-
Are We Done with MMLU?
Authors:
Aryo Pradipta Gema,
Joshua Ong Jun Leang,
Giwon Hong,
Alessio Devoto,
Alberto Carlo Maria Mancino,
Rohit Saxena,
Xuanli He,
Yu Zhao,
Xiaotang Du,
Mohammad Reza Ghasemi Madani,
Claire Barale,
Robert McHardy,
Joshua Harris,
Jean Kaddour,
Emile van Krieken,
Pasquale Minervini
Abstract:
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive fr…
▽ More
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux.
△ Less
Submitted 7 June, 2024; v1 submitted 6 June, 2024;
originally announced June 2024.
-
Anatomy of Industrial Scale Multilingual ASR
Authors:
Francis McCann Ramirez,
Luka Chkhetiani,
Andrew Ehrenberg,
Robert McHardy,
Rami Botros,
Yash Khare,
Andrea Vanzo,
Taufiquzzaman Peyash,
Gabriel Oexle,
Michael Liang,
Ilya Sklyar,
Enver Fakhan,
Ahmed Etefy,
Daniel McCrystal,
Sam Flamini,
Domenic Donato,
Takuya Yoshioka
Abstract:
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed descriptio…
▽ More
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
△ Less
Submitted 16 April, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Challenges and Applications of Large Language Models
Authors:
Jean Kaddour,
Joshua Harris,
Maximilian Mozes,
Herbie Bradley,
Roberta Raileanu,
Robert McHardy
Abstract:
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify the remaining challenges and already fruitful application areas. In this paper, we aim to establish a systematic set of open problems and application successes so that ML researchers can comprehend the field's current…
▽ More
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify the remaining challenges and already fruitful application areas. In this paper, we aim to establish a systematic set of open problems and application successes so that ML researchers can comprehend the field's current state more quickly and become productive.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
Adversarial Training for Satire Detection: Controlling for Confounding Variables
Authors:
Robert McHardy,
Heike Adel,
Roman Klinger
Abstract:
The automatic detection of satire vs. regular news is relevant for downstream applications (for instance, knowledge base population) and to improve the understanding of linguistic characteristics of satire. Recent approaches build upon corpora which have been labeled automatically based on article sources. We hypothesize that this encourages the models to learn characteristics for different public…
▽ More
The automatic detection of satire vs. regular news is relevant for downstream applications (for instance, knowledge base population) and to improve the understanding of linguistic characteristics of satire. Recent approaches build upon corpora which have been labeled automatically based on article sources. We hypothesize that this encourages the models to learn characteristics for different publication sources (e.g., "The Onion" vs. "The Guardian") rather than characteristics of satire, leading to poor generalization performance to unseen publication sources. We therefore propose a novel model for satire detection with an adversarial component to control for the confounding variable of publication source. On a large novel data set collected from German news (which we make available to the research community), we observe comparable satire classification performance and, as desired, a considerable drop in publication classification performance with adversarial training. Our analysis shows that the adversarial component is crucial for the model to learn to pay attention to linguistic properties of satire.
△ Less
Submitted 1 March, 2019; v1 submitted 28 February, 2019;
originally announced February 2019.
-
On the Topology of the Facebook Page Network
Authors:
R. E. Slattery,
R. R. McHardy,
R. Bairathi
Abstract:
The Facebook Page Network (FPN) is a platform for Businesses, Public Figures and Organizations (BPOs) to connect with individuals and other BPOs in the digital space. For over a decade scale-free networks have most appropriately described a variety of seemingly disparate physical, biological and social real-world systems unified by similar network properties such as scale-invariance, growth via a…
▽ More
The Facebook Page Network (FPN) is a platform for Businesses, Public Figures and Organizations (BPOs) to connect with individuals and other BPOs in the digital space. For over a decade scale-free networks have most appropriately described a variety of seemingly disparate physical, biological and social real-world systems unified by similar network properties such as scale-invariance, growth via a preferential attachment mechanism, and a power law degree distribution P(k) = ck^-λ where typically 2<λ<3. In this paper we show that both the Facebook Page Network and its BPO-BPO subnetwork suggest power law and scale-free characteristics. We argue that social media analysts must consider the logarithmic and non-linear properties of social media audiences of scale.
△ Less
Submitted 8 July, 2013;
originally announced July 2013.