-
Self-Taught Evaluators
Authors:
Tianlu Wang,
Ilia Kulikov,
Olga Golovneva,
Ping Yu,
Weizhe Yuan,
Jane Dwivedi-Yu,
Richard Yuanzhe Pang,
Maryam Fazel-Zarandi,
Jason Weston,
Xian Li
Abstract:
Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove e…
▽ More
Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.
△ Less
Submitted 8 August, 2024; v1 submitted 5 August, 2024;
originally announced August 2024.
-
AltGeoViz: Facilitating Accessible Geovisualization
Authors:
Chu Li,
Rock Yuren Pang,
Ather Sharif,
Arnavi Chheda-Kothary,
Jeffrey Heer,
Jon E. Froehlich
Abstract:
Geovisualizations are powerful tools for exploratory spatial analysis, enabling sighted users to discern patterns, trends, and relationships within geographic data. However, these visual tools have remained largely inaccessible to screen-reader users. We present AltGeoViz, a new system we designed to facilitate geovisualization exploration for these users. AltGeoViz dynamically generates alt-text…
▽ More
Geovisualizations are powerful tools for exploratory spatial analysis, enabling sighted users to discern patterns, trends, and relationships within geographic data. However, these visual tools have remained largely inaccessible to screen-reader users. We present AltGeoViz, a new system we designed to facilitate geovisualization exploration for these users. AltGeoViz dynamically generates alt-text descriptions based on the user's current map view, providing summaries of spatial patterns and descriptive statistics. In a study of five screen-reader users, we found that AltGeoViz enabled them to interact with geovisualizations in previously infeasible ways. Participants demonstrated a clear understanding of data summaries and their location context, and they could synthesize spatial understandings of their explorations. Moreover, we identified key areas for improvement, such as the addition of intuitive spatial navigation controls and comparative analysis features.
△ Less
Submitted 21 June, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
An Introduction to Vision-Language Modeling
Authors:
Florian Bordes,
Richard Yuanzhe Pang,
Anurag Ajay,
Alexander C. Li,
Adrien Bardes,
Suzanne Petryk,
Oscar Mañas,
Zhiqiu Lin,
Anas Mahmoud,
Bargav Jayaraman,
Mark Ibrahim,
Melissa Hall,
Yunyang Xiong,
Jonathan Lebensold,
Candace Ross,
Srihari Jayakumar,
Chuan Guo,
Diane Bouchacourt,
Haider Al-Tahan,
Karthik Padthe,
Vasu Sharma,
Hu Xu,
Xiaoqing Ellen Tan,
Megan Richards,
Samuel Lavoie
, et al. (16 additional authors not shown)
Abstract:
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol…
▽ More
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
BLIP: Facilitating the Exploration of Undesirable Consequences of Digital Technologies
Authors:
Rock Yuren Pang,
Sebastin Santy,
René Just,
Katharina Reinecke
Abstract:
Digital technologies have positively transformed society, but they have also led to undesirable consequences not anticipated at the time of design or development. We posit that insights into past undesirable consequences can help researchers and practitioners gain awareness and anticipate potential adverse effects. To test this assumption, we introduce BLIP, a system that extracts real-world undes…
▽ More
Digital technologies have positively transformed society, but they have also led to undesirable consequences not anticipated at the time of design or development. We posit that insights into past undesirable consequences can help researchers and practitioners gain awareness and anticipate potential adverse effects. To test this assumption, we introduce BLIP, a system that extracts real-world undesirable consequences of technology from online articles, summarizes and categorizes them, and presents them in an interactive, web-based interface. In two user studies with 15 researchers in various computer science disciplines, we found that BLIP substantially increased the number and diversity of undesirable consequences they could list in comparison to relying on prior knowledge or searching online. Moreover, BLIP helped them identify undesirable consequences relevant to their ongoing projects, made them aware of undesirable consequences they "had never considered," and inspired them to reflect on their own experiences with technology.
△ Less
Submitted 10 May, 2024;
originally announced May 2024.
-
Iterative Reasoning Preference Optimization
Authors:
Richard Yuanzhe Pang,
Weizhe Yuan,
Kyunghyun Cho,
He He,
Sainbayar Sukhbaatar,
Jason Weston
Abstract:
Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoni…
▽ More
Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
△ Less
Submitted 25 June, 2024; v1 submitted 30 April, 2024;
originally announced April 2024.
-
Self-Rewarding Language Models
Authors:
Weizhe Yuan,
Richard Yuanzhe Pang,
Kyunghyun Cho,
Xian Li,
Sainbayar Sukhbaatar,
Jing Xu,
Jason Weston
Abstract:
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewardi…
▽ More
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.
△ Less
Submitted 8 February, 2024; v1 submitted 18 January, 2024;
originally announced January 2024.
-
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Authors:
David Rein,
Betty Li Hou,
Asa Cooper Stickland,
Jackson Petty,
Richard Yuanzhe Pang,
Julien Dirani,
Julian Michael,
Samuel R. Bowman
Abstract:
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert v…
▽ More
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
The Case for Anticipating Undesirable Consequences of Computing Innovations Early, Often, and Across Computer Science
Authors:
Rock Yuren Pang,
Dan Grossman,
Tadayoshi Kohno,
Katharina Reinecke
Abstract:
From smart sensors that infringe on our privacy to neural nets that portray realistic imposter deepfakes, our society increasingly bears the burden of negative, if unintended, consequences of computing innovations. As the experts in the technology we create, Computer Science (CS) researchers must do better at anticipating and addressing these undesirable consequences proactively. Our prior work sh…
▽ More
From smart sensors that infringe on our privacy to neural nets that portray realistic imposter deepfakes, our society increasingly bears the burden of negative, if unintended, consequences of computing innovations. As the experts in the technology we create, Computer Science (CS) researchers must do better at anticipating and addressing these undesirable consequences proactively. Our prior work showed that many of us recognize the value of thinking preemptively about the perils our research can pose, yet we tend to address them only in hindsight. How can we change the culture in which considering undesirable consequences of digital technology is deemed as important, but is not commonly done?
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
Leveraging Implicit Feedback from Deployment Data in Dialogue
Authors:
Richard Yuanzhe Pang,
Stephen Roller,
Kyunghyun Cho,
He He,
Jason Weston
Abstract:
We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployme…
▽ More
We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors.
△ Less
Submitted 31 January, 2024; v1 submitted 26 July, 2023;
originally announced July 2023.
-
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Authors:
Abulhair Saparov,
Richard Yuanzhe Pang,
Vishakh Padmakumar,
Nitish Joshi,
Seyed Mehran Kazemi,
Najoung Kim,
He He
Abstract:
Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, an…
▽ More
Given the intractably large size of the space of proofs, any model that is capable of general deductive reasoning must generalize to proofs of greater complexity. Recent studies have shown that large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. However, they have primarily been tested on proofs using modus ponens or of a specific size, and from the same distribution as the in-context examples. To measure the general deductive reasoning ability of LLMs, we test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations from multiple angles: depth-, width-, and compositional generalization. To facilitate systematic exploration, we construct a new synthetic and programmable reasoning dataset that enables control over deduction rules and proof complexity. Our experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs. However, they have difficulty generalizing to longer proofs, and they require explicit demonstrations to produce hypothetical subproofs, specifically in proof by cases and proof by contradiction.
△ Less
Submitted 3 November, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Auditing Cross-Cultural Consistency of Human-Annotated Labels for Recommendation Systems
Authors:
Rock Yuren Pang,
Jack Cenatempo,
Franklyn Graham,
Bridgette Kuehn,
Maddy Whisenant,
Portia Botchway,
Katie Stone Perez,
Allison Koenecke
Abstract:
Recommendation systems increasingly depend on massive human-labeled datasets; however, the human annotators hired to generate these labels increasingly come from homogeneous backgrounds. This poses an issue when downstream predictive models -- based on these labels -- are applied globally to a heterogeneous set of users. We study this disconnect with respect to the labels themselves, asking whethe…
▽ More
Recommendation systems increasingly depend on massive human-labeled datasets; however, the human annotators hired to generate these labels increasingly come from homogeneous backgrounds. This poses an issue when downstream predictive models -- based on these labels -- are applied globally to a heterogeneous set of users. We study this disconnect with respect to the labels themselves, asking whether they are ``consistently conceptualized'' across annotators of different demographics. In a case study of video game labels, we conduct a survey on 5,174 gamers, identify a subset of inconsistently conceptualized game labels, perform causal analyses, and suggest both cultural and linguistic reasons for cross-country differences in label annotation. We further demonstrate that predictive models of game annotations perform better on global train sets as opposed to homogeneous (single-country) train sets. Finally, we provide a generalizable framework for practitioners to audit their own data annotation processes for consistent label conceptualization, and encourage practitioners to consider global inclusivity in recommendation systems starting from the early stages of annotator recruitment and data-labeling.
△ Less
Submitted 10 May, 2023;
originally announced May 2023.
-
Anticipating Unintended Consequences of Technology Using Insights from Creativity Support Tools
Authors:
Rock Yuren Pang,
Katharina Reinecke
Abstract:
Our society has been increasingly witnessing a number of negative, unintended consequences of digital technologies. While post-hoc policy regulation is crucial in addressing these issues, reasonably anticipating the consequences before deploying technology can help mitigate potential harm to society in the first place. Yet, the quest to anticipate potential harms can be difficult without seeing di…
▽ More
Our society has been increasingly witnessing a number of negative, unintended consequences of digital technologies. While post-hoc policy regulation is crucial in addressing these issues, reasonably anticipating the consequences before deploying technology can help mitigate potential harm to society in the first place. Yet, the quest to anticipate potential harms can be difficult without seeing digital technologies deployed in the real world. In this position paper, we argue that anticipating unintended consequences of technology can be facilitated through creativity-enhancing interventions, such as by building on existing knowledge and insights from diverse stakeholders. Using lessons learned from prior work on creativity-support tools, the HCI community is uniquely equipped to design novel systems that aid in anticipating negative unintended consequences of technology on society.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
"That's important, but...": How Computer Science Researchers Anticipate Unintended Consequences of Their Research Innovations
Authors:
Kimberly Do,
Rock Yuren Pang,
Jiachen Jiang,
Katharina Reinecke
Abstract:
Computer science research has led to many breakthrough innovations but has also been scrutinized for enabling technology that has negative, unintended consequences for society. Given the increasing discussions of ethics in the news and among researchers, we interviewed 20 researchers in various CS sub-disciplines to identify whether and how they consider potential unintended consequences of their…
▽ More
Computer science research has led to many breakthrough innovations but has also been scrutinized for enabling technology that has negative, unintended consequences for society. Given the increasing discussions of ethics in the news and among researchers, we interviewed 20 researchers in various CS sub-disciplines to identify whether and how they consider potential unintended consequences of their research innovations. We show that considering unintended consequences is generally seen as important but rarely practiced. Principal barriers are a lack of formal process and strategy as well as the academic practice that prioritizes fast progress and publications. Drawing on these findings, we discuss approaches to support researchers in routinely considering unintended consequences, from bringing diverse perspectives through community participation to increasing incentives to investigate potential consequences. We intend for our work to pave the way for routine explorations of the societal implications of technological innovations before, during, and after the research process.
△ Less
Submitted 27 March, 2023;
originally announced March 2023.
-
Extrapolative Controlled Sequence Generation via Iterative Refinement
Authors:
Vishakh Padmakumar,
Richard Yuanzhe Pang,
He He,
Ankur P. Parikh
Abstract:
We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their att…
▽ More
We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: https://github.com/vishakhpk/iter-extrapolation.
△ Less
Submitted 7 June, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
Reward Gaming in Conditional Text Generation
Authors:
Richard Yuanzhe Pang,
Vishakh Padmakumar,
Thibault Sellam,
Ankur P. Parikh,
He He
Abstract:
To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring sp…
▽ More
To align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (RL) with reward functions learned from human annotations. Under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. We show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during RL training of the text generation model. While there has been discussion about reward gaming in the RL or safety community, in this discussion piece, we would like to highlight reward gaming in the natural language generation (NLG) community using concrete conditional text generation examples and discuss potential fixes and areas for future work.
△ Less
Submitted 1 June, 2023; v1 submitted 16 November, 2022;
originally announced November 2022.
-
How Do Data Science Workers Communicate Intermediate Results?
Authors:
Rock Yuren Pang,
Ruotong Wang,
Joely Nelson,
Leilani Battle
Abstract:
Data science workers increasingly collaborate on large-scale projects before communicating insights to a broader audience in the form of visualization. While prior work has modeled how data science teams, oftentimes with distinct roles and work processes, communicate knowledge to outside stakeholders, we have little knowledge of how data science workers communicate intermediately before delivering…
▽ More
Data science workers increasingly collaborate on large-scale projects before communicating insights to a broader audience in the form of visualization. While prior work has modeled how data science teams, oftentimes with distinct roles and work processes, communicate knowledge to outside stakeholders, we have little knowledge of how data science workers communicate intermediately before delivering the final products. In this work, we contribute a nuanced description of the intermediate communication process within data science teams. By analyzing interview data with 8 self-identified data science workers, we characterized the data science intermediate communication process with four factors, including the types of audience, communication goals, shared artifacts, and mode of communication. We also identified overarching challenges in the current communication process. We also discussed design implications that might inform better tools that facilitate intermediate communication within data science teams.
△ Less
Submitted 6 October, 2022;
originally announced October 2022.
-
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
Authors:
Julian Michael,
Ari Holtzman,
Alicia Parrish,
Aaron Mueller,
Alex Wang,
Angelica Chen,
Divyam Madaan,
Nikita Nangia,
Richard Yuanzhe Pang,
Jason Phang,
Samuel R. Bowman
Abstract:
We present the results of the NLP Community Metasurvey. Run from May to June 2022, the survey elicited opinions on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies: For example, respondents are split almost exactly in half on questions about the importance of artificial general intelligence, w…
▽ More
We present the results of the NLP Community Metasurvey. Run from May to June 2022, the survey elicited opinions on controversial issues, including industry influence in the field, concerns about AGI, and ethics. Our results put concrete numbers to several controversies: For example, respondents are split almost exactly in half on questions about the importance of artificial general intelligence, whether language models understand language, and the necessity of linguistic structure and inductive bias for solving NLP problems. In addition, the survey posed meta-questions, asking respondents to predict the distribution of survey responses. This allows us not only to gain insight on the spectrum of beliefs held by NLP researchers, but also to uncover false sociological beliefs where the community's predictions don't match reality. We find such mismatches on a wide range of issues. Among other results, the community greatly overestimates its own belief in the usefulness of benchmarks and the potential for scaling to solve real-world problems, while underestimating its own belief in the importance of linguistic structure, inductive bias, and interdisciplinary science.
△ Less
Submitted 26 August, 2022;
originally announced August 2022.
-
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way
Authors:
Alex Wang,
Richard Yuanzhe Pang,
Angelica Chen,
Jason Phang,
Samuel R. Bowman
Abstract:
Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to developing summarization bench…
▽ More
Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to developing summarization benchmark data: We hire highly-qualified contractors to read stories and write original summaries from scratch. To amortize reading time, we collect five summaries per document, with the first giving an overview and the subsequent four addressing specific questions. We use this protocol to collect SQuALITY, a dataset of question-focused summaries built on the same public-domain short stories as the multiple-choice dataset QuALITY (Pang et al., 2021). Experiments with state-of-the-art summarization systems show that our dataset is challenging and that existing automatic evaluation metrics are weak indicators of quality.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
Token Dropping for Efficient BERT Pretraining
Authors:
Le Hou,
Richard Yuanzhe Pang,
Tianyi Zhou,
Yuexin Wu,
Xinying Song,
Xiaodan Song,
Denny Zhou
Abstract:
Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus…
▽ More
Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT, without degrading its performance on downstream tasks. In short, we drop unimportant tokens starting from an intermediate layer in the model to make the model focus on important tokens; the dropped tokens are later picked up by the last layer of the model so that the model still produces full-length sequences. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. In our experiments, this simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
△ Less
Submitted 24 March, 2022;
originally announced March 2022.
-
Amortized Noisy Channel Neural Machine Translation
Authors:
Richard Yuanzhe Pang,
He He,
Kyunghyun Cho
Abstract:
Noisy channel models have been especially effective in neural machine translation (NMT). However, recent approaches like "beam search and rerank" (BSR) incur significant computation overhead during inference, making real-world application infeasible. We aim to study if it is possible to build an amortized noisy channel NMT model such that when we do greedy decoding during inference, the translatio…
▽ More
Noisy channel models have been especially effective in neural machine translation (NMT). However, recent approaches like "beam search and rerank" (BSR) incur significant computation overhead during inference, making real-world application infeasible. We aim to study if it is possible to build an amortized noisy channel NMT model such that when we do greedy decoding during inference, the translation accuracy matches that of BSR in terms of reward (based on the source-to-target log probability and the target-to-source log probability) and quality (based on BLEU and BLEURT). We attempt three approaches to train the new model: knowledge distillation, one-step-deviation imitation learning, and Q learning. The first approach obtains the noisy channel signal from a pseudo-corpus, and the latter two approaches aim to optimize toward a noisy-channel MT reward directly. For all three approaches, the generated translations fail to achieve rewards comparable to BSR, but the translation quality approximated by BLEU and BLEURT is similar to the quality of BSR-produced translations. Additionally, all three approaches speed up inference by 1-2 orders of magnitude.
△ Less
Submitted 18 July, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
QuALITY: Question Answering with Long Input Texts, Yes!
Authors:
Richard Yuanzhe Pang,
Alicia Parrish,
Nitish Joshi,
Nikita Nangia,
Jason Phang,
Angelica Chen,
Vishakh Padmakumar,
Johnny Ma,
Jana Thompson,
He He,
Samuel R. Bowman
Abstract:
To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than rely…
▽ More
To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).
△ Less
Submitted 11 May, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
AgreeSum: Agreement-Oriented Multi-Document Summarization
Authors:
Richard Yuanzhe Pang,
Adam D. Lelkes,
Vinh Q. Tran,
Cong Yu
Abstract:
We aim to renew interest in a particular multi-document summarization (MDS) task which we call AgreeSum: agreement-oriented multi-document summarization. Given a cluster of articles, the goal is to provide abstractive summaries that represent information common and faithful to all input articles. Given the lack of existing datasets, we create a dataset for AgreeSum, and provide annotations on arti…
▽ More
We aim to renew interest in a particular multi-document summarization (MDS) task which we call AgreeSum: agreement-oriented multi-document summarization. Given a cluster of articles, the goal is to provide abstractive summaries that represent information common and faithful to all input articles. Given the lack of existing datasets, we create a dataset for AgreeSum, and provide annotations on article-summary entailment relations for a subset of the clusters in the dataset. We aim to create strong baselines for the task by applying the top-performing pretrained single-document summarization model PEGASUS onto AgreeSum, leveraging both annotated clusters by supervised losses, and unannotated clusters by T5-based entailment-related and language-related losses. Compared to other baselines, both automatic evaluation and human evaluation show better article-summary and cluster-summary entailment in generated summaries. On a separate note, we hope that our article-summary entailment annotations contribute to the community's effort in improving abstractive summarization faithfulness.
△ Less
Submitted 4 June, 2021;
originally announced June 2021.
-
Comparing Test Sets with Item Response Theory
Authors:
Clara Vania,
Phu Mon Htut,
William Huang,
Dhara Mungra,
Richard Yuanzhe Pang,
Jason Phang,
Haokun Liu,
Kyunghyun Cho,
Samuel R. Bowman
Abstract:
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind…
▽ More
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Text Generation by Learning from Demonstrations
Authors:
Richard Yuanzhe Pang,
He He
Abstract:
Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation a…
▽ More
Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality given model-generated histories. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones in the reference during training, avoiding optimization issues faced by prior RL approaches that rely on online data collection. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, our models are less sensitive to decoding algorithms and alleviate exposure bias.
△ Less
Submitted 2 March, 2021; v1 submitted 16 September, 2020;
originally announced September 2020.
-
ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation
Authors:
Lifu Tu,
Richard Yuanzhe Pang,
Sam Wiseman,
Kevin Gimpel
Abstract:
We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled cor…
▽ More
We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.
△ Less
Submitted 12 May, 2020; v1 submitted 2 May, 2020;
originally announced May 2020.
-
Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?
Authors:
Yada Pruksachatkun,
Jason Phang,
Haokun Liu,
Phu Mon Htut,
Xiaoyi Zhang,
Richard Yuanzhe Pang,
Clara Vania,
Katharina Kann,
Samuel R. Bowman
Abstract:
While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large…
▽ More
While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.
△ Less
Submitted 9 May, 2020; v1 submitted 1 May, 2020;
originally announced May 2020.
-
Consistency of a Recurrent Language Model With Respect to Incomplete Decoding
Authors:
Sean Welleck,
Ilia Kulikov,
Jaedeok Kim,
Richard Yuanzhe Pang,
Kyunghyun Cho
Abstract:
Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition. We study the related issue of receiving infinite-length sequences from a recurrent language model when using common decoding algorithms. To analyze this issue, we first define inconsistency of a decoding algorithm,…
▽ More
Despite strong performance on a variety of tasks, neural sequence models trained with maximum likelihood have been shown to exhibit issues such as length bias and degenerate repetition. We study the related issue of receiving infinite-length sequences from a recurrent language model when using common decoding algorithms. To analyze this issue, we first define inconsistency of a decoding algorithm, meaning that the algorithm can yield an infinite-length sequence that has zero probability under the model. We prove that commonly used incomplete decoding algorithms - greedy search, beam search, top-k sampling, and nucleus sampling - are inconsistent, despite the fact that recurrent language models are trained to produce sequences of finite length. Based on these insights, we propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model. Empirical results show that inconsistency occurs in practice, and that the proposed methods prevent inconsistency.
△ Less
Submitted 2 October, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.
-
Improving Joint Training of Inference Networks and Structured Prediction Energy Networks
Authors:
Lifu Tu,
Richard Yuanzhe Pang,
Kevin Gimpel
Abstract:
Deep energy-based models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016). Tu and Gimpel (2018) developed an efficient framework for energy-based models by training "inference networks" to approximate structured inference instead of using gradient descent. However, their alternating optimization approach suffers from instabilities during training, requirin…
▽ More
Deep energy-based models are powerful, but pose challenges for learning and inference (Belanger and McCallum, 2016). Tu and Gimpel (2018) developed an efficient framework for energy-based models by training "inference networks" to approximate structured inference instead of using gradient descent. However, their alternating optimization approach suffers from instabilities during training, requiring additional loss terms and careful hyperparameter tuning. In this paper, we contribute several strategies to stabilize and improve this joint training of energy functions and inference networks for structured prediction. We design a compound objective to jointly train both cost-augmented and test-time inference networks along with the energy function. We propose joint parameterizations for the inference networks that encourage them to capture complementary functionality during learning. We empirically validate our strategies on two sequence labeling tasks, showing easier paths to strong performance than prior work, as well as further improvements with global energy terms.
△ Less
Submitted 10 October, 2020; v1 submitted 7 November, 2019;
originally announced November 2019.
-
The Daunting Task of Real-World Textual Style Transfer Auto-Evaluation
Authors:
Richard Yuanzhe Pang
Abstract:
The difficulty of textual style transfer lies in the lack of parallel corpora. Numerous advances have been proposed for the unsupervised generation. However, significant problems remain with the auto-evaluation of style transfer tasks. Based on the summary of Pang and Gimpel (2018) and Mir et al. (2019), style transfer evaluations rely on three criteria: style accuracy of transferred sentences, co…
▽ More
The difficulty of textual style transfer lies in the lack of parallel corpora. Numerous advances have been proposed for the unsupervised generation. However, significant problems remain with the auto-evaluation of style transfer tasks. Based on the summary of Pang and Gimpel (2018) and Mir et al. (2019), style transfer evaluations rely on three criteria: style accuracy of transferred sentences, content similarity between original and transferred sentences, and fluency of transferred sentences. We elucidate the problematic current state of style transfer research. Given that current tasks do not represent real use cases of style transfer, current auto-evaluation approach is flawed. This discussion aims to bring researchers to think about the future of style transfer and style transfer evaluation research.
△ Less
Submitted 9 October, 2019; v1 submitted 8 October, 2019;
originally announced October 2019.
-
Unsupervised Evaluation Metrics and Learning Criteria for Non-Parallel Textual Transfer
Authors:
Richard Yuanzhe Pang,
Kevin Gimpel
Abstract:
We consider the problem of automatically generating textual paraphrases with modified attributes or properties, focusing on the setting without parallel data (Hu et al., 2017; Shen et al., 2017). This setting poses challenges for evaluation. We show that the metric of post-transfer classification accuracy is insufficient on its own, and propose additional metrics based on semantic preservation and…
▽ More
We consider the problem of automatically generating textual paraphrases with modified attributes or properties, focusing on the setting without parallel data (Hu et al., 2017; Shen et al., 2017). This setting poses challenges for evaluation. We show that the metric of post-transfer classification accuracy is insufficient on its own, and propose additional metrics based on semantic preservation and fluency as well as a way to combine them into a single overall score. We contribute new loss functions and training strategies to address the different metrics. Semantic preservation is addressed by adding a cyclic consistency loss and a loss based on paraphrase pairs, while fluency is improved by integrating losses based on style-specific language models. We experiment with a Yelp sentiment dataset and a new literature dataset that we propose, using multiple models that extend prior work (Shen et al., 2017). We demonstrate that our metrics correlate well with human judgments, at both the sentence-level and system-level. Automatic and manual evaluation also show large improvements over the baseline method of Shen et al. (2017). We hope that our proposed metrics can speed up system development for new textual transfer tasks while also encouraging the community to address our three complementary aspects of transfer quality.
△ Less
Submitted 30 September, 2019; v1 submitted 28 October, 2018;
originally announced October 2018.