-
Refusal in Language Models Is Mediated by a Single Direction
Authors:
Andy Arditi,
Oscar Obeso,
Aaquib Syed,
Daniel Paleka,
Nina Panickssery,
Wes Gurnee,
Neel Nanda
Abstract:
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models…
▽ More
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
△ Less
Submitted 15 July, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Authors:
Edoardo Debenedetti,
Javier Rando,
Daniel Paleka,
Silaghi Fineas Florin,
Dragos Albastroiu,
Niv Cohen,
Yuval Lemberg,
Reshmi Ghosh,
Rui Wen,
Ahmed Salem,
Giovanni Cherubin,
Santiago Zanella-Beguelin,
Robin Schmid,
Victor Klemm,
Takahiro Miki,
Chenhao Li,
Stefan Kraft,
Mario Fritz,
Florian Tramèr,
Sahar Abdelnabi,
Lea Schönherr
Abstract:
Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed…
▽ More
Large language model systems face important security risks from maliciously crafted messages that aim to overwrite the system's original instructions or leak private data. To study this problem, we organized a capture-the-flag competition at IEEE SaTML 2024, where the flag is a secret string in the LLM system prompt. The competition was organized in two phases. In the first phase, teams developed defenses to prevent the model from leaking the secret. During the second phase, teams were challenged to extract the secrets hidden for defenses proposed by the other teams. This report summarizes the main insights from the competition. Notably, we found that all defenses were bypassed at least once, highlighting the difficulty of designing a successful defense and the necessity for additional research to protect LLM systems. To foster future research in this direction, we compiled a dataset with over 137k multi-turn attack chats and open-sourced the platform.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Authors:
Usman Anwar,
Abulhair Saparov,
Javier Rando,
Daniel Paleka,
Miles Turpin,
Peter Hase,
Ekdeep Singh Lubana,
Erik Jenner,
Stephen Casper,
Oliver Sourbut,
Benjamin L. Edelman,
Zhaowei Zhang,
Mario Günther,
Anton Korinek,
Jose Hernandez-Orallo,
Lewis Hammond,
Eric Bigelow,
Alexander Pan,
Lauro Langosco,
Tomasz Korbak,
Heidi Zhang,
Ruiqi Zhong,
Seán Ó hÉigeartaigh,
Gabriel Recchia,
Giulio Corsi
, et al. (17 additional authors not shown)
Abstract:
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
△ Less
Submitted 5 September, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Stealing Part of a Production Language Model
Authors:
Nicholas Carlini,
Daniel Paleka,
Krishnamurthy Dj Dvijotham,
Thomas Steinke,
Jonathan Hayase,
A. Feder Cooper,
Katherine Lee,
Matthew Jagielski,
Milad Nasr,
Arthur Conmy,
Itay Yona,
Eric Wallace,
David Rolnick,
Florian Tramèr
Abstract:
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \…
▽ More
We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \$20 USD, our attack extracts the entire projection matrix of OpenAI's Ada and Babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.
△ Less
Submitted 9 July, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
ARB: Advanced Reasoning Benchmark for Large Language Models
Authors:
Tomohiro Sawada,
Daniel Paleka,
Alexander Havrilla,
Pranav Tadepalli,
Paula Vidas,
Alexander Kranias,
John J. Nay,
Kshitij Gupta,
Aran Komatsuzaki
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more c…
▽ More
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks. In order to improve both automatic and assisted evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. Further, we conduct a human evaluation of the symbolic subset of ARB, finding promising agreement between annotators and GPT-4 rubric evaluation scores.
△ Less
Submitted 27 July, 2023; v1 submitted 25 July, 2023;
originally announced July 2023.
-
Evaluating Superhuman Models with Consistency Checks
Authors:
Lukas Fluri,
Daniel Paleka,
Florian Tramèr
Abstract:
If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impos…
▽ More
If machine learning models were to achieve superhuman abilities at various reasoning or decision-making tasks, how would we go about evaluating such models, given that humans would necessarily be poor proxies for ground truth? In this paper, we propose a framework for evaluating superhuman models via consistency checks. Our premise is that while the correctness of superhuman decisions may be impossible to evaluate, we can still surface mistakes if the model's decisions fail to satisfy certain logical, human-interpretable rules. We instantiate our framework on three tasks where correctness of decisions is hard to evaluate due to either superhuman model abilities, or to otherwise missing ground truth: evaluating chess positions, forecasting future events, and making legal judgments. We show that regardless of a model's (possibly superhuman) performance on these tasks, we can discover logical inconsistencies in decision making. For example: a chess engine assigning opposing valuations to semantically identical boards; GPT-4 forecasting that sports records will evolve non-monotonically over time; or an AI judge assigning bail to a defendant only after we add a felony to their criminal record.
△ Less
Submitted 19 October, 2023; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Poisoning Web-Scale Training Datasets is Practical
Authors:
Nicholas Carlini,
Matthew Jagielski,
Christopher A. Choquette-Choo,
Daniel Paleka,
Will Pearce,
Hyrum Anderson,
Andreas Terzis,
Kurt Thomas,
Florian Tramèr
Abstract:
Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet…
▽ More
Deep learning models are often trained on distributed, web-scale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.
△ Less
Submitted 6 May, 2024; v1 submitted 20 February, 2023;
originally announced February 2023.
-
Red-Teaming the Stable Diffusion Safety Filter
Authors:
Javier Rando,
Daniel Paleka,
David Lindner,
Lennart Heim,
Florian Tramèr
Abstract:
Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter's limitations a…
▽ More
Stable Diffusion is a recent open-source image generation model comparable to proprietary models such as DALLE, Imagen, or Parti. Stable Diffusion comes with a safety filter that aims to prevent generating explicit images. Unfortunately, the filter is obfuscated and poorly documented. This makes it hard for users to prevent misuse in their applications, and to understand the filter's limitations and improve it. We first show that it is easy to generate disturbing content that bypasses the safety filter. We then reverse-engineer the filter and find that while it aims to prevent sexual content, it ignores violence, gore, and other similarly disturbing content. Based on our analysis, we argue safety measures in future model releases should strive to be fully open and properly documented to stimulate security contributions from the community.
△ Less
Submitted 10 November, 2022; v1 submitted 3 October, 2022;
originally announced October 2022.
-
A law of adversarial risk, interpolation, and label noise
Authors:
Daniel Paleka,
Amartya Sanyal
Abstract:
In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductiv…
▽ More
In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductive bias of the learning algorithm. We then investigate how different components of this problem affect this result, including properties of the distribution. We also discuss non-uniform label noise distributions; and prove a new theorem showing uniform label noise induces nearly as large an adversarial risk as the worst poisoning with the same noise rate. Then, we provide theoretical and empirical evidence that uniform label noise is more harmful than typical real-world label noise. Finally, we show how inductive biases amplify the effect of label noise and argue the need for future work in this direction.
△ Less
Submitted 13 March, 2023; v1 submitted 8 July, 2022;
originally announced July 2022.