Skip to main content

Showing 1–16 of 16 results for author: Garriga-Alonso, A

  1. arXiv:2410.13032  [pdf, other

    cs.AI cs.LG stat.ML

    Hypothesis Testing the Circuit Hypothesis in LLMs

    Authors: Claudia Shi, Nicolas Beltran-Velez, Achille Nazaret, Carolina Zheng, Adrià Garriga-Alonso, Andrew Jesson, Maggie Makar, David M. Blei

    Abstract: Large language models (LLMs) demonstrate surprising capabilities, but we do not understand how they are implemented. One hypothesis suggests that these capabilities are primarily executed by small subnetworks within the LLM, known as circuits. But how can we evaluate this hypothesis? In this paper, we formalize a set of criteria that a circuit is hypothesized to meet and develop a suite of hypothe… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

    Comments: Code available here: https://github.com/blei-lab/circuitry

  2. arXiv:2407.15421  [pdf, other

    cs.LG cs.AI

    Planning behavior in a recurrent neural network that plays Sokoban

    Authors: Adrià Garriga-Alonso, Mohammad Taufeeque, Adam Gleave

    Abstract: To predict how advanced neural networks generalize to novel situations, it is essential to understand how they reason. Guez et al. (2019, "An investigation of model-free planning") trained a recurrent neural network (RNN) to play Sokoban with model-free reinforcement learning. They found that adding extra computation steps to the start of episodes at test time improves the RNN's success rate. We f… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Mechanistic Interpretability workshop, ICML 2024

  3. arXiv:2407.15166  [pdf, other

    cs.LG

    Adversarial Circuit Evaluation

    Authors: Niels uit de Bos, Adrià Garriga-Alonso

    Abstract: Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

    Comments: 19 pages, 10 figures

  4. arXiv:2407.14503  [pdf, other

    cs.LG

    Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

    Authors: Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso

    Abstract: When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has l… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

    Comments: Mechanistic Interpretability workshop at ICML 2014

  5. arXiv:2407.14494  [pdf, other

    cs.LG

    InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

    Authors: Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso

    Abstract: Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train these neural networks using a stricter version of Interchange Intervent… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  6. arXiv:2407.14008  [pdf, other

    cs.LG

    Investigating the Indirect Object Identification circuit in Mamba

    Authors: Danielle Ensign, Adrià Garriga-Alonso

    Abstract: How well will current interpretability techniques generalize to future models? A relevant case study is Mamba, a recent recurrent architecture with scaling comparable to Transformers. We adapt pre-Mamba techniques to Mamba and partially reverse-engineer the circuit responsible for the Indirect Object Identification (IOI) task. Our techniques provide evidence that 1) Layer 39 is a key bottleneck, 2… ▽ More

    Submitted 21 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

  7. arXiv:2407.12404  [pdf, other

    cs.LG

    Analyzing the Generalization and Reliability of Steering Vectors

    Authors: Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk

    Abstract: Steering vectors (SVs) are a new approach to efficiently adjust language model behaviour at inference time by intervening on intermediate model activations. They have shown promise in terms of improving both capabilities and model alignment. However, the reliability and generalisation properties of this approach are unknown. In this work, we rigorously investigate these properties, and show that s… ▽ More

    Submitted 22 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

  8. arXiv:2304.14997  [pdf, other

    cs.LG

    Towards Automated Circuit Discovery for Mechanistic Interpretability

    Authors: Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso

    Abstract: Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the… ▽ More

    Submitted 28 October, 2023; v1 submitted 28 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023 Spotlight

  9. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  10. arXiv:2106.05586  [pdf, other

    stat.ML cs.LG

    Data augmentation in Bayesian neural networks and the cold posterior effect

    Authors: Seth Nabarro, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, Laurence Aitchison

    Abstract: Bayesian neural networks that incorporate data augmentation implicitly use a ``randomly perturbed log-likelihood [which] does not have a clean interpretation as a valid likelihood function'' (Izmailov et al. 2021). Here, we provide several approaches to developing principled Bayesian neural networks incorporating data augmentation. We introduce a ``finite orbit'' setting which allows likelihoods t… ▽ More

    Submitted 9 December, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

  11. BNNpriors: A library for Bayesian neural network inference with different prior distributions

    Authors: Vincent Fortuin, Adrià Garriga-Alonso, Mark van der Wilk, Laurence Aitchison

    Abstract: Bayesian neural networks have shown great promise in many applications where calibrated uncertainty estimates are crucial and can often also lead to a higher predictive performance. However, it remains challenging to choose a good prior distribution over their weights. While isotropic Gaussian priors are often chosen in practice due to their simplicity, they do not reflect our true prior beliefs w… ▽ More

    Submitted 14 May, 2021; originally announced May 2021.

    Comments: Accepted for publication at Software Impacts

  12. arXiv:2102.06571  [pdf, other

    stat.ML cs.LG

    Bayesian Neural Network Priors Revisited

    Authors: Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Rätsch, Richard E. Turner, Mark van der Wilk, Laurence Aitchison

    Abstract: Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolution… ▽ More

    Submitted 16 March, 2022; v1 submitted 12 February, 2021; originally announced February 2021.

    Comments: Accepted at ICLR 2022

  13. arXiv:2102.01691  [pdf, ps, other

    stat.ML cs.LG

    Exact Langevin Dynamics with Stochastic Gradients

    Authors: Adrià Garriga-Alonso, Vincent Fortuin

    Abstract: Stochastic gradient Markov Chain Monte Carlo algorithms are popular samplers for approximate inference, but they are generally biased. We show that many recent versions of these methods (e.g. Chen et al. (2014)) cannot be corrected using Metropolis-Hastings rejection sampling, because their acceptance probability is always zero. We can fix this by employing a sampler with realizable backwards traj… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

    Comments: 13 pages, 2 figures. Accepted to the 3rd Symposium on Advances in Approximate Bayesian Inference (AABI 2021)

  14. arXiv:2101.04097  [pdf, other

    stat.ML cs.LG

    Correlated Weights in Infinite Limits of Deep Convolutional Neural Networks

    Authors: Adrià Garriga-Alonso, Mark van der Wilk

    Abstract: Infinite width limits of deep neural networks often have tractable forms. They have been used to analyse the behaviour of finite networks, as well as being useful methods in their own right. When investigating infinitely wide convolutional neural networks (CNNs), it was observed that the correlations arising from spatial weight sharing disappear in the infinite limit. This is undesirable, as spati… ▽ More

    Submitted 13 June, 2021; v1 submitted 11 January, 2021; originally announced January 2021.

    Comments: Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021)

  15. arXiv:2011.09421  [pdf, other

    stat.ML cs.LG

    Understanding Variational Inference in Function-Space

    Authors: David R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso, Mark van der Wilk

    Abstract: Recent work has attempted to directly approximate the `function-space' or predictive posterior distribution of Bayesian models, without approximating the posterior distribution over the parameters. This is appealing in e.g. Bayesian neural networks, where we only need the former, and the latter is hard to represent. In this work, we highlight some advantages and limitations of employing the Kullba… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

    Comments: 19 pages

  16. arXiv:1808.05587  [pdf, other

    stat.ML cs.LG

    Deep Convolutional Networks as shallow Gaussian Processes

    Authors: Adrià Garriga-Alonso, Carl Edward Rasmussen, Laurence Aitchison

    Abstract: We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the o… ▽ More

    Submitted 4 May, 2019; v1 submitted 16 August, 2018; originally announced August 2018.