Skip to main content

Showing 1–31 of 31 results for author: Bietti, A

  1. arXiv:2406.03068  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    How Truncating Weights Improves Reasoning in Language Models

    Authors: Lei Chen, Joan Bruna, Alberto Bietti

    Abstract: In addition to the ability to generate fluent text in various languages, large language models have been successful at tasks that involve basic forms of logical "reasoning" over their context. Recent work found that selectively removing certain components from weight matrices in pre-trained models can improve such reasoning capabilities. We investigate this phenomenon further by carefully studying… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  2. arXiv:2406.02585  [pdf, other

    cs.LG cs.AI stat.ML

    Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

    Authors: Siavash Golkar, Alberto Bietti, Mariel Pettee, Michael Eickenberg, Miles Cranmer, Keiya Hirashima, Geraud Krawezik, Nicholas Lourie, Michael McCabe, Rudy Morel, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Kyunghyun Cho, Shirley Ho

    Abstract: Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datas… ▽ More

    Submitted 30 May, 2024; originally announced June 2024.

  3. arXiv:2403.03362  [pdf, other

    cs.LG math.OC

    Level Set Teleportation: An Optimization Perspective

    Authors: Aaron Mishkin, Alberto Bietti, Robert M. Gower

    Abstract: We study level set teleportation, an optimization sub-routine which seeks to accelerate gradient methods by maximizing the gradient norm on a level-set of the objective function. Since the descent lemma implies that gradient descent (GD) decreases the objective proportional to the squared norm of the gradient, level-set teleportation maximizes this one-step progress guarantee. For convex functions… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: Thirty-five pages including appendices

  4. arXiv:2402.19449  [pdf, other

    cs.LG cs.CL math.OC stat.ML

    Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models

    Authors: Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

    Abstract: Adam has been shown to outperform gradient descent on large language models by a larger margin than on other tasks, but it is unclear why. We show that a key factor in this performance gap is the heavy-tailed class imbalance found in language tasks. When trained with gradient descent, the loss of infrequent words decreases more slowly than the loss of frequent ones. This leads to a slow decrease o… ▽ More

    Submitted 12 July, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

  5. arXiv:2402.18724  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Associative Memories with Gradient Descent

    Authors: Vivien Cabannes, Berfin Simsek, Alberto Bietti

    Abstract: This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic grow… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  6. arXiv:2310.19793  [pdf, other

    stat.ML cs.LG math.OC

    On Learning Gaussian Multi-index Models with Gradient Flow

    Authors: Alberto Bietti, Joan Bruna, Loucas Pillaud-Vivien

    Abstract: We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link… ▽ More

    Submitted 2 November, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

  7. arXiv:2310.03024  [pdf, other

    astro-ph.IM cs.AI cs.LG

    AstroCLIP: A Cross-Modal Foundation Model for Galaxies

    Authors: Liam Parker, Francois Lanusse, Siavash Golkar, Leopoldo Sarra, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Geraud Krawezik, Michael McCabe, Ruben Ohana, Mariel Pettee, Bruno Regaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

    Abstract: We present AstroCLIP, a single, versatile model that can embed both galaxy images and spectra into a shared, physically meaningful latent space. These embeddings can then be used - without any model fine-tuning - for a variety of downstream tasks including (1) accurate in-modality and cross-modality semantic similarity search, (2) photometric redshift estimation, (3) galaxy property estimation fro… ▽ More

    Submitted 14 June, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: 18 pages, accepted in Monthly Notices of the Royal Astronomical Society, Presented at the NeurIPS 2023 AI4Science Workshop

  8. arXiv:2310.02994  [pdf, other

    cs.LG cs.AI stat.ML

    Multiple Physics Pretraining for Physical Surrogate Models

    Authors: Michael McCabe, Bruno Régaldo-Saint Blancard, Liam Holden Parker, Ruben Ohana, Miles Cranmer, Alberto Bietti, Michael Eickenberg, Siavash Golkar, Geraud Krawezik, Francois Lanusse, Mariel Pettee, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

    Abstract: We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling. MPP involves training large surrogate models to predict the dynamics of multiple heterogeneous physical systems simultaneously by learning features that are broadly useful across diverse physical tasks. In order to learn effectively in this setting, we introduce a… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  9. arXiv:2310.02989  [pdf, other

    stat.ML cs.AI cs.CL cs.LG

    xVal: A Continuous Number Encoding for Large Language Models

    Authors: Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

    Abstract: Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: 10 pages 7 figures. Supplementary: 5 pages 2 figures

  10. arXiv:2310.02984  [pdf, other

    stat.ML cs.AI cs.CL cs.LG cs.NE

    Scaling Laws for Associative Memories

    Authors: Vivien Cabannes, Elvis Dohmatob, Alberto Bietti

    Abstract: Learning arguably involves the discovery and memorization of abstract rules. The aim of this paper is to study associative memory mechanisms. Our model is based on high-dimensional matrices consisting of outer products of embeddings, which relates to the inner layers of transformer language models. We derive precise scaling laws with respect to sample size and parameter size, and discuss the stati… ▽ More

    Submitted 20 February, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    ACM Class: I.2.6; G.1.6

  11. arXiv:2306.00802  [pdf, other

    stat.ML cs.CL cs.LG

    Birth of a Transformer: A Memory Viewpoint

    Authors: Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, Leon Bottou

    Abstract: Large language models based on transformers have achieved great empirical successes. However, as they are deployed more widely, there is a growing need to better understand their internal mechanisms in order to make them more reliable. These models appear to store vast amounts of knowledge from their training data, and to adapt quickly to new information provided in their context or prompt. We stu… ▽ More

    Submitted 6 November, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: NeurIPS 2023

  12. arXiv:2302.02774  [pdf, other

    stat.ML cs.AI cs.LG math.ST

    The SSL Interplay: Augmentations, Inductive Bias, and Generalization

    Authors: Vivien Cabannes, Bobak T. Kiani, Randall Balestriero, Yann LeCun, Alberto Bietti

    Abstract: Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architect… ▽ More

    Submitted 1 June, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    MSC Class: 68Q32 ACM Class: G.3

    Journal ref: Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023

  13. On minimal variations for unsupervised representation learning

    Authors: Vivien Cabannes, Alberto Bietti, Randall Balestriero

    Abstract: Unsupervised representation learning aims at describing raw data efficiently to solve various downstream tasks. It has been approached with many techniques, such as manifold learning, diffusion maps, or more recently self-supervised learning. Those techniques are arguably all based on the underlying assumption that target functions, associated with future downstream tasks, have low variations in d… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: 5 pages, 1 figure; 1 table

    MSC Class: 68Q32 ACM Class: G.3

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1-5,

  14. arXiv:2210.15651  [pdf, other

    cs.LG math.OC stat.ML

    Learning Single-Index Models with Shallow Neural Networks

    Authors: Alberto Bietti, Joan Bruna, Clayton Sanford, Min Jae Song

    Abstract: Single-index models are a class of functions given by an unknown univariate ``link'' function applied to an unknown one-dimensional projection of the input. These models are particularly relevant in high dimension, when the data might present low-dimensional structure that learning algorithms should adapt to. While several statistical aspects of this model, such as the sample complexity of recover… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: 76 pages. To appear at NeurIPS 2022

  15. arXiv:2206.01079  [pdf, other

    cs.LG

    When does return-conditioned supervised learning work for offline reinforcement learning?

    Authors: David Brandfonbrener, Alberto Bietti, Jacob Buckman, Romain Laroche, Joan Bruna

    Abstract: Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous… ▽ More

    Submitted 11 January, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

  16. arXiv:2203.11864  [pdf, other

    stat.ML cs.LG

    On the (Non-)Robustness of Two-Layer Neural Networks in Different Learning Regimes

    Authors: Elvis Dohmatob, Alberto Bietti

    Abstract: Neural networks are known to be highly sensitive to adversarial examples. These may arise due to different factors, such as random initialization, or spurious correlations in the learning problem. To better understand these factors, we provide a precise study of the adversarial robustness in different scenarios, from initialization to the end of training in different regimes, as well as intermedia… ▽ More

    Submitted 4 July, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

  17. arXiv:2202.05638  [pdf, other

    cs.LG

    Efficient Kernel UCB for Contextual Bandits

    Authors: Houssam Zenati, Alberto Bietti, Eustache Diemert, Julien Mairal, Matthieu Martin, Pierre Gaillard

    Abstract: In this paper, we tackle the computational efficiency of kernelized UCB algorithms in contextual bandits. While standard methods require a O(CT^3) complexity where T is the horizon and the constant C is related to optimizing the UCB rule, we propose an efficient contextual algorithm for large-scale problems. Specifically, our method relies on incremental Nystrom approximations of the joint kernel… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

    Comments: To appear at AISTATS2022

  18. arXiv:2202.05318  [pdf, other

    stat.ML cs.CR cs.LG math.OC

    Personalization Improves Privacy-Accuracy Tradeoffs in Federated Learning

    Authors: Alberto Bietti, Chen-Yu Wei, Miroslav Dudík, John Langford, Zhiwei Steven Wu

    Abstract: Large-scale machine learning systems often involve data distributed across a collection of users. Federated learning algorithms leverage this structure by communicating model updates to a central server, rather than entire datasets. In this paper, we study stochastic optimization algorithms for a personalized federated learning setting involving local and global models subject to user-level (joint… ▽ More

    Submitted 15 July, 2022; v1 submitted 10 February, 2022; originally announced February 2022.

    Comments: ICML

  19. arXiv:2107.05134  [pdf, other

    cs.LG math.OC stat.ML

    Dual Training of Energy-Based Models with Overparametrized Shallow Neural Networks

    Authors: Carles Domingo-Enrich, Alberto Bietti, Marylou Gabrié, Joan Bruna, Eric Vanden-Eijnden

    Abstract: Energy-based models (EBMs) are generative models that are usually trained via maximum likelihood estimation. This approach becomes challenging in generic situations where the trained energy is non-convex, due to the need to sample the Gibbs distribution associated with this energy. Using general Fenchel duality results, we derive variational principles dual to maximum likelihood EBMs with shallow… ▽ More

    Submitted 15 February, 2022; v1 submitted 11 July, 2021; originally announced July 2021.

  20. arXiv:2106.07148  [pdf, other

    stat.ML cs.LG

    On the Sample Complexity of Learning under Invariance and Geometric Stability

    Authors: Alberto Bietti, Luca Venturi, Joan Bruna

    Abstract: Many supervised learning problems involve high-dimensional data such as images, text, or graphs. In order to make efficient use of data, it is often useful to leverage certain geometric priors in the problem at hand, such as invariance to translations, permutation subgroups, or stability to small deformations. We study the sample complexity of learning problems where the target function presents s… ▽ More

    Submitted 4 November, 2021; v1 submitted 13 June, 2021; originally announced June 2021.

  21. arXiv:2105.13099  [pdf, other

    stat.ML cs.LG

    On the Universality of Graph Neural Networks on Large Random Graphs

    Authors: Nicolas Keriven, Alberto Bietti, Samuel Vaiter

    Abstract: We study the approximation power of Graph Neural Networks (GNNs) on latent position random graphs. In the large graph limit, GNNs are known to converge to certain "continuous" models known as c-GNNs, which directly enables a study of their approximation power on random graph models. In the absence of input node features however, just as GNNs are limited by the Weisfeiler-Lehman isomorphism test, c… ▽ More

    Submitted 28 May, 2021; v1 submitted 27 May, 2021; originally announced May 2021.

  22. arXiv:2104.07531  [pdf, other

    cs.LG stat.ML

    On Energy-Based Models with Overparametrized Shallow Neural Networks

    Authors: Carles Domingo-Enrich, Alberto Bietti, Eric Vanden-Eijnden, Joan Bruna

    Abstract: Energy-based models (EBMs) are a simple yet powerful framework for generative modeling. They are based on a trainable energy function which defines an associated Gibbs measure, and they can be trained and sampled from via well-established statistical tools, such as MCMC. Neural networks may be used as energy function approximators, providing both a rich class of expressive models as well as a flex… ▽ More

    Submitted 5 May, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

  23. arXiv:2102.10032  [pdf, other

    stat.ML cs.LG

    Approximation and Learning with Deep Convolutional Models: a Kernel Perspective

    Authors: Alberto Bietti

    Abstract: The empirical success of deep convolutional networks on tasks involving high-dimensional data such as images or audio suggests that they can efficiently approximate certain functions that are well-suited for such tasks. In this paper, we study this through the lens of kernel methods, by considering simple hierarchical kernels with two or three convolution and pooling layers, inspired by convolutio… ▽ More

    Submitted 18 March, 2022; v1 submitted 19 February, 2021; originally announced February 2021.

    Comments: ICLR 2022

  24. arXiv:2009.14397  [pdf, other

    stat.ML cs.LG

    Deep Equals Shallow for ReLU Networks in Kernel Regimes

    Authors: Alberto Bietti, Francis Bach

    Abstract: Deep networks are often considered to be more expressive than shallow ones in terms of approximation. Indeed, certain functions can be approximated by deep networks provably more efficiently than by shallow ones, however, no tractable algorithms are known for learning such deep models. Separately, a recent line of work has shown that deep networks trained with gradient descent may behave like (tra… ▽ More

    Submitted 26 August, 2021; v1 submitted 29 September, 2020; originally announced September 2020.

  25. arXiv:2006.01868  [pdf, other

    stat.ML cs.LG

    Convergence and Stability of Graph Convolutional Networks on Large Random Graphs

    Authors: Nicolas Keriven, Alberto Bietti, Samuel Vaiter

    Abstract: We study properties of Graph Convolutional Networks (GCNs) by analyzing their behavior on standard models of random graphs, where nodes are represented by random latent variables and edges are drawn according to a similarity kernel. This allows us to overcome the difficulties of dealing with discrete notions such as isomorphisms on very large graphs, by considering instead more natural geometric a… ▽ More

    Submitted 23 October, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

  26. arXiv:2004.11722  [pdf, other

    stat.ML cs.LG

    Counterfactual Learning of Stochastic Policies with Continuous Actions: from Models to Offline Evaluation

    Authors: Houssam Zenati, Alberto Bietti, Matthieu Martin, Eustache Diemert, Pierre Gaillard, Julien Mairal

    Abstract: Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In this paper, we address the problem of learning stochastic policies with continuous actions from the viewpoint of counterfactual risk minimization (CRM). While the CRM framework is appealing and well studied for discrete actions, the continuous action case rais… ▽ More

    Submitted 14 December, 2022; v1 submitted 22 April, 2020; originally announced April 2020.

  27. arXiv:1905.12173  [pdf, other

    stat.ML cs.LG

    On the Inductive Bias of Neural Tangent Kernels

    Authors: Alberto Bietti, Julien Mairal

    Abstract: State-of-the-art neural networks are heavily over-parameterized, making the optimization algorithm a crucial ingredient for learning predictive models with good generalization properties. A recent line of work has shown that in a certain over-parameterized regime, the learning dynamics of gradient descent are governed by a certain kernel obtained at initialization, called the neural tangent kernel… ▽ More

    Submitted 31 October, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

    Comments: NeurIPS 2019

  28. arXiv:1810.00363  [pdf, other

    stat.ML cs.LG

    A Kernel Perspective for Regularizing Deep Neural Networks

    Authors: Alberto Bietti, Grégoire Mialon, Dexiong Chen, Julien Mairal

    Abstract: We propose a new point of view for regularizing deep neural networks by using the norm of a reproducing kernel Hilbert space (RKHS). Even though this norm cannot be computed, it admits upper and lower approximations leading to various practical strategies. Specifically, this perspective (i) provides a common umbrella for many existing regularization principles, including spectral norm and gradient… ▽ More

    Submitted 13 May, 2019; v1 submitted 30 September, 2018; originally announced October 2018.

    Comments: ICML

  29. arXiv:1802.04064  [pdf, other

    stat.ML cs.LG

    A Contextual Bandit Bake-off

    Authors: Alberto Bietti, Alekh Agarwal, John Langford

    Abstract: Contextual bandit algorithms are essential for solving many real-world interactive machine learning problems. Despite multiple recent successes on statistically and computationally efficient methods, the practical behavior of these algorithms is still poorly understood. We leverage the availability of large numbers of supervised learning datasets to empirically evaluate contextual bandit algorithm… ▽ More

    Submitted 4 June, 2021; v1 submitted 12 February, 2018; originally announced February 2018.

    Comments: JMLR

  30. arXiv:1706.03078  [pdf, other

    stat.ML cs.LG

    Group Invariance, Stability to Deformations, and Complexity of Deep Convolutional Representations

    Authors: Alberto Bietti, Julien Mairal

    Abstract: The success of deep convolutional architectures is often attributed in part to their ability to learn multiscale and invariant representations of natural signals. However, a precise study of these properties and how they affect learning guarantees is still missing. In this paper, we consider deep convolutional representations of signals; we study their invariance to translations and to more genera… ▽ More

    Submitted 10 October, 2018; v1 submitted 9 June, 2017; originally announced June 2017.

    Journal ref: Journal of Machine Learning Research 20 (2019) 1-49

  31. arXiv:1610.00970  [pdf, other

    stat.ML cs.LG math.OC

    Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure

    Authors: Alberto Bietti, Julien Mairal

    Abstract: Stochastic optimization algorithms with variance reduction have proven successful for minimizing large finite sums of functions. Unfortunately, these techniques are unable to deal with stochastic perturbations of input data, induced for example by data augmentation. In such cases, the objective is no longer a finite sum, and the main candidate for optimization is the stochastic gradient descent me… ▽ More

    Submitted 15 November, 2017; v1 submitted 4 October, 2016; originally announced October 2016.

    Comments: Advances in Neural Information Processing Systems (NIPS), Dec 2017, Long Beach, CA, United States