Skip to main content

Showing 1–50 of 147 results for author: Jaggi, M

  1. arXiv:2410.05090  [pdf, other

    cs.LG stat.ML

    HyperINF: Unleashing the HyperPower of the Schulz's Method for Data Influence Estimation

    Authors: Xinyu Zhou, Simin Fan, Martin Jaggi

    Abstract: Influence functions provide a principled method to assess the contribution of individual training samples to a specific target. Yet, their high computational costs limit their applications on large-scale models and datasets. Existing methods proposed for influence function approximation have significantly reduced the computational overheads. However, they mostly suffer from inaccurate estimation d… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

  2. arXiv:2409.13931  [pdf, other

    cs.LG cs.CL

    On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

    Authors: Dongyang Fan, Bettina Messmer, Martin Jaggi

    Abstract: On-device LLMs have gained increasing attention for their ability to enhance privacy and provide a personalized user experience. To facilitate learning with private and scarce local data, federated learning has become a standard approach, though it introduces challenges related to system and data heterogeneity among end users. As a solution, we propose a novel $\textbf{Co}$llaborative learning app… ▽ More

    Submitted 1 October, 2024; v1 submitted 20 September, 2024; originally announced September 2024.

  3. arXiv:2409.05539  [pdf, other

    cs.LG cs.DC

    CoBo: Collaborative Learning via Bilevel Optimization

    Authors: Diba Hashemi, Lie He, Martin Jaggi

    Abstract: Collaborative learning is an important tool to train multiple clients more effectively by enabling communication among clients. Identifying helpful clients, however, presents challenging and often introduces significant overhead. In this paper, we model client-selection and model-training as two interconnected optimization problems, proposing a novel bilevel optimization problem for collaborative… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

  4. arXiv:2409.03682  [pdf, other

    cs.LG math.OC

    A New First-Order Meta-Learning Algorithm with Convergence Guarantees

    Authors: El Mahdi Chayti, Martin Jaggi

    Abstract: Learning new tasks by drawing on prior experience gathered from other (related) tasks is a core property of any intelligent system. Gradient-based meta-learning, especially MAML and its variants, has emerged as a viable solution to accomplish this goal. One problem MAML encounters is its computational and memory burdens needed to compute the meta-gradients. We propose a new first-order variant of… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

  5. arXiv:2408.11841  [pdf, other

    cs.CY cs.AI cs.CL

    Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants

    Authors: Beatriz Borges, Negar Foroutan, Deniz Bayazit, Anna Sotnikova, Syrielle Montariol, Tanya Nazaretzky, Mohammadreza Banaei, Alireza Sakhaeirad, Philippe Servant, Seyed Parsa Neshaei, Jibril Frej, Angelika Romanou, Gail Weiss, Sepideh Mamooler, Zeming Chen, Simin Fan, Silin Gao, Mete Ismayilzada, Debjit Paul, Alexandre Schöpfer, Andrej Janchevski, Anja Tiede, Clarence Linden, Emanuele Troiani, Francesco Salvi , et al. (65 additional authors not shown)

    Abstract: AI assistants are being increasingly used by students enrolled in higher education institutions. While these tools provide opportunities for improved teaching and education, they also pose significant challenges for assessment and learning outcomes. We conceptualize these challenges through the lens of vulnerability, the potential for university assessments and learning outcomes to be impacted by… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: 20 pages, 8 figures

  6. arXiv:2405.20935  [pdf, other

    cs.LG cs.AI

    Effective Interplay between Sparsity and Quantization: From Theory to Practice

    Authors: Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh

    Abstract: The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two m… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

  7. arXiv:2405.19454  [pdf, other

    cs.LG stat.ML

    Deep Grokking: Would Deep Neural Networks Generalize Better?

    Authors: Simin Fan, Razvan Pascanu, Martin Jaggi

    Abstract: Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on s… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  8. arXiv:2405.18392  [pdf, other

    cs.LG

    Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

    Authors: Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi

    Abstract: Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across… ▽ More

    Submitted 17 October, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

    Comments: Spotlight at NeurIPS 2024

  9. arXiv:2405.01031  [pdf, other

    cs.LG cs.CR cs.DC math.OC stat.ML

    The Privacy Power of Correlated Noise in Decentralized Learning

    Authors: Youssef Allouah, Anastasia Koloskova, Aymane El Firdoussi, Martin Jaggi, Rachid Guerraoui

    Abstract: Decentralized learning is appealing as it enables the scalable usage of large amounts of distributed data and resources (without resorting to any central entity), while promoting privacy since every user minimizes the direct exposure of their data. Yet, without additional precautions, curious users can still leverage models obtained from their peers to violate privacy. In this paper, we propose De… ▽ More

    Submitted 3 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: Accepted as conference paper at ICML 2024

  10. arXiv:2404.09753  [pdf, other

    cs.CL cs.LG

    Personalized Collaborative Fine-Tuning for On-Device Large Language Models

    Authors: Nicolas Wagner, Dongyang Fan, Martin Jaggi

    Abstract: We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability. Taking inspiration from the collaborative learning community, we introduce three distinct trust-weighted gradient aggregation schemes: weight similarity-based, prediction similarity-based and validation performance-based. To minimize communication overhead, we integrate Low… ▽ More

    Submitted 6 August, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

    Journal ref: COLM 2024

  11. arXiv:2404.00456  [pdf, other

    cs.LG

    QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

    Authors: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

    Abstract: We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to th… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 19 pages, 6 figures

  12. arXiv:2402.13089  [pdf, other

    cs.LG cs.AI cs.CL

    Towards an empirical understanding of MoE design choices

    Authors: Dongyang Fan, Bettina Messmer, Martin Jaggi

    Abstract: In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  13. arXiv:2402.04843  [pdf, other

    math.OC

    Spectral Preconditioning for Gradient Methods on Graded Non-convex Functions

    Authors: Nikita Doikov, Sebastian U. Stich, Martin Jaggi

    Abstract: The performance of optimization methods is often tied to the spectrum of the objective Hessian. Yet, conventional assumptions, such as smoothness, do often not enable us to make finely-grained convergence statements -- particularly not for non-convex problems. Striving for a more intricate characterization of complexity, we introduce a unique concept termed graded non-convexity. This allows to par… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

  14. arXiv:2402.04161  [pdf, other

    cs.LG cs.CL cs.IT stat.ML

    Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

    Authors: Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

    Abstract: In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. A key ingredient behind their success is the generative pretraining procedure, during which these models are trained on a large text corpus in an auto-regressive manner. To shed light on this phenomenon, we propose a new framework that allows both theory and s… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  15. arXiv:2402.02933  [pdf, other

    cs.LG cs.CY cs.HC

    InterpretCC: Intrinsic User-Centric Interpretability through Global Mixture of Experts

    Authors: Vinitra Swamy, Syrielle Montariol, Julian Blackwell, Jibril Frej, Martin Jaggi, Tanja Käser

    Abstract: Interpretability for neural networks is a trade-off between three key requirements: 1) faithfulness of the explanation (i.e., how perfectly it explains the prediction), 2) understandability of the explanation by humans, and 3) model performance. Most existing methods compromise one or more of these requirements; e.g., post-hoc approaches provide limited faithfulness, automatically identified featu… ▽ More

    Submitted 29 May, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

  16. arXiv:2402.02622  [pdf, other

    cs.CL cs.LG

    DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

    Authors: Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi

    Abstract: The transformer architecture by Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B param… ▽ More

    Submitted 21 March, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

  17. arXiv:2312.09316  [pdf, other

    cs.AI cs.HC

    Distributional Latent Variable Models with an Application in Active Cognitive Testing

    Authors: Robert Kasumba, Dom CP Marticorena, Anja Pahor, Geetha Ramani, Imani Goffney, Susanne M Jaeggi, Aaron Seitz, Jacob R Gardner, Dennis L Barbour

    Abstract: Cognitive modeling commonly relies on asking participants to complete a battery of varied tests in order to estimate attention, working memory, and other latent variables. In many cases, these tests result in highly variable observation models. A near-ubiquitous approach is to repeat many observations for each test independently, resulting in a distribution over the outcomes from each test given t… ▽ More

    Submitted 25 September, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: 11 pages, 6 figures

  18. arXiv:2311.16079  [pdf, other

    cs.CL cs.AI cs.LG

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    Authors: Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, Antoine Bosselut

    Abstract: Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by rele… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  19. arXiv:2311.06724  [pdf, other

    cs.CL cs.LG

    Controllable Topic-Focused Abstractive Summarization

    Authors: Seyed Ali Bahrainian, Martin Jaggi, Carsten Eickhoff

    Abstract: Controlled abstractive summarization focuses on producing condensed versions of a source article to cover specific aspects by shifting the distribution of generated text towards a desired style, e.g., a set of topics. Subsequently, the resulting summaries may be tailored to user-defined requirements. This paper presents a new Transformer-based architecture capable of producing topic-focused summar… ▽ More

    Submitted 11 November, 2023; originally announced November 2023.

  20. arXiv:2310.15393  [pdf, other

    cs.LG cs.AI cs.CL

    DoGE: Domain Reweighting with Generalization Estimation

    Authors: Simin Fan, Matteo Pagliardini, Martin Jaggi

    Abstract: The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (… ▽ More

    Submitted 5 February, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

  21. arXiv:2310.15389  [pdf, other

    cs.CL cs.AI cs.LG

    Irreducible Curriculum for Language Model Pretraining

    Authors: Simin Fan, Martin Jaggi

    Abstract: Automatic data selection and curriculum design for training large language models is challenging, with only a few existing methods showing improvements over standard training. Furthermore, current schemes focus on domain-level selection, overlooking the more fine-grained contributions of each individual training point. It is difficult to apply traditional datapoint selection methods on large langu… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  22. arXiv:2310.13033  [pdf, other

    cs.NE cs.AI cs.IT cs.LG

    LASER: Linear Compression in Wireless Distributed Optimization

    Authors: Ashok Vardhan Makkuva, Marco Bondaschi, Thijs Vogels, Martin Jaggi, Hyeji Kim, Michael C. Gastpar

    Abstract: Data-parallel SGD is the de facto algorithm for distributed optimization, especially for large scale machine learning. Despite its merits, communication bottleneck is one of its persistent issues. Most compression schemes to alleviate this either assume noiseless communication links, or fail to achieve good performance on practical tasks. In this paper, we close this gap and introduce LASER: LineA… ▽ More

    Submitted 6 February, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

  23. arXiv:2310.10845  [pdf, other

    cs.CL cs.LG

    CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference

    Authors: Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi

    Abstract: Scaling language models to larger and deeper sizes has led to significant boosts in performance. Even though the size of these models limits their application in compute-constrained environments, the race to continually develop ever larger and deeper foundational models is underway. At the same time -- regardless of the model size -- task-specific techniques continue to play a pivotal role in achi… ▽ More

    Submitted 14 August, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

  24. arXiv:2309.14118  [pdf, other

    cs.LG

    MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks

    Authors: Vinitra Swamy, Malika Satayeva, Jibril Frej, Thierry Bossy, Thijs Vogels, Martin Jaggi, Tanja Käser, Mary-Anne Hartley

    Abstract: Predicting multiple real-world tasks in a single model often requires a particularly diverse feature space. Multimodal (MM) models aim to extract the synergistic predictive potential of multiple data types to create a shared feature space with aligned semantic meaning across inputs of drastically varying sizes (i.e. images, text, sound). Most current MM architectures fuse these representations in… ▽ More

    Submitted 6 November, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted as a full paper at NeurIPS 2023 in New Orleans, USA

  25. arXiv:2307.06966  [pdf, other

    cs.LG

    Layer-wise Linear Mode Connectivity

    Authors: Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, Martin Jaggi

    Abstract: Averaging neural network parameters is an intuitive method for fusing the knowledge of two independent models. It is most prominently used in federated learning. If models are averaged at the end of training, this can only lead to a good performing model if the loss surface of interest is very particular, i.e., the loss in the midpoint between the two models needs to be sufficiently low. This is i… ▽ More

    Submitted 19 March, 2024; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: published at ICLR24

  26. arXiv:2306.08393  [pdf, other

    cs.LG cs.DC

    Provably Personalized and Robust Federated Learning

    Authors: Mariel Werner, Lie He, Michael Jordan, Martin Jaggi, Sai Praneeth Karimireddy

    Abstract: Identifying clients with similar objectives and learning a model-per-cluster is an intuitive and interpretable approach to personalization in federated learning. However, doing so with provable and optimal guarantees has remained an open challenge. We formalize this problem as a stochastic optimization problem, achieving optimal convergence rates for a large class of loss functions. We propose sim… ▽ More

    Submitted 18 December, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

  27. arXiv:2306.01160  [pdf, other

    cs.LG cs.AI cs.CL

    Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

    Authors: Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret

    Abstract: Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t. the sequence length -- becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  28. arXiv:2305.19259  [pdf, other

    cs.LG math.OC stat.ML

    On Convergence of Incremental Gradient for Non-Convex Smooth Functions

    Authors: Anastasia Koloskova, Nikita Doikov, Sebastian U. Stich, Martin Jaggi

    Abstract: In machine learning and neural network optimization, algorithms like incremental gradient, and shuffle SGD are popular due to minimizing the number of cache misses and good practical convergence behavior. However, their optimization properties in theory, especially for non-convex smooth functions, remain incompletely explored. This paper delves into the convergence properties of SGD algorithms w… ▽ More

    Submitted 12 February, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

  29. arXiv:2305.18497  [pdf, other

    cs.LG

    Collaborative Learning via Prediction Consensus

    Authors: Dongyang Fan, Celestine Mendler-Dünner, Martin Jaggi

    Abstract: We consider a collaborative learning setting where the goal of each agent is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging shared unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a tru… ▽ More

    Submitted 14 November, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted to the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  30. arXiv:2305.17212  [pdf, other

    cs.LG

    Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

    Authors: Atli Kosson, Bettina Messmer, Martin Jaggi

    Abstract: This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the… ▽ More

    Submitted 3 June, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to ICML 2024; Code available at https://github.com/epfml/REQ

  31. arXiv:2305.17205  [pdf, other

    cs.LG

    Ghost Noise for Regularizing Deep Neural Networks

    Authors: Atli Kosson, Dongyang Fan, Martin Jaggi

    Abstract: Batch Normalization (BN) is widely used to stabilize the optimization process and improve the test performance of deep neural networks. The regularization effect of BN depends on the batch size and explicitly using smaller batch sizes with Batch Normalization, a method known as Ghost Batch Normalization (GBN), has been found to improve generalization in many settings. We investigate the effectiven… ▽ More

    Submitted 19 December, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Journal ref: AAAI 2024

  32. arXiv:2305.17190  [pdf, other

    cs.LG

    Multiplication-Free Transformer Training via Piecewise Affine Operations

    Authors: Atli Kosson, Martin Jaggi

    Abstract: Multiplications are responsible for most of the computational cost involved in neural network training and inference. Recent research has thus looked for ways to reduce the cost associated with them. Inspired by Mogami (2020), we replace multiplication with a cheap piecewise affine approximation that is achieved by adding the bit representation of the floating point numbers together as integers. W… ▽ More

    Submitted 25 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  33. arXiv:2305.16300  [pdf, other

    cs.CL cs.LG

    Landmark Attention: Random-Access Infinite Context Length for Transformers

    Authors: Amirkeivan Mohtashami, Martin Jaggi

    Abstract: While Transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or… ▽ More

    Submitted 19 November, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Published as a conference paper at NeurIPS 2023 - 37th Conference on Neural Information Processing Systems

  34. arXiv:2302.12808  [pdf, other

    math.OC cs.LG

    Linearization Algorithms for Fully Composite Optimization

    Authors: Maria-Luiza Vladarean, Nikita Doikov, Martin Jaggi, Nicolas Flammarion

    Abstract: This paper studies first-order algorithms for solving fully composite optimization problems over convex and compact sets. We leverage the structure of the objective by handling its differentiable and non-differentiable components separately, linearizing only the smooth parts. This provides us with new generalizations of the classical Frank-Wolfe method and the Conditional Gradient Sliding algorith… ▽ More

    Submitted 12 July, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

  35. arXiv:2302.11962  [pdf, other

    math.OC cs.LG

    Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

    Authors: El Mahdi Chayti, Nikita Doikov, Martin Jaggi

    Abstract: We study stochastic Cubic Newton methods for solving general possibly non-convex minimization problems. We propose a new framework, which we call the helper framework, that provides a unified view of the stochastic and variance-reduced second-order algorithms equipped with global complexity guarantees. It can also be applied to learning with auxiliary information. Our helper framework offers the a… ▽ More

    Submitted 5 September, 2024; v1 submitted 23 February, 2023; originally announced February 2023.

    Comments: Published in Transactions on Machine Learning Research

  36. arXiv:2301.02151  [pdf, other

    cs.LG cs.DC math.OC

    Beyond spectral gap (extended): The role of the topology in decentralized learning

    Authors: Thijs Vogels, Hadrien Hendrikx, Martin Jaggi

    Abstract: In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. In the decentralized setting, in which workers communicate over a sparse graph, current theory fails to capture important aspects of real-world behavior. First, the `spectral gap' of the communica… ▽ More

    Submitted 5 January, 2023; originally announced January 2023.

    Comments: Extended version of the other paper (with the same name), that includes (among other things) theory for the heterogeneous case. arXiv admin note: substantial text overlap with arXiv:2206.03093

  37. arXiv:2212.00781  [pdf, other

    math.OC cs.LG

    Second-order optimization with lazy Hessians

    Authors: Nikita Doikov, El Mahdi Chayti, Martin Jaggi

    Abstract: We analyze Newton's method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetical complexity of second-order optimization schemes. By using the cubic regularization technique, we establis… ▽ More

    Submitted 15 June, 2023; v1 submitted 1 December, 2022; originally announced December 2022.

  38. arXiv:2211.10943  [pdf, other

    cs.LG cs.AI

    Scalable Collaborative Learning via Representation Sharing

    Authors: Frédéric Berdoz, Abhishek Singh, Martin Jaggi, Ramesh Raskar

    Abstract: Privacy-preserving machine learning has become a key conundrum for multi-party artificial intelligence. Federated learning (FL) and Split Learning (SL) are two frameworks that enable collaborative learning while keeping the data private (on device). In FL, each data holder trains a model locally and releases it to a central server for aggregation. In SL, the clients must release individual cut-lay… ▽ More

    Submitted 13 December, 2022; v1 submitted 20 November, 2022; originally announced November 2022.

  39. arXiv:2211.10737  [pdf, other

    cs.LG

    Accuracy Booster: Enabling 4-bit Fixed-point Arithmetic for DNN Training

    Authors: Simla Burcu Harma, Ayan Chakraborty, Nicholas Sperry, Babak Falsafi, Martin Jaggi, Yunho Oh

    Abstract: The unprecedented demand for computing resources to train DNN models has led to a search for minimal numerical encoding. Recent state-of-the-art (SOTA) proposals advocate for multi-level scaled narrow bitwidth numerical formats. In this paper, we show that single-level scaling is sufficient to maintain training accuracy while maximizing arithmetic density. We identify a previously proposed single-… ▽ More

    Submitted 31 May, 2024; v1 submitted 19 November, 2022; originally announced November 2022.

  40. arXiv:2211.06637  [pdf, other

    cs.LG

    Modular Clinical Decision Support Networks (MoDN) -- Updatable, Interpretable, and Portable Predictions for Evolving Clinical Environments

    Authors: Cécile Trottet, Thijs Vogels, Martin Jaggi, Mary-Anne Hartley

    Abstract: Data-driven Clinical Decision Support Systems (CDSS) have the potential to improve and standardise care with personalised probabilistic guidance. However, the size of data required necessitates collaborative learning from analogous CDSS's, which are often unsharable or imperfectly interoperable (IIO), meaning their feature sets are not perfectly overlapping. We propose Modular Clinical Decision Su… ▽ More

    Submitted 12 November, 2022; originally announced November 2022.

    Comments: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2022, November 28th, 2022, New Orleans, United States & Virtual, http://www.ml4h.cc, 9 pages

  41. arXiv:2210.04620  [pdf, other

    cs.LG cs.CV

    FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings

    Authors: Jean Ogier du Terrail, Samy-Safwan Ayed, Edwige Cyffers, Felix Grimberg, Chaoyang He, Regis Loeb, Paul Mangold, Tanguy Marchand, Othmane Marfoq, Erum Mushtaq, Boris Muzellec, Constantin Philippenko, Santiago Silva, Maria Teleńczuk, Shadi Albarqouni, Salman Avestimehr, Aurélien Bellet, Aymeric Dieuleveut, Martin Jaggi, Sai Praneeth Karimireddy, Marco Lorenzi, Giovanni Neglia, Marc Tommasi, Mathieu Andreux

    Abstract: Federated Learning (FL) is a novel approach enabling several clients holding sensitive data to collaboratively train machine learning models, without centralizing data. The cross-silo FL setting corresponds to the case of few ($2$--$50$) reliable clients, each holding medium to large datasets, and is typically found in applications such as healthcare, finance, or industry. While previous works hav… ▽ More

    Submitted 5 May, 2023; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: Accepted to NeurIPS, Datasets and Benchmarks Track, this version fixes typos in the datasets' table and the appendix

  42. arXiv:2206.08307  [pdf, other

    cs.LG cs.DC math.OC

    Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

    Authors: Anastasia Koloskova, Sebastian U. Stich, Martin Jaggi

    Abstract: We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

  43. arXiv:2206.03093  [pdf, other

    cs.LG math.OC stat.ML

    Beyond spectral gap: The role of the topology in decentralized learning

    Authors: Thijs Vogels, Hadrien Hendrikx, Martin Jaggi

    Abstract: In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which all workers sample from the same dataset, and communicate over a sparse graph (decentralized). In this setting, current theory fails to capture important aspects o… ▽ More

    Submitted 8 November, 2022; v1 submitted 7 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022

  44. arXiv:2205.15142  [pdf, other

    cs.LG math.OC

    Special Properties of Gradient Descent with Large Learning Rates

    Authors: Amirkeivan Mohtashami, Martin Jaggi, Sebastian Stich

    Abstract: When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of exper… ▽ More

    Submitted 16 February, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: A short version of this work appeared in ICML 22 ICML Workshop on Continuous Time Methods for Machine Learning under the title "The Gap Between Continuous and Discrete Gradient Descent"

  45. arXiv:2205.08184  [pdf, other

    cs.CL cs.AI cs.LG

    SKILL: Structured Knowledge Infusion for Large Language Models

    Authors: Fedor Moiseev, Zhe Dong, Enrique Alfonseca, Martin Jaggi

    Abstract: Large language models (LLMs) have demonstrated human-level performance on a vast spectrum of natural language tasks. However, it is largely unexplored whether they can better internalize knowledge from a structured data, such as a knowledge graph, or from text. In this work, we propose a method to infuse structured knowledge into LLMs, by directly training T5 models on factual triples of knowledge… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

    Comments: NAACL 2022

  46. arXiv:2204.06477  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Data-heterogeneity-aware Mixing for Decentralized Learning

    Authors: Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich

    Abstract: Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches toward decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the g… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.

  47. arXiv:2202.05737  [pdf, other

    cs.LG

    Improving Generalization via Uncertainty Driven Perturbations

    Authors: Matteo Pagliardini, Gilberto Manunza, Martin Jaggi, Michael I. Jordan, Tatjana Chavdarova

    Abstract: Recently Shah et al., 2020 pointed out the pitfalls of the simplicity bias - the tendency of gradient-based algorithms to learn simple models - which include the model's high sensitivity to small input perturbations, as well as sub-optimal margins. In particular, while Stochastic Gradient Descent yields max-margin boundary on linear models, such guarantee does not extend to non-linear models. To m… ▽ More

    Submitted 28 February, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

  48. arXiv:2202.04414  [pdf, other

    cs.LG

    Agree to Disagree: Diversity through Disagreement for Better Transferability

    Authors: Matteo Pagliardini, Martin Jaggi, François Fleuret, Sai Praneeth Karimireddy

    Abstract: Gradient-based learning algorithms have an implicit simplicity bias which in effect can limit the diversity of predictors being sampled by the learning procedure. This behavior can hinder the transferability of trained models by (i) favoring the learning of simpler but spurious features -- present in the training data but absent from the test data -- and (ii) by only leveraging a small subset of p… ▽ More

    Submitted 23 November, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: 23 pages, 17 figures

  49. arXiv:2202.01838  [pdf, other

    cs.LG

    Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

    Authors: Amirkeivan Mohtashami, Sebastian Stich, Martin Jaggi

    Abstract: While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvemen… ▽ More

    Submitted 3 February, 2022; originally announced February 2022.

  50. arXiv:2202.01545  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    Byzantine-Robust Decentralized Learning via ClippedGossip

    Authors: Lie He, Sai Praneeth Karimireddy, Martin Jaggi

    Abstract: In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus and benefit from collaborative training. To address these issues, we propose a ClippedGossip alg… ▽ More

    Submitted 20 April, 2023; v1 submitted 3 February, 2022; originally announced February 2022.