Skip to main content

Showing 1–50 of 56 results for author: Kakade, S

  1. arXiv:2406.17748  [pdf, other

    cs.LG math.OC stat.ML

    A New Perspective on Shampoo's Preconditioner

    Authors: Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade, Lucas Janson

    Abstract: Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connec… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.08466  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Scaling Laws in Linear Regression: Compute, Parameters, and Data

    Authors: Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

    Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, wh… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  3. arXiv:2404.12376  [pdf, other

    cs.LG math.OC stat.ML

    Matching the Statistical Query Lower Bound for k-sparse Parity Problems with Stochastic Gradient Descent

    Authors: Yiwen Kou, Zixiang Chen, Quanquan Gu, Sham M. Kakade

    Abstract: The $k$-parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-parity problem with stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that SGD can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hyper… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: 36 pages, 7 figures, 3 tables

  4. arXiv:2305.10634  [pdf, other

    math.OC cs.LG

    Modified Gauss-Newton Algorithms under Noise

    Authors: Krishna Pillutla, Vincent Roulet, Sham Kakade, Zaid Harchaoui

    Abstract: Gauss-Newton methods and their stochastic version have been widely used in machine learning and signal processing. Their nonsmooth counterparts, modified Gauss-Newton or prox-linear algorithms, can lead to contrasting outcomes when compared to gradient descent in large-scale statistical settings. We explore the contrasting performance of these two classes of algorithms in theory on a stylized stat… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: IEEE SSP 2023

  5. arXiv:2303.02255  [pdf, other

    cs.LG math.OC stat.ML

    Finite-Sample Analysis of Learning High-Dimensional Single ReLU Neuron

    Authors: Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: This paper considers the problem of learning a single ReLU neuron with squared loss (a.k.a., ReLU regression) in the overparameterized regime, where the input dimension can exceed the number of samples. We analyze a Perceptron-type algorithm called GLM-tron (Kakade et al., 2011) and provide its dimension-free risk upper bounds for high-dimensional ReLU regression in both well-specified and misspec… ▽ More

    Submitted 26 June, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: ICML 2023 camera ready

  6. arXiv:2210.04157  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    The Role of Coverage in Online Reinforcement Learning

    Authors: Tengyang Xie, Dylan J. Foster, Yu Bai, Nan Jiang, Sham M. Kakade

    Abstract: Coverage conditions -- which assert that the data logging distribution adequately covers the state space -- play a fundamental role in determining the sample complexity of offline reinforcement learning. While such conditions might seem irrelevant to online reinforcement learning at first glance, we establish a new connection by showing -- somewhat surprisingly -- that the mere existence of a data… ▽ More

    Submitted 8 October, 2022; originally announced October 2022.

  7. arXiv:2210.03137  [pdf, other

    cs.LG math.OC

    Deep Inventory Management

    Authors: Dhruv Madeka, Kari Torkkola, Carson Eisenach, Anna Luo, Dean P. Foster, Sham M. Kakade

    Abstract: This work provides a Deep Reinforcement Learning approach to solving a periodic review inventory control system with stochastic vendor lead times, lost sales, correlated demand, and price matching. While this dynamic program has historically been considered intractable, our results show that several policy learning approaches are competitive with or outperform classical methods. In order to train… ▽ More

    Submitted 28 November, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

  8. arXiv:2208.01857  [pdf, other

    cs.LG math.OC stat.ML

    The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift

    Authors: Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: 32 pages, 1 figure, 1 table

  9. arXiv:2207.08799  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

    Authors: Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, Cyril Zhang

    Abstract: There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-spars… ▽ More

    Submitted 15 January, 2023; v1 submitted 18 July, 2022; originally announced July 2022.

    Comments: v3: final camera-ready revisions for NeurIPS 2022

  10. arXiv:2203.03159  [pdf, other

    cs.LG math.OC stat.ML

    Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

    Authors: Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

    Comments: 28 pages, 2 figures

  11. arXiv:2112.13487  [pdf, other

    cs.LG math.OC math.ST stat.ML

    The Statistical Complexity of Interactive Decision Making

    Authors: Dylan J. Foster, Sham M. Kakade, Jian Qian, Alexander Rakhlin

    Abstract: A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher… ▽ More

    Submitted 11 July, 2023; v1 submitted 26 December, 2021; originally announced December 2021.

    Comments: Minor improvements to writing and organization

  12. arXiv:2110.06198  [pdf, other

    cs.LG math.OC stat.ML

    Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

    Authors: Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: Stochastic gradient descent (SGD) has been shown to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite-dimensional linear regress… ▽ More

    Submitted 11 July, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: 35 pages, 2 figures, 1 table. In ICML 2022

  13. arXiv:2110.06150  [pdf, other

    math.OC cs.LG

    Sparsity in Partially Controllable Linear Systems

    Authors: Yonathan Efroni, Sham Kakade, Akshay Krishnamurthy, Cyril Zhang

    Abstract: A fundamental concept in control theory is that of controllability, where any system state can be reached through an appropriate choice of control inputs. Indeed, a large body of classical and modern approaches are designed for controllable linear dynamical systems. However, in practice, we often encounter systems in which a large set of state variables evolve exogenously and independently of the… ▽ More

    Submitted 9 June, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: ICML2022

  14. arXiv:2108.04552  [pdf, other

    cs.LG math.OC stat.ML

    The Benefits of Implicit Regularization from SGD in Least Squares Problems

    Authors: Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade

    Abstract: Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make s… ▽ More

    Submitted 10 July, 2022; v1 submitted 10 August, 2021; originally announced August 2021.

    Comments: 33 pages, 1 figure. In NeurIPS 2021

  15. arXiv:2107.02377  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    A Short Note on the Relationship of Information Gain and Eluder Dimension

    Authors: Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei

    Abstract: Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

  16. arXiv:2103.12692  [pdf, other

    cs.LG math.OC stat.ML

    Benign Overfitting of Constant-Stepsize SGD for Linear Regression

    Authors: Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Sham M. Kakade

    Abstract: There is an increasing realization that algorithmic inductive biases are central in preventing overfitting; empirically, we often see a benign overfitting phenomenon in overparameterized settings for natural learning algorithms, such as stochastic gradient descent (SGD), where little to no explicit regularization has been employed. This work considers this issue in arguably the most basic setting:… ▽ More

    Submitted 12 October, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

    Comments: 56 pages, 2 figures. A short version is accepted at the 34th Annual Conference on Learning Theory (COLT 2021)

  17. arXiv:2103.10897  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Bilinear Classes: A Structural Framework for Provable Generalization in RL

    Authors: Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang

    Abstract: This work introduces Bilinear Classes, a new structural framework, which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation. The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the opti… ▽ More

    Submitted 11 July, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

    Comments: Expanded extension section to include generalized linear bellman complete and changed related work

  18. arXiv:2103.04947  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Instabilities of Offline RL with Pre-Trained Neural Representation

    Authors: Ruosong Wang, Yifan Wu, Ruslan Salakhutdinov, Sham M. Kakade

    Abstract: In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold, else… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

  19. arXiv:2010.11895  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    What are the Statistical Limits of Offline RL with Linear Function Approximation?

    Authors: Ruosong Wang, Dean P. Foster, Sham M. Kakade

    Abstract: Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making p… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  20. arXiv:2007.07461  [pdf, ps, other

    cs.LG cs.GT cs.MA math.OC stat.ML

    Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

    Authors: Kaiqing Zhang, Sham M. Kakade, Tamer Başar, Lin F. Yang

    Abstract: Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intu… ▽ More

    Submitted 8 August, 2023; v1 submitted 14 July, 2020; originally announced July 2020.

    Comments: Updated version accepted to Journal of Machine Learning Research (JMLR)

  21. arXiv:2006.12484  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

    Authors: Chi Jin, Sham M. Kakade, Akshay Krishnamurthy, Qinghua Liu

    Abstract: Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hard… ▽ More

    Submitted 24 October, 2020; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: To appear at NeurIPS 2020 as spotlight

  22. arXiv:2006.12466  [pdf, other

    cs.LG cs.RO math.OC stat.ML

    Information Theoretic Regret Bounds for Online Nonlinear Control

    Authors: Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, Wen Sun

    Abstract: This work studies the problem of sequential control in an unknown, nonlinear dynamical system, where we model the underlying system dynamics as an unknown function in a known Reproducing Kernel Hilbert Space. This framework yields a general setting that permits discrete and continuous control inputs as well as non-smooth, non-differentiable dynamics. Our main result, the Lower Confidence-based Con… ▽ More

    Submitted 22 June, 2020; originally announced June 2020.

  23. arXiv:2005.00527  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

    Authors: Ruosong Wang, Simon S. Du, Lin F. Yang, Sham M. Kakade

    Abstract: Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is $\varepsilon$ near to tha… ▽ More

    Submitted 9 July, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

  24. arXiv:2003.01897  [pdf, other

    cs.LG cs.NE math.ST stat.ML

    Optimal Regularization Can Mitigate Double Descent

    Authors: Preetum Nakkiran, Prayaag Venkat, Sham Kakade, Tengyu Ma

    Abstract: Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work,… ▽ More

    Submitted 29 April, 2021; v1 submitted 4 March, 2020; originally announced March 2020.

    Comments: v2: Accepted to ICLR 2021. Minor edits to Intro and Appendix

  25. arXiv:2002.09434  [pdf, ps, other

    cs.LG math.OC stat.ML

    Few-Shot Learning via Learning the Representation, Provably

    Authors: Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, Qi Lei

    Abstract: This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understa… ▽ More

    Submitted 30 March, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

    Comments: ICLR2021

  26. arXiv:1911.12568  [pdf, other

    cs.LG math.ST stat.ML

    Optimal Estimation of Change in a Population of Parameters

    Authors: Ramya Korlakai Vinayak, Weihao Kong, Sham M. Kakade

    Abstract: Paired estimation of change in parameters of interest over a population plays a central role in several application domains including those in the social sciences, epidemiology, medicine and biology. In these domains, the size of the population under study is often very large, however, the number of observations available per individual in the population is very small (\emph{sparse observations})… ▽ More

    Submitted 28 November, 2019; originally announced November 2019.

  27. arXiv:1910.03016  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?

    Authors: Simon S. Du, Sham M. Kakade, Ruosong Wang, Lin F. Yang

    Abstract: Modern deep learning methods provide effective means to learn good representations. However, is a good representation itself sufficient for sample efficient reinforcement learning? This question has largely been studied only with respect to (worst-case) approximation error, in the more classical approximate dynamic programming literature. With regards to the statistical viewpoint, this question is… ▽ More

    Submitted 27 February, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

    Comments: To appear in ICLR 2020

  28. arXiv:1909.04630  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Meta-Learning with Implicit Gradients

    Authors: Aravind Rajeswaran, Chelsea Finn, Sham Kakade, Sergey Levine

    Abstract: A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-specific models are learned in the inner-loop, by using only a small amount of data from t… ▽ More

    Submitted 10 September, 2019; originally announced September 2019.

    Comments: NeurIPS 2019. First two authors contributed equally

  29. arXiv:1906.03804  [pdf, ps, other

    cs.LG math.PR stat.ML

    Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal

    Authors: Alekh Agarwal, Sham Kakade, Lin F. Yang

    Abstract: This work considers the sample and computational complexity of obtaining an $ε$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this work, we study the effectiveness of the most natural plug-in approach to model-based planning: we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an opt… ▽ More

    Submitted 4 April, 2020; v1 submitted 10 June, 2019; originally announced June 2019.

  30. arXiv:1905.00313  [pdf, ps, other

    math.OC

    Revisiting the Polyak step size

    Authors: Elad Hazan, Sham Kakade

    Abstract: This paper revisits the Polyak step size schedule for convex optimization problems, proving that a simple variant of it simultaneously attains near optimal convergence rates for the gradient descent algorithm, for all ranges of strong convexity, smoothness, and Lipschitz parameters, without a-priory knowledge of these parameters.

    Submitted 2 August, 2022; v1 submitted 1 May, 2019; originally announced May 2019.

  31. arXiv:1904.12838  [pdf, other

    cs.LG math.OC stat.ML

    The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

    Authors: Rong Ge, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli

    Abstract: Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work p… ▽ More

    Submitted 29 October, 2019; v1 submitted 29 April, 2019; originally announced April 2019.

    Comments: Appears in the proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019. 28 pages, 4 tables, 1 Algorithm, 7 figures

  32. arXiv:1902.08721  [pdf, ps, other

    cs.LG eess.SY math.OC stat.ML

    Online Control with Adversarial Disturbances

    Authors: Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, Karan Singh

    Abstract: We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this pro… ▽ More

    Submitted 22 February, 2019; originally announced February 2019.

  33. arXiv:1902.04811  [pdf, ps, other

    cs.LG math.OC stat.ML

    On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points

    Authors: Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I. Jordan

    Abstract: Gradient descent (GD) and stochastic gradient descent (SGD) are the workhorses of large-scale machine learning. While classical theory focused on analyzing the performance of these methods in convex optimization problems, the most notable successes in machine learning have involved nonconvex optimization, and a gap has arisen between theory and practice. Indeed, traditional analyses of GD and SGD… ▽ More

    Submitted 3 September, 2019; v1 submitted 13 February, 2019; originally announced February 2019.

    Comments: A preliminary version of this paper, with a subset of the results that are presented here, was presented at ICML 2017 (also as arXiv:1703.00887)

  34. arXiv:1902.04553  [pdf, ps, other

    math.ST cs.LG stat.ML

    Maximum Likelihood Estimation for Learning Populations of Parameters

    Authors: Ramya Korlakai Vinayak, Weihao Kong, Gregory Valiant, Sham M. Kakade

    Abstract: Consider a setting with $N$ independent individuals, each with an unknown parameter, $p_i \in [0, 1]$ drawn from some unknown distribution $P^\star$. After observing the outcomes of $t$ independent Bernoulli trials, i.e., $X_i \sim \text{Binomial}(t, p_i)$ per individual, our objective is to accurately estimate $P^\star$. This problem arises in numerous domains, including the social sciences, psyc… ▽ More

    Submitted 12 February, 2019; originally announced February 2019.

  35. arXiv:1902.03736  [pdf, ps, other

    math.PR cs.LG stat.ML

    A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

    Authors: Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I. Jordan

    Abstract: In this note, we derive concentration inequalities for random vectors with subGaussian norm (a generalization of both subGaussian random vectors and norm bounded random vectors), which are tight up to logarithmic factors.

    Submitted 11 February, 2019; originally announced February 2019.

  36. arXiv:1902.03228  [pdf, other

    stat.ML cs.LG math.OC

    A Smoother Way to Train Structured Prediction Models

    Authors: Krishna Pillutla, Vincent Roulet, Sham M. Kakade, Zaid Harchaoui

    Abstract: We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optim… ▽ More

    Submitted 8 February, 2019; originally announced February 2019.

    Comments: Short version appeared in Neural Information Processing Systems (NeurIPS) 2018

  37. arXiv:1809.08530  [pdf, ps, other

    math.OC cs.LG stat.ML

    Provably Correct Automatic Subdifferentiation for Qualified Programs

    Authors: Sham Kakade, Jason D. Lee

    Abstract: The Cheap Gradient Principle (Griewank 2008) --- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of $5$) as that of simply computing the function itself --- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box grad… ▽ More

    Submitted 14 January, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

  38. arXiv:1804.07795  [pdf, other

    math.OC cs.LG

    Stochastic subgradient method converges on tame functions

    Authors: Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee

    Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In part… ▽ More

    Submitted 25 May, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

    Comments: 32 pages, 1 figure

    MSC Class: 65K05; 65K10; 90C15; 90C30

  39. arXiv:1803.05591  [pdf, other

    cs.LG math.OC stat.ML

    On the insufficiency of existing momentum schemes for Stochastic Optimization

    Authors: Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, Sham M. Kakade

    Abstract: Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent on… ▽ More

    Submitted 31 July, 2018; v1 submitted 15 March, 2018; originally announced March 2018.

    Comments: 28 pages, 10 figures. Updated acknowledgements. Appeared as an oral presentation at International Conference on Learning Representations (ICLR), 2018. Code implementing the ASGD method can be found at https://github.com/rahulkidambi/AccSGD

  40. arXiv:1711.08426  [pdf, ps, other

    stat.ML cs.LG math.OC

    Leverage Score Sampling for Faster Accelerated Regression and ERM

    Authors: Naman Agarwal, Sham Kakade, Rahul Kidambi, Yin Tat Lee, Praneeth Netrapalli, Aaron Sidford

    Abstract: Given a matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$ and a vector $b \in\mathbb{R}^{d}$, we show how to compute an $ε$-approximate solution to the regression problem $ \min_{x\in\mathbb{R}^{d}}\frac{1}{2} \|\mathbf{A} x - b\|_{2}^{2} $ in time $ \tilde{O} ((n+\sqrt{d\cdotκ_{\text{sum}}})\cdot s\cdot\logε^{-1}) $ where… ▽ More

    Submitted 22 November, 2017; originally announced November 2017.

  41. arXiv:1710.09430  [pdf, ps, other

    stat.ML cs.LG math.OC

    A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

    Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna Pillutla, Aaron Sidford

    Abstract: This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addre… ▽ More

    Submitted 21 July, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

    Comments: Lemma 1 has been updated in v2

  42. arXiv:1704.08227  [pdf, other

    stat.ML cs.LG math.OC math.ST

    Accelerating Stochastic Gradient Descent For Least Squares Regression

    Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford

    Abstract: There is widespread sentiment that it is not possible to effectively utilize fast gradient methods (e.g. Nesterov's acceleration, conjugate gradient, heavy ball) for the purposes of stochastic optimization due to their instability and error accumulation, a notion made precise in d'Aspremont 2008 and Devolder, Glineur, and Nesterov 2014. This work considers these issues for the special case of stoc… ▽ More

    Submitted 31 July, 2018; v1 submitted 26 April, 2017; originally announced April 2017.

    Comments: 54 pages, 3 figures, 1 table; updated acknowledgements, minor title change. Paper appeared in the proceedings of the Conference on Learning Theory (COLT), 2018

  43. arXiv:1703.00887  [pdf, ps, other

    cs.LG math.OC stat.ML

    How to Escape Saddle Points Efficiently

    Authors: Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan

    Abstract: This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are no… ▽ More

    Submitted 2 March, 2017; originally announced March 2017.

  44. arXiv:1605.08754  [pdf, other

    cs.DS cs.LG math.NA math.OC

    Faster Eigenvector Computation via Shift-and-Invert Preconditioning

    Authors: Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, Aaron Sidford

    Abstract: We give faster algorithms and improved sample complexities for estimating the top eigenvector of a matrix $Σ$ -- i.e. computing a unit vector $x$ such that $x^T Σx \ge (1-ε)λ_1(Σ)$: Offline Eigenvector Estimation: Given an explicit $A \in \mathbb{R}^{n \times d}$ with $Σ= A^TA$, we show how to compute an $ε$ approximate top eigenvector in time… ▽ More

    Submitted 25 May, 2016; originally announced May 2016.

    Comments: Appearing in ICML 2016. Combination of work in arXiv:1509.05647 and arXiv:1510.08896

  45. arXiv:1605.08370  [pdf, ps, other

    cs.LG math.OC stat.ML

    Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent

    Authors: Chi Jin, Sham M. Kakade, Praneeth Netrapalli

    Abstract: Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications,… ▽ More

    Submitted 26 May, 2016; originally announced May 2016.

  46. arXiv:1604.03930  [pdf, ps, other

    cs.LG math.OC stat.ML

    Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis

    Authors: Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford

    Abstract: This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide si… ▽ More

    Submitted 27 May, 2016; v1 submitted 13 April, 2016; originally announced April 2016.

    Comments: International Conference on Machine Learning (ICML) 2016

  47. arXiv:1510.08896  [pdf, other

    cs.DS cs.LG math.NA math.OC

    Robust Shift-and-Invert Preconditioning: Faster and More Sample Efficient Algorithms for Eigenvector Computation

    Authors: Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, Aaron Sidford

    Abstract: We provide faster algorithms and improved sample complexities for approximating the top eigenvector of a matrix. Offline Setting: Given an $n \times d$ matrix $A$, we show how to compute an $ε$ approximate top eigenvector in time $\tilde O ( [nnz(A) + \frac{d \cdot sr(A)}{gap^2}]\cdot \log 1/ε)$ and $\tilde O([\frac{nnz(A)^{3/4} (d \cdot sr(A))^{1/4}}{\sqrt{gap}}]\cdot \log1/ε)$. Here $sr(A)$ is… ▽ More

    Submitted 29 May, 2016; v1 submitted 29 October, 2015; originally announced October 2015.

    Comments: Manuscript outdated. Updated version at arxiv:1605.08754

  48. arXiv:1507.05854  [pdf, other

    math.NA cs.DS math.OC

    Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot

    Authors: Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli

    Abstract: While there has been a significant amount of work studying gradient descent techniques for non-convex optimization problems over the last few years, all existing results establish either local convergence with good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, we take the first step in getting the best of both worlds -- establishing global… ▽ More

    Submitted 9 March, 2017; v1 submitted 21 July, 2015; originally announced July 2015.

    Comments: Appear in AISTATS 2017

  49. arXiv:1308.2853  [pdf, ps, other

    cs.LG cs.IR math.NA math.ST stat.ML

    When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

    Authors: Animashree Anandkumar, Daniel Hsu, Majid Janzamin, Sham Kakade

    Abstract: Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary.… ▽ More

    Submitted 13 August, 2013; originally announced August 2013.

  50. arXiv:1211.5414  [pdf, ps, other

    cs.DS cs.LG math.NA stat.ML

    Analysis of a randomized approximation scheme for matrix multiplication

    Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

    Abstract: This note gives a simple analysis of a randomized approximation scheme for matrix multiplication proposed by Sarlos (2006) based on a random rotation followed by uniform column sampling. The result follows from a matrix version of Bernstein's inequality and a tail inequality for quadratic forms in subgaussian random vectors.

    Submitted 23 November, 2012; originally announced November 2012.