Skip to main content

Showing 1–50 of 133 results for author: Srebro, N

  1. arXiv:2407.05622  [pdf, other

    cs.LG cs.DS

    On the Complexity of Learning Sparse Functions with Statistical and Gradient Queries

    Authors: Nirmit Joshi, Theodor Misiakiewicz, Nathan Srebro

    Abstract: The goal of this paper is to investigate the complexity of gradient algorithms when learning sparse functions (juntas). We introduce a type of Statistical Queries ($\mathsf{SQ}$), which we call Differentiable Learning Queries ($\mathsf{DLQ}$), to model gradient queries on a specified loss with respect to an arbitrary model. We provide a tight characterization of the query complexity of… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 43 pages, 1 table, 1 figure

  2. arXiv:2406.04981  [pdf, other

    cs.LG stat.ML

    The Price of Implicit Bias in Adversarially Robust Generalization

    Authors: Nikolaos Tsilivis, Natalie Frank, Nathan Srebro, Julia Kempe

    Abstract: We study the implicit bias of optimization in robust empirical risk minimization (robust ERM) and its connection with robust generalization. In classification settings under adversarial perturbations with linear models, we study what type of regularization should ideally be applied for a given perturbation set to improve (robust) generalization. We then show that the implicit bias of optimization… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  3. arXiv:2405.11667  [pdf, other

    cs.LG cs.DC math.OC stat.ML

    The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

    Authors: Kumar Kshitij Patel, Margalit Glasgow, Ali Zindari, Lingxiao Wang, Sebastian U. Stich, Ziheng Cheng, Nirmit Joshi, Nathan Srebro

    Abstract: Local SGD is a popular optimization method in distributed learning, often outperforming other algorithms in practice, including mini-batch SGD. Despite this success, theoretically proving the dominance of local SGD in settings with reasonable data heterogeneity has been difficult, creating a significant gap between theory and practice. In this paper, we provide new lower bounds for local SGD under… ▽ More

    Submitted 19 May, 2024; originally announced May 2024.

  4. arXiv:2402.08808  [pdf, other

    cs.LG stat.ML

    Depth Separation in Norm-Bounded Infinite-Width Neural Networks

    Authors: Suzanna Parkinson, Greg Ongie, Rebecca Willett, Ohad Shamir, Nathan Srebro

    Abstract: We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well eve… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

  5. arXiv:2402.06323  [pdf, other

    cs.LG stat.ML

    How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

    Authors: Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry

    Abstract: Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly unifor… ▽ More

    Submitted 9 June, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

  6. arXiv:2312.13978  [pdf, other

    cs.LG cs.DS

    Metalearning with Very Few Samples Per Task

    Authors: Maryam Aliakbarpour, Konstantina Bairaktari, Gavin Brown, Adam Smith, Nathan Srebro, Jonathan Ullman

    Abstract: Metalearning and multitask learning are two frameworks for solving a group of related learning tasks more efficiently than we could hope to solve each of the individual tasks on their own. In multitask learning, we are given a fixed set of related learning tasks and need to output one accurate model per task, whereas in metalearning we are given tasks that are drawn i.i.d. from a metadistribution… ▽ More

    Submitted 1 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

  7. arXiv:2311.15404  [pdf, other

    cs.LG cond-mat.dis-nn stat.ML

    Applying statistical learning theory to deep learning

    Authors: Cédric Gerbelot, Avetik Karagulyan, Stefani Karp, Kavya Ravichandran, Menachem Stern, Nathan Srebro

    Abstract: Although statistical learning theory provides a robust framework to understand supervised learning, many theoretical aspects of deep learning remain unclear, in particular how different architectures may lead to inductive bias when trained using gradient based methods. The goal of these lectures is to provide an overview of some of the main questions that arise when attempting to understand deep l… ▽ More

    Submitted 25 March, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

    Comments: 66 pages, 20 figures

  8. arXiv:2310.06113  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    When is Agnostic Reinforcement Learning Statistically Tractable?

    Authors: Zeyu Jia, Gene Li, Alexander Rakhlin, Ayush Sekhari, Nathan Srebro

    Abstract: We study the problem of agnostic PAC reinforcement learning (RL): given a policy class $Π$, how many rounds of interaction with an unknown MDP (with a potentially large state and action space) are required to learn an $ε$-suboptimal policy with respect to $Π$? Towards that end, we introduce a new complexity measure, called the \emph{spanning capacity}, that depends solely on the set $Π$ and is ind… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted to NeurIPS 2023

  9. arXiv:2307.15396  [pdf, other

    cs.LG

    Noisy Interpolation Learning with Shallow Univariate ReLU Networks

    Authors: Nirmit Joshi, Gal Vardi, Nathan Srebro

    Abstract: Understanding how overparameterized neural networks generalize despite perfect interpolation of noisy training data is a fundamental question. Mallinar et. al. 2022 noted that neural networks seem to often exhibit ``tempered overfitting'', wherein the population risk does not converge to the Bayes optimal error, but neither does it approach infinity, yielding non-trivial generalization. However, t… ▽ More

    Submitted 21 March, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

    Comments: To appear at ICLR 2024. Updated version with minor changes in the presentation

  10. arXiv:2306.13188  [pdf, ps, other

    stat.ML cs.LG

    Uniform Convergence with Square-Root Lipschitz Loss

    Authors: Lijia Zhou, Zhen Dai, Frederic Koehler, Nathan Srebro

    Abstract: We establish generic uniform convergence guarantees for Gaussian data in terms of the Rademacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipsch… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

  11. arXiv:2306.13185  [pdf, ps, other

    stat.ML cs.LG

    An Agnostic View on the Cost of Overfitting in (Kernel) Ridge Regression

    Authors: Lijia Zhou, James B. Simon, Gal Vardi, Nathan Srebro

    Abstract: We study the cost of overfitting in noisy kernel ridge regression (KRR), which we define as the ratio between the test error of the interpolating ridgeless model and the test error of the optimally-tuned model. We take an "agnostic" view in the following sense: we consider the cost as a function of sample size for any target function, even if the sample size is not large enough for consistency or… ▽ More

    Submitted 22 March, 2024; v1 submitted 22 June, 2023; originally announced June 2023.

    Comments: This is the ICLR CR version

  12. arXiv:2306.03534  [pdf, other

    cs.LG math.NA

    Continual Learning in Linear Classification on Separable Data

    Authors: Itay Evron, Edward Moroshko, Gon Buzaglo, Maroun Khriesh, Badea Marjieh, Nathan Srebro, Daniel Soudry

    Abstract: We analyze continual learning on a sequence of separable linear classification tasks with binary labels. We show theoretically that learning with weak regularization reduces to solving a sequential max-margin problem, corresponding to a special case of the Projection Onto Convex Sets (POCS) framework. We then develop upper bounds on the forgetting and other quantities of interest under various set… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

  13. arXiv:2305.16508  [pdf, other

    cs.LG stat.ML

    Most Neural Networks Are Almost Learnable

    Authors: Amit Daniely, Nathan Srebro, Gal Vardi

    Abstract: We present a PTAS for learning random constant-depth networks. We show that for any fixed $ε>0$ and depth $i$, there is a poly-time algorithm that for any distribution on $\sqrt{d} \cdot \mathbb{S}^{d-1}$ learns random Xavier networks of depth $i$, up to an additive error of $ε$. The algorithm runs in time and sample complexity of $(\bar{d})^{\mathrm{poly}(ε^{-1})}$, where $\bar d$ is the size of… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Small fixes after review

  14. arXiv:2303.01462  [pdf, ps, other

    cs.LG stat.ML

    Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization

    Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro

    Abstract: Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estim… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: 53 pages

  15. arXiv:2303.01456  [pdf, ps, other

    cs.LG stat.ML

    The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

    Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro

    Abstract: In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial e… ▽ More

    Submitted 31 October, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 42 pages; NeurIPS 2023 camera ready

  16. arXiv:2302.07426  [pdf, ps, other

    cs.LG stat.ML

    Computational Complexity of Learning Neural Networks: Smoothness and Degeneracy

    Authors: Amit Daniely, Nathan Srebro, Gal Vardi

    Abstract: Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network's weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-$2$ networks can be efficiently learned under the assumptions that the input distribution… ▽ More

    Submitted 4 October, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

    Comments: Changed the title, and made some other minor modifications. arXiv admin note: text overlap with arXiv:2101.08303

  17. arXiv:2302.07263  [pdf, other

    cs.LG cs.DS cs.IT stat.ML

    Interpolation Learning With Minimum Description Length

    Authors: Naren Sarayu Manoj, Nathan Srebro

    Abstract: We prove that the Minimum Description Length learning rule exhibits tempered overfitting. We obtain tempered agnostic finite sample learning guarantees and characterize the asymptotic behavior in the presence of random label noise.

    Submitted 14 February, 2023; originally announced February 2023.

  18. arXiv:2210.12082  [pdf, other

    stat.ML cs.LG math.ST

    A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models

    Authors: Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J. Sutherland, Nathan Srebro

    Abstract: We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the squa… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Comments: As published at NeurIPS 2022

  19. arXiv:2210.07082  [pdf, other

    cs.LG stat.ML

    Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

    Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, Wei Hu

    Abstract: The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: 54 pages

  20. arXiv:2209.07369  [pdf, ps, other

    cs.LG stat.ML

    Adversarially Robust Learning: A Generic Minimax Optimal Learner and Characterization

    Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

    Abstract: We present a minimax optimal learner for the problem of learning predictors robust to adversarial examples at test-time. Interestingly, we find that this requires new algorithmic ideas and approaches to adversarially robust learning. In particular, we show, in a strong negative sense, the suboptimality of the robust learner proposed by Montasser, Hanneke, and Srebro (2019) and a broader family of… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: To appear in NeurIPS 2022

  21. arXiv:2205.10671  [pdf, other

    cs.LG stat.ML

    Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets

    Authors: Gene Li, Cong Ma, Nathan Srebro

    Abstract: We present a family $\{\hatπ\}_{p\ge 1}$ of pessimistic learning rules for offline learning of linear contextual bandits, relying on confidence sets with respect to different $\ell_p$ norms, where $\hatπ_2$ corresponds to Bellman-consistent pessimism (BCP), while $\hatπ_\infty$ is a novel generalization of lower confidence bound (LCB) to the linear setting. We show that the novel $\hatπ_\infty$ le… ▽ More

    Submitted 4 October, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

    Comments: Accepted to NeurIPS 2022

  22. arXiv:2205.09588  [pdf, other

    cs.LG math.NA

    How catastrophic can catastrophic forgetting be in linear regression?

    Authors: Itay Evron, Edward Moroshko, Rachel Ward, Nati Srebro, Daniel Soudry

    Abstract: To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research… ▽ More

    Submitted 25 May, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Journal ref: 35th Annual Conference on Learning Theory (2022)

  23. arXiv:2202.13328  [pdf, ps, other

    cs.LG math.OC

    Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization

    Authors: Idan Amir, Roi Livni, Nathan Srebro

    Abstract: We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $ε$ (compared to the… ▽ More

    Submitted 30 October, 2022; v1 submitted 27 February, 2022; originally announced February 2022.

  24. arXiv:2202.06233  [pdf, other

    cs.LG stat.ML

    The Sample Complexity of One-Hidden-Layer Neural Networks

    Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

    Abstract: We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to ge… ▽ More

    Submitted 22 September, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

    Comments: Bug fixed in proof of Theorem 2 (resulting in different log factors); Other minor edits

  25. arXiv:2112.14195  [pdf, other

    cs.LG stat.ML

    Exponential Family Model-Based Reinforcement Learning via Score Matching

    Authors: Gene Li, Junbo Li, Anmol Kabra, Nathan Srebro, Zhaoran Wang, Zhuoran Yang

    Abstract: We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Un… ▽ More

    Submitted 8 January, 2023; v1 submitted 28 December, 2021; originally announced December 2021.

    Comments: NeurIPS 2022

  26. arXiv:2112.04470  [pdf, other

    stat.ML cs.LG math.ST

    Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression

    Authors: Lijia Zhou, Frederic Koehler, Danica J. Sutherland, Nathan Srebro

    Abstract: We study a localized notion of uniform convergence known as an "optimistic rate" (Panchenko 2002; Srebro et al. 2010) for linear regression with Gaussian data. Our refined analysis avoids the hidden constant and logarithmic factor in existing results, which are known to be crucial in high-dimensional settings, especially for understanding interpolation learning. As a special case, our analysis rec… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

  27. arXiv:2110.10602  [pdf, ps, other

    cs.LG stat.ML

    Transductive Robust Learning Guarantees

    Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

    Abstract: We study the problem of adversarially robust learning in the transductive setting. For classes $\mathcal{H}$ of bounded VC dimension, we propose a simple transductive learner that when presented with a set of labeled training examples and a set of unlabeled test examples (both sets possibly adversarially perturbed), it correctly labels the test examples with a robust error rate that is linear in t… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

  28. arXiv:2110.02954  [pdf, other

    math.OC cs.LG stat.ML

    A Stochastic Newton Algorithm for Distributed Convex Optimization

    Authors: Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth

    Abstract: We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  29. arXiv:2110.02732  [pdf, other

    cs.LG stat.ML

    On Margin Maximization in Linear and ReLU Networks

    Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

    Abstract: The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem.… ▽ More

    Submitted 2 October, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: This version includes some minor updates

  30. arXiv:2108.04190  [pdf, ps, other

    cs.LG stat.ML

    On the Power of Differentiable Learning versus PAC and SQ Learning

    Authors: Emmanuel Abbe, Pritish Kamath, Eran Malach, Colin Sandon, Nathan Srebro

    Abstract: We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can always simulate learning with statistical queries (SQ), but their ability to go beyond that depends… ▽ More

    Submitted 5 February, 2022; v1 submitted 9 August, 2021; originally announced August 2021.

  31. arXiv:2107.00595  [pdf, other

    cs.LG math.OC stat.ML

    Fast Margin Maximization via Dual Acceleration

    Authors: Ziwei Ji, Nathan Srebro, Matus Telgarsky

    Abstract: We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradi… ▽ More

    Submitted 21 August, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: ICML 2021

  32. arXiv:2106.09276  [pdf, other

    stat.ML cs.LG math.ST

    Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting

    Authors: Frederic Koehler, Lijia Zhou, Danica J. Sutherland, Nathan Srebro

    Abstract: We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators… ▽ More

    Submitted 3 January, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: v2: Minor changes only. As published at NeurIPS 2021: https://proceedings.neurips.cc/paper/2021/hash/ac9815bef801f58de83804bce86984ad-Abstract.html

  33. arXiv:2106.02720  [pdf, ps, other

    cs.LG math.OC

    An Even More Optimal Stochastic Optimization Algorithm: Minibatching and Interpolation Learning

    Authors: Blake Woodworth, Nathan Srebro

    Abstract: We present and analyze an algorithm for optimizing smooth and convex or strongly convex objectives using minibatch stochastic gradient estimates. The algorithm is optimal with respect to its dependence on both the minibatch size and minimum expected loss simultaneously. This improves over the optimal method of Lan (2012), which is insensitive to the minimum expected loss; over the optimistic accel… ▽ More

    Submitted 26 October, 2021; v1 submitted 4 June, 2021; originally announced June 2021.

    Comments: 24 pages

  34. arXiv:2104.06970  [pdf, other

    cs.LG stat.ML

    Understanding the Eluder Dimension

    Authors: Gene Li, Pritish Kamath, Dylan J. Foster, Nathan Srebro

    Abstract: We provide new insights on eluder dimension, a complexity measure that has been extensively used to bound the regret of algorithms for online bandits and reinforcement learning with function approximation. First, we study the relationship between the eluder dimension for a function class and a generalized notion of rank, defined for any monotone "activation" $σ: \mathbb{R}\to \mathbb{R}$, which co… ▽ More

    Submitted 4 October, 2022; v1 submitted 14 April, 2021; originally announced April 2021.

    Comments: NeurIPS 2022

  35. arXiv:2103.01210  [pdf, ps, other

    cs.LG stat.ML

    Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

    Authors: Eran Malach, Pritish Kamath, Emmanuel Abbe, Nathan Srebro

    Abstract: We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small ev… ▽ More

    Submitted 1 March, 2021; originally announced March 2021.

  36. arXiv:2102.09769  [pdf, other

    cs.LG

    On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

    Authors: Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake Woodworth, Nathan Srebro, Amir Globerson, Daniel Soudry

    Abstract: Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involve… ▽ More

    Submitted 19 February, 2021; originally announced February 2021.

    Comments: 33 pages, 2 figures

    MSC Class: 68T07 (Primary) ACM Class: I.2.6; G.1.6

  37. arXiv:2102.02145  [pdf, ps, other

    cs.LG stat.ML

    Adversarially Robust Learning with Unknown Perturbation Sets

    Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

    Abstract: We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of succ… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

  38. arXiv:2102.01583  [pdf, other

    cs.LG math.OC

    The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

    Authors: Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro

    Abstract: We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound wi… ▽ More

    Submitted 5 August, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

    Comments: 48 pages

  39. arXiv:2101.01134  [pdf, other

    stat.ML cs.AI cs.LG

    Does Invariant Risk Minimization Capture Invariance?

    Authors: Pritish Kamath, Akilesh Tangella, Danica J. Sutherland, Nathan Srebro

    Abstract: We show that the Invariant Risk Minimization (IRM) formulation of Arjovsky et al. (2019) can fail to capture "natural" invariances, at least when used in its practical "linear" form, and even on very simple problems which directly follow the motivating examples for IRM. This can lead to worse generalization on new environments, even when compared to unconstrained ERM. The issue stems from a signif… ▽ More

    Submitted 26 February, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

    Comments: Code is available in the arXiv ancillary files, linked from this page

  40. arXiv:2010.12039  [pdf, ps, other

    cs.LG

    Reducing Adversarially Robust Learning to Non-Robust PAC Learning

    Authors: Omar Montasser, Steve Hanneke, Nathan Srebro

    Abstract: We study the problem of reducing adversarially robust learning to standard PAC learning, i.e. the complexity of learning adversarially robust predictors using access to only a black-box non-robust learner. We give a reduction that can robustly learn any hypothesis class $\mathcal{C}$ using any non-robust learner $\mathcal{A}$ for $\mathcal{C}$. The number of calls to $\mathcal{A}$ depends logarith… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: To appear in NeurIPS 2020

  41. arXiv:2007.06738  [pdf, other

    cs.LG stat.ML

    Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

    Authors: Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry

    Abstract: We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accuratel… ▽ More

    Submitted 13 July, 2020; originally announced July 2020.

  42. arXiv:2007.05073  [pdf, other

    stat.ML cs.LG

    Predictive Value Generalization Bounds

    Authors: Keshav Vemuri, Nathan Srebro

    Abstract: In this paper, we study a bi-criterion framework for assessing scoring functions in the context of binary classification. The positive and negative predictive values (ppv and npv, respectively) are conditional probabilities of the true label matching a classifier's predicted label. The usual classification error rate is a linear combination of these probabilities, and therefore, concentration ineq… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

    Comments: 20 pages, 3 figures

  43. arXiv:2006.05942  [pdf, ps, other

    stat.ML cs.LG

    On Uniform Convergence and Low-Norm Interpolation Learning

    Authors: Lijia Zhou, Danica J. Sutherland, Nathan Srebro

    Abstract: We consider an underdetermined noisy linear regression model where the minimum-norm interpolating predictor is known to be consistent, and ask: can uniform convergence in a norm ball, or at least (following Nagarajan and Kolter) the subset of a norm ball that the algorithm selects on a typical input set, explain this success? We show that uniformly bounding the difference between empirical and pop… ▽ More

    Submitted 13 January, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

    Comments: v3: No content changes to this final version, as published at NeurIPS 2020: https://proceedings.neurips.cc/paper/2020/hash/4cc5400e63624c44fadeda99f57588a6-Abstract.html

  44. arXiv:2006.04735  [pdf, other

    cs.LG math.OC stat.ML

    Minibatch vs Local SGD for Heterogeneous Distributed Learning

    Authors: Blake Woodworth, Kumar Kshitij Patel, Nathan Srebro

    Abstract: We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD in the heterogeneous distributed setting, where each machine has access to stochastic gradient estimates for a different, machine-specific, convex objective; the goal is to optimize w.r.t. the average objective; and machines can only communicate intermittently. We argue that, (i) Minibatch SGD (even without acceleration) domina… ▽ More

    Submitted 1 March, 2022; v1 submitted 8 June, 2020; originally announced June 2020.

    Comments: 34 pages

  45. arXiv:2005.07652  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Efficiently Learning Adversarially Robust Halfspaces with Noise

    Authors: Omar Montasser, Surbhi Goel, Ilias Diakonikolas, Nathan Srebro

    Abstract: We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. In the realizable setting, we provide necessary and sufficient conditions on the adversarial perturbation sets under which halfspaces are efficiently robustly learnable. In the presence of random label noise, we give a simple computationally efficient algorithm for this problem with respect to… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

  46. arXiv:2004.01025  [pdf, ps, other

    cs.LG math.OC stat.ML

    Mirrorless Mirror Descent: A Natural Derivation of Mirror Descent

    Authors: Suriya Gunasekar, Blake Woodworth, Nathan Srebro

    Abstract: We present a primal only derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential. We contrast this discretization to Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows gen… ▽ More

    Submitted 1 July, 2021; v1 submitted 2 April, 2020; originally announced April 2020.

    Comments: 11 pages

  47. arXiv:2003.04180  [pdf, ps, other

    cs.LG stat.ML

    Approximate is Good Enough: Probabilistic Variants of Dimensional and Margin Complexity

    Authors: Pritish Kamath, Omar Montasser, Nathan Srebro

    Abstract: We present and study approximate notions of dimensional and margin complexity, which correspond to the minimal dimension or norm of an embedding required to approximate, rather then exactly represent, a given hypothesis class. We show that such notions are not only sufficient for learning using linear predictors or a kernel, but unlike the exact variants, are also necessary. Thus they are better s… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

  48. arXiv:2003.03397  [pdf, other

    cs.LG stat.ML

    Dropout: Explicit Forms and Capacity Control

    Authors: Raman Arora, Peter Bartlett, Poorya Mianjy, Nathan Srebro

    Abstract: We investigate the capacity control provided by dropout in various machine learning problems. First, we study dropout for matrix completion, where it induces a data-dependent regularizer that, in expectation, equals the weighted trace-norm of the product of the factors. In deep learning, we show that the data-dependent regularizer due to dropout directly controls the Rademacher complexity of the u… ▽ More

    Submitted 6 March, 2020; originally announced March 2020.

  49. arXiv:2002.11651  [pdf, other

    cs.LG cs.CY stat.ML

    Fair Learning with Private Demographic Data

    Authors: Hussein Mozannar, Mesrob I. Ohannessian, Nathan Srebro

    Abstract: Sensitive attributes such as race are rarely available to learners in real world settings as their collection is often restricted by laws and regulations. We give a scheme that allows individuals to release their sensitive information privately while still allowing any downstream entity to learn non-discriminatory predictors. We show how to adapt non-discriminatory learners to work with privatized… ▽ More

    Submitted 13 July, 2020; v1 submitted 26 February, 2020; originally announced February 2020.

    Comments: ICML 2020

  50. arXiv:2002.09277  [pdf, other

    cs.LG stat.ML

    Kernel and Rich Regimes in Overparametrized Models

    Authors: Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro

    Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More

    Submitted 27 July, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: This updates and significantly extends a previous article (arXiv:1906.05827), Sections 6 and 7 are the most major additions. 31 pages. arXiv admin note: text overlap with arXiv:1906.05827