subscribe to arXiv mailings

Open Problem: Anytime Convergence Rate of Gradient Descent

Abstract: Recent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $\mathcal{O}(1/T)$ convergence rate, at \emph{any} stopping time $T$? Recent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $\mathcal{O}(1/T)$ convergence rate, at \emph{any} stopping time $T$? △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: COLT 2024 open problem; 5 pages

arXiv:2402.08808 [pdf, other]

Depth Separation in Norm-Bounded Infinite-Width Neural Networks

Authors: Suzanna Parkinson, Greg Ongie, Rebecca Willett, Ohad Shamir, Nathan Srebro

Abstract: We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well eve… ▽ More We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well even when the network width is unbounded. Here, we study separation in terms of the sample complexity required for learnability. Specifically, we show that there are functions that are learnable with sample complexity polynomial in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks (with any value for the norm). We also show that a similar statement in the reverse direction is not possible: any function learnable with polynomial sample complexity by a norm-controlled depth-2 ReLU network with infinite width is also learnable with polynomial sample complexity by a norm-controlled depth-3 ReLU network. △ Less

Submitted 13 February, 2024; originally announced February 2024.

arXiv:2312.15995 [pdf, ps, other]

Generalization in Kernel Regression Under Realistic Assumptions

Authors: Daniel Barzilai, Ohad Shamir

Abstract: It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup… ▽ More It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup. This work aims to provide a unified theory to upper bound the excess risk of kernel regression for nearly all common and realistic settings. Specifically, we provide rigorous bounds that hold for common kernels and for any amount of regularization, noise, any input dimension, and any number of samples. Furthermore, we provide relative perturbation bounds for the eigenvalues of kernel matrices, which may be of independent interest. These reveal a self-regularization phenomenon, whereby a heavy tail in the eigendecomposition of the kernel provides it with an implicit form of regularization, enabling good generalization. When applied to common kernels, our results imply benign overfitting in high input dimensions, nearly tempered overfitting in fixed dimensions, and explicit convergence rates for regularized regression. As a by-product, we obtain time-dependent bounds for neural networks trained in the kernel regime. △ Less

Submitted 20 February, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

arXiv:2307.04504 [pdf, ps, other]

An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization

Authors: Guy Kornowski, Ohad Shamir

Abstract: We study the complexity of producing $(δ,ε)$-stationary points of Lipschitz objectives which are possibly neither smooth nor convex, using only noisy function evaluations. Recent works proposed several stochastic zero-order algorithms that solve this task, all of which suffer from a dimension-dependence of $Ω(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal.… ▽ More We study the complexity of producing $(δ,ε)$-stationary points of Lipschitz objectives which are possibly neither smooth nor convex, using only noisy function evaluations. Recent works proposed several stochastic zero-order algorithms that solve this task, all of which suffer from a dimension-dependence of $Ω(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal. We refute this conjecture by providing a faster algorithm that has complexity $O(dδ^{-1}ε^{-3})$, which is optimal (up to numerical constants) with respect to $d$ and also optimal with respect to the accuracy parameters $δ,ε$, thus solving an open question due to Lin et al. (NeurIPS'22). Moreover, the convergence rate achieved by our algorithm is also optimal for smooth objectives, proving that in the nonconvex stochastic zero-order setting, nonsmooth optimization is as easy as smooth optimization. We provide algorithms that achieve the aforementioned convergence rate in expectation as well as with high probability. Our analysis is based on a simple yet powerful lemma regarding the Goldstein-subdifferential set, which allows utilizing recent advancements in first-order nonsmooth nonconvex optimization. △ Less

Submitted 15 April, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

Comments: Accepted to Journal of Machine Learning Research (JMLR); improved dependence on Lipschitz constant; some minor edits following reviews

arXiv:2305.16475 [pdf, ps, other]

Initialization-Dependent Sample Complexity of Linear Predictors and Neural Networks

Authors: Roey Magen, Ohad Shamir

Abstract: We provide several new results on the sample complexity of vector-valued linear predictors (parameterized by a matrix), and more generally neural networks. Focusing on size-independent bounds, where only the Frobenius norm distance of the parameters from some fixed reference matrix $W_0$ is controlled, we show that the sample complexity behavior can be surprisingly different than what we may expec… ▽ More We provide several new results on the sample complexity of vector-valued linear predictors (parameterized by a matrix), and more generally neural networks. Focusing on size-independent bounds, where only the Frobenius norm distance of the parameters from some fixed reference matrix $W_0$ is controlled, we show that the sample complexity behavior can be surprisingly different than what we may expect considering the well-studied setting of scalar-valued linear predictors. This also leads to new sample complexity bounds for feed-forward neural networks, tackling some open questions in the literature, and establishing a new convex linear prediction problem that is provably learnable without uniform convergence. △ Less

Submitted 25 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: 30 pages

arXiv:2305.15141 [pdf, other]

From Tempered to Benign Overfitting in ReLU Neural Networks

Authors: Guy Kornowski, Gilad Yehudai, Ohad Shamir

Abstract: Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the pe… ▽ More Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the performance is non-optimal yet also non-trivial, and degrades as a function of the noise level. However, a theoretical justification of this claim for non-linear NNs has been lacking so far. In this work, we provide several results that aim at bridging these complementing views. We study a simple classification setting with 2-layer ReLU NNs, and prove that under various assumptions, the type of overfitting transitions from tempered in the extreme case of one-dimensional data, to benign in high dimensions. Thus, we show that the input dimension has a crucial role on the type of overfitting in this setting, which we also validate empirically for intermediate dimensions. Overall, our results shed light on the intricate connections between the dimension, sample size, architecture and training algorithm on the one hand, and the type of resulting overfitting on the other hand. △ Less

Submitted 21 March, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023; fixed bug

arXiv:2302.08300 [pdf, ps, other]

Deterministic Nonsmooth Nonconvex Optimization

Authors: Michael I. Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, Manolis Zampetakis

Abstract: We study the complexity of optimizing nonsmooth nonconvex Lipschitz functions by producing $(δ,ε)$-stationary points. Several recent works have presented randomized algorithms that produce such points using $\tilde O(δ^{-1}ε^{-3})$ first-order oracle calls, independent of the dimension $d$. It has been an open problem as to whether a similar result can be obtained via a deterministic algorithm. We… ▽ More We study the complexity of optimizing nonsmooth nonconvex Lipschitz functions by producing $(δ,ε)$-stationary points. Several recent works have presented randomized algorithms that produce such points using $\tilde O(δ^{-1}ε^{-3})$ first-order oracle calls, independent of the dimension $d$. It has been an open problem as to whether a similar result can be obtained via a deterministic algorithm. We resolve this open problem, showing that randomization is necessary to obtain a dimension-free rate. In particular, we prove a lower bound of $Ω(d)$ for any deterministic algorithm. Moreover, we show that unlike smooth or convex optimization, access to function values is required for any deterministic algorithm to halt within any finite time. On the other hand, we prove that if the function is even slightly smooth, then the dimension-free rate of $\tilde O(δ^{-1}ε^{-3})$ can be obtained by a deterministic algorithm with merely a logarithmic dependence on the smoothness parameter. Motivated by these findings, we turn to study the complexity of deterministically smoothing Lipschitz functions. Though there are efficient black-box randomized smoothings, we start by showing that no such deterministic procedure can smooth functions in a meaningful manner, resolving an open question. We then bypass this impossibility result for the structured case of ReLU neural networks. To that end, in a practical white-box setting in which the optimizer is granted access to the network's architecture, we propose a simple, dimension-free, deterministic smoothing that provably preserves $(δ,ε)$-stationary points. Our method applies to a variety of architectures of arbitrary depth, including ResNets and ConvNets. Combined with our algorithm, this yields the first deterministic dimension-free algorithm for optimizing ReLU networks, circumventing our lower bound. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: This work supersedes arxiv:2209.12463 and arxiv:2209.10346[Section 3], with major additional results

arXiv:2209.10346 [pdf, other]

On the Complexity of Finding Small Subgradients in Nonsmooth Optimization

Authors: Guy Kornowski, Ohad Shamir

Abstract: We study the oracle complexity of producing $(δ,ε)$-stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimension-free randomized algorithms for producing such points within $\widetilde{O}(1/δε^3)$ first-order oracle calls, we show that no dimension-free rate can be achieved by a deterministic algorithm. On the other hand, we point out that thi… ▽ More We study the oracle complexity of producing $(δ,ε)$-stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimension-free randomized algorithms for producing such points within $\widetilde{O}(1/δε^3)$ first-order oracle calls, we show that no dimension-free rate can be achieved by a deterministic algorithm. On the other hand, we point out that this rate can be derandomized for smooth functions with merely a logarithmic dependence on the smoothness parameter. Moreover, we establish several lower bounds for this task which hold for any randomized algorithm, with or without convexity. Finally, we show how the convergence rate of finding $(δ,ε)$-stationary points can be improved in case the function is convex, a setting which we motivate by proving that in general no finite time algorithm can produce points with small subgradients even for convex functions. △ Less

Submitted 21 September, 2022; originally announced September 2022.

Comments: 18 pages

arXiv:2206.07758 [pdf, other]

Reconstructing Training Data from Trained Neural Networks

Authors: Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, Michal Irani

Abstract: Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results abo… ▽ More Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets. △ Less

Submitted 5 December, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

Comments: Fixed a typo in the acknowledgements

arXiv:2202.06233 [pdf, other]

The Sample Complexity of One-Hidden-Layer Neural Networks

Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

Abstract: We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to ge… ▽ More We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees (independent of the network width), while a stronger Frobenius norm control is sufficient, extending and improving on previous work. Motivated by the proof constructions, we identify and analyze two important settings where (perhaps surprisingly) a mere spectral norm control turns out to be sufficient: First, when the network's activation functions are sufficiently smooth (with the result extending to deeper networks); and second, for certain types of convolutional networks. In the latter setting, we study how the sample complexity is additionally affected by parameters such as the amount of overlap between patches and the overall number of patches. △ Less

Submitted 22 September, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: Bug fixed in proof of Theorem 2 (resulting in different log factors); Other minor edits

arXiv:2202.04347 [pdf, other]

Gradient Methods Provably Converge to Non-Robust Networks

Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

Abstract: Despite a great deal of research, it is still unclear why neural networks are so susceptible to adversarial examples. In this work, we identify natural settings where depth-$2$ ReLU networks trained with gradient flow are provably non-robust (susceptible to small adversarial $\ell_2$-perturbations), even when robust networks that classify the training dataset correctly exist. Perhaps surprisingly,… ▽ More Despite a great deal of research, it is still unclear why neural networks are so susceptible to adversarial examples. In this work, we identify natural settings where depth-$2$ ReLU networks trained with gradient flow are provably non-robust (susceptible to small adversarial $\ell_2$-perturbations), even when robust networks that classify the training dataset correctly exist. Perhaps surprisingly, we show that the well-known implicit bias towards margin maximization induces bias towards non-robust networks, by proving that every network which satisfies the KKT conditions of the max-margin problem is non-robust. △ Less

Submitted 4 October, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: Minor fixes made for the NeurIPS CR version

arXiv:2202.03841 [pdf, ps, other]

Width is Less Important than Depth in ReLU Neural Networks

Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

Abstract: We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of wi… ▽ More We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of width and depth are interchanged, it follows that depth plays a more significant role than width in the expressive power of neural networks. We extend our results to constructing networks with bounded weights, and to constructing networks with width at most $d+2$, which is close to the minimal possible width due to previous lower bounds. Both of these constructions cause an extra polynomial factor in the number of parameters over the target network. We also show an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network. △ Less

Submitted 1 June, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

Comments: Camera ready version in COLT 2022

arXiv:2201.12760 [pdf, other]

Implicit Regularization Towards Rank Minimization in ReLU Networks

Authors: Nadav Timor, Gal Vardi, Ohad Shamir

Abstract: We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this genera… ▽ More We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this generalizes to nonlinear networks is an open problem. In this paper, we focus on nonlinear ReLU networks, providing several new positive and negative results. On the negative side, we prove (and demonstrate empirically) that, unlike the linear case, GF on ReLU networks may no longer tend to minimize ranks, in a rather strong sense (even approximately, for "most" datasets of size 2). On the positive side, we reveal that ReLU networks of sufficient depth are provably biased towards low-rank solutions in several reasonable settings. △ Less

Submitted 30 January, 2022; originally announced January 2022.

arXiv:2201.11489 [pdf, other]

The Implicit Bias of Benign Overfitting

Authors: Ohad Shamir

Abstract: The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining near-optimal expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and c… ▽ More The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining near-optimal expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and classification tasks. We consider a prototypical and rather generic data model for benign overfitting of linear predictors, where an arbitrary input distribution of some fixed dimension $k$ is concatenated with a high-dimensional distribution. For linear regression which is not necessarily well-specified, we show that the minimum-norm interpolating predictor (that standard training methods converge to) is biased towards an inconsistent solution in general, hence benign overfitting will generally not occur. Moreover, we show how this can be extended beyond standard linear regression, by an argument proving how the existence of benign overfitting on some regression problems precludes its existence on other regression problems. We then turn to classification problems, and show that the situation there is much more favorable. Specifically, we prove that the max-margin predictor (to which standard training methods are known to converge in direction) is asymptotically biased towards minimizing a weighted \emph{squared hinge loss}. This allows us to reduce the question of benign overfitting in classification to the simpler question of whether this loss is a good surrogate for the misclassification error, and use it to show benign overfitting in some new settings. △ Less

Submitted 17 April, 2023; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: Version published in the Journal of Machine Learning Research

Journal ref: JMLR 24(113):1-40, 2023

arXiv:2112.04229 [pdf, other]

Replay For Safety

Authors: Liran Szlak, Ohad Shamir

Abstract: Experience replay \citep{lin1993reinforcement, mnih2015human} is a widely used technique to achieve efficient use of data and improved performance in RL algorithms. In experience replay, past transitions are stored in a memory buffer and re-used during learning. Various suggestions for sampling schemes from the replay buffer have been suggested in previous works, attempting to optimally choose tho… ▽ More Experience replay \citep{lin1993reinforcement, mnih2015human} is a widely used technique to achieve efficient use of data and improved performance in RL algorithms. In experience replay, past transitions are stored in a memory buffer and re-used during learning. Various suggestions for sampling schemes from the replay buffer have been suggested in previous works, attempting to optimally choose those experiences which will most contribute to the convergence to an optimal policy. Here, we give some conditions on the replay sampling scheme that will ensure convergence, focusing on the well-known Q-learning algorithm in the tabular setting. After establishing sufficient conditions for convergence, we turn to suggest a slightly different usage for experience replay - replaying memories in a biased manner as a means to change the properties of the resulting policy. We initiate a rigorous study of experience replay as a tool to control and modify the properties of the resulting policy. In particular, we show that using an appropriate biased sampling scheme can allow us to achieve a \emph{safe} policy. We believe that using experience replay as a biasing mechanism that allows controlling the resulting policy in desirable ways is an idea with promising potential for many applications. △ Less

Submitted 8 December, 2021; originally announced December 2021.

arXiv:2112.04213 [pdf, other]

Convergence Results For Q-Learning With Experience Replay

Authors: Liran Szlak, Ohad Shamir

Abstract: A commonly used heuristic in RL is experience replay (e.g.~\citet{lin1993reinforcement, mnih2015human}), in which a learner stores and re-uses past trajectories as if they were sampled online. In this work, we initiate a rigorous study of this heuristic in the setting of tabular Q-learning. We provide a convergence rate guarantee, and discuss how it compares to the convergence of Q-learning depend… ▽ More A commonly used heuristic in RL is experience replay (e.g.~\citet{lin1993reinforcement, mnih2015human}), in which a learner stores and re-uses past trajectories as if they were sampled online. In this work, we initiate a rigorous study of this heuristic in the setting of tabular Q-learning. We provide a convergence rate guarantee, and discuss how it compares to the convergence of Q-learning depending on important parameters such as the frequency and number of replay iterations. We also provide theoretical evidence showing when we might expect this heuristic to strictly improve performance, by introducing and analyzing a simple class of MDPs. Finally, we provide some experiments to support our theoretical findings. △ Less

Submitted 8 December, 2021; originally announced December 2021.

arXiv:2110.03187 [pdf, ps, other]

On the Optimal Memorization Power of ReLU Neural Networks

Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

Abstract: We study the memorization power of feedforward ReLU neural networks. We show that such networks can memorize any $N$ points that satisfy a mild separability assumption using $\tilde{O}\left(\sqrt{N}\right)$ parameters. Known VC-dimension upper bounds imply that memorizing $N$ samples requires $Ω(\sqrt{N})$ parameters, and hence our construction is optimal up to logarithmic factors. We also give a… ▽ More We study the memorization power of feedforward ReLU neural networks. We show that such networks can memorize any $N$ points that satisfy a mild separability assumption using $\tilde{O}\left(\sqrt{N}\right)$ parameters. Known VC-dimension upper bounds imply that memorizing $N$ samples requires $Ω(\sqrt{N})$ parameters, and hence our construction is optimal up to logarithmic factors. We also give a generalized construction for networks with depth bounded by $1 \leq L \leq \sqrt{N}$, for memorizing $N$ samples using $\tilde{O}(N/L)$ parameters. This bound is also optimal up to logarithmic factors. Our construction uses weights with large bit complexity. We prove that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2110.02954 [pdf, other]

A Stochastic Newton Algorithm for Distributed Convex Optimization

Authors: Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth

Abstract: We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations… ▽ More We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations performed between rounds of communication. We show that our method can reduce the number, and frequency, of required communication rounds compared to existing methods without hurting performance, by proving convergence guarantees for quasi-self-concordant objectives (e.g., logistic regression), alongside empirical evidence. △ Less

Submitted 7 October, 2021; originally announced October 2021.

arXiv:2110.02732 [pdf, other]

On Margin Maximization in Linear and ReLU Networks

Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

Abstract: The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem.… ▽ More The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem. In this paper, we study this question in detail, for several neural network architectures involving linear and ReLU activations. Perhaps surprisingly, we show that in many cases, the KKT point is not even a local optimum of the max margin problem. On the flip side, we identify multiple settings where a local or global optimum can be guaranteed. △ Less

Submitted 2 October, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: This version includes some minor updates

arXiv:2106.06880 [pdf, other]

Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems

Authors: Itay Safran, Ohad Shamir

Abstract: Recently, there has been much interest in studying the convergence rates of without-replacement SGD, and proving that it is faster than with-replacement SGD in the worst case. However, known lower bounds ignore the problem's geometry, including its condition number, whereas the upper bounds explicitly depend on it. Perhaps surprisingly, we prove that when the condition number is taken into account… ▽ More Recently, there has been much interest in studying the convergence rates of without-replacement SGD, and proving that it is faster than with-replacement SGD in the worst case. However, known lower bounds ignore the problem's geometry, including its condition number, whereas the upper bounds explicitly depend on it. Perhaps surprisingly, we prove that when the condition number is taken into account, without-replacement SGD \emph{does not} significantly improve on with-replacement SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the condition number. Since many problems in machine learning and other areas are both ill-conditioned and involve large datasets, this indicates that without-replacement does not necessarily improve over with-replacement sampling for realistic iteration budgets. We show this by providing new lower and upper bounds which are tight (up to log factors), for quadratic problems with commuting quadratic terms, precisely quantifying the dependence on the problem parameters. △ Less

Submitted 3 December, 2021; v1 submitted 12 June, 2021; originally announced June 2021.

arXiv:2106.01101 [pdf, other]

Learning a Single Neuron with Bias Using Gradient Descent

Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

Abstract: We theoretically study the fundamental problem of learning a single neuron with a bias term ($\mathbf{x} \mapsto σ(<\mathbf{w},\mathbf{x}> + b)$) in the realizable setting with the ReLU activation, using gradient descent. Perhaps surprisingly, we show that this is a significantly different and more challenging problem than the bias-less case (which was the focus of previous works on single neurons… ▽ More We theoretically study the fundamental problem of learning a single neuron with a bias term ($\mathbf{x} \mapsto σ(<\mathbf{w},\mathbf{x}> + b)$) in the realizable setting with the ReLU activation, using gradient descent. Perhaps surprisingly, we show that this is a significantly different and more challenging problem than the bias-less case (which was the focus of previous works on single neurons), both in terms of the optimization geometry as well as the ability of gradient methods to succeed in some scenarios. We provide a detailed study of this problem, characterizing the critical points of the objective, demonstrating failure cases, and providing positive convergence guarantees under different sets of assumptions. To prove our results, we develop some tools which may be of independent interest, and improve previous results on learning single neurons. △ Less

Submitted 5 February, 2022; v1 submitted 2 June, 2021; originally announced June 2021.

Comments: An updated version, corresponding to the NeurIPS 2021 camera-ready version

arXiv:2104.06763 [pdf, other]

Oracle Complexity in Nonsmooth Nonconvex Optimization

Authors: Guy Kornowski, Ohad Shamir

Abstract: It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find $ε$-stationary points (with gradient norm less than $ε$) in $\mathcal{O}(1/ε^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inappli… ▽ More It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find $ε$-stationary points (with gradient norm less than $ε$) in $\mathcal{O}(1/ε^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near $ε$-stationary points. This is perhaps the most natural relaxation of finding $ε$-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efficiently, for any distance and $ε$ smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large. △ Less

Submitted 27 October, 2022; v1 submitted 14 April, 2021; originally announced April 2021.

Comments: Accepted to Journal of Machine Learning Research (JMLR); some minor edits following reviews

arXiv:2102.01583 [pdf, other]

The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

Authors: Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro

Abstract: We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound wi… ▽ More We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound with a matching upper bound that establishes an optimal algorithm. △ Less

Submitted 5 August, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Comments: 48 pages

arXiv:2102.00434 [pdf, ps, other]

The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

Authors: Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

Abstract: Several recent works have shown separation results between deep neural networks, and hypothesis classes with inferior approximation capacity such as shallow networks or kernel classes. On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks. In this work we study the intricat… ▽ More Several recent works have shown separation results between deep neural networks, and hypothesis classes with inferior approximation capacity such as shallow networks or kernel classes. On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks. In this work we study the intricate connection between learnability and approximation capacity. We show that learnability with deep networks of a target function depends on the ability of simpler classes to approximate the target. Specifically, we show that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks. We also show that a class of functions can be learned by an efficient statistical query algorithm if and only if it can be approximated in a weak sense by some kernel class. We give several examples of functions which demonstrate depth separation, and conclude that they cannot be efficiently learned, even by a hypothesis class that can efficiently approximate them. △ Less

Submitted 18 July, 2021; v1 submitted 31 January, 2021; originally announced February 2021.

Comments: COLT 2021 camera ready version

arXiv:2102.00314 [pdf, ps, other]

Size and Depth Separation in Approximating Benign Functions with Neural Networks

Authors: Gal Vardi, Daniel Reichman, Toniann Pitassi, Ohad Shamir

Abstract: When studying the expressive power of neural networks, a main challenge is to understand how the size and depth of the network affect its ability to approximate real functions. However, not all functions are interesting from a practical viewpoint: functions of interest usually have a polynomially-bounded Lipschitz constant, and can be computed efficiently. We call functions that satisfy these cond… ▽ More When studying the expressive power of neural networks, a main challenge is to understand how the size and depth of the network affect its ability to approximate real functions. However, not all functions are interesting from a practical viewpoint: functions of interest usually have a polynomially-bounded Lipschitz constant, and can be computed efficiently. We call functions that satisfy these conditions "benign", and explore the benefits of size and depth for approximation of benign functions with ReLU networks. As we show, this problem is more challenging than the corresponding problem for non-benign functions. We give barriers to showing depth-lower-bounds: Proving existence of a benign function that cannot be approximated by polynomial-size networks of depth $4$ would settle longstanding open problems in computational complexity. It implies that beyond depth $4$ there is a barrier to showing depth-separation for benign functions, even between networks of constant depth and networks of nonconstant depth. We also study size-separation, namely, whether there are benign functions that can be approximated with networks of size $O(s(d))$, but not with networks of size $O(s'(d))$. We show a complexity-theoretic barrier to proving such results beyond size $O(d\log^2(d))$, but also show an explicit benign function, that can be approximated with networks of size $O(d)$ and not with networks of size $o(d/\log d)$. For approximation in $L_\infty$ we achieve such separation already between size $O(d)$ and size $o(d)$. Moreover, we show superpolynomial size lower bounds and barriers to such lower bounds, depending on the assumptions on the function. Our size-separation results rely on an analysis of size lower bounds for Boolean functions, which is of independent interest: We show linear size lower bounds for computing explicit Boolean functions with neural networks and threshold circuits. △ Less

Submitted 28 June, 2021; v1 submitted 30 January, 2021; originally announced February 2021.

Comments: Edits after review + changing the terminology from "natural functions" to "benign functions"

arXiv:2012.05156 [pdf, other]

Implicit Regularization in ReLU Networks with the Square Loss

Authors: Gal Vardi, Ohad Shamir

Abstract: Understanding the implicit regularization (or implicit bias) of gradient descent has recently been a very active research area. However, the implicit regularization in nonlinear neural networks is still poorly understood, especially for regression losses such as the square loss. Perhaps surprisingly, we prove that even for a single ReLU neuron, it is impossible to characterize the implicit regular… ▽ More Understanding the implicit regularization (or implicit bias) of gradient descent has recently been a very active research area. However, the implicit regularization in nonlinear neural networks is still poorly understood, especially for regression losses such as the square loss. Perhaps surprisingly, we prove that even for a single ReLU neuron, it is impossible to characterize the implicit regularization with the square loss by any explicit function of the model parameters (although on the positive side, we show it can be characterized approximately). For one hidden-layer networks, we prove a similar result, where in general it is impossible to characterize implicit regularization properties in this manner, except for the "balancedness" property identified in Du et al. [2018]. Our results suggest that a more general framework than the one considered so far may be needed to understand implicit regularization for nonlinear predictors, and provides some clues on what this framework should be. △ Less

Submitted 8 June, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

Comments: Small changes due to reviews

arXiv:2010.06642 [pdf, ps, other]

High-Order Oracle Complexity of Smooth and Strongly Convex Optimization

Authors: Guy Kornowski, Ohad Shamir

Abstract: In this note, we consider the complexity of optimizing a highly smooth (Lipschitz $k$-th order derivative) and strongly convex function, via calls to a $k$-th order oracle which returns the value and first $k$ derivatives of the function at a given point, and where the dimension is unrestricted. Extending the techniques introduced in Arjevani et al. [2019], we prove that the worst-case oracle comp… ▽ More In this note, we consider the complexity of optimizing a highly smooth (Lipschitz $k$-th order derivative) and strongly convex function, via calls to a $k$-th order oracle which returns the value and first $k$ derivatives of the function at a given point, and where the dimension is unrestricted. Extending the techniques introduced in Arjevani et al. [2019], we prove that the worst-case oracle complexity for any fixed $k$ to optimize the function up to accuracy $ε$ is on the order of $\left(\frac{μ_k D^{k-1}}λ\right)^{\frac{2}{3k+1}}+\log\log\left(\frac{1}ε\right)$ (in sufficiently high dimension, and up to log factors independent of $ε$), where $μ_k$ is the Lipschitz constant of the $k$-th derivative, $D$ is the initial distance to the optimum, and $λ$ is the strong convexity parameter. △ Less

Submitted 28 April, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: 21 pages; added low dimensional regime; some minor edits

arXiv:2007.00028 [pdf, other]

Gradient Methods Never Overfit On Separable Data

Authors: Ohad Shamir

Abstract: A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after s… ▽ More A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient flow, gradient descent and stochastic gradient descent) never overfit on separable data: If we run these methods for $T$ iterations on a dataset of size $m$, both the empirical risk and the generalization error decrease at an essentially optimal rate of $\tilde{\mathcal{O}}(1/γ^2 T)$ up till $T\approx m$, at which point the generalization error remains fixed at an essentially optimal level of $\tilde{\mathcal{O}}(1/γ^2 m)$ regardless of how large $T$ is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness. △ Less

Submitted 10 September, 2020; v1 submitted 30 June, 2020; originally announced July 2020.

Comments: Added a short appendix on polynomially-tailed losses; some minor edits

arXiv:2006.01005 [pdf, other]

The Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural Networks

Authors: Itay Safran, Gilad Yehudai, Ohad Shamir

Abstract: We study the effects of mild over-parameterization on the optimization landscape of a simple ReLU neural network of the form $\mathbf{x}\mapsto\sum_{i=1}^k\max\{0,\mathbf{w}_i^{\top}\mathbf{x}\}$, in a well-studied teacher-student setting where the target values are generated by the same architecture, and when directly optimizing over the population squared loss with respect to Gaussian inputs. We… ▽ More We study the effects of mild over-parameterization on the optimization landscape of a simple ReLU neural network of the form $\mathbf{x}\mapsto\sum_{i=1}^k\max\{0,\mathbf{w}_i^{\top}\mathbf{x}\}$, in a well-studied teacher-student setting where the target values are generated by the same architecture, and when directly optimizing over the population squared loss with respect to Gaussian inputs. We prove that while the objective is strongly convex around the global minima when the teacher and student networks possess the same number of neurons, it is not even \emph{locally convex} after any amount of over-parameterization. Moreover, related desirable properties (e.g., one-point strong convexity and the Polyak-Łojasiewicz condition) also do not hold even locally. On the other hand, we establish that the objective remains one-point strongly convex in \emph{most} directions (suitably defined), and show an optimization guarantee under this property. For the non-global minima, we prove that adding even just a single neuron will turn a non-global minimum into a saddle point. This holds under some technical conditions which we validate empirically. These results provide a possible explanation for why recovering a global minimum becomes significantly easier when we over-parameterize, even if the amount of over-parameterization is very moderate. △ Less

Submitted 30 July, 2021; v1 submitted 1 June, 2020; originally announced June 2020.

arXiv:2006.00625 [pdf, ps, other]

Neural Networks with Small Weights and Depth-Separation Barriers

Authors: Gal Vardi, Ohad Shamir

Abstract: In studying the expressiveness of neural networks, an important question is whether there are functions which can only be approximated by sufficiently deep networks, assuming their size is bounded. However, for constant depths, existing results are limited to depths $2$ and $3$, and achieving results for higher depths has been an important open question. In this paper, we focus on feedforward ReLU… ▽ More In studying the expressiveness of neural networks, an important question is whether there are functions which can only be approximated by sufficiently deep networks, assuming their size is bounded. However, for constant depths, existing results are limited to depths $2$ and $3$, and achieving results for higher depths has been an important open question. In this paper, we focus on feedforward ReLU networks, and prove fundamental barriers to proving such results beyond depth $4$, by reduction to open problems and natural-proof barriers in circuit complexity. To show this, we study a seemingly unrelated problem of independent interest: Namely, whether there are polynomially-bounded functions which require super-polynomial weights in order to approximate with constant-depth neural networks. We provide a negative and constructive answer to that question, by showing that if a function can be approximated by a polynomially-sized, constant depth $k$ network with arbitrarily large weights, it can also be approximated by a polynomially-sized, depth $3k+3$ network, whose weights are polynomially bounded. △ Less

Submitted 27 December, 2020; v1 submitted 31 May, 2020; originally announced June 2020.

arXiv:2002.11962 [pdf, other]

Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?

Authors: Ohad Shamir

Abstract: It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find $ε$-stationary points (where the gradient norm is less than $ε$) in $\mathcal{O}(1/ε^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. Moreover,… ▽ More It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find $ε$-stationary points (where the gradient norm is less than $ε$) in $\mathcal{O}(1/ε^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. Moreover, as recently pointed out in Zhang et al. [2020], it is generally impossible to provide finite-time guarantees for finding an $ε$-stationary point of nonsmooth functions. Perhaps the most natural relaxation of this is to find points which are near such $ε$-stationary points. In this paper, we show that even this relaxed goal is hard to obtain in general, given only black-box access to the function values and gradients. We also discuss the pros and cons of alternative approaches. △ Less

Submitted 15 April, 2021; v1 submitted 27 February, 2020; originally announced February 2020.

Comments: This paper is superseded by arXiv:2104.06763, which contains major additional results

arXiv:2002.07839 [pdf, other]

Is Local SGD Better than Minibatch SGD?

Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibat… ▽ More We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibatch SGD and that accelerated local SGD is minimax optimal for quadratics; (2) For general convex objectives we provide the first guarantee that at least sometimes improves over minibatch SGD; (3) We show that indeed local SGD does not dominate minibatch SGD by presenting a lower bound on the performance of local SGD that is worse than the minibatch SGD guarantee. △ Less

Submitted 20 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 29 pages

arXiv:2002.00585 [pdf, ps, other]

Proving the Lottery Ticket Hypothesis: Pruning is All You Need

Authors: Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

Abstract: The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded wei… ▽ More The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training. △ Less

Submitted 3 February, 2020; originally announced February 2020.

arXiv:2001.05205 [pdf, other]

Learning a Single Neuron with Gradient Methods

Authors: Gilad Yehudai, Ohad Shamir

Abstract: We consider the fundamental problem of learning a single neuron $x \mapstoσ(w^\top x)$ using standard gradient methods. As opposed to previous works, which considered specific (and not always realistic) input distributions and activation functions $σ(\cdot)$, we ask whether a more general result is attainable, under milder assumptions. On the one hand, we show that some assumptions on the distribu… ▽ More We consider the fundamental problem of learning a single neuron $x \mapstoσ(w^\top x)$ using standard gradient methods. As opposed to previous works, which considered specific (and not always realistic) input distributions and activation functions $σ(\cdot)$, we ask whether a more general result is attainable, under milder assumptions. On the one hand, we show that some assumptions on the distribution and the activation function are necessary. On the other hand, we prove positive guarantees under mild assumptions, which go beyond those studied in the literature so far. We also point out and study the challenges in further strengthening and generalizing our results. △ Less

Submitted 27 February, 2022; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: Fixed a small bug in the proof of Theorem 4.2

arXiv:1910.01845 [pdf, ps, other]

The Complexity of Finding Stationary Points with Stochastic Gradient Descent

Authors: Yoel Drori, Ohad Shamir

Abstract: We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions. We provide several results, implying that the $\mathcal{O}(ε^{-4})$ upper bound of Ghadimi and Lan~\cite{ghadimi2013stochastic} (for making the average gradient norm less than $ε$) cannot be improved upon, unless a combination of additional assumptions is… ▽ More We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions. We provide several results, implying that the $\mathcal{O}(ε^{-4})$ upper bound of Ghadimi and Lan~\cite{ghadimi2013stochastic} (for making the average gradient norm less than $ε$) cannot be improved upon, unless a combination of additional assumptions is made. Notably, this holds even if we limit ourselves to convex quadratic functions. We also show that for nonconvex functions, the feasibility of minimizing gradients with SGD is surprisingly sensitive to the choice of optimality criteria. △ Less

Submitted 29 July, 2021; v1 submitted 4 October, 2019; originally announced October 2019.

Comments: Corrected the attribution of Ghadimi and Lan's result

arXiv:1908.00045 [pdf, ps, other]

How Good is SGD with Random Shuffling?

Authors: Itay Safran, Ohad Shamir

Abstract: We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting… ▽ More We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting has been investigated in several recent works, but the optimal error rates remain unclear. In this paper, we provide lower bounds on the expected optimization error with these heuristics (using SGD with any constant step size), which elucidate their advantages and disadvantages. In particular, we prove that after $k$ passes over $n$ individual functions, if the functions are re-shuffled after every pass, the best possible optimization error for SGD is at least $Ω\left(1/(nk)^2+1/nk^3\right)$, which partially corresponds to recently derived upper bounds. Moreover, if the functions are only shuffled once, then the lower bound increases to $Ω(1/nk^2)$. Since there are strictly smaller upper bounds for repeated reshuffling, this proves an inherent performance gap between SGD with single shuffling and repeated shuffling. As a more minor contribution, we also provide a non-asymptotic $Ω(1/k^2)$ lower bound (independent of $n$) for the incremental gradient method, when no random shuffling takes place. Finally, we provide an indication that our lower bounds are tight, by proving matching upper bounds for univariate quadratic functions. △ Less

Submitted 2 June, 2021; v1 submitted 31 July, 2019; originally announced August 2019.

arXiv:1904.06984 [pdf, other]

Depth Separations in Neural Networks: What is Actually Being Separated?

Authors: Itay Safran, Ronen Eldan, Ohad Shamir

Abstract: Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz paramet… ▽ More Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension $d$ (or equivalently, by scaling the function, the hardness result applies to $\mathcal{O}(1)$-Lipschitz functions only when the target accuracy $ε$ is at most $\text{poly}(1/d)$). In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$-Lipschitz radial functions, when $ε$ does not scale with $d$. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it \emph{is} possible to approximate $\mathcal{O}(1)$-Lipschitz radial functions with depth $2$, size $\text{poly}(d)$ networks, for every constant $ε$. We complement it by showing that approximating such functions is also possible with depth $2$, size $\text{poly}(1/ε)$ networks, for every constant $d$. Finally, we show that it is not possible to have polynomial dependence in both $d,1/ε$ simultaneously. Overall, our results indicate that in order to show depth separations for expressing $\mathcal{O}(1)$-Lipschitz functions with constant accuracy -- if at all possible -- one would need fundamentally different techniques than existing ones in the literature. △ Less

Submitted 2 June, 2021; v1 submitted 15 April, 2019; originally announced April 2019.

arXiv:1904.00687 [pdf, ps, other]

On the Power and Limitations of Random Features for Understanding Neural Networks

Authors: Gilad Yehudai, Ohad Shamir

Abstract: Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as… ▽ More Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks. We then argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. In particular, we rigorously show that random features cannot be used to learn even a single ReLU neuron with standard Gaussian inputs, unless the network size (or magnitude of the weights) is exponentially large. Since a single neuron is learnable with gradient-based methods, we conclude that we are still far from a satisfying general explanation for the empirical success of neural networks. △ Less

Submitted 27 February, 2022; v1 submitted 1 April, 2019; originally announced April 2019.

Comments: Comparison to previous version: Fixed a bug in Theorem 3.4 about approximating polynomials as an expectation of random features. Also added another assumption on the activaion function in theorem 3.1

arXiv:1902.04686 [pdf, ps, other]

The Complexity of Making the Gradient Small in Stochastic Convex Optimization

Authors: Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth

Abstract: We give nearly matching upper and lower bounds on the oracle complexity of finding $ε$-stationary points ($\| \nabla F(x) \| \leqε$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimizatio… ▽ More We give nearly matching upper and lower bounds on the oracle complexity of finding $ε$-stationary points ($\| \nabla F(x) \| \leqε$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and that the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent "recursive regularization" technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff. △ Less

Submitted 14 February, 2019; v1 submitted 12 February, 2019; originally announced February 2019.

arXiv:1902.03498 [pdf, ps, other]

Space lower bounds for linear prediction in the streaming model

Authors: Yuval Dagan, Gil Kur, Ohad Shamir

Abstract: We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting. This implies that such problems cannot be solved (at least in this setting) by scalable memory-efficient streaming algorithms. Our results build on a memory lower bound for a simple linear-algebraic… ▽ More We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting. This implies that such problems cannot be solved (at least in this setting) by scalable memory-efficient streaming algorithms. Our results build on a memory lower bound for a simple linear-algebraic problem -- finding orthogonal vectors -- and utilize the estimates on the packing of the Grassmannian, the manifold of all linear subspaces of fixed dimension. △ Less

Submitted 11 June, 2019; v1 submitted 9 February, 2019; originally announced February 2019.

Comments: Added a minor correction in referencing the prior work

arXiv:1810.12361 [pdf, other]

Global Non-convex Optimization with Discretized Diffusions

Authors: Murat A. Erdogdu, Lester Mackey, Ohad Shamir

Abstract: An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing… ▽ More An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing convex and non-convex functions not covered by the existing Langevin theory. Our non-asymptotic analysis delivers computable optimization and integration error bounds based on easily accessed properties of the objective and chosen diffusion. Central to our approach are new explicit Stein factor bounds on the solutions of Poisson equations. We complement these results with improved optimization guarantees for targets other than the standard Gibbs measure. △ Less

Submitted 27 December, 2019; v1 submitted 29 October, 2018; originally announced October 2018.

Comments: 19 pages, NeurIPS 2018 camera ready version

arXiv:1809.08587 [pdf, other]

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

Authors: Ohad Shamir

Abstract: We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth… ▽ More We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth $k$. We also show empirically that this phenomenon can occur in higher dimensions, where each $w_i$ is a matrix. This highlights a potential obstacle in understanding the convergence of gradient-based methods for deep linear neural networks, where $k$ is large. △ Less

Submitted 13 June, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

Comments: Comparison to previous version: Fixed a bug in lemma 1 part 3 (does not affect any other part of the paper)

arXiv:1806.10188 [pdf, ps, other]

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

Authors: Yossi Arjevani, Ohad Shamir, Nathan Srebro

Abstract: We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $τ$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/τ$ of the gradient… ▽ More We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $τ$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/τ$ of the gradients. In sharp contrast, we quantify how stochastic noise makes the effect of delays negligible, improving on previous work which only showed this phenomenon asymptotically or for much smaller delays. Also, in the context of distributed optimization, the results indicate that the performance of gradient descent with delays is competitive with synchronous approaches such as mini-batching. Our results are based on a novel technique for analyzing convergence of optimization algorithms using generating functions. △ Less

Submitted 26 June, 2018; originally announced June 2018.

arXiv:1804.06739 [pdf, other]

Are ResNets Provably Better than Linear Predictors?

Authors: Ohad Shamir

Abstract: A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer's output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remov… ▽ More A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer's output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remove the residual layers and train a shallower network instead. However, due to the non-convexity of the optimization problem, it is not at all clear that ResNets indeed achieve this behavior, rather than getting stuck at some arbitrarily poor local minimum. In this paper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear predictor (namely a 1-layer network). Notably, we show this under minimal or no assumptions on the precise network architecture, data distribution, or loss function used. We also provide a quantitative analysis of approximate stationary points for this problem. Finally, we show that with a certain tweak to the architecture, training the network with standard stochastic gradient descent achieves an objective value close or better than any linear predictor. △ Less

Submitted 27 September, 2018; v1 submitted 18 April, 2018; originally announced April 2018.

Comments: Comparison to previous arXiv version: Minor changes to incorporate comments of NIPS 2018 reviewers (main results are unaffected)

arXiv:1803.01420 [pdf, ps, other]

Detecting Correlations with Little Memory and Communication

Authors: Yuval Dagan, Ohad Shamir

Abstract: We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise co… ▽ More We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise correlations with optimal sample complexity, the number of required memory/communication bits is at least quadratic in the dimension. Our results substantially improve those of Shamir [2014], which studied a similar question in a much more restricted setting. To the best of our knowledge, these are the first provable sample/memory/communication trade-offs for a practical estimation problem, using standard distributions, and in the natural regime where the memory/communication budget is larger than the size of a single data point. To derive our theorems, we prove a new information-theoretic result, which may be relevant for studying other information-constrained learning problems. △ Less

Submitted 6 June, 2018; v1 submitted 4 March, 2018; originally announced March 2018.

Comments: Accepted for presentation at Conference on Learning Theory (COLT) 2018. Changes: Added a comparison to Raz [2016]; Corrected typos; Added references

arXiv:1712.08968 [pdf, other]

Spurious Local Minima are Common in Two-Layer ReLU Neural Networks

Authors: Itay Safran, Ohad Shamir

Abstract: We consider the optimization problem associated with training simple ReLU neural networks of the form $\mathbf{x}\mapsto \sum_{i=1}^{k}\max\{0,\mathbf{w}_i^\top \mathbf{x}\}$ with respect to the squared loss. We provide a computer-assisted proof that even if the input distribution is standard Gaussian, even if the dimension is arbitrarily large, and even if the target values are generated by such… ▽ More We consider the optimization problem associated with training simple ReLU neural networks of the form $\mathbf{x}\mapsto \sum_{i=1}^{k}\max\{0,\mathbf{w}_i^\top \mathbf{x}\}$ with respect to the squared loss. We provide a computer-assisted proof that even if the input distribution is standard Gaussian, even if the dimension is arbitrarily large, and even if the target values are generated by such a network, with orthonormal parameter vectors, the problem can still have spurious local minima once $6\le k\le 20$. By a concentration of measure argument, this implies that in high input dimensions, \emph{nearly all} target networks of the relevant sizes lead to spurious local minima. Moreover, we conduct experiments which show that the probability of hitting such local minima is quite high, and increasing with the network size. On the positive side, mild over-parameterization appears to drastically reduce such local minima, indicating that an over-parameterization assumption is necessary to get a positive result in this setting. △ Less

Submitted 9 August, 2018; v1 submitted 24 December, 2017; originally announced December 2017.

arXiv:1712.06541 [pdf, ps, other]

Size-Independent Sample Complexity of Neural Networks

Authors: Noah Golowich, Alexander Rakhlin, Ohad Shamir

Abstract: We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are… ▽ More We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are derived using some novel techniques, which may be of independent interest. △ Less

Submitted 17 November, 2019; v1 submitted 18 December, 2017; originally announced December 2017.

Comments: Fixed a bug in the proof of theorem 7 (not affecting theorem statement), by slightly changing the construction

arXiv:1706.00687 [pdf, ps, other]

Weight Sharing is Crucial to Succesful Optimization

Authors: Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah

Abstract: Exploiting the great expressive power of Deep Neural Network architectures, relies on the ability to train them. While current theoretical work provides, mostly, results showing the hardness of this task, empirical evidence usually differs from this line, with success stories in abundance. A strong position among empirically successful architectures is captured by networks where extensive weight s… ▽ More Exploiting the great expressive power of Deep Neural Network architectures, relies on the ability to train them. While current theoretical work provides, mostly, results showing the hardness of this task, empirical evidence usually differs from this line, with success stories in abundance. A strong position among empirically successful architectures is captured by networks where extensive weight sharing is used, either by Convolutional or Recurrent layers. Additionally, characterizing specific aspects of different tasks, making them "harder" or "easier", is an interesting direction explored both theoretically and empirically. We consider a family of ConvNet architectures, and prove that weight sharing can be crucial, from an optimization point of view. We explore different notions of the frequency, of the target function, proving necessity of the target function having some low frequency components. This necessity is not sufficient - only with weight sharing can it be exploited, thus theoretically separating architectures using it, from others which do not. Our theoretical results are aligned with empirical experiments in an even more general setting, suggesting viability of examination of the role played by interleaving those aspects in broader families of tasks. △ Less

Submitted 2 June, 2017; originally announced June 2017.

arXiv:1705.05091 [pdf, ps, other]

Bandit Regret Scaling with the Effective Loss Range

Authors: Nicolò Cesa-Bianchi, Ohad Shamir

Abstract: We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or advanc… ▽ More We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or advance knowledge of the loss of a single, possibly unspecified arm. Along the way, we develop a novel technique which might be of independent interest, to convert any multi-armed bandit algorithm with regret depending on the loss range, to an algorithm with regret depending only on the effective range, while avoiding predictably bad arms altogether. △ Less

Submitted 2 January, 2020; v1 submitted 15 May, 2017; originally announced May 2017.

Comments: The results in section 4 are incorrect as stated -- we have added an erratum at the beginning of the document. The results in the other sections are still valid. We thank Étienne de Montbrun for locating the error

arXiv:1703.07950 [pdf, other]

Failures of Gradient-Based Deep Learning

Authors: Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah

Abstract: In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four types of simple problems, for which the gradient-based algorithm… ▽ More In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four types of simple problems, for which the gradient-based algorithms commonly used in deep learning either fail or suffer from significant difficulties. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied. △ Less

Submitted 26 April, 2017; v1 submitted 23 March, 2017; originally announced March 2017.

Showing 1–50 of 100 results for author: Shamir, O