Skip to main content

Showing 1–50 of 100 results for author: Shamir, O

  1. arXiv:2406.13888  [pdf, ps, other

    math.OC cs.LG

    Open Problem: Anytime Convergence Rate of Gradient Descent

    Authors: Guy Kornowski, Ohad Shamir

    Abstract: Recent results show that vanilla gradient descent can be accelerated for smooth convex objectives, merely by changing the stepsize sequence. We show that this can lead to surprisingly large errors indefinitely, and therefore ask: Is there any stepsize schedule for gradient descent that accelerates the classic $\mathcal{O}(1/T)$ convergence rate, at \emph{any} stopping time $T$?

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: COLT 2024 open problem; 5 pages

  2. arXiv:2402.08808  [pdf, other

    cs.LG stat.ML

    Depth Separation in Norm-Bounded Infinite-Width Neural Networks

    Authors: Suzanna Parkinson, Greg Ongie, Rebecca Willett, Ohad Shamir, Nathan Srebro

    Abstract: We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $\ell_2$-norm of the weights (sum of squares of all weights in the network). Whereas previous depth separation results focused on separation in terms of width, such results do not give insight into whether depth determines if it is possible to learn a network that generalizes well eve… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

  3. arXiv:2312.15995  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Generalization in Kernel Regression Under Realistic Assumptions

    Authors: Daniel Barzilai, Ohad Shamir

    Abstract: It is by now well-established that modern over-parameterized models seem to elude the bias-variance tradeoff and generalize well despite overfitting noise. Many recent works attempt to analyze this phenomenon in the relatively tractable setting of kernel regression. However, as we argue in detail, most past works on this topic either make unrealistic assumptions, or focus on a narrow problem setup… ▽ More

    Submitted 20 February, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

  4. arXiv:2307.04504  [pdf, ps, other

    math.OC cs.LG

    An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization

    Authors: Guy Kornowski, Ohad Shamir

    Abstract: We study the complexity of producing $(δ,ε)$-stationary points of Lipschitz objectives which are possibly neither smooth nor convex, using only noisy function evaluations. Recent works proposed several stochastic zero-order algorithms that solve this task, all of which suffer from a dimension-dependence of $Ω(d^{3/2})$ where $d$ is the dimension of the problem, which was conjectured to be optimal.… ▽ More

    Submitted 15 April, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

    Comments: Accepted to Journal of Machine Learning Research (JMLR); improved dependence on Lipschitz constant; some minor edits following reviews

  5. arXiv:2305.16475  [pdf, ps, other

    cs.LG stat.ML

    Initialization-Dependent Sample Complexity of Linear Predictors and Neural Networks

    Authors: Roey Magen, Ohad Shamir

    Abstract: We provide several new results on the sample complexity of vector-valued linear predictors (parameterized by a matrix), and more generally neural networks. Focusing on size-independent bounds, where only the Frobenius norm distance of the parameters from some fixed reference matrix $W_0$ is controlled, we show that the sample complexity behavior can be surprisingly different than what we may expec… ▽ More

    Submitted 25 October, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: 30 pages

  6. arXiv:2305.15141  [pdf, other

    cs.LG cs.NE stat.ML

    From Tempered to Benign Overfitting in ReLU Neural Networks

    Authors: Guy Kornowski, Gilad Yehudai, Ohad Shamir

    Abstract: Overparameterized neural networks (NNs) are observed to generalize well even when trained to perfectly fit noisy data. This phenomenon motivated a large body of work on "benign overfitting", where interpolating predictors achieve near-optimal performance. Recently, it was conjectured and empirically observed that the behavior of NNs is often better described as "tempered overfitting", where the pe… ▽ More

    Submitted 21 March, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023; fixed bug

  7. arXiv:2302.08300  [pdf, ps, other

    cs.LG math.OC

    Deterministic Nonsmooth Nonconvex Optimization

    Authors: Michael I. Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, Manolis Zampetakis

    Abstract: We study the complexity of optimizing nonsmooth nonconvex Lipschitz functions by producing $(δ,ε)$-stationary points. Several recent works have presented randomized algorithms that produce such points using $\tilde O(δ^{-1}ε^{-3})$ first-order oracle calls, independent of the dimension $d$. It has been an open problem as to whether a similar result can be obtained via a deterministic algorithm. We… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

    Comments: This work supersedes arxiv:2209.12463 and arxiv:2209.10346[Section 3], with major additional results

  8. arXiv:2209.10346  [pdf, other

    math.OC cs.LG

    On the Complexity of Finding Small Subgradients in Nonsmooth Optimization

    Authors: Guy Kornowski, Ohad Shamir

    Abstract: We study the oracle complexity of producing $(δ,ε)$-stationary points of Lipschitz functions, in the sense proposed by Zhang et al. [2020]. While there exist dimension-free randomized algorithms for producing such points within $\widetilde{O}(1/δε^3)$ first-order oracle calls, we show that no dimension-free rate can be achieved by a deterministic algorithm. On the other hand, we point out that thi… ▽ More

    Submitted 21 September, 2022; originally announced September 2022.

    Comments: 18 pages

  9. arXiv:2206.07758  [pdf, other

    cs.LG cs.CR cs.CV cs.NE stat.ML

    Reconstructing Training Data from Trained Neural Networks

    Authors: Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, Michal Irani

    Abstract: Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results abo… ▽ More

    Submitted 5 December, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: Fixed a typo in the acknowledgements

  10. arXiv:2202.06233  [pdf, other

    cs.LG stat.ML

    The Sample Complexity of One-Hidden-Layer Neural Networks

    Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

    Abstract: We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to ge… ▽ More

    Submitted 22 September, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

    Comments: Bug fixed in proof of Theorem 2 (resulting in different log factors); Other minor edits

  11. arXiv:2202.04347  [pdf, other

    cs.LG

    Gradient Methods Provably Converge to Non-Robust Networks

    Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

    Abstract: Despite a great deal of research, it is still unclear why neural networks are so susceptible to adversarial examples. In this work, we identify natural settings where depth-$2$ ReLU networks trained with gradient flow are provably non-robust (susceptible to small adversarial $\ell_2$-perturbations), even when robust networks that classify the training dataset correctly exist. Perhaps surprisingly,… ▽ More

    Submitted 4 October, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: Minor fixes made for the NeurIPS CR version

  12. arXiv:2202.03841  [pdf, ps, other

    cs.LG cs.NE stat.ML

    Width is Less Important than Depth in ReLU Neural Networks

    Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

    Abstract: We solve an open question from Lu et al. (2017), by showing that any target network with inputs in $\mathbb{R}^d$ can be approximated by a width $O(d)$ network (independent of the target network's architecture), whose number of parameters is essentially larger only by a linear factor. In light of previous depth separation theorems, which imply that a similar result cannot hold when the roles of wi… ▽ More

    Submitted 1 June, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: Camera ready version in COLT 2022

  13. arXiv:2201.12760  [pdf, other

    cs.LG stat.ML

    Implicit Regularization Towards Rank Minimization in ReLU Networks

    Authors: Nadav Timor, Gal Vardi, Ohad Shamir

    Abstract: We study the conjectured relationship between the implicit regularization in neural networks, trained with gradient-based methods, and rank minimization of their weight matrices. Previously, it was proved that for linear networks (of depth 2 and vector-valued outputs), gradient flow (GF) w.r.t. the square loss acts as a rank minimization heuristic. However, understanding to what extent this genera… ▽ More

    Submitted 30 January, 2022; originally announced January 2022.

  14. arXiv:2201.11489  [pdf, other

    cs.LG stat.ML

    The Implicit Bias of Benign Overfitting

    Authors: Ohad Shamir

    Abstract: The phenomenon of benign overfitting, where a predictor perfectly fits noisy training data while attaining near-optimal expected loss, has received much attention in recent years, but still remains not fully understood beyond well-specified linear regression setups. In this paper, we provide several new results on when one can or cannot expect benign overfitting to occur, for both regression and c… ▽ More

    Submitted 17 April, 2023; v1 submitted 27 January, 2022; originally announced January 2022.

    Comments: Version published in the Journal of Machine Learning Research

    Journal ref: JMLR 24(113):1-40, 2023

  15. arXiv:2112.04229  [pdf, other

    cs.LG cs.AI

    Replay For Safety

    Authors: Liran Szlak, Ohad Shamir

    Abstract: Experience replay \citep{lin1993reinforcement, mnih2015human} is a widely used technique to achieve efficient use of data and improved performance in RL algorithms. In experience replay, past transitions are stored in a memory buffer and re-used during learning. Various suggestions for sampling schemes from the replay buffer have been suggested in previous works, attempting to optimally choose tho… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

  16. arXiv:2112.04213  [pdf, other

    cs.LG cs.AI

    Convergence Results For Q-Learning With Experience Replay

    Authors: Liran Szlak, Ohad Shamir

    Abstract: A commonly used heuristic in RL is experience replay (e.g.~\citet{lin1993reinforcement, mnih2015human}), in which a learner stores and re-uses past trajectories as if they were sampled online. In this work, we initiate a rigorous study of this heuristic in the setting of tabular Q-learning. We provide a convergence rate guarantee, and discuss how it compares to the convergence of Q-learning depend… ▽ More

    Submitted 8 December, 2021; originally announced December 2021.

  17. arXiv:2110.03187  [pdf, ps, other

    cs.LG cs.NE stat.ML

    On the Optimal Memorization Power of ReLU Neural Networks

    Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

    Abstract: We study the memorization power of feedforward ReLU neural networks. We show that such networks can memorize any $N$ points that satisfy a mild separability assumption using $\tilde{O}\left(\sqrt{N}\right)$ parameters. Known VC-dimension upper bounds imply that memorizing $N$ samples requires $Ω(\sqrt{N})$ parameters, and hence our construction is optimal up to logarithmic factors. We also give a… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  18. arXiv:2110.02954  [pdf, other

    math.OC cs.LG stat.ML

    A Stochastic Newton Algorithm for Distributed Convex Optimization

    Authors: Brian Bullins, Kumar Kshitij Patel, Ohad Shamir, Nathan Srebro, Blake Woodworth

    Abstract: We propose and analyze a stochastic Newton algorithm for homogeneous distributed stochastic convex optimization, where each machine can calculate stochastic gradients of the same population objective, as well as stochastic Hessian-vector products (products of an independent unbiased estimator of the Hessian of the population objective with arbitrary vectors), with many such stochastic computations… ▽ More

    Submitted 7 October, 2021; originally announced October 2021.

  19. arXiv:2110.02732  [pdf, other

    cs.LG stat.ML

    On Margin Maximization in Linear and ReLU Networks

    Authors: Gal Vardi, Ohad Shamir, Nathan Srebro

    Abstract: The implicit bias of neural networks has been extensively studied in recent years. Lyu and Li [2019] showed that in homogeneous networks trained with the exponential or the logistic loss, gradient flow converges to a KKT point of the max margin problem in the parameter space. However, that leaves open the question of whether this point will generally be an actual optimum of the max margin problem.… ▽ More

    Submitted 2 October, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: This version includes some minor updates

  20. arXiv:2106.06880  [pdf, other

    cs.LG

    Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems

    Authors: Itay Safran, Ohad Shamir

    Abstract: Recently, there has been much interest in studying the convergence rates of without-replacement SGD, and proving that it is faster than with-replacement SGD in the worst case. However, known lower bounds ignore the problem's geometry, including its condition number, whereas the upper bounds explicitly depend on it. Perhaps surprisingly, we prove that when the condition number is taken into account… ▽ More

    Submitted 3 December, 2021; v1 submitted 12 June, 2021; originally announced June 2021.

  21. arXiv:2106.01101  [pdf, other

    cs.LG cs.NE stat.ML

    Learning a Single Neuron with Bias Using Gradient Descent

    Authors: Gal Vardi, Gilad Yehudai, Ohad Shamir

    Abstract: We theoretically study the fundamental problem of learning a single neuron with a bias term ($\mathbf{x} \mapsto σ(<\mathbf{w},\mathbf{x}> + b)$) in the realizable setting with the ReLU activation, using gradient descent. Perhaps surprisingly, we show that this is a significantly different and more challenging problem than the bias-less case (which was the focus of previous works on single neurons… ▽ More

    Submitted 5 February, 2022; v1 submitted 2 June, 2021; originally announced June 2021.

    Comments: An updated version, corresponding to the NeurIPS 2021 camera-ready version

  22. arXiv:2104.06763  [pdf, other

    math.OC cs.LG

    Oracle Complexity in Nonsmooth Nonconvex Optimization

    Authors: Guy Kornowski, Ohad Shamir

    Abstract: It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find $ε$-stationary points (with gradient norm less than $ε$) in $\mathcal{O}(1/ε^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inappli… ▽ More

    Submitted 27 October, 2022; v1 submitted 14 April, 2021; originally announced April 2021.

    Comments: Accepted to Journal of Machine Learning Research (JMLR); some minor edits following reviews

  23. arXiv:2102.01583  [pdf, other

    cs.LG math.OC

    The Min-Max Complexity of Distributed Stochastic Convex Optimization with Intermittent Communication

    Authors: Blake Woodworth, Brian Bullins, Ohad Shamir, Nathan Srebro

    Abstract: We resolve the min-max complexity of distributed stochastic convex optimization (up to a log factor) in the intermittent communication setting, where $M$ machines work in parallel over the course of $R$ rounds of communication to optimize the objective, and during each round of communication, each machine may sequentially compute $K$ stochastic gradient estimates. We present a novel lower bound wi… ▽ More

    Submitted 5 August, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

    Comments: 48 pages

  24. arXiv:2102.00434  [pdf, ps, other

    cs.LG cs.NE stat.ML

    The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

    Authors: Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

    Abstract: Several recent works have shown separation results between deep neural networks, and hypothesis classes with inferior approximation capacity such as shallow networks or kernel classes. On the other hand, the fact that deep networks can efficiently express a target function does not mean that this target function can be learned efficiently by deep neural networks. In this work we study the intricat… ▽ More

    Submitted 18 July, 2021; v1 submitted 31 January, 2021; originally announced February 2021.

    Comments: COLT 2021 camera ready version

  25. arXiv:2102.00314  [pdf, ps, other

    cs.LG cs.CC stat.ML

    Size and Depth Separation in Approximating Benign Functions with Neural Networks

    Authors: Gal Vardi, Daniel Reichman, Toniann Pitassi, Ohad Shamir

    Abstract: When studying the expressive power of neural networks, a main challenge is to understand how the size and depth of the network affect its ability to approximate real functions. However, not all functions are interesting from a practical viewpoint: functions of interest usually have a polynomially-bounded Lipschitz constant, and can be computed efficiently. We call functions that satisfy these cond… ▽ More

    Submitted 28 June, 2021; v1 submitted 30 January, 2021; originally announced February 2021.

    Comments: Edits after review + changing the terminology from "natural functions" to "benign functions"

  26. arXiv:2012.05156  [pdf, other

    cs.LG stat.ML

    Implicit Regularization in ReLU Networks with the Square Loss

    Authors: Gal Vardi, Ohad Shamir

    Abstract: Understanding the implicit regularization (or implicit bias) of gradient descent has recently been a very active research area. However, the implicit regularization in nonlinear neural networks is still poorly understood, especially for regression losses such as the square loss. Perhaps surprisingly, we prove that even for a single ReLU neuron, it is impossible to characterize the implicit regular… ▽ More

    Submitted 8 June, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

    Comments: Small changes due to reviews

  27. arXiv:2010.06642  [pdf, ps, other

    math.OC cs.LG

    High-Order Oracle Complexity of Smooth and Strongly Convex Optimization

    Authors: Guy Kornowski, Ohad Shamir

    Abstract: In this note, we consider the complexity of optimizing a highly smooth (Lipschitz $k$-th order derivative) and strongly convex function, via calls to a $k$-th order oracle which returns the value and first $k$ derivatives of the function at a given point, and where the dimension is unrestricted. Extending the techniques introduced in Arjevani et al. [2019], we prove that the worst-case oracle comp… ▽ More

    Submitted 28 April, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

    Comments: 21 pages; added low dimensional regime; some minor edits

  28. arXiv:2007.00028  [pdf, other

    cs.LG stat.ML

    Gradient Methods Never Overfit On Separable Data

    Authors: Ohad Shamir

    Abstract: A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overfit. However, this does not address the question of whether overfitting might occur non-asymptotically, after s… ▽ More

    Submitted 10 September, 2020; v1 submitted 30 June, 2020; originally announced July 2020.

    Comments: Added a short appendix on polynomially-tailed losses; some minor edits

  29. arXiv:2006.01005  [pdf, other

    cs.LG stat.ML

    The Effects of Mild Over-parameterization on the Optimization Landscape of Shallow ReLU Neural Networks

    Authors: Itay Safran, Gilad Yehudai, Ohad Shamir

    Abstract: We study the effects of mild over-parameterization on the optimization landscape of a simple ReLU neural network of the form $\mathbf{x}\mapsto\sum_{i=1}^k\max\{0,\mathbf{w}_i^{\top}\mathbf{x}\}$, in a well-studied teacher-student setting where the target values are generated by the same architecture, and when directly optimizing over the population squared loss with respect to Gaussian inputs. We… ▽ More

    Submitted 30 July, 2021; v1 submitted 1 June, 2020; originally announced June 2020.

  30. arXiv:2006.00625  [pdf, ps, other

    cs.LG cs.CC stat.ML

    Neural Networks with Small Weights and Depth-Separation Barriers

    Authors: Gal Vardi, Ohad Shamir

    Abstract: In studying the expressiveness of neural networks, an important question is whether there are functions which can only be approximated by sufficiently deep networks, assuming their size is bounded. However, for constant depths, existing results are limited to depths $2$ and $3$, and achieving results for higher depths has been an important open question. In this paper, we focus on feedforward ReLU… ▽ More

    Submitted 27 December, 2020; v1 submitted 31 May, 2020; originally announced June 2020.

  31. arXiv:2002.11962  [pdf, other

    math.OC cs.LG

    Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?

    Authors: Ohad Shamir

    Abstract: It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find $ε$-stationary points (where the gradient norm is less than $ε$) in $\mathcal{O}(1/ε^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. Moreover,… ▽ More

    Submitted 15 April, 2021; v1 submitted 27 February, 2020; originally announced February 2020.

    Comments: This paper is superseded by arXiv:2104.06763, which contains major additional results

  32. arXiv:2002.07839  [pdf, other

    cs.LG math.OC stat.ML

    Is Local SGD Better than Minibatch SGD?

    Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, Nathan Srebro

    Abstract: We study local SGD (also known as parallel SGD and federated averaging), a natural and frequently used stochastic distributed optimization method. Its theoretical foundations are currently lacking and we highlight how all existing error guarantees in the convex setting are dominated by a simple baseline, minibatch SGD. (1) For quadratic objectives we prove that local SGD strictly dominates minibat… ▽ More

    Submitted 20 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

    Comments: 29 pages

  33. arXiv:2002.00585  [pdf, ps, other

    cs.LG stat.ML

    Proving the Lottery Ticket Hypothesis: Pruning is All You Need

    Authors: Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir

    Abstract: The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded wei… ▽ More

    Submitted 3 February, 2020; originally announced February 2020.

  34. arXiv:2001.05205  [pdf, other

    cs.LG cs.NE stat.ML

    Learning a Single Neuron with Gradient Methods

    Authors: Gilad Yehudai, Ohad Shamir

    Abstract: We consider the fundamental problem of learning a single neuron $x \mapstoσ(w^\top x)$ using standard gradient methods. As opposed to previous works, which considered specific (and not always realistic) input distributions and activation functions $σ(\cdot)$, we ask whether a more general result is attainable, under milder assumptions. On the one hand, we show that some assumptions on the distribu… ▽ More

    Submitted 27 February, 2022; v1 submitted 15 January, 2020; originally announced January 2020.

    Comments: Fixed a small bug in the proof of Theorem 4.2

  35. arXiv:1910.01845  [pdf, ps, other

    cs.LG math.OC stat.ML

    The Complexity of Finding Stationary Points with Stochastic Gradient Descent

    Authors: Yoel Drori, Ohad Shamir

    Abstract: We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions. We provide several results, implying that the $\mathcal{O}(ε^{-4})$ upper bound of Ghadimi and Lan~\cite{ghadimi2013stochastic} (for making the average gradient norm less than $ε$) cannot be improved upon, unless a combination of additional assumptions is… ▽ More

    Submitted 29 July, 2021; v1 submitted 4 October, 2019; originally announced October 2019.

    Comments: Corrected the attribution of Ghadimi and Lan's result

  36. arXiv:1908.00045  [pdf, ps, other

    cs.LG math.OC stat.ML

    How Good is SGD with Random Shuffling?

    Authors: Itay Safran, Ohad Shamir

    Abstract: We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with replacement, we focus here on popular but poorly-understood heuristics, which involve going over random permutations of the individual functions. This setting… ▽ More

    Submitted 2 June, 2021; v1 submitted 31 July, 2019; originally announced August 2019.

  37. arXiv:1904.06984  [pdf, other

    cs.LG stat.ML

    Depth Separations in Neural Networks: What is Actually Being Separated?

    Authors: Itay Safran, Ronen Eldan, Ohad Shamir

    Abstract: Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz paramet… ▽ More

    Submitted 2 June, 2021; v1 submitted 15 April, 2019; originally announced April 2019.

  38. arXiv:1904.00687  [pdf, ps, other

    cs.LG cs.NE stat.ML

    On the Power and Limitations of Random Features for Understanding Neural Networks

    Authors: Gilad Yehudai, Ohad Shamir

    Abstract: Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as… ▽ More

    Submitted 27 February, 2022; v1 submitted 1 April, 2019; originally announced April 2019.

    Comments: Comparison to previous version: Fixed a bug in Theorem 3.4 about approximating polynomials as an expectation of random features. Also added another assumption on the activaion function in theorem 3.1

  39. arXiv:1902.04686  [pdf, ps, other

    cs.LG math.OC stat.ML

    The Complexity of Making the Gradient Small in Stochastic Convex Optimization

    Authors: Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth

    Abstract: We give nearly matching upper and lower bounds on the oracle complexity of finding $ε$-stationary points ($\| \nabla F(x) \| \leqε$) in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimizatio… ▽ More

    Submitted 14 February, 2019; v1 submitted 12 February, 2019; originally announced February 2019.

  40. arXiv:1902.03498  [pdf, ps, other

    cs.LG stat.ML

    Space lower bounds for linear prediction in the streaming model

    Authors: Yuval Dagan, Gil Kur, Ohad Shamir

    Abstract: We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting. This implies that such problems cannot be solved (at least in this setting) by scalable memory-efficient streaming algorithms. Our results build on a memory lower bound for a simple linear-algebraic… ▽ More

    Submitted 11 June, 2019; v1 submitted 9 February, 2019; originally announced February 2019.

    Comments: Added a minor correction in referencing the prior work

  41. arXiv:1810.12361  [pdf, other

    stat.ML cs.LG stat.CO

    Global Non-convex Optimization with Discretized Diffusions

    Authors: Murat A. Erdogdu, Lester Mackey, Ohad Shamir

    Abstract: An Euler discretization of the Langevin diffusion is known to converge to the global minimizers of certain convex and non-convex optimization problems. We show that this property holds for any suitably smooth diffusion and that different diffusions are suitable for optimizing different classes of convex and non-convex functions. This allows us to design diffusions suitable for globally optimizing… ▽ More

    Submitted 27 December, 2019; v1 submitted 29 October, 2018; originally announced October 2018.

    Comments: 19 pages, NeurIPS 2018 camera ready version

  42. arXiv:1809.08587  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

    Authors: Ohad Shamir

    Abstract: We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks. We prove that for standard random initializations, and under mild assumptions on $f$, the number of iterations required for convergence scales exponentially with the depth… ▽ More

    Submitted 13 June, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

    Comments: Comparison to previous version: Fixed a bug in lemma 1 part 3 (does not affect any other part of the paper)

  43. arXiv:1806.10188  [pdf, ps, other

    math.OC cs.LG stat.ML

    A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

    Authors: Yossi Arjevani, Ohad Shamir, Nathan Srebro

    Abstract: We provide tight finite-time convergence bounds for gradient descent and stochastic gradient descent on quadratic functions, when the gradients are delayed and reflect iterates from $τ$ rounds ago. First, we show that without stochastic noise, delays strongly affect the attainable optimization error: In fact, the error can be as bad as non-delayed gradient descent ran on only $1/τ$ of the gradient… ▽ More

    Submitted 26 June, 2018; originally announced June 2018.

  44. arXiv:1804.06739  [pdf, other

    cs.LG cs.NE stat.ML

    Are ResNets Provably Better than Linear Predictors?

    Authors: Ohad Shamir

    Abstract: A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer's output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remov… ▽ More

    Submitted 27 September, 2018; v1 submitted 18 April, 2018; originally announced April 2018.

    Comments: Comparison to previous arXiv version: Minor changes to incorporate comments of NIPS 2018 reviewers (main results are unaffected)

  45. arXiv:1803.01420  [pdf, ps, other

    cs.LG stat.ML

    Detecting Correlations with Little Memory and Communication

    Authors: Yuval Dagan, Ohad Shamir

    Abstract: We study the problem of identifying correlations in multivariate data, under information constraints: Either on the amount of memory that can be used by the algorithm, or the amount of communication when the data is distributed across several machines. We prove a tight trade-off between the memory/communication complexity and the sample complexity, implying (for example) that to detect pairwise co… ▽ More

    Submitted 6 June, 2018; v1 submitted 4 March, 2018; originally announced March 2018.

    Comments: Accepted for presentation at Conference on Learning Theory (COLT) 2018. Changes: Added a comparison to Raz [2016]; Corrected typos; Added references

  46. arXiv:1712.08968  [pdf, other

    cs.LG stat.ML

    Spurious Local Minima are Common in Two-Layer ReLU Neural Networks

    Authors: Itay Safran, Ohad Shamir

    Abstract: We consider the optimization problem associated with training simple ReLU neural networks of the form $\mathbf{x}\mapsto \sum_{i=1}^{k}\max\{0,\mathbf{w}_i^\top \mathbf{x}\}$ with respect to the squared loss. We provide a computer-assisted proof that even if the input distribution is standard Gaussian, even if the dimension is arbitrarily large, and even if the target values are generated by such… ▽ More

    Submitted 9 August, 2018; v1 submitted 24 December, 2017; originally announced December 2017.

  47. arXiv:1712.06541  [pdf, ps, other

    cs.LG cs.NE stat.ML

    Size-Independent Sample Complexity of Neural Networks

    Authors: Noah Golowich, Alexander Rakhlin, Ohad Shamir

    Abstract: We study the sample complexity of learning neural networks, by providing new bounds on their Rademacher complexity assuming norm constraints on the parameter matrix of each layer. Compared to previous work, these complexity bounds have improved dependence on the network depth, and under some additional assumptions, are fully independent of the network size (both depth and width). These results are… ▽ More

    Submitted 17 November, 2019; v1 submitted 18 December, 2017; originally announced December 2017.

    Comments: Fixed a bug in the proof of theorem 7 (not affecting theorem statement), by slightly changing the construction

  48. arXiv:1706.00687  [pdf, ps, other

    cs.LG

    Weight Sharing is Crucial to Succesful Optimization

    Authors: Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah

    Abstract: Exploiting the great expressive power of Deep Neural Network architectures, relies on the ability to train them. While current theoretical work provides, mostly, results showing the hardness of this task, empirical evidence usually differs from this line, with success stories in abundance. A strong position among empirically successful architectures is captured by networks where extensive weight s… ▽ More

    Submitted 2 June, 2017; originally announced June 2017.

  49. arXiv:1705.05091  [pdf, ps, other

    cs.LG stat.ML

    Bandit Regret Scaling with the Effective Loss Range

    Authors: Nicolò Cesa-Bianchi, Ohad Shamir

    Abstract: We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or advanc… ▽ More

    Submitted 2 January, 2020; v1 submitted 15 May, 2017; originally announced May 2017.

    Comments: The results in section 4 are incorrect as stated -- we have added an erratum at the beginning of the document. The results in the other sections are still valid. We thank Étienne de Montbrun for locating the error

  50. arXiv:1703.07950  [pdf, other

    cs.LG cs.NE stat.ML

    Failures of Gradient-Based Deep Learning

    Authors: Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah

    Abstract: In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four types of simple problems, for which the gradient-based algorithm… ▽ More

    Submitted 26 April, 2017; v1 submitted 23 March, 2017; originally announced March 2017.