Skip to main content

Showing 1–19 of 19 results for author: Jelassi, S

  1. arXiv:2407.03310  [pdf, other

    cs.LG

    Universal Length Generalization with Turing Programs

    Authors: Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, Eran Malach

    Abstract: Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  2. arXiv:2407.00968  [pdf, other

    cs.LG

    How Does Overparameterization Affect Features?

    Authors: Ahmet Cagri Duzgun, Samy Jelassi, Yuanzhi Li

    Abstract: Overparameterization, the condition where models have more parameters than necessary to fit their training loss, is a crucial factor for the success of deep learning. However, the characteristics of the features learned by overparameterized networks are not well understood. In this work, we explore this question by comparing models with the same architecture but different widths. We first examine… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  3. arXiv:2402.14688  [pdf, other

    cs.LG

    Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

    Authors: Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, David Brandfonbrener

    Abstract: We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candid… ▽ More

    Submitted 2 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  4. arXiv:2402.01032  [pdf, other

    cs.LG cs.AI cs.CL

    Repeat After Me: Transformers are Better than State Space Models at Copying

    Authors: Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

    Abstract: Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks th… ▽ More

    Submitted 3 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  5. arXiv:2306.15400  [pdf, other

    cs.LG

    Length Generalization in Arithmetic Transformers

    Authors: Samy Jelassi, Stéphane d'Ascoli, Carles Domingo-Enrich, Yuhuai Wu, Yuanzhi Li, François Charton

    Abstract: We examine how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We find that relative position embeddings enable length generalization for simple tasks, such as addition: models trained on $5$-digit numbers can perform $15$-digit sums. However, this method fails for multiplication, and we propose train set pri… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

  6. arXiv:2305.07810  [pdf, ps, other

    cs.LG stat.ML

    Depth Dependence of $μ$P Learning Rates in ReLU MLPs

    Authors: Samy Jelassi, Boris Hanin, Ziwei Ji, Sashank J. Reddi, Srinadh Bhojanapalli, Sanjiv Kumar

    Abstract: In this short note we consider random fully connected ReLU networks of width $n$ and depth $L$ equipped with a mean-field weight initialization. Our purpose is to study the dependence on $n$ and $L$ of the maximal update ($μ$P) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large $n,L$. A… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  7. arXiv:2210.09221  [pdf, other

    cs.CV cs.LG

    Vision Transformers provably learn spatial structure

    Authors: Samy Jelassi, Michael E. Sander, Yuanzhi Li

    Abstract: Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any visual inductive bias of spatial locality. Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized p… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

  8. arXiv:2210.04319  [pdf, other

    cs.LG

    Dissecting adaptive methods in GANs

    Authors: Samy Jelassi, David Dobre, Arthur Mensch, Yuanzhi Li, Gauthier Gidel

    Abstract: Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to pinpoint the "marginal value of adaptive methods" in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in arXiv:2002.… ▽ More

    Submitted 9 October, 2022; originally announced October 2022.

  9. arXiv:2207.05931  [pdf, other

    cs.LG stat.ML

    Towards understanding how momentum improves generalization in deep learning

    Authors: Samy Jelassi, Yuanzhi Li

    Abstract: Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well-understood that using momentum can lead to faster convergence rate in various settings, it has also been observed that momentum yields higher generalization. Prior work argue that momentum stabilizes the SGD noise during training and this leads to higher generalization.… ▽ More

    Submitted 12 July, 2022; originally announced July 2022.

    Comments: Accepted at ICML 2022

  10. arXiv:2102.01621  [pdf, ps, other

    cs.LG cs.NE stat.ML

    Depth separation beyond radial functions

    Authors: Luca Venturi, Samy Jelassi, Tristan Ozuch, Joan Bruna

    Abstract: High-dimensional depth separation results for neural networks show that certain functions can be efficiently approximated by two-hidden-layer networks but not by one-hidden-layer ones in high-dimensions $d$. Existing results of this type mainly focus on functions with an underlying radial or one-dimensional structure, which are usually not encountered in practice. The first contribution of this pa… ▽ More

    Submitted 22 September, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

  11. arXiv:2101.11075  [pdf, other

    cs.LG cs.AI math.OC

    Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

    Authors: Aaron Defazio, Samy Jelassi

    Abstract: We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both… ▽ More

    Submitted 26 August, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

  12. arXiv:2010.10502  [pdf, other

    cs.LG math.OC stat.ML

    Dual Averaging is Surprisingly Effective for Deep Learning Optimization

    Authors: Samy Jelassi, Aaron Defazio

    Abstract: First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks. However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance. For instance, SGD with momentum (SGD+M) is typically used in computer vision (CV) and Adam is used for training transformer models for Natural Language Process… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

  13. arXiv:2006.05684  [pdf, other

    cs.GT cs.LG

    Auction learning as a two-player game

    Authors: Jad Rahme, Samy Jelassi, S. Matthew Weinberg

    Abstract: Designing an incentive compatible auction that maximizes expected revenue is a central problem in Auction Design. While theoretical approaches to the problem have hit some limits, a recent research direction initiated by Duetting et al. (2019) consists in building neural network architectures to find optimal auctions. We propose two conceptual deviations from their approach which result in enhance… ▽ More

    Submitted 25 October, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

    Journal ref: The 9th International Conference on Learning Representations, ICLR 2021

  14. arXiv:2003.01497  [pdf, other

    cs.GT cs.LG stat.ML

    A Permutation-Equivariant Neural Network Architecture For Auction Design

    Authors: Jad Rahme, Samy Jelassi, Joan Bruna, S. Matthew Weinberg

    Abstract: Designing an incentive compatible auction that maximizes expected revenue is a central problem in Auction Design. Theoretical approaches to the problem have hit some limits in the past decades and analytical solutions are known for only a few simple settings. Computational approaches to the problem through the use of LPs have their own set of limitations. Building on the success of deep learning,… ▽ More

    Submitted 25 October, 2021; v1 submitted 1 March, 2020; originally announced March 2020.

    Journal ref: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021

  15. arXiv:2002.06277  [pdf, other

    cs.LG math.OC math.PR stat.ML

    A mean-field analysis of two-player zero-sum games

    Authors: Carles Domingo-Enrich, Samy Jelassi, Arthur Mensch, Grant Rotskoff, Joan Bruna

    Abstract: Finding Nash equilibria in two-player zero-sum continuous games is a central problem in machine learning, e.g. for training both GANs and robust models. The existence of pure Nash equilibria requires strong conditions which are not typically met in practice. Mixed Nash equilibria exist in greater generality and may be found using mirror descent. Yet this approach does not scale to high dimensions.… ▽ More

    Submitted 6 May, 2021; v1 submitted 14 February, 2020; originally announced February 2020.

    Journal ref: Published at NeurIPS 2020

  16. arXiv:1908.02725  [pdf, other

    math.OC cs.LG

    Towards closing the gap between the theory and practice of SVRG

    Authors: Othmane Sebbouh, Nidham Gazagnadou, Samy Jelassi, Francis Bach, Robert M. Gower

    Abstract: Among the very first variance reduced stochastic methods for solving the empirical risk minimization problem was the SVRG method (Johnson & Zhang 2013). SVRG is an inner-outer loop based method, where in the outer loop a reference full gradient is evaluated, after which $m \in \mathbb{N}$ steps of an inner loop are executed where the reference gradient is used to build a variance reduced estimate… ▽ More

    Submitted 2 July, 2021; v1 submitted 31 July, 2019; originally announced August 2019.

    Comments: 39 pages, 23 figures

    MSC Class: 90C15; 90C25; 68W20

  17. arXiv:1905.12363  [pdf, other

    stat.ML cs.LG math.OC

    Extragradient with player sampling for faster Nash equilibrium finding

    Authors: Carles Domingo Enrich, Samy Jelassi, Carles Domingo-Enrich, Damien Scieur, Arthur Mensch, Joan Bruna

    Abstract: Data-driven modeling increasingly requires to find a Nash equilibrium in multi-player games, e.g. when training GANs. In this paper, we analyse a new extra-gradient method for Nash equilibrium finding, that performs gradient extrapolations and updates on a random subset of players at each iteration. This approach provably exhibits a better rate of convergence than full extra-gradient for non-smoot… ▽ More

    Submitted 21 July, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

  18. arXiv:1902.01843  [pdf, other

    stat.ML cs.LG

    Global convergence of neuron birth-death dynamics

    Authors: Grant Rotskoff, Samy Jelassi, Joan Bruna, Eric Vanden-Eijnden

    Abstract: Neural networks with a large number of parameters admit a mean-field description, which has recently served as a theoretical explanation for the favorable training properties of "overparameterized" models. In this regime, gradient descent obeys a deterministic partial differential equation (PDE) that converges to a globally optimal solution for networks with a single hidden layer under appropriate… ▽ More

    Submitted 27 March, 2019; v1 submitted 5 February, 2019; originally announced February 2019.

  19. arXiv:1806.03763  [pdf, other

    stat.ML cs.LG math.OC

    Smoothed analysis of the low-rank approach for smooth semidefinite programs

    Authors: Thomas Pumir, Samy Jelassi, Nicolas Boumal

    Abstract: We consider semidefinite programs (SDPs) of size n with equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix Y of size $n$ by $k$ such that $X = YY^*$ is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced and positive semidefiniteness is… ▽ More

    Submitted 27 November, 2018; v1 submitted 10 June, 2018; originally announced June 2018.