Skip to main content

Showing 1–30 of 30 results for author: Gunasekar, S

  1. arXiv:2404.14219  [pdf, other

    cs.CL cs.AI

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Authors: Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra , et al. (90 additional authors not shown)

    Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset… ▽ More

    Submitted 23 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 19 pages

  2. arXiv:2310.15511  [pdf, other

    cs.LG cs.AI cs.CL cs.IR

    KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

    Authors: Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi

    Abstract: We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many cu… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: 23 pages

    ACM Class: I.2.7

  3. arXiv:2309.15098  [pdf, other

    cs.CL cs.AI cs.LG

    Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models

    Authors: Mert Yuksekgonul, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, Besmira Nushi

    Abstract: We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as constraint satisfaction problems and use this framework to investigate how the LLM interacts internally with factual constraints. We find a strong positive relationship between the LLM's attention to constraint tokens and the fac… ▽ More

    Submitted 17 April, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Published at ICLR 2024

  4. arXiv:2309.05463  [pdf, other

    cs.CL cs.AI

    Textbooks Are All You Need II: phi-1.5 technical report

    Authors: Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

    Abstract: We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs)… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

  5. arXiv:2306.11644  [pdf, other

    cs.CL cs.AI cs.LG

    Textbooks Are All You Need

    Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li

    Abstract: We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accu… ▽ More

    Submitted 2 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: 26 pages; changed color scheme of plot. fixed minor typos and added couple clarifications

  6. arXiv:2302.08982  [pdf, other

    cs.LG math.OC stat.ML

    (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability

    Authors: Mathieu Even, Scott Pesme, Suriya Gunasekar, Nicolas Flammarion

    Abstract: In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp ch… ▽ More

    Submitted 25 October, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

  7. arXiv:2211.09359  [pdf, other

    cs.CV cs.LG

    How to Fine-Tune Vision Models with SGD

    Authors: Ananya Kumar, Ruoqi Shen, Sebastien Bubeck, Suriya Gunasekar

    Abstract: SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning wit… ▽ More

    Submitted 10 October, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

  8. arXiv:2207.11368  [pdf, other

    cs.CV

    Neural-Sim: Learning to Generate Training Data with NeRF

    Authors: Yunhao Ge, Harkirat Behl, Jiashu Xu, Suriya Gunasekar, Neel Joshi, Yale Song, Xin Wang, Laurent Itti, Vibhav Vineet

    Abstract: Training computer vision models usually requires collecting and labeling vast amounts of imagery under a diverse set of scene configurations and properties. This process is incredibly time-consuming, and it is challenging to ensure that the captured data distribution maps well to the target domain of an application scenario. Recently, synthetic data has emerged as a way to address both of these is… ▽ More

    Submitted 22 July, 2022; originally announced July 2022.

    Comments: ECCV 2022

  9. arXiv:2207.02349  [pdf, other

    cs.CV cs.LG

    Generalization to translation shifts: a study in architectures and augmentations

    Authors: Suriya Gunasekar

    Abstract: We study how effective data augmentation is at capturing the inductive bias of carefully designed network architectures for spatial translation invariance. We evaluate various image classification architectures (antialiased, convolutional, vision transformer, and fully connected MLP networks) and data augmentation techniques towards generalization to large translation shifts. We observe that: (a)… ▽ More

    Submitted 12 November, 2022; v1 submitted 5 July, 2022; originally announced July 2022.

  10. arXiv:2206.04301  [pdf, other

    cs.LG cs.AI cs.CL

    Unveiling Transformers with LEGO: a synthetic reasoning task

    Authors: Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Tal Wagner

    Abstract: We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well… ▽ More

    Submitted 17 February, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

  11. arXiv:2203.01572  [pdf, other

    cs.LG stat.ML

    Data Augmentation as Feature Manipulation

    Authors: Ruoqi Shen, Sébastien Bubeck, Suriya Gunasekar

    Abstract: Data augmentation is a cornerstone of the machine learning pipeline, yet its theoretical underpinnings remain unclear. Is it merely a way to artificially augment the data set size? Or is it about encouraging the model to satisfy certain invariance? In this work we consider another angle, and we study the effect of data augmentation on the dynamic of the learning process. We find that data augmenta… ▽ More

    Submitted 20 September, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: 38 pages, 4 figures. ICML22 camera-ready version

  12. arXiv:2102.12238  [pdf, other

    cs.LG stat.ML

    Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm

    Authors: Meena Jagadeesan, Ilya Razenshteyn, Suriya Gunasekar

    Abstract: We provide a function space characterization of the inductive bias resulting from minimizing the $\ell_2$ norm of the weights in multi-channel convolutional neural networks with linear activations and empirically test our resulting hypothesis on ReLU networks trained using gradient descent. We define an induced regularizer in the function space as the minimum $\ell_2$ norm of weights of a network… ▽ More

    Submitted 11 July, 2022; v1 submitted 24 February, 2021; originally announced February 2021.

    Comments: Appeared at COLT 2022

  13. arXiv:2012.07976  [pdf, other

    cs.LG stat.ML

    NeurIPS 2020 Competition: Predicting Generalization in Deep Learning

    Authors: Yiding Jiang, Pierre Foret, Scott Yak, Daniel M. Roy, Hossein Mobahi, Gintare Karolina Dziugaite, Samy Bengio, Suriya Gunasekar, Isabelle Guyon, Behnam Neyshabur

    Abstract: Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, c… ▽ More

    Submitted 14 December, 2020; originally announced December 2020.

    Comments: 20 pages, 2 figures. Accepted for NeurIPS 2020 Competitions Track. Lead organizer: Yiding Jiang

  14. arXiv:2007.06738  [pdf, other

    cs.LG stat.ML

    Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

    Authors: Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D. Lee, Nathan Srebro, Daniel Soudry

    Abstract: We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accuratel… ▽ More

    Submitted 13 July, 2020; originally announced July 2020.

  15. arXiv:2004.01025  [pdf, ps, other

    cs.LG math.OC stat.ML

    Mirrorless Mirror Descent: A Natural Derivation of Mirror Descent

    Authors: Suriya Gunasekar, Blake Woodworth, Nathan Srebro

    Abstract: We present a primal only derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential. We contrast this discretization to Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows gen… ▽ More

    Submitted 1 July, 2021; v1 submitted 2 April, 2020; originally announced April 2020.

    Comments: 11 pages

  16. arXiv:2002.09277  [pdf, other

    cs.LG stat.ML

    Kernel and Rich Regimes in Overparametrized Models

    Authors: Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, Nathan Srebro

    Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More

    Submitted 27 July, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: This updates and significantly extends a previous article (arXiv:1906.05827), Sections 6 and 7 are the most major additions. 31 pages. arXiv admin note: text overlap with arXiv:1906.05827

  17. arXiv:1911.07956  [pdf, other

    cs.LG cs.CV math.OC stat.ML

    Implicit Regularization and Convergence for Weight Normalization

    Authors: Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu

    Abstract: Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-s… ▽ More

    Submitted 30 August, 2022; v1 submitted 18 November, 2019; originally announced November 2019.

    Comments: NeurIPS 2020

  18. arXiv:1906.05827   

    cs.LG stat.ML

    Kernel and Rich Regimes in Overparametrized Models

    Authors: Blake Woodworth, Suriya Gunasekar, Pedro Savarese, Edward Moroshko, Itay Golan, Jason Lee, Daniel Soudry, Nathan Srebro

    Abstract: A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich… ▽ More

    Submitted 25 February, 2020; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: This paper has been substantially modified, updated, and expanded with additional content (arXiv:2002.09277). To avoid confusion with already existing citations, we are withdrawing the old version of this article

  19. arXiv:1905.07325  [pdf, ps, other

    stat.ML cs.LG

    Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

    Authors: Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, Daniel Soudry

    Abstract: With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non-homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models. To this end we study the limit of loss minimization with a diverging norm… ▽ More

    Submitted 17 May, 2019; originally announced May 2019.

    Comments: ICML Camera ready version

  20. arXiv:1810.11829  [pdf, ps, other

    cs.LG cs.DS stat.ML

    On preserving non-discrimination when combining expert advice

    Authors: Avrim Blum, Suriya Gunasekar, Thodoris Lykouris, Nathan Srebro

    Abstract: We study the interplay between sequential decision making and avoiding discrimination against protected groups, when examples arrive online and do not follow distributional assumptions. We consider the most basic extension of classical online learning: "Given a class of predictors that are individually non-discriminatory with respect to a particular metric, how can we combine them to perform as we… ▽ More

    Submitted 29 March, 2019; v1 submitted 28 October, 2018; originally announced October 2018.

    Comments: Appeared in NIPS 2018

  21. arXiv:1806.00468  [pdf, other

    cs.LG stat.ML

    Implicit Bias of Gradient Descent on Linear Convolutional Networks

    Authors: Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro

    Abstract: We show that gradient descent on full-width linear convolutional networks of depth $L$ converges to a linear predictor related to the $\ell_{2/L}$ bridge penalty in the frequency domain. This is in contrast to linearly fully connected networks, where gradient descent converges to the hard margin linear support vector machine solution, regardless of depth.

    Submitted 10 January, 2019; v1 submitted 1 June, 2018; originally announced June 2018.

  22. arXiv:1803.01905  [pdf, other

    stat.ML cs.LG

    Convergence of Gradient Descent on Separable Data

    Authors: Mor Shpigel Nacson, Jason D. Lee, Suriya Gunasekar, Pedro H. P. Savarese, Nathan Srebro, Daniel Soudry

    Abstract: We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margi… ▽ More

    Submitted 24 March, 2019; v1 submitted 5 March, 2018; originally announced March 2018.

    Comments: AISTATS Camera ready version

  23. arXiv:1802.08246  [pdf, other

    stat.ML cs.LG

    Characterizing Implicit Bias in Terms of Optimization Geometry

    Authors: Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro

    Abstract: We study the implicit bias of generic optimization methods, such as mirror descent, natural gradient descent, and steepest descent with respect to different potentials and norms, when optimizing underdetermined linear regression or separable linear classification problems. We explore the question of whether the specific global minimum (among the many possible global minima) reached by an algorithm… ▽ More

    Submitted 22 June, 2020; v1 submitted 22 February, 2018; originally announced February 2018.

    Comments: (1) A bug in the proof of implicit bias for matrix factorization was fixed. v2 gives a characterization of the asymptotic bias of the factor matrices, while v1 made a stronger claim on the limit direction of the unfactored matrix. (2) v2 also includes new results on implicit bias of mirror descent with realizable affine constraints

  24. arXiv:1710.10345  [pdf, ps, other

    stat.ML cs.LG

    The Implicit Bias of Gradient Descent on Separable Data

    Authors: Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro

    Abstract: We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a d… ▽ More

    Submitted 16 April, 2024; v1 submitted 27 October, 2017; originally announced October 2017.

    Comments: Change from v5: clarified the derivation between eqs. (41) and (42)

  25. arXiv:1705.09280  [pdf, other

    stat.ML cs.LG

    Implicit Regularization in Matrix Factorization

    Authors: Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro

    Abstract: We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix $X$ with gradient descent on a factorization of $X$. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

    Submitted 25 May, 2017; originally announced May 2017.

  26. arXiv:1702.06081  [pdf, other

    cs.LG

    Learning Non-Discriminatory Predictors

    Authors: Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, Nathan Srebro

    Abstract: We consider learning a predictor which is non-discriminatory with respect to a "protected attribute" according to the notion of "equalized odds" proposed by Hardt et al. [2016]. We study the problem of learning such a non-discriminatory predictor from a finite training set, both statistically and computationally. We show that a post-hoc correction approach, as suggested by Hardt et al, can be high… ▽ More

    Submitted 1 November, 2017; v1 submitted 20 February, 2017; originally announced February 2017.

    Comments: 28 pages

  27. arXiv:1611.04218  [pdf, other

    stat.ML cs.LG

    Preference Completion from Partial Rankings

    Authors: Suriya Gunasekar, Oluwasanmi Koyejo, Joydeep Ghosh

    Abstract: We propose a novel and efficient algorithm for the collaborative preference completion problem, which involves jointly estimating individualized rankings for a set of entities over a shared set of items, based on a limited number of observed affinity values. Our approach exploits the observation that while preferences are often recorded as numerical scores, the predictive quantity of interest is t… ▽ More

    Submitted 13 November, 2016; originally announced November 2016.

    Comments: NIPS 2016

  28. arXiv:1608.00704  [pdf, other

    stat.ML cs.LG

    Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization

    Authors: Shalmali Joshi, Suriya Gunasekar, David Sontag, Joydeep Ghosh

    Abstract: This work proposes a new algorithm for automated and simultaneous phenotyping of multiple co-occurring medical conditions, also referred as comorbidities, using clinical notes from the electronic health records (EHRs). A basic latent factor estimation technique of non-negative matrix factorization (NMF) is augmented with domain specific constraints to obtain sparse latent factors that are anchored… ▽ More

    Submitted 20 September, 2016; v1 submitted 2 August, 2016; originally announced August 2016.

    Comments: Presented at 2016 Machine Learning and Healthcare Conference (MLHC 2016), Los Angeles, CA

  29. arXiv:1509.04397  [pdf, ps, other

    stat.ML cs.LG

    Exponential Family Matrix Completion under Structural Constraints

    Authors: Suriya Gunasekar, Pradeep Ravikumar, Joydeep Ghosh

    Abstract: We consider the matrix completion problem of recovering a structured matrix from noisy and partial measurements. Recent works have proposed tractable estimators with strong statistical guarantees for the case where the underlying matrix is low--rank, and the measurements consist of a subset, either of the exact individual entries, or of the entries perturbed by additive Gaussian noise, which is th… ▽ More

    Submitted 15 September, 2015; originally announced September 2015.

    Comments: 20 pages, 9 figures

    Journal ref: Gunasekar, Suriya, Pradeep Ravikumar, and Joydeep Ghosh. "Exponential family matrix completion under structural constraints". Proceedings of The 31st International Conference on Machine Learning, pp. 1917-1925, 2014

  30. arXiv:1412.2113  [pdf, other

    stat.ML cs.LG

    Consistent Collective Matrix Completion under Joint Low Rank Structure

    Authors: Suriya Gunasekar, Makoto Yamada, Dawei Yin, Yi Chang

    Abstract: We address the collective matrix completion problem of jointly recovering a collection of matrices with shared structure from partial (and potentially noisy) observations. To ensure well--posedness of the problem, we impose a joint low rank structure, wherein each component matrix is low rank and the latent space of the low rank factors corresponding to each entity is shared across the entire coll… ▽ More

    Submitted 7 April, 2015; v1 submitted 5 December, 2014; originally announced December 2014.

    Comments: 19 pages, 3 figures