subscribe to arXiv mailings

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Authors: Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra , et al. (90 additional authors not shown)

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset… ▽ More We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts. △ Less

Submitted 23 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 19 pages

arXiv:2403.19146 [pdf, ps, other]

Improving the Bit Complexity of Communication for Distributed Convex Optimization

Authors: Mehrdad Ghadiri, Yin Tat Lee, Swati Padmanabhan, William Swartworth, David Woodruff, Guanghao Ye

Abstract: We consider the communication complexity of some fundamental convex optimization problems in the point-to-point (coordinator) and blackboard communication models. We strengthen known bounds for approximately solving linear regression, $p$-norm regression (for $1\leq p\leq 2$), linear programming, minimizing the sum of finitely many convex nonsmooth functions with varying supports, and low rank app… ▽ More We consider the communication complexity of some fundamental convex optimization problems in the point-to-point (coordinator) and blackboard communication models. We strengthen known bounds for approximately solving linear regression, $p$-norm regression (for $1\leq p\leq 2$), linear programming, minimizing the sum of finitely many convex nonsmooth functions with varying supports, and low rank approximation; for a number of these fundamental problems our bounds are nearly optimal, as proven by our lower bounds. Among our techniques, we use the notion of block leverage scores, which have been relatively unexplored in this context, as well as dropping all but the ``middle" bits in Richardson-style algorithms. We also introduce a new communication problem for accurately approximating inner products and establish a lower bound using the spherical Radon transform. Our lower bound can be used to show the first separation of linear programming and linear systems in the distributed model when the number of constraints is polynomial, addressing an open question in prior work. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: To appear in STOC '24. Abstract shortened to meet the arXiv limits. Comments welcome!

arXiv:2403.01749 [pdf, other]

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Authors: Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin

Abstract: Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalab… ▽ More Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2311.16452 [pdf, other]

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Authors: Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz

Abstract: Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build… ▽ More Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: 21 pages, 7 figures

ACM Class: I.2.7

arXiv:2311.14737 [pdf, other]

Positional Description Matters for Transformers Arithmetic

Authors: Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, Yi Zhang

Abstract: Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herei… ▽ More Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples). △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: 18 pages

arXiv:2309.05463 [pdf, other]

Textbooks Are All You Need II: phi-1.5 technical report

Authors: Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

Abstract: We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs)… ▽ More We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics. △ Less

Submitted 11 September, 2023; originally announced September 2023.

arXiv:2306.11644 [pdf, other]

Textbooks Are All You Need

Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li

Abstract: We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accu… ▽ More We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval. △ Less

Submitted 2 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: 26 pages; changed color scheme of plot. fixed minor typos and added couple clarifications

arXiv:2306.01337 [pdf, other]

MathChat: Converse to Tackle Challenging Math Problems with LLM Agents

Authors: Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, Chi Wang

Abstract: Employing Large Language Models (LLMs) to address mathematical problems is an intriguing research endeavor, considering the abundance of math problems expressed in natural language across numerous science and engineering fields. LLMs, with their generalized ability, are used as a foundation model to build AI agents for different tasks. In this paper, we study the effectiveness of utilizing LLM age… ▽ More Employing Large Language Models (LLMs) to address mathematical problems is an intriguing research endeavor, considering the abundance of math problems expressed in natural language across numerous science and engineering fields. LLMs, with their generalized ability, are used as a foundation model to build AI agents for different tasks. In this paper, we study the effectiveness of utilizing LLM agents to solve math problems through conversations. We propose MathChat, a conversational problem-solving framework designed for math problems. MathChat consists of an LLM agent and a user proxy agent which is responsible for tool execution and additional guidance. This synergy facilitates a collaborative problem-solving process, where the agents engage in a dialogue to solve the problems. We perform evaluation on difficult high school competition problems from the MATH dataset. Utilizing Python, we show that MathChat can further improve previous tool-using prompting methods by 6%. △ Less

Submitted 28 June, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: Update version

arXiv:2305.03495 [pdf, other]

Automatic Prompt Optimization with "Gradient Descent" and Beam Search

Authors: Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, Michael Zeng

Abstract: Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trial-and-error effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming acc… ▽ More Large Language Models (LLMs) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trial-and-error effort. We propose a simple and nonparametric solution to this problem, Automatic Prompt Optimization (APO), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an LLM API. The algorithm uses minibatches of data to form natural language "gradients" that criticize the current prompt. The gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient. These gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. Preliminary results across three benchmark NLP tasks and the novel problem of LLM jailbreak detection suggest that Automatic Prompt Optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions. △ Less

Submitted 19 October, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

Comments: EMNLP 2023

arXiv:2304.03426 [pdf, ps, other]

Convex Minimization with Integer Minima in $\widetilde O(n^4)$ Time

Authors: Haotian Jiang, Yin Tat Lee, Zhao Song, Lichen Zhang

Abstract: Given a convex function $f$ on $\mathbb{R}^n$ with an integer minimizer, we show how to find an exact minimizer of $f$ using $O(n^2 \log n)$ calls to a separation oracle and $O(n^4 \log n)$ time. The previous best polynomial time algorithm for this problem given in [Jiang, SODA 2021, JACM 2022] achieves $O(n^2\log\log n/\log n)$ oracle complexity. However, the overall runtime of Jiang's algorithm… ▽ More Given a convex function $f$ on $\mathbb{R}^n$ with an integer minimizer, we show how to find an exact minimizer of $f$ using $O(n^2 \log n)$ calls to a separation oracle and $O(n^4 \log n)$ time. The previous best polynomial time algorithm for this problem given in [Jiang, SODA 2021, JACM 2022] achieves $O(n^2\log\log n/\log n)$ oracle complexity. However, the overall runtime of Jiang's algorithm is at least $\widetildeΩ(n^8)$, due to expensive sub-routines such as the Lenstra-Lenstra-Lovász (LLL) algorithm [Lenstra, Lenstra, Lovász, Math. Ann. 1982] and random walk based cutting plane method [Bertsimas, Vempala, JACM 2004]. Our significant speedup is obtained by a nontrivial combination of a faster version of the LLL algorithm due to [Neumaier, Stehlé, ISSAC 2016] that gives similar guarantees, the volumetric center cutting plane method (CPM) by [Vaidya, FOCS 1989] and its fast implementation given in [Jiang, Lee, Song, Wong, STOC 2020]. For the special case of submodular function minimization (SFM), our result implies a strongly polynomial time algorithm for this problem using $O(n^3 \log n)$ calls to an evaluation oracle and $O(n^4 \log n)$ additional arithmetic operations. Both the oracle complexity and the number of arithmetic operations of our more general algorithm are better than the previous best-known runtime algorithms for this specific problem given in [Lee, Sidford, Wong, FOCS 2015] and [Dadush, Végh, Zambelli, SODA 2018, MOR 2021]. △ Less

Submitted 14 November, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

Comments: SODA 2024

arXiv:2303.12712 [pdf, other]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Authors: Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang

Abstract: Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an earl… ▽ More Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions. △ Less

Submitted 13 April, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2302.10879 [pdf, other]

$k$NN-Adapter: Efficient Domain Adaptation for Black-Box Language Models

Authors: Yangsibo Huang, Daogao Liu, Zexuan Zhong, Weijia Shi, Yin Tat Lee

Abstract: Fine-tuning a language model on a new domain is standard practice for domain adaptation. However, it can be infeasible when it comes to modern large-scale language models such as GPT-3, which can only be accessed through APIs, making it difficult to access the internal parameters of the model. In this paper, we propose $k$NN-Adapter, a method to effectively adapt these black-box large language mod… ▽ More Fine-tuning a language model on a new domain is standard practice for domain adaptation. However, it can be infeasible when it comes to modern large-scale language models such as GPT-3, which can only be accessed through APIs, making it difficult to access the internal parameters of the model. In this paper, we propose $k$NN-Adapter, a method to effectively adapt these black-box large language models (LLMs) to a new domain. The $k$NN-Adapter builds on top of the retrieval-augmented language model, and adaptively learns to interpolate the output of the language model with retrieval results from a datastore consisting of the target domain data. Our experiments on four different domains demonstrate that $k$NN-Adapter significantly improves perplexity, and works particularly well in settings with limited access to LLMs. Additionally, we show that $k$NN-Adapter is more effective than fine-tuning when the amount of training data is limited. We also release a dataset to encourage further study. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2302.06085 [pdf, ps, other]

Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal Sampler

Authors: Sivakanth Gopi, Yin Tat Lee, Daogao Liu, Ruoqi Shen, Kevin Tian

Abstract: The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transfo… ▽ More The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration. △ Less

Submitted 22 February, 2023; v1 submitted 12 February, 2023; originally announced February 2023.

Comments: Comments welcome! v2 improves constant in duality result, adds citations

arXiv:2301.00457 [pdf, other]

ReSQueing Parallel and Private Stochastic Convex Optimization

Authors: Yair Carmon, Arun Jambulapati, Yujia Jin, Yin Tat Lee, Daogao Liu, Aaron Sidford, Kevin Tian

Abstract: We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO obj… ▽ More We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJJLST20, ACJJS21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^d$, we obtain the following results (up to polylogarithmic factors). We give a parallel algorithm obtaining optimization error $ε_{\text{opt}}$ with $d^{1/3}ε_{\text{opt}}^{-2/3}$ gradient oracle query depth and $d^{1/3}ε_{\text{opt}}^{-2/3} + ε_{\text{opt}}^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $ε_{\text{opt}} \in [d^{-1}, d^{-1/4}]$, our algorithm matches the state-of-the-art oracle depth of [BJLLS19] while maintaining the optimal total work of stochastic gradient descent. Given $n$ samples of Lipschitz loss functions, prior works [BFTT19, BFGT20, AFKT21, KLL21] established that if $n \gtrsim d ε_{\text{dp}}^{-2}$, $(ε_{\text{dp}}, δ)$-differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gtrsim d^2 ε_{\text{dp}}^{-3}$, by using ReSQue to design an algorithm with near-linear gradient query complexity in this regime. △ Less

Submitted 27 October, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

arXiv:2212.07469 [pdf, other]

Learning threshold neurons via the "edge of stability"

Authors: Kwangjun Ahn, Sébastien Bubeck, Sinho Chewi, Yin Tat Lee, Felipe Suarez, Yi Zhang

Abstract: Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learni… ▽ More Existing analyses of neural network training often operate under the unrealistic assumption of an extremely small learning rate. This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the "edge of stability" or "unstable convergence") and potential benefits for generalization in the large learning rate regime. Despite a flurry of recent works on this topic, however, the latter effect is still poorly understood. In this paper, we take a step towards understanding genuinely non-convex training dynamics with large learning rates by performing a detailed analysis of gradient descent for simplified models of two-layer neural networks. For these models, we provably establish the edge of stability phenomenon and discover a sharp phase transition for the step size below which the neural network fails to learn "threshold-like" neurons (i.e., neurons with a non-zero first-layer bias). This elucidates one possible mechanism by which the edge of stability can in fact lead to better generalization, as threshold neurons are basic building blocks with useful inductive bias for many tasks. △ Less

Submitted 19 October, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

Comments: 31 pages, 13 figures, Published at NeurIPS 2023

arXiv:2212.01539 [pdf, other]

Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping

Authors: Jiyan He, Xuechen Li, Da Yu, Huishuai Zhang, Janardhan Kulkarni, Yin Tat Lee, Arturs Backurs, Nenghai Yu, Jiang Bian

Abstract: Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the… ▽ More Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $ε=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task. △ Less

Submitted 3 December, 2022; originally announced December 2022.

Comments: 25 pages

arXiv:2211.11860 [pdf, other]

Upper and Lower Bounds on the Smoothed Complexity of the Simplex Method

Authors: Sophie Huiberts, Yin Tat Lee, Xinzhi Zhang

Abstract: The simplex method for linear programming is known to be highly efficient in practice, and understanding its performance from a theoretical perspective is an active research topic. The framework of smoothed analysis, first introduced by Spielman and Teng (JACM '04) for this purpose, defines the smoothed complexity of solving a linear program with $d$ variables and $n$ constraints as the expected r… ▽ More The simplex method for linear programming is known to be highly efficient in practice, and understanding its performance from a theoretical perspective is an active research topic. The framework of smoothed analysis, first introduced by Spielman and Teng (JACM '04) for this purpose, defines the smoothed complexity of solving a linear program with $d$ variables and $n$ constraints as the expected running time when Gaussian noise of variance $σ^2$ is added to the LP data. We prove that the smoothed complexity of the simplex method is $O(σ^{-3/2} d^{13/4}\log^{7/4} n)$, improving the dependence on $1/σ$ compared to the previous bound of $O(σ^{-2} d^2\sqrt{\log n})$. We accomplish this through a new analysis of the \emph{shadow bound}, key to earlier analyses as well. Illustrating the power of our new method, we use our method to prove a nearly tight upper bound on the smoothed complexity of two-dimensional polygons. We also establish the first non-trivial lower bound on the smoothed complexity of the simplex method, proving that the \emph{shadow vertex simplex method} requires at least $Ω\Big(\min \big(σ^{-1/2} d^{-1/2}\log^{-1/4} d,2^d \big) \Big)$ pivot steps with high probability. A key part of our analysis is a new variation on the extended formulation for the regular $2^k$-gon. We end with a numerical experiment that suggests this analysis could be further improved. △ Less

Submitted 15 May, 2024; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: 43 pages, 5 figures. STOC 2023

arXiv:2210.07219 [pdf, ps, other]

Condition-number-independent convergence rate of Riemannian Hamiltonian Monte Carlo with numerical integrators

Authors: Yunbum Kook, Yin Tat Lee, Ruoqi Shen, Santosh S. Vempala

Abstract: We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-α^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of… ▽ More We study the convergence rate of discretized Riemannian Hamiltonian Monte Carlo on sampling from distributions in the form of $e^{-f(x)}$ on a convex body $\mathcal{M}\subset\mathbb{R}^{n}$. We show that for distributions in the form of $e^{-α^{\top}x}$ on a polytope with $m$ constraints, the convergence rate of a family of commonly-used integrators is independent of $\left\Vert α\right\Vert _{2}$ and the geometry of the polytope. In particular, the implicit midpoint method (IMM) and the generalized Leapfrog method (LM) have a mixing time of $\widetilde{O}\left(mn^{3}\right)$ to achieve $ε$ total variation distance to the target distribution. These guarantees are based on a general bound on the convergence rate for densities of the form $e^{-f(x)}$ in terms of parameters of the manifold and the integrator. Our theoretical guarantee complements the empirical results of [KLSV22], which shows that RHMC with IMM can sample ill-conditioned, non-smooth and constrained distributions in very high dimension efficiently in practice. △ Less

Submitted 10 February, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: Improved writing & Theory for arXiv:2202.01908

arXiv:2208.11644 [pdf, ps, other]

A Slightly Improved Bound for the KLS Constant

Authors: Arun Jambulapati, Yin Tat Lee, Santosh S. Vempala

Abstract: We refine the recent breakthrough technique of Klartag and Lehec to obtain an improved polylogarithmic bound for the KLS constant. We refine the recent breakthrough technique of Klartag and Lehec to obtain an improved polylogarithmic bound for the KLS constant. △ Less

Submitted 6 October, 2022; v1 submitted 24 August, 2022; originally announced August 2022.

Comments: minor revision fixing typos

arXiv:2208.03811 [pdf, ps, other]

Decomposable Non-Smooth Convex Optimization with Nearly-Linear Gradient Oracle Complexity

Authors: Sally Dong, Haotian Jiang, Yin Tat Lee, Swati Padmanabhan, Guanghao Ye

Abstract: Many fundamental problems in machine learning can be formulated by the convex program \[ \min_{θ\in R^d}\ \sum_{i=1}^{n}f_{i}(θ), \] where each $f_i$ is a convex, Lipschitz function supported on a subset of $d_i$ coordinates of $θ$. One common approach to this problem, exemplified by stochastic gradient descent, involves sampling one $f_i$ term at every iteration to make progress. This approach cr… ▽ More Many fundamental problems in machine learning can be formulated by the convex program \[ \min_{θ\in R^d}\ \sum_{i=1}^{n}f_{i}(θ), \] where each $f_i$ is a convex, Lipschitz function supported on a subset of $d_i$ coordinates of $θ$. One common approach to this problem, exemplified by stochastic gradient descent, involves sampling one $f_i$ term at every iteration to make progress. This approach crucially relies on a notion of uniformity across the $f_i$'s, formally captured by their condition number. In this work, we give an algorithm that minimizes the above convex formulation to $ε$-accuracy in $\widetilde{O}(\sum_{i=1}^n d_i \log (1 /ε))$ gradient computations, with no assumptions on the condition number. The previous best algorithm independent of the condition number is the standard cutting plane method, which requires $O(nd \log (1/ε))$ gradient computations. As a corollary, we improve upon the evaluation oracle complexity for decomposable submodular minimization by Axiotis et al. (ICML 2021). Our main technical contribution is an adaptive procedure to select an $f_i$ term at every iteration via a novel combination of cutting-plane and interior-point methods. △ Less

Submitted 7 August, 2022; originally announced August 2022.

arXiv:2207.08347 [pdf, ps, other]

Private Convex Optimization in General Norms

Authors: Sivakanth Gopi, Yin Tat Lee, Daogao Liu, Ruoqi Shen, Kevin Tian

Abstract: We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work o… ▽ More We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work of [Gopi, Lee, Liu '22] to non-Euclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DP-ERM (empirical risk minimization) and DP-SCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers non-private SCO rates achieved by mirror descent as the privacy parameter $ε\to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacy-utility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten-$p$ norm optimization frameworks are complemented with polynomial-time samplers whose query complexity we explicitly bound. △ Less

Submitted 10 November, 2022; v1 submitted 17 July, 2022; originally announced July 2022.

Comments: SODA 2023

arXiv:2207.00160 [pdf, other]

When Does Differentially Private Learning Not Suffer in High Dimensions?

Authors: Xuechen Li, Daogao Liu, Tatsunori Hashimoto, Huseyin A. Inan, Janardhan Kulkarni, Yin Tat Lee, Abhradeep Guha Thakurta

Abstract: Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following researc… ▽ More Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients obtained during fine-tuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/private-transformers/tree/main/examples/classification/spectral_analysis}. △ Less

Submitted 26 October, 2022; v1 submitted 30 June, 2022; originally announced July 2022.

Comments: 26 pages; v3 includes additional experiments and clarification

arXiv:2205.01562 [pdf, ps, other]

Nested Dissection Meets IPMs: Planar Min-Cost Flow in Nearly-Linear Time

Authors: Sally Dong, Yu Gao, Gramoz Goranci, Yin Tat Lee, Richard Peng, Sushant Sachdeva, Guanghao Ye

Abstract: We present a nearly-linear time algorithm for finding a minimum-cost flow in planar graphs with polynomially bounded integer costs and capacities. The previous fastest algorithm for this problem is based on interior point methods (IPMs) and works for general sparse graphs in $O(n^{1.5}\text{poly}(\log n))$ time [Daitch-Spielman, STOC'08]. Intuitively, $Ω(n^{1.5})$ is a natural runtime barrier for… ▽ More We present a nearly-linear time algorithm for finding a minimum-cost flow in planar graphs with polynomially bounded integer costs and capacities. The previous fastest algorithm for this problem is based on interior point methods (IPMs) and works for general sparse graphs in $O(n^{1.5}\text{poly}(\log n))$ time [Daitch-Spielman, STOC'08]. Intuitively, $Ω(n^{1.5})$ is a natural runtime barrier for IPM-based methods, since they require $\sqrt{n}$ iterations, each routing a possibly-dense electrical flow. To break this barrier, we develop a new implicit representation for flows based on generalized nested-dissection [Lipton-Rose-Tarjan, JSTOR'79] and approximate Schur complements [Kyng-Sachdeva, FOCS'16]. This implicit representation permits us to design a data structure to route an electrical flow with sparse demands in roughly $\sqrt{n}$ update time, resulting in a total running time of $O(n\cdot\text{poly}(\log n))$. Our results immediately extend to all families of separable graphs. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: 93 pages

arXiv:2203.00263 [pdf, ps, other]

Private Convex Optimization via Exponential Mechanism

Authors: Sivakanth Gopi, Yin Tat Lee, Daogao Liu

Abstract: In this paper, we study private optimization problems for non-smooth convex functions $F(x)=\mathbb{E}_i f_i(x)$ on $\mathbb{R}^d$. We show that modifying the exponential mechanism by adding an $\ell_2^2$ regularizer to $F(x)$ and sampling from $π(x)\propto \exp(-k(F(x)+μ\|x\|_2^2/2))$ recovers both the known optimal empirical risk and population loss under $(ε,δ)$-DP. Furthermore, we show how to… ▽ More In this paper, we study private optimization problems for non-smooth convex functions $F(x)=\mathbb{E}_i f_i(x)$ on $\mathbb{R}^d$. We show that modifying the exponential mechanism by adding an $\ell_2^2$ regularizer to $F(x)$ and sampling from $π(x)\propto \exp(-k(F(x)+μ\|x\|_2^2/2))$ recovers both the known optimal empirical risk and population loss under $(ε,δ)$-DP. Furthermore, we show how to implement this mechanism using $\widetilde{O}(n \min(d, n))$ queries to $f_i(x)$ for the DP-SCO where $n$ is the number of samples/users and $d$ is the ambient dimension. We also give a (nearly) matching lower bound $\widetildeΩ(n \min(d, n))$ on the number of evaluation queries. Our results utilize the following tools that are of independent interest: (1) We prove Gaussian Differential Privacy (GDP) of the exponential mechanism if the loss function is strongly convex and the perturbation is Lipschitz. Our privacy bound is \emph{optimal} as it includes the privacy of Gaussian mechanism as a special case and is proved using the isoperimetric inequality for strongly log-concave measures. (2) We show how to sample from $\exp(-F(x)-μ\|x\|^2_2/2)$ for $G$-Lipschitz $F$ with $η$ error in total variation (TV) distance using $\widetilde{O}((G^2/μ) \log^2(d/η))$ unbiased queries to $F(x)$. This is the first sampler whose query complexity has \emph{polylogarithmic dependence} on both dimension $d$ and accuracy $η$. △ Less

Submitted 28 July, 2022; v1 submitted 1 March, 2022; originally announced March 2022.

arXiv:2202.01908 [pdf, other]

Sampling with Riemannian Hamiltonian Monte Carlo in a Constrained Space

Authors: Yunbum Kook, Yin Tat Lee, Ruoqi Shen, Santosh S. Vempala

Abstract: We demonstrate for the first time that ill-conditioned, non-smooth, constrained distributions in very high dimension, upwards of 100,000, can be sampled efficiently $\textit{in practice}$. Our algorithm incorporates constraints into the Riemannian version of Hamiltonian Monte Carlo and maintains sparsity. This allows us to achieve a mixing rate independent of smoothness and condition numbers. On… ▽ More We demonstrate for the first time that ill-conditioned, non-smooth, constrained distributions in very high dimension, upwards of 100,000, can be sampled efficiently $\textit{in practice}$. Our algorithm incorporates constraints into the Riemannian version of Hamiltonian Monte Carlo and maintains sparsity. This allows us to achieve a mixing rate independent of smoothness and condition numbers. On benchmark data sets in systems biology and linear programming, our algorithm outperforms existing packages by orders of magnitude. In particular, we achieve a 1,000-fold speed-up for sampling from the largest published human metabolic network (RECON3D). Our package has been incorporated into the COBRA toolbox. △ Less

Submitted 15 October, 2022; v1 submitted 3 February, 2022; originally announced February 2022.

Comments: Mixing-rate proof added. To appear in NeurIPS 2022

arXiv:2112.00722 [pdf, ps, other]

Faster Maxflow via Improved Dynamic Spectral Vertex Sparsifiers

Authors: Jan van den Brand, Yu Gao, Arun Jambulapati, Yin Tat Lee, Yang P. Liu, Richard Peng, Aaron Sidford

Abstract: We make several advances broadly related to the maintenance of electrical flows in weighted graphs undergoing dynamic resistance updates, including: 1. More efficient dynamic spectral vertex sparsification, achieved by faster length estimation of random walks in weighted graphs using Morris counters [Morris 1978, Nelson-Yu 2020]. 2. A direct reduction from detecting edges with large energy in… ▽ More We make several advances broadly related to the maintenance of electrical flows in weighted graphs undergoing dynamic resistance updates, including: 1. More efficient dynamic spectral vertex sparsification, achieved by faster length estimation of random walks in weighted graphs using Morris counters [Morris 1978, Nelson-Yu 2020]. 2. A direct reduction from detecting edges with large energy in dynamic electric flows to dynamic spectral vertex sparsifiers. 3. A procedure for turning algorithms for estimating a sequence of vectors under updates from an oblivious adversary to one that tolerates adaptive adversaries via the Gaussian-mechanism from differential privacy. Combining these pieces with modifications to prior robust interior point frameworks gives an algorithm that on graphs with $m$ edges computes a mincost flow with edge costs and capacities in $[1, U]$ in time $\widetilde{O}(m^{3/2-1/58} \log^2 U)$. In prior and independent work, [Axiotis-Mądry-Vladu FOCS 2021] also obtained an improved algorithm for sparse mincost flows on capacitated graphs. Our algorithm implies a $\widetilde{O}(m^{3/2-1/58} \log U)$ time maxflow algorithm, improving over the $\widetilde{O}(m^{3/2-1/328}\log U)$ time maxflow algorithm of [Gao-Liu-Peng FOCS 2021]. △ Less

Submitted 1 December, 2021; originally announced December 2021.

Comments: 63 pages

arXiv:2110.15563 [pdf, other]

Computing Lewis Weights to High Precision

Authors: Maryam Fazel, Yin Tat Lee, Swati Padmanabhan, Aaron Sidford

Abstract: We present an algorithm for computing approximate $\ell_p$ Lewis weights to high precision. Given a full-rank $\mathbf{A} \in \mathbb{R}^{m \times n}$ with $m \geq n$ and a scalar $p>2$, our algorithm computes $ε$-approximate $\ell_p$ Lewis weights of $\mathbf{A}$ in $\widetilde{O}_p(\log(1/ε))$ iterations; the cost of each iteration is linear in the input size plus the cost of computing the lever… ▽ More We present an algorithm for computing approximate $\ell_p$ Lewis weights to high precision. Given a full-rank $\mathbf{A} \in \mathbb{R}^{m \times n}$ with $m \geq n$ and a scalar $p>2$, our algorithm computes $ε$-approximate $\ell_p$ Lewis weights of $\mathbf{A}$ in $\widetilde{O}_p(\log(1/ε))$ iterations; the cost of each iteration is linear in the input size plus the cost of computing the leverage scores of $\mathbf{D}\mathbf{A}$ for diagonal $\mathbf{D} \in \mathbb{R}^{m \times m}$. Prior to our work, such a computational complexity was known only for $p \in (0, 4)$ [CohenPeng2015], and combined with this result, our work yields the first polylogarithmic-depth polynomial-work algorithm for the problem of computing $\ell_p$ Lewis weights to high precision for all constant $p > 0$. An important consequence of this result is also the first polylogarithmic-depth polynomial-work algorithm for computing a nearly optimal self-concordant barrier for a polytope. △ Less

Submitted 29 October, 2021; originally announced October 2021.

Comments: 24 pages

arXiv:2110.06500 [pdf, other]

Differentially Private Fine-tuning of Language Models

Authors: Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, Huishuai Zhang

Abstract: We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially… ▽ More We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $ε= 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $ε= 6.8,δ=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced. △ Less

Submitted 14 July, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: ICLR 2022. Code available at https://github.com/huseyinatahaninan/Differentially-Private-Fine-tuning-of-Language-Models

arXiv:2108.04734 [pdf, other]

Tutorial on the Robust Interior Point Method

Authors: Yin Tat Lee, Santosh S. Vempala

Abstract: We give a short, self-contained proof of the interior point method and its robust version. We give a short, self-contained proof of the interior point method and its robust version. △ Less

Submitted 10 August, 2021; originally announced August 2021.

arXiv:2106.05480 [pdf, other]

Lower Bounds on Metropolized Sampling Methods for Well-Conditioned Distributions

Authors: Yin Tat Lee, Ruoqi Shen, Kevin Tian

Abstract: We give lower bounds on the performance of two of the most popular sampling methods in practice, the Metropolis-adjusted Langevin algorithm (MALA) and multi-step Hamiltonian Monte Carlo (HMC) with a leapfrog integrator, when applied to well-conditioned distributions. Our main result is a nearly-tight lower bound of $\widetildeΩ(κd)$ on the mixing time of MALA from an exponentially warm start, matc… ▽ More We give lower bounds on the performance of two of the most popular sampling methods in practice, the Metropolis-adjusted Langevin algorithm (MALA) and multi-step Hamiltonian Monte Carlo (HMC) with a leapfrog integrator, when applied to well-conditioned distributions. Our main result is a nearly-tight lower bound of $\widetildeΩ(κd)$ on the mixing time of MALA from an exponentially warm start, matching a line of algorithmic results up to logarithmic factors and answering an open question of Chewi et. al. We also show that a polynomial dependence on dimension is necessary for the relaxation time of HMC under any number of leapfrog steps, and bound the gains achievable by changing the step count. Our HMC analysis draws upon a novel connection between leapfrog integration and Chebyshev polynomials, which may be of independent interest. △ Less

Submitted 26 October, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

Comments: 46 pages, 1 figure. This version removes Gaussian upper bound claim

arXiv:2106.02848 [pdf, ps, other]

Numerical Composition of Differential Privacy

Authors: Sivakanth Gopi, Yin Tat Lee, Lukas Wutschitz

Abstract: We give a fast algorithm to optimally compose privacy guarantees of differentially private (DP) algorithms to arbitrary accuracy. Our method is based on the notion of privacy loss random variables to quantify the privacy loss of DP algorithms. The running time and memory needed for our algorithm to approximate the privacy curve of a DP algorithm composed with itself $k$ times is… ▽ More We give a fast algorithm to optimally compose privacy guarantees of differentially private (DP) algorithms to arbitrary accuracy. Our method is based on the notion of privacy loss random variables to quantify the privacy loss of DP algorithms. The running time and memory needed for our algorithm to approximate the privacy curve of a DP algorithm composed with itself $k$ times is $\tilde{O}(\sqrt{k})$. This improves over the best prior method by Koskela et al. (2020) which requires $\tildeΩ(k^{1.5})$ running time. We demonstrate the utility of our algorithm by accurately computing the privacy loss of DP-SGD algorithm of Abadi et al. (2016) and showing that our algorithm speeds up the privacy computations by a few orders of magnitude compared to prior work, while maintaining similar accuracy. △ Less

Submitted 26 October, 2021; v1 submitted 5 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021 Spotlight

arXiv:2105.13637 [pdf, ps, other]

The Power of Sampling: Dimension-free Risk Bounds in Private ERM

Authors: Yin Tat Lee, Daogao Liu, Zhou Lu

Abstract: Differentially private empirical risk minimization (DP-ERM) is a fundamental problem in private optimization. While the theory of DP-ERM is well-studied, as large-scale models become prevalent, traditional DP-ERM methods face new challenges, including (1) the prohibitive dependence on the ambient dimension, (2) the highly non-smooth objective functions, (3) costly first-order gradient oracles. Suc… ▽ More Differentially private empirical risk minimization (DP-ERM) is a fundamental problem in private optimization. While the theory of DP-ERM is well-studied, as large-scale models become prevalent, traditional DP-ERM methods face new challenges, including (1) the prohibitive dependence on the ambient dimension, (2) the highly non-smooth objective functions, (3) costly first-order gradient oracles. Such challenges demand rethinking existing DP-ERM methodologies. In this work, we show that the regularized exponential mechanism combined with existing samplers can address these challenges altogether: under the standard unconstrained domain and low-rank gradients assumptions, our algorithm can achieve rank-dependent risk bounds for non-smooth convex objectives using only zeroth order oracles, which was not accomplished by prior methods. This highlights the power of sampling in differential privacy. We further construct lower bounds, demonstrating that when gradients are full-rank, there is no separation between the constrained and unconstrained settings. Our lower bound is derived from a general black-box reduction from unconstrained to the constrained domain and an improved lower bound in the constrained setting, which might be of independent interest. △ Less

Submitted 3 June, 2024; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: We add the dimension-independent upper bounds results

arXiv:2103.15352 [pdf, other]

Private Non-smooth Empirical Risk Minimization and Stochastic Convex Optimization in Subquadratic Steps

Authors: Janardhan Kulkarni, Yin Tat Lee, Daogao Liu

Abstract: We study the differentially private Empirical Risk Minimization (ERM) and Stochastic Convex Optimization (SCO) problems for non-smooth convex functions. We get a (nearly) optimal bound on the excess empirical risk and excess population loss with subquadratic gradient complexity. More precisely, our differentially private algorithm requires $O(\frac{N^{3/2}}{d^{1/8}}+ \frac{N^2}{d})$ gradient queri… ▽ More We study the differentially private Empirical Risk Minimization (ERM) and Stochastic Convex Optimization (SCO) problems for non-smooth convex functions. We get a (nearly) optimal bound on the excess empirical risk and excess population loss with subquadratic gradient complexity. More precisely, our differentially private algorithm requires $O(\frac{N^{3/2}}{d^{1/8}}+ \frac{N^2}{d})$ gradient queries for optimal excess empirical risk, which is achieved with the help of subsampling and smoothing the function via convolution. This is the first subquadratic algorithm for the non-smooth case when $d$ is super constant. As a direct application, using the iterative localization approach of Feldman et al. \cite{fkt20}, we achieve the optimal excess population loss for stochastic convex optimization problem, with $O(\min\{N^{5/4}d^{1/8},\frac{ N^{3/2}}{d^{1/8}}\})$ gradient queries. Our work makes progress towards resolving a question raised by Bassily et al. \cite{bfgt20}, giving first algorithms for private ERM and SCO with subquadratic steps. We note that independently Asi et al. \cite{afkt21} gave other algorithms for private ERM and SCO with subquadratic steps. △ Less

Submitted 29 March, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

arXiv:2102.03013 [pdf, other]

Fast and Memory Efficient Differentially Private-SGD via JL Projections

Authors: Zhiqi Bu, Sivakanth Gopi, Janardhan Kulkarni, Yin Tat Lee, Judy Hanwen Shen, Uthaipon Tantipongpipat

Abstract: Differentially Private-SGD (DP-SGD) of Abadi et al. (2016) and its variations are the only known algorithms for private training of large scale neural networks. This algorithm requires computation of per-sample gradients norms which is extremely slow and memory intensive in practice. In this paper, we present a new framework to design differentially private optimizers called DP-SGD-JL and DP-Adam-… ▽ More Differentially Private-SGD (DP-SGD) of Abadi et al. (2016) and its variations are the only known algorithms for private training of large scale neural networks. This algorithm requires computation of per-sample gradients norms which is extremely slow and memory intensive in practice. In this paper, we present a new framework to design differentially private optimizers called DP-SGD-JL and DP-Adam-JL. Our approach uses Johnson-Lindenstrauss (JL) projections to quickly approximate the per-sample gradient norms without exactly computing them, thus making the training time and memory requirements of our optimizers closer to that of their non-DP versions. Unlike previous attempts to make DP-SGD faster which work only on a subset of network architectures or use compiler techniques, we propose an algorithmic solution which works for any network in a black-box manner which is the main contribution of this paper. To illustrate this, on IMDb dataset, we train a Recurrent Neural Network (RNN) to achieve good privacy-vs-accuracy tradeoff, while being significantly faster than DP-SGD and with a similar memory footprint as non-private SGD. The privacy analysis of our algorithms is more involved than DP-SGD, we use the recently proposed f-DP framework of Dong et al. (2019) to prove privacy. △ Less

Submitted 5 February, 2021; originally announced February 2021.

arXiv:2101.08993 [pdf]

Automatic Volumetric Segmentation of Additive Manufacturing Defects with 3D U-Net

Authors: Vivian Wen Hui Wong, Max Ferguson, Kincho H. Law, Yung-Tsun Tina Lee, Paul Witherell

Abstract: Segmentation of additive manufacturing (AM) defects in X-ray Computed Tomography (XCT) images is challenging, due to the poor contrast, small sizes and variation in appearance of defects. Automatic segmentation can, however, provide quality control for additive manufacturing. Over recent years, three-dimensional convolutional neural networks (3D CNNs) have performed well in the volumetric segmenta… ▽ More Segmentation of additive manufacturing (AM) defects in X-ray Computed Tomography (XCT) images is challenging, due to the poor contrast, small sizes and variation in appearance of defects. Automatic segmentation can, however, provide quality control for additive manufacturing. Over recent years, three-dimensional convolutional neural networks (3D CNNs) have performed well in the volumetric segmentation of medical images. In this work, we leverage techniques from the medical imaging domain and propose training a 3D U-Net model to automatically segment defects in XCT images of AM samples. This work not only contributes to the use of machine learning for AM defect detection but also demonstrates for the first time 3D volumetric segmentation in AM. We train and test with three variants of the 3D U-Net on an AM dataset, achieving a mean intersection of union (IOU) value of 88.4%. △ Less

Submitted 22 January, 2021; originally announced January 2021.

Comments: Accepted by AAAI 2020 Spring Symposia

Journal ref: AAAI 2020 Spring Symposia, Stanford, CA, USA, Mar 23-25, 2020

arXiv:2101.05719 [pdf, ps, other]

Minimum Cost Flows, MDPs, and $\ell_1$-Regression in Nearly Linear Time for Dense Instances

Authors: Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, Di Wang

Abstract: In this paper we provide new randomized algorithms with improved runtimes for solving linear programs with two-sided constraints. In the special case of the minimum cost flow problem on $n$-vertex $m$-edge graphs with integer polynomially-bounded costs and capacities we obtain a randomized method which solves the problem in $\tilde{O}(m+n^{1.5})$ time. This improves upon the previous best runtime… ▽ More In this paper we provide new randomized algorithms with improved runtimes for solving linear programs with two-sided constraints. In the special case of the minimum cost flow problem on $n$-vertex $m$-edge graphs with integer polynomially-bounded costs and capacities we obtain a randomized method which solves the problem in $\tilde{O}(m+n^{1.5})$ time. This improves upon the previous best runtime of $\tilde{O}(m\sqrt{n})$ (Lee-Sidford 2014) and, in the special case of unit-capacity maximum flow, improves upon the previous best runtimes of $m^{4/3+o(1)}$ (Liu-Sidford 2020, Kathuria 2020) and $\tilde{O}(m\sqrt{n})$ (Lee-Sidford 2014) for sufficiently dense graphs. For $\ell_1$-regression in a matrix with $n$-columns and $m$-rows we obtain a randomized method which computes an $ε$-approximate solution in $\tilde{O}(mn+n^{2.5})$ time. This yields a randomized method which computes an $ε$-optimal policy of a discounted Markov Decision Process with $S$ states and $A$ actions per state in time $\tilde{O}(S^2A+S^{2.5})$. These methods improve upon the previous best runtimes of methods which depend polylogarithmically on problem parameters, which were $\tilde{O}(mn^{1.5})$ (Lee-Sidford 2015) and $\tilde{O}(S^{2.5}A)$ (Lee-Sidford 2014, Sidford-Wang-Wu-Ye 2018). To obtain this result we introduce two new algorithmic tools of independent interest. First, we design a new general interior point method for solving linear programs with two sided constraints which combines techniques from (Lee-Song-Zhang 2019, Brand et al. 2020) to obtain a robust stochastic method with iteration count nearly the square root of the smaller dimension. Second, to implement this method we provide dynamic data structures for efficiently maintaining approximations to variants of Lewis-weights, a fundamental importance measure for matrices which generalize leverage scores and effective resistances. △ Less

Submitted 21 August, 2021; v1 submitted 14 January, 2021; originally announced January 2021.

arXiv:2011.05365 [pdf, other]

A Nearly-Linear Time Algorithm for Linear Programs with Small Treewidth: A Multiscale Representation of Robust Central Path

Authors: Sally Dong, Yin Tat Lee, Guanghao Ye

Abstract: Arising from structural graph theory, treewidth has become a focus of study in fixed-parameter tractable algorithms in various communities including combinatorics, integer-linear programming, and numerical analysis. Many NP-hard problems are known to be solvable in $\widetilde{O}(n \cdot 2^{O(\mathrm{tw})})$ time, where $\mathrm{tw}$ is the treewidth of the input graph. Analogously, many problems… ▽ More Arising from structural graph theory, treewidth has become a focus of study in fixed-parameter tractable algorithms in various communities including combinatorics, integer-linear programming, and numerical analysis. Many NP-hard problems are known to be solvable in $\widetilde{O}(n \cdot 2^{O(\mathrm{tw})})$ time, where $\mathrm{tw}$ is the treewidth of the input graph. Analogously, many problems in P should be solvable in $\widetilde{O}(n \cdot \mathrm{tw}^{O(1)})$ time; however, due to the lack of appropriate tools, only a few such results are currently known. [Fom+18] conjectured this to hold as broadly as all linear programs; in our paper, we show this is true: Given a linear program of the form $\min_{Ax=b,\ell \leq x\leq u} c^{\top} x$, and a width-$τ$ tree decomposition of a graph $G_A$ related to $A$, we show how to solve it in time $$\widetilde{O}(n \cdot τ^2 \log (1/\varepsilon)),$$ where $n$ is the number of variables and $\varepsilon$ is the relative accuracy. Combined with recent techniques in vertex-capacitated flow [BGS21], this leads to an algorithm with $\widetilde{O}(n^{1+o(1)} \cdot \mathrm{tw}^2 \log (1/\varepsilon))$ run-time. Besides being the first of its kind, our algorithm has run-time nearly matching the fastest run-time for solving the sub-problem $Ax=b$ (under the assumption that no fast matrix multiplication is used). We obtain these results by combining recent techniques in interior-point methods (IPMs), sketching, and a novel representation of the solution under a multiscale basis similar to the wavelet basis. △ Less

Submitted 13 September, 2023; v1 submitted 10 November, 2020; originally announced November 2020.

arXiv:2010.03106 [pdf, ps, other]

Structured Logconcave Sampling with a Restricted Gaussian Oracle

Authors: Yin Tat Lee, Ruoqi Shen, Kevin Tian

Abstract: We give algorithms for sampling several structured logconcave families to high accuracy. We further develop a reduction framework, inspired by proximal point methods in convex optimization, which bootstraps samplers for regularized densities to improve dependences on problem conditioning. A key ingredient in our framework is the notion of a "restricted Gaussian oracle" (RGO) for… ▽ More We give algorithms for sampling several structured logconcave families to high accuracy. We further develop a reduction framework, inspired by proximal point methods in convex optimization, which bootstraps samplers for regularized densities to improve dependences on problem conditioning. A key ingredient in our framework is the notion of a "restricted Gaussian oracle" (RGO) for $g: \mathbb{R}^d \rightarrow \mathbb{R}$, which is a sampler for distributions whose negative log-likelihood sums a quadratic and $g$. By combining our reduction framework with our new samplers, we obtain the following bounds for sampling structured distributions to total variation distance $ε$. For composite densities $\exp(-f(x) - g(x))$, where $f$ has condition number $κ$ and convex (but possibly non-smooth) $g$ admits an RGO, we obtain a mixing time of $O(κd \log^3\frac{κd}ε)$, matching the state-of-the-art non-composite bound; no composite samplers with better mixing than general-purpose logconcave samplers were previously known. For logconcave finite sums $\exp(-F(x))$, where $F(x) = \frac{1}{n}\sum_{i \in [n]} f_i(x)$ has condition number $κ$, we give a sampler querying $\widetilde{O}(n + κ\max(d, \sqrt{nd}))$ gradient oracles to $\{f_i\}_{i \in [n]}$; no high-accuracy samplers with nontrivial gradient query complexity were previously known. For densities with condition number $κ$, we give an algorithm obtaining mixing time $O(κd \log^2\frac{κd}ε)$, improving the prior state-of-the-art by a logarithmic factor with a significantly simpler analysis; we also show a zeroth-order algorithm attains the same query complexity. △ Less

Submitted 22 October, 2021; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: 58 pages. The results of Section 5 of this paper, as well as an empirical evaluation, appeared earlier as arXiv:2006.05976. This version fixes an error in the proof of Theorem 1, see Section 1.4

arXiv:2009.10217 [pdf, ps, other]

A Faster Interior Point Method for Semidefinite Programming

Authors: Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, Zhao Song

Abstract: Semidefinite programs (SDPs) are a fundamental class of optimization problems with important recent applications in approximation algorithms, quantum complexity, robust learning, algorithmic rounding, and adversarial deep learning. This paper presents a faster interior point method to solve generic SDPs with variable size $n \times n$ and $m$ constraints in time \begin{align*} \widetilde{O}(\sqrt{… ▽ More Semidefinite programs (SDPs) are a fundamental class of optimization problems with important recent applications in approximation algorithms, quantum complexity, robust learning, algorithmic rounding, and adversarial deep learning. This paper presents a faster interior point method to solve generic SDPs with variable size $n \times n$ and $m$ constraints in time \begin{align*} \widetilde{O}(\sqrt{n}( mn^2 + m^ω+ n^ω) \log(1 / ε) ), \end{align*} where $ω$ is the exponent of matrix multiplication and $ε$ is the relative accuracy. In the predominant case of $m \geq n$, our runtime outperforms that of the previous fastest SDP solver, which is based on the cutting plane method of Jiang, Lee, Song, and Wong [JLSW20]. Our algorithm's runtime can be naturally interpreted as follows: $\widetilde{O}(\sqrt{n} \log (1/ε))$ is the number of iterations needed for our interior point method, $mn^2$ is the input size, and $m^ω+ n^ω$ is the time to invert the Hessian and slack matrix in each iteration. These constitute natural barriers to further improving the runtime of interior point methods for solving generic SDPs. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: FOCS 2020

arXiv:2008.02146 [pdf, other]

Reducing Isotropy and Volume to KLS: An $O(n^3ψ^2)$ Volume Algorithm

Authors: He Jia, Aditi Laddha, Yin Tat Lee, Santosh S. Vempala

Abstract: We show that the volume of a convex body in ${\bf R}^{n}$ in the general membership oracle model can be computed to within relative error $\varepsilon$ using $\widetilde{O}(n^{3}ψ^{2} + n^{3}/\varepsilon^{2})$ oracle queries, where $ψ$ is the KLS constant. With the current bound of $ψ=\widetilde{O}(1)$, this gives an $\widetilde{O}(n^{3}/\varepsilon^{2})$ algorithm, improving on the Lovász-Vempala… ▽ More We show that the volume of a convex body in ${\bf R}^{n}$ in the general membership oracle model can be computed to within relative error $\varepsilon$ using $\widetilde{O}(n^{3}ψ^{2} + n^{3}/\varepsilon^{2})$ oracle queries, where $ψ$ is the KLS constant. With the current bound of $ψ=\widetilde{O}(1)$, this gives an $\widetilde{O}(n^{3}/\varepsilon^{2})$ algorithm, improving on the Lovász-Vempala $\widetilde{O}(n^{4}/\varepsilon^{2})$ algorithm from 2003. The main new ingredient is an $\widetilde{O}(n^{3}ψ^{2})$ algorithm for isotropic transformation, following which we can apply the $\widetilde{O}(n^{3}/\varepsilon^{2})$ volume algorithm of Cousins and Vempala for well-rounded convex bodies. We also give an efficient implementation of the new algorithm for convex polytopes defined by $m$ inequalities in ${\bf R}^{n}$: polytope volume can be estimated in time $\widetilde{O}(mn^{c}/\varepsilon^{2})$ where $c<3.2$ depends on the current matrix multiplication exponent; this improves known bounds. △ Less

Submitted 3 September, 2022; v1 submitted 5 August, 2020; originally announced August 2020.

Comments: 23 pages, 1 figure; updated with current KLS bound and resulting complexity

arXiv:2006.05976 [pdf, other]

Composite Logconcave Sampling with a Restricted Gaussian Oracle

Authors: Ruoqi Shen, Kevin Tian, Yin Tat Lee

Abstract: We consider sampling from composite densities on $\mathbb{R}^d$ of the form $dπ(x) \propto \exp(-f(x) - g(x))dx$ for well-conditioned $f$ and convex (but possibly non-smooth) $g$, a family generalizing restrictions to a convex set, through the abstraction of a restricted Gaussian oracle. For $f$ with condition number $κ$, our algorithm runs in $O \left(κ^2 d \log^2\tfrac{κd}ε\right)$ iterations, e… ▽ More We consider sampling from composite densities on $\mathbb{R}^d$ of the form $dπ(x) \propto \exp(-f(x) - g(x))dx$ for well-conditioned $f$ and convex (but possibly non-smooth) $g$, a family generalizing restrictions to a convex set, through the abstraction of a restricted Gaussian oracle. For $f$ with condition number $κ$, our algorithm runs in $O \left(κ^2 d \log^2\tfrac{κd}ε\right)$ iterations, each querying a gradient of $f$ and a restricted Gaussian oracle, to achieve total variation distance $ε$. The restricted Gaussian oracle, which draws samples from a distribution whose negative log-likelihood sums a quadratic and $g$, has been previously studied and is a natural extension of the proximal oracle used in composite optimization. Our algorithm is conceptually simple and obtains stronger provable guarantees and greater generality than existing methods for composite sampling. We conduct experiments showing our algorithm vastly improves upon the hit-and-run algorithm for sampling the restriction of a (non-diagonal) Gaussian to the positive orthant. △ Less

Submitted 10 June, 2020; originally announced June 2020.

arXiv:2006.02855 [pdf, ps, other]

Network size and weights size for memorization with two-layers neural networks

Authors: Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Dan Mikulincer

Abstract: In 1988, Eric B. Baum showed that two-layers neural networks with threshold activation function can perfectly memorize the binary labels of $n$ points in general position in $\mathbb{R}^d$ using only $\ulcorner n/d \urcorner$ neurons. We observe that with ReLU networks, using four times as many neurons one can fit arbitrary real labels. Moreover, for approximate memorization up to error $ε$, the n… ▽ More In 1988, Eric B. Baum showed that two-layers neural networks with threshold activation function can perfectly memorize the binary labels of $n$ points in general position in $\mathbb{R}^d$ using only $\ulcorner n/d \urcorner$ neurons. We observe that with ReLU networks, using four times as many neurons one can fit arbitrary real labels. Moreover, for approximate memorization up to error $ε$, the neural tangent kernel can also memorize with only $O\left(\frac{n}{d} \cdot \log(1/ε) \right)$ neurons (assuming that the data is well dispersed too). We show however that these constructions give rise to networks where the magnitude of the neurons' weights are far from optimal. In contrast we propose a new training procedure for ReLU networks, based on complex (as opposed to real) recombination of the neurons, for which we show approximate memorization with both $O\left(\frac{n}{d} \cdot \frac{\log(1/ε)}ε\right)$ neurons, as well as nearly-optimal size of the weights. △ Less

Submitted 3 November, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

Comments: 27 pages

arXiv:2004.04250 [pdf, other]

An Improved Cutting Plane Method for Convex Optimization, Convex-Concave Games and its Applications

Authors: Haotian Jiang, Yin Tat Lee, Zhao Song, Sam Chiu-wai Wong

Abstract: Given a separation oracle for a convex set $K \subset \mathbb{R}^n$ that is contained in a box of radius $R$, the goal is to either compute a point in $K$ or prove that $K$ does not contain a ball of radius $ε$. We propose a new cutting plane algorithm that uses an optimal $O(n \log (κ))$ evaluations of the oracle and an additional $O(n^2)$ time per evaluation, where $κ= nR/ε$. $\bullet$ This im… ▽ More Given a separation oracle for a convex set $K \subset \mathbb{R}^n$ that is contained in a box of radius $R$, the goal is to either compute a point in $K$ or prove that $K$ does not contain a ball of radius $ε$. We propose a new cutting plane algorithm that uses an optimal $O(n \log (κ))$ evaluations of the oracle and an additional $O(n^2)$ time per evaluation, where $κ= nR/ε$. $\bullet$ This improves upon Vaidya's $O( \text{SO} \cdot n \log (κ) + n^{ω+1} \log (κ))$ time algorithm [Vaidya, FOCS 1989a] in terms of polynomial dependence on $n$, where $ω< 2.373$ is the exponent of matrix multiplication and $\text{SO}$ is the time for oracle evaluation. $\bullet$ This improves upon Lee-Sidford-Wong's $O( \text{SO} \cdot n \log (κ) + n^3 \log^{O(1)} (κ))$ time algorithm [Lee, Sidford and Wong, FOCS 2015] in terms of dependence on $κ$. For many important applications in economics, $κ= Ω(\exp(n))$ and this leads to a significant difference between $\log(κ)$ and $\mathrm{poly}(\log (κ))$. We also provide evidence that the $n^2$ time per evaluation cannot be improved and thus our running time is optimal. A bottleneck of previous cutting plane methods is to compute leverage scores, a measure of the relative importance of past constraints. Our result is achieved by a novel multi-layered data structure for leverage score maintenance, which is a sophisticated combination of diverse techniques such as random projection, batched low-rank update, inverse maintenance, polynomial interpolation, and fast rectangular matrix multiplication. Interestingly, our method requires a combination of different fast rectangular matrix multiplication algorithms. △ Less

Submitted 8 April, 2020; originally announced April 2020.

Comments: STOC 2020

arXiv:2003.08078 [pdf, other]

Acceleration with a Ball Optimization Oracle

Authors: Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford, Kevin Tian

Abstract: Consider an oracle which takes a point $x$ and returns the minimizer of a convex function $f$ in an $\ell_2$ ball of radius $r$ around $x$. It is straightforward to show that roughly $r^{-1}\log\frac{1}ε$ calls to the oracle suffice to find an $ε$-approximate minimizer of $f$ in an $\ell_2$ unit ball. Perhaps surprisingly, this is not optimal: we design an accelerated algorithm which attains an… ▽ More Consider an oracle which takes a point $x$ and returns the minimizer of a convex function $f$ in an $\ell_2$ ball of radius $r$ around $x$. It is straightforward to show that roughly $r^{-1}\log\frac{1}ε$ calls to the oracle suffice to find an $ε$-approximate minimizer of $f$ in an $\ell_2$ unit ball. Perhaps surprisingly, this is not optimal: we design an accelerated algorithm which attains an $ε$-approximate minimizer with roughly $r^{-2/3} \log \frac{1}ε$ oracle queries, and give a matching lower bound. Further, we implement ball optimization oracles for functions with locally stable Hessians using a variant of Newton's method. The resulting algorithm applies to a number of problems of practical and theoretical import, improving upon previous results for logistic and $\ell_\infty$ regression and achieving guarantees comparable to the state-of-the-art for $\ell_p$ regression. △ Less

Submitted 18 March, 2020; originally announced March 2020.

Comments: 37 pages

arXiv:2002.04830 [pdf, ps, other]

Positive Semidefinite Programming: Mixed, Parallel, and Width-Independent

Authors: Arun Jambulapati, Yin Tat Lee, Jerry Li, Swati Padmanabhan, Kevin Tian

Abstract: We give the first approximation algorithm for mixed packing and covering semidefinite programs (SDPs) with polylogarithmic dependence on width. Mixed packing and covering SDPs constitute a fundamental algorithmic primitive with recent applications in combinatorial optimization, robust learning, and quantum complexity. The current approximate solvers for positive semidefinite programming can handle… ▽ More We give the first approximation algorithm for mixed packing and covering semidefinite programs (SDPs) with polylogarithmic dependence on width. Mixed packing and covering SDPs constitute a fundamental algorithmic primitive with recent applications in combinatorial optimization, robust learning, and quantum complexity. The current approximate solvers for positive semidefinite programming can handle only pure packing instances, and technical hurdles prevent their generalization to a wider class of positive instances. For a given multiplicative accuracy of $ε$, our algorithm takes $O(\log^3(ndρ) \cdot ε^{-3})$ parallelizable iterations, where $n$, $d$ are dimensions of the problem and $ρ$ is a width parameter of the instance, generalizing or improving all previous parallel algorithms in the positive linear and semidefinite programming literature. When specialized to pure packing SDPs, our algorithm's iteration complexity is $O(\log^2 (nd) \cdot ε^{-2})$, a slight improvement and derandomization of the state-of-the-art (Allen-Zhu et. al. '16, Peng et. al. '16, Wang et. al. '15). For a wide variety of structured instances commonly found in applications, the iterations of our algorithm run in nearly-linear time. In doing so, we give matrix analytic techniques for overcoming obstacles that have stymied prior approaches to this open problem, as stated in past works (Peng et. al. '16, Mahoney et. al. '16). Crucial to our analysis are a simplification of existing algorithms for mixed positive linear programs, achieved by removing an asymmetry caused by modifying covering constraints, and a suite of matrix inequalities whose proofs are based on analyzing the Schur complements of matrices in a higher dimension. We hope that both our algorithm and techniques open the door to improved solvers for positive semidefinite programming, as well as its applications. △ Less

Submitted 12 July, 2021; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: There is an error in this manuscript. This version notes the source of the error on the first page

arXiv:2002.04121 [pdf, ps, other]

Logsmooth Gradient Concentration and Tighter Runtimes for Metropolized Hamiltonian Monte Carlo

Authors: Yin Tat Lee, Ruoqi Shen, Kevin Tian

Abstract: We show that the gradient norm $\|\nabla f(x)\|$ for $x \sim \exp(-f(x))$, where $f$ is strongly convex and smooth, concentrates tightly around its mean. This removes a barrier in the prior state-of-the-art analysis for the well-studied Metropolized Hamiltonian Monte Carlo (HMC) algorithm for sampling from a strongly logconcave distribution. We correspondingly demonstrate that Metropolized HMC mix… ▽ More We show that the gradient norm $\|\nabla f(x)\|$ for $x \sim \exp(-f(x))$, where $f$ is strongly convex and smooth, concentrates tightly around its mean. This removes a barrier in the prior state-of-the-art analysis for the well-studied Metropolized Hamiltonian Monte Carlo (HMC) algorithm for sampling from a strongly logconcave distribution. We correspondingly demonstrate that Metropolized HMC mixes in $\tilde{O}(κd)$ iterations, improving upon the $\tilde{O}(κ^{1.5}\sqrt{d} + κd)$ runtime of (Dwivedi et. al. '18, Chen et. al. '19) by a factor $(κ/d)^{1/2}$ when the condition number $κ$ is large. Our mixing time analysis introduces several techniques which to our knowledge have not appeared in the literature and may be of independent interest, including restrictions to a nonconvex set with good conductance behavior, and a new reduction technique for boosting a constant-accuracy total variation guarantee under weak warmness assumptions. This is the first high-accuracy mixing time result for logconcave distributions using only first-order function information which achieves linear dependence on $κ$; we also give evidence that this dependence is likely to be necessary for standard Metropolized first-order methods. △ Less

Submitted 13 June, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

Comments: 31 pages. v2 propagates changes from COLT 2020 camera-ready

arXiv:2002.02304 [pdf, other]

Solving Tall Dense Linear Programs in Nearly Linear Time

Authors: Jan van den Brand, Yin Tat Lee, Aaron Sidford, Zhao Song

Abstract: In this paper we provide an $\tilde{O}(nd+d^{3})$ time randomized algorithm for solving linear programs with $d$ variables and $n$ constraints with high probability. To obtain this result we provide a robust, primal-dual $\tilde{O}(\sqrt{d})$-iteration interior point method inspired by the methods of Lee and Sidford (2014, 2019) and show how to efficiently implement this method using new data-stru… ▽ More In this paper we provide an $\tilde{O}(nd+d^{3})$ time randomized algorithm for solving linear programs with $d$ variables and $n$ constraints with high probability. To obtain this result we provide a robust, primal-dual $\tilde{O}(\sqrt{d})$-iteration interior point method inspired by the methods of Lee and Sidford (2014, 2019) and show how to efficiently implement this method using new data-structures based on heavy-hitters, the Johnson-Lindenstrauss lemma, and inverse maintenance. Interestingly, we obtain this running time without using fast matrix multiplication and consequently, barring a major advance in linear system solving, our running time is near optimal for solving dense linear programs among algorithms that do not use fast matrix multiplication. △ Less

Submitted 21 August, 2021; v1 submitted 6 February, 2020; originally announced February 2020.

arXiv:1911.10765 [pdf, other]

Faster Matroid Intersection

Authors: Deeparnab Chakrabarty, Yin Tat Lee, Aaron Sidford, Sahil Singla, Sam Chiu-wai Wong

Abstract: In this paper we consider the classic matroid intersection problem: given two matroids $\M_{1}=(V,\I_{1})$ and $\M_{2}=(V,\I_{2})$ defined over a common ground set $V$, compute a set $S\in\I_{1}\cap\I_{2}$ of largest possible cardinality, denoted by $r$. We consider this problem both in the setting where each $\M_{i}$ is accessed through an independence oracle, i.e. a routine which returns whether… ▽ More In this paper we consider the classic matroid intersection problem: given two matroids $\M_{1}=(V,\I_{1})$ and $\M_{2}=(V,\I_{2})$ defined over a common ground set $V$, compute a set $S\in\I_{1}\cap\I_{2}$ of largest possible cardinality, denoted by $r$. We consider this problem both in the setting where each $\M_{i}$ is accessed through an independence oracle, i.e. a routine which returns whether or not a set $S\in\I_{i}$ in $\indep$ time, and the setting where each $\M_{i}$ is accessed through a rank oracle, i.e. a routine which returns the size of the largest independent subset of $S$ in $\M_{i}$ in $\rank$ time. In each setting we provide faster exact and approximate algorithms. Given an independence oracle, we provide an exact $O(nr\log r \indep)$ time algorithm. This improves upon the running time of $O(nr^{1.5} \indep)$ due to Cunningham in 1986 and $\tilde{O}(n^{2} \indep+n^{3})$ due to Lee, Sidford, and Wong in 2015. We also provide two algorithms which compute a $(1-ε)$-approximate solution to matroid intersection running in times $\tilde{O}(n^{1.5}/\eps^{1.5} \indep)$ and $\tilde{O}((n^{2}r^{-1}ε^{-2}+r^{1.5}ε^{-4.5}) \indep)$, respectively. These results improve upon the $O(nr/\eps \indep)$-time algorithm of Cunningham as noted recently by Chekuri and Quanrud. Given a rank oracle, we provide algorithms with even better dependence on $n$ and $r$. We provide an $O(n\sqrt{r}\log n \rank)$-time exact algorithm and an $O(nε^{-1}\log n \rank)$-time algorithm which obtains a $(1-\eps)$-approximation to the matroid intersection problem. The former result improves over the $\tilde{O}(nr \rankt+n^{3})$-time algorithm by Lee, Sidford, and Wong. The rank oracle is of particular interest as the matroid intersection problem with this oracle is a special case of the submodular function minimization problem with an evaluation oracle. △ Less

Submitted 25 November, 2019; originally announced November 2019.

Comments: 38 pages. Preliminary version appeared in FOCS 2019

arXiv:1911.05656 [pdf, other]

Strong Self-Concordance and Sampling

Authors: Aditi Laddha, Yin Tat Lee, Santosh Vempala

Abstract: Motivated by the Dikin walk, we develop aspects of an interior-point theory for sampling in high dimension. Specifically, we introduce a symmetric parameter and the notion of strong self-concordance. These properties imply that the corresponding Dikin walk mixes in $\tilde{O}(n\barν)$ steps from a warm start in a convex body in $\mathbb{R}^{n}$ using a strongly self-concordant barrier with symmetr… ▽ More Motivated by the Dikin walk, we develop aspects of an interior-point theory for sampling in high dimension. Specifically, we introduce a symmetric parameter and the notion of strong self-concordance. These properties imply that the corresponding Dikin walk mixes in $\tilde{O}(n\barν)$ steps from a warm start in a convex body in $\mathbb{R}^{n}$ using a strongly self-concordant barrier with symmetric self-concordance parameter $\barν$. For many natural barriers, $\barν$ is roughly bounded by $ν$, the standard self-concordance parameter. We show that this property and strong self-concordance hold for the Lee-Sidford barrier. As a consequence, we obtain the first walk to mix in $\tilde{O}(n^{2})$ steps for an arbitrary polytope in $\mathbb{R}^{n}$. Strong self-concordance for other barriers leads to an interesting (and unexpected) connection -- for the universal and entropic barriers, it is implied by the KLS conjecture. △ Less

Submitted 9 July, 2020; v1 submitted 13 November, 2019; originally announced November 2019.

arXiv:1911.00612 [pdf, other]

Computing Circle Packing Representations of Planar Graphs

Authors: Sally Dong, Yin Tat Lee, Kent Quanrud

Abstract: The Circle Packing Theorem states that every planar graph can be represented as the tangency graph of a family of internally-disjoint circles. A well-known generalization is the Primal-Dual Circle Packing Theorem for 3-connected planar graphs. The existence of these representations has widespread applications in theoretical computer science and mathematics; however, the algorithmic aspect has rece… ▽ More The Circle Packing Theorem states that every planar graph can be represented as the tangency graph of a family of internally-disjoint circles. A well-known generalization is the Primal-Dual Circle Packing Theorem for 3-connected planar graphs. The existence of these representations has widespread applications in theoretical computer science and mathematics; however, the algorithmic aspect has received relatively little attention. In this work, we present an algorithm based on convex optimization for computing a primal-dual circle packing representation of maximal planar graphs, i.e. triangulations. This in turn gives an algorithm for computing a circle packing representation of any planar graph. Both take $\widetilde{O}(n \log(R/\varepsilon))$ expected run-time to produce a solution that is $\varepsilon$ close to a true representation, where $R$ is the ratio between the maximum and minimum circle radius in the true representation. △ Less

Submitted 1 November, 2019; originally announced November 2019.

Comments: 19 pages, 10 figures. SODA 2020

Showing 1–50 of 96 results for author: Lee, Y T