-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1092 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 14 June, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation
Authors:
Xuesu Xiao,
Tingnan Zhang,
Krzysztof Choromanski,
Edward Lee,
Anthony Francis,
Jake Varley,
Stephen Tu,
Sumeet Singh,
Peng Xu,
Fei Xia,
Sven Mikael Persson,
Dmitry Kalashnikov,
Leila Takayama,
Roy Frostig,
Jie Tan,
Carolina Parada,
Vikas Sindhwani
Abstract:
Despite decades of research, existing navigation systems still face real-world challenges when deployed in the wild, e.g., in cluttered home environments or in human-occupied public spaces. To address this, we present a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints from Model Predictive Control (MPC). Our approach…
▽ More
Despite decades of research, existing navigation systems still face real-world challenges when deployed in the wild, e.g., in cluttered home environments or in human-occupied public spaces. To address this, we present a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints from Model Predictive Control (MPC). Our approach, called Performer-MPC, uses a learned cost function parameterized by vision context embeddings provided by Performers -- a low-rank implicit-attention Transformer. We jointly train the cost function and construct the controller relying on it, effectively solving end-to-end the corresponding bi-level optimization problem. We show that the resulting policy improves standard MPC performance by leveraging a few expert demonstrations of the desired navigation behavior in different challenging real-world scenarios. Compared with a standard MPC policy, Performer-MPC achieves >40% better goal reached in cluttered environments and >65% better on social metrics when navigating around humans.
△ Less
Submitted 23 September, 2022; v1 submitted 22 September, 2022;
originally announced September 2022.
-
You Only Linearize Once: Tangents Transpose to Gradients
Authors:
Alexey Radul,
Adam Paszke,
Roy Frostig,
Matthew Johnson,
Dougal Maclaurin
Abstract:
Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, we formalize a decomposition of reverse-mode AD into (i) forward-mode AD followed by (ii) unzipping the…
▽ More
Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented) separately. Can there be only one? Following up on the AD systems developed in the JAX and Dex projects, we formalize a decomposition of reverse-mode AD into (i) forward-mode AD followed by (ii) unzipping the linear and non-linear parts and then (iii) transposition of the linear part.
To that end, we define a (substructurally) linear type system that can prove a class of functions are (algebraically) linear. Our main results are that forward-mode AD produces such linear functions, and that we can unzip and transpose any such linear function, conserving cost, size, and linearity. Composing these three transformations recovers reverse-mode AD. This decomposition also sheds light on checkpointing, which emerges naturally from a free choice in unzipping `let` expressions. As a corollary, checkpointing techniques are applicable to general-purpose partial evaluation, not just AD.
We hope that our formalization will lead to a deeper understanding of automatic differentiation and that it will simplify implementations, by separating the concerns of differentiation proper from the concerns of gaining efficiency (namely, separating the derivative computation from the act of running it backward).
△ Less
Submitted 6 December, 2022; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Learning from many trajectories
Authors:
Stephen Tu,
Roy Frostig,
Mahdi Soltanolkotabi
Abstract:
We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Ou…
▽ More
We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable.
For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $Θ(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $Ω(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
△ Less
Submitted 31 January, 2023; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Filtrated Common Functional Principal Components for Multivariate Functional data
Authors:
Shuhao Jiao,
Ron D. Frostig,
Hernando Ombao
Abstract:
Local field potentials (LFPs) are signals that measure electrical activity in localized cortical regions from implanted tetrodes in the human or animal brain. The LFP signals are curves observed at multiple tetrodes which are implanted across a patch on the surface of the cortex. Hence, they can be treated as multi-group functional data, where the trajectories collected across temporal epochs from…
▽ More
Local field potentials (LFPs) are signals that measure electrical activity in localized cortical regions from implanted tetrodes in the human or animal brain. The LFP signals are curves observed at multiple tetrodes which are implanted across a patch on the surface of the cortex. Hence, they can be treated as multi-group functional data, where the trajectories collected across temporal epochs from one tetrode are viewed as a group of functions. In many cases, multi-tetrode LFP trajectories contain both global variation patterns (which are shared in common to all groups, due to signal synchrony) and isolated variation patterns (common only to a small subset of groups), and such structure is very informative to the analysis of such data. Therefore, one goal in this paper is to develop an efficient procedure that is able to capture and quantify both global and isolated features. We propose a novel tree-structured functional principal components (filt-fPC) analysis through finite-dimensional functional representation - specifically via filtration. A major advantage of the proposed filt-fPC method is the ability to extract the components that are common to multiple groups (or tetrodes) in a flexible "multi-resolution" manner and simultaneously preserve the idiosyncratic individual components of different tetrodes. The proposed filt-fPC approach is highly data-driven and no "ground-truth" model pre-specification is needed, making it a suitable approach for analyzing multi-group functional data that is complex. In addition, the filt-fPC method is able to produce a parsimonious, interpretable, and efficient low dimensional representation of multi-group functional data with orthonormal basis functions. Here, the proposed filt-fPCA method is employed to study the impact of a shock (induced stroke) on the synchrony structure of the rat brain.
△ Less
Submitted 26 November, 2022; v1 submitted 2 June, 2021;
originally announced June 2021.
-
Efficient and Modular Implicit Differentiation
Authors:
Mathieu Blondel,
Quentin Berthet,
Marco Cuturi,
Roy Frostig,
Stephan Hoyer,
Felipe Llinares-López,
Fabian Pedregosa,
Jean-Philippe Vert
Abstract:
Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems suc…
▽ More
Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.
△ Less
Submitted 12 October, 2022; v1 submitted 31 May, 2021;
originally announced May 2021.
-
Decomposing reverse-mode automatic differentiation
Authors:
Roy Frostig,
Matthew J. Johnson,
Dougal Maclaurin,
Adam Paszke,
Alexey Radul
Abstract:
We decompose reverse-mode automatic differentiation into (forward-mode) linearization followed by transposition. Doing so isolates the essential difference between forward- and reverse-mode AD, and simplifies their joint implementation. In particular, once forward-mode AD rules are defined for every primitive operation in a source language, only linear primitives require an additional transpositio…
▽ More
We decompose reverse-mode automatic differentiation into (forward-mode) linearization followed by transposition. Doing so isolates the essential difference between forward- and reverse-mode AD, and simplifies their joint implementation. In particular, once forward-mode AD rules are defined for every primitive operation in a source language, only linear primitives require an additional transposition rule in order to arrive at a complete reverse-mode AD implementation. This is how reverse-mode AD is written in JAX and Dex.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
Break Point Detection for Functional Covariance
Authors:
Shuhao Jiao,
Ron D. Frostig,
Hernando Ombao
Abstract:
Many experiments record sequential trajectories where each trajectory consists of oscillations and fluctuations around zero. Such trajectories can be viewed as zero-mean functional data. When there are structural breaks (on the sequence of trajectories) in higher order moments, it is not always easy to spot these by mere visual inspection. Motivated by this challenging problem in brain signal anal…
▽ More
Many experiments record sequential trajectories where each trajectory consists of oscillations and fluctuations around zero. Such trajectories can be viewed as zero-mean functional data. When there are structural breaks (on the sequence of trajectories) in higher order moments, it is not always easy to spot these by mere visual inspection. Motivated by this challenging problem in brain signal analysis, we propose a detection and testing procedure to find the change point in functional covariance. The detection procedure is based on the cumulative sum statistics (CUSUM). The classical testing procedure for functional data depends on a null distribution which depends on infinitely many unknown parameters, though in practice only a finite number of these can be included for the hypothesis test of the existence of change point. This paper provides some theoretical insights on the influence of the number of parameters. Meanwhile, the asymptotic properties of the estimated change point are developed. The effectiveness of the proposed method is numerically validated in simulation studies and an application to investigate changes in rat brain signals following an experimentally-induced stroke.
△ Less
Submitted 4 February, 2022; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Variation Pattern Classification of Functional Data
Authors:
Shuhao Jiao,
Ron D. Frostig,
Hernando Ombao
Abstract:
A new classification method for functional data is proposed in this paper. This work is motivated by the need to identify features that discriminate between neurological conditions on which local field potentials (LFPs) were recorded. Regardless of the condition, these local field potentials have zero mean and thus the first moments of these random processes do not have discriminating power. We pr…
▽ More
A new classification method for functional data is proposed in this paper. This work is motivated by the need to identify features that discriminate between neurological conditions on which local field potentials (LFPs) were recorded. Regardless of the condition, these local field potentials have zero mean and thus the first moments of these random processes do not have discriminating power. We propose the variation pattern classification (VPC) method {which employs the (auto-)covariance operators as the discriminating features} and uses the Hilbert-Schmidt norm to measure the discrepancy between the (auto-)covariance operators of different groups. The proposed VPC method is demonstrated to be sensitive to the discrepancy, {potentially leading to a higher rate of classification}. One important innovation lies in the dimension reduction where the VPC method data-adaptively determines the basis functions (discriminative feature functions) that account for the major discrepancy. In addition, the selected discriminative feature functions provide insights on the discrepancy between different groups because they reveal the features of variation pattern that differentiate groups. Consistency properties are established and, furthermore, simulation studies and the analysis of rat brain LFP trajectories empirically demonstrate the advantages and effectiveness of the proposed method.
△ Less
Submitted 4 February, 2022; v1 submitted 2 April, 2020;
originally announced April 2020.
-
Modeling Spectral Properties in Stationary Processes of Varying Dimensions with Applications to Brain Local Field Potential Signals
Authors:
Raanju Ragavendar Sundararajan,
Ron D. Frostig,
Hernando Ombao
Abstract:
A common class of methods for analyzing of multivariate time series, stationary and nonstationary, decomposes the observed series into latent sources. Methods such as principal compoment analysis (PCA), independent component analysis (ICA) and Stationary Subspace Analysis (SSA) assume the observed multivariate process is generated by latent sources that are stationary or nonstationary. We develop…
▽ More
A common class of methods for analyzing of multivariate time series, stationary and nonstationary, decomposes the observed series into latent sources. Methods such as principal compoment analysis (PCA), independent component analysis (ICA) and Stationary Subspace Analysis (SSA) assume the observed multivariate process is generated by latent sources that are stationary or nonstationary. We develop a method that tracks changes in the complexity of a 32-channel local field potential (LFP) signal from a rat following an experimentally induced stroke. We study complexity through the latent sources and their dimensions that can change across epochs due to an induced shock to the cortical system. Our method compares the spread of spectral information in several multivariate stationary processes with different dimensions. A frequency specific spectral ratio (FS-ratio) statistic is proposed and its asymptotic properties are derived. The FS-ratio is blind to the dimension of the stationary process and captures the proportion of spectral information in various (user-specified) frequency bands. We apply our method to study differences in complexity and structure of the LFP before and after system shock. The analysis indicates that spectral information in the beta frequency band (12-30 Hertz) demonstrated the greatest change in structure and complexity due to the stroke.
△ Less
Submitted 29 November, 2019; v1 submitted 27 November, 2019;
originally announced November 2019.
-
The advantages of multiple classes for reducing overfitting from test set reuse
Authors:
Vitaly Feldman,
Roy Frostig,
Moritz Hardt
Abstract:
Excessive reuse of holdout data can lead to overfitting. However, there is little concrete evidence of significant overfitting due to holdout reuse in popular multiclass benchmarks today. Known results show that, in the worst-case, revealing the accuracy of $k$ adaptively chosen classifiers on a data set of size $n$ allows to create a classifier with bias of $Θ(\sqrt{k/n})$ for any binary predicti…
▽ More
Excessive reuse of holdout data can lead to overfitting. However, there is little concrete evidence of significant overfitting due to holdout reuse in popular multiclass benchmarks today. Known results show that, in the worst-case, revealing the accuracy of $k$ adaptively chosen classifiers on a data set of size $n$ allows to create a classifier with bias of $Θ(\sqrt{k/n})$ for any binary prediction problem. We show a new upper bound of $\tilde O(\max\{\sqrt{k\log(n)/(mn)},k/n\})$ on the worst-case bias that any attack can achieve in a prediction problem with $m$ classes. Moreover, we present an efficient attack that achieve a bias of $Ω(\sqrt{k/(m^2 n)})$ and improves on previous work for the binary setting ($m=2$). We also present an inefficient attack that achieves a bias of $\tildeΩ(k/n)$. Complementing our theoretical work, we give new practical attacks to stress-test multiclass benchmarks by aiming to create as large a bias as possible with a given number of queries. Our experiments show that the additional uncertainty of prediction with a large number of classes indeed mitigates the effect of our best attacks.
Our work extends developments in understanding overfitting due to adaptive data analysis to multiclass prediction problems. It also bears out the surprising fact that multiclass prediction problems are significantly more robust to overfitting when reusing a test (or holdout) dataset. This offers an explanation as to why popular multiclass prediction benchmarks, such as ImageNet, may enjoy a longer lifespan than what intuition from literature on binary classification suggests.
△ Less
Submitted 24 May, 2019;
originally announced May 2019.
-
Measuring the Effects of Data Parallelism on Neural Network Training
Authors:
Christopher J. Shallue,
Jaehoon Lee,
Joseph Antognini,
Jascha Sohl-Dickstein,
Roy Frostig,
George E. Dahl
Abstract:
Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by…
▽ More
Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and find extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes. We find no evidence that larger batch sizes degrade out-of-sample performance. Finally, we discuss the implications of our results on efforts to train neural networks much faster in the future. Our experimental data is publicly available as a database of 71,638,836 loss measurements taken over the course of training for 168,160 individual models across 35 workloads.
△ Less
Submitted 18 July, 2019; v1 submitted 8 November, 2018;
originally announced November 2018.
-
Modeling Dependence via Copula of Functionals of Fourier Coefficients
Authors:
Charles Fontaine,
Ron D. Frostig,
Hernando Ombao
Abstract:
The goal of this paper is to develop a measure for characterizing complex dependence between stationary time series that cannot be captured by traditional measures such as correlation and coherence. Our approach is to use copula models of functionals of the Fourier coefficients which is a generalization of coherence. Here, we use standard parametric copula models with a single parameter both from…
▽ More
The goal of this paper is to develop a measure for characterizing complex dependence between stationary time series that cannot be captured by traditional measures such as correlation and coherence. Our approach is to use copula models of functionals of the Fourier coefficients which is a generalization of coherence. Here, we use standard parametric copula models with a single parameter both from elliptical and Archimedean families. Our approach is to analyze changes in local field potentials in the rat cortex prior to and immediately following the onset of stroke. We present the necessary theoretical background, the multivariate models and an illustration of our methodology on these local field potential data. Simulations with non-linear dependent data show information that were missed by not taking into account dependence on specific frequencies. Moreover, these simulations demonstrate how our proposed method captures more complex non-linear dependence between time series. Finally, we illustrate our copula-based approach in the analysis of local field potentials of rats.
△ Less
Submitted 25 September, 2018;
originally announced September 2018.
-
Modeling non-linear spectral domain dependence using copulas with applications to rat local field potentials
Authors:
Charles Fontaine,
Ron D. Frostig,
Hernando Ombao
Abstract:
This paper intends to develop tools for characterizing non-linear spectral dependence between spontaneous brain signals. We use parametric copula models (both bivariate and vine models) applied on the magnitude of Fourier coefficients rather than using coherence. The motivation behind this work is an experiment on rats that studied the impact of stroke on the connectivity structure (dependence) be…
▽ More
This paper intends to develop tools for characterizing non-linear spectral dependence between spontaneous brain signals. We use parametric copula models (both bivariate and vine models) applied on the magnitude of Fourier coefficients rather than using coherence. The motivation behind this work is an experiment on rats that studied the impact of stroke on the connectivity structure (dependence) between local field potentials recorded at various channels. We address the following major questions. First, we ask whether one can detect any changepoint in the regime of a brain channel for a given frequency band based on a difference between the cumulative distribution functions modeled for each epoch (small window of time). Our proposed approach is an iterative algorithm which compares each successive bivariate copulas on all the epochs range, using a bivariate Kolmogorov-Smirnov statistic. Second, we ask whether stroke can alter the dependence structure of brain signals; and examine whether changes in dependence are present only in some channels or generalized across channels. These questions are addressed by comparing Vine-copulas models fitted for each epoch. We provide the necessary framework and show the effectiveness of our methods through the results for the local field potential data analysis of a rat.
△ Less
Submitted 24 September, 2018;
originally announced September 2018.
-
Random Features for Compositional Kernels
Authors:
Amit Daniely,
Roy Frostig,
Vineet Gupta,
Yoram Singer
Abstract:
We describe and analyze a simple random feature scheme (RFS) from prescribed compositional kernels. The compositional kernels we use are inspired by the structure of convolutional neural networks and kernels. The resulting scheme yields sparse and efficiently computable features. Each random feature can be represented as an algebraic expression over a small number of (random) paths in a compositio…
▽ More
We describe and analyze a simple random feature scheme (RFS) from prescribed compositional kernels. The compositional kernels we use are inspired by the structure of convolutional neural networks and kernels. The resulting scheme yields sparse and efficiently computable features. Each random feature can be represented as an algebraic expression over a small number of (random) paths in a composition tree. Thus, compositional random features can be stored compactly. The discrete nature of the generation process enables de-duplication of repeated features, further compacting the representation and increasing the diversity of the embeddings. Our approach complements and can be combined with previous random feature schemes.
△ Less
Submitted 22 March, 2017;
originally announced March 2017.
-
Estimation from Indirect Supervision with Linear Moments
Authors:
Aditi Raghunathan,
Roy Frostig,
John Duchi,
Percy Liang
Abstract:
In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: non-convexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectly-supervised problems. Our approach is simple: we solve a linear system to estim…
▽ More
In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: non-convexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectly-supervised problems. Our approach is simple: we solve a linear system to estimate sufficient statistics of the model, which we then use to estimate parameters via convex optimization. We analyze the statistical properties of our approach and show empirically that it is effective in two settings: learning with local privacy constraints and learning from low-cost count-based annotations.
△ Less
Submitted 10 August, 2016;
originally announced August 2016.
-
Principal Component Projection Without Principal Component Analysis
Authors:
Roy Frostig,
Cameron Musco,
Christopher Musco,
Aaron Sidford
Abstract:
We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression.
By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependenc…
▽ More
We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression.
By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression.
To achieve our results, we first observe that ridge regression can be used to obtain a "smooth projection" onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision.
△ Less
Submitted 26 November, 2019; v1 submitted 22 February, 2016;
originally announced February 2016.
-
Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
Authors:
Amit Daniely,
Roy Frostig,
Yoram Singer
Abstract:
We develop a general duality between neural networks and compositional kernels, striving towards a better understanding of deep learning. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good s…
▽ More
We develop a general duality between neural networks and compositional kernels, striving towards a better understanding of deep learning. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization. Our dual view also reveals a pragmatic and aesthetic perspective of neural networks and underscores their expressive power.
△ Less
Submitted 19 May, 2017; v1 submitted 18 February, 2016;
originally announced February 2016.
-
Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization
Authors:
Roy Frostig,
Rong Ge,
Sham M. Kakade,
Aaron Sidford
Abstract:
We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several a…
▽ More
We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several algorithms that reduce the minimization of a strongly convex function to approximate minimizations of regularizations of the function. Using these results, we accelerate recent fast stochastic algorithms in a black-box fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original problem.
△ Less
Submitted 24 June, 2015;
originally announced June 2015.
-
Competing with the Empirical Risk Minimizer in a Single Pass
Authors:
Roy Frostig,
Rong Ge,
Sham M. Kakade,
Aaron Sidford
Abstract:
In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or…
▽ More
In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or the $M$-estimator -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage.
We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties:
* The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample.
* The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors.
* The algorithm's performance depends on the initial error at a rate that decreases super-polynomially.
* The algorithm is easily parallelizable.
Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM.
△ Less
Submitted 25 February, 2015; v1 submitted 20 December, 2014;
originally announced December 2014.
-
A sub-constant improvement in approximating the positive semidefinite Grothendieck problem
Authors:
Roy Frostig,
Sida I. Wang
Abstract:
Semidefinite relaxations are a powerful tool for approximately solving combinatorial optimization problems such as MAX-CUT and the Grothendieck problem. By exploiting a bounded rank property of extreme points in the semidefinite cone, we make a sub-constant improvement in the approximation ratio of one such problem. Precisely, we describe a polynomial-time algorithm for the positive semidefinite G…
▽ More
Semidefinite relaxations are a powerful tool for approximately solving combinatorial optimization problems such as MAX-CUT and the Grothendieck problem. By exploiting a bounded rank property of extreme points in the semidefinite cone, we make a sub-constant improvement in the approximation ratio of one such problem. Precisely, we describe a polynomial-time algorithm for the positive semidefinite Grothendieck problem -- based on rounding from the standard relaxation -- that achieves a ratio of $2/π+ Θ(1/{\sqrt n})$, whereas the previous best is $2/π+ Θ(1/n)$. We further show a corresponding integrality gap of $2/π+\tilde{O}(1/n^{1/3})$.
△ Less
Submitted 10 August, 2014;
originally announced August 2014.
-
Relaxations for inference in restricted Boltzmann machines
Authors:
Sida I. Wang,
Roy Frostig,
Percy Liang,
Christopher D. Manning
Abstract:
We propose a relaxation-based approximate inference algorithm that samples near-MAP configurations of a binary pairwise Markov random field. We experiment on MAP inference tasks in several restricted Boltzmann machines. We also use our underlying sampler to estimate the log-partition function of restricted Boltzmann machines and compare against other sampling-based methods.
We propose a relaxation-based approximate inference algorithm that samples near-MAP configurations of a binary pairwise Markov random field. We experiment on MAP inference tasks in several restricted Boltzmann machines. We also use our underlying sampler to estimate the log-partition function of restricted Boltzmann machines and compare against other sampling-based methods.
△ Less
Submitted 2 January, 2014; v1 submitted 20 December, 2013;
originally announced December 2013.