subscribe to arXiv mailings

arXiv:1610.03774 [pdf, other]

Parallelizing Stochastic Gradient Descent for Least Squares Regression: mini-batching, averaging, and model misspecification

Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Aaron Sidford

Abstract: This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averagin… ▽ More This work characterizes the benefits of averaging schemes widely used in conjunction with stochastic gradient descent (SGD). In particular, this work provides a sharp analysis of: (1) mini-batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of the stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD to decrease the variance in SGD's final iterate. This work presents non-asymptotic excess risk bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batch SGD yields provable near-linear parallelization speedups over SGD with batch size one. This allows for understanding learning rate versus batch size tradeoffs for the final iterate of an SGD method. These results are then utilized in providing a highly parallelizable SGD method that obtains the minimax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGD methods. A non-asymptotic analysis of communication efficient parallelization schemes such as model-averaging/parameter mixing methods is then provided. Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure minimax risk for the agnostic case must be a function of the noise properties. This paper builds on the operator view of analyzing SGD methods, introduced by Defossez and Bach (2015), followed by developing a novel analysis in bounding these operators to characterize the excess risk. These techniques are of broader interest in analyzing computational aspects of stochastic approximation. △ Less

Submitted 31 July, 2018; v1 submitted 12 October, 2016; originally announced October 2016.

Comments: 39 pages. Published in the Journal of Machine Learning Research (JMLR)

arXiv:1605.08754 [pdf, other]

Faster Eigenvector Computation via Shift-and-Invert Preconditioning

Authors: Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, Aaron Sidford

Abstract: We give faster algorithms and improved sample complexities for estimating the top eigenvector of a matrix $Σ$ -- i.e. computing a unit vector $x$ such that $x^T Σx \ge (1-ε)λ_1(Σ)$: Offline Eigenvector Estimation: Given an explicit $A \in \mathbb{R}^{n \times d}$ with $Σ= A^TA$, we show how to compute an $ε$ approximate top eigenvector in time… ▽ More We give faster algorithms and improved sample complexities for estimating the top eigenvector of a matrix $Σ$ -- i.e. computing a unit vector $x$ such that $x^T Σx \ge (1-ε)λ_1(Σ)$: Offline Eigenvector Estimation: Given an explicit $A \in \mathbb{R}^{n \times d}$ with $Σ= A^TA$, we show how to compute an $ε$ approximate top eigenvector in time $\tilde O([nnz(A) + \frac{d*sr(A)}{gap^2} ]* \log 1/ε)$ and $\tilde O([\frac{nnz(A)^{3/4} (d*sr(A))^{1/4}}{\sqrt{gap}} ] * \log 1/ε)$. Here $nnz(A)$ is the number of nonzeros in $A$, $sr(A)$ is the stable rank, $gap$ is the relative eigengap. By separating the $gap$ dependence from the $nnz(A)$ term, our first runtime improves upon the classical power and Lanczos methods. It also improves prior work using fast subspace embeddings [AC09, CW13] and stochastic optimization [Sha15c], giving significantly better dependencies on $sr(A)$ and $ε$. Our second running time improves these further when $nnz(A) \le \frac{d*sr(A)}{gap^2}$. Online Eigenvector Estimation: Given a distribution $D$ with covariance matrix $Σ$ and a vector $x_0$ which is an $O(gap)$ approximate top eigenvector for $Σ$, we show how to refine to an $ε$ approximation using $ O(\frac{var(D)}{gap*ε})$ samples from $D$. Here $var(D)$ is a natural notion of variance. Combining our algorithm with previous work to initialize $x_0$, we obtain improved sample complexity and runtime results under a variety of assumptions on $D$. We achieve our results using a general framework that we believe is of independent interest. We give a robust analysis of the classic method of shift-and-invert preconditioning to reduce eigenvector computation to approximately solving a sequence of linear systems. We then apply fast stochastic variance reduced gradient (SVRG) based system solvers to achieve our claims. △ Less

Submitted 25 May, 2016; originally announced May 2016.

Comments: Appearing in ICML 2016. Combination of work in arXiv:1509.05647 and arXiv:1510.08896

arXiv:1605.08370 [pdf, ps, other]

Provable Efficient Online Matrix Completion via Non-convex Stochastic Gradient Descent

Authors: Chi Jin, Sham M. Kakade, Praneeth Netrapalli

Abstract: Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications,… ▽ More Matrix completion, where we wish to recover a low rank matrix by observing a few entries from it, is a widely studied problem in both theory and practice with wide applications. Most of the provable algorithms so far on this problem have been restricted to the offline setting where they provide an estimate of the unknown matrix using all observations simultaneously. However, in many applications, the online version, where we observe one entry at a time and dynamically update our estimate, is more appealing. While existing algorithms are efficient for the offline setting, they could be highly inefficient for the online setting. In this paper, we propose the first provable, efficient online algorithm for matrix completion. Our algorithm starts from an initial estimate of the matrix and then performs non-convex stochastic gradient descent (SGD). After every observation, it performs a fast update involving only one row of two tall matrices, giving near linear total runtime. Our algorithm can be naturally used in the offline setting as well, where it gives competitive sample complexity and runtime to state of the art algorithms. Our proofs introduce a general framework to show that SGD updates tend to stay away from saddle surfaces and could be of broader interests for other non-convex problems to prove tight rates. △ Less

Submitted 26 May, 2016; originally announced May 2016.

arXiv:1604.03930 [pdf, ps, other]

Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis

Authors: Rong Ge, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford

Abstract: This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide si… ▽ More This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved. We obtain our results by reducing CCA to the top-$k$ generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of $O(\frac{z k \sqrtκ}ρ \log(1/ε) \log \left(kκ/ρ\right))$ where $z$ is the total number of nonzero entries, $κ$ is the condition number and $ρ$ is the relative eigenvalue gap of the appropriate matrices. Our algorithm is linear in the input size and the number of components $k$ up to a $\log(k)$ factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets. △ Less

Submitted 27 May, 2016; v1 submitted 13 April, 2016; originally announced April 2016.

Comments: International Conference on Machine Learning (ICML) 2016

arXiv:1602.06929 [pdf, ps, other]

Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm

Authors: Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford

Abstract: This work provides improved guarantees for streaming principle component analysis (PCA). Given $A_1, \ldots, A_n\in \mathbb{R}^{d\times d}$ sampled independently from distributions satisfying $\mathbb{E}[A_i] = Σ$ for $Σ\succeq \mathbf{0}$, this work provides an $O(d)$-space linear-time single-pass streaming algorithm for estimating the top eigenvector of $Σ$. The algorithm nearly matches (and in… ▽ More This work provides improved guarantees for streaming principle component analysis (PCA). Given $A_1, \ldots, A_n\in \mathbb{R}^{d\times d}$ sampled independently from distributions satisfying $\mathbb{E}[A_i] = Σ$ for $Σ\succeq \mathbf{0}$, this work provides an $O(d)$-space linear-time single-pass streaming algorithm for estimating the top eigenvector of $Σ$. The algorithm nearly matches (and in certain cases improves upon) the accuracy obtained by the standard batch method that computes top eigenvector of the empirical covariance $\frac{1}{n} \sum_{i \in [n]} A_i$ as analyzed by the matrix Bernstein inequality. Moreover, to achieve constant accuracy, our algorithm improves upon the best previous known sample complexities of streaming algorithms by either a multiplicative factor of $O(d)$ or $1/\mathrm{gap}$ where $\mathrm{gap}$ is the relative distance between the top two eigenvalues of $Σ$. These results are achieved through a novel analysis of the classic Oja's algorithm, one of the oldest and most popular algorithms for streaming PCA. In particular, this work shows that simply picking a random initial point $w_0$ and applying the update rule $w_{i + 1} = w_i + η_i A_i w_i$ suffices to accurately estimate the top eigenvector, with a suitable choice of $η_i$. We believe our result sheds light on how to efficiently perform streaming PCA both in theory and in practice and we hope that our analysis may serve as the basis for analyzing many variants and extensions of streaming PCA. △ Less

Submitted 28 March, 2016; v1 submitted 22 February, 2016; originally announced February 2016.

Comments: Updated title

arXiv:1602.06586 [pdf, ps, other]

Recovering Structured Probability Matrices

Authors: Qingqing Huang, Sham M. Kakade, Weihao Kong, Gregory Valiant

Abstract: We consider the problem of accurately recovering a matrix B of size M by M , which represents a probability distribution over M2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal recons… ▽ More We consider the problem of accurately recovering a matrix B of size M by M , which represents a probability distribution over M2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient algorithm that accurately recovers the underlying M by M matrix using Theta(M) samples. This result easily translates to Theta(M) sample algorithms for learning topic models and learning hidden Markov Models. These linear sample complexities are optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. We provide an even stronger lower bound where distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by an HMM with two hidden states requires Omega(M) observations. This precludes sublinear-sample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs. △ Less

Submitted 5 February, 2018; v1 submitted 21 February, 2016; originally announced February 2016.

arXiv:1510.08896 [pdf, other]

Robust Shift-and-Invert Preconditioning: Faster and More Sample Efficient Algorithms for Eigenvector Computation

Authors: Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, Aaron Sidford

Abstract: We provide faster algorithms and improved sample complexities for approximating the top eigenvector of a matrix. Offline Setting: Given an $n \times d$ matrix $A$, we show how to compute an $ε$ approximate top eigenvector in time $\tilde O ( [nnz(A) + \frac{d \cdot sr(A)}{gap^2}]\cdot \log 1/ε)$ and $\tilde O([\frac{nnz(A)^{3/4} (d \cdot sr(A))^{1/4}}{\sqrt{gap}}]\cdot \log1/ε)$. Here $sr(A)$ is… ▽ More We provide faster algorithms and improved sample complexities for approximating the top eigenvector of a matrix. Offline Setting: Given an $n \times d$ matrix $A$, we show how to compute an $ε$ approximate top eigenvector in time $\tilde O ( [nnz(A) + \frac{d \cdot sr(A)}{gap^2}]\cdot \log 1/ε)$ and $\tilde O([\frac{nnz(A)^{3/4} (d \cdot sr(A))^{1/4}}{\sqrt{gap}}]\cdot \log1/ε)$. Here $sr(A)$ is the stable rank and $gap$ is the multiplicative eigenvalue gap. By separating the $gap$ dependence from $nnz(A)$ we improve on the classic power and Lanczos methods. We also improve prior work using fast subspace embeddings and stochastic optimization, giving significantly improved dependencies on $sr(A)$ and $ε$. Our second running time improves this further when $nnz(A) \le \frac{d\cdot sr(A)}{gap^2}$. Online Setting: Given a distribution $D$ with covariance matrix $Σ$ and a vector $x_0$ which is an $O(gap)$ approximate top eigenvector for $Σ$, we show how to refine to an $ε$ approximation using $\tilde O(\frac{v(D)}{gap^2} + \frac{v(D)}{gap \cdot ε})$ samples from $D$. Here $v(D)$ is a natural variance measure. Combining our algorithm with previous work to initialize $x_0$, we obtain a number of improved sample complexity and runtime results. For general distributions, we achieve asymptotically optimal accuracy as a function of sample size as the number of samples grows large. Our results center around a robust analysis of the classic method of shift-and-invert preconditioning to reduce eigenvector computation to approximately solving a sequence of linear systems. We then apply fast SVRG based approximate system solvers to achieve our claims. We believe our results suggest the general effectiveness of shift-and-invert based approaches and imply that further computational gains may be reaped in practice. △ Less

Submitted 29 May, 2016; v1 submitted 29 October, 2015; originally announced October 2015.

Comments: Manuscript outdated. Updated version at arxiv:1605.08754

arXiv:1509.07943 [pdf, other]

Super-Resolution Off the Grid

Authors: Qingqing Huang, Sham M. Kakade

Abstract: Super-resolution is the problem of recovering a superposition of point sources using bandlimited measurements, which may be corrupted with noise. This signal processing problem arises in numerous imaging problems, ranging from astronomy to biology to spectroscopy, where it is common to take (coarse) Fourier measurements of an object. Of particular interest is in obtaining estimation procedures whi… ▽ More Super-resolution is the problem of recovering a superposition of point sources using bandlimited measurements, which may be corrupted with noise. This signal processing problem arises in numerous imaging problems, ranging from astronomy to biology to spectroscopy, where it is common to take (coarse) Fourier measurements of an object. Of particular interest is in obtaining estimation procedures which are robust to noise, with the following desirable statistical and computational properties: we seek to use coarse Fourier measurements (bounded by some cutoff frequency); we hope to take a (quantifiably) small number of measurements; we desire our algorithm to run quickly. Suppose we have k point sources in d dimensions, where the points are separated by at least Δfrom each other (in Euclidean distance). This work provides an algorithm with the following favorable guarantees: - The algorithm uses Fourier measurements, whose frequencies are bounded by O(1/Δ) (up to log factors). Previous algorithms require a cutoff frequency which may be as large as Ω( d/Δ). - The number of measurements taken by and the computational complexity of our algorithm are bounded by a polynomial in both the number of points k and the dimension d, with no dependence on the separation Δ. In contrast, previous algorithms depended inverse polynomially on the minimal separation and exponentially on the dimension for both of these quantities. Our estimation procedure itself is simple: we take random bandlimited measurements (as opposed to taking an exponential number of measurements on the hyper-grid). Furthermore, our analysis and algorithm are elementary (based on concentration bounds for sampling and the singular value decomposition). △ Less

Submitted 25 September, 2015; originally announced September 2015.

arXiv:1507.05854 [pdf, other]

Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot

Authors: Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli

Abstract: While there has been a significant amount of work studying gradient descent techniques for non-convex optimization problems over the last few years, all existing results establish either local convergence with good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, we take the first step in getting the best of both worlds -- establishing global… ▽ More While there has been a significant amount of work studying gradient descent techniques for non-convex optimization problems over the last few years, all existing results establish either local convergence with good rates or global convergence with highly suboptimal rates, for many problems of interest. In this paper, we take the first step in getting the best of both worlds -- establishing global convergence and obtaining a good rate of convergence for the problem of computing squareroot of a positive definite (PD) matrix, which is a widely studied problem in numerical linear algebra with applications in machine learning and statistics among others. Given a PD matrix $M$ and a PD starting point $U_0$, we show that gradient descent with appropriately chosen step-size finds an $ε$-accurate squareroot of $M$ in $O(α\log (\|M-U_0^2\|_F /ε))$ iterations, where $α= (\max\{\|U_0\|_2^2,\|M\|_2\} / \min \{σ_{\min}^2(U_0),σ_{\min}(M) \} )^{3/2}$. Our result is the first to establish global convergence for this problem and that it is robust to errors in each iteration. A key contribution of our work is the general proof technique which we believe should further excite research in understanding deterministic and stochastic variants of simple non-convex gradient descent algorithms with good global convergence rates for other problems in machine learning and numerical linear algebra. △ Less

Submitted 9 March, 2017; v1 submitted 21 July, 2015; originally announced July 2015.

Comments: Appear in AISTATS 2017

arXiv:1506.07512 [pdf, other]

Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization

Authors: Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford

Abstract: We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several a… ▽ More We develop a family of accelerated stochastic algorithms that minimize sums of convex functions. Our algorithms improve upon the fastest running time for empirical risk minimization (ERM), and in particular linear least-squares regression, across a wide range of problem settings. To achieve this, we establish a framework based on the classical proximal point algorithm. Namely, we provide several algorithms that reduce the minimization of a strongly convex function to approximate minimizations of regularizations of the function. Using these results, we accelerate recent fast stochastic algorithms in a black-box fashion. Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that are advantageous in practice. Both in theory and in practice, the provided algorithms reap the computational benefits of adding a large strongly convex regularization term, without incurring a corresponding bias to the original problem. △ Less

Submitted 24 June, 2015; originally announced June 2015.

arXiv:1503.00424 [pdf, other]

Learning Mixtures of Gaussians in High Dimensions

Authors: Rong Ge, Qingqing Huang, Sham M. Kakade

Abstract: Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in Rn, the learning problem asks to estimate the means and the covariance matrices of these Gaussians. This learning problem arises in many areas ranging from the natural sciences to the social sciences, and has also found many… ▽ More Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in Rn, the learning problem asks to estimate the means and the covariance matrices of these Gaussians. This learning problem arises in many areas ranging from the natural sciences to the social sciences, and has also found many machine learning applications. Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: in order to learn the parameters up to a reasonable accuracy, the number of samples required is exponential in the number of Gaussian components in the worst case. In this work, we show that provided we are in high enough dimensions, the class of Gaussian mixtures is learnable in its most general form under a smoothed analysis framework, where the parameters are randomly perturbed from an adversarial starting point. In particular, given samples from a mixture of Gaussians with randomly perturbed parameters, when n > Ω(k^2), we give an algorithm that learns the parameters with polynomial running time and using polynomial number of samples. The central algorithmic ideas consist of new ways to decompose the moment tensor of the Gaussian mixture by exploiting its structural properties. The symmetries of this tensor are derived from the combinatorial structure of higher order moments of Gaussian distributions (sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop new tools for bounding smallest singular values of structured random matrices, which could be useful in other smoothed analysis settings. △ Less

Submitted 9 March, 2015; v1 submitted 2 March, 2015; originally announced March 2015.

arXiv:1412.6606 [pdf, other]

Competing with the Empirical Risk Minimizer in a Single Pass

Authors: Roy Frostig, Rong Ge, Sham M. Kakade, Aaron Sidford

Abstract: In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or… ▽ More In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or the $M$-estimator -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. * The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors. * The algorithm's performance depends on the initial error at a rate that decreases super-polynomially. * The algorithm is easily parallelizable. Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM. △ Less

Submitted 25 February, 2015; v1 submitted 20 December, 2014; originally announced December 2014.

arXiv:1310.1949 [pdf, other]

Least Squares Revisited: Scalable Approaches for Multi-class Prediction

Authors: Alekh Agarwal, Sham M. Kakade, Nikos Karampatziakis, Le Song, Gregory Valiant

Abstract: This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees… ▽ More This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees. Owing to their effective use of second-order structure, these algorithms are substantially better than first-order methods in many practical scenarios. On the empirical side, we present a scalable stagewise variant of our approach, which achieves dramatic computational speedups over popular optimization packages such as Liblinear and Vowpal Wabbit on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies. △ Less

Submitted 21 October, 2013; v1 submitted 7 October, 2013; originally announced October 2013.

arXiv:1302.2684 [pdf, ps, other]

A Tensor Approach to Learning Mixed Membership Community Models

Authors: Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade

Abstract: Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and provide guaranteed community detection for a family of probabilistic network models with overlapping communities, termed… ▽ More Community detection is the task of detecting hidden communities from observed interactions. Guaranteed community detection has so far been mostly limited to models with non-overlapping communities such as the stochastic block model. In this paper, we remove this restriction, and provide guaranteed community detection for a family of probabilistic network models with overlapping communities, termed as the mixed membership Dirichlet model, first introduced by Airoldi et al. This model allows for nodes to have fractional memberships in multiple communities and assumes that the community memberships are drawn from a Dirichlet distribution. Moreover, it contains the stochastic block model as a special case. We propose a unified approach to learning these models via a tensor spectral decomposition method. Our estimator is based on low-order moment tensor of the observed network, consisting of 3-star counts. Our learning method is fast and is based on simple linear algebraic operations, e.g. singular value decomposition and tensor power iterations. We provide guaranteed recovery of community memberships and model parameters and present a careful finite sample analysis of our learning method. As an important special case, our results match the best known scaling requirements for the (homogeneous) stochastic block model. △ Less

Submitted 24 October, 2013; v1 submitted 11 February, 2013; originally announced February 2013.

arXiv:1211.5414 [pdf, ps, other]

Analysis of a randomized approximation scheme for matrix multiplication

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

Abstract: This note gives a simple analysis of a randomized approximation scheme for matrix multiplication proposed by Sarlos (2006) based on a random rotation followed by uniform column sampling. The result follows from a matrix version of Bernstein's inequality and a tail inequality for quadratic forms in subgaussian random vectors. This note gives a simple analysis of a randomized approximation scheme for matrix multiplication proposed by Sarlos (2006) based on a random rotation followed by uniform column sampling. The result follows from a matrix version of Bernstein's inequality and a tail inequality for quadratic forms in subgaussian random vectors. △ Less

Submitted 23 November, 2012; originally announced November 2012.

arXiv:1210.7559 [pdf, ps, other]

Tensor decompositions for learning latent variable models

Authors: Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, Matus Telgarsky

Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to… ▽ More This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models. △ Less

Submitted 13 November, 2014; v1 submitted 29 October, 2012; originally announced October 2012.

Journal ref: Journal of Machine Learning Research, 15(Aug):2773-2832, 2014

arXiv:1209.5350 [pdf, other]

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints

Authors: Animashree Anandkumar, Daniel Hsu, Adel Javanmard, Sham M. Kakade

Abstract: Unsupervised estimation of latent variable models is a fundamental problem central to numerous applications of machine learning and statistics. This work presents a principled approach for estimating broad classes of such models, including probabilistic topic models and latent linear Bayesian networks, using only second-order observed moments. The sufficient conditions for identifiability of these… ▽ More Unsupervised estimation of latent variable models is a fundamental problem central to numerous applications of machine learning and statistics. This work presents a principled approach for estimating broad classes of such models, including probabilistic topic models and latent linear Bayesian networks, using only second-order observed moments. The sufficient conditions for identifiability of these models are primarily based on weak expansion constraints on the topic-word matrix, for topic models, and on the directed acyclic graph, for Bayesian networks. Because no assumptions are made on the distribution among the latent variables, the approach can handle arbitrary correlations among the topics or latent factors. In addition, a tractable learning method via $\ell_1$ optimization is proposed and studied in numerical experiments. △ Less

Submitted 24 May, 2013; v1 submitted 24 September, 2012; originally announced September 2012.

Comments: 38 pages, 6 figures, 2 tables, applications in topic models and Bayesian networks are studied. Simulation section is added

arXiv:1207.1388 [pdf]

Planning in POMDPs Using Multiplicity Automata

Authors: Eyal Even-Dar, Sham M. Kakade, Yishay Mansour

Abstract: Planning and learning in Partially Observable MDPs (POMDPs) are among the most challenging tasks in both the AI and Operation Research communities. Although solutions to these problems are intractable in general, there might be special cases, such as structured POMDPs, which can be solved efficiently. A natural and possibly efficient way to represent a POMDP is through the predictive state represe… ▽ More Planning and learning in Partially Observable MDPs (POMDPs) are among the most challenging tasks in both the AI and Operation Research communities. Although solutions to these problems are intractable in general, there might be special cases, such as structured POMDPs, which can be solved efficiently. A natural and possibly efficient way to represent a POMDP is through the predictive state representation (PSR) - a representation which recently has been receiving increasing attention. In this work, we relate POMDPs to multiplicity automata- showing that POMDPs can be represented by multiplicity automata with no increase in the representation size. Furthermore, we show that the size of the multiplicity automaton is equal to the rank of the predictive state representation. Therefore, we relate both the predictive state representation and POMDPs to the well-founded multiplicity automata literature. Based on the multiplicity automata representation, we provide a planning algorithm which is exponential only in the multiplicity automata rank rather than the number of states of the POMDP. As a result, whenever the predictive state representation is logarithmic in the standard POMDP representation, our planning algorithm is efficient. △ Less

Submitted 4 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

Report number: UAI-P-2005-PG-185-192

arXiv:1206.5766 [pdf, ps, other]

Learning mixtures of spherical Gaussians: moment methods and spectral decompositions

Authors: Daniel Hsu, Sham M. Kakade

Abstract: This work provides a computationally efficient and statistically consistent moment-based estimator for mixtures of spherical Gaussians. Under the condition that component means are in general position, a simple spectral decomposition technique yields consistent parameter estimates from low-order observable moments, without additional minimum separation assumptions needed by previous computationall… ▽ More This work provides a computationally efficient and statistically consistent moment-based estimator for mixtures of spherical Gaussians. Under the condition that component means are in general position, a simple spectral decomposition technique yields consistent parameter estimates from low-order observable moments, without additional minimum separation assumptions needed by previous computationally efficient estimation procedures. Thus computational and information-theoretic barriers to efficient estimation in mixture models are precluded when the mixture components have means in general position and spherical covariances. Some connections are made to estimation problems related to independent component analysis. △ Less

Submitted 28 October, 2012; v1 submitted 25 June, 2012; originally announced June 2012.

arXiv:1206.3137 [pdf, other]

Identifiability and Unmixing of Latent Parse Trees

Authors: Daniel Hsu, Sham M. Kakade, Percy Liang

Abstract: This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficient… ▽ More This paper explores unsupervised learning of parsing models along two directions. First, which models are identifiable from infinite data? We use a general technique for numerically checking identifiability based on the rank of a Jacobian matrix, and apply it to several standard constituency and dependency parsing models. Second, for identifiable models, how do we estimate the parameters efficiently? EM suffers from local optima, while recent work using spectral methods cannot be directly applied since the topology of the parse tree varies across sentences. We develop a strategy, unmixing, which deals with this additional complexity for restricted classes of parsing models. △ Less

Submitted 14 June, 2012; originally announced June 2012.

arXiv:1204.6703 [pdf, ps, other]

A Spectral Algorithm for Latent Dirichlet Allocation

Authors: Animashree Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, Yi-Kai Liu

Abstract: The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimat… ▽ More The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on $k\times k$ matrices, where $k$ is the number of latent factors (e.g. the number of topics), rather than in the $d$-dimensional observed space (typically $d \gg k$). △ Less

Submitted 17 January, 2013; v1 submitted 30 April, 2012; originally announced April 2012.

Comments: Changed title to match conference version, which appears in Advances in Neural Information Processing Systems 25, 2012

arXiv:1203.0697 [pdf, ps, other]

Learning High-Dimensional Mixtures of Graphical Models

Authors: A. Anandkumar, D. Hsu, F. Huang, S. M. Kakade

Abstract: We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable corresponding to the mixture components is hidden and each mixture component over the observed variables can have a potentially different Markov graph structure and parameters. We propose a novel approach for estimating the mixture components, and our output is a tree-mixture model which serves a… ▽ More We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable corresponding to the mixture components is hidden and each mixture component over the observed variables can have a potentially different Markov graph structure and parameters. We propose a novel approach for estimating the mixture components, and our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. Our method is efficient when the union graph, which is the union of the Markov graphs of the mixture components, has sparse vertex separators between any pair of observed variables. This includes tree mixtures and mixtures of bounded degree graphs. For such models, we prove that our method correctly recovers the union graph structure and the tree structures corresponding to maximum-likelihood tree approximations of the mixture components. The sample and computational complexities of our method scale as $\poly(p, r)$, for an $r$-component mixture of $p$-variate graphical models. We further extend our results to the case when the union graph has sparse local separators between any pair of observed variables, such as mixtures of locally tree-like graphs, and the mixture components are in the regime of correlation decay. △ Less

Submitted 30 June, 2012; v1 submitted 3 March, 2012; originally announced March 2012.

arXiv:1203.0683 [pdf, ps, other]

A Method of Moments for Mixture Models and Hidden Markov Models

Authors: Animashree Anandkumar, Daniel Hsu, Sham M. Kakade

Abstract: Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations. The current practice for estimating the parameters of such models relies on local search heuristics (e.g., the EM algorithm) which are prone to failure, and existing consistent methods are unfavorable due to their high computational and sample complexity which typi… ▽ More Mixture models are a fundamental tool in applied statistics and machine learning for treating data taken from multiple subpopulations. The current practice for estimating the parameters of such models relies on local search heuristics (e.g., the EM algorithm) which are prone to failure, and existing consistent methods are unfavorable due to their high computational and sample complexity which typically scale exponentially with the number of mixture components. This work develops an efficient method of moments approach to parameter estimation for a broad class of high-dimensional mixture models with many components, including multi-view mixtures of Gaussians (such as mixtures of axis-aligned Gaussians) and hidden Markov models. The new method leads to rigorous unsupervised learning results for mixture models that were not achieved by previous works; and, because of its simplicity, it offers a viable alternative to EM for practical deployment. △ Less

Submitted 5 September, 2012; v1 submitted 3 March, 2012; originally announced March 2012.

arXiv:1202.3079 [pdf, ps, other]

Towards minimax policies for online linear optimization with bandit feedback

Authors: Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade

Abstract: We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order $\sqrt{d n \log N}$ for any finite action set with $N$ actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous $\sqrt{d}$ factor compared to previous works, and give… ▽ More We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order $\sqrt{d n \log N}$ for any finite action set with $N$ actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous $\sqrt{d}$ factor compared to previous works, and gives a regret bound of order $d \sqrt{n \log n}$ for any compact set of actions. Without further assumptions on the action set, this last bound is minimax optimal up to a logarithmic factor. Interestingly, our result also shows that the minimax regret for bandit linear optimization with expert advice in $d$ dimension is the same as for the basic $d$-armed bandit with expert advice. Our second contribution is to show how to use the Mirror Descent algorithm to obtain computationally efficient strategies with minimax optimal regret bounds in specific examples. More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In the former case, we obtain the first computationally efficient algorithm with a $d \sqrt{n}$ regret, thus improving by a factor $\sqrt{d \log n}$ over the best known result for a computationally efficient algorithm. In the latter case, our approach gives the first algorithm with a $\sqrt{d n \log n}$ regret, again shaving off an extraneous $\sqrt{d}$ compared to previous works. △ Less

Submitted 14 February, 2012; originally announced February 2012.

arXiv:1110.2842 [pdf, ps, other]

A tail inequality for quadratic forms of subgaussian random vectors

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

Abstract: We prove an exponential probability tail inequality for positive semidefinite quadratic forms in a subgaussian random vector. The bound is analogous to one that holds when the vector has independent Gaussian entries. We prove an exponential probability tail inequality for positive semidefinite quadratic forms in a subgaussian random vector. The bound is analogous to one that holds when the vector has independent Gaussian entries. △ Less

Submitted 13 October, 2011; originally announced October 2011.

arXiv:1107.1744 [pdf, other]

Stochastic convex optimization with bandit feedback

Authors: Alekh Agarwal, Dean P. Foster, Daniel Hsu, Sham M. Kakade, Alexander Rakhlin

Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function $f$ over a convex, compact set $\xset$ under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value $f(x)$ at any query point $x \in \xset$. The quantity of interest is the regret of the algorithm, which is the sum of the function values at algorithm'… ▽ More This paper addresses the problem of minimizing a convex, Lipschitz function $f$ over a convex, compact set $\xset$ under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value $f(x)$ at any query point $x \in \xset$. The quantity of interest is the regret of the algorithm, which is the sum of the function values at algorithm's query points minus the optimal function value. We demonstrate a generalization of the ellipsoid algorithm that incurs $\otil(\poly(d)\sqrt{T})$ regret. Since any algorithm has regret at least $Ω(\sqrt{T})$ on this problem, our algorithm is optimal in terms of the scaling with $T$. △ Less

Submitted 8 October, 2011; v1 submitted 8 July, 2011; originally announced July 2011.

arXiv:1107.1283 [pdf, other]

Spectral Methods for Learning Multivariate Latent Tree Structure

Authors: Animashree Anandkumar, Kamalika Chaudhuri, Daniel Hsu, Sham M. Kakade, Le Song, Tong Zhang

Abstract: This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the… ▽ More This work considers the problem of learning the structure of multivariate linear tree models, which include a variety of directed tree graphical models with continuous, discrete, and mixed latent variables such as linear-Gaussian models, hidden Markov models, Gaussian mixture models, and Markov evolutionary trees. The setting is one where we only have samples from certain observed variables in the tree, and our goal is to estimate the tree structure (i.e., the graph of how the underlying hidden variables are connected to each other and to the observed variables). We propose the Spectral Recursive Grouping algorithm, an efficient and simple bottom-up procedure for recovering the tree structure from independent samples of the observed variables. Our finite sample size bounds for exact recovery of the tree structure reveal certain natural dependencies on underlying statistical and structural properties of the underlying joint distribution. Furthermore, our sample complexity guarantees have no explicit dependence on the dimensionality of the observed variables, making the algorithm applicable to many high-dimensional settings. At the heart of our algorithm is a spectral quartet test for determining the relative topology of a quartet of variables from second-order statistics. △ Less

Submitted 8 November, 2011; v1 submitted 6 July, 2011; originally announced July 2011.

arXiv:1106.2363 [pdf, ps, other]

Random design analysis of ridge regression

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

Abstract: This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of… ▽ More This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the ``out-of-sample'' prediction error, as opposed to the ``in-sample'' (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices. △ Less

Submitted 24 March, 2014; v1 submitted 12 June, 2011; originally announced June 2011.

arXiv:1105.0875 [pdf, ps, other]

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Authors: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a Principal Component Analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method is within a constant factor (name… ▽ More We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a Principal Component Analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method is within a constant factor (namely 4) of the risk of ridge regression. △ Less

Submitted 31 May, 2013; v1 submitted 4 May, 2011; originally announced May 2011.

Comments: Appearing in JMLR 14, June 2013

arXiv:1104.1672 [pdf, ps, other]

Dimension-free tail inequalities for sums of random matrices

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

Abstract: We derive exponential tail inequalities for sums of random matrices with no dependence on the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity that can be small even when the dimension is large or infinite. Some applications to principal component analysis and… ▽ More We derive exponential tail inequalities for sums of random matrices with no dependence on the explicit matrix dimensions. These are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity that can be small even when the dimension is large or infinite. Some applications to principal component analysis and approximate matrix multiplication are given to illustrate the utility of the new bounds. △ Less

Submitted 16 April, 2011; v1 submitted 9 April, 2011; originally announced April 2011.

arXiv:1011.1518 [pdf, ps, other]

Robust Matrix Decomposition with Outliers

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

Abstract: Suppose a given observation matrix can be decomposed as the sum of a low-rank matrix and a sparse matrix (outliers), and the goal is to recover these individual components from the observed sum. Such additive decompositions have applications in a variety of numerical problems including system identification, latent variable graphical modeling, and principal components analysis. We study conditions… ▽ More Suppose a given observation matrix can be decomposed as the sum of a low-rank matrix and a sparse matrix (outliers), and the goal is to recover these individual components from the observed sum. Such additive decompositions have applications in a variety of numerical problems including system identification, latent variable graphical modeling, and principal components analysis. We study conditions under which recovering such a decomposition is possible via a combination of $\ell_1$ norm and trace norm minimization. We are specifically interested in the question of how many outliers are allowed so that convex programming can still achieve accurate recovery, and we obtain stronger recovery guarantees than previous studies. Moreover, we do not assume that the spatial pattern of outliers is random, which stands in contrast to related analyses under such assumptions via matrix completion. △ Less

Submitted 3 December, 2010; v1 submitted 5 November, 2010; originally announced November 2010.

Comments: Corrected comparisons to previous work of Candes et al (2009)

arXiv:1001.4598 [pdf, ps, other]

An Optimal Dynamic Mechanism for Multi-Armed Bandit Processes

Authors: Sham M. Kakade, Ilan Lobel, Hamid Nazerzadeh

Abstract: We consider the problem of revenue-optimal dynamic mechanism design in settings where agents' types evolve over time as a function of their (both public and private) experience with items that are auctioned repeatedly over an infinite horizon. A central question here is understanding what natural restrictions on the environment permit the design of optimal mechanisms (note that even in the simpler… ▽ More We consider the problem of revenue-optimal dynamic mechanism design in settings where agents' types evolve over time as a function of their (both public and private) experience with items that are auctioned repeatedly over an infinite horizon. A central question here is understanding what natural restrictions on the environment permit the design of optimal mechanisms (note that even in the simpler static setting, optimal mechanisms are characterized only under certain restrictions). We provide a {\em structural characterization} of a natural "separable: multi-armed bandit environment (where the evolution and incentive structure of the a-priori type is decoupled from the subsequent experience in a precise sense) where dynamic optimal mechanism design is possible. Here, we present the Virtual Index Mechanism, an optimal dynamic mechanism, which maximizes the (long term) {\em virtual surplus} using the classical Gittins algorithm. The mechanism optimally balances exploration and exploitation, taking incentives into account. △ Less

Submitted 15 October, 2010; v1 submitted 26 January, 2010; originally announced January 2010.

arXiv:0912.3995 [pdf, other]

doi 10.1109/TIT.2011.2182033

Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design

Authors: Niranjan Srinivas, Andreas Krause, Sham M. Kakade, Matthias Seeger

Abstract: Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multi-armed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GP-U… ▽ More Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multi-armed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GP-UCB, an intuitive upper-confidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GP-UCB compares favorably with other heuristical GP optimization approaches. △ Less

Submitted 9 June, 2010; v1 submitted 20 December, 2009; originally announced December 2009.

arXiv:0911.0054 [pdf, ps, other]

Learning Exponential Families in High-Dimensions: Strong Convexity and Sparsity

Authors: Sham M. Kakade, Ohad Shamir, Karthik Sridharan, Ambuj Tewari

Abstract: The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model. A central issue is learning these models in high-dimensions, such as when there is some sparsity pattern of the optimal parameter. This work characterizes a certain strong convexity property of general exponential families, which allow their generalization… ▽ More The versatility of exponential families, along with their attendant convexity properties, make them a popular and effective statistical model. A central issue is learning these models in high-dimensions, such as when there is some sparsity pattern of the optimal parameter. This work characterizes a certain strong convexity property of general exponential families, which allow their generalization ability to be quantified. In particular, we show how this property can be used to analyze generic exponential families under L_1 regularization. △ Less

Submitted 16 May, 2015; v1 submitted 30 October, 2009; originally announced November 2009.

Comments: Errata added. Incorrect claim about cumulants of the Bernoulli distribution fixed

ACM Class: I.2.6

arXiv:0910.0610 [pdf, ps, other]

Regularization Techniques for Learning with Matrices

Authors: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

Abstract: There is growing body of learning problems for which it is natural to organize the parameters into matrix, so as to appropriately regularize the parameters under some matrix norm (in order to impose some more sophisticated prior knowledge). This work describes and analyzes a systematic method for constructing such matrix-based, regularization methods. In particular, we focus on how the underlying… ▽ More There is growing body of learning problems for which it is natural to organize the parameters into matrix, so as to appropriately regularize the parameters under some matrix norm (in order to impose some more sophisticated prior knowledge). This work describes and analyzes a systematic method for constructing such matrix-based, regularization methods. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on the known duality fact: that a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and kernel learning. △ Less

Submitted 17 October, 2010; v1 submitted 4 October, 2009; originally announced October 2009.

ACM Class: I.2.6

arXiv:0902.1284 [pdf, ps, other]

Multi-Label Prediction via Compressed Sensing

Authors: Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang

Abstract: We consider multi-label prediction problems with large output spaces under the assumption of output sparsity -- that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regressi… ▽ More We consider multi-label prediction problems with large output spaces under the assumption of output sparsity -- that the target (label) vectors have small support. We develop a general theory for a variant of the popular error correcting output code scheme, using ideas from compressed sensing for exploiting this sparsity. The method can be regarded as a simple reduction from multi-label regression problems to binary regression problems. We show that the number of subproblems need only be logarithmic in the total number of possible labels, making this approach radically more efficient than others. We also state and prove robustness guarantees for this method in the form of regret transform bounds (in general), and also provide a more detailed analysis for the linear prediction setting. △ Less

Submitted 2 June, 2009; v1 submitted 7 February, 2009; originally announced February 2009.

arXiv:0811.4413 [pdf, ps, other]

A Spectral Algorithm for Learning Hidden Markov Models

Authors: Daniel Hsu, Sham M. Kakade, Tong Zhang

Abstract: Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the small… ▽ More Hidden Markov Models (HMMs) are one of the most fundamental and widely used statistical tools for modeling discrete time series. In general, learning HMMs from data is computationally hard (under cryptographic assumptions), and practitioners typically resort to search heuristics which suffer from the usual local optima issues. We prove that under a natural separation condition (bounds on the smallest singular value of the HMM parameters), there is an efficient and provably correct algorithm for learning HMMs. The sample complexity of the algorithm does not explicitly depend on the number of distinct (discrete) observations---it implicitly depends on this quantity through spectral properties of the underlying HMM. This makes the algorithm particularly applicable to settings with a large number of observations, such as those in natural language processing where the space of observation is sometimes the words in a language. The algorithm is also simple, employing only a singular value decomposition and matrix multiplications. △ Less

Submitted 6 July, 2012; v1 submitted 26 November, 2008; originally announced November 2008.

Comments: Published in JCSS Special Issue "Learning Theory 2009"

Journal ref: Journal of Computer and System Sciences, 78(5):1460-1480, 2012

Showing 51–87 of 87 results for author: Kakade, S M