-
Singular Subspace Perturbation Bounds via Rectangular Random Matrix Diffusions
Authors:
Peiyao Lai,
Oren Mangoubi
Abstract:
Given a matrix $A \in \mathbb{R}^{m\times d}$ with singular values $σ_1\geq \cdots \geq σ_d$, and a random matrix $G \in \mathbb{R}^{m\times d}$ with iid $N(0,T)$ entries for some $T>0$, we derive new bounds on the Frobenius distance between subspaces spanned by the top-$k$ (right) singular vectors of $A$ and $A+G$. This problem arises in numerous applications in statistics where a data matrix may…
▽ More
Given a matrix $A \in \mathbb{R}^{m\times d}$ with singular values $σ_1\geq \cdots \geq σ_d$, and a random matrix $G \in \mathbb{R}^{m\times d}$ with iid $N(0,T)$ entries for some $T>0$, we derive new bounds on the Frobenius distance between subspaces spanned by the top-$k$ (right) singular vectors of $A$ and $A+G$. This problem arises in numerous applications in statistics where a data matrix may be corrupted by Gaussian noise, and in the analysis of the Gaussian mechanism in differential privacy, where Gaussian noise is added to data to preserve private information. We show that, for matrices $A$ where the gaps in the top-$k$ singular values are roughly $Ω(σ_k-σ_{k+1})$ the expected Frobenius distance between the subspaces is $\tilde{O}(\frac{\sqrt{d}}{σ_k-σ_{k+1}} \times \sqrt{T})$, improving on previous bounds by a factor of $\frac{\sqrt{m}}{\sqrt{d}} \sqrt{k}$. To obtain our bounds we view the perturbation to the singular vectors as a diffusion process -- the Dyson-Bessel process -- and use tools from stochastic calculus to track the evolution of the subspace spanned by the top-$k$ singular vectors.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Private Covariance Approximation and Eigenvalue-Gap Bounds for Complex Gaussian Perturbations
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
We consider the problem of approximating a $d \times d$ covariance matrix $M$ with a rank-$k$ matrix under $(\varepsilon,δ)$-differential privacy. We present and analyze a complex variant of the Gaussian mechanism and show that the Frobenius norm of the difference between the matrix output by this mechanism and the best rank-$k$ approximation to $M$ is bounded by roughly $\tilde{O}(\sqrt{kd})$, wh…
▽ More
We consider the problem of approximating a $d \times d$ covariance matrix $M$ with a rank-$k$ matrix under $(\varepsilon,δ)$-differential privacy. We present and analyze a complex variant of the Gaussian mechanism and show that the Frobenius norm of the difference between the matrix output by this mechanism and the best rank-$k$ approximation to $M$ is bounded by roughly $\tilde{O}(\sqrt{kd})$, whenever there is an appropriately large gap between the $k$'th and the $k+1$'th eigenvalues of $M$. This improves on previous work that requires that the gap between every pair of top-$k$ eigenvalues of $M$ is at least $\sqrt{d}$ for a similar bound. Our analysis leverages the fact that the eigenvalues of complex matrix Brownian motion repel more than in the real case, and uses Dyson's stochastic differential equations governing the evolution of its eigenvalues to show that the eigenvalues of the matrix $M$ perturbed by complex Gaussian noise have large gaps with high probability. Our results contribute to the analysis of low-rank approximations under average-case perturbations and to an understanding of eigenvalue gaps for random matrices, which may be of independent interest.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Re-Analyze Gauss: Bounds for Private Matrix Approximation via Dyson Brownian Motion
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
Given a symmetric matrix $M$ and a vector $λ$, we present new bounds on the Frobenius-distance utility of the Gaussian mechanism for approximating $M$ by a matrix whose spectrum is $λ$, under $(\varepsilon,δ)$-differential privacy. Our bounds depend on both $λ$ and the gaps in the eigenvalues of $M$, and hold whenever the top $k+1$ eigenvalues of $M$ have sufficiently large gaps. When applied to t…
▽ More
Given a symmetric matrix $M$ and a vector $λ$, we present new bounds on the Frobenius-distance utility of the Gaussian mechanism for approximating $M$ by a matrix whose spectrum is $λ$, under $(\varepsilon,δ)$-differential privacy. Our bounds depend on both $λ$ and the gaps in the eigenvalues of $M$, and hold whenever the top $k+1$ eigenvalues of $M$ have sufficiently large gaps. When applied to the problems of private rank-$k$ covariance matrix approximation and subspace recovery, our bounds yield improvements over previous bounds. Our bounds are obtained by viewing the addition of Gaussian noise as a continuous-time matrix Brownian motion. This viewpoint allows us to track the evolution of eigenvalues and eigenvectors of the matrix, which are governed by stochastic differential equations discovered by Dyson. These equations allow us to bound the utility as the square-root of a sum-of-squares of perturbations to the eigenvectors, as opposed to a sum of perturbation bounds obtained via Davis-Kahan-type theorems.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
Private Matrix Approximation and Geometry of Unitary Orbits
Authors:
Oren Mangoubi,
Yikai Wu,
Satyen Kale,
Abhradeep Guha Thakurta,
Nisheeth K. Vishnoi
Abstract:
Consider the following optimization problem: Given $n \times n$ matrices $A$ and $Λ$, maximize $\langle A, UΛU^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $Λ$ and, by setting $Λ$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximat…
▽ More
Consider the following optimization problem: Given $n \times n$ matrices $A$ and $Λ$, maximize $\langle A, UΛU^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $Λ$ and, by setting $Λ$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users' private data. We give efficient and private algorithms that come with upper and lower bounds on the approximation error. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.
△ Less
Submitted 6 July, 2022;
originally announced July 2022.
-
Sampling from Log-Concave Distributions over Polytopes via a Soft-Threshold Dikin Walk
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
Given a Lipschitz or smooth convex function $\, f:K \to \mathbb{R}$ for a bounded polytope $K \subseteq \mathbb{R}^d$ defined by $m$ inequalities, we consider the problem of sampling from the log-concave distribution $π(θ) \propto e^{-f(θ)}$ constrained to $K$. Interest in this problem derives from its applications to Bayesian inference and differentially private learning. Our main result is a gen…
▽ More
Given a Lipschitz or smooth convex function $\, f:K \to \mathbb{R}$ for a bounded polytope $K \subseteq \mathbb{R}^d$ defined by $m$ inequalities, we consider the problem of sampling from the log-concave distribution $π(θ) \propto e^{-f(θ)}$ constrained to $K$. Interest in this problem derives from its applications to Bayesian inference and differentially private learning. Our main result is a generalization of the Dikin walk Markov chain to this setting that requires at most $O((md + d L^2 R^2) \times md^{ω-1}) \log(\frac{w}δ))$ arithmetic operations to sample from $π$ within error $δ>0$ in the total variation distance from a $w$-warm start. Here $L$ is the Lipschitz-constant of $f$, $K$ is contained in a ball of radius $R$ and contains a ball of smaller radius $r$, and $ω$ is the matrix-multiplication constant. Our algorithm improves on the running time of prior works for a range of parameter settings important for the aforementioned learning applications. Technically, we depart from previous Dikin walks by adding a "soft-threshold" regularizer derived from the Lipschitz or smoothness properties of $f$ to the log-barrier function for $K$ that allows our version of the Dikin walk to propose updates that have a high Metropolis acceptance ratio for $f$, while at the same time remaining inside the polytope $K$.
△ Less
Submitted 14 November, 2022; v1 submitted 19 June, 2022;
originally announced June 2022.
-
Sampling from Log-Concave Distributions with Infinity-Distance Guarantees
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
For a $d$-dimensional log-concave distribution $π(θ) \propto e^{-f(θ)}$ constrained to a convex body $K$, the problem of outputting samples from a distribution $ν$ which is $\varepsilon$-close in infinity-distance $\sup_{θ\in K} |\log \frac{ν(θ)}{π(θ)}|$ to $π$ arises in differentially private optimization. While sampling within total-variation distance $\varepsilon$ of $π$ can be done by algorith…
▽ More
For a $d$-dimensional log-concave distribution $π(θ) \propto e^{-f(θ)}$ constrained to a convex body $K$, the problem of outputting samples from a distribution $ν$ which is $\varepsilon$-close in infinity-distance $\sup_{θ\in K} |\log \frac{ν(θ)}{π(θ)}|$ to $π$ arises in differentially private optimization. While sampling within total-variation distance $\varepsilon$ of $π$ can be done by algorithms whose runtime depends polylogarithmically on $\frac{1}{\varepsilon}$, prior algorithms for sampling in $\varepsilon$ infinity distance have runtime bounds that depend polynomially on $\frac{1}{\varepsilon}$. We bridge this gap by presenting an algorithm that outputs a point $\varepsilon$-close to $π$ in infinity distance that requires at most $\mathrm{poly}(\log \frac{1}{\varepsilon}, d)$ calls to a membership oracle for $K$ and evaluation oracle for $f$, when $f$ is Lipschitz. Our approach departs from prior works that construct Markov chains on a $\frac{1}{\varepsilon^2}$-discretization of $K$ to achieve a sample with $\varepsilon$ infinity-distance error, and present a method to directly convert continuous samples from $K$ with total-variation bounds to samples with infinity bounds. This approach also allows us to obtain an improvement on the dimension $d$ in the running time for the problem of sampling from a log-concave distribution on polytopes $K$ with infinity distance $\varepsilon$, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain.
△ Less
Submitted 11 November, 2022; v1 submitted 7 November, 2021;
originally announced November 2021.
-
Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning
Authors:
Shijian Li,
Oren Mangoubi,
Lijie Xu,
Tian Guo
Abstract:
Stochastic Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters. A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol. For example, while Bulk Synchronous Parallel (BSP) often achieves better converged accuracy, the corresponding training throughput can be negatively i…
▽ More
Stochastic Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters. A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol. For example, while Bulk Synchronous Parallel (BSP) often achieves better converged accuracy, the corresponding training throughput can be negatively impacted by stragglers. In contrast, Asynchronous Parallel (ASP) can have higher throughput, but its convergence and accuracy can be impacted by stale gradients. To improve the performance of synchronization protocol, recent work often focuses on designing new protocols with a heavy reliance on hard-to-tune hyper-parameters. In this paper, we design a hybrid synchronization approach that exploits the benefits of both BSP and ASP, i.e., reducing training time while simultaneously maintaining the converged accuracy. Based on extensive empirical profiling, we devise a collection of adaptive policies that determine how and when to switch between synchronization protocols. Our policies include both offline ones that target recurring jobs and online ones for handling transient stragglers. We implement the proposed policies in a prototype system, called Sync-Switch, on top of TensorFlow, and evaluate the training performance with popular deep learning models and datasets. Our experiments show that Sync-Switch achieves up to 5.13X throughput speedup and similar converged accuracy when comparing to BSP. Further, we observe that Sync-Switch achieves 3.8% higher converged accuracy with just 1.23X the training time compared to training with ASP. Moreover, Sync-Switch can be used in settings when training with ASP leads to divergence errors. Sync-Switch achieves all of these benefits with very low overhead, e.g., the framework overhead can be as low as 1.7% of the total training time.
△ Less
Submitted 19 April, 2021; v1 submitted 16 April, 2021;
originally announced April 2021.
-
A Convergent and Dimension-Independent Min-Max Optimization Algorithm
Authors:
Vijay Keswani,
Oren Mangoubi,
Sushant Sachdeva,
Nisheeth K. Vishnoi
Abstract:
We study a variant of a recently introduced min-max optimization framework where the max-player is constrained to update its parameters in a greedy manner until it reaches a first-order stationary point. Our equilibrium definition for this framework depends on a proposal distribution which the min-player uses to choose directions in which to update its parameters. We show that, given a smooth and…
▽ More
We study a variant of a recently introduced min-max optimization framework where the max-player is constrained to update its parameters in a greedy manner until it reaches a first-order stationary point. Our equilibrium definition for this framework depends on a proposal distribution which the min-player uses to choose directions in which to update its parameters. We show that, given a smooth and bounded nonconvex-nonconcave objective function, access to any proposal distribution for the min-player's updates, and stochastic gradient oracle for the max-player, our algorithm converges to the aforementioned approximate local equilibrium in a number of iterations that does not depend on the dimension. The equilibrium point found by our algorithm depends on the proposal distribution, and when applying our algorithm to train GANs we choose the proposal distribution to be a distribution of stochastic gradients. We empirically evaluate our algorithm on challenging nonconvex-nonconcave test-functions and loss functions arising in GAN training. Our algorithm converges on these test functions and, when used to train GANs, trains stably on synthetic and real-world datasets and avoids mode collapse
△ Less
Submitted 30 June, 2022; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Greedy Adversarial Equilibrium: An Efficient Alternative to Nonconvex-Nonconcave Min-Max Optimization
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
Min-max optimization of an objective function $f: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is an important model for robustness in an adversarial setting, with applications to many areas including optimization, economics, and deep learning. In many applications $f$ may be nonconvex-nonconcave, and finding a global min-max point may be computationally intractable. There is a long li…
▽ More
Min-max optimization of an objective function $f: \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}$ is an important model for robustness in an adversarial setting, with applications to many areas including optimization, economics, and deep learning. In many applications $f$ may be nonconvex-nonconcave, and finding a global min-max point may be computationally intractable. There is a long line of work that seeks computationally tractable algorithms for alternatives to the min-max optimization model. However, many of the alternative models have solution points which are only guaranteed to exist under strong assumptions on $f$, such as convexity, monotonicity, or special properties of the starting point. We propose an optimization model, the $\varepsilon$-greedy adversarial equilibrium, and show that it can serve as a computationally tractable alternative to the min-max optimization model. Roughly, we say that a point $(x^\star, y^\star)$ is an $\varepsilon$-greedy adversarial equilibrium if $y^\star$ is an $\varepsilon$-approximate local maximum for $f(x^\star,\cdot)$, and $x^\star$ is an $\varepsilon$-approximate local minimum for a "greedy approximation" to the function $\max_z f(x, z)$ which can be efficiently estimated using second-order optimization algorithms. We prove the existence of such a point for any smooth function which is bounded and has Lipschitz Hessian. To prove existence, we introduce an algorithm that converges from any starting point to an $\varepsilon$-greedy adversarial equilibrium in a number of evaluations of the function $f$, the max-player's gradient $\nabla_y f(x,y)$, and its Hessian $\nabla^2_y f(x,y)$, that is polynomial in the dimension $d$, $1/\varepsilon$, and the bounds on $f$ and its Lipschitz constant.
△ Less
Submitted 4 May, 2021; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Faster polytope rounding, sampling, and volume computation via a sublinear "Ball Walk"
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
We study the problem of "isotropically rounding" a polytope $K\subset\mathbb{R}^n$, that is, computing a linear transformation which makes the uniform distribution on the polytope have roughly identity covariance matrix. We assume $K$ is defined by $m$ linear inequalities, with guarantee that $rB\subset K\subset RB$, where $B$ is the unit ball. We introduce a new variant of the ball walk Markov ch…
▽ More
We study the problem of "isotropically rounding" a polytope $K\subset\mathbb{R}^n$, that is, computing a linear transformation which makes the uniform distribution on the polytope have roughly identity covariance matrix. We assume $K$ is defined by $m$ linear inequalities, with guarantee that $rB\subset K\subset RB$, where $B$ is the unit ball. We introduce a new variant of the ball walk Markov chain and show that, roughly, the expected number of arithmetic operations per-step of this Markov chain is $O(m)$ that is sublinear in the input size $mn$--the per-step time of all prior Markov chains. Subsequently, we give a rounding algorithm that succeeds with probability $1-\varepsilon$ in $\tilde{O}(mn^{4.5}\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ arithmetic operations. This gives a factor of $\sqrt{n}$ improvement on the previous bound of $\tilde{O}(mn^5\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ for rounding, which uses the hit-and-run algorithm. Since the rounding preprocessing step is in many cases the bottleneck in improving sampling or volume computation, our results imply these tasks can also be achieved in roughly $\tilde{O}(mn^{4.5}\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r})+mn^4δ^{-2})$ operations for computing the volume of $K$ up to a factor $1+δ$ and $\tilde{O}(mn^{4.5}\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r})))$ for uniformly sampling on $K$ with TV error $\varepsilon$. This improves on the previous bounds of $\tilde{O}(mn^5\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r})+mn^4δ^{-2})$ for volume computation when roughly $m\geq n^{2.5}$, and $\tilde{O}(mn^5\mbox{polylog}(\frac{1}{\varepsilon},\frac{R}{r}))$ for sampling when roughly $m\geq n^{1.5}$. We achieve this improvement by a novel method of computing polytope membership, where one avoids checking inequalities estimated to have a very low probability of being violated.
△ Less
Submitted 14 September, 2019; v1 submitted 5 May, 2019;
originally announced May 2019.
-
Nonconvex sampling with the Metropolis-adjusted Langevin algorithm
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
The Langevin Markov chain algorithms are widely deployed methods to sample from distributions in challenging high-dimensional and non-convex statistics and machine learning applications. Despite this, current bounds for the Langevin algorithms are slower than those of competing algorithms in many important situations, for instance when sampling from weakly log-concave distributions, or when sampli…
▽ More
The Langevin Markov chain algorithms are widely deployed methods to sample from distributions in challenging high-dimensional and non-convex statistics and machine learning applications. Despite this, current bounds for the Langevin algorithms are slower than those of competing algorithms in many important situations, for instance when sampling from weakly log-concave distributions, or when sampling or optimizing non-convex log-densities. In this paper, we obtain improved bounds in many of these situations, showing that the Metropolis-adjusted Langevin algorithm (MALA) is faster than the best bounds for its competitor algorithms when the target distribution satisfies weak third- and fourth- order regularity properties associated with the input data. In many settings, our regularity conditions are weaker than the usual Euclidean operator norm regularity properties, allowing us to show faster bounds for a much larger class of distributions than would be possible with the usual Euclidean operator norm approach, including in statistics and machine learning applications where the data satisfy a certain incoherence condition. In particular, we show that using our regularity conditions one can obtain faster bounds for applications which include sampling problems in Bayesian logistic regression with weakly convex priors, and the nonconvex optimization problem of learning linear classifiers with zero-one loss functions.
Our main technical contribution in this paper is our analysis of the Metropolis acceptance probability of MALA in terms of its "energy-conservation error," and our bound for this error in terms of third- and fourth- order regularity conditions. Our combination of this higher-order analysis of the energy conservation error with the conductance method is key to obtaining bounds which have a sub-linear dependence on the dimension $d$ in the non-strongly logconcave setting.
△ Less
Submitted 9 April, 2019; v1 submitted 22 February, 2019;
originally announced February 2019.
-
Online Sampling from Log-Concave Distributions
Authors:
Holden Lee,
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
Given a sequence of convex functions $f_0, f_1, \ldots, f_T$, we study the problem of sampling from the Gibbs distribution $π_t \propto e^{-\sum_{k=0}^tf_k}$ for each epoch $t$ in an online manner. Interest in this problem derives from applications in machine learning, Bayesian statistics, and optimization where, rather than obtaining all the observations at once, one constantly acquires new data,…
▽ More
Given a sequence of convex functions $f_0, f_1, \ldots, f_T$, we study the problem of sampling from the Gibbs distribution $π_t \propto e^{-\sum_{k=0}^tf_k}$ for each epoch $t$ in an online manner. Interest in this problem derives from applications in machine learning, Bayesian statistics, and optimization where, rather than obtaining all the observations at once, one constantly acquires new data, and must continuously update the distribution. Our main result is an algorithm that generates roughly independent samples from $π_t$ for every epoch $t$ and, under mild assumptions, makes $\mathrm{polylog}(T)$ gradient evaluations per epoch. All previous results imply a bound on the number of gradient or function evaluations which is at least linear in $T$. Motivated by real-world applications, we assume that functions are smooth, their associated distributions have a bounded second moment, and their minimizer drifts in a bounded manner, but do not assume they are strongly convex. In particular, our assumptions hold for online Bayesian logistic regression, when the data satisfy natural regularity properties, giving a sampling algorithm with updates that are poly-logarithmic in $T$. In simulations, our algorithm achieves accuracy comparable to an algorithm specialized to logistic regression. Key to our algorithm is a novel stochastic gradient Langevin dynamics Markov chain with a carefully designed variance reduction step and constant batch size. Technically, lack of strong convexity is a significant barrier to analysis and, here, our main contribution is a martingale exit time argument that shows our Markov chain remains in a ball of radius roughly poly-logarithmic in $T$ for enough time to reach within $\varepsilon$ of $π_t$.
△ Less
Submitted 4 December, 2019; v1 submitted 21 February, 2019;
originally announced February 2019.
-
Does Hamiltonian Monte Carlo mix faster than a random walk on multimodal densities?
Authors:
Oren Mangoubi,
Natesh S. Pillai,
Aaron Smith
Abstract:
Hamiltonian Monte Carlo (HMC) is a very popular and generic collection of Markov chain Monte Carlo (MCMC) algorithms. One explanation for the popularity of HMC algorithms is their excellent performance as the dimension $d$ of the target becomes large: under conditions that are satisfied for many common statistical models, optimally-tuned HMC algorithms have a running time that scales like…
▽ More
Hamiltonian Monte Carlo (HMC) is a very popular and generic collection of Markov chain Monte Carlo (MCMC) algorithms. One explanation for the popularity of HMC algorithms is their excellent performance as the dimension $d$ of the target becomes large: under conditions that are satisfied for many common statistical models, optimally-tuned HMC algorithms have a running time that scales like $d^{0.25}$. In stark contrast, the running time of the usual Random-Walk Metropolis (RWM) algorithm, optimally tuned, scales like $d$. This superior scaling of the HMC algorithm with dimension is attributed to the fact that it, unlike RWM, incorporates the gradient information in the proposal distribution. In this paper, we investigate a different scaling question: does HMC beat RWM for highly $\textit{multimodal}$ targets? We find that the answer is often $\textit{no}$. We compute the spectral gaps for both the algorithms for a specific class of multimodal target densities, and show that they are identical. The key reason is that, within one mode, the gradient is effectively ignorant about other modes, thus negating the advantage the HMC algorithm enjoys in unimodal targets. We also give heuristic arguments suggesting that the above observation may hold quite generally. Our main tool for answering this question is a novel simple formula for the conductance of HMC using Liouville's theorem. This result allows us to compute the spectral gap of HMC algorithms, for both the classical HMC with isotropic momentum and the recent Riemannian HMC, for multimodal targets.
△ Less
Submitted 4 September, 2018; v1 submitted 9 August, 2018;
originally announced August 2018.
-
Dimensionally Tight Bounds for Second-Order Hamiltonian Monte Carlo
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
Hamiltonian Monte Carlo (HMC) is a widely deployed method to sample from high-dimensional distributions in Statistics and Machine learning. HMC is known to run very efficiently in practice and its popular second-order "leapfrog" implementation has long been conjectured to run in $d^{1/4}$ gradient evaluations. Here we show that this conjecture is true when sampling from strongly log-concave target…
▽ More
Hamiltonian Monte Carlo (HMC) is a widely deployed method to sample from high-dimensional distributions in Statistics and Machine learning. HMC is known to run very efficiently in practice and its popular second-order "leapfrog" implementation has long been conjectured to run in $d^{1/4}$ gradient evaluations. Here we show that this conjecture is true when sampling from strongly log-concave target distributions that satisfy a weak third-order regularity property associated with the input data. Our regularity condition is weaker than the Lipschitz Hessian property and allows us to show faster convergence bounds for a much larger class of distributions than would be possible with the usual Lipschitz Hessian constant alone. Important distributions that satisfy our regularity condition include posterior distributions used in Bayesian logistic regression for which the data satisfies an "incoherence" property. Our result compares favorably with the best available bounds for the class of strongly log-concave distributions, which grow like $d^{{1}/{2}}$ gradient evaluations with the dimension. Moreover, our simulations on synthetic data suggest that, when our regularity condition is satisfied, leapfrog HMC performs better than its competitors -- both in terms of accuracy and in terms of the number of gradient evaluations it requires.
△ Less
Submitted 9 August, 2018; v1 submitted 24 February, 2018;
originally announced February 2018.
-
Convex Optimization with Unbounded Nonconvex Oracles using Simulated Annealing
Authors:
Oren Mangoubi,
Nisheeth K. Vishnoi
Abstract:
We consider the problem of minimizing a convex objective function $F$ when one can only evaluate its noisy approximation $\hat{F}$. Unless one assumes some structure on the noise, $\hat{F}$ may be an arbitrary nonconvex function, making the task of minimizing $F$ intractable. To overcome this, prior work has often focused on the case when $F(x)-\hat{F}(x)$ is uniformly-bounded. In this paper we st…
▽ More
We consider the problem of minimizing a convex objective function $F$ when one can only evaluate its noisy approximation $\hat{F}$. Unless one assumes some structure on the noise, $\hat{F}$ may be an arbitrary nonconvex function, making the task of minimizing $F$ intractable. To overcome this, prior work has often focused on the case when $F(x)-\hat{F}(x)$ is uniformly-bounded. In this paper we study the more general case when the noise has magnitude $αF(x) + β$ for some $α, β> 0$, and present a polynomial time algorithm that finds an approximate minimizer of $F$ for this noise model. Previously, Markov chains, such as the stochastic gradient Langevin dynamics, have been used to arrive at approximate solutions to these optimization problems. However, for the noise model considered in this paper, no single temperature allows such a Markov chain to both mix quickly and concentrate near the global minimizer. We bypass this by combining "simulated annealing" with the stochastic gradient Langevin dynamics, and gradually decreasing the temperature of the chain in order to approach the global minimizer. As a corollary one can approximately minimize a nonconvex function that is close to a convex function; however, the closeness can deteriorate as one moves away from the optimum.
△ Less
Submitted 18 June, 2018; v1 submitted 7 November, 2017;
originally announced November 2017.
-
Rapid Mixing of Geodesic Walks on Manifolds with Positive Curvature
Authors:
Oren Mangoubi,
Aaron Smith
Abstract:
We introduce a Markov chain for sampling from the uniform distribution on a Riemannian manifold $\mathcal{M}$, which we call the $\textit{geodesic walk}$. We prove that the mixing time of this walk on any manifold with positive sectional curvature $C_{x}(u,v)$ bounded both above and below by $0 < \mathfrak{m}_{2} \leq C_{x}(u,v) \leq \mathfrak{M}_2 < \infty$ is…
▽ More
We introduce a Markov chain for sampling from the uniform distribution on a Riemannian manifold $\mathcal{M}$, which we call the $\textit{geodesic walk}$. We prove that the mixing time of this walk on any manifold with positive sectional curvature $C_{x}(u,v)$ bounded both above and below by $0 < \mathfrak{m}_{2} \leq C_{x}(u,v) \leq \mathfrak{M}_2 < \infty$ is $\mathcal{O}^*\left(\frac{\mathfrak{M}_2}{\mathfrak{m}_2}\right)$. In particular, this bound on the mixing time does not depend explicitly on the dimension of the manifold. In the special case that $\mathcal{M}$ is the boundary of a convex body, we give an explicit and computationally tractable algorithm for approximating the exact geodesic walk. As a consequence, we obtain an algorithm for sampling uniformly from the surface of a convex body that has running time bounded solely in terms of the curvature of the body.
△ Less
Submitted 26 November, 2017; v1 submitted 9 September, 2016;
originally announced September 2016.