subscribe to arXiv mailings

Variational Inference for Uncertainty Quantification: an Analysis of Trade-offs

Authors: Charles C. Margossian, Loucas Pillaud-Vivien, Lawrence K. Saul

Abstract: Given an intractable distribution $p$, the problem of variational inference (VI) is to find the best approximation from some more tractable family $Q$. Commonly, one chooses $Q$ to be a family of factorized distributions (i.e., the mean-field assumption), even though~$p$ itself does not factorize. We show that this mismatch leads to an impossibility theorem: if $p$ does not factorize, then any fac… ▽ More Given an intractable distribution $p$, the problem of variational inference (VI) is to find the best approximation from some more tractable family $Q$. Commonly, one chooses $Q$ to be a family of factorized distributions (i.e., the mean-field assumption), even though~$p$ itself does not factorize. We show that this mismatch leads to an impossibility theorem: if $p$ does not factorize, then any factorized approximation $q\in Q$ can correctly estimate at most one of the following three measures of uncertainty: (i) the marginal variances, (ii) the marginal precisions, or (iii) the generalized variance (which can be related to the entropy). In practice, the best variational approximation in $Q$ is found by minimizing some divergence $D(q,p)$ between distributions, and so we ask: how does the choice of divergence determine which measure of uncertainty, if any, is correctly estimated by VI? We consider the classic Kullback-Leibler divergences, the more general Rényi divergences, and a score-based divergence which compares $\nabla \log p$ and $\nabla \log q$. We provide a thorough theoretical analysis in the setting where $p$ is a Gaussian and $q$ is a (factorized) Gaussian. We show that all the considered divergences can be \textit{ordered} based on the estimates of uncertainty they yield as objective functions for~VI. Finally, we empirically evaluate the validity of this ordering when the target distribution $p$ is not Gaussian. △ Less

Submitted 7 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

arXiv:2402.14758 [pdf, other]

Batch and match: black-box variational inference with a score-based divergence

Authors: Diana Cai, Chirag Modi, Loucas Pillaud-Vivien, Charles C. Margossian, Robert M. Gower, David M. Blei, Lawrence K. Saul

Abstract: Most leading implementations of black-box variational inference (BBVI) are based on optimizing a stochastic evidence lower bound (ELBO). But such approaches to BBVI often converge slowly due to the high variance of their gradient estimates and their sensitivity to hyperparameters. In this work, we propose batch and match (BaM), an alternative approach to BBVI based on a score-based divergence. Not… ▽ More Most leading implementations of black-box variational inference (BBVI) are based on optimizing a stochastic evidence lower bound (ELBO). But such approaches to BBVI often converge slowly due to the high variance of their gradient estimates and their sensitivity to hyperparameters. In this work, we propose batch and match (BaM), an alternative approach to BBVI based on a score-based divergence. Notably, this score-based divergence can be optimized by a closed-form proximal update for Gaussian variational families with full covariance matrices. We analyze the convergence of BaM when the target distribution is Gaussian, and we prove that in the limit of infinite batch size the variational parameter updates converge exponentially quickly to the target mean and covariance. We also evaluate the performance of BaM on Gaussian and non-Gaussian target distributions that arise from posterior inference in hierarchical and deep generative models. In these experiments, we find that BaM typically converges in fewer (and sometimes significantly fewer) gradient evaluations than leading implementations of BBVI based on ELBO maximization. △ Less

Submitted 12 June, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: 49 pages, 14 figures. To appear in the Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

arXiv:1409.3518 [pdf, other]

Topic Modeling of Hierarchical Corpora

Authors: Do-kyum Kim, Geoffrey M. Voelker, Lawrence K. Saul

Abstract: We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models we show that ther… ▽ More We study the problem of topic modeling in corpora whose documents are organized in a multi-level hierarchy. We explore a parametric approach to this problem, assuming that the number of topics is known or can be estimated by cross-validation. The models we consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models we show that there exists a simple variational approximation for probabilistic inference. The approximation relies on a previously unexploited inequality that handles the conditional dependence between Dirichlet latent variables in adjacent levels of the model's hierarchy. We compare our approach to existing implementations of nonparametric HDPs. On several benchmarks we find that our approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods. Finally, we demonstrate the large-scale viability of our approach on two newly available corpora from researchers in computer security---one with 350,000 documents and over 6,000 internal subcategories, the other with a five-level deep hierarchy. △ Less

Submitted 13 April, 2015; v1 submitted 11 September, 2014; originally announced September 2014.

arXiv:1112.3714 [pdf]

Nonnegative Matrix Factorization for Semi-supervised Dimensionality Reduction

Authors: Youngmin Cho, Lawrence K. Saul

Abstract: We show how to incorporate information from labeled examples into nonnegative matrix factorization (NMF), a popular unsupervised learning algorithm for dimensionality reduction. In addition to mapping the data into a space of lower dimensionality, our approach aims to preserve the nonnegative components of the data that are important for classification. We identify these components from the suppor… ▽ More We show how to incorporate information from labeled examples into nonnegative matrix factorization (NMF), a popular unsupervised learning algorithm for dimensionality reduction. In addition to mapping the data into a space of lower dimensionality, our approach aims to preserve the nonnegative components of the data that are important for classification. We identify these components from the support vectors of large-margin classifiers and derive iterative updates to preserve them in a semi-supervised version of NMF. These updates have a simple multiplicative form like their unsupervised counterparts; they are also guaranteed at each iteration to decrease their loss function---a weighted sum of I-divergences that captures the trade-off between unsupervised and supervised learning. We evaluate these updates for dimensionality reduction when they are used as a precursor to linear classification. In this role, we find that they yield much better performance than their unsupervised counterparts. We also find one unexpected benefit of the low dimensional representations discovered by our approach: often they yield more accurate classifiers than both ordinary and transductive SVMs trained in the original input space. △ Less

Submitted 16 December, 2011; originally announced December 2011.

Comments: Preprint submitted to Machine Learning Journal

arXiv:1112.3712 [pdf]

Analysis and Extension of Arc-Cosine Kernels for Large Margin Classification

Authors: Youngmin Cho, Lawrence K. Saul

Abstract: We investigate a recently proposed family of positive-definite kernels that mimic the computation in large neural networks. We examine the properties of these kernels using tools from differential geometry; specifically, we analyze the geometry of surfaces in Hilbert space that are induced by these kernels. When this geometry is described by a Riemannian manifold, we derive results for the metric,… ▽ More We investigate a recently proposed family of positive-definite kernels that mimic the computation in large neural networks. We examine the properties of these kernels using tools from differential geometry; specifically, we analyze the geometry of surfaces in Hilbert space that are induced by these kernels. When this geometry is described by a Riemannian manifold, we derive results for the metric, curvature, and volume element. Interestingly, though, we find that the simplest kernel in this family does not admit such an interpretation. We explore two variations of these kernels that mimic computation in neural networks with different activation functions. We experiment with these new kernels on several data sets and highlight their general trends in performance for classification. △ Less

Submitted 16 December, 2011; originally announced December 2011.

Comments: Preprint submitted to Neural Networks

arXiv:cs/9603102 [pdf, ps]

Mean Field Theory for Sigmoid Belief Networks

Authors: L. K. Saul, T. Jaakkola, M. I. Jordan

Abstract: We develop a mean field theory for sigmoid belief networks based on ideas from statistical mechanics. Our mean field theory provides a tractable approximation to the true probability distribution in these networks; it also yields a lower bound on the likelihood of evidence. We demonstrate the utility of this framework on a benchmark problem in statistical pattern recognition---the classification… ▽ More We develop a mean field theory for sigmoid belief networks based on ideas from statistical mechanics. Our mean field theory provides a tractable approximation to the true probability distribution in these networks; it also yields a lower bound on the likelihood of evidence. We demonstrate the utility of this framework on a benchmark problem in statistical pattern recognition---the classification of handwritten digits. △ Less

Submitted 29 February, 1996; originally announced March 1996.

Comments: See http://www.jair.org/ for any accompanying files

Journal ref: Journal of Artificial Intelligence Research, Vol 4, (1996), 61-76

Showing 1–6 of 6 results for author: Saul, L K