subscribe to arXiv mailings

Inferring Dynamic Networks from Marginals with Iterative Proportional Fitting

Authors: Serina Chang, Frederic Koehler, Zhaonan Qu, Jure Leskovec, Johan Ugander

Abstract: A common network inference problem, arising from real-world data constraints, is how to infer a dynamic network from its time-aggregated adjacency matrix and time-varying marginals (i.e., row and column sums). Prior approaches to this problem have repurposed the classic iterative proportional fitting (IPF) procedure, also known as Sinkhorn's algorithm, with promising empirical results. However, th… ▽ More A common network inference problem, arising from real-world data constraints, is how to infer a dynamic network from its time-aggregated adjacency matrix and time-varying marginals (i.e., row and column sums). Prior approaches to this problem have repurposed the classic iterative proportional fitting (IPF) procedure, also known as Sinkhorn's algorithm, with promising empirical results. However, the statistical foundation for using IPF has not been well understood: under what settings does IPF provide principled estimation of a dynamic network from its marginals, and how well does it estimate the network? In this work, we establish such a setting, by identifying a generative network model whose maximum likelihood estimates are recovered by IPF. Our model both reveals implicit assumptions on the use of IPF in such settings and enables new analyses, such as structure-dependent error bounds on IPF's parameter estimates. When IPF fails to converge on sparse network data, we introduce a principled algorithm that guarantees IPF converges under minimal changes to the network structure. Finally, we conduct experiments with synthetic and real-world data, which demonstrate the practical value of our theoretical and algorithmic contributions. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.15409 [pdf, other]

Lasso with Latents: Efficient Estimation, Covariate Rescaling, and Computational-Statistical Gaps

Authors: Jonathan Kelner, Frederic Koehler, Raghu Meka, Dhruv Rohatgi

Abstract: It is well-known that the statistical performance of Lasso can suffer significantly when the covariates of interest have strong correlations. In particular, the prediction error of Lasso becomes much worse than computationally inefficient alternatives like Best Subset Selection. Due to a large conjectured computational-statistical tradeoff in the problem of sparse linear regression, it may be impo… ▽ More It is well-known that the statistical performance of Lasso can suffer significantly when the covariates of interest have strong correlations. In particular, the prediction error of Lasso becomes much worse than computationally inefficient alternatives like Best Subset Selection. Due to a large conjectured computational-statistical tradeoff in the problem of sparse linear regression, it may be impossible to close this gap in general. In this work, we propose a natural sparse linear regression setting where strong correlations between covariates arise from unobserved latent variables. In this setting, we analyze the problem caused by strong correlations and design a surprisingly simple fix. While Lasso with standard normalization of covariates fails, there exists a heterogeneous scaling of the covariates with which Lasso will suddenly obtain strong provable guarantees for estimation. Moreover, we design a simple, efficient procedure for computing such a "smart scaling." The sample complexity of the resulting "rescaled Lasso" algorithm incurs (in the worst case) quadratic dependence on the sparsity of the underlying signal. While this dependence is not information-theoretically necessary, we give evidence that it is optimal among the class of polynomial-time algorithms, via the method of low-degree polynomials. This argument reveals a new connection between sparse linear regression and a special version of sparse PCA with a near-critical negative spike. The latter problem can be thought of as a real-valued analogue of learning a sparse parity. Using it, we also establish the first computational-statistical gap for the closely related problem of learning a Gaussian Graphical Model. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2310.01762 [pdf, other]

Sampling Multimodal Distributions with the Vanilla Score: Benefits of Data-Based Initialization

Authors: Frederic Koehler, Thuy-Duong Vuong

Abstract: There is a long history, as well as a recent explosion of interest, in statistical and generative modeling approaches based on score functions -- derivatives of the log-likelihood of a distribution. In seminal works, Hyvärinen proposed vanilla score matching as a way to learn distributions from data by computing an estimate of the score function of the underlying ground truth, and established conn… ▽ More There is a long history, as well as a recent explosion of interest, in statistical and generative modeling approaches based on score functions -- derivatives of the log-likelihood of a distribution. In seminal works, Hyvärinen proposed vanilla score matching as a way to learn distributions from data by computing an estimate of the score function of the underlying ground truth, and established connections between this method and established techniques like Contrastive Divergence and Pseudolikelihood estimation. It is by now well-known that vanilla score matching has significant difficulties learning multimodal distributions. Although there are various ways to overcome this difficulty, the following question has remained unanswered -- is there a natural way to sample multimodal distributions using just the vanilla score? Inspired by a long line of related experimental works, we prove that the Langevin diffusion with early stopping, initialized at the empirical distribution, and run on a score function estimated from data successfully generates natural multimodal distributions (mixtures of log-concave distributions). △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2307.10466 [pdf, ps, other]

Universality of Spectral Independence with Applications to Fast Mixing in Spin Glasses

Authors: Nima Anari, Vishesh Jain, Frederic Koehler, Huy Tuan Pham, Thuy-Duong Vuong

Abstract: We study Glauber dynamics for sampling from discrete distributions $μ$ on the hypercube $\{\pm 1\}^n$. Recently, techniques based on spectral independence have successfully yielded optimal $O(n)$ relaxation times for a host of different distributions $μ$. We show that spectral independence is universal: a relaxation time of $O(n)$ implies spectral independence. We then study a notion of tractabi… ▽ More We study Glauber dynamics for sampling from discrete distributions $μ$ on the hypercube $\{\pm 1\}^n$. Recently, techniques based on spectral independence have successfully yielded optimal $O(n)$ relaxation times for a host of different distributions $μ$. We show that spectral independence is universal: a relaxation time of $O(n)$ implies spectral independence. We then study a notion of tractability for $μ$, defined in terms of smoothness of the multilinear extension of its Hamiltonian -- $\log μ$ -- over $[-1,+1]^n$. We show that Glauber dynamics has relaxation time $O(n)$ for such $μ$, and using the universality of spectral independence, we conclude that these distributions are also fractionally log-concave and consequently satisfy modified log-Sobolev inequalities. We sharpen our estimates and obtain approximate tensorization of entropy and the optimal $\widetilde{O}(n)$ mixing time for random Hamiltonians, i.e. the classically studied mixed $p$-spin model at sufficiently high temperature. These results have significant downstream consequences for concentration of measure, statistical testing, and learning. △ Less

Submitted 19 July, 2023; originally announced July 2023.

arXiv:2307.07625 [pdf, ps, other]

Influences in Mixing Measures

Authors: Frederic Koehler, Noam Lifshitz, Dor Minzer, Elchanan Mossel

Abstract: The theory of influences in product measures has profound applications in theoretical computer science, combinatorics, and discrete probability. This deep theory is intimately connected to functional inequalities and to the Fourier analysis of discrete groups. Originally, influences of functions were motivated by the study of social choice theory, wherein a Boolean function represents a voting sch… ▽ More The theory of influences in product measures has profound applications in theoretical computer science, combinatorics, and discrete probability. This deep theory is intimately connected to functional inequalities and to the Fourier analysis of discrete groups. Originally, influences of functions were motivated by the study of social choice theory, wherein a Boolean function represents a voting scheme, its inputs represent the votes, and its output represents the outcome of the elections. Thus, product measures represent a scenario in which the votes of the parties are randomly and independently distributed, which is often far from the truth in real-life scenarios. We begin to develop the theory of influences for more general measures under mixing or correlation decay conditions. More specifically, we prove analogues of the KKL and Talagrand influence theorems for Markov Random Fields on bounded degree graphs with correlation decay. We show how some of the original applications of the theory of in terms of voting and coalitions extend to general measures with correlation decay. Our results thus shed light both on voting with correlated voters and on the behavior of general functions of Markov Random Fields (also called ``spin-systems") with correlation decay. △ Less

Submitted 14 July, 2023; originally announced July 2023.

arXiv:2306.13188 [pdf, ps, other]

Uniform Convergence with Square-Root Lipschitz Loss

Authors: Lijia Zhou, Zhen Dai, Frederic Koehler, Nathan Srebro

Abstract: We establish generic uniform convergence guarantees for Gaussian data in terms of the Rademacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipsch… ▽ More We establish generic uniform convergence guarantees for Gaussian data in terms of the Rademacher complexity of the hypothesis class and the Lipschitz constant of the square root of the scalar loss function. We show how these guarantees substantially generalize previous results based on smoothness (Lipschitz constant of the derivative), and allow us to handle the broader class of square-root-Lipschitz losses, which includes also non-smooth loss functions appropriate for studying phase retrieval and ReLU regression, as well as rederive and better understand "optimistic rate" and interpolation learning guarantees. △ Less

Submitted 22 June, 2023; originally announced June 2023.

arXiv:2305.16892 [pdf, other]

Feature Adaptation for Sparse Linear Regression

Authors: Jonathan Kelner, Frederic Koehler, Raghu Meka, Dhruv Rohatgi

Abstract: Sparse linear regression is a central problem in high-dimensional statistics. We study the correlated random design setting, where the covariates are drawn from a multivariate Gaussian $N(0,Σ)$, and we seek an estimator with small excess risk. If the true signal is $t$-sparse, information-theoretically, it is possible to achieve strong recovery guarantees with only $O(t\log n)$ samples. However,… ▽ More Sparse linear regression is a central problem in high-dimensional statistics. We study the correlated random design setting, where the covariates are drawn from a multivariate Gaussian $N(0,Σ)$, and we seek an estimator with small excess risk. If the true signal is $t$-sparse, information-theoretically, it is possible to achieve strong recovery guarantees with only $O(t\log n)$ samples. However, computationally efficient algorithms have sample complexity linear in (some variant of) the condition number of $Σ$. Classical algorithms such as the Lasso can require significantly more samples than necessary even if there is only a single sparse approximate dependency among the covariates. We provide a polynomial-time algorithm that, given $Σ$, automatically adapts the Lasso to tolerate a small number of approximate dependencies. In particular, we achieve near-optimal sample complexity for constant sparsity and if $Σ$ has few ``outlier'' eigenvalues. Our algorithm fits into a broader framework of feature adaptation for sparse linear regression with ill-conditioned covariates. With this framework, we additionally provide the first polynomial-factor improvement over brute-force search for constant sparsity $t$ and arbitrary covariance $Σ$. △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2210.12082 [pdf, other]

A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models

Authors: Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J. Sutherland, Nathan Srebro

Abstract: We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the squa… ▽ More We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal $\ell_2$-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand's well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory for generalization that applies outside of proportional scaling regimes, handles model misspecification, and complements existing asymptotic Moreau envelope theories for M-estimation. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: As published at NeurIPS 2022

arXiv:2210.00726 [pdf, other]

Statistical Efficiency of Score Matching: The View from Isoperimetry

Authors: Frederic Koehler, Alexander Heckett, Andrej Risteski

Abstract: Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score functio… ▽ More Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincaré, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics. △ Less

Submitted 22 December, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

arXiv:2203.02824 [pdf, ps, other]

Distributional Hardness Against Preconditioned Lasso via Erasure-Robust Designs

Authors: Jonathan A. Kelner, Frederic Koehler, Raghu Meka, Dhruv Rohatgi

Abstract: Sparse linear regression with ill-conditioned Gaussian random designs is widely believed to exhibit a statistical/computational gap, but there is surprisingly little formal evidence for this belief, even in the form of examples that are hard for restricted classes of algorithms. Recent work has shown that, for certain covariance matrices, the broad class of Preconditioned Lasso programs provably c… ▽ More Sparse linear regression with ill-conditioned Gaussian random designs is widely believed to exhibit a statistical/computational gap, but there is surprisingly little formal evidence for this belief, even in the form of examples that are hard for restricted classes of algorithms. Recent work has shown that, for certain covariance matrices, the broad class of Preconditioned Lasso programs provably cannot succeed on polylogarithmically sparse signals with a sublinear number of samples. However, this lower bound only shows that for every preconditioner, there exists at least one signal that it fails to recover successfully. This leaves open the possibility that, for example, trying multiple different preconditioners solves every sparse linear regression problem. In this work, we prove a stronger lower bound that overcomes this issue. For an appropriate covariance matrix, we construct a single signal distribution on which any invertibly-preconditioned Lasso program fails with high probability, unless it receives a linear number of samples. Surprisingly, at the heart of our lower bound is a new positive result in compressed sensing. We show that standard sparse random designs are with high probability robust to adversarial measurement erasures, in the sense that if $b$ measurements are erased, then all but $O(b)$ of the coordinates of the signal are still information-theoretically identifiable. To our knowledge, this is the first time that partial recoverability of arbitrary sparse signals under erasures has been studied in compressed sensing. △ Less

Submitted 5 March, 2022; originally announced March 2022.

Comments: 39 pages

arXiv:2202.08907 [pdf, ps, other]

Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods

Authors: Frederic Koehler, Holden Lee, Andrej Risteski

Abstract: We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the… ▽ More We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the Glauber dynamics to mix torpidly. Our general result implies the first polynomial time sampling algorithms for low-rank Ising models such as Hopfield networks with a fixed number of patterns and Bayesian clustering models with low-dimensional contexts, and greatly improves the polynomial time sampling regime for the antiferromagnetic/ferromagnetic Ising model with inconsistent field on expander graphs. It also improves on previous approximation algorithm results based on the naive mean-field approximation in variational methods and statistical physics. Our approach is based on a new fusion of ideas from the MCMC and variational inference worlds. As part of our algorithm, we define a new nonconvex variational problem which allows us to sample from an exponential reweighting of a distribution by a negative definite quadratic form, and show how to make this procedure provably efficient using stochastic gradient descent. On top of this, we construct a new simulated tempering chain (on an extended state space arising from the Hubbard-Stratonovich transform) which overcomes the obstacle posed by large positive eigenvalues, and combine it with the SGD-based sampler to solve the full problem. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 43 pages

arXiv:2112.06868 [pdf, other]

Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias

Authors: Frederic Koehler, Viraj Mehta, Chenghui Zhou, Andrej Risteski

Abstract: Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance… ▽ More Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance which is correctly supported on the ground truth manifold. They gave partial support for that conjecture by showing that some optima of the VAE loss do satisfy this property, but did not analyze the training dynamics. In this paper, we show that for linear encoders/decoders, the conjecture is true-that is the VAE training does recover a generator with support equal to the ground truth manifold-and does so due to an implicit bias of gradient descent rather than merely the VAE loss itself. In the nonlinear case, we show that VAE training frequently learns a higher-dimensional manifold which is a superset of the ground truth manifold. △ Less

Submitted 17 May, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: Accepted as a conference paper at ICLR 2022

arXiv:2112.04470 [pdf, other]

Optimistic Rates: A Unifying Theory for Interpolation Learning and Regularization in Linear Regression

Authors: Lijia Zhou, Frederic Koehler, Danica J. Sutherland, Nathan Srebro

Abstract: We study a localized notion of uniform convergence known as an "optimistic rate" (Panchenko 2002; Srebro et al. 2010) for linear regression with Gaussian data. Our refined analysis avoids the hidden constant and logarithmic factor in existing results, which are known to be crucial in high-dimensional settings, especially for understanding interpolation learning. As a special case, our analysis rec… ▽ More We study a localized notion of uniform convergence known as an "optimistic rate" (Panchenko 2002; Srebro et al. 2010) for linear regression with Gaussian data. Our refined analysis avoids the hidden constant and logarithmic factor in existing results, which are known to be crucial in high-dimensional settings, especially for understanding interpolation learning. As a special case, our analysis recovers the guarantee from Koehler et al. (2021), which tightly characterizes the population risk of low-norm interpolators under the benign overfitting conditions. Our optimistic rate bound, though, also analyzes predictors with arbitrary training error. This allows us to recover some classical statistical guarantees for ridge and LASSO regression under random designs, and helps us obtain a precise understanding of the excess risk of near-interpolators in the over-parameterized regime. △ Less

Submitted 8 December, 2021; originally announced December 2021.

arXiv:2111.06395 [pdf, ps, other]

Kalman Filtering with Adversarial Corruptions

Authors: Sitan Chen, Frederic Koehler, Ankur Moitra, Morris Yau

Abstract: Here we revisit the classic problem of linear quadratic estimation, i.e. estimating the trajectory of a linear dynamical system from noisy measurements. The celebrated Kalman filter gives an optimal estimator when the measurement noise is Gaussian, but is widely known to break down when one deviates from this assumption, e.g. when the noise is heavy-tailed. Many ad hoc heuristics have been employe… ▽ More Here we revisit the classic problem of linear quadratic estimation, i.e. estimating the trajectory of a linear dynamical system from noisy measurements. The celebrated Kalman filter gives an optimal estimator when the measurement noise is Gaussian, but is widely known to break down when one deviates from this assumption, e.g. when the noise is heavy-tailed. Many ad hoc heuristics have been employed in practice for dealing with outliers. In a pioneering work, Schick and Mitter gave provable guarantees when the measurement noise is a known infinitesimal perturbation of a Gaussian and raised the important question of whether one can get similar guarantees for large and unknown perturbations. In this work we give a truly robust filter: we give the first strong provable guarantees for linear quadratic estimation when even a constant fraction of measurements have been adversarially corrupted. This framework can model heavy-tailed and even non-stationary noise processes. Our algorithm robustifies the Kalman filter in the sense that it competes with the optimal algorithm that knows the locations of the corruptions. Our work is in a challenging Bayesian setting where the number of measurements scales with the complexity of what we need to estimate. Moreover, in linear dynamical systems past information decays over time. We develop a suite of new techniques to robustly extract information across different time steps and over varying time scales. △ Less

Submitted 11 November, 2021; originally announced November 2021.

Comments: 57 pages, comments welcome

arXiv:2111.03247 [pdf, ps, other]

Entropic Independence II: Optimal Sampling and Concentration via Restricted Modified Log-Sobolev Inequalities

Authors: Nima Anari, Vishesh Jain, Frederic Koehler, Huy Tuan Pham, Thuy-Duong Vuong

Abstract: We introduce a framework for obtaining tight mixing times for Markov chains based on what we call restricted modified log-Sobolev inequalities. Modified log-Sobolev inequalities (MLSI) quantify the rate of relative entropy contraction for the Markov operator, and are notoriously difficult to establish. However, infinitesimally close to stationarity, entropy contraction becomes equivalent to varian… ▽ More We introduce a framework for obtaining tight mixing times for Markov chains based on what we call restricted modified log-Sobolev inequalities. Modified log-Sobolev inequalities (MLSI) quantify the rate of relative entropy contraction for the Markov operator, and are notoriously difficult to establish. However, infinitesimally close to stationarity, entropy contraction becomes equivalent to variance contraction, a.k.a. a Poincare inequality, which is significantly easier to establish through, e.g., spectral analysis. Motivated by this observation, we study restricted modified log-Sobolev inequalities that guarantee entropy contraction not for all starting distributions, but for those in a large neighborhood of the stationary distribution. We show how to sample from the hardcore and Ising models on $n$-node graphs that have a constant $δ$ relative gap to the tree-uniqueness threshold, in nearly-linear time $\widetilde O_δ(n)$. Notably, our bound does not depend on the maximum degree $Δ$, and is therefore optimal even for high-degree graphs. This improves on prior mixing time bounds of $\widetilde O_{δ, Δ}(n)$ and $\widetilde O_δ(n^2)$, established via (non-restricted) modified log-Sobolev and Poincare inequalities respectively. We further show that optimal concentration inequalities can still be achieved from the restricted form of modified log-Sobolev inequalities. To establish restricted entropy contraction, we extend the entropic independence framework of Anari, Jain, Koehler, Pham, and Vuong to the setting of distributions that are spectrally independent under a restricted set of external fields. We also develop an orthogonal trick that might be of independent interest: utilizing Bernoulli factories we show how to implement Glauber dynamics updates on high-degree graphs in $O(1)$ time, assuming standard adjacency array representation of the graph. △ Less

Submitted 5 November, 2021; originally announced November 2021.

arXiv:2109.11505 [pdf, other]

Multidimensional Scaling: Approximation and Complexity

Authors: Erik Demaine, Adam Hesterberg, Frederic Koehler, Jayson Lynch, John Urschel

Abstract: Metric Multidimensional scaling (MDS) is a classical method for generating meaningful (non-linear) low-dimensional embeddings of high-dimensional data. MDS has a long history in the statistics, machine learning, and graph drawing communities. In particular, the Kamada-Kawai force-directed graph drawing method is equivalent to MDS and is one of the most popular ways in practice to embed graphs into… ▽ More Metric Multidimensional scaling (MDS) is a classical method for generating meaningful (non-linear) low-dimensional embeddings of high-dimensional data. MDS has a long history in the statistics, machine learning, and graph drawing communities. In particular, the Kamada-Kawai force-directed graph drawing method is equivalent to MDS and is one of the most popular ways in practice to embed graphs into low dimensions. Despite its ubiquity, our theoretical understanding of MDS remains limited as its objective function is highly non-convex. In this paper, we prove that minimizing the Kamada-Kawai objective is NP-hard and give a provable approximation algorithm for optimizing it, which in particular is a PTAS on low-diameter graphs. We supplement this result with experiments suggesting possible connections between our greedy approximation algorithm and gradient-based methods. △ Less

Submitted 23 September, 2021; originally announced September 2021.

arXiv:2109.06915 [pdf, other]

Reconstruction on Trees and Low-Degree Polynomials

Authors: Frederic Koehler, Elchanan Mossel

Abstract: The study of Markov processes and broadcasting on trees has deep connections to a variety of areas including statistical physics, graphical models, phylogenetic reconstruction, Markov Chain Monte Carlo, and community detection in random graphs. Notably, the celebrated Belief Propagation (BP) algorithm achieves Bayes-optimal performance for the reconstruction problem of predicting the value of the… ▽ More The study of Markov processes and broadcasting on trees has deep connections to a variety of areas including statistical physics, graphical models, phylogenetic reconstruction, Markov Chain Monte Carlo, and community detection in random graphs. Notably, the celebrated Belief Propagation (BP) algorithm achieves Bayes-optimal performance for the reconstruction problem of predicting the value of the Markov process at the root of the tree from its values at the leaves. Recently, the analysis of low-degree polynomials has emerged as a valuable tool for predicting computational-to-statistical gaps. In this work, we investigate the performance of low-degree polynomials for the reconstruction problem on trees. Perhaps surprisingly, we show that there are simple tree models with $N$ leaves and bounded arity where (1) nontrivial reconstruction of the root value is possible with a simple polynomial time algorithm and with robustness to noise, but not with any polynomial of degree $N^{c}$ for $c > 0$ a constant depending only on the arity, and (2) when the tree is unknown and given multiple samples with correlated root assignments, nontrivial reconstruction of the root value is possible with a simple Statistical Query algorithm but not with any polynomial of degree $N^c$. These results clarify some of the limitations of low-degree polynomials vs. polynomial time algorithms for Bayesian estimation problems. They also complement recent work of Moitra, Mossel, and Sandon who studied the circuit complexity of Belief Propagation. As a consequence of our main result, we show that for some $c' > 0$ depending only on the arity, $\exp(N^{c'})$ many samples are needed for RBF kernel regression to obtain nontrivial correlation with the true regression function (BP). We pose related open questions about low-degree polynomials and the Kesten-Stigum threshold. △ Less

Submitted 25 October, 2022; v1 submitted 14 September, 2021; originally announced September 2021.

Comments: 28 pages, comments welcome. V2: added new result and more exposition. abstract shortened for arxiv

arXiv:2106.09276 [pdf, other]

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting

Authors: Frederic Koehler, Lijia Zhou, Danica J. Sutherland, Nathan Srebro

Abstract: We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators… ▽ More We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings. △ Less

Submitted 3 January, 2022; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: v2: Minor changes only. As published at NeurIPS 2021: https://proceedings.neurips.cc/paper/2021/hash/ac9815bef801f58de83804bce86984ad-Abstract.html

arXiv:2106.09207 [pdf, other]

On the Power of Preconditioning in Sparse Linear Regression

Authors: Jonathan Kelner, Frederic Koehler, Raghu Meka, Dhruv Rohatgi

Abstract: Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to efficiently solve it without restrictive conditions on the design matrix. We consider the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian $N(0,Σ)$ with $Σ: n \times n$, and seek estimators $\hat{w}$ minimizing… ▽ More Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to efficiently solve it without restrictive conditions on the design matrix. We consider the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian $N(0,Σ)$ with $Σ: n \times n$, and seek estimators $\hat{w}$ minimizing $(\hat{w}-w^*)^TΣ(\hat{w}-w^*)$, where $w^*$ is the $k$-sparse ground truth. Information theoretically, one can achieve strong error bounds with $O(k \log n)$ samples for arbitrary $Σ$ and $w^*$; however, no efficient algorithms are known to match these guarantees even with $o(n)$ samples, without further assumptions on $Σ$ or $w^*$. As far as hardness, computational lower bounds are only known with worst-case design matrices. Random-design instances are known which are hard for the Lasso, but these instances can generally be solved by Lasso after a simple change-of-basis (i.e. preconditioning). In this work, we give upper and lower bounds clarifying the power of preconditioning in sparse linear regression. First, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth -- even if $Σ$ is highly ill-conditioned. Second, we construct (for the first time) random-design instances which are provably hard for an optimally preconditioned Lasso. In fact, we complete our treewidth classification by proving that for any treewidth-$t$ graph, there exists a Gaussian Markov Random Field on this graph such that the preconditioned Lasso, with any choice of preconditioner, requires $Ω(t^{1/20})$ samples to recover $O(\log n)$-sparse signals when covariates are drawn from this model. △ Less

Submitted 16 June, 2021; originally announced June 2021.

Comments: 73 pages, 5 figures

arXiv:2106.04105 [pdf, ps, other]

Entropic Independence I: Modified Log-Sobolev Inequalities for Fractionally Log-Concave Distributions and High-Temperature Ising Models

Authors: Nima Anari, Vishesh Jain, Frederic Koehler, Huy Tuan Pham, Thuy-Duong Vuong

Abstract: We introduce a notion called entropic independence that is an entropic analog of spectral notions of high-dimensional expansion. Informally, entropic independence of a background distribution $μ$ on $k$-sized subsets of a ground set of elements says that for any (possibly randomly chosen) set $S$, the relative entropy of a single element of $S$ drawn uniformly at random carries at most $O(1/k)$ fr… ▽ More We introduce a notion called entropic independence that is an entropic analog of spectral notions of high-dimensional expansion. Informally, entropic independence of a background distribution $μ$ on $k$-sized subsets of a ground set of elements says that for any (possibly randomly chosen) set $S$, the relative entropy of a single element of $S$ drawn uniformly at random carries at most $O(1/k)$ fraction of the relative entropy of $S$. Entropic independence is the analog of the notion of spectral independence, if one replaces variance by entropy. We use entropic independence to derive tight mixing time bounds, overcoming the lossy nature of spectral analysis of Markov chains on exponential-sized state spaces. In our main technical result, we show a general way of deriving entropy contraction, a.k.a. modified log-Sobolev inequalities, for down-up random walks from spectral notions. We show that spectral independence of a distribution under arbitrary external fields automatically implies entropic independence. To derive our results, we relate entropic independence to properties of polynomials: $μ$ is entropically independent exactly when a transformed version of the generating polynomial of $μ$ is upper bounded by its linear tangent; this property is implied by concavity of the said transformation, which was shown by prior work to be locally equivalent to spectral independence. We apply our results to obtain tight modified log-Sobolev inequalities and mixing times for multi-step down-up walks on fractionally log-concave distributions. As our flagship application, we establish the tight mixing time of $O(n\log n)$ for Glauber dynamics on Ising models whose interaction matrix has eigenspectrum lying within an interval of length smaller than $1$, improving upon the prior quadratic dependence on $n$. △ Less

Submitted 4 November, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

arXiv:2106.03969 [pdf, other]

Chow-Liu++: Optimal Prediction-Centric Learning of Tree Ising Models

Authors: Enric Boix-Adsera, Guy Bresler, Frederic Koehler

Abstract: We consider the problem of learning a tree-structured Ising model from data, such that subsequent predictions computed using the model are accurate. Concretely, we aim to learn a model such that posteriors $P(X_i|X_S)$ for small sets of variables $S$ are accurate. Since its introduction more than 50 years ago, the Chow-Liu algorithm, which efficiently computes the maximum likelihood tree, has been… ▽ More We consider the problem of learning a tree-structured Ising model from data, such that subsequent predictions computed using the model are accurate. Concretely, we aim to learn a model such that posteriors $P(X_i|X_S)$ for small sets of variables $S$ are accurate. Since its introduction more than 50 years ago, the Chow-Liu algorithm, which efficiently computes the maximum likelihood tree, has been the benchmark algorithm for learning tree-structured graphical models. A bound on the sample complexity of the Chow-Liu algorithm with respect to the prediction-centric local total variation loss was shown in [BK19]. While those results demonstrated that it is possible to learn a useful model even when recovering the true underlying graph is impossible, their bound depends on the maximum strength of interactions and thus does not achieve the information-theoretic optimum. In this paper, we introduce a new algorithm that carefully combines elements of the Chow-Liu algorithm with tree metric reconstruction methods to efficiently and optimally learn tree Ising models under a prediction-centric loss. Our algorithm is robust to model misspecification and adversarial corruptions. In contrast, we show that the celebrated Chow-Liu algorithm can be arbitrarily suboptimal. △ Less

Submitted 23 November, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: 49 pages, 3 figures, to appear in FOCS'21

arXiv:2010.04157 [pdf, other]

Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination

Authors: Sitan Chen, Frederic Koehler, Ankur Moitra, Morris Yau

Abstract: In this work we revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits, from the perspective of adversarial robustness. Existing works in algorithmic robust statistics make strong distributional assumptions that ensure that the input data is evenly spread out or comes from a nice generative model. Is it possible to achieve strong robustness g… ▽ More In this work we revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits, from the perspective of adversarial robustness. Existing works in algorithmic robust statistics make strong distributional assumptions that ensure that the input data is evenly spread out or comes from a nice generative model. Is it possible to achieve strong robustness guarantees even without distributional assumptions altogether, where the sequence of tasks we are asked to solve is adaptively and adversarially chosen? We answer this question in the affirmative for both linear regression and contextual bandits. In fact our algorithms succeed where conventional methods fail. In particular we show strong lower bounds against Huber regression and more generally any convex M-estimator. Our approach is based on a novel alternating minimization scheme that interleaves ordinary least-squares with a simple convex program that finds the optimal reweighting of the distribution under a spectral constraint. Our results obtain essentially optimal dependence on the contamination level $η$, reach the optimal breakdown point, and naturally apply to infinite dimensional settings where the feature vectors are represented implicitly via a kernel map. △ Less

Submitted 10 June, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

Comments: 66 pages, 1 figure, v3: refined exposition and improved rates

arXiv:2010.01155 [pdf, other]

Representational aspects of depth and conditioning in normalizing flows

Authors: Frederic Koehler, Viraj Mehta, Andrej Risteski

Abstract: Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: model… ▽ More Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: models which produce good samples typically need to be extremely deep -- which comes with accompanying vanishing/exploding gradient problems. A very related problem is that they are often poorly conditioned: since they are parametrized as invertible maps from $\mathbb{R}^d \to \mathbb{R}^d$, and typical training data like images intuitively is lower-dimensional, the learned maps often have Jacobians that are close to being singular. In our paper, we tackle representational aspects around depth and conditioning of normalizing flows: both for general invertible architectures, and for a particular common architecture, affine couplings. We prove that $Θ(1)$ affine coupling layers suffice to exactly represent a permutation or $1 \times 1$ convolution, as used in GLOW, showing that representationally the choice of partition is not a bottleneck for depth. We also show that shallow affine coupling networks are universal approximators in Wasserstein distance if ill-conditioning is allowed, and experimentally investigate related phenomena involving padding. Finally, we show a depth lower bound for general flow architectures with few neurons per layer and bounded Lipschitz constant. △ Less

Submitted 25 June, 2021; v1 submitted 2 October, 2020; originally announced October 2020.

Comments: Appeared in ICML 2021

arXiv:2007.12815 [pdf, other]

From Boltzmann Machines to Neural Networks and Back Again

Authors: Surbhi Goel, Adam Klivans, Frederic Koehler

Abstract: Graphical models are powerful tools for modeling high-dimensional data, but learning graphical models in the presence of latent variables is well-known to be difficult. In this work we give new results for learning Restricted Boltzmann Machines, probably the most well-studied class of latent variable models. Our results are based on new connections to learning two-layer neural networks under… ▽ More Graphical models are powerful tools for modeling high-dimensional data, but learning graphical models in the presence of latent variables is well-known to be difficult. In this work we give new results for learning Restricted Boltzmann Machines, probably the most well-studied class of latent variable models. Our results are based on new connections to learning two-layer neural networks under $\ell_{\infty}$ bounded input; for both problems, we give nearly optimal results under the conjectured hardness of sparse parity with noise. Using the connection between RBMs and feedforward networks, we also initiate the theoretical study of $supervised~RBMs$ [Hinton, 2012], a version of neural-network learning that couples distributional assumptions induced from the underlying graphical model with the architecture of the unknown function class. We then give an algorithm for learning a natural class of supervised RBMs with better runtime than what is possible for its related class of networks without distributional assumptions. △ Less

Submitted 24 July, 2020; originally announced July 2020.

arXiv:2006.04787 [pdf, other]

Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Connections to Evolvability

Authors: Sitan Chen, Frederic Koehler, Ankur Moitra, Morris Yau

Abstract: In this paper we revisit some classic problems on classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate $η$. In a recent work, Diakonikolas, Goulekakis, and Tzamos resolved a long-standing problem by giving the first efficient algorithm for learning to accuracy $η+ ε$ for any $ε> 0$. However, their algorithm outputs a compli… ▽ More In this paper we revisit some classic problems on classification under misspecification. In particular, we study the problem of learning halfspaces under Massart noise with rate $η$. In a recent work, Diakonikolas, Goulekakis, and Tzamos resolved a long-standing problem by giving the first efficient algorithm for learning to accuracy $η+ ε$ for any $ε> 0$. However, their algorithm outputs a complicated hypothesis, which partitions space into $\text{poly}(d,1/ε)$ regions. Here we give a much simpler algorithm and in the process resolve a number of outstanding open questions: (1) We give the first proper learner for Massart halfspaces that achieves $η+ ε$. We also give improved bounds on the sample complexity achievable by polynomial time algorithms. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to evolvability, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + ε$. Moreover we study generalized linear models where $\mathbb{E}[Y|\mathbf{X}] = σ(\langle \mathbf{w}^*, \mathbf{X}\rangle)$ for any odd, monotone, and Lipschitz function $σ$. This family includes the previously mentioned halfspace models as a special case, but is much richer and includes other fundamental models like logistic regression. We introduce a challenging new corruption model that generalizes Massart noise, and give a general algorithm for learning in this setting. Our algorithms are based on a small set of core recipes for learning to classify in the presence of misspecification. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties. △ Less

Submitted 20 September, 2023; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: 52 pages, v2: updated references

arXiv:2004.12580 [pdf, other]

A Phase Transition in Arrow's Theorem

Authors: Frederic Koehler, Elchanan Mossel

Abstract: Arrow's Theorem concerns a fundamental problem in social choice theory: given the individual preferences of members of a group, how can they be aggregated to form rational group preferences? Arrow showed that in an election between three or more candidates, there are situations where any voting rule satisfying a small list of natural "fairness" axioms must produce an apparently irrational intransi… ▽ More Arrow's Theorem concerns a fundamental problem in social choice theory: given the individual preferences of members of a group, how can they be aggregated to form rational group preferences? Arrow showed that in an election between three or more candidates, there are situations where any voting rule satisfying a small list of natural "fairness" axioms must produce an apparently irrational intransitive outcome. Furthermore, quantitative versions of Arrow's Theorem in the literature show that when voters choose rankings in an i.i.d.\ fashion, the outcome is intransitive with non-negligible probability. It is natural to ask if such a quantitative version of Arrow's Theorem holds for non-i.i.d.\ models. To answer this question, we study Arrow's Theorem under a natural non-i.i.d.\ model of voters inspired by canonical models in statistical physics; indeed, a version of this model was previously introduced by Raffaelli and Marsili in the physics literature. This model has a parameter, temperature, that prescribes the correlation between different voters. We show that the behavior of Arrow's Theorem in this model undergoes a striking phase transition: in the entire high temperature regime of the model, a Quantitative Arrow's Theorem holds showing that the probability of paradox for any voting rule satisfying the axioms is non-negligible; this is tight because the probability of paradox under pairwise majority goes to zero when approaching the critical temperature, and becomes exponentially small in the number of voters beyond it. We prove this occurs in another natural model of correlated voters and conjecture this phenomena is quite general. △ Less

Submitted 24 September, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

Comments: 52 pages; comments welcome!

arXiv:1905.10031 [pdf, ps, other]

Accuracy-Memory Tradeoffs and Phase Transitions in Belief Propagation

Authors: Vishesh Jain, Frederic Koehler, Jingbo Liu, Elchanan Mossel

Abstract: The analysis of Belief Propagation and other algorithms for the {\em reconstruction problem} plays a key role in the analysis of community detection in inference on graphs, phylogenetic reconstruction in bioinformatics, and the cavity method in statistical physics. We prove a conjecture of Evans, Kenyon, Peres, and Schulman (2000) which states that any bounded memory message passing algorithm is… ▽ More The analysis of Belief Propagation and other algorithms for the {\em reconstruction problem} plays a key role in the analysis of community detection in inference on graphs, phylogenetic reconstruction in bioinformatics, and the cavity method in statistical physics. We prove a conjecture of Evans, Kenyon, Peres, and Schulman (2000) which states that any bounded memory message passing algorithm is statistically much weaker than Belief Propagation for the reconstruction problem. More formally, any recursive algorithm with bounded memory for the reconstruction problem on the trees with the binary symmetric channel has a phase transition strictly below the Belief Propagation threshold, also known as the Kesten-Stigum bound. The proof combines in novel fashion tools from recursive reconstruction, information theory, and optimal transport, and also establishes an asymptotic normality result for BP and other message-passing algorithms near the critical threshold. △ Less

Submitted 24 May, 2019; originally announced May 2019.

Comments: To be presented on COLT 2019

arXiv:1905.09992 [pdf, other]

Fast Convergence of Belief Propagation to Global Optima: Beyond Correlation Decay

Authors: Frederic Koehler

Abstract: Belief propagation is a fundamental message-passing algorithm for probabilistic reasoning and inference in graphical models. While it is known to be exact on trees, in most applications belief propagation is run on graphs with cycles. Understanding the behavior of "loopy" belief propagation has been a major challenge for researchers in machine learning, and positive convergence results for BP are… ▽ More Belief propagation is a fundamental message-passing algorithm for probabilistic reasoning and inference in graphical models. While it is known to be exact on trees, in most applications belief propagation is run on graphs with cycles. Understanding the behavior of "loopy" belief propagation has been a major challenge for researchers in machine learning, and positive convergence results for BP are known under strong assumptions which imply the underlying graphical model exhibits decay of correlations. We show that under a natural initialization, BP converges quickly to the global optimum of the Bethe free energy for Ising models on arbitrary graphs, as long as the Ising model is \emph{ferromagnetic} (i.e. neighbors prefer to be aligned). This holds even though such models can exhibit long range correlations and may have multiple suboptimal BP fixed points. We also show an analogous result for iterating the (naive) mean-field equations; perhaps surprisingly, both results are dimension-free in the sense that a constant number of iterations already provides a good estimate to the Bethe/mean-field free energy. △ Less

Submitted 23 May, 2019; originally announced May 2019.

Comments: 24 pages; comments welcome!

arXiv:1905.01282 [pdf, other]

Learning Some Popular Gaussian Graphical Models without Condition Number Bounds

Authors: Jonathan Kelner, Frederic Koehler, Raghu Meka, Ankur Moitra

Abstract: Gaussian Graphical Models (GGMs) have wide-ranging applications in machine learning and the natural and social sciences. In most of the settings in which they are applied, the number of observed samples is much smaller than the dimension and they are assumed to be sparse. While there are a variety of algorithms (e.g. Graphical Lasso, CLIME) that provably recover the graph structure with a logarith… ▽ More Gaussian Graphical Models (GGMs) have wide-ranging applications in machine learning and the natural and social sciences. In most of the settings in which they are applied, the number of observed samples is much smaller than the dimension and they are assumed to be sparse. While there are a variety of algorithms (e.g. Graphical Lasso, CLIME) that provably recover the graph structure with a logarithmic number of samples, they assume various conditions that require the precision matrix to be in some sense well-conditioned. Here we give the first polynomial-time algorithms for learning attractive GGMs and walk-summable GGMs with a logarithmic number of samples without any such assumptions. In particular, our algorithms can tolerate strong dependencies among the variables. Our result for structure recovery in walk-summable GGMs is derived from a more general result for efficient sparse linear regression in walk-summable models without any norm dependencies. We complement our results with experiments showing that many existing algorithms fail even in some simple settings where there are long dependency chains, whereas ours do not. △ Less

Submitted 8 March, 2020; v1 submitted 3 May, 2019; originally announced May 2019.

Comments: V2: Updated version with some new results

arXiv:1808.07226 [pdf, ps, other]

Mean-field approximation, convex hierarchies, and the optimality of correlation rounding: a unified perspective

Authors: Vishesh Jain, Frederic Koehler, Andrej Risteski

Abstract: The free energy is a key quantity of interest in Ising models, but unfortunately, computing it in general is computationally intractable. Two popular (variational) approximation schemes for estimating the free energy of general Ising models (in particular, even in regimes where correlation decay does not hold) are: (i) the mean-field approximation with roots in statistical physics, which estimates… ▽ More The free energy is a key quantity of interest in Ising models, but unfortunately, computing it in general is computationally intractable. Two popular (variational) approximation schemes for estimating the free energy of general Ising models (in particular, even in regimes where correlation decay does not hold) are: (i) the mean-field approximation with roots in statistical physics, which estimates the free energy from below, and (ii) hierarchies of convex relaxations with roots in theoretical computer science, which estimate the free energy from above. We show, surprisingly, that the tight regime for both methods to compute the free energy to leading order is identical. More precisely, we show that the mean-field approximation is within $O((n\|J\|_{F})^{2/3})$ of the free energy, where $\|J\|_F$ denotes the Frobenius norm of the interaction matrix of the Ising model. This simultaneously subsumes both the breakthrough work of Basak and Mukherjee, who showed the tight result that the mean-field approximation is within $o(n)$ whenever $\|J\|_{F} = o(\sqrt{n})$, as well as the work of Jain, Koehler, and Mossel, who gave the previously best known non-asymptotic bound of $O((n\|J\|_{F})^{2/3}\log^{1/3}(n\|J\|_{F}))$. We give a simple, algorithmic proof of this result using a convex relaxation proposed by Risteski based on the Sherali-Adams hierarchy, automatically giving sub-exponential time approximation schemes for the free energy in this entire regime. Our algorithmic result is tight under Gap-ETH. We furthermore combine our techniques with spin glass theory to prove (in a strong sense) the optimality of correlation rounding, refuting a recent conjecture of Allen, O'Donnell, and Zhou. Finally, we give the tight generalization of all of these results to $k$-MRFs, capturing as a special case previous work on approximating MAX-$k$-CSP. △ Less

Submitted 28 August, 2018; v1 submitted 22 August, 2018; originally announced August 2018.

Comments: This version: minor formatting changes, added grant acknowledgements

arXiv:1805.11405 [pdf, ps, other]

Representational Power of ReLU Networks and Polynomial Kernels: Beyond Worst-Case Analysis

Authors: Frederic Koehler, Andrej Risteski

Abstract: There has been a large amount of interest, both in the past and particularly recently, into the power of different families of universal approximators, e.g. ReLU networks, polynomials, rational functions. However, current research has focused almost exclusively on understanding this problem in a worst-case setting, e.g. bounding the error of the best infinity-norm approximation in a box. In this s… ▽ More There has been a large amount of interest, both in the past and particularly recently, into the power of different families of universal approximators, e.g. ReLU networks, polynomials, rational functions. However, current research has focused almost exclusively on understanding this problem in a worst-case setting, e.g. bounding the error of the best infinity-norm approximation in a box. In this setting a high-degree polynomial is required to even approximate a single ReLU. However, in real applications with high dimensional data we expect it is only important to approximate the desired function well on certain relevant parts of its domain. With this motivation, we analyze the ability of neural networks and polynomial kernels of bounded degree to achieve good statistical performance on a simple, natural inference problem with sparse latent structure. We give almost-tight bounds on the performance of both neural networks and low degree polynomials for this problem. Our bounds for polynomials involve new techniques which may be of independent interest and show major qualitative differences with what is known in the worst-case setting. △ Less

Submitted 29 May, 2018; originally announced May 2018.

arXiv:1805.10262 [pdf, other]

Learning Restricted Boltzmann Machines via Influence Maximization

Authors: Guy Bresler, Frederic Koehler, Ankur Moitra, Elchanan Mossel

Abstract: Graphical models are a rich language for describing high-dimensional distributions in terms of their dependence structure. While there are algorithms with provable guarantees for learning undirected graphical models in a variety of settings, there has been much less progress in the important scenario when there are latent variables. Here we study Restricted Boltzmann Machines (or RBMs), which are… ▽ More Graphical models are a rich language for describing high-dimensional distributions in terms of their dependence structure. While there are algorithms with provable guarantees for learning undirected graphical models in a variety of settings, there has been much less progress in the important scenario when there are latent variables. Here we study Restricted Boltzmann Machines (or RBMs), which are a popular model with wide-ranging applications in dimensionality reduction, collaborative filtering, topic modeling, feature extraction and deep learning. The main message of our paper is a strong dichotomy in the feasibility of learning RBMs, depending on the nature of the interactions between variables: ferromagnetic models can be learned efficiently, while general models cannot. In particular, we give a simple greedy algorithm based on influence maximization to learn ferromagnetic RBMs with bounded degree. In fact, we learn a description of the distribution on the observed variables as a Markov Random Field. Our analysis is based on tools from mathematical physics that were developed to show the concavity of magnetization. Our algorithm extends straighforwardly to general ferromagnetic Ising models with latent variables. Conversely, we show that even for a contant number of latent variables with constant degree, without ferromagneticity the problem is as hard as sparse parity with noise. This hardness result is based on a sharp and surprising characterization of the representational power of bounded degree RBMs: the distribution on their observed variables can simulate any bounded order MRF. This result is of independent interest since RBMs are the building blocks of deep belief networks. △ Less

Submitted 5 November, 2018; v1 submitted 25 May, 2018; originally announced May 2018.

Comments: 29 pages

arXiv:1802.06129 [pdf, ps, other]

The Vertex Sample Complexity of Free Energy is Polynomial

Authors: Vishesh Jain, Frederic Koehler, Elchanan Mossel

Abstract: We study the following question: given a massive Markov random field on $n$ nodes, can a small sample from it provide a rough approximation to the free energy $\mathcal{F}_n = \log{Z_n}$? Results in graph limit literature by Borgs, Chayes, Lovász, Sós, and Vesztergombi show that for Ising models on $n$ nodes and interactions of strength $Θ(1/n)$, an $ε$ approximation to $\log Z_n / n$ can be ach… ▽ More We study the following question: given a massive Markov random field on $n$ nodes, can a small sample from it provide a rough approximation to the free energy $\mathcal{F}_n = \log{Z_n}$? Results in graph limit literature by Borgs, Chayes, Lovász, Sós, and Vesztergombi show that for Ising models on $n$ nodes and interactions of strength $Θ(1/n)$, an $ε$ approximation to $\log Z_n / n$ can be achieved by sampling a randomly induced model on $2^{O(1/ε^2)}$ nodes. We show that the sampling complexity of this problem is {\em polynomial in} $1/ε$. We further show a polynomial dependence on $ε$ cannot be avoided. Our results are very general as they apply to higher order Markov random fields. For Markov random fields of order $r$, we obtain an algorithm that achieves $ε$ approximation using a number of samples polynomial in $r$ and $1/ε$ and running time that is $2^{O(1/ε^2)}$ up to polynomial factors in $r$ and $ε$. For ferromagnetic Ising models, the running time is polynomial in $1/ε$. Our results are intimately connected to recent research on the regularity lemma and property testing, where the interest is in finding which properties can tested within $ε$ error in time polynomial in $1/ε$. In particular, our proofs build on results from a recent work by Alon, de la Vega, Kannan and Karpinski, who also introduced the notion of polynomial vertex sample complexity. Another critical ingredient of the proof is an effective bound by the authors of the paper relating the variational free energy and the free energy. △ Less

Submitted 23 February, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

Comments: arXiv admin note: text overlap with arXiv:1802.06126 Updated bibliography

arXiv:1802.06126 [pdf, ps, other]

The Mean-Field Approximation: Information Inequalities, Algorithms, and Complexity

Authors: Vishesh Jain, Frederic Koehler, Elchanan Mossel

Abstract: The mean field approximation to the Ising model is a canonical variational tool that is used for analysis and inference in Ising models. We provide a simple and optimal bound for the KL error of the mean field approximation for Ising models on general graphs, and extend it to higher order Markov random fields. Our bound improves on previous bounds obtained in work in the graph limit literature by… ▽ More The mean field approximation to the Ising model is a canonical variational tool that is used for analysis and inference in Ising models. We provide a simple and optimal bound for the KL error of the mean field approximation for Ising models on general graphs, and extend it to higher order Markov random fields. Our bound improves on previous bounds obtained in work in the graph limit literature by Borgs, Chayes, Lovász, Sós, and Vesztergombi and another recent work by Basak and Mukherjee. Our bound is tight up to lower order terms. Building on the methods used to prove the bound, along with techniques from combinatorics and optimization, we study the algorithmic problem of estimating the (variational) free energy for Ising models and general Markov random fields. For a graph $G$ on $n$ vertices and interaction matrix $J$ with Frobenius norm $\| J \|_F$, we provide algorithms that approximate the free energy within an additive error of $εn \|J\|_F$ in time $\exp(poly(1/ε))$. We also show that approximation within $(n \|J\|_F)^{1-δ}$ is NP-hard for every $δ> 0$. Finally, we provide more efficient approximation algorithms, which find the optimal mean field approximation, for ferromagnetic Ising models and for Ising models satisfying Dobrushin's condition. △ Less

Submitted 20 February, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

Comments: Updated bibliography

arXiv:1711.01655 [pdf, ps, other]

Approximating Partition Functions in Constant Time

Authors: Vishesh Jain, Frederic Koehler, Elchanan Mossel

Abstract: We study approximations of the partition function of dense graphical models. Partition functions of graphical models play a fundamental role is statistical physics, in statistics and in machine learning. Two of the main methods for approximating the partition function are Markov Chain Monte Carlo and Variational Methods. An impressive body of work in mathematics, physics and theoretical computer s… ▽ More We study approximations of the partition function of dense graphical models. Partition functions of graphical models play a fundamental role is statistical physics, in statistics and in machine learning. Two of the main methods for approximating the partition function are Markov Chain Monte Carlo and Variational Methods. An impressive body of work in mathematics, physics and theoretical computer science provides conditions under which Markov Chain Monte Carlo methods converge in polynomial time. These methods often lead to polynomial time approximation algorithms for the partition function in cases where the underlying model exhibits correlation decay. There are very few theoretical guarantees for the performance of variational methods. One exception is recent results by Risteski (2016) who considered dense graphical models and showed that using variational methods, it is possible to find an $O(εn)$ additive approximation to the log partition function in time $n^{O(1/ε^2)}$ even in a regime where correlation decay does not hold. We show that under essentially the same conditions, an $O(εn)$ additive approximation of the log partition function can be found in constant time, independent of $n$. In particular, our results cover dense Ising and Potts models as well as dense graphical models with $k$-wise interaction. They also apply for low threshold rank models. △ Less

Submitted 20 February, 2018; v1 submitted 5 November, 2017; originally announced November 2017.

Comments: This preprint is completely subsumed by preprints arXiv:1802.06126 and arXiv:1802.06129 by the same authors which also include important references that are missing in the current preprint

arXiv:1705.11107 [pdf, other]

Information Theoretic Properties of Markov Random Fields, and their Algorithmic Applications

Authors: Linus Hamilton, Frederic Koehler, Ankur Moitra

Abstract: Markov random fields area popular model for high-dimensional probability distributions. Over the years, many mathematical, statistical and algorithmic problems on them have been studied. Until recently, the only known algorithms for provably learning them relied on exhaustive search, correlation decay or various incoherence assumptions. Bresler gave an algorithm for learning general Ising models o… ▽ More Markov random fields area popular model for high-dimensional probability distributions. Over the years, many mathematical, statistical and algorithmic problems on them have been studied. Until recently, the only known algorithms for provably learning them relied on exhaustive search, correlation decay or various incoherence assumptions. Bresler gave an algorithm for learning general Ising models on bounded degree graphs. His approach was based on a structural result about mutual information in Ising models. Here we take a more conceptual approach to proving lower bounds on the mutual information through setting up an appropriate zero-sum game. Our proof generalizes well beyond Ising models, to arbitrary Markov random fields with higher order interactions. As an application, we obtain algorithms for learning Markov random fields on bounded degree graphs on $n$ nodes with $r$-order interactions in $n^r$ time and $\log n$ sample complexity. The sample complexity is information theoretically optimal up to the dependence on the maximum degree. The running time is nearly optimal under standard conjectures about the hardness of learning parity with noise. △ Less

Submitted 31 May, 2017; originally announced May 2017.

Comments: 25 pages

arXiv:1605.08491 [pdf, other]

Provable Algorithms for Inference in Topic Models

Authors: Sanjeev Arora, Rong Ge, Frederic Koehler, Tengyu Ma, Ankur Moitra

Abstract: Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us… ▽ More Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a {\em single} iteration of Gibbs sampling. △ Less

Submitted 26 May, 2016; originally announced May 2016.

Comments: to appear at ICML'2016

Showing 1–37 of 37 results for author: Koehler, F