Machine Learning
See recent articles
- [1] arXiv:2407.11268 [pdf, html, other]
-
Title: Heterogenous Multi-Source Data Fusion Through Input Mapping and Latent Variable Gaussian ProcessComments: 20 Pages,9 Figures, Data is available per requestSubjects: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
Artificial intelligence and machine learning frameworks have served as computationally efficient mapping between inputs and outputs for engineering problems. These mappings have enabled optimization and analysis routines that have warranted superior designs, ingenious material systems and optimized manufacturing processes. A common occurrence in such modeling endeavors is the existence of multiple source of data, each differentiated by fidelity, operating conditions, experimental conditions, and more. Data fusion frameworks have opened the possibility of combining such differentiated sources into single unified models, enabling improved accuracy and knowledge transfer. However, these frameworks encounter limitations when the different sources are heterogeneous in nature, i.e., not sharing the same input parameter space. These heterogeneous input scenarios can occur when the domains differentiated by complexity, scale, and fidelity require different parametrizations. Towards addressing this void, a heterogeneous multi-source data fusion framework is proposed based on input mapping calibration (IMC) and latent variable Gaussian process (LVGP). In the first stage, the IMC algorithm is utilized to transform the heterogeneous input parameter spaces into a unified reference parameter space. In the second stage, a multi-source data fusion model enabled by LVGP is leveraged to build a single source-aware surrogate model on the transformed reference space. The proposed framework is demonstrated and analyzed on three engineering case studies (design of cantilever beam, design of ellipsoidal void and modeling properties of Ti6Al4V alloy). The results indicate that the proposed framework provides improved predictive accuracy over a single source model and transformed but source unaware model.
- [2] arXiv:2407.11353 [pdf, other]
-
Title: Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric RegressionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
We consider nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $\cO({1}/{n^{4\alpha/(4\alpha+1)}})$, which is sharper the current standard rate of $\cO({1}/{n^{2\alpha/(2\alpha+1)}})$ with $2\alpha = d/(d-1)$ when the data is distributed uniformly on the unit sphere in $\RR^d$ and $n$ is the size of the training data. When the target function has no spectral bias, we prove that neural network trained with regular GD with early stopping still enjoys minimax optimal rate, and in this case our results do not require distributional assumptions in contrast with the current known results. Our results are built upon two significant technical contributions. First, uniform convergence to the NTK is established during the training process by PGD or GD, so that we can have a nice decomposition of the neural network function at any step of GD or PGD into a function in the RKHS and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD or PGD. Our results also indicate that PGD can be another way of avoiding the usual linear regime of NTK and obtaining sharper generalization bound, because PGD induces a different kernel with lower kernel complexity during the training than the regular NTK induced by the network architecture trained by regular GD.
- [3] arXiv:2407.11518 [pdf, html, other]
-
Title: Ensemble Transport Filter via Optimized Maximum Mean DiscrepancyComments: 27 pages, 14 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Other Statistics (stat.OT)
In this paper, we present a new ensemble-based filter method by reconstructing the analysis step of the particle filter through a transport map, which directly transports prior particles to posterior particles. The transport map is constructed through an optimization problem described by the Maximum Mean Discrepancy loss function, which matches the expectation information of the approximated posterior and reference posterior. The proposed method inherits the accurate estimation of the posterior distribution from particle filtering. To improve the robustness of Maximum Mean Discrepancy, a variance penalty term is used to guide the optimization. It prioritizes minimizing the discrepancy between the expectations of highly informative statistics for the approximated and reference posteriors. The penalty term significantly enhances the robustness of the proposed method and leads to a better approximation of the posterior. A few numerical examples are presented to illustrate the advantage of the proposed method over the ensemble Kalman filter.
- [4] arXiv:2407.11873 [pdf, other]
-
Title: Variance Norms for Kernelized Anomaly DetectionSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
We present a unified theory for Mahalanobis-type anomaly detection on Banach spaces, using ideas from Cameron-Martin theory applied to non-Gaussian measures. This approach leads to a basis-free, data-driven notion of anomaly distance through the so-called variance norm of a probability measure, which can be consistently estimated using empirical measures. Our framework generalizes the classical $\mathbb{R}^d$, functional $(L^2[0,1])^d$, and kernelized settings, including the general case of non-injective covariance operator. We prove that the variance norm depends solely on the inner product in a given Hilbert space, and hence that the kernelized Mahalanobis distance can naturally be recovered by working on reproducing kernel Hilbert spaces.
Using the variance norm, we introduce the notion of a kernelized nearest-neighbour Mahalanobis distance for semi-supervised anomaly detection. In an empirical study on 12 real-world datasets, we demonstrate that the kernelized nearest-neighbour Mahalanobis distance outperforms the traditional kernelized Mahalanobis distance for multivariate time series anomaly detection, using state-of-the-art time series kernels such as the signature, global alignment, and Volterra reservoir kernels. Moreover, we provide an initial theoretical justification of nearest-neighbour Mahalanobis distances by developing concentration inequalities in the finite-dimensional Gaussian case. - [5] arXiv:2407.11901 [pdf, html, other]
-
Title: Combining Wasserstein-1 and Wasserstein-2 proximals: robust manifold learning via well-posed generative flowsSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
We formulate well-posed continuous-time generative flows for learning distributions that are supported on low-dimensional manifolds through Wasserstein proximal regularizations of $f$-divergences. Wasserstein-1 proximal operators regularize $f$-divergences so that singular distributions can be compared. Meanwhile, Wasserstein-2 proximal operators regularize the paths of the generative flows by adding an optimal transport cost, i.e., a kinetic energy penalization. Via mean-field game theory, we show that the combination of the two proximals is critical for formulating well-posed generative flows. Generative flows can be analyzed through optimality conditions of a mean-field game (MFG), a system of a backward Hamilton-Jacobi (HJ) and a forward continuity partial differential equations (PDEs) whose solution characterizes the optimal generative flow. For learning distributions that are supported on low-dimensional manifolds, the MFG theory shows that the Wasserstein-1 proximal, which addresses the HJ terminal condition, and the Wasserstein-2 proximal, which addresses the HJ dynamics, are both necessary for the corresponding backward-forward PDE system to be well-defined and have a unique solution with provably linear flow trajectories. This implies that the corresponding generative flow is also unique and can therefore be learned in a robust manner even for learning high-dimensional distributions supported on low-dimensional manifolds. The generative flows are learned through adversarial training of continuous-time flows, which bypasses the need for reverse simulation. We demonstrate the efficacy of our approach for generating high-dimensional images without the need to resort to autoencoders or specialized architectures.
- [6] arXiv:2407.11927 [pdf, html, other]
-
Title: Bayesian Causal Forests for Longitudinal Data: Assessing the Impact of Part-Time Work on Growth in High School Mathematics AchievementComments: 25 pages, 7 figures, 3 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
Modelling growth in student achievement is a significant challenge in the field of education. Understanding how interventions or experiences such as part-time work can influence this growth is also important. Traditional methods like difference-in-differences are effective for estimating causal effects from longitudinal data. Meanwhile, Bayesian non-parametric methods have recently become popular for estimating causal effects from single time point observational studies. However, there remains a scarcity of methods capable of combining the strengths of these two approaches to flexibly estimate heterogeneous causal effects from longitudinal data. Motivated by two waves of data from the High School Longitudinal Study, the NCES' most recent longitudinal study which tracks a representative sample of over 20,000 students in the US, our study introduces a longitudinal extension of Bayesian Causal Forests. This model allows for the flexible identification of both individual growth in mathematical ability and the effects of participation in part-time work. Simulation studies demonstrate the predictive performance and reliable uncertainty quantification of the proposed model. Results reveal the negative impact of part time work for most students, but hint at potential benefits for those students with an initially low sense of school belonging. Clear signs of a widening achievement gap between students with high and low academic achievement are also identified. Potential policy implications are discussed, along with promising areas for future research.
New submissions for Wednesday, 17 July 2024 (showing 6 of 6 entries )
- [7] arXiv:2407.11029 (cross-list from cs.LG) [pdf, html, other]
-
Title: A Geometric Framework for Adversarial Vulnerability in Machine LearningComments: 165 pages, PhD thesis, many figuresSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
This work starts with the intention of using mathematics to understand the intriguing vulnerability observed by ~\citet{szegedy2013} within artificial neural networks. Along the way, we will develop some novel tools with applications far outside of just the adversarial domain. We will do this while developing a rigorous mathematical framework to examine this problem. Our goal is to build out theory which can support increasingly sophisticated conjecture about adversarial attacks with a particular focus on the so called ``Dimpled Manifold Hypothesis'' by ~\citet{shamir2021dimpled}. Chapter one will cover the history and architecture of neural network architectures. Chapter two is focused on the background of adversarial vulnerability. Starting from the seminal paper by ~\citet{szegedy2013} we will develop the theory of adversarial perturbation and attack.
Chapter three will build a theory of persistence that is related to Ricci Curvature, which can be used to measure properties of decision boundaries. We will use this foundation to make a conjecture relating adversarial attacks. Chapters four and five represent a sudden and wonderful digression that examines an intriguing related body of theory for spatial analysis of neural networks as approximations of kernel machines and becomes a novel theory for representing neural networks with bilinear maps. These heavily mathematical chapters will set up a framework and begin exploring applications of what may become a very important theoretical foundation for analyzing neural network learning with spatial and geometric information. We will conclude by setting up our new methods to address the conjecture from chapter 3 in continuing research. - [8] arXiv:2407.11035 (cross-list from stat.ME) [pdf, html, other]
-
Title: Optimal estimators of cross-partial derivatives and surrogates of functionsSubjects: Methodology (stat.ME); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO); Machine Learning (stat.ML)
Computing cross-partial derivatives using fewer model runs is relevant in modeling, such as stochastic approximation, derivative-based ANOVA, exploring complex models, and active subspaces. This paper introduces surrogates of all the cross-partial derivatives of functions by evaluating such functions at $N$ randomized points and using a set of $L$ constraints. Randomized points rely on independent, central, and symmetric variables. The associated estimators, based on $NL$ model runs, reach the optimal rates of convergence (i.e., $\mathcal{O}(N^{-1})$), and the biases of our approximations do not suffer from the curse of dimensionality for a wide class of functions. Such results are used for i) computing the main and upper-bounds of sensitivity indices, and ii) deriving emulators of simulators or surrogates of functions thanks to the derivative-based ANOVA. Simulations are presented to show the accuracy of our emulators and estimators of sensitivity indices. The plug-in estimates of indices using the U-statistics of one sample are numerically much stable.
- [9] arXiv:2407.11069 (cross-list from cs.LG) [pdf, html, other]
-
Title: Combining Federated Learning and Control: A SurveyComments: Submitted to IEEE Communications Survey and TutorialsSubjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
This survey provides an overview of combining Federated Learning (FL) and control to enhance adaptability, scalability, generalization, and privacy in (nonlinear) control applications. Traditional control methods rely on controller design models, but real-world scenarios often require online model retuning or learning. FL offers a distributed approach to model training, enabling collaborative learning across distributed devices while preserving data privacy. By keeping data localized, FL mitigates concerns regarding privacy and security while reducing network bandwidth requirements for communication. This survey summarizes the state-of-the-art concepts and ideas of combining FL and control. The methodical benefits are further discussed, culminating in a detailed overview of expected applications, from dynamical system modeling over controller design, focusing on adaptive control, to knowledge transfer in multi-agent decision-making systems.
- [10] arXiv:2407.11094 (cross-list from stat.ME) [pdf, html, other]
-
Title: Robust Score-Based Quickest Change DetectionComments: arXiv admin note: text overlap with arXiv:2306.05091Subjects: Methodology (stat.ME); Signal Processing (eess.SP); Machine Learning (stat.ML)
Methods in the field of quickest change detection rapidly detect in real-time a change in the data-generating distribution of an online data stream. Existing methods have been able to detect this change point when the densities of the pre- and post-change distributions are known. Recent work has extended these results to the case where the pre- and post-change distributions are known only by their score functions. This work considers the case where the pre- and post-change score functions are known only to correspond to distributions in two disjoint sets. This work employs a pair of "least-favorable" distributions to robustify the existing score-based quickest change detection algorithm, the properties of which are studied. This paper calculates the least-favorable distributions for specific model classes and provides methods of estimating the least-favorable distributions for common constructions. Simulation results are provided demonstrating the performance of our robust change detection algorithm.
- [11] arXiv:2407.11249 (cross-list from cs.LG) [pdf, html, other]
-
Title: Disentangling Representations in RNNs through Multi-task LearningComments: 32 pages, 12 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
Abstract, or disentangled, representations are a promising mathematical framework for efficient and effective generalization in both biological and artificial systems. We investigate abstract representations in the context of multi-task classification over noisy evidence streams -- a canonical decision-making neuroscience paradigm. We derive theoretical bounds that guarantee the emergence of disentangled representations in the latent state of any optimal multi-task classifier, when the number of tasks exceeds the dimensionality of the state space. We experimentally confirm that RNNs trained on multi-task classification learn disentangled representations in the form of continuous attractors, leading to zero-shot out-of-distribution (OOD) generalization. We demonstrate the flexibility of the abstract RNN representations across various decision boundary geometries and in tasks requiring classification confidence estimation. Our framework suggests a general principle for the formation of cognitive maps that organize knowledge to enable flexible generalization in biological and artificial systems alike, and closely relates to representations found in humans and animals during decision-making and spatial reasoning tasks.
- [12] arXiv:2407.11274 (cross-list from cs.LG) [pdf, html, other]
-
Title: Empirical Mean and Frequency Estimation Under Heterogeneous Privacy: A Worst-Case AnalysisSubjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Differential Privacy (DP) is the current gold-standard for measuring privacy. Estimation problems under DP constraints appearing in the literature have largely focused on providing equal privacy to all users. We consider the problems of empirical mean estimation for univariate data and frequency estimation for categorical data, two pillars of data analysis in the industry, subject to heterogeneous privacy constraints. Each user, contributing a sample to the dataset, is allowed to have a different privacy demand. The dataset itself is assumed to be worst-case and we study both the problems in two different formulations -- the correlated and the uncorrelated setting. In the former setting, the privacy demand and the user data can be arbitrarily correlated while in the latter setting, there is no correlation between the dataset and the privacy demand. We prove some optimality results, under both PAC error and mean-squared error, for our proposed algorithms and demonstrate superior performance over other baseline techniques experimentally.
- [13] arXiv:2407.11284 (cross-list from hep-ph) [pdf, html, other]
-
Title: Moment UnfoldingComments: 16 pages, 6 figures, 1 tableSubjects: High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
Deconvolving ("unfolding'') detector distortions is a critical step in the comparison of cross section measurements with theoretical predictions in particle and nuclear physics. However, most existing approaches require histogram binning while many theoretical predictions are at the level of statistical moments. We develop a new approach to directly unfold distribution moments as a function of another observable without having to first discretize the data. Our Moment Unfolding technique uses machine learning and is inspired by Generative Adversarial Networks (GANs). We demonstrate the performance of this approach using jet substructure measurements in collider physics. With this illustrative example, we find that our Moment Unfolding protocol is more precise than bin-based approaches and is as or more precise than completely unbinned methods.
- [14] arXiv:2407.11386 (cross-list from math.PR) [pdf, html, other]
-
Title: Exponential tilting of subweibull distributionsSubjects: Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
The class of subweibull distributions has recently been shown to generalize the important properties of subexponential and subgaussian random variables. We describe alternative characterizations of subweibull distributions and detail the conditions under which their tail behavior is preserved after exponential tilting.
- [15] arXiv:2407.11427 (cross-list from cs.LG) [pdf, html, other]
-
Title: Semi-Supervised Generative Models for Disease Trajectories: A Case Study on Systemic SclerosisCécile Trottet, Manuel Schürch, Ahmed Allam, Imon Barua, Liubov Petelytska, Oliver Distler, Anna-Maria Hoffmann-Vold, Michael Krauthammer, the EUSTAR collaboratorsComments: Accepted at Machine Learning for Healthcare 2024. arXiv admin note: substantial text overlap with arXiv:2311.08149Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We propose a deep generative approach using latent temporal processes for modeling and holistically analyzing complex disease trajectories, with a particular focus on Systemic Sclerosis (SSc). We aim to learn temporal latent representations of the underlying generative process that explain the observed patient disease trajectories in an interpretable and comprehensive way. To enhance the interpretability of these latent temporal processes, we develop a semi-supervised approach for disentangling the latent space using established medical knowledge. By combining the generative approach with medical definitions of different characteristics of SSc, we facilitate the discovery of new aspects of the disease. We show that the learned temporal latent processes can be utilized for further data analysis and clinical hypothesis testing, including finding similar patients and clustering SSc patient trajectories into novel sub-types. Moreover, our method enables personalized online monitoring and prediction of multivariate time series with uncertainty quantification.
- [16] arXiv:2407.11435 (cross-list from q-bio.GN) [pdf, html, other]
-
Title: Genomic Language Models: Opportunities and ChallengesComments: Review article; 25 pages, 3 figures, 1 tableSubjects: Genomics (q-bio.GN); Machine Learning (cs.LG); Machine Learning (stat.ML)
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. In this review, we showcase this potential by highlighting key applications of gLMs, including fitness prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. We discuss major considerations for developing and evaluating gLMs.
- [17] arXiv:2407.11676 (cross-list from cs.LG) [pdf, html, other]
-
Title: SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic ValidationYanis Lalou, Théo Gnassounou, Antoine Collas, Antoine de Mathelin, Oleksii Kachaiev, Ambroise Odonnat, Alexandre Gramfort, Thomas Moreau, Rémi FlamarySubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With SKADA-Bench, we propose a framework to evaluate DA methods and present a fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data with specific feature extraction. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. SKADA-Bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors. SKADA-Bench is available on GitHub at this https URL.
- [18] arXiv:2407.11678 (cross-list from cs.LG) [pdf, html, other]
-
Title: Theoretical Insights into CycleGAN: Analyzing Approximation and Estimation Errors in Unpaired Data GenerationSubjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this paper, we focus on analyzing the excess risk of the unpaired data generation model, called CycleGAN. Unlike classical GANs, CycleGAN not only transforms data between two unpaired distributions but also ensures the mappings are consistent, which is encouraged by the cycle-consistency term unique to CycleGAN. The increasing complexity of model structure and the addition of the cycle-consistency term in CycleGAN present new challenges for error analysis. By considering the impact of both the model architecture and training procedure, the risk is decomposed into two terms: approximation error and estimation error. These two error terms are analyzed separately and ultimately combined by considering the trade-off between them. Each component is rigorously analyzed; the approximation error through constructing approximations of the optimal transport maps, and the estimation error through establishing an upper bound using Rademacher complexity. Our analysis not only isolates these errors but also explores the trade-offs between them, which provides a theoretical insights of how CycleGAN's architecture and training procedures influence its performance.
- [19] arXiv:2407.11735 (cross-list from cs.LG) [pdf, other]
-
Title: ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution DetectionComments: ECCV2024Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
In open-set semi-supervised learning (OSSL), we consider unlabeled datasets that may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD). Additionally, many works for OSSL rely on ad-hoc thresholds for ID/OOD classification, without considering the statistics of the problem. We propose a new score for ID/OOD classification based on angles in feature space between data and an ID subspace. Moreover, we propose an approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD. These components are put together in a framework for OSSL, termed \emph{ProSub}, that is experimentally shown to reach SOTA performance on several benchmark problems. Our code is available at this https URL.
- [20] arXiv:2407.11800 (cross-list from math.AP) [pdf, other]
-
Title: Gradient Flows and Riemannian Structure in the Gromov-Wasserstein GeometryComments: 73 pagesSubjects: Analysis of PDEs (math.AP); Optimization and Control (math.OC); Machine Learning (stat.ML)
The Wasserstein space of probability measures is known for its intricate Riemannian structure, which underpins the Wasserstein geometry and enables gradient flow algorithms. However, the Wasserstein geometry may not be suitable for certain tasks or data modalities. Motivated by scenarios where the global structure of the data needs to be preserved, this work initiates the study of gradient flows and Riemannian structure in the Gromov-Wasserstein (GW) geometry, which is particularly suited for such purposes. We focus on the inner product GW (IGW) distance between distributions on $\mathbb{R}^d$. Given a functional $\mathsf{F}:\mathcal{P}_2(\mathbb{R}^d)\to\mathbb{R}$ to optimize, we present an implicit IGW minimizing movement scheme that generates a sequence of distributions $\{\rho_i\}_{i=0}^n$, which are close in IGW and aligned in the 2-Wasserstein sense. Taking the time step to zero, we prove that the discrete solution converges to an IGW generalized minimizing movement (GMM) $(\rho_t)_t$ that follows the continuity equation with a velocity field $v_t\in L^2(\rho_t;\mathbb{R}^d)$, specified by a global transformation of the Wasserstein gradient of $\mathsf{F}$. The transformation is given by a mobility operator that modifies the Wasserstein gradient to encode not only local information, but also global structure. Our gradient flow analysis leads us to identify the Riemannian structure that gives rise to the intrinsic IGW geometry, using which we establish a Benamou-Brenier-like formula for IGW. We conclude with a formal derivation, akin to the Otto calculus, of the IGW gradient as the inverse mobility acting on the Wasserstein gradient. Numerical experiments validating our theory and demonstrating the global nature of IGW interpolations are provided.
- [21] arXiv:2407.11823 (cross-list from cs.LG) [pdf, html, other]
-
Title: Harmonizing Safety and Speed: A Human-Algorithm Approach to Enhance the FDA's Medical Device Clearance PolicySubjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Optimization and Control (math.OC); Machine Learning (stat.ML)
The United States Food and Drug Administration's (FDA's) Premarket Notification 510(K) pathway allows manufacturers to gain approval for a medical device by demonstrating its substantial equivalence to another legally marketed device. However, the inherent ambiguity of this regulatory procedure has led to high recall rates for many devices cleared through this pathway. This trend has raised significant concerns regarding the efficacy of the FDA's current approach, prompting a reassessment of the 510(K) regulatory framework. In this paper, we develop a combined human-algorithm approach to assist the FDA in improving its 510(k) medical device clearance process by reducing the risk of potential recalls and the workload imposed on the FDA. We first develop machine learning methods to estimate the risk of recall of 510(k) medical devices based on the information available at the time of submission. We then propose a data-driven clearance policy that recommends acceptance, rejection, or deferral to FDA's committees for in-depth evaluation. We conduct an empirical study using a unique large-scale dataset of over 31,000 medical devices and 12,000 national and international manufacturers from over 65 countries that we assembled based on data sources from the FDA and Centers for Medicare and Medicaid Service (CMS). A conservative evaluation of our proposed policy based on this data shows a 38.9% improvement in the recall rate and a 43.0% reduction in the FDA's workload. Our analyses also indicate that implementing our policy could result in significant annual cost-savings ranging between \$2.4 billion and \$2.7 billion, which highlights the value of using a holistic and data-driven approach to improve the FDA's current 510(K) medical device evaluation pathway.
- [22] arXiv:2407.11844 (cross-list from cs.LG) [pdf, html, other]
-
Title: Variational Randomized Smoothing for Sample-Wise Adversarial RobustnessComments: 20 pages, under preparationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Randomized smoothing is a defensive technique to achieve enhanced robustness against adversarial examples which are small input perturbations that degrade the performance of neural network models. Conventional randomized smoothing adds random noise with a fixed noise level for every input sample to smooth out adversarial perturbations. This paper proposes a new variational framework that uses a per-sample noise level suitable for each input by introducing a noise level selector. Our experimental results demonstrate enhancement of empirical robustness against adversarial attacks. We also provide and analyze the certified robustness for our sample-wise smoothing method.
- [23] arXiv:2407.11894 (cross-list from cs.LG) [pdf, html, other]
-
Title: Deep Learning without Global Optimization by Random Fourier Neural NetworksSubjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
We introduce a new training algorithm for variety of deep neural networks that utilize random complex exponential activation functions. Our approach employs a Markov Chain Monte Carlo sampling procedure to iteratively train network layers, avoiding global and gradient-based optimization while maintaining error control. It consistently attains the theoretical approximation rate for residual networks with complex exponential activation functions, determined by network complexity. Additionally, it enables efficient learning of multiscale and high-frequency features, producing interpretable parameter distributions. Despite using sinusoidal basis functions, we do not observe Gibbs phenomena in approximating discontinuous target functions.
- [24] arXiv:2407.11917 (cross-list from cs.LG) [pdf, html, other]
-
Title: Global Optimisation of Black-Box Functions with Generative Models in the Wasserstein SpaceComments: ECAI 2024 (Oral)Subjects: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
We propose a new uncertainty estimator for gradient-free optimisation of black-box simulators using deep generative surrogate models. Optimisation of these simulators is especially challenging for stochastic simulators and higher dimensions. To address these issues, we utilise a deep generative surrogate approach to model the black box response for the entire parameter space. We then leverage this knowledge to estimate the proposed uncertainty based on the Wasserstein distance - the Wasserstein uncertainty. This approach is employed in a posterior agnostic gradient-free optimisation algorithm that minimises regret over the entire parameter space. A series of tests were conducted to demonstrate that our method is more robust to the shape of both the black box function and the stochastic response of the black box than state-of-the-art methods, such as efficient global optimisation with a deep Gaussian process surrogate.
- [25] arXiv:2407.11932 (cross-list from math.ST) [pdf, html, other]
-
Title: Impossibility of latent inner product recovery via rate distortionSubjects: Statistics Theory (math.ST); Information Theory (cs.IT); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
In this largely expository note, we present an impossibility result for inner product recovery in a random geometric graph or latent space model using the rate-distortion theory. More precisely, suppose that we observe a graph $A$ on $n$ vertices with average edge density $p$ generated from Gaussian or spherical latent locations $z_1, \dots, z_n \in \mathbb{R}^d$ associated with the $n$ vertices. It is of interest to estimate the inner products $\langle z_i, z_j \rangle$ which represent the geometry of the latent points. We prove that it is impossible to recover the inner products if $d \gtrsim n h(p)$ where $h(p)$ is the binary entropy function. This matches the condition required for positive results on inner product recovery in the literature. The proof follows the well-established rate-distortion theory with the main technical ingredient being a lower bound on the rate-distortion function of the Wishart distribution which is interesting in its own right.
- [26] arXiv:2407.11942 (cross-list from q-bio.BM) [pdf, html, other]
-
Title: Context-Guided Diffusion for Out-of-Distribution Molecular and Protein DesignComments: Published in the Proceedings of the 41st International Conference on Machine Learning (ICML 2024)Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
Generative models have the potential to accelerate key steps in the discovery of novel molecular therapeutics and materials. Diffusion models have recently emerged as a powerful approach, excelling at unconditional sample generation and, with data-driven guidance, conditional generation within their training domain. Reliably sampling from high-value regions beyond the training data, however, remains an open challenge -- with current methods predominantly focusing on modifying the diffusion process itself. In this paper, we develop context-guided diffusion (CGD), a simple plug-and-play method that leverages unlabeled data and smoothness constraints to improve the out-of-distribution generalization of guided diffusion models. We demonstrate that this approach leads to substantial performance gains across various settings, including continuous, discrete, and graph-structured diffusion processes with applications across drug discovery, materials science, and protein design.
Cross submissions for Wednesday, 17 July 2024 (showing 20 of 20 entries )
- [27] arXiv:2301.04257 (replaced) [pdf, other]
-
Title: ODIM: Outlier Detection via Likelihood of Under-Fitted Generative ModelsComments: 31 pages in total. Published at ICML 2024Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The unsupervised outlier detection (UOD) problem refers to a task to identify inliers given training data which contain outliers as well as inliers, without any labeled information about inliers and outliers. It has been widely recognized that using fully-trained likelihood-based deep generative models (DGMs) often results in poor performance in distinguishing inliers from outliers. In this study, we claim that the likelihood itself could serve as powerful evidence for identifying inliers in UOD tasks, provided that DGMs are carefully under-fitted. Our approach begins with a novel observation called the inlier-memorization (IM) effect-when training a deep generative model with data including outliers, the model initially memorizes inliers before outliers. Based on this finding, we develop a new method called the outlier detection via the IM effect (ODIM). Remarkably, the ODIM requires only a few updates, making it computationally efficient-at least tens of times faster than other deep-learning-based algorithms. Also, the ODIM filters out outliers excellently, regardless of the data type, including tabular, image, and text data. To validate the superiority and efficiency of our method, we provide extensive empirical analyses on close to 60 datasets.
- [28] arXiv:2402.04146 (replaced) [pdf, html, other]
-
Title: Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian ProcessSandipp Krishnan Ravi, Yigitcan Comlek, Wei Chen, Arjun Pathak, Vipul Gupta, Rajnikant Umretiya, Andrew Hoffman, Ghanshyam Pilania, Piyush Pandita, Sayan Ghosh, Nathaniel Mckeever, Liping WangComments: 27 Pages,10 Figures, 3 Supplementary Figures, 2 Supplementary TablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
With the advent of artificial intelligence (AI) and machine learning (ML), various domains of science and engineering communites has leveraged data-driven surrogates to model complex systems from numerous sources of information (data). The proliferation has led to significant reduction in cost and time involved in development of superior systems designed to perform specific functionalities. A high proposition of such surrogates are built extensively fusing multiple sources of data, may it be published papers, patents, open repositories, or other resources. However, not much attention has been paid to the differences in quality and comprehensiveness of the known and unknown underlying physical parameters of the information sources that could have downstream implications during system optimization. Towards resolving this issue, a multi-source data fusion framework based on Latent Variable Gaussian Process (LVGP) is proposed. The individual data sources are tagged as a characteristic categorical variable that are mapped into a physically interpretable latent space, allowing the development of source-aware data fusion modeling. Additionally, a dissimilarity metric based on the latent variables of LVGP is introduced to study and understand the differences in the sources of data. The proposed approach is demonstrated on and analyzed through two mathematical (representative parabola problem, 2D Ackley function) and two materials science (design of FeCrAl and SmCoFe alloys) case studies. From the case studies, it is observed that compared to using single-source and source unaware ML models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems, interpretability regarding the sources, and enhanced modeling capabilities by taking advantage of the correlations and relationships among different sources.
- [29] arXiv:2403.02011 (replaced) [pdf, html, other]
-
Title: Bipartite Graph Variational Auto-Encoder with Fair Latent Representation to Account for Sampling Bias in Ecological NetworksSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
We propose a method to represent bipartite networks using graph embeddings tailored to tackle the challenges of studying ecological networks, such as the ones linking plants and pollinators, where many covariates need to be accounted for, in particular to control for sampling bias. We adapt the variational graph auto-encoder approach to the bipartite case, which enables us to generate embeddings in a latent space where the two sets of nodes are positioned based on their probability of connection. We translate the fairness framework commonly considered in sociology in order to address sampling bias in ecology. By incorporating the Hilbert-Schmidt independence criterion (HSIC) as an additional penalty term in the loss we optimize, we ensure that the structure of the latent space is independent of continuous variables, which are related to the sampling process. Finally, we show how our approach can change our understanding of ecological networks when applied to the Spipoll data set, a citizen science monitoring program of plant-pollinator interactions to which many observers contribute, making it prone to sampling bias.
- [30] arXiv:2404.09779 (replaced) [pdf, other]
-
Title: A replica analysis of under-baggingComments: 31 pages, 14 figuresSubjects: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
Under-bagging (UB), which combines under-sampling and bagging, is a popular ensemble learning method for training classifiers on an imbalanced data. Using bagging to reduce the increased variance caused by the reduction in sample size due to under-sampling is a natural approach. However, it has recently been pointed out that in generalized linear models, naive bagging, which does not consider the class imbalance structure, and ridge regularization can produce the same results. Therefore, it is not obvious whether it is better to use UB, which requires an increased computational cost proportional to the number of under-sampled data sets, when training linear models. Given such a situation, in this study, we heuristically derive a sharp asymptotics of UB and use it to compare with several other popular methods for learning from imbalanced data, in the scenario where a linear classifier is trained from a two-component mixture data. The methods compared include the under-sampling (US) method, which trains a model using a single realization of the under-sampled data, and the simple weighting (SW) method, which trains a model with a weighted loss on the entire data. It is shown that the performance of UB is improved by increasing the size of the majority class while keeping the size of the minority fixed, even though the class imbalance can be large, especially when the size of the minority class is small. This is in contrast to US, whose performance is almost independent of the majority class size. In this sense, bagging and simple regularization differ as methods to reduce the variance increased by under-sampling. On the other hand, the performance of SW with the optimal weighting coefficients is almost equal to UB, indicating that the combination of reweighting and regularization may be similar to UB.
- [31] arXiv:2405.14038 (replaced) [pdf, html, other]
-
Title: FLIPHAT: Joint Differential Privacy for High Dimensional Sparse Linear BanditsComments: 28 pages, 1 figureSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
High dimensional sparse linear bandits serve as an efficient model for sequential decision-making problems (e.g. personalized medicine), where high dimensional features (e.g. genomic data) on the users are available, but only a small subset of them are relevant. Motivated by data privacy concerns in these applications, we study the joint differentially private high dimensional sparse linear bandits, where both rewards and contexts are considered as private data. First, to quantify the cost of privacy, we derive a lower bound on the regret achievable in this setting. To further address the problem, we design a computationally efficient bandit algorithm, \textbf{F}orgetfu\textbf{L} \textbf{I}terative \textbf{P}rivate \textbf{HA}rd \textbf{T}hresholding (FLIPHAT). Along with doubling of episodes and episodic forgetting, FLIPHAT deploys a variant of Noisy Iterative Hard Thresholding (N-IHT) algorithm as a sparse linear regression oracle to ensure both privacy and regret-optimality. We show that FLIPHAT achieves optimal regret up to logarithmic factors. We analyze the regret by providing a novel refined analysis of the estimation error of N-IHT, which is of parallel interest.
- [32] arXiv:2407.10955 (replaced) [pdf, html, other]
-
Title: Enhancing Stochastic Optimization for Statistical Efficiency Using ROOT-SGD with Diminishing StepsizeSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
In this paper, we revisit \textsf{ROOT-SGD}, an innovative method for stochastic optimization to bridge the gap between stochastic optimization and statistical efficiency. The proposed method enhances the performance and reliability of \textsf{ROOT-SGD} by integrating a carefully designed \emph{diminishing stepsize strategy}. This approach addresses key challenges in optimization, providing robust theoretical guarantees and practical benefits. Our analysis demonstrates that \textsf{ROOT-SGD} with diminishing achieves optimal convergence rates while maintaining computational efficiency. By dynamically adjusting the learning rate, \textsf{ROOT-SGD} ensures improved stability and precision throughout the optimization process. The findings of this study offer valuable insights for developing advanced optimization algorithms that are both efficient and statistically robust.
- [33] arXiv:2207.06569 (replaced) [pdf, html, other]
-
Title: Benign, Tempered, or Catastrophic: A Taxonomy of OverfittingNeil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum NakkiranComments: NM and JS co-first authorsSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.
- [34] arXiv:2305.02997 (replaced) [pdf, html, other]
-
Title: When Do Neural Nets Outperform Boosted Trees on Tabular Data?Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, Colin WhiteComments: NeurIPS Datasets and Benchmarks Track 2023Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at this https URL.
- [35] arXiv:2306.03928 (replaced) [pdf, html, other]
-
Title: Designing Decision Support Systems Using Counterfactual Prediction SetsComments: Best paper award in the ICML 2023 AI&HCI Workshop, spotlight paper at ICML 2024Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Methodology (stat.ME); Machine Learning (stat.ML)
Decision support systems for classification tasks are predominantly designed to predict the value of the ground truth labels. However, since their predictions are not perfect, these systems also need to make human experts understand when and how to use these predictions to update their own predictions. Unfortunately, this has been proven challenging. In this context, it has been recently argued that an alternative type of decision support systems may circumvent this challenge. Rather than providing a single label prediction, these systems provide a set of label prediction values constructed using a conformal predictor, namely a prediction set, and forcefully ask experts to predict a label value from the prediction set. However, the design and evaluation of these systems have so far relied on stylized expert models, questioning their promise. In this paper, we revisit the design of this type of systems from the perspective of online learning and develop a methodology that does not require, nor assumes, an expert model. Our methodology leverages the nested structure of the prediction sets provided by any conformal predictor and a natural counterfactual monotonicity assumption to achieve an exponential improvement in regret in comparison to vanilla bandit algorithms. We conduct a large-scale human subject study ($n = 2{,}751$) to compare our methodology to several competitive baselines. The results show that, for decision support systems based on prediction sets, limiting experts' level of agency leads to greater performance than allowing experts to always exercise their own agency. We have made available the data gathered in our human subject study as well as an open source implementation of our system at this https URL.
- [36] arXiv:2307.00310 (replaced) [pdf, html, other]
-
Title: Gradients Look Alike: Sensitivity is Often Overestimated in DP-SGDComments: published in 33rd USENIX Security SymposiumSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
Differentially private stochastic gradient descent (DP-SGD) is the canonical approach to private deep learning. While the current privacy analysis of DP-SGD is known to be tight in some settings, several empirical results suggest that models trained on common benchmark datasets leak significantly less privacy for many datapoints. Yet, despite past attempts, a rigorous explanation for why this is the case has not been reached. Is it because there exist tighter privacy upper bounds when restricted to these dataset settings, or are our attacks not strong enough for certain datapoints? In this paper, we provide the first per-instance (i.e., ``data-dependent") DP analysis of DP-SGD. Our analysis captures the intuition that points with similar neighbors in the dataset enjoy better data-dependent privacy than outliers. Formally, this is done by modifying the per-step privacy analysis of DP-SGD to introduce a dependence on the distribution of model updates computed from a training dataset. We further develop a new composition theorem to effectively use this new per-step analysis to reason about an entire training run. Put all together, our evaluation shows that this novel DP-SGD analysis allows us to now formally show that DP-SGD leaks significantly less privacy for many datapoints (when trained on common benchmarks) than the current data-independent guarantee. This implies privacy attacks will necessarily fail against many datapoints if the adversary does not have sufficient control over the possible training datasets.
- [37] arXiv:2309.05697 (replaced) [pdf, html, other]
-
Title: 21cmEMU: an emulator of 21cmFAST summary observablesComments: 21 pages, 13 figures, submitted to MNRASSubjects: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (stat.ML)
Recent years have witnessed rapid progress in observations of the Epoch of Reionization (EoR). These have enabled high-dimensional inference of galaxy and intergalactic medium (IGM) properties during the first billion years of our Universe. However, even using efficient, semi-numerical simulations, traditional inference approaches that compute 3D lightcones on-the-fly can take $10^5$ core hours. Here we present 21cmEMU: an emulator of several summary observables from the popular 21cmFAST simulation code. 21cmEMU takes as input nine parameters characterizing EoR galaxies, and outputs the following summary statistics: (i) the IGM mean neutral fraction; (ii) the 21-cm power spectrum; (iii) the mean 21-cm spin temperature; (iv) the sky-averaged (global) 21-cm signal; (v) the ultraviolet (UV) luminosity functions (LFs); and (vi) the Thomson scattering optical depth to the cosmic microwave background (CMB). All observables are predicted with sub-percent median accuracy, with a reduction of the computational cost by a factor of over 10$^4$. After validating inference results, we showcase a few applications, including: (i) quantifying the relative constraining power of different observational datasets; (ii) seeing how recent claims of a late EoR impact previous inferences; and (iii) forecasting upcoming constraints from the sixth observing season of the Hydrogen Epoch of Reionization Array (HERA) telescope. 21cmEMU is publicly-available, and is included as an alternative simulator in the public 21CMMC sampler.
- [38] arXiv:2311.08214 (replaced) [pdf, html, other]
-
Title: Frequentist Guarantees of Distributed (Non)-Bayesian InferenceSubjects: Statistics Theory (math.ST); Machine Learning (stat.ML)
Motivated by the need to analyze large, decentralized datasets, distributed Bayesian inference has become a critical research area across multiple fields, including statistics, electrical engineering, and economics. This paper establishes Frequentist properties, such as posterior consistency, asymptotic normality, and posterior contraction rates, for the distributed (non-)Bayes Inference problem among agents connected via a communication network. Our results show that, under appropriate assumptions on the communication graph, distributed Bayesian inference retains parametric efficiency while enhancing robustness in uncertainty quantification. We also explore the trade-off between statistical efficiency and communication efficiency by examining how the design and size of the communication graph impact the posterior contraction rate. Furthermore, We extend our analysis to time-varying graphs and apply our results to exponential family models, distributed logistic regression, and decentralized detection models.
- [39] arXiv:2402.01143 (replaced) [pdf, html, other]
-
Title: Learning Network Representations with Disentangled Graph Auto-EncoderComments: 15 pages, 9 figuresSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
The (variational) graph auto-encoder is widely used to learn representations for graph-structured data. However, the formation of real-world graphs is a complicated and heterogeneous process influenced by latent factors. Existing encoders are fundamentally holistic, neglecting the entanglement of latent factors. This reduces the effectiveness of graph analysis tasks, while also making it more difficult to explain the learned representations. As a result, learning disentangled graph representations with the (variational) graph auto-encoder poses significant challenges and remains largely unexplored in the current research. In this paper, we introduce the Disentangled Graph Auto-Encoder (DGA) and the Disentangled Variational Graph Auto-Encoder (DVGA) to learn disentangled representations. Specifically, we first design a disentangled graph convolutional network with multi-channel message-passing layers to serve as the encoder. This allows each channel to aggregate information about each latent factor. The disentangled variational graph auto-encoder's expressive capability is then enhanced by applying a component-wise flow to each channel. In addition, we construct a factor-wise decoder that takes into account the characteristics of disentangled representations. We improve the independence of representations by imposing independence constraints on the mapping channels for distinct latent factors. Empirical experiments on both synthetic and real-world datasets demonstrate the superiority of our proposed method compared to several state-of-the-art baselines.
- [40] arXiv:2402.01845 (replaced) [pdf, html, other]
-
Title: Multi-Armed Bandits with InterferenceSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. The cumulative performance, while equally crucial, is less well understood. To address this gap, we introduce the problem of {\em Multi-armed Bandits with Interference} (MABI), where the learner assigns an arm to each of $N$ experimental units over a time horizon of $T$ rounds. The reward of each unit in each round depends on the treatments of {\em all} units, where the influence of a unit decays in the spatial distance between units. Furthermore, we employ a general setup wherein the reward functions are chosen by an adversary and may vary arbitrarily across rounds and units. We first show that switchback policies achieve an optimal {\em expected} regret $\tilde O(\sqrt T)$ against the best fixed-arm policy. Nonetheless, the regret (as a random variable) for any switchback policy suffers a high variance, as it does not account for $N$. We propose a cluster randomization policy whose regret (i) is optimal in {\em expectation} and (ii) admits a high probability bound that vanishes in $N$.
- [41] arXiv:2402.14781 (replaced) [pdf, html, other]
-
Title: Effective Bayesian Causal Inference via Structural Marginalisation and Autoregressive OrdersComments: 8 pages + references + appendices (19 pages total)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
Bayesian causal inference (BCI) naturally incorporates epistemic uncertainty about the true causal model into down-stream causal reasoning tasks by posterior averaging over causal models. However, this poses a tremendously hard computational problem due to the intractable number of causal structures to marginalise over. In this work, we decompose the structure learning problem into inferring (i) a causal order and (ii) a parent set for each variable given a causal order. By limiting the number of parents per variable, we can exactly marginalise over the parent sets in polynomial time, which leaves only the causal order to be marginalised. To this end, we propose a novel autoregressive model over causal orders (ARCO) learnable with gradient-based methods. Our method yields state-of-the-art in structure learning on simulated non-linear additive noise benchmarks with scale-free and Erdos-Renyi graph structures, and competitive results on real-world data. Moreover, we illustrate that our method accurately infers interventional distributions, which allows us to estimate posterior average causal effects and many other causal quantities of interest.
- [42] arXiv:2403.16883 (replaced) [pdf, html, other]
-
Title: GLAD: Improving Latent Graph Generative Modeling with Simple QuantizationComments: Accepted in the 2nd Structured Probabilistic Inference & Generative Modeling workshop of ICML 2024Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Exploring the graph latent structures has not garnered much attention in the graph generative research field. Yet, exploiting the latent space is as crucial as working on the data space for discrete data such as graphs. However, previous methods either failed to preserve the permutation symmetry of graphs or lacked an effective approaches to model appropriately within the latent space. To mitigate those issues, we propose a simple, yet effective discrete latent graph diffusion generative model. Our model, namely GLAD, not only overcomes the drawbacks of existing latent approaches, but also alleviates inherent issues present in diffusion methods applied on the graph space. We validate our generative model on the molecular benchmark datasets, on which it demonstrates competitive performance compared with the state-of-the-art baselines.