-
Local to Global: Learning Dynamics and Effect of Initialization for Transformers
Authors:
Ashok Vardhan Makkuva,
Marco Bondaschi,
Chanakya Ekbote,
Adway Girish,
Alliot Nagle,
Hyeji Kim,
Michael Gastpar
Abstract:
In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling. To better understand this phenomenon, there is a growing interest in using Markov input processes to study transformers. However, our current understanding in this regard remains limited with many fundamental questions about how transformers learn Markov chains still unanswered. In this…
▽ More
In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling. To better understand this phenomenon, there is a growing interest in using Markov input processes to study transformers. However, our current understanding in this regard remains limited with many fundamental questions about how transformers learn Markov chains still unanswered. In this paper, we address this by focusing on first-order Markov chains and single-layer transformers, providing a comprehensive characterization of the learning dynamics in this context. Specifically, we prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima, contingent on the initialization and the Markovian data properties, and we characterize the precise conditions under which this occurs. To the best of our knowledge, this is the first result of its kind highlighting the role of initialization. We further demonstrate that our theoretical findings are corroborated by empirical evidence. Based on these insights, we provide guidelines for the initialization of transformer parameters and demonstrate their effectiveness. Finally, we outline several open problems in this arena. Code is available at: https://github.com/Bond1995/Markov.
△ Less
Submitted 27 June, 2024; v1 submitted 5 June, 2024;
originally announced June 2024.
-
Sibson's $α$-Mutual Information and its Variational Representations
Authors:
Amedeo Roberto Esposito,
Michael Gastpar,
Ibrahim Issa
Abstract:
Information measures can be constructed from Rényi divergences much like mutual information from Kullback-Leibler divergence. One such information measure is known as Sibson's $α$-mutual information and has received renewed attention recently in several contexts: concentration of measure under dependence, statistical learning, hypothesis testing, and estimation theory. In this paper, we survey and…
▽ More
Information measures can be constructed from Rényi divergences much like mutual information from Kullback-Leibler divergence. One such information measure is known as Sibson's $α$-mutual information and has received renewed attention recently in several contexts: concentration of measure under dependence, statistical learning, hypothesis testing, and estimation theory. In this paper, we survey and extend the state of the art. In particular, we introduce variational representations for Sibson's $α$-mutual information and employ them in each of the contexts just described to derive novel results. Namely, we produce generalized Transportation-Cost inequalities and Fano-type inequalities. We also present an overview of known applications, spanning from learning theory and Bayesian risk to universal prediction.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Characterising directed and undirected metrics of high-order interdependence
Authors:
Fernando E. Rosas,
Pedro A. M. Mediano,
Michael Gastpar
Abstract:
Systems of interest for theoretical or experimental work often exhibit high-order interactions, corresponding to statistical interdependencies in groups of variables that cannot be reduced to dependencies in subsets of them. While still under active development, the framework of partial information decomposition (PID) has emerged as the dominant approach to conceptualise and calculate high-order i…
▽ More
Systems of interest for theoretical or experimental work often exhibit high-order interactions, corresponding to statistical interdependencies in groups of variables that cannot be reduced to dependencies in subsets of them. While still under active development, the framework of partial information decomposition (PID) has emerged as the dominant approach to conceptualise and calculate high-order interdependencies. PID approaches can be grouped in two types: directed approaches that divide variables into sources and targets, and undirected approaches that treat all variables equally. Directed and undirected approaches are usually employed to investigate different scenarios, and hence little is known about how these two types of approaches may relate to each other, or if their corresponding quantities are linked in some way. In this paper we investigate the relationship between the redundancy-synergy index (RSI) and the O-information, which are practical metrics of directed and undirected high-order interdependencies, respectively. Our results reveal tight links between these two quantities, and provide interpretations of them in terms of likelihood ratios in a hypothesis testing setting, as well as in terms of projections in information geometry.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Properties of the Strong Data Processing Constant for Rényi Divergence
Authors:
Lifu Jin,
Amedeo Roberto Esposito,
Michael Gastpar
Abstract:
Strong data processing inequalities (SDPI) are an important object of study in Information Theory and have been well studied for $f$-divergences. Universal upper and lower bounds have been provided along with several applications, connecting them to impossibility (converse) results, concentration of measure, hypercontractivity, and so on. In this paper, we study Rényi divergence and the correspond…
▽ More
Strong data processing inequalities (SDPI) are an important object of study in Information Theory and have been well studied for $f$-divergences. Universal upper and lower bounds have been provided along with several applications, connecting them to impossibility (converse) results, concentration of measure, hypercontractivity, and so on. In this paper, we study Rényi divergence and the corresponding SDPI constant whose behavior seems to deviate from that of ordinary $Φ$-divergences. In particular, one can find examples showing that the universal upper bound relating its SDPI constant to the one of Total Variation does not hold in general. In this work, we prove, however, that the universal lower bound involving the SDPI constant of the Chi-square divergence does indeed hold. Furthermore, we also provide a characterization of the distribution that achieves the supremum when $α$ is equal to $2$ and consequently compute the SDPI constant for Rényi divergence of the general binary channel.
△ Less
Submitted 14 May, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
The Fundamental Limits of Least-Privilege Learning
Authors:
Theresa Stadler,
Bogdan Kulynych,
Michael C. Gastpar,
Nicolas Papernot,
Carmela Troncoso
Abstract:
The promise of least-privilege learning -- to find feature representations that are useful for a learning task but prevent inference of any sensitive information unrelated to this task -- is highly appealing. However, so far this concept has only been stated informally. It thus remains an open question whether and how we can achieve this goal. In this work, we provide the first formalisation of th…
▽ More
The promise of least-privilege learning -- to find feature representations that are useful for a learning task but prevent inference of any sensitive information unrelated to this task -- is highly appealing. However, so far this concept has only been stated informally. It thus remains an open question whether and how we can achieve this goal. In this work, we provide the first formalisation of the least-privilege principle for machine learning and characterise its feasibility. We prove that there is a fundamental trade-off between a representation's utility for a given task and its leakage beyond the intended task: it is not possible to learn representations that have high utility for the intended task but, at the same time prevent inference of any attribute other than the task label itself. This trade-off holds under realistic assumptions on the data distribution and regardless of the technique used to learn the feature mappings that produce these representations. We empirically validate this result for a wide range of learning techniques, model architectures, and datasets.
△ Less
Submitted 26 June, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Authors:
Ashok Vardhan Makkuva,
Marco Bondaschi,
Adway Girish,
Alliot Nagle,
Martin Jaggi,
Hyeji Kim,
Michael Gastpar
Abstract:
In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. A key ingredient behind their success is the generative pretraining procedure, during which these models are trained on a large text corpus in an auto-regressive manner. To shed light on this phenomenon, we propose a new framework that allows both theory and s…
▽ More
In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. A key ingredient behind their success is the generative pretraining procedure, during which these models are trained on a large text corpus in an auto-regressive manner. To shed light on this phenomenon, we propose a new framework that allows both theory and systematic experiments to study the sequential modeling capabilities of transformers through the lens of Markov chains. Inspired by the Markovianity of natural languages, we model the data as a Markovian source and utilize this framework to systematically study the interplay between the data-distributional properties, the transformer architecture, the learnt distribution, and the final model performance. In particular, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima and bad local minima contingent upon the specific data characteristics and the transformer architecture. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. We further investigate these findings in the broader context of higher order Markov chains and deeper architectures, and outline open problems in this arena. Code is available at \url{https://github.com/Bond1995/Markov}.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Batch Universal Prediction
Authors:
Marco Bondaschi,
Michael Gastpar
Abstract:
Large language models (LLMs) have recently gained much popularity due to their surprising ability at generating human-like English sentences. LLMs are essentially predictors, estimating the probability of a sequence of words given the past. Therefore, it is natural to evaluate their performance from a universal prediction perspective. In order to do that fairly, we introduce the notion of batch re…
▽ More
Large language models (LLMs) have recently gained much popularity due to their surprising ability at generating human-like English sentences. LLMs are essentially predictors, estimating the probability of a sequence of words given the past. Therefore, it is natural to evaluate their performance from a universal prediction perspective. In order to do that fairly, we introduce the notion of batch regret as a modification of the classical average regret, and we study its asymptotical value for add-constant predictors, in the case of memoryless sources and first-order Markov sources.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Simultaneous Computation and Communication over MAC
Authors:
Matthias Frey,
Igor Bjelaković,
Michael C. Gastpar,
Jingge Zhu
Abstract:
We study communication over a Gaussian multiple-access channel (MAC) with two types of transmitters: Digital transmitters hold a message from a discrete set that needs to be communicated to the receiver with vanishing error probability. Analog transmitters hold sequences of analog values, and some functions of these distributed values (but not the values themselves) need to be conveyed to the rece…
▽ More
We study communication over a Gaussian multiple-access channel (MAC) with two types of transmitters: Digital transmitters hold a message from a discrete set that needs to be communicated to the receiver with vanishing error probability. Analog transmitters hold sequences of analog values, and some functions of these distributed values (but not the values themselves) need to be conveyed to the receiver, subject to a fidelity criterion such as mean squared error (MSE) or a certain maximum error with given confidence. For the case in which the computed function for the analog transmitters is a sum of values in [-1,1], we derive inner and outer bounds for the tradeoff of digital and analog rates of communication under peak and average power constraints for digital transmitters and a peak power constraint for analog transmitters. We then extend the achievability result to a class of functions that includes all linear and some non-linear functions. The practicality of our proposed communication scheme is shown in channel simulations that use a version of the scheme based on low density parity check (LDPC) coding and evaluate the system performance for different block lengths and Gaussian as well as non-Gaussian noise distributions.
△ Less
Submitted 23 April, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
LASER: Linear Compression in Wireless Distributed Optimization
Authors:
Ashok Vardhan Makkuva,
Marco Bondaschi,
Thijs Vogels,
Martin Jaggi,
Hyeji Kim,
Michael C. Gastpar
Abstract:
Data-parallel SGD is the de facto algorithm for distributed optimization, especially for large scale machine learning. Despite its merits, communication bottleneck is one of its persistent issues. Most compression schemes to alleviate this either assume noiseless communication links, or fail to achieve good performance on practical tasks. In this paper, we close this gap and introduce LASER: LineA…
▽ More
Data-parallel SGD is the de facto algorithm for distributed optimization, especially for large scale machine learning. Despite its merits, communication bottleneck is one of its persistent issues. Most compression schemes to alleviate this either assume noiseless communication links, or fail to achieve good performance on practical tasks. In this paper, we close this gap and introduce LASER: LineAr CompreSsion in WirEless DistRibuted Optimization. LASER capitalizes on the inherent low-rank structure of gradients and transmits them efficiently over the noisy channels. Whilst enjoying theoretical guarantees similar to those of the classical SGD, LASER shows consistent gains over baselines on a variety of practical benchmarks. In particular, it outperforms the state-of-the-art compression schemes on challenging computer vision and GPT language modeling tasks. On the latter, we obtain $50$-$64 \%$ improvement in perplexity over our baselines for noisy channels.
△ Less
Submitted 6 February, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Fantastic Generalization Measures are Nowhere to be Found
Authors:
Michael Gastpar,
Ido Nachum,
Jonathan Shafer,
Thomas Weinberger
Abstract:
We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small for all learning algorithms and all population distributions. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, in t…
▽ More
We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small for all learning algorithms and all population distributions. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, in their paper ``Fantastic Generalization Measures and Where to Find Them,'' Jiang et al. (2020) examine more than a dozen generalization bounds, and show empirically that none of them are uniformly tight. This raises the question of whether uniformly-tight generalization bounds are at all possible in the overparameterized setting. We consider two types of generalization bounds: (1) bounds that may depend on the training set and the learned hypothesis (e.g., margin bounds). We prove mathematically that no such bound can be uniformly tight in the overparameterized setting; (2) bounds that may in addition also depend on the learning algorithm (e.g., stability bounds). For these bounds, we show a trade-off between the algorithm's performance and the bound's tightness. Namely, if the algorithm achieves good accuracy on certain distributions, then no generalization bound can be uniformly tight for it in the overparameterized setting. We explain how these formal results can, in our view, inform research on generalization bounds for neural networks, while stressing that other interpretations of these results are also possible.
△ Less
Submitted 28 November, 2023; v1 submitted 24 September, 2023;
originally announced September 2023.
-
Lower Bounds on the Bayesian Risk via Information Measures
Authors:
Amedeo Roberto Esposito,
Adrien Vandenbroucque,
Michael Gastpar
Abstract:
This paper focuses on parameter estimation and introduces a new method for lower bounding the Bayesian risk. The method allows for the use of virtually \emph{any} information measure, including Rényi's $α$, $\varphi$-Divergences, and Sibson's $α$-Mutual Information. The approach considers divergences as functionals of measures and exploits the duality between spaces of measures and spaces of funct…
▽ More
This paper focuses on parameter estimation and introduces a new method for lower bounding the Bayesian risk. The method allows for the use of virtually \emph{any} information measure, including Rényi's $α$, $\varphi$-Divergences, and Sibson's $α$-Mutual Information. The approach considers divergences as functionals of measures and exploits the duality between spaces of measures and spaces of functions. In particular, we show that one can lower bound the risk with any information measure by upper bounding its dual via Markov's inequality. We are thus able to provide estimator-independent impossibility results thanks to the Data-Processing Inequalities that divergences satisfy. The results are then applied to settings of interest involving both discrete and continuous parameters, including the ``Hide-and-Seek'' problem, and compared to the state-of-the-art techniques. An important observation is that the behaviour of the lower bound in the number of samples is influenced by the choice of the information measure. We leverage this by introducing a new divergence inspired by the ``Hockey-Stick'' Divergence, which is demonstrated empirically to provide the largest lower-bound across all considered settings. If the observations are subject to privatisation, stronger impossibility results can be obtained via Strong Data-Processing Inequalities. The paper also discusses some generalisations and alternative directions.
△ Less
Submitted 24 March, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
Generalization Error Bounds for Noisy, Iterative Algorithms via Maximal Leakage
Authors:
Ibrahim Issa,
Amedeo Roberto Esposito,
Michael Gastpar
Abstract:
We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal leakage (equiv…
▽ More
We adopt an information-theoretic framework to analyze the generalization behavior of the class of iterative, noisy learning algorithms. This class is particularly suitable for study under information-theoretic metrics as the algorithms are inherently randomized, and it includes commonly used algorithms such as Stochastic Gradient Langevin Dynamics (SGLD). Herein, we use the maximal leakage (equivalently, the Sibson mutual information of order infinity) metric, as it is simple to analyze, and it implies both bounds on the probability of having a large generalization error and on its expected value. We show that, if the update function (e.g., gradient) is bounded in $L_2$-norm and the additive noise is isotropic Gaussian noise, then one can obtain an upper-bound on maximal leakage in semi-closed form. Furthermore, we demonstrate how the assumptions on the update function affect the optimal (in the sense of minimizing the induced maximal leakage) choice of the noise. Finally, we compute explicit tight upper bounds on the induced maximal leakage for other scenarios of interest.
△ Less
Submitted 19 July, 2023; v1 submitted 28 February, 2023;
originally announced February 2023.
-
Finite Littlestone Dimension Implies Finite Information Complexity
Authors:
Aditya Pradeep,
Ido Nachum,
Michael Gastpar
Abstract:
We prove that every online learnable class of functions of Littlestone dimension $d$ admits a learning algorithm with finite information complexity. Towards this end, we use the notion of a globally stable algorithm. Generally, the information complexity of such a globally stable algorithm is large yet finite, roughly exponential in $d$. We also show there is room for improvement; for a canonical…
▽ More
We prove that every online learnable class of functions of Littlestone dimension $d$ admits a learning algorithm with finite information complexity. Towards this end, we use the notion of a globally stable algorithm. Generally, the information complexity of such a globally stable algorithm is large yet finite, roughly exponential in $d$. We also show there is room for improvement; for a canonical online learnable class, indicator functions of affine subspaces of dimension $d$, the information complexity can be upper bounded logarithmically in $d$.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Alpha-NML Universal Predictors
Authors:
Marco Bondaschi,
Michael Gastpar
Abstract:
Inspired by the connection between classical regret measures employed in universal prediction and Rényi divergence, we introduce a new class of universal predictors that depend on a real parameter $α\geq 1$. This class interpolates two well-known predictors, the mixture estimators, that include the Laplace and the Krichevsky-Trofimov predictors, and the Normalized Maximum Likelihood (NML) estimato…
▽ More
Inspired by the connection between classical regret measures employed in universal prediction and Rényi divergence, we introduce a new class of universal predictors that depend on a real parameter $α\geq 1$. This class interpolates two well-known predictors, the mixture estimators, that include the Laplace and the Krichevsky-Trofimov predictors, and the Normalized Maximum Likelihood (NML) estimator. We point out some advantages of this new class of predictors and study its benefits from two complementary viewpoints: (1) we prove its optimality when the maximal Rényi divergence is considered as a regret measure, which can be interpreted operationally as a middle ground between the standard average and worst-case regret measures; (2) we discuss how it can be employed when NML is not a viable option, as an alternative to other predictors such as Luckiness NML. Finally, we apply the $α$-NML predictor to the class of discrete memoryless sources (DMS), where we derive simple formulas to compute the predictor and analyze its asymptotic performance in terms of worst-case regret.
△ Less
Submitted 1 May, 2024; v1 submitted 25 February, 2022;
originally announced February 2022.
-
From Generalisation Error to Transportation-cost Inequalities and Back
Authors:
Amedeo Roberto Esposito,
Michael Gastpar
Abstract:
In this work, we connect the problem of bounding the expected generalisation error with transportation-cost inequalities. Exposing the underlying pattern behind both approaches we are able to generalise them and go beyond Kullback-Leibler Divergences/Mutual Information and sub-Gaussian measures. In particular, we are able to provide a result showing the equivalence between two families of inequali…
▽ More
In this work, we connect the problem of bounding the expected generalisation error with transportation-cost inequalities. Exposing the underlying pattern behind both approaches we are able to generalise them and go beyond Kullback-Leibler Divergences/Mutual Information and sub-Gaussian measures. In particular, we are able to provide a result showing the equivalence between two families of inequalities: one involving functionals and one involving measures. This result generalises the one proposed by Bobkov and Götze that connects transportation-cost inequalities with concentration of measure. Moreover, it allows us to recover all standard generalisation error bounds involving mutual information and to introduce new, more general bounds, that involve arbitrary divergence measures.
△ Less
Submitted 25 March, 2022; v1 submitted 8 February, 2022;
originally announced February 2022.
-
On Sibson's $α$-Mutual Information
Authors:
Amedeo Roberto Esposito,
Adrien Vandenbroucque,
Michael Gastpar
Abstract:
We explore a family of information measures that stems from Rényi's $α$-Divergences with $α<0$. In particular, we extend the definition of Sibson's $α$-Mutual Information to negative values of $α$ and show several properties of these objects. Moreover, we highlight how this family of information measures is related to functional inequalities that can be employed in a variety of fields, including l…
▽ More
We explore a family of information measures that stems from Rényi's $α$-Divergences with $α<0$. In particular, we extend the definition of Sibson's $α$-Mutual Information to negative values of $α$ and show several properties of these objects. Moreover, we highlight how this family of information measures is related to functional inequalities that can be employed in a variety of fields, including lower-bounds on the Risk in Bayesian Estimation Procedures.
△ Less
Submitted 8 February, 2022;
originally announced February 2022.
-
Lower-bounds on the Bayesian Risk in Estimation Procedures via $f$-Divergences
Authors:
Adrien Vandenbroucque,
Amedeo Roberto Esposito,
Michael Gastpar
Abstract:
We consider the problem of parameter estimation in a Bayesian setting and propose a general lower-bound that includes part of the family of $f$-Divergences. The results are then applied to specific settings of interest and compared to other notable results in the literature. In particular, we show that the known bounds using Mutual Information can be improved by using, for example, Maximal Leakage…
▽ More
We consider the problem of parameter estimation in a Bayesian setting and propose a general lower-bound that includes part of the family of $f$-Divergences. The results are then applied to specific settings of interest and compared to other notable results in the literature. In particular, we show that the known bounds using Mutual Information can be improved by using, for example, Maximal Leakage, Hellinger divergence, or generalizations of the Hockey-Stick divergence.
△ Less
Submitted 18 May, 2022; v1 submitted 5 February, 2022;
originally announced February 2022.
-
The Price of Distributed: Rate Loss in the CEO Problem
Authors:
Arda Atalik,
Alper Köse,
Michael Gastpar
Abstract:
In the distributed remote (CEO) source coding problem, many separate encoders observe independently noisy copies of an underlying source. The rate loss is the difference between the rate required in this distributed setting and the rate that would be required in a setting where the encoders can fully cooperate. In this sense, the rate loss characterizes the price of distributed processing. We surv…
▽ More
In the distributed remote (CEO) source coding problem, many separate encoders observe independently noisy copies of an underlying source. The rate loss is the difference between the rate required in this distributed setting and the rate that would be required in a setting where the encoders can fully cooperate. In this sense, the rate loss characterizes the price of distributed processing. We survey and extend the known results on the rate loss in various settings, with a particular emphasis on the case where the noise in the observations is Gaussian, but the underlying source is general.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
Shannon Bounds on Lossy Gray-Wyner Networks
Authors:
Erixhen Sula,
Michael Gastpar
Abstract:
The Gray-Wyner network subject to a fidelity criterion is studied. Upper and lower bounds for the trade-offs between the private sum-rate and the common rate are obtained for arbitrary sources subject to mean-squared error distortion. The bounds meet exactly, leading to the computation of the rate region, when the source is jointly Gaussian. They meet partially when the sources are modeled via an…
▽ More
The Gray-Wyner network subject to a fidelity criterion is studied. Upper and lower bounds for the trade-offs between the private sum-rate and the common rate are obtained for arbitrary sources subject to mean-squared error distortion. The bounds meet exactly, leading to the computation of the rate region, when the source is jointly Gaussian. They meet partially when the sources are modeled via an additive Gaussian "channel". The bounds are inspired from the Shannon bounds on the rate-distortion problem.
△ Less
Submitted 2 February, 2022;
originally announced February 2022.
-
A Johnson--Lindenstrauss Framework for Randomly Initialized CNNs
Authors:
Ido Nachum,
Jan Hązła,
Michael Gastpar,
Anatoly Khina
Abstract:
How does the geometric representation of a dataset change after the application of each randomly initialized layer of a neural network? The celebrated Johnson--Lindenstrauss lemma answers this question for linear fully-connected neural networks (FNNs), stating that the geometry is essentially preserved. For FNNs with the ReLU activation, the angle between two inputs contracts according to a known…
▽ More
How does the geometric representation of a dataset change after the application of each randomly initialized layer of a neural network? The celebrated Johnson--Lindenstrauss lemma answers this question for linear fully-connected neural networks (FNNs), stating that the geometry is essentially preserved. For FNNs with the ReLU activation, the angle between two inputs contracts according to a known mapping. The question for non-linear convolutional neural networks (CNNs) becomes much more intricate. To answer this question, we introduce a geometric framework. For linear CNNs, we show that the Johnson--Lindenstrauss lemma continues to hold, namely, that the angle between two inputs is preserved. For CNNs with ReLU activation, on the other hand, the behavior is richer: The angle between the outputs contracts, where the level of contraction depends on the nature of the inputs. In particular, after one layer, the geometry of natural images is essentially preserved, whereas for Gaussian correlated inputs, CNNs exhibit the same contracting behavior as FNNs with ReLU activation.
△ Less
Submitted 7 March, 2022; v1 submitted 3 November, 2021;
originally announced November 2021.
-
A Unified Discretization Approach to Compute-Forward: From Discrete to Continuous Inputs
Authors:
Adriano Pastore,
Sung Hoon Lim,
Chen Feng,
Bobak Nazer,
Michael Gastpar
Abstract:
Compute-forward is a coding technique that enables receiver(s) in a network to directly decode one or more linear combinations of the transmitted codewords. Initial efforts focused on Gaussian channels and derived achievable rate regions via nested lattice codes and single-user (lattice) decoding as well as sequential (lattice) decoding. Recently, these results have been generalized to discrete me…
▽ More
Compute-forward is a coding technique that enables receiver(s) in a network to directly decode one or more linear combinations of the transmitted codewords. Initial efforts focused on Gaussian channels and derived achievable rate regions via nested lattice codes and single-user (lattice) decoding as well as sequential (lattice) decoding. Recently, these results have been generalized to discrete memoryless channels via nested linear codes and joint typicality coding, culminating in a simultaneous-decoding rate region for recovering one or more linear combinations from $K$ users. Using a discretization approach, this paper translates this result into a simultaneous-decoding rate region for a wide class of continuous memoryless channels, including the important special case of Gaussian channels. Additionally, this paper derives a single, unified expression for both discrete and continuous rate regions via an algebraic generalization of Rényi's information dimension.
△ Less
Submitted 30 September, 2021;
originally announced October 2021.
-
Private Retrieval, Computing and Learning: Recent Progress and Future Challenges
Authors:
Sennur Ulukus,
Salman Avestimehr,
Michael Gastpar,
Syed Jafar,
Ravi Tandon,
Chao Tian
Abstract:
Most of our lives are conducted in the cyberspace. The human notion of privacy translates into a cyber notion of privacy on many functions that take place in the cyberspace. This article focuses on three such functions: how to privately retrieve information from cyberspace (privacy in information retrieval), how to privately leverage large-scale distributed/parallel processing (privacy in distribu…
▽ More
Most of our lives are conducted in the cyberspace. The human notion of privacy translates into a cyber notion of privacy on many functions that take place in the cyberspace. This article focuses on three such functions: how to privately retrieve information from cyberspace (privacy in information retrieval), how to privately leverage large-scale distributed/parallel processing (privacy in distributed computing), and how to learn/train machine learning models from private data spread across multiple users (privacy in distributed (federated) learning). The article motivates each privacy setting, describes the problem formulation, summarizes breakthrough results in the history of each problem, and gives recent results and discusses some of the major ideas that emerged in each field. In addition, the cross-cutting techniques and interconnections between the three topics are discussed along with a set of open problems and challenges.
△ Less
Submitted 30 July, 2021;
originally announced August 2021.
-
Differential Entropy of the Conditional Expectation under Additive Gaussian Noise
Authors:
Arda Atalik,
Alper Köse,
Michael Gastpar
Abstract:
The conditional mean is a fundamental and important quantity whose applications include the theories of estimation and rate-distortion. It is also notoriously difficult to work with. This paper establishes novel bounds on the differential entropy of the conditional mean in the case of finite-variance input signals and additive Gaussian noise. The main result is a new lower bound in terms of the di…
▽ More
The conditional mean is a fundamental and important quantity whose applications include the theories of estimation and rate-distortion. It is also notoriously difficult to work with. This paper establishes novel bounds on the differential entropy of the conditional mean in the case of finite-variance input signals and additive Gaussian noise. The main result is a new lower bound in terms of the differential entropies of the input signal and the noisy observation. The main results are also extended to the vector Gaussian channel and to the natural exponential family. Various other properties such as upper bounds, asymptotics, Taylor series expansion, and connection to Fisher Information are obtained. Two applications of the lower bound in the remote-source coding and CEO problem are discussed.
△ Less
Submitted 11 August, 2022; v1 submitted 8 June, 2021;
originally announced June 2021.
-
Lower bound on Wyner's Common Information
Authors:
Erixhen Sula,
Michael Gastpar
Abstract:
An important notion of common information between two random variables is due to Wyner. In this paper, we derive a lower bound on Wyner's common information for continuous random variables. The new bound improves on the only other general lower bound on Wyner's common information, which is the mutual information. We also show that the new lower bound is tight for the so-called "Gaussian channels"…
▽ More
An important notion of common information between two random variables is due to Wyner. In this paper, we derive a lower bound on Wyner's common information for continuous random variables. The new bound improves on the only other general lower bound on Wyner's common information, which is the mutual information. We also show that the new lower bound is tight for the so-called "Gaussian channels" case, namely, when the joint distribution of the random variables can be written as the sum of a single underlying random variable and Gaussian noises. We motivate this work from the recent variations of Wyner's common information and applications to network data compression problems such as the Gray-Wyner network.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
On conditional Sibson's $α$-Mutual Information
Authors:
Amedeo Roberto Esposito,
Diyuan Wu,
Michael Gastpar
Abstract:
In this work, we analyse how to define a conditional version of Sibson's $α$-Mutual Information. Several such definitions can be advanced and they all lead to different information measures with different (but similar) operational meanings. We will analyse in detail one such definition, compute a closed-form expression for it and endorse it with an operational meaning while also considering some a…
▽ More
In this work, we analyse how to define a conditional version of Sibson's $α$-Mutual Information. Several such definitions can be advanced and they all lead to different information measures with different (but similar) operational meanings. We will analyse in detail one such definition, compute a closed-form expression for it and endorse it with an operational meaning while also considering some applications. The alternative definitions will also be mentioned and compared.
△ Less
Submitted 8 June, 2021; v1 submitted 1 February, 2021;
originally announced February 2021.
-
Learning, compression, and leakage: Minimising classification error via meta-universal compression principles
Authors:
Fernando E. Rosas,
Pedro A. M. Mediano,
Michael Gastpar
Abstract:
Learning and compression are driven by the common aim of identifying and exploiting statistical regularities in data, which opens the door for fertile collaboration between these areas. A promising group of compression techniques for learning scenarios is normalised maximum likelihood (NML) coding, which provides strong guarantees for compression of small datasets - in contrast with more popular e…
▽ More
Learning and compression are driven by the common aim of identifying and exploiting statistical regularities in data, which opens the door for fertile collaboration between these areas. A promising group of compression techniques for learning scenarios is normalised maximum likelihood (NML) coding, which provides strong guarantees for compression of small datasets - in contrast with more popular estimators whose guarantees hold only in the asymptotic limit. Here we consider a NML-based decision strategy for supervised classification problems, and show that it attains heuristic PAC learning when applied to a wide variety of models. Furthermore, we show that the misclassification rate of our method is upper bounded by the maximal leakage, a recently proposed metric to quantify the potential of data leakage in privacy-sensitive scenarios.
△ Less
Submitted 31 January, 2021; v1 submitted 14 October, 2020;
originally announced October 2020.
-
The Gaussian lossy Gray-Wyner network
Authors:
Erixhen Sula,
Michael Gastpar
Abstract:
We consider the problem of source coding subject to a fidelity criterion for the Gray-Wyner network that connects a single source with two receivers via a common channel and two private channels. The pareto-optimal trade-offs between the sum-rate of the private channels and the rate of the common channel is completely characterized for jointly Gaussian sources subject to the mean-squared error cri…
▽ More
We consider the problem of source coding subject to a fidelity criterion for the Gray-Wyner network that connects a single source with two receivers via a common channel and two private channels. The pareto-optimal trade-offs between the sum-rate of the private channels and the rate of the common channel is completely characterized for jointly Gaussian sources subject to the mean-squared error criterion, leveraging convex duality and an argument involving the factorization of convex envelopes. Specifically, it is attained by selecting the auxiliary random variable to be jointly Gaussian with the sources.
△ Less
Submitted 30 September, 2020; v1 submitted 2 February, 2020;
originally announced February 2020.
-
Common Information Components Analysis
Authors:
Michael Gastpar,
Erixhen Sula
Abstract:
We give an information-theoretic interpretation of Canonical Correlation Analysis (CCA) via (relaxed) Wyner's common information. CCA permits to extract from two high-dimensional data sets low-dimensional descriptions (features) that capture the commonalities between the data sets, using a framework of correlations and linear transforms. Our interpretation first extracts the common information up…
▽ More
We give an information-theoretic interpretation of Canonical Correlation Analysis (CCA) via (relaxed) Wyner's common information. CCA permits to extract from two high-dimensional data sets low-dimensional descriptions (features) that capture the commonalities between the data sets, using a framework of correlations and linear transforms. Our interpretation first extracts the common information up to a pre-selected resolution level, and then projects this back onto each of the data sets. In the case of Gaussian statistics, this procedure precisely reduces to CCA, where the resolution level specifies the number of CCA components that are extracted. This also suggests a novel algorithm, Common Information Components Analysis (CICA), with several desirable features, including a natural extension to beyond just two data sets.
△ Less
Submitted 28 February, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Robust Generalization via $α$-Mutual Information
Authors:
Amedeo Roberto Esposito,
Michael Gastpar,
Ibrahim Issa
Abstract:
The aim of this work is to provide bounds connecting two probability measures of the same event using Rényi $α$-Divergences and Sibson's $α$-Mutual Information, a generalization of respectively the Kullback-Leibler Divergence and Shannon's Mutual Information. A particular case of interest can be found when the two probability measures considered are a joint distribution and the corresponding produ…
▽ More
The aim of this work is to provide bounds connecting two probability measures of the same event using Rényi $α$-Divergences and Sibson's $α$-Mutual Information, a generalization of respectively the Kullback-Leibler Divergence and Shannon's Mutual Information. A particular case of interest can be found when the two probability measures considered are a joint distribution and the corresponding product of marginals (representing the statistically independent scenario). In this case, a bound using Sibson's $α-$Mutual Information is retrieved, extending a result involving Maximal Leakage to general alphabets. These results have broad applications, from bounding the generalization error of learning algorithms to the more general framework of adaptive data analysis, provided that the divergences and/or information measures used are amenable to such an analysis ({\it i.e.,} are robust to post-processing and compose adaptively). The generalization error bounds are derived with respect to high-probability events but a corresponding bound on expected generalization error is also retrieved.
△ Less
Submitted 14 January, 2020;
originally announced January 2020.
-
On Wyner's Common Information in the Gaussian Case
Authors:
Erixhen Sula,
Michael Gastpar
Abstract:
Wyner's Common Information and a natural relaxation are studied in the special case of Gaussian random variables. The relaxation replaces conditional independence by a bound on the conditional mutual information. The main contribution is the proof that Gaussian auxiliaries are optimal, leading to a closed-form formula. As a corollary, the proof technique also establishes the optimality of Gaussian…
▽ More
Wyner's Common Information and a natural relaxation are studied in the special case of Gaussian random variables. The relaxation replaces conditional independence by a bound on the conditional mutual information. The main contribution is the proof that Gaussian auxiliaries are optimal, leading to a closed-form formula. As a corollary, the proof technique also establishes the optimality of Gaussian auxiliaries for the Gaussian Gray-Wyner network, a long-standing open problem.
△ Less
Submitted 28 September, 2020; v1 submitted 15 December, 2019;
originally announced December 2019.
-
Generalization Error Bounds Via Rényi-, $f$-Divergences and Maximal Leakage
Authors:
Amedeo Roberto Esposito,
Michael Gastpar,
Ibrahim Issa
Abstract:
In this work, the probability of an event under some joint distribution is bounded by measuring it with the product of the marginals instead (which is typically easier to analyze) together with a measure of the dependence between the two random variables. These results find applications in adaptive data analysis, where multiple dependencies are introduced and in learning theory, where they can be…
▽ More
In this work, the probability of an event under some joint distribution is bounded by measuring it with the product of the marginals instead (which is typically easier to analyze) together with a measure of the dependence between the two random variables. These results find applications in adaptive data analysis, where multiple dependencies are introduced and in learning theory, where they can be employed to bound the generalization error of a learning algorithm. Bounds are given in terms of Sibson's Mutual Information, $α-$Divergences, Hellinger Divergences, and $f-$Divergences. A case of particular interest is the Maximal Leakage (or Sibson's Mutual Information of order infinity), since this measure is robust to post-processing and composes adaptively. The corresponding bound can be seen as a generalization of classical bounds, such as Hoeffding's and McDiarmid's inequalities, to the case of dependent random variables.
△ Less
Submitted 21 October, 2020; v1 submitted 1 December, 2019;
originally announced December 2019.
-
Strengthened Information-theoretic Bounds on the Generalization Error
Authors:
Ibrahim Issa,
Amedeo Roberto Esposito,
Michael Gastpar
Abstract:
The following problem is considered: given a joint distribution $P_{XY}$ and an event $E$, bound $P_{XY}(E)$ in terms of $P_XP_Y(E)$ (where $P_XP_Y$ is the product of the marginals of $P_{XY}$) and a measure of dependence of $X$ and $Y$. Such bounds have direct applications in the analysis of the generalization error of learning algorithms, where $E$ represents a large error event and the measure…
▽ More
The following problem is considered: given a joint distribution $P_{XY}$ and an event $E$, bound $P_{XY}(E)$ in terms of $P_XP_Y(E)$ (where $P_XP_Y$ is the product of the marginals of $P_{XY}$) and a measure of dependence of $X$ and $Y$. Such bounds have direct applications in the analysis of the generalization error of learning algorithms, where $E$ represents a large error event and the measure of dependence controls the degree of overfitting. Herein, bounds are demonstrated using several information-theoretic metrics, in particular: mutual information, lautum information, maximal leakage, and $J_\infty$. The mutual information bound can outperform comparable bounds in the literature by an arbitrarily large factor.
△ Less
Submitted 9 March, 2019;
originally announced March 2019.
-
A New Approach to Adaptive Data Analysis and Learning via Maximal Leakage
Authors:
Amedeo Roberto Esposito,
Michael Gastpar,
Ibrahim Issa
Abstract:
There is an increasing concern that most current published research findings are false. The main cause seems to lie in the fundamental disconnection between theory and practice in data analysis. While the former typically relies on statistical independence, the latter is an inherently adaptive process: new hypotheses are formulated based on the outcomes of previous analyses. A recent line of work…
▽ More
There is an increasing concern that most current published research findings are false. The main cause seems to lie in the fundamental disconnection between theory and practice in data analysis. While the former typically relies on statistical independence, the latter is an inherently adaptive process: new hypotheses are formulated based on the outcomes of previous analyses. A recent line of work tries to mitigate these issues by enforcing constraints, such as differential privacy, that compose adaptively while degrading gracefully and thus provide statistical guarantees even in adaptive contexts. Our contribution consists in the introduction of a new approach, based on the concept of Maximal Leakage, an information-theoretic measure of leakage of information. The main result allows us to compare the probability of an event happening when adaptivity is considered with respect to the non-adaptive scenario. The bound we derive represents a generalization of the bounds used in non-adaptive scenarios (e.g., McDiarmid's inequality for $c$-sensitive functions, false discovery error control via significance level, etc.), and allows us to replicate or even improve, in certain regimes, the results obtained using Max-Information or Differential Privacy. In contrast with the line of work started by Dwork et al., our results do not rely on Differential Privacy but are, in principle, applicable to every algorithm that has a bounded leakage, including the differentially private algorithms and the ones with a short description length.
△ Less
Submitted 5 March, 2019;
originally announced March 2019.
-
Quantifying High-order Interdependencies via Multivariate Extensions of the Mutual Information
Authors:
Fernando Rosas,
Pedro A. M. Mediano,
Michael Gastpar,
Henrik J. Jensen
Abstract:
This article introduces a model-agnostic approach to study statistical synergy, a form of emergence in which patterns at large scales are not traceable from lower scales. Our framework leverages various multivariate extensions of Shannon's mutual information, and introduces the O-information as a metric capable of characterising synergy- and redundancy-dominated systems. We develop key analytical…
▽ More
This article introduces a model-agnostic approach to study statistical synergy, a form of emergence in which patterns at large scales are not traceable from lower scales. Our framework leverages various multivariate extensions of Shannon's mutual information, and introduces the O-information as a metric capable of characterising synergy- and redundancy-dominated systems. We develop key analytical properties of the O-information, and study how it relates to other metrics of high-order interactions from the statistical mechanics and neuroscience literature. Finally, as a proof of concept, we use the proposed framework to explore the relevance of statistical synergy in Baroque music scores.
△ Less
Submitted 28 February, 2019;
originally announced February 2019.
-
Towards an Algebraic Network Information Theory: Simultaneous Joint Typicality Decoding
Authors:
Sung Hoon Lim,
Chen Feng,
Adriano Pastore,
Bobak Nazer,
Michael Gastpar
Abstract:
Consider a receiver in a multi-user network that wishes to decode several messages. Simultaneous joint typicality decoding is one of the most powerful techniques for determining the fundamental limits at which reliable decoding is possible. This technique has historically been used in conjunction with random i.i.d. codebooks to establish achievable rate regions for networks. Recently, it has been…
▽ More
Consider a receiver in a multi-user network that wishes to decode several messages. Simultaneous joint typicality decoding is one of the most powerful techniques for determining the fundamental limits at which reliable decoding is possible. This technique has historically been used in conjunction with random i.i.d. codebooks to establish achievable rate regions for networks. Recently, it has been shown that, in certain scenarios, nested linear codebooks in conjunction with "single-user" or sequential decoding can yield better achievable rates. For instance, the compute-forward problem examines the scenario of recovering $L \le K$ linear combinations of transmitted codewords over a $K$-user multiple-access channel (MAC), and it is well established that linear codebooks can yield higher rates. Here, we develop bounds for simultaneous joint typicality decoding used in conjunction with nested linear codebooks, and apply them to obtain a larger achievable region for compute-forward over a $K$-user discrete memoryless MAC. The key technical challenge is that competing codeword tuples that are linearly dependent on the true codeword tuple introduce statistical dependencies, which requires careful partitioning of the associated error events.
△ Less
Submitted 10 January, 2019;
originally announced January 2019.
-
Locally Differentially-Private Randomized Response for Discrete Distribution Learning
Authors:
Adriano Pastore,
Michael Gastpar
Abstract:
We consider a setup in which confidential i.i.d. samples $X_1,\dotsc,X_n$ from an unknown finite-support distribution $\boldsymbol{p}$ are passed through $n$ copies of a discrete privatization channel (a.k.a. mechanism) producing outputs $Y_1,\dotsc,Y_n$. The channel law guarantees a local differential privacy of $ε$. Subject to a prescribed privacy level $ε$, the optimal channel should be designe…
▽ More
We consider a setup in which confidential i.i.d. samples $X_1,\dotsc,X_n$ from an unknown finite-support distribution $\boldsymbol{p}$ are passed through $n$ copies of a discrete privatization channel (a.k.a. mechanism) producing outputs $Y_1,\dotsc,Y_n$. The channel law guarantees a local differential privacy of $ε$. Subject to a prescribed privacy level $ε$, the optimal channel should be designed such that an estimate of the source distribution based on the channel outputs $Y_1,\dotsc,Y_n$ converges as fast as possible to the exact value $\boldsymbol{p}$. For this purpose we study the convergence to zero of three distribution distance metrics: $f$-divergence, mean-squared error and total variation. We derive the respective normalized first-order terms of convergence (as $n\to\infty$), which for a given target privacy $ε$ represent a rule-of-thumb factor by which the sample size must be augmented so as to achieve the same estimation accuracy as that of a non-randomizing channel. We formulate the privacy-fidelity trade-off problem as being that of minimizing said first-order term under a privacy constraint $ε$. We further identify a scalar quantity that captures the essence of this trade-off, and prove bounds and data-processing inequalities on this quantity. For some specific instances of the privacy-fidelity trade-off problem, we derive inner and outer bounds on the optimal trade-off curve.
△ Less
Submitted 29 November, 2018;
originally announced November 2018.
-
Sum-Rate Capacity for Symmetric Gaussian Multiple Access Channels with Feedback
Authors:
Erixhen Sula,
Michael Gastpar,
Gerhard Kramer
Abstract:
The feedback sum-rate capacity is established for the symmetric $J$-user Gaussian multiple-access channel (GMAC). The main contribution is a converse bound that combines the dependence-balance argument of Hekstra and Willems (1989) with a variant of the factorization of a convex envelope of Geng and Nair (2014). The converse bound matches the achievable sum-rate of the Fourier-Modulated Estimate C…
▽ More
The feedback sum-rate capacity is established for the symmetric $J$-user Gaussian multiple-access channel (GMAC). The main contribution is a converse bound that combines the dependence-balance argument of Hekstra and Willems (1989) with a variant of the factorization of a convex envelope of Geng and Nair (2014). The converse bound matches the achievable sum-rate of the Fourier-Modulated Estimate Correction strategy of Kramer (2002).
△ Less
Submitted 23 November, 2018;
originally announced November 2018.
-
Converse for Multi-Server Single-Message PIR with Side Information
Authors:
Su Li,
Michael Gastpar
Abstract:
Multi-server single-message private information retrieval is studied in the presence of side information. In this problem, $K$ independent messages are replicatively stored at $N$ non-colluding servers. The user wants to privately download one message from the servers without revealing the index of the message to any of the servers, leveraging its $M$ side information messages. We assume that the…
▽ More
Multi-server single-message private information retrieval is studied in the presence of side information. In this problem, $K$ independent messages are replicatively stored at $N$ non-colluding servers. The user wants to privately download one message from the servers without revealing the index of the message to any of the servers, leveraging its $M$ side information messages. We assume that the servers only know the number of the side information messages available at the user but not their indices. We prove a converse bound on the maximum download rates, which coincides with the known achievability scheme proposed by Kadhe {\it et. al.}. Hence, we characterize the capacity for this problem, which is $(1+\frac{1}{N}+\frac{1}{N^2}+\dots+\frac{1}{N^{\left\lceil \frac{K}{M+1}\right\rceil-1}})^{-1}$. The proof leverages a novel concept that we call {\it virtual side information}, which, for a fixed query and any message, identifies the side information that would be needed in order to recover that message.
△ Less
Submitted 1 November, 2019; v1 submitted 26 September, 2018;
originally announced September 2018.
-
The Optimal Memory-Rate Trade-off for the Non-uniform Centralized Caching Problem with Two Files under Uncoded Placement
Authors:
Saeid Sahraei,
Pierre Quinton,
Michael Gastpar
Abstract:
A new scheme for the problem of centralized coded caching with non-uniform demands is proposed. The distinguishing feature of the proposed placement strategy is that it admits equal sub-packetization for all files while allowing the users to allocate more cache to the files which are more popular. This creates natural broadcasting opportunities in the delivery phase which are simultaneously helpfu…
▽ More
A new scheme for the problem of centralized coded caching with non-uniform demands is proposed. The distinguishing feature of the proposed placement strategy is that it admits equal sub-packetization for all files while allowing the users to allocate more cache to the files which are more popular. This creates natural broadcasting opportunities in the delivery phase which are simultaneously helpful for the users who have requested files of different popularities. For the case of two files, we propose a new delivery strategy based on interference alignment which enables each user to decode his desired file following a two-layer peeling decoder. Furthermore, we extend the existing converse bounds for uniform demands under uncoded placement to the nonuniform case. To accomplish this, we construct $N!$ auxiliary users, corresponding to all permutations of the $N$ files, each caching carefully selected sub-packets of the files. Each auxiliary user provides a different converse bound. The overall converse bound is the maximum of all these $N!$ bounds. We prove that our achievable delivery rate for the case of two files meets this converse, thereby establishing the optimal expected memory-rate trade-off for the case of $K$ users and two files with arbitrary popularities under uncoded placement.
△ Less
Submitted 25 July, 2019; v1 submitted 23 August, 2018;
originally announced August 2018.
-
Single-Server Multi-Message Private Information Retrieval with Side Information
Authors:
Su Li,
Michael Gastpar
Abstract:
We study the problem of single-server multi-message private information retrieval with side information. One user wants to recover $N$ out of $K$ independent messages which are stored at a single server. The user initially possesses a subset of $M$ messages as side information. The goal of the user is to download the $N$ demand messages while not leaking any information about the indices of these…
▽ More
We study the problem of single-server multi-message private information retrieval with side information. One user wants to recover $N$ out of $K$ independent messages which are stored at a single server. The user initially possesses a subset of $M$ messages as side information. The goal of the user is to download the $N$ demand messages while not leaking any information about the indices of these messages to the server. In this paper, we characterize the minimum number of required transmissions. We also present the optimal linear coding scheme which enables the user to download the demand messages and preserves the privacy of their indices. Moreover, we show that the trivial MDS coding scheme with $K-M$ transmissions is optimal if $N>M$ or $N^2+N \ge K-M$. This means if one wishes to privately download more than the square-root of the number of files in the database, then one must effectively download the full database (minus the side information), irrespective of the amount of side information one has available.
△ Less
Submitted 17 August, 2018;
originally announced August 2018.
-
Remote Source Coding under Gaussian Noise : Dueling Roles of Power and Entropy Power
Authors:
Krishnan Eswaran,
Michael Gastpar
Abstract:
The distributed remote source coding (so-called CEO) problem is studied in the case where the underlying source, not necessarily Gaussian, has finite differential entropy and the observation noise is Gaussian. The main result is a new lower bound for the sum-rate-distortion function under arbitrary distortion measures. When specialized to the case of mean-squared error, it is shown that the bound…
▽ More
The distributed remote source coding (so-called CEO) problem is studied in the case where the underlying source, not necessarily Gaussian, has finite differential entropy and the observation noise is Gaussian. The main result is a new lower bound for the sum-rate-distortion function under arbitrary distortion measures. When specialized to the case of mean-squared error, it is shown that the bound exactly mirrors a corresponding upper bound, except that the upper bound has the source power (variance) whereas the lower bound has the source entropy power. Bounds exhibiting this pleasing duality of power and entropy power have been well known for direct and centralized source coding since Shannon's work. While the bounds hold generally, their value is most pronounced when interpreted as a function of the number of agents in the CEO problem.
△ Less
Submitted 7 February, 2019; v1 submitted 16 May, 2018;
originally announced May 2018.
-
A Novel Centralized Strategy for Coded Caching with Non-uniform Demands
Authors:
Pierre Quinton,
Saeid Sahraei,
Michael Gastpar
Abstract:
Despite significant progress in the caching literature concerning the worst case and uniform average case regimes, the algorithms for caching with nonuniform demands are still at a basic stage and mostly rely on simple grouping and memory-sharing techniques. In this work we introduce a novel centralized caching strategy for caching with nonuniform file popularities. Our scheme allows for assigning…
▽ More
Despite significant progress in the caching literature concerning the worst case and uniform average case regimes, the algorithms for caching with nonuniform demands are still at a basic stage and mostly rely on simple grouping and memory-sharing techniques. In this work we introduce a novel centralized caching strategy for caching with nonuniform file popularities. Our scheme allows for assigning more cache to the files which are more likely to be requested, while maintaining the same sub-packetization for all the files. As a result, in the delivery phase it is possible to perform linear codes across files with different popularities without resorting to zero-padding or concatenation techniques. We will describe our placement strategy for arbitrary range of parameters. The delivery phase will be outlined for a small example for which we are able to show a noticeable improvement over the state of the art.
△ Less
Submitted 31 January, 2018;
originally announced January 2018.
-
Compute--Forward Multiple Access (CFMA): Practical Code Design
Authors:
Erixhen Sula,
Jingge Zhu,
Adriano Pastore,
Sung Hoon Lim,
Michael Gastpar
Abstract:
We present a practical strategy that aims to attain rate points on the dominant face of the multiple access channel capacity using a standard low complexity decoder. This technique is built upon recent theoretical developments of Zhu and Gastpar on compute-forward multiple access (CFMA) which achieves the capacity of the multiple access channel using a sequential decoder. We illustrate this strate…
▽ More
We present a practical strategy that aims to attain rate points on the dominant face of the multiple access channel capacity using a standard low complexity decoder. This technique is built upon recent theoretical developments of Zhu and Gastpar on compute-forward multiple access (CFMA) which achieves the capacity of the multiple access channel using a sequential decoder. We illustrate this strategy with off-the-shelf LDPC codes. In the first stage of decoding, the receiver first recovers a linear combination of the transmitted codewords using the sum-product algorithm (SPA). In the second stage, by using the recovered sum-of-codewords as side information, the receiver recovers one of the two codewords using a modified SPA, ultimately recovering both codewords. The main benefit of recovering the sum-of-codewords instead of the codeword itself is that it allows to attain points on the dominant face of the multiple access channel capacity without the need of rate-splitting or time sharing while maintaining a low complexity in the order of a standard point-to-point decoder. This property is also shown to be crucial for some applications, e.g., interference channels. For all the simulations with single-layer binary codes, our proposed practical strategy is shown to be within \SI{1.7}{\decibel} of the theoretical limits, without explicit optimization on the off-the-self LDPC codes.
△ Less
Submitted 29 December, 2017;
originally announced December 2017.
-
Cooperative Data Exchange based on MDS Codes
Authors:
Su Li,
Michael Gastpar
Abstract:
The cooperative data exchange problem is studied for the fully connected network. In this problem, each node initially only possesses a subset of the $K$ packets making up the file. Nodes make broadcast transmissions that are received by all other nodes. The goal is for each node to recover the full file. In this paper, we present a polynomial-time deterministic algorithm to compute the optimal (i…
▽ More
The cooperative data exchange problem is studied for the fully connected network. In this problem, each node initially only possesses a subset of the $K$ packets making up the file. Nodes make broadcast transmissions that are received by all other nodes. The goal is for each node to recover the full file. In this paper, we present a polynomial-time deterministic algorithm to compute the optimal (i.e., minimal) number of required broadcast transmissions and to determine the precise transmissions to be made by the nodes. A particular feature of our approach is that {\it each} of the $K-d$ transmissions is a linear combination of {\it exactly} $d+1$ packets, and we show how to optimally choose the value of $d.$ We also show how the coefficients of these linear combinations can be chosen by leveraging a connection to Maximum Distance Separable (MDS) codes. Moreover, we show that our method can be used to solve cooperative data exchange problems with weighted cost as well as the so-called successive local omniscience problem.
△ Less
Submitted 6 December, 2017;
originally announced December 2017.
-
Increasing Availability in Distributed Storage Systems via Clustering
Authors:
Saeid Sahraei,
Michael Gastpar
Abstract:
We introduce the Fixed Cluster Repair System (FCRS) as a novel architecture for Distributed Storage Systems (DSS), achieving a small repair bandwidth while guaranteeing a high availability. Specifically we partition the set of servers in a DSS into $s$ clusters and allow a failed server to choose any cluster other than its own as its repair group. Thereby, we guarantee an availability of $s-1$. We…
▽ More
We introduce the Fixed Cluster Repair System (FCRS) as a novel architecture for Distributed Storage Systems (DSS), achieving a small repair bandwidth while guaranteeing a high availability. Specifically we partition the set of servers in a DSS into $s$ clusters and allow a failed server to choose any cluster other than its own as its repair group. Thereby, we guarantee an availability of $s-1$. We characterize the repair bandwidth vs. storage trade-off for the FCRS under functional repair and show that the minimum repair bandwidth can be improved by an asymptotic multiplicative factor of $2/3$ compared to the state of the art coding techniques that guarantee the same availability. We further introduce Cubic Codes designed to minimize the repair bandwidth of the FCRS under the exact repair model. We prove an asymptotic multiplicative improvement of $0.79$ in the minimum repair bandwidth compared to the existing exact repair coding techniques that achieve the same availability. We show that Cubic Codes are information-theoretically optimal for the FCRS with $2$ and $3$ complete clusters. Furthermore, under the repair-by-transfer model, Cubic Codes are optimal irrespective of the number of clusters.
△ Less
Submitted 4 March, 2019; v1 submitted 7 October, 2017;
originally announced October 2017.
-
Communication versus Computation: Duality for multiple access channels and source coding
Authors:
Jingge Zhu,
Sung Hoon Lim,
Michael Gastpar
Abstract:
Computation codes in network information theory are designed for the scenarios where the decoder is not interested in recovering the information sources themselves, but only a function thereof. Körner and Marton showed for distributed source coding that such function decoding can be achieved more efficiently than decoding the full information sources. Compute-and-forward has shown that function de…
▽ More
Computation codes in network information theory are designed for the scenarios where the decoder is not interested in recovering the information sources themselves, but only a function thereof. Körner and Marton showed for distributed source coding that such function decoding can be achieved more efficiently than decoding the full information sources. Compute-and-forward has shown that function decoding, in combination with network coding ideas, is a useful building block for end-to-end communication. In both cases, good computation codes are the key component in the coding schemes. In this work, we expose the fact that good computation codes could undermine the capability of the codes for recovering the information sources individually, e.g., for the purpose of multiple access and distributed source coding. Particularly, we establish duality results between the codes which are good for computation and the codes which are good for multiple access or distributed compression.
△ Less
Submitted 26 July, 2017;
originally announced July 2017.
-
GDSP: A Graphical Perspective on the Distributed Storage Systems
Authors:
Saeid Sahraei,
Michael Gastpar
Abstract:
The classical distributed storage problem can be modeled by a k-uniform {\it complete} hyper-graph where vertices represent servers and hyper-edges represent users. Hence each hyper-edge should be able to recover the full file using only the memories of the vertices associated with it. This paper considers the generalization of this problem to {\it arbitrary} hyper-graphs and to the case of multip…
▽ More
The classical distributed storage problem can be modeled by a k-uniform {\it complete} hyper-graph where vertices represent servers and hyper-edges represent users. Hence each hyper-edge should be able to recover the full file using only the memories of the vertices associated with it. This paper considers the generalization of this problem to {\it arbitrary} hyper-graphs and to the case of multiple files, where each user is only interested in one, a problem we will refer to as the graphical distributed storage problem (GDSP). Specifically, we make progress in the analysis of minimum-storage codes for two main subproblems of the GDSP which extend the classical model in two independent directions: the case of an arbitrary graph with multiple files, and the case of an arbitrary hyper-graph with a single file.
△ Less
Submitted 15 February, 2017;
originally announced February 2017.
-
A Joint Typicality Approach to Algebraic Network Information Theory
Authors:
Sung Hoon Lim,
Chen Feng,
Adriano Pastore,
Bobak Nazer,
Michael Gastpar
Abstract:
This paper presents a joint typicality framework for encoding and decoding nested linear codes for multi-user networks. This framework provides a new perspective on compute-forward within the context of discrete memoryless networks. In particular, it establishes an achievable rate region for computing the weighted sum of nested linear codewords over a discrete memoryless multiple-access channel (M…
▽ More
This paper presents a joint typicality framework for encoding and decoding nested linear codes for multi-user networks. This framework provides a new perspective on compute-forward within the context of discrete memoryless networks. In particular, it establishes an achievable rate region for computing the weighted sum of nested linear codewords over a discrete memoryless multiple-access channel (MAC). When specialized to the Gaussian MAC, this rate region recovers and improves upon the lattice-based compute-forward rate region of Nazer and Gastpar, thus providing a unified approach for discrete memoryless and Gaussian networks. Furthermore, this framework can be used to shed light on the joint decoding rate region for compute-forward, which is considered an open problem. Specifically, this work establishes an achievable rate region for simultaneously decoding two linear combinations of nested linear codewords from K senders.
△ Less
Submitted 30 June, 2016;
originally announced June 2016.
-
Information Theoretic Caching: The Multi-User Case
Authors:
Sung Hoon Lim,
Chien-Yi Wang,
Michael Gastpar
Abstract:
In this paper, we consider a cache aided network in which each user is assumed to have individual caches, while upon users' requests, an update message is sent though a common link to all users. First, we formulate a general information theoretic setting that represents the database as a discrete memoryless source, and the users' requests as side information that is available everywhere except at…
▽ More
In this paper, we consider a cache aided network in which each user is assumed to have individual caches, while upon users' requests, an update message is sent though a common link to all users. First, we formulate a general information theoretic setting that represents the database as a discrete memoryless source, and the users' requests as side information that is available everywhere except at the cache encoder. The decoders' objective is to recover a function of the source and the side information. By viewing cache aided networks in terms of a general distributed source coding problem and through information theoretic arguments, we present inner and outer bounds on the fundamental tradeoff of cache memory size and update rate. Then, we specialize our general inner and outer bounds to a specific model of content delivery networks: File selection networks, in which the database is a collection of independent equal-size files and each user requests one of the files independently. For file selection networks, we provide an outer bound and two inner bounds (for centralized and decentralized caching strategies). For the case when the user request information is uniformly distributed, we characterize the rate vs. cache size tradeoff to within a multiplicative gap of 4. By further extending our arguments to the framework of Maddah-Ali and Niesen, we also establish a new outer bound and two new inner bounds in which it is shown to recover the centralized and decentralized strategies, previously established by Maddah-Ali and Niesen. Finally, in terms of rate vs. cache size tradeoff, we improve the previous multiplicative gap of 72 to 4.7 for the average case with uniform requests.
△ Less
Submitted 8 April, 2016;
originally announced April 2016.
-
Multi-Library Coded Caching
Authors:
Saeid Sahraei,
Michael Gastpar
Abstract:
We study the problem of coded caching when the server has access to several libraries and each user makes independent requests from every library. The single-library scenario has been well studied and it has been proved that coded caching can significantly improve the delivery rate compared to uncoded caching. In this work we show that when all the libraries have the same number of files, memory-s…
▽ More
We study the problem of coded caching when the server has access to several libraries and each user makes independent requests from every library. The single-library scenario has been well studied and it has been proved that coded caching can significantly improve the delivery rate compared to uncoded caching. In this work we show that when all the libraries have the same number of files, memory-sharing is optimal and the delivery rate cannot be improved via coding across files from different libraries. In this setting, the optimal memory-sharing strategy is one that divides the cache of each user proportional to the size of the files in different libraries. As for the general case, when the number of files in different libraries are arbitrary, we propose an inner-bound based on memory-sharing and an outer-bound based on concatenation of files from different libraries.
△ Less
Submitted 22 January, 2016;
originally announced January 2016.