-
Decentralized Privacy Preservation for Critical Connections in Graphs
Authors:
Conggai Li,
Wei Ni,
Ming Ding,
Youyang Qu,
Jianjun Chen,
David Smith,
Wenjie Zhang,
Thierry Rakotoarivelo
Abstract:
Many real-world interconnections among entities can be characterized as graphs. Collecting local graph information with balanced privacy and data utility has garnered notable interest recently. This paper delves into the problem of identifying and protecting critical information of entity connections for individual participants in a graph based on cohesive subgraph searches. This problem has not b…
▽ More
Many real-world interconnections among entities can be characterized as graphs. Collecting local graph information with balanced privacy and data utility has garnered notable interest recently. This paper delves into the problem of identifying and protecting critical information of entity connections for individual participants in a graph based on cohesive subgraph searches. This problem has not been addressed in the literature. To address the problem, we propose to extract the critical connections of a queried vertex using a fortress-like cohesive subgraph model known as $p$-cohesion. A user's connections within a fortress are obfuscated when being released, to protect critical information about the user. Novel merit and penalty score functions are designed to measure each participant's critical connections in the minimal $p$-cohesion, facilitating effective identification of the connections. We further propose to preserve the privacy of a vertex enquired by only protecting its critical connections when responding to queries raised by data collectors. We prove that, under the decentralized differential privacy (DDP) mechanism, one's response satisfies $(\varepsilon, δ)$-DDP when its critical connections are protected while the rest remains unperturbed. The effectiveness of our proposed method is demonstrated through extensive experiments on real-life graph datasets.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Privacy at a Price: Exploring its Dual Impact on AI Fairness
Authors:
Mengmeng Yang,
Ming Ding,
Youyang Qu,
Wei Ni,
David Smith,
Thierry Rakotoarivelo
Abstract:
The worldwide adoption of machine learning (ML) and deep learning models, particularly in critical sectors, such as healthcare and finance, presents substantial challenges in maintaining individual privacy and fairness. These two elements are vital to a trustworthy environment for learning systems. While numerous studies have concentrated on protecting individual privacy through differential priva…
▽ More
The worldwide adoption of machine learning (ML) and deep learning models, particularly in critical sectors, such as healthcare and finance, presents substantial challenges in maintaining individual privacy and fairness. These two elements are vital to a trustworthy environment for learning systems. While numerous studies have concentrated on protecting individual privacy through differential privacy (DP) mechanisms, emerging research indicates that differential privacy in machine learning models can unequally impact separate demographic subgroups regarding prediction accuracy. This leads to a fairness concern, and manifests as biased performance. Although the prevailing view is that enhancing privacy intensifies fairness disparities, a smaller, yet significant, subset of research suggests the opposite view. In this article, with extensive evaluation results, we demonstrate that the impact of differential privacy on fairness is not monotonous. Instead, we observe that the accuracy disparity initially grows as more DP noise (enhanced privacy) is added to the ML process, but subsequently diminishes at higher privacy levels with even more noise. Moreover, implementing gradient clipping in the differentially private stochastic gradient descent ML method can mitigate the negative impact of DP noise on fairness. This mitigation is achieved by moderating the disparity growth through a lower clipping threshold.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
{A New Hope}: Contextual Privacy Policies for Mobile Applications and An Approach Toward Automated Generation
Authors:
Shidong Pan,
Zhen Tao,
Thong Hoang,
Dawen Zhang,
Tianshi Li,
Zhenchang Xing,
Sherry Xu,
Mark Staples,
Thierry Rakotoarivelo,
David Lo
Abstract:
Privacy policies have emerged as the predominant approach to conveying privacy notices to mobile application users. In an effort to enhance both readability and user engagement, the concept of contextual privacy policies (CPPs) has been proposed by researchers. The aim of CPPs is to fragment privacy policies into concise snippets, displaying them only within the corresponding contexts within the a…
▽ More
Privacy policies have emerged as the predominant approach to conveying privacy notices to mobile application users. In an effort to enhance both readability and user engagement, the concept of contextual privacy policies (CPPs) has been proposed by researchers. The aim of CPPs is to fragment privacy policies into concise snippets, displaying them only within the corresponding contexts within the application's graphical user interfaces (GUIs). In this paper, we first formulate CPP in mobile application scenario, and then present a novel multimodal framework, named SeePrivacy, specifically designed to automatically generate CPPs for mobile applications. This method uniquely integrates vision-based GUI understanding with privacy policy analysis, achieving 0.88 precision and 0.90 recall to detect contexts, as well as 0.98 precision and 0.96 recall in extracting corresponding policy segments. A human evaluation shows that 77% of the extracted privacy policy segments were perceived as well-aligned with the detected contexts. These findings suggest that SeePrivacy could serve as a significant tool for bolstering user interaction with, and understanding of, privacy policies. Furthermore, our solution has the potential to make privacy notices more accessible and inclusive, thus appealing to a broader demographic. A demonstration of our work can be accessed at https://cpp4app.github.io/SeePrivacy/
△ Less
Submitted 10 March, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
Privacy risk in GeoData: A survey
Authors:
Mahrokh Abdollahi Lorestani,
Thilina Ranbaduge,
Thierry Rakotoarivelo
Abstract:
With the ubiquitous use of location-based services, large-scale individual-level location data has been widely collected through location-awareness devices. The exposure of location data constitutes a significant privacy risk to users as it can lead to de-anonymisation, the inference of sensitive information, and even physical threats. Geoprivacy concerns arise on the issues of user identity de-an…
▽ More
With the ubiquitous use of location-based services, large-scale individual-level location data has been widely collected through location-awareness devices. The exposure of location data constitutes a significant privacy risk to users as it can lead to de-anonymisation, the inference of sensitive information, and even physical threats. Geoprivacy concerns arise on the issues of user identity de-anonymisation and location exposure. In this survey, we analyse different geomasking techniques that have been proposed to protect the privacy of individuals in geodata. We present a taxonomy to characterise these techniques along different dimensions, and conduct a survey of geomasking techniques. We then highlight shortcomings of current techniques and discuss avenues for future research.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Learn to Unlearn: A Survey on Machine Unlearning
Authors:
Youyang Qu,
Xin Yuan,
Ming Ding,
Wei Ni,
Thierry Rakotoarivelo,
David Smith
Abstract:
Machine Learning (ML) models have been shown to potentially leak sensitive information, thus raising privacy concerns in ML-driven applications. This inspired recent research on removing the influence of specific data samples from a trained ML model. Such efficient removal would enable ML to comply with the "right to be forgotten" in many legislation, and could also address performance bottlenecks…
▽ More
Machine Learning (ML) models have been shown to potentially leak sensitive information, thus raising privacy concerns in ML-driven applications. This inspired recent research on removing the influence of specific data samples from a trained ML model. Such efficient removal would enable ML to comply with the "right to be forgotten" in many legislation, and could also address performance bottlenecks from low-quality or poisonous samples. In that context, machine unlearning methods have been proposed to erase the contributions of designated data samples on models, as an alternative to the often impracticable approach of retraining models from scratch. This article presents a comprehensive review of recent machine unlearning techniques, verification mechanisms, and potential attacks. We further highlight emerging challenges and prospective research directions (e.g. resilience and fairness concerns). We aim for this paper to provide valuable resources for integrating privacy, equity, andresilience into ML systems and help them "learn to unlearn".
△ Less
Submitted 26 October, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
PADDLES: Phase-Amplitude Spectrum Disentangled Early Stopping for Learning with Noisy Labels
Authors:
Huaxi Huang,
Hui Kang,
Sheng Liu,
Olivier Salvado,
Thierry Rakotoarivelo,
Dadong Wang,
Tongliang Liu
Abstract:
Convolutional Neural Networks (CNNs) have demonstrated superiority in learning patterns, but are sensitive to label noises and may overfit noisy labels during training. The early stopping strategy averts updating CNNs during the early training phase and is widely employed in the presence of noisy labels. Motivated by biological findings that the amplitude spectrum (AS) and phase spectrum (PS) in t…
▽ More
Convolutional Neural Networks (CNNs) have demonstrated superiority in learning patterns, but are sensitive to label noises and may overfit noisy labels during training. The early stopping strategy averts updating CNNs during the early training phase and is widely employed in the presence of noisy labels. Motivated by biological findings that the amplitude spectrum (AS) and phase spectrum (PS) in the frequency domain play different roles in the animal's vision system, we observe that PS, which captures more semantic information, can increase the robustness of DNNs to label noise, more so than AS can. We thus propose early stops at different times for AS and PS by disentangling the features of some layer(s) into AS and PS using Discrete Fourier Transform (DFT) during training. Our proposed Phase-AmplituDe DisentangLed Early Stopping (PADDLES) method is shown to be effective on both synthetic and real-world label-noise datasets. PADDLES outperforms other early stopping methods and obtains state-of-the-art performance.
△ Less
Submitted 7 December, 2022;
originally announced December 2022.
-
Enhancing Utility in the Watchdog Privacy Mechanism
Authors:
Mohammad Amin Zarrabian,
Ni Ding,
Parastoo Sadeghi,
Thierry Rakotoarivelo
Abstract:
This paper is concerned with enhancing data utility in the privacy watchdog method for attaining information-theoretic privacy. For a specific privacy constraint, the watchdog method filters out the high-risk data symbols through applying a uniform data regulation scheme, e.g., merging all high-risk symbols together. While this method entirely trades the symbols resolution off for privacy, we show…
▽ More
This paper is concerned with enhancing data utility in the privacy watchdog method for attaining information-theoretic privacy. For a specific privacy constraint, the watchdog method filters out the high-risk data symbols through applying a uniform data regulation scheme, e.g., merging all high-risk symbols together. While this method entirely trades the symbols resolution off for privacy, we show that the data utility can be greatly improved by partitioning the high-risk symbols set and individually privatizing each subset. We further propose an agglomerative merging algorithm that finds a suitable partition of high-risk symbols: it starts with a singleton high-risk symbol, which is iteratively fused with others until the resulting subsets are private.~Numerical simulations demonstrate the efficacy of this algorithm in privately achieving higher utilities in the watchdog scheme.
△ Less
Submitted 10 October, 2021;
originally announced October 2021.
-
Private Graph Data Release: A Survey
Authors:
Yang Li,
Michael Purcell,
Thierry Rakotoarivelo,
David Smith,
Thilina Ranbaduge,
Kee Siong Ng
Abstract:
The application of graph analytics to various domains has yielded tremendous societal and economical benefits in recent years. However, the increasingly widespread adoption of graph analytics comes with a commensurate increase in the need to protect private information in graph data, especially in light of the many privacy breaches in real-world graph data that was supposed to preserve sensitive i…
▽ More
The application of graph analytics to various domains has yielded tremendous societal and economical benefits in recent years. However, the increasingly widespread adoption of graph analytics comes with a commensurate increase in the need to protect private information in graph data, especially in light of the many privacy breaches in real-world graph data that was supposed to preserve sensitive information. This paper provides a comprehensive survey of private graph data release algorithms that seek to achieve the fine balance between privacy and utility, with a specific focus on provably private mechanisms. Many of these mechanisms are natural extensions of the Differential Privacy framework to graph data, but we also investigate more general privacy formulations like Pufferfish Privacy that address some of the limitations of Differential Privacy. We also provide a wide-ranging survey of the applications of private graph data release mechanisms to social networks, finance, supply chain, and health care. This survey paper and the taxonomy it provides should benefit practitioners and researchers alike in the increasingly important area of private analytics and data release.
△ Less
Submitted 4 June, 2022; v1 submitted 9 July, 2021;
originally announced July 2021.
-
Realistic Differentially-Private Transmission Power Flow Data Release
Authors:
David Smith,
Frederik Geth,
Elliott Vercoe,
Andrew Feutrill,
Ming Ding,
Jonathan Chan,
James Foster,
Thierry Rakotoarivelo
Abstract:
For the modeling, design and planning of future energy transmission networks, it is vital for stakeholders to access faithful and useful power flow data, while provably maintaining the privacy of business confidentiality of service providers. This critical challenge has recently been somewhat addressed in [1]. This paper significantly extends this existing work. First, we reduce the potential leak…
▽ More
For the modeling, design and planning of future energy transmission networks, it is vital for stakeholders to access faithful and useful power flow data, while provably maintaining the privacy of business confidentiality of service providers. This critical challenge has recently been somewhat addressed in [1]. This paper significantly extends this existing work. First, we reduce the potential leakage information by proposing a fundamentally different post-processing method, using public information of grid losses rather than power dispatch, which achieve a higher level of privacy protection. Second, we protect more sensitive parameters, i.e., branch shunt susceptance in addition to series impedance (complete pi-model). This protects power flow data for the transmission high-voltage networks, using differentially private transformations that maintain the optimal power flow consistent with, and faithful to, expected model behaviour. Third, we tested our approach at a larger scale than previous work, using the PGLib-OPF test cases [10]. This resulted in the successful obfuscation of up to a 4700-bus system, which can be successfully solved with faithfulness of parameters and good utility to data analysts. Our approach addresses a more feasible and realistic scenario, and provides higher than state-of-the-art privacy guarantees, while maintaining solvability, fidelity and feasibility of the system.
△ Less
Submitted 25 March, 2021;
originally announced March 2021.
-
On Properties and Optimization of Information-theoretic Privacy Watchdog
Authors:
Parastoo Sadeghi,
Ni Ding,
Thierry Rakotoarivelo
Abstract:
We study the problem of privacy preservation in data sharing, where $S$ is a sensitive variable to be protected and $X$ is a non-sensitive useful variable correlated with $S$. Variable $X$ is randomized into variable $Y$, which will be shared or released according to $p_{Y|X}(y|x)$. We measure privacy leakage by \emph{information privacy} (also known as \emph{log-lift} in the literature), which gu…
▽ More
We study the problem of privacy preservation in data sharing, where $S$ is a sensitive variable to be protected and $X$ is a non-sensitive useful variable correlated with $S$. Variable $X$ is randomized into variable $Y$, which will be shared or released according to $p_{Y|X}(y|x)$. We measure privacy leakage by \emph{information privacy} (also known as \emph{log-lift} in the literature), which guarantees mutual information privacy and differential privacy (DP). Let $\Xepsc \subseteq \X$ contain elements n the alphabet of $X$ for which the absolute value of log-lift (abs-log-lift for short) is greater than a desired threshold $\eps$. When elements $x\in \Xepsc$ are randomized into $y\in \Y$, we derive the best upper bound on the abs-log-lift across the resultant pairs $(s,y)$. We then prove that this bound is achievable via an \emph{$X$-invariant} randomization $p(y|x) = R(y)$ for $x,y\in\Xepsc$. However, the utility measured by the mutual information $I(X;Y)$ is severely damaged in imposing a strict upper bound $\eps$ on the abs-log-lift. To remedy this and inspired by the probabilistic ($\eps$, $δ$)-DP, we propose a relaxed ($\eps$, $δ$)-log-lift framework. To achieve this relaxation, we introduce a greedy algorithm which exempts some elements in $\Xepsc$ from randomization, as long as their abs-log-lift is bounded by $\eps$ with probability $1-δ$. Numerical results demonstrate efficacy of this algorithm in achieving a better privacy-utility tradeoff.
△ Less
Submitted 19 October, 2020;
originally announced October 2020.
-
Privacy-Utility Tradeoff in a Guessing Framework Inspired by Index Coding
Authors:
Yucheng Liu,
Ni Ding,
Parastoo Sadeghi,
Thierry Rakotoarivelo
Abstract:
This paper studies the tradeoff in privacy and utility in a single-trial multi-terminal guessing (estimation) framework using a system model that is inspired by index coding. There are $n$ independent discrete sources at a data curator. There are $m$ legitimate users and one adversary, each with some side information about the sources. The data curator broadcasts a distorted function of sources to…
▽ More
This paper studies the tradeoff in privacy and utility in a single-trial multi-terminal guessing (estimation) framework using a system model that is inspired by index coding. There are $n$ independent discrete sources at a data curator. There are $m$ legitimate users and one adversary, each with some side information about the sources. The data curator broadcasts a distorted function of sources to legitimate users, which is also overheard by the adversary. In terms of utility, each legitimate user wishes to perfectly reconstruct some of the unknown sources and attain a certain gain in the estimation correctness for the remaining unknown sources. In terms of privacy, the data curator wishes to minimize the maximal leakage: the worst-case guessing gain of the adversary in estimating any target function of its unknown sources after receiving the broadcast data. Given the system settings, we derive fundamental performance lower bounds on the maximal leakage to the adversary, which are inspired by the notion of confusion graph and performance bounds for the index coding problem. We also detail a greedy privacy enhancing mechanism, which is inspired by the agglomerative clustering algorithms in the information bottleneck and privacy funnel problems.
△ Less
Submitted 18 June, 2020; v1 submitted 19 January, 2020;
originally announced January 2020.
-
Part II: A Practical Approach for Successive Omniscience
Authors:
Ni Ding,
Parastoo Sadeghi,
Thierry Rakotoarivelo
Abstract:
In Part I, we studied the communication for omniscience (CO) problem and proposed a parametric (PAR) algorithm to determine the minimum sum-rate at which a set of users indexed by a finite set $V$ attain omniscience. The omniscience in CO refers to the status that each user in $V$ recovers the observations of a multiple random source. It is called the global omniscience in this paper in contrast t…
▽ More
In Part I, we studied the communication for omniscience (CO) problem and proposed a parametric (PAR) algorithm to determine the minimum sum-rate at which a set of users indexed by a finite set $V$ attain omniscience. The omniscience in CO refers to the status that each user in $V$ recovers the observations of a multiple random source. It is called the global omniscience in this paper in contrast to the study of the successive omniscience (SO), where the local omniscience is attained subsequently in user subsets. By inputting a lower bound on the minimum sum-rate for CO, we apply the PAR algorithm to search a complimentary subset $X_* \subsetneq V$ such that if the local omniscience in $X_*$ is reached first, the global omniscience whereafter can still be attained with the minimum sum-rate. We further utilize the outputs of the PAR algorithm to outline a multi-stage SO approach that is characterized by $K \leq |V| - 1$ complimentary subsets $X_*^{(k)}, \forall k \in \{1,\dotsc,K\}$ forming a nesting sequence $X_*^{(1)} \subsetneq \dotsc \subsetneq X_*^{(K)} = V$. Starting from stage $k = 1$, the local omniscience in $X_*^{(k)}$ is attained at each stage $k$ until the final global omniscience in $X_*^{(K)} = V$. A $|X_*{(k)}|$-dimensional local omniscience achievable rate vector is also derived for each stage $k$ designating individual users transmitting rates. The sum-rate of this rate vector in the last stage $K$ coincides with the minimized sum-rate for the global omniscience.
△ Less
Submitted 26 December, 2019;
originally announced December 2019.
-
Part I: Improving Computational Efficiency of Communication for Omniscience
Authors:
Ni Ding,
Parastoo Sadeghi,
Thierry Rakotoarivelo
Abstract:
Communication for omniscience (CO) refers to the problem where the users in a finite set $V$ observe a discrete multiple random source and want to exchange data over broadcast channels to reach omniscience, the state where everyone recovers the entire source. This paper studies how to improve the computational complexity for the problem of minimizing the sum-rate for attaining omniscience in $V$.…
▽ More
Communication for omniscience (CO) refers to the problem where the users in a finite set $V$ observe a discrete multiple random source and want to exchange data over broadcast channels to reach omniscience, the state where everyone recovers the entire source. This paper studies how to improve the computational complexity for the problem of minimizing the sum-rate for attaining omniscience in $V$. While the existing algorithms rely on the submodular function minimization (SFM) techniques and complete in $O(|V|^2 \cdot \text{SFM}(|V|)$ time, we prove the strict strong map property of the nesting SFM problem. We propose a parametric (PAR) algorithm that utilizes the parametric SFM techniques and reduce the the complexity to $O(|V| \cdot \text{SFM}(|V|)$. The output of the PAR algorithm is in fact the segmented Dilworth truncation of the residual entropy for all minimum sum-rate estimates $α$, which characterizes the principal sequence of partitions (PSP) and solves some related problems: It not only determines the secret capacity, a dual problem to CO, and the network strength of a graph, but also outlines the hierarchical solution to a combinatorial clustering problem. \end{abstract}
△ Less
Submitted 18 December, 2020; v1 submitted 26 December, 2019;
originally announced December 2019.
-
Improving Computational Efficiency of Communication for Omniscience and Successive Omniscience
Authors:
Ni Ding,
Parastoo Sadeghi,
Thierry Rakotoarivelo
Abstract:
For a group of users in $V$ where everyone observes a component of a discrete multiple random source, the process that users exchange data so as to reach omniscience, the state where everyone recovers the entire source, is called communication for omniscience (CO). We first consider how to improve the existing complexity $O(|V|^2 \cdot \text{SFM}(|V|)$ of minimizing the sum of communication rates…
▽ More
For a group of users in $V$ where everyone observes a component of a discrete multiple random source, the process that users exchange data so as to reach omniscience, the state where everyone recovers the entire source, is called communication for omniscience (CO). We first consider how to improve the existing complexity $O(|V|^2 \cdot \text{SFM}(|V|)$ of minimizing the sum of communication rates in CO, where $\text{SFM}(|V|)$ denotes the complexity of minimizing a submodular function. We reveal some structured property in an existing coordinate saturation algorithm: the resulting rate vector and the corresponding partition of $V$ are segmented in $α$, the estimation of the minimum sum-rate. A parametric (PAR) algorithm is then proposed where, instead of a particular $α$, we search the critical points that fully determine the segmented variables for all $α$ so that they converge to the solution to the minimum sum-rate problem and the overall complexity reduces to $O(|V| \cdot \text{SFM}(|V|))$.
For the successive omniscience (SO), we consider how to attain local omniscience in some complimentary user subset so that the overall sum-rate for the global omniscience still remains minimum. While the existing algorithm only determines a complimentary user subset in $O(|V| \cdot \text{SFM}(|V|))$ time, we show that, if a lower bound on the minimum sum-rate is applied to the segmented variables in the PAR algorithm, not only a complimentary subset, but also an optimal rate vector for attaining the local omniscience in it are returned in $O(|V| \cdot \text{SFM}(|V|))$ time.
△ Less
Submitted 3 March, 2019;
originally announced March 2019.
-
Attaining Fairness in Communication for Omniscience
Authors:
Ni Ding,
Parastoo Sadeghi,
David Smith,
Thierry Rakotoarivelo
Abstract:
This paper studies how to attain fairness in communication for omniscience, where a set of users exchange their observations of a discrete multiple random source to attain omniscience---the state that all users recover the entire source. The optimal rate region containing all source coding rate vectors that achieve the omniscience with the minimum sum rate is shown to coincide with the core (the s…
▽ More
This paper studies how to attain fairness in communication for omniscience, where a set of users exchange their observations of a discrete multiple random source to attain omniscience---the state that all users recover the entire source. The optimal rate region containing all source coding rate vectors that achieve the omniscience with the minimum sum rate is shown to coincide with the core (the solution set) of a coalitional game. Two game-theoretic fairness solutions are studied: the Shapley value and the egalitarian solution. It is shown that the Shapley value assigns each user the source coding rate measured by his/her remaining information of the multiple source given the common randomness that is shared by all users, while the egalitarian solution simply distributes the rates as evenly as possible in the core. To avoid the exponentially growing complexity of obtaining the Shapley value, a polynomial-time approximation method is proposed by utilizing the fact that the Shapley value is the mean value over all extreme points in the core. In addition, a steepest descent algorithm is proposed which converges in polynomial time to the fractional egalitarian solution in the core that can be implemented by network coding schemes. Finally, it is shown that the game can be decomposed into subgames so that both the Shapley value and the egalitarian solution can be obtained within each subgame in a distributed manner with reduced complexity.
△ Less
Submitted 10 February, 2019;
originally announced February 2019.
-
Differentially Private Release of High-Dimensional Datasets using the Gaussian Copula
Authors:
Hassan Jameel Asghar,
Ming Ding,
Thierry Rakotoarivelo,
Sirine Mrabet,
Mohamed Ali Kaafar
Abstract:
We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample syn…
▽ More
We propose a generic mechanism to efficiently release differentially private synthetic versions of high-dimensional datasets with high utility. The core technique in our mechanism is the use of copulas. Specifically, we use the Gaussian copula to define dependencies of attributes in the input dataset, whose rows are modelled as samples from an unknown multivariate distribution, and then sample synthetic records through this copula. Despite the inherently numerical nature of Gaussian correlations we construct a method that is applicable to both numerical and categorical attributes alike. Our mechanism is efficient in that it only takes time proportional to the square of the number of attributes in the dataset. We propose a differentially private way of constructing the Gaussian copula without compromising computational efficiency. Through experiments on three real-world datasets, we show that we can obtain highly accurate answers to the set of all one-way marginal, and two-and three-way positive conjunction queries, with 99\% of the query answers having absolute (fractional) error rates between 0.01 to 3\%. Furthermore, for a majority of two-way and three-way queries, we outperform independent noise addition through the well-known Laplace mechanism. In terms of computational time we demonstrate that our mechanism can output synthetic datasets in around 6 minutes 47 seconds on average with an input dataset of about 200 binary attributes and more than 32,000 rows, and about 2 hours 30 mins to execute a much larger dataset of about 700 binary attributes and more than 5 million rows. To further demonstrate scalability, we ran the mechanism on larger (artificial) datasets with 1,000 and 2,000 binary attributes (and 5 million rows) obtaining synthetic outputs in approximately 6 and 19 hours, respectively.
△ Less
Submitted 4 February, 2019;
originally announced February 2019.
-
Fairness in Multiterminal Data Compression: A Splitting Method for The Egalitarian Solution
Authors:
Ni Ding,
David Smith,
Parastoo Sadeghi,
Thierry Rakotoarivelo
Abstract:
This paper proposes a novel splitting (SPLIT) algorithm to achieve fairness in the multiterminal lossless data compression problem. It finds the egalitarian solution in the Slepian-Wolf region and completes in strongly polynomial time. We show that the SPLIT algorithm adaptively updates the source coding rates to the optimal solution, while recursively splitting the terminal set, enabling parallel…
▽ More
This paper proposes a novel splitting (SPLIT) algorithm to achieve fairness in the multiterminal lossless data compression problem. It finds the egalitarian solution in the Slepian-Wolf region and completes in strongly polynomial time. We show that the SPLIT algorithm adaptively updates the source coding rates to the optimal solution, while recursively splitting the terminal set, enabling parallel and distributed computation. The result of an experiment demonstrates a significant reduction in computation time by the parallel implementation when the number of terminals becomes large. The achieved egalitarian solution is also shown to be superior to the Shapley value in distributed networks, e.g., wireless sensor networks, in that it best balances the nodes' energy consumption and is far less computationally complex to obtain.
△ Less
Submitted 3 May, 2018;
originally announced May 2018.
-
Fairness in Multiterminal Data Compression: Decomposition of Shapley Value
Authors:
Ni Ding,
David Smith,
Parastoo Sadeghi,
Thierry Rakotoarivelo
Abstract:
We consider the problem of how to determine a fair source coding rate allocation method for the lossless data compression problem in multiterminal networks, e.g, the wireless sensor network where there are a large number of sources to be encoded. We model this problem by a game-theoretic approach and present a decomposition method for obtaining the Shapley value, a fair source coding rate vector i…
▽ More
We consider the problem of how to determine a fair source coding rate allocation method for the lossless data compression problem in multiterminal networks, e.g, the wireless sensor network where there are a large number of sources to be encoded. We model this problem by a game-theoretic approach and present a decomposition method for obtaining the Shapley value, a fair source coding rate vector in the Slepian-Wolf achievable region. We formulate a coalitional game model where the entropy function quantifies the cost incurred due to the source coding rates in each coalition. In the typical case for which the game is decomposable, we show that the Shapley value can be obtained separately for each subgame. The complexity of this decomposition method is determined by the maximum size of subgames, which is strictly smaller than the total number of sources and contributes to a considerable reduction in computational complexity. Experiments demonstrate large complexity reduction when the number of sources becomes large.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Distributed Data Compression in Sensor Clusters: A Maximum Independent Flow Approach
Authors:
Ni Ding,
Parastoo Sadeghi,
David Smith,
Thierry Rakotoarivelo
Abstract:
Let a cluster (network) of sensors be connected by the communication links, each link having a capacity upper bound. Each sensor observes a discrete random variable in private and one sensor serves as a cluster header or sink. Here, we formulate the problem of how to let the sensors encode their observations such that the direction of compressed data is a feasible flow towards the sink. We demonst…
▽ More
Let a cluster (network) of sensors be connected by the communication links, each link having a capacity upper bound. Each sensor observes a discrete random variable in private and one sensor serves as a cluster header or sink. Here, we formulate the problem of how to let the sensors encode their observations such that the direction of compressed data is a feasible flow towards the sink. We demonstrate that this problem can be solved by an existing maximum independent flow (MIF) algorithm in polynomial time. Further, we reveal that this algorithm in fact determines an optimal solution by recursively pushing the remaining randomness in the sources via unsaturated communication links towards the sink. We then show that the MIF algorithm can be implemented in a distributed manner. For those networks with integral communication capacities, we propose an integral MIF algorithm which completes much faster than MIF. Finally, we point out that the nature of the data compression problem in a sensor cluster is to seek the maximum independent information flow in the intersection of two submodular polyhedra, which can be further utilized to improve the MIF algorithm in the future.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Repeatable Experiments with LabWiki
Authors:
Thierry Rakotoarivelo,
Guillaume Jourjon,
Olivier Mehani,
Maximilian Ott,
Mike Zink
Abstract:
The ability to repeat the experiments from a research study and obtain similar results is a corner stone in experiment-based scientific discovery. This essential feature has been often ignored by the distributed computing and networking community. There are many reasons for that, such as the complexity of provisioning, configuring, and orchestrating the resources used by experiments, their multipl…
▽ More
The ability to repeat the experiments from a research study and obtain similar results is a corner stone in experiment-based scientific discovery. This essential feature has been often ignored by the distributed computing and networking community. There are many reasons for that, such as the complexity of provisioning, configuring, and orchestrating the resources used by experiments, their multiple external dependencies, and the difficulty to seamlessly record these dependencies. This paper describes a methodology based on well-established principles to plan, prepare and execute experiments. We propose and describe a family of tools, the LabWiki workspace, to support an experimenter's workflow based on that methodology. This proposed workspace provides services and mechanisms for each step of an experiment-based study, while automatically capturing the necessary information to allow others to repeat, inspect, validate and modify prior experiments. Our LabWiki workspace builds on existing contributions, and de-facto protocol and model standards, which emerged from recent experimental facility initiatives. We use a real experiment as a thread to guide and illustrate the discussion throughout this paper.
△ Less
Submitted 7 October, 2014;
originally announced October 2014.
-
A Method for the Characterisation of Observer Effects and its Application to OML
Authors:
Olivier Mehani,
Guillaume Jourjon,
Thierry Rakotoarivelo
Abstract:
In all measurement campaigns, one needs to assert that the instrumentation tools do not significantly impact the system being monitored. This is critical to future claims based on the collected data and is sometimes overseen in experimental studies. We propose a method to evaluate the potential "observer effect" of an instrumentation system, and apply it to the OMF Measurement Library (OML). OML a…
▽ More
In all measurement campaigns, one needs to assert that the instrumentation tools do not significantly impact the system being monitored. This is critical to future claims based on the collected data and is sometimes overseen in experimental studies. We propose a method to evaluate the potential "observer effect" of an instrumentation system, and apply it to the OMF Measurement Library (OML). OML allows the instrumentation of almost any software to collect any type of measurements. As it is increasingly being used in networking research, it is important to characterise possible biases it may introduce in the collected metrics. Thus, we study its effect on multiple types of reports from various applications commonly used in wireless research. To this end, we designed experiments comparing OML-instrumented software with their original flavours. Our analyses of the results from these experiments show that, with an appropriate reporting setup, OML has no significant impact on the instrumented applications, and may even improve some of their performances in specifics cases. We discuss our methodology and the implication of using OML, and provide guidelines on instrumenting off-the-shelf software.
△ Less
Submitted 17 May, 2012;
originally announced May 2012.