-
Measure-Observe-Remeasure: An Interactive Paradigm for Differentially-Private Exploratory Analysis
Authors:
Priyanka Nanayakkara,
Hyeok Kim,
Yifan Wu,
Ali Sarvghad,
Narges Mahyar,
Gerome Miklau,
Jessica Hullman
Abstract:
Differential privacy (DP) has the potential to enable privacy-preserving analysis on sensitive data, but requires analysts to judiciously spend a limited ``privacy loss budget'' $ε$ across queries. Analysts conducting exploratory analyses do not, however, know all queries in advance and seldom have DP expertise. Thus, they are limited in their ability to specify $ε$ allotments across queries prior…
▽ More
Differential privacy (DP) has the potential to enable privacy-preserving analysis on sensitive data, but requires analysts to judiciously spend a limited ``privacy loss budget'' $ε$ across queries. Analysts conducting exploratory analyses do not, however, know all queries in advance and seldom have DP expertise. Thus, they are limited in their ability to specify $ε$ allotments across queries prior to an analysis. To support analysts in spending $ε$ efficiently, we propose a new interactive analysis paradigm, Measure-Observe-Remeasure, where analysts ``measure'' the database with a limited amount of $ε$, observe estimates and their errors, and remeasure with more $ε$ as needed.
We instantiate the paradigm in an interactive visualization interface which allows analysts to spend increasing amounts of $ε$ under a total budget. To observe how analysts interact with the Measure-Observe-Remeasure paradigm via the interface, we conduct a user study that compares the utility of $ε$ allocations and findings from sensitive data participants make to the allocations and findings expected of a rational agent who faces the same decision task. We find that participants are able to use the workflow relatively successfully, including using budget allocation strategies that maximize over half of the available utility stemming from $ε$ allocation. Their loss in performance relative to a rational agent appears to be driven more by their inability to access information and report it than to allocate $ε$.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Joint Selection: Adaptively Incorporating Public Information for Private Synthetic Data
Authors:
Miguel Fuentes,
Brett Mullins,
Ryan McKenna,
Gerome Miklau,
Daniel Sheldon
Abstract:
Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable w…
▽ More
Mechanisms for generating differentially private synthetic data based on marginals and graphical models have been successful in a wide range of settings. However, one limitation of these methods is their inability to incorporate public data. Initializing a data generating model by pre-training on public data has shown to improve the quality of synthetic data, but this technique is not applicable when model structure is not determined a priori. We develop the mechanism jam-pgm, which expands the adaptive measurements framework to jointly select between measuring public data and private data. This technique allows for public data to be included in a graphical-model-based mechanism. We show that jam-pgm is able to outperform both publicly assisted and non publicly assisted synthetic data generation mechanisms even when the public data distribution is biased.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy
Authors:
Skye Berghel,
Philip Bohannon,
Damien Desfontaines,
Charles Estes,
Sam Haney,
Luke Hartman,
Michael Hay,
Ashwin Machanavajjhala,
Tom Magerlein,
Gerome Miklau,
Amritha Pai,
William Sexton,
Ruchit Shrestha
Abstract:
In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service.
In this short paper, we outline the design of Tumult Analytics, a Python framework for differential privacy used at institutions such as the U.S. Census Bureau, the Wikimedia Foundation, or the Internal Revenue Service.
△ Less
Submitted 8 December, 2022;
originally announced December 2022.
-
AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data
Authors:
Ryan McKenna,
Brett Mullins,
Daniel Sheldon,
Gerome Miklau
Abstract:
We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting b…
▽ More
We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.
△ Less
Submitted 12 June, 2024; v1 submitted 29 January, 2022;
originally announced January 2022.
-
Benchmarking Differentially Private Synthetic Data Generation Algorithms
Authors:
Yuchao Tao,
Ryan McKenna,
Michael Hay,
Ashwin Machanavajjhala,
Gerome Miklau
Abstract:
This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluat…
▽ More
This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.
△ Less
Submitted 15 February, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
Relaxed Marginal Consistency for Differentially Private Query Answering
Authors:
Ryan McKenna,
Siddhant Pradhan,
Daniel Sheldon,
Gerome Miklau
Abstract:
Many differentially private algorithms for answering database queries involve a step that reconstructs a discrete data distribution from noisy measurements. This provides consistent query answers and reduces error, but often requires space that grows exponentially with dimension. Private-PGM is a recent approach that uses graphical models to represent the data distribution, with complexity proport…
▽ More
Many differentially private algorithms for answering database queries involve a step that reconstructs a discrete data distribution from noisy measurements. This provides consistent query answers and reduces error, but often requires space that grows exponentially with dimension. Private-PGM is a recent approach that uses graphical models to represent the data distribution, with complexity proportional to that of exact marginal inference in a graphical model with structure determined by the co-occurrence of variables in the noisy measurements. Private-PGM is highly scalable for sparse measurements, but may fail to run in high dimensions with dense measurements. We overcome the main scalability limitation of Private-PGM through a principled approach that relaxes consistency constraints in the estimation objective. Our new approach works with many existing private query answering algorithms and improves scalability or accuracy with no privacy cost.
△ Less
Submitted 25 October, 2021; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Winning the NIST Contest: A scalable and general approach to differentially private synthetic data
Authors:
Ryan McKenna,
Gerome Miklau,
Daniel Sheldon
Abstract:
We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a hi…
▽ More
We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
Differentially Private Algorithms for 2020 Census Detailed DHC Race \& Ethnicity
Authors:
Sam Haney,
William Sexton,
Ashwin Machanavajjhala,
Michael Hay,
Gerome Miklau
Abstract:
This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribe…
▽ More
This article describes a proposed differentially private (DP) algorithms that the US Census Bureau is considering to release the Detailed Demographic and Housing Characteristics (DHC) Race & Ethnicity tabulations as part of the 2020 Census. The tabulations contain statistics (counts) of demographic and housing characteristics of the entire population of the US crossed with detailed races and tribes at varying levels of geography. We describe two differentially private algorithmic strategies, one based on adding noise drawn from a two-sided Geometric distribution that satisfies "pure"-DP, and another based on adding noise from a Discrete Gaussian distribution that satisfied a well studied variant of differential privacy, called Zero Concentrated Differential Privacy (zCDP). We analytically estimate the privacy loss parameters ensured by the two algorithms for comparable levels of error introduced in the statistics.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
HDMM: Optimizing error of high-dimensional statistical queries under differential privacy
Authors:
Ryan McKenna,
Gerome Miklau,
Michael Hay,
Ashwin Machanavajjhala
Abstract:
In this work we describe the High-Dimensional Matrix Mechanism (HDMM), a differentially private algorithm for answering a workload of predicate counting queries. HDMM represents query workloads using a compact implicit matrix representation and exploits this representation to efficiently optimize over (a subset of) the space of differentially private algorithms for one that is unbiased and answers…
▽ More
In this work we describe the High-Dimensional Matrix Mechanism (HDMM), a differentially private algorithm for answering a workload of predicate counting queries. HDMM represents query workloads using a compact implicit matrix representation and exploits this representation to efficiently optimize over (a subset of) the space of differentially private algorithms for one that is unbiased and answers the input query workload with low expected error. HDMM can be deployed for both $ε$-differential privacy (with Laplace noise) and $(ε, δ)$-differential privacy (with Gaussian noise), although the core techniques are slightly different for each. We demonstrate empirically that HDMM can efficiently answer queries with lower expected error than state-of-the-art techniques, and in some cases, it nearly matches existing lower bounds for the particular class of mechanisms we consider.
△ Less
Submitted 22 June, 2021;
originally announced June 2021.
-
A workload-adaptive mechanism for linear queries under local differential privacy
Authors:
Ryan McKenna,
Raj Kumar Maity,
Arya Mazumdar,
Gerome Miklau
Abstract:
We propose a new mechanism to accurately answer a user-provided set of linear counting queries under local differential privacy (LDP). Given a set of linear counting queries (the workload) our mechanism automatically adapts to provide accuracy on the workload queries. We define a parametric class of mechanisms that produce unbiased estimates of the workload, and formulate a constrained optimizatio…
▽ More
We propose a new mechanism to accurately answer a user-provided set of linear counting queries under local differential privacy (LDP). Given a set of linear counting queries (the workload) our mechanism automatically adapts to provide accuracy on the workload queries. We define a parametric class of mechanisms that produce unbiased estimates of the workload, and formulate a constrained optimization problem to select a mechanism from this class that minimizes expected total squared error. We solve this optimization problem numerically using projected gradient descent and provide an efficient implementation that scales to large workloads. We demonstrate the effectiveness of our optimization-based approach in a wide variety of settings, showing that it outperforms many competitors, even outperforming existing mechanisms on the workloads for which they were intended.
△ Less
Submitted 18 May, 2020; v1 submitted 4 February, 2020;
originally announced February 2020.
-
Fair Decision Making using Privacy-Protected Data
Authors:
Satya Kuppam,
Ryan Mckenna,
David Pujol,
Michael Hay,
Ashwin Machanavajjhala,
Gerome Miklau
Abstract:
Data collected about individuals is regularly used to make decisions that impact those same individuals. We consider settings where sensitive personal data is used to decide who will receive resources or benefits. While it is well known that there is a tradeoff between protecting privacy and the accuracy of decisions, we initiate a first-of-its-kind study into the impact of formally private mechan…
▽ More
Data collected about individuals is regularly used to make decisions that impact those same individuals. We consider settings where sensitive personal data is used to decide who will receive resources or benefits. While it is well known that there is a tradeoff between protecting privacy and the accuracy of decisions, we initiate a first-of-its-kind study into the impact of formally private mechanisms (based on differential privacy) on fair and equitable decision-making. We empirically investigate novel tradeoffs on two real-world decisions made using U.S. Census data (allocation of federal funds and assignment of voting rights benefits) as well as a classic apportionment problem. Our results show that if decisions are made using an $ε$-differentially private version of the data, under strict privacy constraints (smaller $ε$), the noise added to achieve privacy may disproportionately impact some groups over others. We propose novel measures of fairness in the context of randomized differentially private algorithms and identify a range of causes of outcome disparities.
△ Less
Submitted 24 January, 2020; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Privacy Preserving Off-Policy Evaluation
Authors:
Tengyang Xie,
Philip S. Thomas,
Gerome Miklau
Abstract:
Many reinforcement learning applications involve the use of data that is sensitive, such as medical records of patients or financial information. However, most current reinforcement learning methods can leak information contained within the (possibly sensitive) data on which they are trained. To address this problem, we present the first differentially private approach for off-policy evaluation. W…
▽ More
Many reinforcement learning applications involve the use of data that is sensitive, such as medical records of patients or financial information. However, most current reinforcement learning methods can leak information contained within the (possibly sensitive) data on which they are trained. To address this problem, we present the first differentially private approach for off-policy evaluation. We provide a theoretical analysis of the privacy-preserving properties of our algorithm and analyze its utility (speed of convergence). After describing some results of this theoretical analysis, we show empirically that our method outperforms previous methods (which are restricted to the on-policy setting).
△ Less
Submitted 31 January, 2019;
originally announced February 2019.
-
Graphical-model based estimation and inference for differential privacy
Authors:
Ryan McKenna,
Daniel Sheldon,
Gerome Miklau
Abstract:
Many privacy mechanisms reveal high-level information about a data distribution through noisy measurements. It is common to use this information to estimate the answers to new queries. In this work, we provide an approach to solve this estimation problem efficiently using graphical models, which is particularly effective when the distribution is high-dimensional but the measurements are over low-d…
▽ More
Many privacy mechanisms reveal high-level information about a data distribution through noisy measurements. It is common to use this information to estimate the answers to new queries. In this work, we provide an approach to solve this estimation problem efficiently using graphical models, which is particularly effective when the distribution is high-dimensional but the measurements are over low-dimensional marginals. We show that our approach is far more efficient than existing estimation techniques from the privacy literature and that it can improve the accuracy and scalability of many state-of-the-art mechanisms.
△ Less
Submitted 25 January, 2019;
originally announced January 2019.
-
Ektelo: A Framework for Defining Differentially-Private Computations
Authors:
Dan Zhang,
Ryan McKenna,
Ios Kotsogiannis,
George Bissias,
Michael Hay,
Ashwin Machanavajjhala,
Gerome Miklau
Abstract:
The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming…
▽ More
The adoption of differential privacy is growing but the complexity of designing private, efficient and accurate algorithms is still high. We propose a novel programming framework and system, Ektelo, for implementing both existing and new privacy algorithms. For the task of answering linear counting queries, we show that nearly all existing algorithms can be composed from operators, each conforming to one of a small number of operator classes. While past programming frameworks have helped to ensure the privacy of programs, the novelty of our framework is its significant support for authoring accurate and efficient (as well as private) programs.
After describing the design and architecture of the Ektelo system, we show that Ektelo is expressive, allows for safer implementations through code reuse, and that it allows both privacy novices and experts to easily design algorithms. We demonstrate the use of Ektelo by designing several new state-of-the-art algorithms.
△ Less
Submitted 24 May, 2019; v1 submitted 10 August, 2018;
originally announced August 2018.
-
Optimizing error of high-dimensional statistical queries under differential privacy
Authors:
Ryan McKenna,
Gerome Miklau,
Michael Hay,
Ashwin Machanavajjhala
Abstract:
Differentially private algorithms for answering sets of predicate counting queries on a sensitive database have many applications. Organizations that collect individual-level data, such as statistical agencies and medical institutions, use them to safely release summary tabulations. However, existing techniques are accurate only on a narrow class of query workloads, or are extremely slow, especial…
▽ More
Differentially private algorithms for answering sets of predicate counting queries on a sensitive database have many applications. Organizations that collect individual-level data, such as statistical agencies and medical institutions, use them to safely release summary tabulations. However, existing techniques are accurate only on a narrow class of query workloads, or are extremely slow, especially when analyzing more than one or two dimensions of the data. In this work we propose HDMM, a new differentially private algorithm for answering a workload of predicate counting queries, that is especially effective for higher-dimensional datasets. HDMM represents query workloads using an implicit matrix representation and exploits this compact representation to efficiently search (a subset of) the space of differentially private algorithms for one that answers the input query workload with high accuracy. We empirically show that HDMM can efficiently answer queries with lower error than state-of-the-art techniques on a variety of low and high dimensional datasets.
△ Less
Submitted 10 August, 2018;
originally announced August 2018.
-
On Obtaining Stable Rankings
Authors:
Abolfazl Asudeh,
H. V. Jagadish,
Gerome Miklau,
Julia Stoyanovich
Abstract:
Decision making is challenging when there is more than one criterion to consider. In such cases, it is common to assign a goodness score to each item as a weighted sum of its attribute values and rank them accordingly. Clearly, the ranking obtained depends on the weights used for this summation. Ideally, one would want the ranked order not to change if the weights are changed slightly. We call thi…
▽ More
Decision making is challenging when there is more than one criterion to consider. In such cases, it is common to assign a goodness score to each item as a weighted sum of its attribute values and rank them accordingly. Clearly, the ranking obtained depends on the weights used for this summation. Ideally, one would want the ranked order not to change if the weights are changed slightly. We call this property {\em stability} of the ranking. A consumer of a ranked list may trust the ranking more if it has high stability. A producer of a ranked list prefers to choose weights that result in a stable ranking, both to earn the trust of potential consumers and because a stable ranking is intrinsically likely to be more meaningful. In this paper, we develop a framework that can be used to assess the stability of a provided ranking and to obtain a stable ranking within an "acceptable" range of weight values (called "the region of interest"). We address the case where the user cares about the rank order of the entire set of items, and also the case where the user cares only about the top-$k$ items. Using a geometric interpretation, we propose algorithms that produce stable rankings. In addition to theoretical analyses, we conduct extensive experiments on real datasets that validate our proposal.
△ Less
Submitted 18 December, 2018; v1 submitted 29 April, 2018;
originally announced April 2018.
-
A Nutritional Label for Rankings
Authors:
Ke Yang,
Julia Stoyanovich,
Abolfazl Asudeh,
Bill Howe,
HV Jagadish,
Gerome Miklau
Abstract:
Algorithmic decisions often result in scoring and ranking individuals to determine credit worthiness, qualifications for college admissions and employment, and compatibility as dating partners. While automatic and seemingly objective, ranking algorithms can discriminate against individuals and protected groups, and exhibit low diversity. Furthermore, ranked results are often unstable --- small cha…
▽ More
Algorithmic decisions often result in scoring and ranking individuals to determine credit worthiness, qualifications for college admissions and employment, and compatibility as dating partners. While automatic and seemingly objective, ranking algorithms can discriminate against individuals and protected groups, and exhibit low diversity. Furthermore, ranked results are often unstable --- small changes in the input data or in the ranking methodology may lead to drastic changes in the output, making the result uninformative and easy to manipulate. Similar concerns apply in cases where items other than individuals are ranked, including colleges, academic departments, or products.
In this demonstration we present Ranking Facts, a Web-based application that generates a "nutritional label" for rankings. Ranking Facts is made up of a collection of visual widgets that implement our latest research results on fairness, stability, and transparency for rankings, and that communicate details of the ranking methodology, or of the output, to the end user. We will showcase Ranking Facts on real datasets from different domains, including college rankings, criminal risk assessment, and financial services.
△ Less
Submitted 21 April, 2018;
originally announced April 2018.
-
Differentially Private Learning of Undirected Graphical Models using Collective Graphical Models
Authors:
Garrett Bernstein,
Ryan McKenna,
Tao Sun,
Daniel Sheldon,
Michael Hay,
Gerome Miklau
Abstract:
We investigate the problem of learning discrete, undirected graphical models in a differentially private way. We show that the approach of releasing noisy sufficient statistics using the Laplace mechanism achieves a good trade-off between privacy, utility, and practicality. A naive learning algorithm that uses the noisy sufficient statistics "as is" outperforms general-purpose differentially priva…
▽ More
We investigate the problem of learning discrete, undirected graphical models in a differentially private way. We show that the approach of releasing noisy sufficient statistics using the Laplace mechanism achieves a good trade-off between privacy, utility, and practicality. A naive learning algorithm that uses the noisy sufficient statistics "as is" outperforms general-purpose differentially private learning algorithms. However, it has three limitations: it ignores knowledge about the data generating process, rests on uncertain theoretical foundations, and exhibits certain pathologies. We develop a more principled approach that applies the formalism of collective graphical models to perform inference over the true sufficient statistics within an expectation-maximization framework. We show that this learns better models than competing approaches on both synthetic data and on real human mobility data used as a case study.
△ Less
Submitted 14 June, 2017;
originally announced June 2017.
-
A Consumer-Centric Market for Database Computation in the Cloud
Authors:
Yue Wang,
Alexandra Meliou,
Gerome Miklau
Abstract:
The availability of public computing resources in the cloud has revolutionized data analysis, but requesting cloud resources often involves complex decisions for consumers. Under the current pricing mechanisms, cloud service providers offer several service options and charge consumers based on the resources they use. Before they can decide which cloud resources to request, consumers have to estima…
▽ More
The availability of public computing resources in the cloud has revolutionized data analysis, but requesting cloud resources often involves complex decisions for consumers. Under the current pricing mechanisms, cloud service providers offer several service options and charge consumers based on the resources they use. Before they can decide which cloud resources to request, consumers have to estimate the completion time and cost of their computational tasks for different service options and possibly for different service providers. This estimation is challenging even for expert cloud users. We propose a new market-based framework for pricing computational tasks in the cloud. Our framework introduces an agent between consumers and cloud providers. The agent takes data and computational tasks from users, estimates time and cost for evaluating the tasks, and returns to consumers contracts that specify the price and completion time. Our framework can be applied directly to existing cloud markets without altering the way cloud providers offer and price services. In addition, it simplifies cloud use for consumers by allowing them to compare contracts, rather than choose resources directly. We present design, analytical, and algorithmic contributions focusing on pricing computation contracts, analyzing their properties, and optimizing them in complex workflows. We conduct an experimental evaluation of our market framework over a real-world cloud service and demonstrate empirically that our market ensures three key properties: competitiveness, fairness, and resilience. Finally, we present a fine-grained pricing mechanism for complex workflows and show that it can increase agent profits by more than an order of magnitude in some cases.
△ Less
Submitted 16 June, 2017; v1 submitted 7 September, 2016;
originally announced September 2016.
-
Principled Evaluation of Differentially Private Algorithms using DPBench
Authors:
Michael Hay,
Ashwin Machanavajjhala,
Gerome Miklau,
Yan Chen,
Dan Zhang
Abstract:
Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is {\em data dependent}, meaning the distribution of the noise added to query answers ma…
▽ More
Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is {\em data dependent}, meaning the distribution of the noise added to query answers may change depending on the input data. Theoretical analysis typically only considers the worst case, making empirical study of average case performance increasingly important.
In this paper we propose a set of evaluation principles which we argue are essential for sound evaluation. Based on these principles we propose DPBench, a novel evaluation framework for standardized evaluation of privacy algorithms. We then apply our benchmark to evaluate algorithms for answering 1- and 2-dimensional range queries. The result is a thorough empirical study of 15 published algorithms on a total of 27 datasets that offers new insights into algorithm behavior---in particular the influence of dataset scale and shape---and a more complete characterization of the state of the art. Our methodology is able to resolve inconsistencies in prior empirical studies and place algorithm performance in context through comparison to simple baselines. Finally, we pose open research questions which we hope will guide future algorithm design.
△ Less
Submitted 15 December, 2015;
originally announced December 2015.
-
A Data- and Workload-Aware Algorithm for Range Queries Under Differential Privacy
Authors:
Chao Li,
Michael Hay,
Gerome Miklau,
Yue Wang
Abstract:
We describe a new algorithm for answering a given set of range queries under $ε$-differential privacy which often achieves substantially lower error than competing methods. Our algorithm satisfies differential privacy by adding noise that is adapted to the input data and to the given query set. We first privately learn a partitioning of the domain into buckets that suit the input data well. Then w…
▽ More
We describe a new algorithm for answering a given set of range queries under $ε$-differential privacy which often achieves substantially lower error than competing methods. Our algorithm satisfies differential privacy by adding noise that is adapted to the input data and to the given query set. We first privately learn a partitioning of the domain into buckets that suit the input data well. Then we privately estimate counts for each bucket, doing so in a manner well-suited for the given query set. Since the performance of the algorithm depends on the input database, we evaluate it on a wide range of real datasets, showing that we can achieve the benefits of data-dependence on both "easy" and "hard" databases.
△ Less
Submitted 1 October, 2014;
originally announced October 2014.
-
Introducing Access Control in Webdamlog
Authors:
Serge Abiteboul,
Émilien Antoine,
Gerome Miklau,
Julia Stoyanovich,
Vera Zaychik Moffitt
Abstract:
We survey recent work on the specification of an access control mechanism in a collaborative environment. The work is presented in the context of the WebdamLog language, an extension of datalog to a distributed context. We discuss a fine-grained access control mechanism for intentional data based on provenance as well as a control mechanism for delegation, i.e., for deploying rules at remote peers…
▽ More
We survey recent work on the specification of an access control mechanism in a collaborative environment. The work is presented in the context of the WebdamLog language, an extension of datalog to a distributed context. We discuss a fine-grained access control mechanism for intentional data based on provenance as well as a control mechanism for delegation, i.e., for deploying rules at remote peers.
△ Less
Submitted 31 July, 2013;
originally announced July 2013.
-
Rule-Based Application Development using Webdamlog
Authors:
Serge Abiteboul,
Émilien Antoine,
Gerome Miklau,
Julia Stoyanovich,
Jules Testard
Abstract:
We present the WebdamLog system for managing distributed data on the Web in a peer-to-peer manner. We demonstrate the main features of the system through an application called Wepic for sharing pictures between attendees of the sigmod conference. Using Wepic, the attendees will be able to share, download, rate and annotate pictures in a highly decentralized manner. We show how WebdamLog handles he…
▽ More
We present the WebdamLog system for managing distributed data on the Web in a peer-to-peer manner. We demonstrate the main features of the system through an application called Wepic for sharing pictures between attendees of the sigmod conference. Using Wepic, the attendees will be able to share, download, rate and annotate pictures in a highly decentralized manner. We show how WebdamLog handles heterogeneity of the devices and services used to share data in such a Web setting. We exhibit the simple rules that define the Wepic application and show how to easily modify the Wepic application.
△ Less
Submitted 14 May, 2013;
originally announced May 2013.
-
A Theory of Pricing Private Data
Authors:
Chao Li,
Daniel Yang Li,
Gerome Miklau,
Dan Suciu
Abstract:
Personal data has value to both its owner and to institutions who would like to analyze it. Privacy mechanisms protect the owner's data while releasing to analysts noisy versions of aggregate query results. But such strict protections of individual's data have not yet found wide use in practice. Instead, Internet companies, for example, commonly provide free services in return for valuable sensiti…
▽ More
Personal data has value to both its owner and to institutions who would like to analyze it. Privacy mechanisms protect the owner's data while releasing to analysts noisy versions of aggregate query results. But such strict protections of individual's data have not yet found wide use in practice. Instead, Internet companies, for example, commonly provide free services in return for valuable sensitive information from users, which they exploit and sometimes sell to third parties.
As the awareness of the value of the personal data increases, so has the drive to compensate the end user for her private information. The idea of monetizing private data can improve over the narrower view of hiding private data, since it empowers individuals to control their data through financial means.
In this paper we propose a theoretical framework for assigning prices to noisy query answers, as a function of their accuracy, and for dividing the price amongst data owners who deserve compensation for their loss of privacy. Our framework adopts and extends key principles from both differential privacy and query pricing in data markets. We identify essential properties of the price function and micro-payments, and characterize valid solutions.
△ Less
Submitted 17 December, 2012; v1 submitted 26 August, 2012;
originally announced August 2012.
-
An Adaptive Mechanism for Accurate Query Answering under Differential Privacy
Authors:
Chao Li,
Gerome Miklau
Abstract:
We propose a novel mechanism for answering sets of count- ing queries under differential privacy. Given a workload of counting queries, the mechanism automatically selects a different set of "strategy" queries to answer privately, using those answers to derive answers to the workload. The main algorithm proposed in this paper approximates the optimal strategy for any workload of linear counting qu…
▽ More
We propose a novel mechanism for answering sets of count- ing queries under differential privacy. Given a workload of counting queries, the mechanism automatically selects a different set of "strategy" queries to answer privately, using those answers to derive answers to the workload. The main algorithm proposed in this paper approximates the optimal strategy for any workload of linear counting queries. With no cost to the privacy guarantee, the mechanism improves significantly on prior approaches and achieves near-optimal error for many workloads, when applied under (ε, δ)-differential privacy. The result is an adaptive mechanism which can help users achieve good utility without requiring that they reason carefully about the best formulation of their task.
△ Less
Submitted 16 February, 2012;
originally announced February 2012.
-
Optimal error of query sets under the differentially-private matrix mechanism
Authors:
Chao Li,
Gerome Miklau
Abstract:
A common goal of privacy research is to release synthetic data that satisfies a formal privacy guarantee and can be used by an analyst in place of the original data. To achieve reasonable accuracy, a synthetic data set must be tuned to support a specified set of queries accurately, sacrificing fidelity for other queries.
This work considers methods for producing synthetic data under differential…
▽ More
A common goal of privacy research is to release synthetic data that satisfies a formal privacy guarantee and can be used by an analyst in place of the original data. To achieve reasonable accuracy, a synthetic data set must be tuned to support a specified set of queries accurately, sacrificing fidelity for other queries.
This work considers methods for producing synthetic data under differential privacy and investigates what makes a set of queries "easy" or "hard" to answer. We consider answering sets of linear counting queries using the matrix mechanism, a recent differentially-private mechanism that can reduce error by adding complex correlated noise adapted to a specified workload.
Our main result is a novel lower bound on the minimum total error required to simultaneously release answers to a set of workload queries. The bound reveals that the hardness of a query workload is related to the spectral properties of the workload when it is represented in matrix form. The bound is most informative for $(ε,δ)$-differential privacy but also applies to $ε$-differential privacy.
△ Less
Submitted 17 December, 2012; v1 submitted 15 February, 2012;
originally announced February 2012.
-
Efficient Batch Query Answering Under Differential Privacy
Authors:
Chao Li,
Gerome Miklau
Abstract:
Differential privacy is a rigorous privacy condition achieved by randomizing query answers. This paper develops efficient algorithms for answering multiple queries under differential privacy with low error. We pursue this goal by advancing a recent approach called the matrix mechanism, which generalizes standard differentially private mechanisms. This new mechanism works by first answering a diffe…
▽ More
Differential privacy is a rigorous privacy condition achieved by randomizing query answers. This paper develops efficient algorithms for answering multiple queries under differential privacy with low error. We pursue this goal by advancing a recent approach called the matrix mechanism, which generalizes standard differentially private mechanisms. This new mechanism works by first answering a different set of queries (a strategy) and then inferring the answers to the desired workload of queries. Although a few strategies are known to work well on specific workloads, finding the strategy which minimizes error on an arbitrary workload is intractable. We prove a new lower bound on the optimal error of this mechanism, and we propose an efficient algorithm that approaches this bound for a wide range of workloads.
△ Less
Submitted 7 March, 2011;
originally announced March 2011.
-
Scalable Probabilistic Databases with Factor Graphs and MCMC
Authors:
Michael Wick,
Andrew McCallum,
Gerome Miklau
Abstract:
Probabilistic databases play a crucial role in the management and understanding of uncertain data. However, incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or restrict the class of relational algebra formula under which they are closed. We propose an alternative approach where the underlying…
▽ More
Probabilistic databases play a crucial role in the management and understanding of uncertain data. However, incorporating probabilities into the semantics of incomplete databases has posed many challenges, forcing systems to sacrifice modeling power, scalability, or restrict the class of relational algebra formula under which they are closed. We propose an alternative approach where the underlying relational database always represents a single world, and an external factor graph encodes a distribution over possible worlds; Markov chain Monte Carlo (MCMC) inference is then used to recover this uncertainty to a desired level of fidelity. Our approach allows the efficient evaluation of arbitrary queries over probabilistic databases with arbitrary dependencies expressed by graphical models with structure that changes during inference. MCMC sampling provides efficiency by hypothesizing {\em modifications} to possible worlds rather than generating entire worlds from scratch. Queries are then run over the portions of the world that change, avoiding the onerous cost of running full queries over each sampled world. A significant innovation of this work is the connection between MCMC sampling and materialized view maintenance techniques: we find empirically that using view maintenance techniques is several orders of magnitude faster than naively querying each sampled world. We also demonstrate our system's ability to answer relational queries with aggregation, and demonstrate additional scalability through the use of parallelization.
△ Less
Submitted 11 May, 2010;
originally announced May 2010.
-
Optimizing Histogram Queries under Differential Privacy
Authors:
Chao Li,
Michael Hay,
Vibhor Rastogi,
Gerome Miklau,
Andrew McGregor
Abstract:
Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. Despite much recent work, optimal strategies for answering a collection of correlated queries are not known.
We study the problem of devising a set of strategy queries, to be submitted and answered privately, that will support the answers to a given workload of queries. We prop…
▽ More
Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. Despite much recent work, optimal strategies for answering a collection of correlated queries are not known.
We study the problem of devising a set of strategy queries, to be submitted and answered privately, that will support the answers to a given workload of queries. We propose a general framework in which query strategies are formed from linear combinations of counting queries, and we describe an optimal method for deriving new query answers from the answers to the strategy queries. Using this framework we characterize the error of strategies geometrically, and we propose solutions to the problem of finding optimal strategies.
△ Less
Submitted 6 September, 2010; v1 submitted 23 December, 2009;
originally announced December 2009.
-
Boosting the Accuracy of Differentially-Private Histograms Through Consistency
Authors:
Michael Hay,
Vibhor Rastogi,
Gerome Miklau,
Dan Suciu
Abstract:
We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The f…
▽ More
We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The final output is differentially-private and consistent, but in addition, it is often much more accurate. We show, both theoretically and experimentally, that these techniques can be used for estimating the degree sequence of a graph very precisely, and for computing a histogram that can support arbitrary range queries accurately.
△ Less
Submitted 8 July, 2010; v1 submitted 6 April, 2009;
originally announced April 2009.