subscribe to arXiv mailings

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Authors: Ethan Shen, Alan Fan, Sarah M. Pratt, Jae Sung Park, Matthew Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi, Aditya Kusupati

Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To… ▽ More Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the language model. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k^2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding. △ Less

Submitted 24 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: 22 pages, 15 figures

arXiv:2311.05784 [pdf, other]

Are "Hierarchical" Visual Representations Hierarchical?

Authors: Ethan Shen, Ali Farhadi, Aditya Kusupati

Abstract: Learned visual representations often capture large amounts of semantic information for accurate downstream applications. Human understanding of the world is fundamentally grounded in hierarchy. To mimic this and further improve representation capabilities, the community has explored "hierarchical" visual representations that aim at modeling the underlying hierarchy of the visual world. In this wor… ▽ More Learned visual representations often capture large amounts of semantic information for accurate downstream applications. Human understanding of the world is fundamentally grounded in hierarchy. To mimic this and further improve representation capabilities, the community has explored "hierarchical" visual representations that aim at modeling the underlying hierarchy of the visual world. In this work, we set out to investigate if hierarchical visual representations truly capture the human perceived hierarchy better than standard learned representations. To this end, we create HierNet, a suite of 12 datasets spanning 3 kinds of hierarchy from the BREEDs subset of ImageNet. After extensive evaluation of Hyperbolic and Matryoshka Representations across training setups, we conclude that they do not capture hierarchy any better than the standard representations but can assist in other aspects like search efficiency and interpretability. Our benchmark and the datasets are open-sourced at https://github.com/ethanlshen/HierNet. △ Less

Submitted 23 November, 2023; v1 submitted 9 November, 2023; originally announced November 2023.

arXiv:2310.10351 [pdf, other]

B-rep Boolean Resulting Model Repair by Correcting Intersection Edges Based on Inference Procedure

Authors: Haomian Huang, Li Chen, Enya Shen, Jianmin Wang

Abstract: As the most essential part of CAD modeling operations, boolean operations on B-rep CAD models often suffer from errors. Errors caused by geometric precision or numerical uncertainty are hard to eliminate. They will reduce the reliability of boolean operations and damage the integrity of the resulting models. And it is difficult to repair false boolean resulting models damaged by errors. In practic… ▽ More As the most essential part of CAD modeling operations, boolean operations on B-rep CAD models often suffer from errors. Errors caused by geometric precision or numerical uncertainty are hard to eliminate. They will reduce the reliability of boolean operations and damage the integrity of the resulting models. And it is difficult to repair false boolean resulting models damaged by errors. In practice, we find that the illegal boolean resulting models stem from the false intersection edges caused by errors. Therefore, this paper proposes an automatic method based on set reasoning to repair flawed structures of the boolean resulting models by correcting their topological intersection edges. We provide a local adaptive tolerance estimation method for each intersection edge based on its geometric features as well as its origin. Then, we propose a set of inference mechanisms based on set operations to infer whether a repair is needed based on the tolerance value and how to correct the inaccurate intersection edge. Our inference strategies are strictly proven, ensuring the reliability and robustness of the repair process. The inference process will transform the problem into a geometric equivalent form less susceptible to errors to get a more accurate intersection edge. Since our inference procedure focuses on topological features, our method can repair the flawed boolean resulting models, no matter what source of errors causes the problem. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2307.13339 [pdf, other]

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Authors: Skyler Wu, Eric Meng Shen, Charumathi Badrinath, Jiaqi Ma, Himabindu Lakkaraju

Abstract: Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks. While understanding why CoT prompting is effective is crucial to ensuring that this phenomenon is a consequence of desired model behavior, little work has addressed this; nonetheless, such an understanding is a critical prerequisite for responsibl… ▽ More Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks. While understanding why CoT prompting is effective is crucial to ensuring that this phenomenon is a consequence of desired model behavior, little work has addressed this; nonetheless, such an understanding is a critical prerequisite for responsible model deployment. We address this question by leveraging gradient-based feature attribution methods which produce saliency scores that capture the influence of input tokens on model output. Specifically, we probe several open-source LLMs to investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt compared to standard few-shot prompting, it increases the robustness of saliency scores to question perturbations and variations in model output. △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: Accepted to Workshop on Challenges in Deployable Generative AI at ICML 2023

arXiv:2307.10405 [pdf, other]

Generative Visual Question Answering

Authors: Ethan Shen, Scotty Singh, Bhavesh Kumar

Abstract: Multi-modal tasks involving vision and language in deep learning continue to rise in popularity and are leading to the development of newer models that can generalize beyond the extent of their training data. The current models lack temporal generalization which enables models to adapt to changes in future data. This paper discusses a viable approach to creating an advanced Visual Question Answeri… ▽ More Multi-modal tasks involving vision and language in deep learning continue to rise in popularity and are leading to the development of newer models that can generalize beyond the extent of their training data. The current models lack temporal generalization which enables models to adapt to changes in future data. This paper discusses a viable approach to creating an advanced Visual Question Answering (VQA) model which can produce successful results on temporal generalization. We propose a new data set, GenVQA, utilizing images and captions from the VQAv2 and MS-COCO dataset to generate new images through stable diffusion. This augmented dataset is then used to test a combination of seven baseline and cutting edge VQA models. Performance evaluation focuses on questions mirroring the original VQAv2 dataset, with the answers having been adjusted to the new images. This paper's purpose is to investigate the robustness of several successful VQA models to assess their performance on future data distributions. Model architectures are analyzed to identify common stylistic choices that improve generalization under temporal distribution shifts. This research highlights the importance of creating a large-scale future shifted dataset. This data can enhance the robustness of VQA models, allowing their future peers to have improved ability to adapt to temporal distribution shifts. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2210.05650 [pdf, other]

Regret Bounds for Risk-Sensitive Reinforcement Learning

Authors: O. Bastani, Y. J. Ma, E. Shen, W. Xu

Abstract: In safety-critical applications of reinforcement learning such as healthcare and robotics, it is often desirable to optimize risk-sensitive objectives that account for tail outcomes rather than expected reward. We prove the first regret bounds for reinforcement learning under a general class of risk-sensitive objectives including the popular CVaR objective. Our theory is based on a novel character… ▽ More In safety-critical applications of reinforcement learning such as healthcare and robotics, it is often desirable to optimize risk-sensitive objectives that account for tail outcomes rather than expected reward. We prove the first regret bounds for reinforcement learning under a general class of risk-sensitive objectives including the popular CVaR objective. Our theory is based on a novel characterization of the CVaR objective as well as a novel optimistic MDP construction. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2208.13699 [pdf, other]

Graph Exploration with Embedding-Guided Layouts

Authors: Leixian Shen, Zhiwei Tai, Enya Shen, Jianmin Wang

Abstract: Node-link diagrams are widely used to visualize graphs. Most graph layout algorithms only use graph topology for aesthetic goals (e.g., minimize node occlusions and edge crossings) or use node attributes for exploration goals (e.g., preserve visible communities). Existing hybrid methods that bind the two perspectives still suffer from various generation restrictions (e.g., limited input types and… ▽ More Node-link diagrams are widely used to visualize graphs. Most graph layout algorithms only use graph topology for aesthetic goals (e.g., minimize node occlusions and edge crossings) or use node attributes for exploration goals (e.g., preserve visible communities). Existing hybrid methods that bind the two perspectives still suffer from various generation restrictions (e.g., limited input types and required manual adjustments and prior knowledge of graphs) and the imbalance between aesthetic and exploration goals. In this paper, we propose a flexible embedding-based graph exploration pipeline to enjoy the best of both graph topology and node attributes. First, we leverage embedding algorithms for attributed graphs to encode the two perspectives into latent space. Then, we present an embedding-driven graph layout algorithm, GEGraph, which can achieve aesthetic layouts with better community preservation to support an easy interpretation of the graph structure. Next, graph explorations are extended based on the generated graph layout and insights extracted from the embedding vectors. Illustrated with examples, we build a layout-preserving aggregation method with Focus+Context interaction and a related nodes searching approach with multiple proximity strategies. Finally, we conduct quantitative and qualitative evaluations, a user study, and two case studies to validate our approach. △ Less

Submitted 19 January, 2023; v1 submitted 29 August, 2022; originally announced August 2022.

Comments: accepted by TVCG

arXiv:2205.11055 [pdf, other]

TempLM: Distilling Language Models into Template-Based Generators

Authors: Tianyi Zhang, Mina Lee, Lisa Li, Ende Shen, Tatsunori B. Hashimoto

Abstract: While pretrained language models (PLMs) have greatly improved text generation, they have also been known to produce unfaithful or inappropriate content. In contrast, classic template-based systems provide strong guarantees of faithfulness at the cost of fluency. We propose TempLM, which achieves the best of both worlds by distilling a PLM into a template-based generator. On the E2E and SynthBio da… ▽ More While pretrained language models (PLMs) have greatly improved text generation, they have also been known to produce unfaithful or inappropriate content. In contrast, classic template-based systems provide strong guarantees of faithfulness at the cost of fluency. We propose TempLM, which achieves the best of both worlds by distilling a PLM into a template-based generator. On the E2E and SynthBio data-to-text datasets, we show that TempLM is more faithful than the original PLM and is more fluent than prior template systems. Notably, on an out-of-domain evaluation, TempLM reduces a finetuned BART model's unfaithfulness rate from 83% to 0%. In a human study, we find that TempLM's templates substantially improve upon human-written ones in BERTScore. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2205.03183 [pdf]

doi 10.1007/s41019-022-00195-3

Visual Data Analysis with Task-based Recommendations

Authors: Leixian Shen, Enya Shen, Zhiwei Tai, Yihao Xu, Jianmin Wang

Abstract: General visualization recommendation systems typically make design decisions for the dataset automatically. However, most of them can only prune meaningless visualizations but fail to recommend targeted results. This paper contributes TaskVis, a task-oriented visualization recommendation system that allows users to select their tasks precisely on the interface. We first summarize a task base with… ▽ More General visualization recommendation systems typically make design decisions for the dataset automatically. However, most of them can only prune meaningless visualizations but fail to recommend targeted results. This paper contributes TaskVis, a task-oriented visualization recommendation system that allows users to select their tasks precisely on the interface. We first summarize a task base with 18 classical analytic tasks by a survey both in academia and industry. On this basis, we maintain a rule base, which extends empirical wisdom with our targeted modeling of the analytic tasks. Then, our rule-based approach enumerates all the candidate visualizations through answer set programming. After that, the generated charts can be ranked by four ranking schemes. Furthermore, we introduce a task-based combination recommendation strategy, leveraging a set of visualizations to give a brief view of the dataset collaboratively. Finally, we evaluate TaskVis through a series of use cases and a user study. △ Less

Submitted 14 September, 2022; v1 submitted 6 May, 2022; originally announced May 2022.

Comments: 16 pages 10 figures. Data Sci. Eng. (2022)

arXiv:2109.03506 [pdf, other]

doi 10.1109/TVCG.2022.3148007

Towards Natural Language Interfaces for Data Visualization: A Survey

Authors: Leixian Shen, Enya Shen, Yuyu Luo, Xiaocong Yang, Xuming Hu, Xiongshuai Zhang, Zhiwei Tai, Jianmin Wang

Abstract: Utilizing Visualization-oriented Natural Language Interfaces (V-NLI) as a complementary input modality to direct manipulation for visual analytics can provide an engaging user experience. It enables users to focus on their tasks rather than having to worry about how to operate visualization tools on the interface. In the past two decades, leveraging advanced natural language processing technologie… ▽ More Utilizing Visualization-oriented Natural Language Interfaces (V-NLI) as a complementary input modality to direct manipulation for visual analytics can provide an engaging user experience. It enables users to focus on their tasks rather than having to worry about how to operate visualization tools on the interface. In the past two decades, leveraging advanced natural language processing technologies, numerous V-NLI systems have been developed in academic research and commercial software, especially in recent years. In this article, we conduct a comprehensive review of the existing V-NLIs. In order to classify each paper, we develop categorical dimensions based on a classic information visualization pipeline with the extension of a V-NLI layer. The following seven stages are used: query interpretation, data transformation, visual mapping, view transformation, human interaction, dialogue management, and presentation. Finally, we also shed light on several promising directions for future work in the V-NLI community. △ Less

Submitted 4 February, 2022; v1 submitted 8 September, 2021; originally announced September 2021.

Comments: 20 pages, 15 figures, accepted by IEEE TVCG

arXiv:2102.07077 [pdf, other]

Model-Agnostic Graph Regularization for Few-Shot Learning

Authors: Ethan Shen, Maria Brbic, Nicholas Monath, Jiaqi Zhai, Manzil Zaheer, Jure Leskovec

Abstract: In many domains, relationships between categories are encoded in the knowledge graph. Recently, promising results have been achieved by incorporating knowledge graph as side information in hard classification tasks with severely limited data. However, prior models consist of highly complex architectures with many sub-components that all seem to impact performance. In this paper, we present a compr… ▽ More In many domains, relationships between categories are encoded in the knowledge graph. Recently, promising results have been achieved by incorporating knowledge graph as side information in hard classification tasks with severely limited data. However, prior models consist of highly complex architectures with many sub-components that all seem to impact performance. In this paper, we present a comprehensive empirical study on graph embedded few-shot learning. We introduce a graph regularization approach that allows a deeper understanding of the impact of incorporating graph information between labels. Our proposed regularization is widely applicable and model-agnostic, and boosts the performance of any few-shot learning model, including fine-tuning, metric-based, and optimization-based meta-learning. Our approach improves the performance of strong base learners by up to 2% on Mini-ImageNet and 6.7% on ImageNet-FS, outperforming state-of-the-art graph embedded methods. Additional analyses reveal that graph regularizing models result in a lower loss for more difficult tasks, such as those with fewer shots and less informative support examples. △ Less

Submitted 14 February, 2021; originally announced February 2021.

Comments: NeurIPS Workshop on Meta-Learning 2020

arXiv:1811.10789 [pdf, other]

Flexible Attributed Network Embedding

Authors: Enya Shen, Zhidong Cao, Changqing Zou, Jianmin Wang

Abstract: Network embedding aims to find a way to encode network by learning an embedding vector for each node in the network. The network often has property information which is highly informative with respect to the node's position and role in the network. Most network embedding methods fail to utilize this information during network representation learning. In this paper, we propose a novel framework, FA… ▽ More Network embedding aims to find a way to encode network by learning an embedding vector for each node in the network. The network often has property information which is highly informative with respect to the node's position and role in the network. Most network embedding methods fail to utilize this information during network representation learning. In this paper, we propose a novel framework, FANE, to integrate structure and property information in the network embedding process. In FANE, we design a network to unify heterogeneity of the two information sources, and define a new random walking strategy to leverage property information and make the two information compensate. FANE is conceptually simple and empirically powerful. It improves over the state-of-the-art methods on Cora dataset classification task by over 5%, more than 10% on WebKB dataset classification task. Experiments also show that the results improve more than the state-of-the-art methods as increasing training size. Moreover, qualitative visualization show that our framework is helpful in network property information exploration. In all, we present a new way for efficiently learning state-of-the-art task-independent representations in complex attributed networks. The source code and datasets of this paper can be obtained from https://github.com/GraphWorld/FANE. △ Less

Submitted 26 November, 2018; originally announced November 2018.

arXiv:1703.02014 [pdf, ps, other]

SoK: Cryptographically Protected Database Search

Authors: Benjamin Fuller, Mayank Varia, Arkady Yerukhimovich, Emily Shen, Ariel Hamlin, Vijay Gadepally, Richard Shay, John Darby Mitchell, Robert K. Cunningham

Abstract: Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly; systems are offered by academia, start-ups, and established companies. However, there… ▽ More Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly; systems are offered by academia, start-ups, and established companies. However, there is no best protected search system or set of techniques. Design of such systems is a balancing act between security, functionality, performance, and usability. This challenge is made more difficult by ongoing database specialization, as some users will want the functionality of SQL, NoSQL, or NewSQL databases. This database evolution will continue, and the protected search community should be able to quickly provide functionality consistent with newly invented databases. At the same time, the community must accurately and clearly characterize the tradeoffs between different approaches. To address these challenges, we provide the following contributions: 1) An identification of the important primitive operations across database paradigms. We find there are a small number of base operations that can be used and combined to support a large number of database paradigms. 2) An evaluation of the current state of protected search systems in implementing these base operations. This evaluation describes the main approaches and tradeoffs for each base operation. Furthermore, it puts protected search in the context of unprotected search, identifying key gaps in functionality. 3) An analysis of attacks against protected search for different base queries. 4) A roadmap and tools for transforming a protected search system into a protected database, including an open-source performance evaluation platform and initial user opinions of protected search. △ Less

Submitted 2 June, 2017; v1 submitted 6 March, 2017; originally announced March 2017.

Comments: 20 pages, to appear to IEEE Security and Privacy

arXiv:1512.07931 [pdf, other]

doi 10.1088/0967-3334/37/9/1404

Probabilistic Model-Based Approach for Heart Beat Detection

Authors: Hugh Chen, Yusuf Erol, Eric Shen, Stuart Russell

Abstract: Nowadays, hospitals are ubiquitous and integral to modern society. Patients flow in and out of a veritable whirlwind of paperwork, consultations, and potential inpatient admissions, through an abstracted system that is not without flaws. One of the biggest flaws in the medical system is perhaps an unexpected one: the patient alarm system. One longitudinal study reported an 88.8% rate of false alar… ▽ More Nowadays, hospitals are ubiquitous and integral to modern society. Patients flow in and out of a veritable whirlwind of paperwork, consultations, and potential inpatient admissions, through an abstracted system that is not without flaws. One of the biggest flaws in the medical system is perhaps an unexpected one: the patient alarm system. One longitudinal study reported an 88.8% rate of false alarms, with other studies reporting numbers of similar magnitudes. These false alarm rates lead to a number of deleterious effects that manifest in a significantly lower standard of care across clinics. This paper discusses a model-based probabilistic inference approach to identifying variables at a detection level. We design a generative model that complies with an overview of human physiology and perform approximate Bayesian inference. One primary goal of this paper is to justify a Bayesian modeling approach to increasing robustness in a physiological domain. We use three data sets provided by Physionet, a research resource for complex physiological signals, in the form of the Physionet 2014 Challenge set-p1 and set-p2, as well as the MGH/MF Waveform Database. On the extended data set our algorithm is on par with the other top six submissions to the Physionet 2014 challenge. △ Less

Submitted 24 December, 2015; originally announced December 2015.

arXiv:1305.1495 [pdf, ps, other]

doi 10.1109/TPAMI.2015.2414427

GReTA - a novel Global and Recursive Tracking Algorithm in three dimensions

Authors: Alessandro Attanasi, Andrea Cavagna, Lorenzo Del Castello, Irene Giardina, Asja Jelic, Stefania Melillo, Leonardo Parisi, Fabio Pellacini, Edward Shen, Edmondo Silvestri, Massimiliano Viale

Abstract: Tracking multiple moving targets allows quantitative measure of the dynamic behavior in systems as diverse as animal groups in biology, turbulence in fluid dynamics and crowd and traffic control. In three dimensions, tracking several targets becomes increasingly hard since optical occlusions are very likely, i.e. two featureless targets frequently overlap for several frames. Occlusions are particu… ▽ More Tracking multiple moving targets allows quantitative measure of the dynamic behavior in systems as diverse as animal groups in biology, turbulence in fluid dynamics and crowd and traffic control. In three dimensions, tracking several targets becomes increasingly hard since optical occlusions are very likely, i.e. two featureless targets frequently overlap for several frames. Occlusions are particularly frequent in biological groups such as bird flocks, fish schools, and insect swarms, a fact that has severely limited collective animal behavior field studies in the past. This paper presents a 3D tracking method that is robust in the case of severe occlusions. To ensure robustness, we adopt a global optimization approach that works on all objects and frames at once. To achieve practicality and scalability, we employ a divide and conquer formulation, thanks to which the computational complexity of the problem is reduced by orders of magnitude. We tested our algorithm with synthetic data, with experimental data of bird flocks and insect swarms and with public benchmark datasets, and show that our system yields high quality trajectories for hundreds of moving targets with severe overlap. The results obtained on very heterogeneous data show the potential applicability of our method to the most diverse experimental situations. △ Less

Submitted 17 April, 2015; v1 submitted 7 May, 2013; originally announced May 2013.

Comments: 13 pages, 6 figures, 3 tables. Version 3 was slightly shortened, and new comprative results on the public datasets (thermal infrared videos of flying bats) by Z. Wu and coworkers (2014) were included. in A. Attanasi et al., "GReTA - A Novel Global and Recursive Tracking Algorithm in Three Dimensions", IEEE Trans. Pattern Anal. Mach. Intell., vol.37 (2015)

arXiv:1301.7015 [pdf, other]

Mining Frequent Graph Patterns with Differential Privacy

Authors: Entong Shen, Ting Yu

Abstract: Discovering frequent graph patterns in a graph database offers valuable information in a variety of applications. However, if the graph dataset contains sensitive data of individuals such as mobile phone-call graphs and web-click graphs, releasing discovered frequent patterns may present a threat to the privacy of individuals. {\em Differential privacy} has recently emerged as the {\em de facto} s… ▽ More Discovering frequent graph patterns in a graph database offers valuable information in a variety of applications. However, if the graph dataset contains sensitive data of individuals such as mobile phone-call graphs and web-click graphs, releasing discovered frequent patterns may present a threat to the privacy of individuals. {\em Differential privacy} has recently emerged as the {\em de facto} standard for private data analysis due to its provable privacy guarantee. In this paper we propose the first differentially private algorithm for mining frequent graph patterns. We first show that previous techniques on differentially private discovery of frequent {\em itemsets} cannot apply in mining frequent graph patterns due to the inherent complexity of handling structural information in graphs. We then address this challenge by proposing a Markov Chain Monte Carlo (MCMC) sampling based algorithm. Unlike previous work on frequent itemset mining, our techniques do not rely on the output of a non-private mining algorithm. Instead, we observe that both frequent graph pattern mining and the guarantee of differential privacy can be unified into an MCMC sampling framework. In addition, we establish the privacy and utility guarantee of our algorithm and propose an efficient neighboring pattern counting technique as well. Experimental results show that the proposed algorithm is able to output frequent patterns with good precision. △ Less

Submitted 1 March, 2013; v1 submitted 29 January, 2013; originally announced January 2013.

arXiv:1103.5170 [pdf, other]

Differentially Private Spatial Decompositions

Authors: Graham Cormode, Magda Procopiuc, Entong Shen, Divesh Srivastava, Ting Yu

Abstract: Differential privacy has recently emerged as the de facto standard for private data release. This makes it possible to provide strong theoretical guarantees on the privacy and utility of released data. While it is well-known how to release data based on counts and simple functions under this guarantee, it remains to provide general purpose techniques to release different kinds of data. In this pap… ▽ More Differential privacy has recently emerged as the de facto standard for private data release. This makes it possible to provide strong theoretical guarantees on the privacy and utility of released data. While it is well-known how to release data based on counts and simple functions under this guarantee, it remains to provide general purpose techniques to release different kinds of data. In this paper, we focus on spatial data such as locations and more generally any data that can be indexed by a tree structure. Directly applying existing differential privacy methods to this type of data simply generates noise. Instead, we introduce a new class of "private spatial decompositions": these adapt standard spatial indexing methods such as quadtrees and kd-trees to provide a private description of the data distribution. Equipping such structures with differential privacy requires several steps to ensure that they provide meaningful privacy guarantees. Various primitives, such as choosing splitting points and describing the distribution of points within a region, must be done privately, and the guarantees of the different building blocks composed to provide an overall guarantee. Consequently, we expose the design space for private spatial decompositions, and analyze some key examples. Our experimental study demonstrates that it is possible to build such decompositions efficiently, and use them to answer a variety of queries privately with high accuracy. △ Less

Submitted 13 March, 2012; v1 submitted 26 March, 2011; originally announced March 2011.

Comments: ICDE 2012 (supplementary acknowledgments)

Showing 1–17 of 17 results for author: Shen, E