-
A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs
Authors:
Yun Wang,
Chrysanthi Kosyfaki,
Sihem Amer-Yahia,
Reynold Cheng
Abstract:
Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework,…
▽ More
Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing in graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses in attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m- dimensional random walk that accounts for the paths specified in a hypothesis. We further optimize its time efficiency and propose PHASEopt. Experiments on real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling in terms of accuracy and time efficiency.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
From Large Language Models to Databases and Back: A discussion on research and education
Authors:
Sihem Amer-Yahia,
Angela Bonifati,
Lei Chen,
Guoliang Li,
Kyuseok Shim,
Jianliang Xu,
Xiaochun Yang
Abstract:
This discussion was conducted at a recent panel at the 28th International Conference on Database Systems for Advanced Applications (DASFAA 2023), held April 17-20, 2023 in Tianjin, China. The title of the panel was "What does LLM (ChatGPT) Bring to Data Science Research and Education? Pros and Cons". It was moderated by Lei Chen and Xiaochun Yang. The discussion raised several questions on how lar…
▽ More
This discussion was conducted at a recent panel at the 28th International Conference on Database Systems for Advanced Applications (DASFAA 2023), held April 17-20, 2023 in Tianjin, China. The title of the panel was "What does LLM (ChatGPT) Bring to Data Science Research and Education? Pros and Cons". It was moderated by Lei Chen and Xiaochun Yang. The discussion raised several questions on how large language models (LLMs) and database research and education can help each other and the potential risks of LLMs.
△ Less
Submitted 7 July, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
On Efficient Approximate Queries over Machine Learning Models
Authors:
Dujian Ding,
Sihem Amer-Yahia,
Laks VS Lakshmanan
Abstract:
The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate qu…
▽ More
The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the objects in the DB. It relies on two assumptions. Under the Proxy Quality assumption, proxy quality can be quantified in a probabilistic manner w.r.t. the oracle. This allows us to develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.
△ Less
Submitted 17 November, 2022; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Guided Exploration of Data Summaries
Authors:
Brit Youngmann,
Sihem Amer-Yahia,
Aurélien Personnaz
Abstract:
Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Findin…
▽ More
Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains k individually uniform sets that are collectively diverse to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize Eda4Sum, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. EdA4Sum generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) Top1Sum which chooses the most useful summary at each step; (ii) RLSum which trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]
Authors:
Sihem Amer-Yahia,
Georgia Koutrika,
Frederic Bastian,
Theofilos Belmpas,
Martin Braschler,
Ursin Brunner,
Diego Calvanese,
Maximilian Fabricius,
Orest Gkini,
Catherine Kosten,
Davide Lanti,
Antonis Litke,
Hendrik Lücke-Tieke,
Francesco Alessandro Massucci,
Tarcisio Mendes de Farias,
Alessandro Mosca,
Francesco Multari,
Nikolaos Papadakis,
Dimitris Papadopoulos,
Yogendra Patil,
Aurélien Personnaz,
Guillem Rull,
Ana Sima,
Ellery Smith,
Dimitrios Skoutas
, et al. (3 additional authors not shown)
Abstract:
A full-fledged data exploration system must combine different access modalities with a powerful concept of guiding the user in the exploration process, by being reactive and anticipative both for data discovery and for data linking. Such systems are a real opportunity for our community to cater to users with different domain and data science expertise. We introduce INODE -- an end-to-end data expl…
▽ More
A full-fledged data exploration system must combine different access modalities with a powerful concept of guiding the user in the exploration process, by being reactive and anticipative both for data discovery and for data linking. Such systems are a real opportunity for our community to cater to users with different domain and data science expertise. We introduce INODE -- an end-to-end data exploration system -- that leverages, on the one hand, Machine Learning and, on the other hand, semantics for the purpose of Data Management (DM). Our vision is to develop a classic unified, comprehensive platform that provides extensive access to open datasets, and we demonstrate it in three significant use cases in the fields of Cancer Biomarker Reearch, Research and Innovation Policy Making, and Astrophysics. INODE offers sustainable services in (a) data modeling and linking, (b) integrated query processing using natural language, (c) guidance, and (d) data exploration through visualization, thus facilitating the user in discovering new insights. We demonstrate that our system is uniquely accessible to a wide range of users from larger scientific communities to the public. Finally, we briefly illustrate how this work paves the way for new research opportunities in DM.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
DropLeaf: a precision farming smartphone application for measuring pesticide spraying methods
Authors:
Bruno Brandoli,
Gabriel Spadon,
Travis Esau,
Patrick Hennessy,
Andre C. P. L. Carvalho,
Jose F. Rodrigues-Jr,
Sihem Amer-Yahia
Abstract:
Pesticide application has been heavily used in the cultivation of major crops, contributing to the increase of crop production over the past decades. However, their appropriate use and calibration of machines rely upon evaluation methodologies that can precisely estimate how well the pesticides' spraying covered the crops. A few strategies have been proposed in former works, yet their elevated cos…
▽ More
Pesticide application has been heavily used in the cultivation of major crops, contributing to the increase of crop production over the past decades. However, their appropriate use and calibration of machines rely upon evaluation methodologies that can precisely estimate how well the pesticides' spraying covered the crops. A few strategies have been proposed in former works, yet their elevated costs and low portability do not permit their wide adoption. This work introduces and experimentally assesses a novel tool that functions over a smartphone-based mobile application, named DropLeaf - Spraying Meter. Tests performed using DropLeaf demonstrated that, notwithstanding its versatility, it can estimate the pesticide spraying with high precision. Our methodology is based on image analysis, and the assessment of spraying deposition measures is performed successfully over real and synthetic water-sensitive papers. The proposed tool can be extensively used by farmers and agronomists furnished with regular smartphones, improving the utilization of pesticides with well-being, ecological, and monetary advantages. DropLeaf can be easily used for spray drift assessment of different methods, including emerging UAV (Unmanned Aerial Vehicle) sprayers.
△ Less
Submitted 31 August, 2020;
originally announced September 2020.
-
Recommending Deployment Strategies for Collaborative Tasks
Authors:
Dong Wei,
Senjuti Basu Roy,
Sihem Amer-Yahia
Abstract:
Our work contributes to aiding requesters in deploying collaborative tasks in crowdsourcing. We initiate the study of recommending deployment strategies for collaborative tasks to requesters that are consistent with deployment parameters they desire: a lower-bound on the quality of the crowd contribution, an upper-bound on the latency of task completion, and an upper-bound on the cost incurred by…
▽ More
Our work contributes to aiding requesters in deploying collaborative tasks in crowdsourcing. We initiate the study of recommending deployment strategies for collaborative tasks to requesters that are consistent with deployment parameters they desire: a lower-bound on the quality of the crowd contribution, an upper-bound on the latency of task completion, and an upper-bound on the cost incurred by paying workers. A deployment strategy is a choice of value for three dimensions: Structure (whether to solicit the workforce sequentially or simultaneously), Organization (to organize it collaboratively or independently), and Style (to rely solely on the crowd or to combine it with machine algorithms). We propose StratRec, an optimization-driven middle layer that recommends deployment strategies and alternative deployment parameters to requesters by accounting for worker availability. Our solutions are grounded in discrete optimization and computational geometry techniques that produce results with theoretical guarantees. We present extensive experiments on Amazon Mechanical Turk and conduct synthetic experiments to validate the qualitative and scalability aspects of StratRec.
△ Less
Submitted 15 March, 2020;
originally announced March 2020.
-
Patient trajectory prediction in the Mimic-III dataset, challenges and pitfalls
Authors:
Jose F Rodrigues-Jr,
Gabriel Spadon,
Bruno Brandoli,
Sihem Amer-Yahia
Abstract:
Automated medical prognosis has gained interest as artificial intelligence evolves and the potential for computer-aided medicine becomes evident. Nevertheless, it is challenging to design an effective system that, given a patient's medical history, is able to predict probable future conditions. Previous works, mostly carried out over private datasets, have tackled the problem by using artificial n…
▽ More
Automated medical prognosis has gained interest as artificial intelligence evolves and the potential for computer-aided medicine becomes evident. Nevertheless, it is challenging to design an effective system that, given a patient's medical history, is able to predict probable future conditions. Previous works, mostly carried out over private datasets, have tackled the problem by using artificial neural network architectures that cannot deal with low-cardinality datasets, or by means of non-generalizable inference approaches. We introduce a Deep Learning architecture whose design results from an intensive experimental process. The final architecture is based on two parallel Minimal Gated Recurrent Unit networks working in bi-directional manner, which was extensively tested with the open-access Mimic-III dataset. Our results demonstrate significant improvements in automated medical prognosis, as measured with Recall@k. We summarize our experience as a set of relevant insights for the design of Deep Learning architectures. Our work improves the performance of computer-aided medicine and can serve as a guide in designing artificial neural networks used in prediction tasks.
△ Less
Submitted 28 November, 2019; v1 submitted 10 September, 2019;
originally announced September 2019.
-
Eliciting Worker Preference for Task Completion
Authors:
Mohammadreza Esfandiari,
Senjuti Basu Roy,
Sihem Amer-Yahia
Abstract:
Current crowdsourcing platforms provide little support for worker feedback. Workers are sometimes invited to post free text describing their experience and preferences in completing tasks. They can also use forums such as Turker Nation1 to exchange preferences on tasks and requesters. In fact, crowdsourcing platforms rely heavily on observing workers and inferring their preferences implicitly. In…
▽ More
Current crowdsourcing platforms provide little support for worker feedback. Workers are sometimes invited to post free text describing their experience and preferences in completing tasks. They can also use forums such as Turker Nation1 to exchange preferences on tasks and requesters. In fact, crowdsourcing platforms rely heavily on observing workers and inferring their preferences implicitly. In this work, we believe that asking workers to indicate their preferences explicitly improve their experience in task completion and hence, the quality of their contributions. Explicit elicitation can indeed help to build more accurate worker models for task completion that captures the evolving nature of worker preferences. We design a worker model whose accuracy is improved iteratively by requesting preferences for task factors such as required skills, task payment, and task relevance. We propose a generic framework, develop efficient solutions in realistic scenarios, and run extensive experiments that show the benefit of explicit preference elicitation over implicit ones with statistical significance.
△ Less
Submitted 9 January, 2018;
originally announced January 2018.
-
Exploration of User Groups in VEXUS
Authors:
Sihem Amer-Yahia,
Behrooz Omidvar-Tehrani,
Joao Comba,
Viviane Moreira,
Fabian Colque Zegarra
Abstract:
We introduce VEXUS, an interactive visualization framework for exploring user data to fulfill tasks such as finding a set of experts, forming discussion groups and analyzing collective behaviors. User data is characterized by a combination of demographics like age and occupation, and actions such as rating a movie, writing a paper, following a medical treatment or buying groceries. The ubiquity of…
▽ More
We introduce VEXUS, an interactive visualization framework for exploring user data to fulfill tasks such as finding a set of experts, forming discussion groups and analyzing collective behaviors. User data is characterized by a combination of demographics like age and occupation, and actions such as rating a movie, writing a paper, following a medical treatment or buying groceries. The ubiquity of user data requires tools that help explorers, be they specialists or novice users, acquire new insights. VEXUS lets explorers interact with user data via visual primitives and builds an exploration profile to recommend the next exploration steps. VEXUS combines state-of-the-art visualization techniques with appropriate indexing of user data to provide fast and relevant exploration.
△ Less
Submitted 10 December, 2017;
originally announced December 2017.
-
Querying Temporal Drifts at Multiple Granularities (Technical Report)
Authors:
Sofia Kleisarchaki,
Sihem Amer-Yahia,
Ahlame Douzal-Chouakria,
Vassilis Christophides
Abstract:
There exists a large body of work on online drift detection with the goal of dynamically finding and maintaining changes in data streams. In this paper, we adopt a query-based approach to drift detection. Our approach relies on {\em a drift index}, a structure that captures drift at different time granularities and enables flexible {\em drift queries}. We formalize different drift queries that rep…
▽ More
There exists a large body of work on online drift detection with the goal of dynamically finding and maintaining changes in data streams. In this paper, we adopt a query-based approach to drift detection. Our approach relies on {\em a drift index}, a structure that captures drift at different time granularities and enables flexible {\em drift queries}. We formalize different drift queries that represent real-world scenarios and develop query evaluation algorithms that use different materializations of the drift index as well as strategies for online index maintenance. We describe a thorough study of the performance of our algorithms on real-world and synthetic datasets with varying change rates.
△ Less
Submitted 13 May, 2016; v1 submitted 9 May, 2016;
originally announced May 2016.
-
Testing Interestingness Measures in Practice: A Large-Scale Analysis of Buying Patterns
Authors:
Martin Kirchgessner,
Vincent Leroy,
Sihem Amer-Yahia,
Shashwat Mishra
Abstract:
Understanding customer buying patterns is of great interest to the retail industry and has shown to benefit a wide variety of goals ranging from managing stocks to implementing loyalty programs. Association rule mining is a common technique for extracting correlations such as "people in the South of France buy rosé wine" or "customers who buy paté also buy salted butter and sour bread." Unfortunat…
▽ More
Understanding customer buying patterns is of great interest to the retail industry and has shown to benefit a wide variety of goals ranging from managing stocks to implementing loyalty programs. Association rule mining is a common technique for extracting correlations such as "people in the South of France buy rosé wine" or "customers who buy paté also buy salted butter and sour bread." Unfortunately, sifting through a high number of buying patterns is not useful in practice, because of the predominance of popular products in the top rules. As a result, a number of "interestingness" measures (over 30) have been proposed to rank rules. However, there is no agreement on which measures are more appropriate for retail data. Moreover, since pattern mining algorithms output thousands of association rules for each product, the ability for an analyst to rely on ranking measures to identify the most interesting ones is crucial. In this paper, we develop CAPA (Comparative Analysis of PAtterns), a framework that provides analysts with the ability to compare the outcome of interestingness measures applied to buying patterns in the retail industry. We report on how we used CAPA to compare 34 measures applied to over 1,800 stores of Intermarché, one of the largest food retailers in France.
△ Less
Submitted 15 March, 2016;
originally announced March 2016.
-
"The Whole Is Greater Than the Sum of Its Parts": Optimization in Collaborative Crowdsourcing
Authors:
Habibur Rahman,
Senjuti Basu Roy,
Saravanan Thirumuruganathan,
Sihem Amer-Yahia,
Gautam Das
Abstract:
In this work, we initiate the investigation of optimization opportunities in collaborative crowdsourcing. Many popular applications, such as collaborative document editing, sentence translation, or citizen science resort to this special form of human-based computing, where, crowd workers with appropriate skills and expertise are required to form groups to solve complex tasks. Central to any collab…
▽ More
In this work, we initiate the investigation of optimization opportunities in collaborative crowdsourcing. Many popular applications, such as collaborative document editing, sentence translation, or citizen science resort to this special form of human-based computing, where, crowd workers with appropriate skills and expertise are required to form groups to solve complex tasks. Central to any collaborative crowdsourcing process is the aspect of successful collaboration among the workers, which, for the first time, is formalized and then optimized in this work. Our formalism considers two main collaboration-related human factors, affinity and upper critical mass, appropriately adapted from organizational science and social theories. Our contributions are (a) proposing a comprehensive model for collaborative crowdsourcing optimization, (b) rigorous theoretical analyses to understand the hardness of the proposed problems, (c) an array of efficient exact and approximation algorithms with provable theoretical guarantees. Finally, we present a detailed set of experimental results stemming from two real-world collaborative crowdsourcing application us- ing Amazon Mechanical Turk, as well as conduct synthetic data analyses on scalability and qualitative aspects of our proposed algorithms. Our experimental results successfully demonstrate the efficacy of our proposed solutions.
△ Less
Submitted 12 April, 2015; v1 submitted 17 February, 2015;
originally announced February 2015.
-
Optimization in Knowledge-Intensive Crowdsourcing
Authors:
Senjuti Basu Roy,
Ioanna Lykourentzou,
Saravanan Thirumuruganathan,
Sihem Amer-Yahia,
Gautam Das
Abstract:
We present SmartCrowd, a framework for optimizing collaborative knowledge-intensive crowdsourcing. SmartCrowd distinguishes itself by accounting for human factors in the process of assigning tasks to workers. Human factors designate workers' expertise in different skills, their expected minimum wage, and their availability. In SmartCrowd, we formulate task assignment as an optimization problem, an…
▽ More
We present SmartCrowd, a framework for optimizing collaborative knowledge-intensive crowdsourcing. SmartCrowd distinguishes itself by accounting for human factors in the process of assigning tasks to workers. Human factors designate workers' expertise in different skills, their expected minimum wage, and their availability. In SmartCrowd, we formulate task assignment as an optimization problem, and rely on pre-indexing workers and maintaining the indexes adaptively, in such a way that the task assignment process gets optimized both qualitatively, and computation time-wise. We present rigorous theoretical analyses of the optimization problem and propose optimal and approximation algorithms. We finally perform extensive performance and quality experiments using real and synthetic data to demonstrate that adaptive indexing in SmartCrowd is necessary to achieve efficient high quality task assignment.
△ Less
Submitted 7 January, 2014;
originally announced January 2014.
-
Who Tags What? An Analysis Framework
Authors:
Mahashweta Das,
Saravanan Thirumuruganathan,
Sihem Amer-Yahia,
Gautam Das,
Cong Yu
Abstract:
The rise of Web 2.0 is signaled by sites such as Flickr, del.icio.us, and YouTube, and social tagging is essential to their success. A typical tagging action involves three components, user, item (e.g., photos in Flickr), and tags (i.e., words or phrases). Analyzing how tags are assigned by certain users to certain items has important implications in helping users search for desired information. I…
▽ More
The rise of Web 2.0 is signaled by sites such as Flickr, del.icio.us, and YouTube, and social tagging is essential to their success. A typical tagging action involves three components, user, item (e.g., photos in Flickr), and tags (i.e., words or phrases). Analyzing how tags are assigned by certain users to certain items has important implications in helping users search for desired information. In this paper, we explore common analysis tasks and propose a dual mining framework for social tagging behavior mining. This framework is centered around two opposing measures, similarity and diversity, being applied to one or more tagging components, and therefore enables a wide range of analysis scenarios such as characterizing similar users tagging diverse items with similar tags, or diverse users tagging similar items with diverse tags, etc. By adopting different concrete measures for similarity and diversity in the framework, we show that a wide range of concrete analysis problems can be defined and they are NP-Complete in general. We design efficient algorithms for solving many of those problems and demonstrate, through comprehensive experiments over real data, that our algorithms significantly out-perform the exact brute-force approach without compromising analysis result quality.
△ Less
Submitted 1 August, 2012;
originally announced August 2012.
-
SocialScope: Enabling Information Discovery on Social Content Sites
Authors:
Sihem Amer-Yahia,
Laks Lakshmanan,
Cong Yu
Abstract:
Recently, many content sites have started encouraging their users to engage in social activities such as adding buddies on Yahoo! Travel and sharing articles with their friends on New York Times. This has led to the emergence of {\em social content sites}, which is being facilitated by initiatives like OpenID (http://www.openid.net/) and OpenSocial (http://www.opensocial.org/). These community s…
▽ More
Recently, many content sites have started encouraging their users to engage in social activities such as adding buddies on Yahoo! Travel and sharing articles with their friends on New York Times. This has led to the emergence of {\em social content sites}, which is being facilitated by initiatives like OpenID (http://www.openid.net/) and OpenSocial (http://www.opensocial.org/). These community standards enable the open access to users' social profiles and connections by individual content sites and are bringing content-oriented sites and social networking sites ever closer. The integration of content and social information raises new challenges for {\em information management and discovery} over such sites. We propose a logical architecture, named \kw{SocialScope}, consisting of three layers, for tackling the challenges. The {\em content management} layer is responsible for integrating, maintaining and physically accessing the content and social data. The {\em information discovery} layer takes care of analyzing content to derive interesting new information, and interpreting and processing the user's information need to identify relevant information. Finally, the {\em information presentation} layer explores the discovered information and helps users better understand it in a principled way. We describe the challenges in each layer and propose solutions for some of those challenges. In particular, we propose a uniform algebraic framework, which can be leveraged to uniformly and flexibly specify many of the information discovery and analysis tasks and provide the foundation for the optimization of those tasks.
△ Less
Submitted 10 September, 2009;
originally announced September 2009.