-
GLANCE: Global Actions in a Nutshell for Counterfactual Explainability
Authors:
Ioannis Emiris,
Dimitris Fotakis,
Giorgos Giannopoulos,
Dimitrios Gunopulos,
Loukas Kavouras,
Kleopatra Markou,
Eleni Psaroudaki,
Dimitrios Rontogiannis,
Dimitris Sacharidis,
Nikolaos Theologitis,
Dimitrios Tomaras,
Konstantinos Tsopelas
Abstract:
Counterfactual explanations have emerged as an important tool to understand, debug, and audit complex machine learning models. To offer global counterfactual explainability, state-of-the-art methods construct summaries of local explanations, offering a trade-off among conciseness, counterfactual effectiveness, and counterfactual cost or burden imposed on instances. In this work, we provide a conci…
▽ More
Counterfactual explanations have emerged as an important tool to understand, debug, and audit complex machine learning models. To offer global counterfactual explainability, state-of-the-art methods construct summaries of local explanations, offering a trade-off among conciseness, counterfactual effectiveness, and counterfactual cost or burden imposed on instances. In this work, we provide a concise formulation of the problem of identifying global counterfactuals and establish principled criteria for comparing solutions, drawing inspiration from Pareto dominance. We introduce innovative algorithms designed to address the challenge of finding global counterfactuals for either the entire input space or specific partitions, employing clustering and decision trees as key components. Additionally, we conduct a comprehensive experimental evaluation, considering various instances of the problem and comparing our proposed algorithms with state-of-the-art methods. The results highlight the consistent capability of our algorithms to generate meaningful and interpretable global counterfactual explanations.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
A Framework for Feasible Counterfactual Exploration incorporating Causality, Sparsity and Density
Authors:
Kleopatra Markou,
Dimitrios Tomaras,
Vana Kalogeraki,
Dimitrios Gunopulos
Abstract:
The imminent need to interpret the output of a Machine Learning model with counterfactual (CF) explanations - via small perturbations to the input - has been notable in the research community. Although the variety of CF examples is important, the aspect of them being feasible at the same time, does not necessarily apply in their entirety. This work uses different benchmark datasets to examine thro…
▽ More
The imminent need to interpret the output of a Machine Learning model with counterfactual (CF) explanations - via small perturbations to the input - has been notable in the research community. Although the variety of CF examples is important, the aspect of them being feasible at the same time, does not necessarily apply in their entirety. This work uses different benchmark datasets to examine through the preservation of the logical causal relations of their attributes, whether CF examples can be generated after a small amount of changes to the original input, be feasible and actually useful to the end-user in a real-world case. To achieve this, we used a black box model as a classifier, to distinguish the desired from the input class and a Variational Autoencoder (VAE) to generate feasible CF examples. As an extension, we also extracted two-dimensional manifolds (one for each dataset) that located the majority of the feasible examples, a representation that adequately distinguished them from infeasible ones. For our experimentation we used three commonly used datasets and we managed to generate feasible and at the same time sparse, CF examples that satisfy all possible predefined causal constraints, by confirming their importance with the attributes in a dataset.
△ Less
Submitted 20 April, 2024;
originally announced April 2024.
-
Towards Mobility Data Science (Vision Paper)
Authors:
Mohamed Mokbel,
Mahmoud Sakr,
Li Xiong,
Andreas Züfle,
Jussara Almeida,
Taylor Anderson,
Walid Aref,
Gennady Andrienko,
Natalia Andrienko,
Yang Cao,
Sanjay Chawla,
Reynold Cheng,
Panos Chrysanthis,
Xiqi Fei,
Gabriel Ghinita,
Anita Graser,
Dimitrios Gunopulos,
Christian Jensen,
Joon-Seok Kim,
Kyoung-Sook Kim,
Peer Kröger,
John Krumm,
Johannes Lauer,
Amr Magdy,
Mario Nascimento
, et al. (23 additional authors not shown)
Abstract:
Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences…
▽ More
Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences. In this paper, we present the emerging domain of mobility data science. Towards a unified approach to mobility data science, we envision a pipeline having the following components: mobility data collection, cleaning, analysis, management, and privacy. For each of these components, we explain how mobility data science differs from general data science, we survey the current state of the art and describe open challenges for the research community in the coming years.
△ Less
Submitted 7 March, 2024; v1 submitted 21 June, 2023;
originally announced July 2023.
-
HTTE: A Hybrid Technique For Travel Time Estimation In Sparse Data Environments
Authors:
Nikolaos Zygouras,
Nikolaos Panagiotou,
Yang Li,
Dimitrios Gunopulos,
Leonidas Guibas
Abstract:
Travel time estimation is a critical task, useful to many urban applications at the individual citizen and the stakeholder level. This paper presents a novel hybrid algorithm for travel time estimation that leverages historical and sparse real-time trajectory data. Given a path and a departure time we estimate the travel time taking into account the historical information, the real-time trajectory…
▽ More
Travel time estimation is a critical task, useful to many urban applications at the individual citizen and the stakeholder level. This paper presents a novel hybrid algorithm for travel time estimation that leverages historical and sparse real-time trajectory data. Given a path and a departure time we estimate the travel time taking into account the historical information, the real-time trajectory data and the correlations among different road segments. We detect similar road segments using historical trajectories, and use a latent representation to model the similarities. Our experimental evaluation demonstrates the effectiveness of our approach.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
A Novel Framework for Handling Sparse Data in Traffic Forecast
Authors:
Nikolaos Zygouras,
Dimitrios Gunopulos
Abstract:
The ever increasing amount of GPS-equipped vehicles provides in real-time valuable traffic information for the roads traversed by the moving vehicles. In this way, a set of sparse and time evolving traffic reports is generated for each road. These time series are a valuable asset in order to forecast the future traffic condition. In this paper we present a deep learning framework that encodes the…
▽ More
The ever increasing amount of GPS-equipped vehicles provides in real-time valuable traffic information for the roads traversed by the moving vehicles. In this way, a set of sparse and time evolving traffic reports is generated for each road. These time series are a valuable asset in order to forecast the future traffic condition. In this paper we present a deep learning framework that encodes the sparse recent traffic information and forecasts the future traffic condition. Our framework consists of a recurrent part and a decoder. The recurrent part employs an attention mechanism that encodes the traffic reports that are available at a particular time window. The decoder is responsible to forecast the future traffic condition.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Particle-based Fast Jet Simulation at the LHC with Variational Autoencoders
Authors:
Mary Touranakou,
Nadezda Chernyavskaya,
Javier Duarte,
Dimitrios Gunopulos,
Raghav Kansal,
Breno Orzari,
Maurizio Pierini,
Thiago Tomei,
Jean-Roch Vlimant
Abstract:
We study how to use Deep Variational Autoencoders for a fast simulation of jets of particles at the LHC. We represent jets as a list of constituents, characterized by their momenta. Starting from a simulation of the jet before detector effects, we train a Deep Variational Autoencoder to return the corresponding list of constituents after detection. Doing so, we bypass both the time-consuming detec…
▽ More
We study how to use Deep Variational Autoencoders for a fast simulation of jets of particles at the LHC. We represent jets as a list of constituents, characterized by their momenta. Starting from a simulation of the jet before detector effects, we train a Deep Variational Autoencoder to return the corresponding list of constituents after detection. Doing so, we bypass both the time-consuming detector simulation and the collision reconstruction steps of a traditional processing chain, speeding up significantly the events generation workflow. Through model optimization and hyperparameter tuning, we achieve state-of-the-art precision on the jet four-momentum, while providing an accurate description of the constituents momenta, and an inference time comparable to that of a rule-based fast simulation.
△ Less
Submitted 1 March, 2022;
originally announced March 2022.
-
Particle Cloud Generation with Message Passing Generative Adversarial Networks
Authors:
Raghav Kansal,
Javier Duarte,
Hao Su,
Breno Orzari,
Thiago Tomei,
Maurizio Pierini,
Mary Touranakou,
Jean-Roch Vlimant,
Dimitrios Gunopulos
Abstract:
In high energy physics (HEP), jets are collections of correlated particles produced ubiquitously in particle collisions such as those at the CERN Large Hadron Collider (LHC). Machine learning (ML)-based generative models, such as generative adversarial networks (GANs), have the potential to significantly accelerate LHC jet simulations. However, despite jets having a natural representation as a set…
▽ More
In high energy physics (HEP), jets are collections of correlated particles produced ubiquitously in particle collisions such as those at the CERN Large Hadron Collider (LHC). Machine learning (ML)-based generative models, such as generative adversarial networks (GANs), have the potential to significantly accelerate LHC jet simulations. However, despite jets having a natural representation as a set of particles in momentum-space, a.k.a. a particle cloud, there exist no generative models applied to such a dataset. In this work, we introduce a new particle cloud dataset (JetNet), and apply to it existing point cloud GANs. Results are evaluated using (1) 1-Wasserstein distances between high- and low-level feature distributions, (2) a newly developed Fréchet ParticleNet Distance, and (3) the coverage and (4) minimum matching distance metrics. Existing GANs are found to be inadequate for physics applications, hence we develop a new message passing GAN (MPGAN), which outperforms existing point cloud GANs on virtually every metric and shows promise for use in HEP. We propose JetNet as a novel point-cloud-style dataset for the ML community to experiment with, and set MPGAN as a benchmark to improve upon for future generative models. Additionally, to facilitate research and improve accessibility and reproducibility in this area, we release the open-source JetNet Python package with interfaces for particle cloud datasets, implementations for evaluation and loss metrics, and more tools for ML in HEP development.
△ Less
Submitted 21 January, 2022; v1 submitted 22 June, 2021;
originally announced June 2021.
-
Graph Generative Adversarial Networks for Sparse Data Generation in High Energy Physics
Authors:
Raghav Kansal,
Javier Duarte,
Breno Orzari,
Thiago Tomei,
Maurizio Pierini,
Mary Touranakou,
Jean-Roch Vlimant,
Dimitrios Gunopulos
Abstract:
We develop a graph generative adversarial network to generate sparse data sets like those produced at the CERN Large Hadron Collider (LHC). We demonstrate this approach by training on and generating sparse representations of MNIST handwritten digit images and jets of particles in proton-proton collisions like those at the LHC. We find the model successfully generates sparse MNIST digits and partic…
▽ More
We develop a graph generative adversarial network to generate sparse data sets like those produced at the CERN Large Hadron Collider (LHC). We demonstrate this approach by training on and generating sparse representations of MNIST handwritten digit images and jets of particles in proton-proton collisions like those at the LHC. We find the model successfully generates sparse MNIST digits and particle jet data. We quantify agreement between real and generated data with a graph-based Fréchet Inception distance, and the particle and jet feature-level 1-Wasserstein distance for the MNIST and jet datasets respectively.
△ Less
Submitted 30 January, 2021; v1 submitted 30 November, 2020;
originally announced December 2020.
-
Infant Mortality Prediction using Birth Certificate Data
Authors:
Antonia Saravanou,
Clemens Noelke,
Nicholas Huntington,
Dolores Acevedo-Garcia,
Dimitrios Gunopulos
Abstract:
The Infant Mortality Rate (IMR) is the number of infants per 1000 that do not survive until their first birthday. It is an important metric providing information about infant health but it also measures the society's general health status. Despite the high level of prosperity in the U.S.A., the country's IMR is higher than that of many other developed countries. Additionally, the U.S.A. exhibits p…
▽ More
The Infant Mortality Rate (IMR) is the number of infants per 1000 that do not survive until their first birthday. It is an important metric providing information about infant health but it also measures the society's general health status. Despite the high level of prosperity in the U.S.A., the country's IMR is higher than that of many other developed countries. Additionally, the U.S.A. exhibits persistent inequalities in the IMR across different racial and ethnic groups. In this paper, we study the infant mortality prediction using features extracted from birth certificates. We are interested in training classification models to decide whether an infant will survive or not. We focus on exploring and understanding the importance of features in subsets of the population; we compare models trained for individual races to general models. Our evaluation shows that our methodology outperforms standard classification methods used by epidemiology researchers.
△ Less
Submitted 25 July, 2019; v1 submitted 21 July, 2019;
originally announced July 2019.
-
Low-Rank Methods in Event Detection and Subsampled Point-to-Subspace Proximity Tests
Authors:
Jakub Marecek,
Stathis Maroulis,
Vana Kalogeraki,
Dimitrios Gunopulos
Abstract:
Monitoring of streamed data to detect abnormal behaviour (variously known as event detection, anomaly detection, change detection, or outlier detection) underlies many applications of the Internet of Things. There, one often collects data from a variety of sources, with asynchronous sampling, and missing data. In this setting, one can predict abnormal behavior using low-rank techniques. In particu…
▽ More
Monitoring of streamed data to detect abnormal behaviour (variously known as event detection, anomaly detection, change detection, or outlier detection) underlies many applications of the Internet of Things. There, one often collects data from a variety of sources, with asynchronous sampling, and missing data. In this setting, one can predict abnormal behavior using low-rank techniques. In particular, we assume that normal observations come from a low-rank subspace, prior to being corrupted by a uniformly distributed noise. Correspondingly, we aim to recover a representation of the subspace, and perform event detection by running point-to-subspace distance query for incoming data. In particular, we use a variant of low-rank factorisation, which considers interval uncertainty sets around "known entries", on a suitable flattening of the input data to obtain a low-rank model. On-line, we compute the distance of incoming data to the low-rank normal subspace and update the subspace to keep it consistent with the seasonal changes present. For the distance computation, we suggest to consider subsampling. We bound the one-sided error as a function of the number of coordinates employed using techniques from learning theory and computational geometry. In our experimental evaluation, we have tested the ability of the proposed algorithm to identify samples of abnormal behavior in induction-loop data from Dublin, Ireland.
△ Less
Submitted 29 July, 2021; v1 submitted 10 February, 2018;
originally announced February 2018.
-
Social Event Scheduling
Authors:
Nikos Bikakis,
Vana Kalogeraki,
Dimitrios Gunopulos
Abstract:
A major challenge for social event organizers (e.g., event planning and marketing companies, venues) is attracting the maximum number of participants, since it has great impact on the success of the event, and, consequently, the expected gains (e.g., revenue, artist/brand publicity). In this paper, we introduce the Social Event Scheduling (SES) problem, which schedules a set of social events consi…
▽ More
A major challenge for social event organizers (e.g., event planning and marketing companies, venues) is attracting the maximum number of participants, since it has great impact on the success of the event, and, consequently, the expected gains (e.g., revenue, artist/brand publicity). In this paper, we introduce the Social Event Scheduling (SES) problem, which schedules a set of social events considering user preferences and behavior, events' spatiotemporal conflicts, and competing vents, in order to maximize the overall number of attendees. We show that SES is strongly NP-hard, even in highly restricted instances. To cope with the hardness of the SES problem we design a greedy approximation algorithm. Finally, we evaluate our method experimentally using a dataset from the Meetup event-based social network.
△ Less
Submitted 6 March, 2018; v1 submitted 30 January, 2018;
originally announced January 2018.
-
Anima: Adaptive Personalized Software Keyboard
Authors:
Panos Sakkos,
Dimitrios Kotsakos,
Ioannis Katakis,
Dimitrios Gunopulos
Abstract:
We present a Software Keyboard for smart touchscreen devices that learns its owner's unique dictionary in order to produce personalized typing predictions. The learning process is accelerated by analysing user's past typed communication. Moreover, personal temporal user behaviour is captured and exploited in the prediction engine. Computational and storage issues are addressed by dynamically forge…
▽ More
We present a Software Keyboard for smart touchscreen devices that learns its owner's unique dictionary in order to produce personalized typing predictions. The learning process is accelerated by analysing user's past typed communication. Moreover, personal temporal user behaviour is captured and exploited in the prediction engine. Computational and storage issues are addressed by dynamically forgetting words that the user no longer types. A prototype implementation is available at Google Play Store.
△ Less
Submitted 29 August, 2015; v1 submitted 22 January, 2015;
originally announced January 2015.
-
Elastic Processing of Analytical Query Workloads on IaaS Clouds
Authors:
Herald Kllapi,
Panos Sakkos,
Alex Delis,
Dimitrios Gunopulos,
Yannis Ioannidis
Abstract:
Many modern applications require the evaluation of analytical queries on large amounts of data. Such queries entail joins and heavy aggregations that often include user-defined functions (UDFs). The most efficient way to process these specific type of queries is using tree execution plans. In this work, we develop an engine for analytical query processing and a suite of specialized techniques that…
▽ More
Many modern applications require the evaluation of analytical queries on large amounts of data. Such queries entail joins and heavy aggregations that often include user-defined functions (UDFs). The most efficient way to process these specific type of queries is using tree execution plans. In this work, we develop an engine for analytical query processing and a suite of specialized techniques that collectively take advantage of the tree form of such plans. The engine executes these tree plans in an elastic IaaS cloud infrastructure and dynamically adapts by allocating and releasing pertinent resources based on the query workload monitored over a sliding time window. The engine offers its services for a fee according to service-level agreements (SLAs) associated with the incoming queries; its management of cloud resources aims at maximizing the profit after removing the costs of using these resources. We have fully implemented our algorithms in the Exareme dataflow processing system. We present an extensive evaluation that demonstrates that our approach is very efficient (exhibiting fast response times), elastic (successfully adjusting the cloud resources it uses as the engine continually adapts to query workload changes), and profitable (approximating very well the maximum difference between SLA-based income and cloud-based expenses).
△ Less
Submitted 5 January, 2015;
originally announced January 2015.
-
On The Spatiotemporal Burstiness of Terms
Authors:
Theodoros Lappas,
Marcos R. Vieira,
Dimitrios Gunopulos,
Vassilis J. Tsotras
Abstract:
Thousands of documents are made available to the users via the web on a daily basis. One of the most extensively studied problems in the context of such document streams is burst identification. Given a term t, a burst is generally exhibited when an unusually high frequency is observed for t. While spatial and temporal burstiness have been studied individually in the past, our work is the first to…
▽ More
Thousands of documents are made available to the users via the web on a daily basis. One of the most extensively studied problems in the context of such document streams is burst identification. Given a term t, a burst is generally exhibited when an unusually high frequency is observed for t. While spatial and temporal burstiness have been studied individually in the past, our work is the first to simultaneously track and measure spatiotemporal term burstiness. In addition, we use the mined burstiness information toward an efficient document-search engine: given a user's query of terms, our engine returns a ranked list of documents discussing influential events with a strong spatiotemporal impact. We demonstrate the efficiency of our methods with an extensive experimental evaluation on real and synthetic datasets.
△ Less
Submitted 30 May, 2012;
originally announced May 2012.