-
Accurate and fast anomaly detection in industrial processes and IoT environments
Authors:
Simone Tonini,
Andrea Vandin,
Francesca Chiaromonte,
Daniele Licari,
Fernando Barsacchi
Abstract:
We present a novel, simple and widely applicable semi-supervised procedure for anomaly detection in industrial and IoT environments, SAnD (Simple Anomaly Detection). SAnD comprises 5 steps, each leveraging well-known statistical tools, namely; smoothing filters, variance inflation factors, the Mahalanobis distance, threshold selection algorithms and feature importance techniques. To our knowledge,…
▽ More
We present a novel, simple and widely applicable semi-supervised procedure for anomaly detection in industrial and IoT environments, SAnD (Simple Anomaly Detection). SAnD comprises 5 steps, each leveraging well-known statistical tools, namely; smoothing filters, variance inflation factors, the Mahalanobis distance, threshold selection algorithms and feature importance techniques. To our knowledge, SAnD is the first procedure that integrates these tools to identify anomalies and help decipher their putative causes. We show how each step contributes to tackling technical challenges that practitioners face when detecting anomalies in industrial contexts, where signals can be highly multicollinear, have unknown distributions, and intertwine short-lived noise with the long(er)-lived actual anomalies. The development of SAnD was motivated by a concrete case study from our industrial partner, which we use here to show its effectiveness. We also evaluate the performance of SAnD by comparing it with a selection of semi-supervised methods on public datasets from the literature on anomaly detection. We conclude that SAnD is effective, broadly applicable, and outperforms existing approaches in both anomaly detection and runtime.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
White-box validation of quantitative product lines by statistical model checking and process mining
Authors:
Roberto Casaluce,
Andrea Burattin,
Francesca Chiaromonte,
Alberto Lluch Lafuente,
Andrea Vandin
Abstract:
We propose a novel methodology for validating software product line (PL) models by integrating Statistical Model Checking (SMC) with Process Mining (PM). Our approach focuses on the feature-oriented language QFLan in the PL engineering domain, allowing modeling of PLs with rich cross-tree and quantitative constraints, as well as aspects of dynamic PLs like staged configurations. This richness lead…
▽ More
We propose a novel methodology for validating software product line (PL) models by integrating Statistical Model Checking (SMC) with Process Mining (PM). Our approach focuses on the feature-oriented language QFLan in the PL engineering domain, allowing modeling of PLs with rich cross-tree and quantitative constraints, as well as aspects of dynamic PLs like staged configurations. This richness leads to models with infinite state-space, requiring simulation-based analysis techniques like SMC. For instance, we illustrate with a running example involving infinite state space. SMC involves generating samples of system dynamics to estimate properties such as event probabilities or expected values. On the other hand, PM uses data-driven techniques on execution logs to identify and reason about the underlying execution process. In this paper, we propose, for the first time, applying PM techniques to SMC simulations' byproducts to enhance the utility of SMC analyses. Typically, when SMC results are unexpected, modelers must determine whether they stem from actual system characteristics or model bugs in a black-box manner. We improve on this by using PM to provide a white-box perspective on the observed system dynamics. Samples from SMC are fed into PM tools, producing a compact graphical representation of observed dynamics. The mined PM model is then transformed into a QFLan model, accessible to PL engineers. Using two well-known PL models, we demonstrate the effectiveness and scalability of our methodology in pinpointing issues and suggesting fixes. Additionally, we show its generality by applying it to the security domain.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Venture Capital investments through the lens of Network and Functional Data Analysis
Authors:
Christian Esposito,
Marco Gortan,
Lorenzo Testa,
Francesca Chiaromonte,
Giorgio Fagiolo,
Andrea Mina,
Giulio Rossetti
Abstract:
In this paper we characterize the performance of venture capital-backed firms based on their ability to attract investment. The aim of the study is to identify relevant predictors of success built from the network structure of firms' and investors' relations. Focusing on deal-level data for the health sector, we first create a bipartite network among firms and investors, and then apply functional…
▽ More
In this paper we characterize the performance of venture capital-backed firms based on their ability to attract investment. The aim of the study is to identify relevant predictors of success built from the network structure of firms' and investors' relations. Focusing on deal-level data for the health sector, we first create a bipartite network among firms and investors, and then apply functional data analysis (FDA) to derive progressively more refined indicators of success captured by a binary, a scalar and a functional outcome. More specifically, we use different network centrality measures to capture the role of early investments for the success of the firm. Our results, which are robust to different specifications, suggest that success has a strong positive association with centrality measures of the firm and of its large investors, and a weaker but still detectable association with centrality measures of small investors and features describing firms as knowledge bridges. Finally, based on our analyses, success is not associated with firms' and investors' spreading power (harmonic centrality), nor with the tightness of investors' community (clustering coefficient) and spreading ability (VoteRank).
△ Less
Submitted 10 August, 2022; v1 submitted 25 February, 2022;
originally announced February 2022.
-
Can you always reap what you sow? Network and functional data analysis of VC investments in health-tech companies
Authors:
Christian Esposito,
Marco Gortan,
Lorenzo Testa,
Francesca Chiaromonte,
Giorgio Fagiolo,
Andrea Mina,
Giulio Rossetti
Abstract:
"Success" of firms in venture capital markets is hard to define, and its determinants are still poorly understood. We build a bipartite network of investors and firms in the healthcare sector, describing its structure and its communities. Then, we characterize "success" introducing progressively more refined definitions, and we find a positive association between such definitions and the centralit…
▽ More
"Success" of firms in venture capital markets is hard to define, and its determinants are still poorly understood. We build a bipartite network of investors and firms in the healthcare sector, describing its structure and its communities. Then, we characterize "success" introducing progressively more refined definitions, and we find a positive association between such definitions and the centrality of a company. In particular, we are able to cluster funding trajectories of firms into two groups capturing different "success" regimes and to link the probability of belonging to one or the other to their network features (in particular their centrality and the one of their investors). We further investigate this positive association by introducing scalar as well as functional "success" outcomes, confirming our findings and their robustness.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Epsilon Consistent Mixup: Structural Regularization with an Adaptive Consistency-Interpolation Tradeoff
Authors:
Vincent Pisztora,
Yanglan Ou,
Xiaolei Huang,
Francesca Chiaromonte,
Jia Li
Abstract:
In this paper we propose $ε$-Consistent Mixup ($ε$mu). $ε$mu is a data-based structural regularization technique that combines Mixup's linear interpolation with consistency regularization in the Mixup direction, by compelling a simple adaptive tradeoff between the two. This learnable combination of consistency and interpolation induces a more flexible structure on the evolution of the response acr…
▽ More
In this paper we propose $ε$-Consistent Mixup ($ε$mu). $ε$mu is a data-based structural regularization technique that combines Mixup's linear interpolation with consistency regularization in the Mixup direction, by compelling a simple adaptive tradeoff between the two. This learnable combination of consistency and interpolation induces a more flexible structure on the evolution of the response across the feature space and is shown to improve semi-supervised classification accuracy on the SVHN and CIFAR10 benchmark datasets, yielding the largest gains in the most challenging low label-availability scenarios. Empirical studies comparing $ε$mu and Mixup are presented and provide insight into the mechanisms behind $ε$mu's effectiveness. In particular, $ε$mu is found to produce more accurate synthetic labels and more confident predictions than Mixup.
△ Less
Submitted 29 September, 2021; v1 submitted 19 April, 2021;
originally announced April 2021.
-
Automated and Distributed Statistical Analysis of Economic Agent-Based Models
Authors:
Andrea Vandin,
Daniele Giachini,
Francesco Lamperti,
Francesca Chiaromonte
Abstract:
We propose a novel approach to the statistical analysis of stochastic simulation models and, especially, agent-based models (ABMs). Our main goal is to provide fully automated, model-independent and tool-supported techniques and algorithms to inspect simulations and perform counterfactual analysis. Our approach: (i) is easy-to-use by the modeller, (ii) improves reproducibility of results, (iii) op…
▽ More
We propose a novel approach to the statistical analysis of stochastic simulation models and, especially, agent-based models (ABMs). Our main goal is to provide fully automated, model-independent and tool-supported techniques and algorithms to inspect simulations and perform counterfactual analysis. Our approach: (i) is easy-to-use by the modeller, (ii) improves reproducibility of results, (iii) optimizes running time given the modeller's machine, (iv) automatically chooses the number of required simulations and simulation steps to reach user-specified statistical confidence, and (v) automates a variety of statistical tests. In particular, our techniques are designed to distinguish the transient dynamics of the model from its steady-state behaviour (if any), estimate properties in both 'phases', and provide indications on the (non-)ergodic nature of the simulated processes - which, in turn, allows one to gauge the reliability of a steady-state analysis. Estimates are equipped with statistical guarantees, allowing for robust comparisons across computational experiments. To demonstrate the effectiveness of our approach, we apply it to two models from the literature: a large-scale macro-financial ABM and a small scale prediction market model. Compared to prior analyses of these models, we obtain new insights and we are able to identify and fix some erroneous conclusions.
△ Less
Submitted 8 November, 2023; v1 submitted 10 February, 2021;
originally announced February 2021.
-
An Efficient Semi-smooth Newton Augmented Lagrangian Method for Elastic Net
Authors:
Tobia Boschi,
Matthew Reimherr,
Francesca Chiaromonte
Abstract:
Feature selection is an important and active research area in statistics and machine learning. The Elastic Net is often used to perform selection when the features present non-negligible collinearity or practitioners wish to incorporate additional known structure. In this article, we propose a new Semi-smooth Newton Augmented Lagrangian Method to efficiently solve the Elastic Net in ultra-high dim…
▽ More
Feature selection is an important and active research area in statistics and machine learning. The Elastic Net is often used to perform selection when the features present non-negligible collinearity or practitioners wish to incorporate additional known structure. In this article, we propose a new Semi-smooth Newton Augmented Lagrangian Method to efficiently solve the Elastic Net in ultra-high dimensional settings. Our new algorithm exploits both the sparsity induced by the Elastic Net penalty and the sparsity due to the second order information of the augmented Lagrangian. This greatly reduces the computational cost of the problem. Using simulations on both synthetic and real datasets, we demonstrate that our approach outperforms its best competitors by at least an order of magnitude in terms of CPU time. We also apply our approach to a Genome Wide Association Study on childhood obesity.
△ Less
Submitted 6 June, 2020;
originally announced June 2020.
-
The relationship between human mobility and viral transmissibility during the COVID-19 epidemics in Italy
Authors:
Paolo Cintia,
Luca Pappalardo,
Salvatore Rinzivillo,
Daniele Fadda,
Tobia Boschi,
Fosca Giannotti,
Francesca Chiaromonte,
Pietro Bonato,
Francesco Fabbri,
Francesco Penone,
Marcello Savarese,
Francesco Calabrese,
Giorgio Guzzetta,
Flavia Riccardo,
Valentina Marziano,
Piero Poletti,
Filippo Trentini,
Antonino Bella,
Xanthi Andrianou,
Martina Del Manso,
Massimo Fabiani,
Stefania Bellino,
Stefano Boros,
Alberto Mateo Urdiales,
Maria Fenicia Vescio
, et al. (7 additional authors not shown)
Abstract:
In 2020, countries affected by the COVID-19 pandemic implemented various non-pharmaceutical interventions to contrast the spread of the virus and its impact on their healthcare systems and economies. Using Italian data at different geographic scales, we investigate the relationship between human mobility, which subsumes many facets of the population's response to the changing situation, and the sp…
▽ More
In 2020, countries affected by the COVID-19 pandemic implemented various non-pharmaceutical interventions to contrast the spread of the virus and its impact on their healthcare systems and economies. Using Italian data at different geographic scales, we investigate the relationship between human mobility, which subsumes many facets of the population's response to the changing situation, and the spread of COVID-19. Leveraging mobile phone data from February through September 2020, we find a striking relationship between the decrease in mobility flows and the net reproduction number. We find that the time needed to switch off mobility and bring the net reproduction number below the critical threshold of 1 is about one week. Moreover, we observe a strong relationship between the number of days spent above such threshold before the lockdown-induced drop in mobility flows and the total number of infections per 100k inhabitants. Estimating the statistical effect of mobility flows on the net reproduction number over time, we document a 2-week lag positive association, strong in March and April, and weaker but still significant in June. Our study demonstrates the value of big mobility data to monitor the epidemic and inform control interventions during its unfolding.
△ Less
Submitted 1 April, 2021; v1 submitted 4 June, 2020;
originally announced June 2020.
-
Give more data, awareness and control to individual citizens, and they will help COVID-19 containment
Authors:
Mirco Nanni,
Gennady Andrienko,
Albert-László Barabási,
Chiara Boldrini,
Francesco Bonchi,
Ciro Cattuto,
Francesca Chiaromonte,
Giovanni Comandé,
Marco Conti,
Mark Coté,
Frank Dignum,
Virginia Dignum,
Josep Domingo-Ferrer,
Paolo Ferragina,
Fosca Giannotti,
Riccardo Guidotti,
Dirk Helbing,
Kimmo Kaski,
Janos Kertesz,
Sune Lehmann,
Bruno Lepri,
Paul Lukowicz,
Stan Matwin,
David Megías Jiménez,
Anna Monreale
, et al. (14 additional authors not shown)
Abstract:
The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the phase 2 of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being proposed for large scale adoption by many countri…
▽ More
The rapid dynamics of COVID-19 calls for quick and effective tracking of virus transmission chains and early detection of outbreaks, especially in the phase 2 of the pandemic, when lockdown and other restriction measures are progressively withdrawn, in order to avoid or minimize contagion resurgence. For this purpose, contact-tracing apps are being proposed for large scale adoption by many countries. A centralized approach, where data sensed by the app are all sent to a nation-wide server, raises concerns about citizens' privacy and needlessly strong digital surveillance, thus alerting us to the need to minimize personal data collection and avoiding location tracking. We advocate the conceptual advantage of a decentralized approach, where both contact and location data are collected exclusively in individual citizens' "personal data stores", to be shared separately and selectively, voluntarily, only when the citizen has tested positive for COVID-19, and with a privacy preserving level of granularity. This approach better protects the personal sphere of citizens and affords multiple benefits: it allows for detailed information gathering for infected people in a privacy-preserving fashion; and, in turn this enables both contact tracing, and, the early detection of outbreak hotspots on more finely-granulated geographic scale. Our recommendation is two-fold. First to extend existing decentralized architectures with a light touch, in order to manage the collection of location data locally on the device, and allow the user to share spatio-temporal aggregates - if and when they want, for specific aims - with health authorities, for instance. Second, we favour a longer-term pursuit of realizing a Personal Data Store vision, giving users the opportunity to contribute to collective good in the measure they want, enhancing self-awareness, and cultivating collective efforts for rebuilding society.
△ Less
Submitted 16 April, 2020; v1 submitted 10 April, 2020;
originally announced April 2020.
-
Knowledge and Social Relatedness Shape Research Portfolio Diversification
Authors:
Giorgio Tripodi,
Francesca Chiaromonte,
Fabrizio Lillo
Abstract:
Scientific discovery is shaped by scientists' choices and thus by their career patterns. The increasing knowledge required to work at the frontier of science makes it harder for an individual to embark on unexplored paths. Yet collaborations can reduce learning costs -- albeit at the expense of increased coordination costs. In this article, we use data on the publication histories of a very large…
▽ More
Scientific discovery is shaped by scientists' choices and thus by their career patterns. The increasing knowledge required to work at the frontier of science makes it harder for an individual to embark on unexplored paths. Yet collaborations can reduce learning costs -- albeit at the expense of increased coordination costs. In this article, we use data on the publication histories of a very large sample of physicists to measure the effects of knowledge and social relatedness on their diversification strategies. Using bipartite networks, we compute a measure of topics similarity and a measure of social proximity. We find that scientists' strategies are not random, and that they are significantly affected by both. Knowledge relatedness across topics explains $\approx 10\%$ of logistic regression deviances and social relatedness as much as $\approx 30\%$, suggesting that science is an eminently social enterprise: when scientists move out of their core specialization, they do so through collaborations. Interestingly, we also find a significant negative interaction between knowledge and social relatedness, suggesting that the farther scientists move from their specialization, the more they rely on collaborations. Our results provide a starting point for broader quantitative analyses of scientific diversification strategies, which could also be extended to the domain of technological innovation -- offering insights from a comparative and policy perspective.
△ Less
Submitted 25 September, 2020; v1 submitted 15 February, 2020;
originally announced February 2020.
-
On the bias of H-scores for comparing biclusters, and how to correct it
Authors:
Jacopo Di Iorio,
Francesca Chiaromonte,
Marzia A. Cremona
Abstract:
In the last two decades several biclustering methods have been developed as new unsupervised learning techniques to simultaneously cluster rows and columns of a data matrix. These algorithms play a central role in contemporary machine learning and in many applications, e.g. to computational biology and bioinformatics. The H-score is the evaluation score underlying the seminal biclustering algorith…
▽ More
In the last two decades several biclustering methods have been developed as new unsupervised learning techniques to simultaneously cluster rows and columns of a data matrix. These algorithms play a central role in contemporary machine learning and in many applications, e.g. to computational biology and bioinformatics. The H-score is the evaluation score underlying the seminal biclustering algorithm by Cheng and Church, as well as many other subsequent biclustering methods. In this paper, we characterize a potentially troublesome bias in this score, that can distort biclustering results. We prove, both analytically and by simulation, that the average H-score increases with the number of rows/columns in a bicluster. This makes the H-score, and hence all algorithms based on it, biased towards small clusters. Based on our analytical proof, we are able to provide a straightforward way to correct this bias, allowing users to accurately compare biclusters.
△ Less
Submitted 24 July, 2019;
originally announced July 2019.
-
Linear Contour Learning: A Method for Supervised Dimension Reduction
Authors:
Bing Li,
Hongyuan Zha,
Francesca Chiaromonte
Abstract:
We propose a novel approach to sufficient dimension reduction in regression, based on estimating contour directions of negligible variation for the response surface. These directions span the orthogonal complement of the minimal space relevant for the regression, and can be extracted according to a measure of the variation in the response, leading to General Contour Regression(GCR). In comparison…
▽ More
We propose a novel approach to sufficient dimension reduction in regression, based on estimating contour directions of negligible variation for the response surface. These directions span the orthogonal complement of the minimal space relevant for the regression, and can be extracted according to a measure of the variation in the response, leading to General Contour Regression(GCR). In comparison to exiisting sufficient dimension reduction techniques, this sontour-based mothology guarantees exhaustive estimation of the central space under ellipticity of the predictoor distribution and very mild additional assumptions, while maintaining vn-consisytency and somputational ease. Moreover, it proves to be robust to departures from ellipticity. We also establish some useful population properties for GCR. Simulations to compare performance with that of standard techniques such as ordinary least squares, sliced inverse regression, principal hessian directions, and sliced average variance estimation confirm the advntages anticipated by theoretical analyses. We also demonstrate the use of contour-based methods on a data set concerning grades of students from Massachusetts colleges.
△ Less
Submitted 13 August, 2014;
originally announced August 2014.