-
Dynamic Cluster Analysis to Detect and Track Novelty in Network Telescopes
Authors:
Kai Huang,
Luca Gioacchini,
Marco Mellia,
Luca Vassio
Abstract:
In the context of cybersecurity, tracking the activities of coordinated hosts over time is a daunting task because both participants and their behaviours evolve at a fast pace. We address this scenario by solving a dynamic novelty discovery problem with the aim of both re-identifying patterns seen in the past and highlighting new patterns. We focus on traffic collected by Network Telescopes, a pri…
▽ More
In the context of cybersecurity, tracking the activities of coordinated hosts over time is a daunting task because both participants and their behaviours evolve at a fast pace. We address this scenario by solving a dynamic novelty discovery problem with the aim of both re-identifying patterns seen in the past and highlighting new patterns. We focus on traffic collected by Network Telescopes, a primary and noisy source for cybersecurity analysis. We propose a 3-stage pipeline: (i) we learn compact representations (embeddings) of hosts through their traffic in a self-supervised fashion; (ii) via clustering, we distinguish groups of hosts performing similar activities; (iii) we track the cluster temporal evolution to highlight novel patterns. We apply our methodology to 20 days of telescope traffic during which we observe more than 8 thousand active hosts. Our results show that we efficiently identify 50-70 well-shaped clusters per day, 60-70% of which we associate with already analysed cases, while we pinpoint 10-20 previously unseen clusters per day. These correspond to activity changes and new incidents, of which we document some. In short, our novelty discovery methodology enormously simplifies the manual analysis the security analysts have to conduct to gain insights to interpret novel coordinated activities.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Generic Multi-modal Representation Learning for Network Traffic Analysis
Authors:
Luca Gioacchini,
Idilio Drago,
Marco Mellia,
Zied Ben Houidi,
Dario Rossi
Abstract:
Network traffic analysis is fundamental for network management, troubleshooting, and security. Tasks such as traffic classification, anomaly detection, and novelty discovery are fundamental for extracting operational information from network data and measurements. We witness the shift from deep packet inspection and basic machine learning to Deep Learning (DL) approaches where researchers define a…
▽ More
Network traffic analysis is fundamental for network management, troubleshooting, and security. Tasks such as traffic classification, anomaly detection, and novelty discovery are fundamental for extracting operational information from network data and measurements. We witness the shift from deep packet inspection and basic machine learning to Deep Learning (DL) approaches where researchers define and test a custom DL architecture designed for each specific problem. We here advocate the need for a general DL architecture flexible enough to solve different traffic analysis tasks. We test this idea by proposing a DL architecture based on generic data adaptation modules, followed by an integration module that summarises the extracted information into a compact and rich intermediate representation (i.e. embeddings). The result is a flexible Multi-modal Autoencoder (MAE) pipeline that can solve different use cases. We demonstrate the architecture with traffic classification (TC) tasks since they allow us to quantitatively compare results with state-of-the-art solutions. However, we argue that the MAE architecture is generic and can be used to learn representations useful in multiple scenarios. On TC, the MAE performs on par or better than alternatives while avoiding cumbersome feature engineering, thus streamlining the adoption of DL solutions for traffic analysis.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Privacy Policies and Consent Management Platforms: Growth and Users' Interactions over Time
Authors:
Nikhil Jha,
Martino Trevisan,
Marco Mellia,
Daniel Fernandez,
Rodrigo Irarrazaval
Abstract:
In response to growing concerns about user privacy, legislators have introduced new regulations and laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) that force websites to obtain user consent before activating personal data collection, fundamental to providing targeted advertising. The cornerstone of this consent-seeking process involves the…
▽ More
In response to growing concerns about user privacy, legislators have introduced new regulations and laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) that force websites to obtain user consent before activating personal data collection, fundamental to providing targeted advertising. The cornerstone of this consent-seeking process involves the use of Privacy Banners, the technical mechanism to collect users' approval for data collection practices. Consent management platforms (CMPs) have emerged as practical solutions to make it easier for website administrators to properly manage consent, allowing them to outsource the complexities of managing user consent and activating advertising features.
This paper presents a detailed and longitudinal analysis of the evolution of CMPs spanning nine years. We take a twofold perspective: Firstly, thanks to the HTTP Archive dataset, we provide insights into the growth, market share, and geographical spread of CMPs. Noteworthy observations include the substantial impact of GDPR on the proliferation of CMPs in Europe. Secondly, we analyse millions of user interactions with a medium-sized CMP present in thousands of websites worldwide. We observe how even small changes in the design of Privacy Banners have a critical impact on the user's giving or denying their consent to data collection. For instance, over 60% of users do not consent when offered a simple "one-click reject-all" option. Conversely, when opting out requires more than one click, about 90% of users prefer to simply give their consent. The main objective is in fact to eliminate the annoying privacy banner rather the make an informed decision. Curiously, we observe iOS users exhibit a higher tendency to accept cookies compared to Android users, possibly indicating greater confidence in the privacy offered by Apple devices.
△ Less
Submitted 29 February, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
Benchmarking Evolutionary Community Detection Algorithms in Dynamic Networks
Authors:
Giordano Paoletti,
Luca Gioacchini,
Marco Mellia,
Luca Vassio,
Jussara M. Almeida
Abstract:
In dynamic complex networks, entities interact and form network communities that evolve over time. Among the many static Community Detection (CD) solutions, the modularity-based Louvain, or Greedy Modularity Algorithm (GMA), is widely employed in real-world applications due to its intuitiveness and scalability. Nevertheless, addressing CD in dynamic graphs remains an open problem, since the evolut…
▽ More
In dynamic complex networks, entities interact and form network communities that evolve over time. Among the many static Community Detection (CD) solutions, the modularity-based Louvain, or Greedy Modularity Algorithm (GMA), is widely employed in real-world applications due to its intuitiveness and scalability. Nevertheless, addressing CD in dynamic graphs remains an open problem, since the evolution of the network connections may poison the identification of communities, which may be evolving at a slower pace. Hence, naively applying GMA to successive network snapshots may lead to temporal inconsistencies in the communities. Two evolutionary adaptations of GMA, sGMA and $α$GMA, have been proposed to tackle this problem. Yet, evaluating the performance of these methods and understanding to which scenarios each one is better suited is challenging because of the lack of a comprehensive set of metrics and a consistent ground truth. To address these challenges, we propose (i) a benchmarking framework for evolutionary CD algorithms in dynamic networks and (ii) a generalised modularity-based approach (NeGMA). Our framework allows us to generate synthetic community-structured graphs and design evolving scenarios with nine basic graph transformations occurring at different rates. We evaluate performance through three metrics we define, i.e. Correctness, Delay, and Stability. Our findings reveal that $α$GMA is well-suited for detecting intermittent transformations, but struggles with abrupt changes; sGMA achieves superior stability, but fails to detect emerging communities; and NeGMA appears a well-balanced solution, excelling in responsiveness and instantaneous transformations detection.
△ Less
Submitted 11 January, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Sound-skwatter (Did You Mean: Sound-squatter?) AI-powered Generator for Phishing Prevention
Authors:
Rodolfo Valentim,
Idilio Drago,
Marco Mellia,
Federico Cerutti
Abstract:
Sound-squatting is a phishing attack that tricks users into malicious resources by exploiting similarities in the pronunciation of words. Proactive defense against sound-squatting candidates is complex, and existing solutions rely on manually curated lists of homophones. We here introduce Sound-skwatter, a multi-language AI-based system that generates sound-squatting candidates for proactive defen…
▽ More
Sound-squatting is a phishing attack that tricks users into malicious resources by exploiting similarities in the pronunciation of words. Proactive defense against sound-squatting candidates is complex, and existing solutions rely on manually curated lists of homophones. We here introduce Sound-skwatter, a multi-language AI-based system that generates sound-squatting candidates for proactive defense. Sound-skwatter relies on an innovative multi-modal combination of Transformers Networks and acoustic models to learn sound similarities. We show that Sound-skwatter can automatically list known homophones and thousands of high-quality candidates. In addition, it covers cross-language sound-squatting, i.e., when the reader and the listener speak different languages, supporting any combination of languages. We apply Sound-skwatter to network-centric phishing via squatted domain names. We find ~ 10% of the generated domains exist in the wild, the vast majority unknown to protection solutions. Next, we show attacks on the PyPI package manager, where ~ 17% of the popular packages have at least one existing candidate. We believe Sound-skwatter is a crucial asset to mitigate the sound-squatting phenomenon proactively on the Internet. To increase its impact, we publish an online demo and release our models and code as open source.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
LogPrécis: Unleashing Language Models for Automated Malicious Log Analysis
Authors:
Matteo Boffa,
Rodolfo Vieira Valentim,
Luca Vassio,
Danilo Giordano,
Idilio Drago,
Marco Mellia,
Zied Ben Houidi
Abstract:
The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain…
▽ More
The collection of security-related logs holds the key to understanding attack behaviors and diagnosing vulnerabilities. Still, their analysis remains a daunting challenge. Recently, Language Models (LMs) have demonstrated unmatched potential in understanding natural and programming languages. The question arises whether and how LMs could be also useful for security experts since their logs contain intrinsically confused and obfuscated information. In this paper, we systematically study how to benefit from the state-of-the-art in LM to automatically analyze text-like Unix shell attack logs. We present a thorough design methodology that leads to LogPrécis. It receives as input raw shell sessions and automatically identifies and assigns the attacker tactic to each portion of the session, i.e., unveiling the sequence of the attacker's goals. We demonstrate LogPrécis capability to support the analysis of two large datasets containing about 400,000 unique Unix shell attacks. LogPrécis reduces them into about 3,000 fingerprints, each grouping sessions with the same sequence of tactics. The abstraction it provides lets the analyst better understand attacks, identify fingerprints, detect novelty, link similar attacks, and track families and mutations. Overall, LogPrécis, released as open source, paves the way for better and more responsive defense against cyberattacks.
△ Less
Submitted 22 March, 2024; v1 submitted 17 July, 2023;
originally announced July 2023.
-
On the Robustness of Topics API to a Re-Identification Attack
Authors:
Nikhil Jha,
Martino Trevisan,
Emilio Leonardi,
Marco Mellia
Abstract:
Web tracking through third-party cookies is considered a threat to users' privacy and is supposed to be abandoned in the near future. Recently, Google proposed the Topics API framework as a privacy-friendly alternative for behavioural advertising. Using this approach, the browser builds a user profile based on navigation history, which advertisers can access. The Topics API has the possibility of…
▽ More
Web tracking through third-party cookies is considered a threat to users' privacy and is supposed to be abandoned in the near future. Recently, Google proposed the Topics API framework as a privacy-friendly alternative for behavioural advertising. Using this approach, the browser builds a user profile based on navigation history, which advertisers can access. The Topics API has the possibility of becoming the new standard for behavioural advertising, thus it is necessary to fully understand its operation and find possible limitations.
This paper evaluates the robustness of the Topics API to a re-identification attack where an attacker reconstructs the user profile by accumulating user's exposed topics over time to later re-identify the same user on a different website. Using real traffic traces and realistic population models, we find that the Topics API mitigates but cannot prevent re-identification to take place, as there is a sizeable chance that a user's profile is unique within a website's audience. Consequently, the probability of correct re-identification can reach 15-17%, considering a pool of 1,000 users. We offer the code and data we use in this work to stimulate further studies and the tuning of the Topic API parameters.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
Recommendation Systems in Libraries: an Application with Heterogeneous Data Sources
Authors:
Alessandro Speciale,
Greta Vallero,
Luca Vassio,
Marco Mellia
Abstract:
The Reading&Machine project exploits the support of digitalization to increase the attractiveness of libraries and improve the users' experience. The project implements an application that helps the users in their decision-making process, providing recommendation system (RecSys)-generated lists of books the users might be interested in, and showing them through an interactive Virtual Reality (VR)-…
▽ More
The Reading&Machine project exploits the support of digitalization to increase the attractiveness of libraries and improve the users' experience. The project implements an application that helps the users in their decision-making process, providing recommendation system (RecSys)-generated lists of books the users might be interested in, and showing them through an interactive Virtual Reality (VR)-based Graphical User Interface (GUI). In this paper, we focus on the design and testing of the recommendation system, employing data about all users' loans over the past 9 years from the network of libraries located in Turin, Italy. In addition, we use data collected by the Anobii online social community of readers, who share their feedback and additional information about books they read. Armed with this heterogeneous data, we build and evaluate Content Based (CB) and Collaborative Filtering (CF) approaches. Our results show that the CF outperforms the CB approach, improving by up to 47\% the relevant recommendations provided to a reader. However, the performance of the CB approach is heavily dependent on the number of books the reader has already read, and it can work even better than CF for users with a large history. Finally, our evaluations highlight that the performances of both approaches are significantly improved if the system integrates and leverages the information from the Anobii dataset, which allows us to include more user readings (for CF) and richer book metadata (for CB).
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
Operationalizing AI in Future Networks: A Bird's Eye View from the System Perspective
Authors:
Qiong Liu,
Tianzhu Zhang,
Masoud Hemmatpour,
Han Qiu,
Dong Zhang,
Chung Shue Chen,
Marco Mellia,
Armen Aghasaryan
Abstract:
Modern Artificial Intelligence (AI) technologies, led by Machine Learning (ML), have gained unprecedented momentum over the past decade. Following this wave of "AI summer", the network research community has also embraced AI/ML algorithms to address many problems related to network operations and management. However, compared to their counterparts in other domains, most ML-based solutions have yet…
▽ More
Modern Artificial Intelligence (AI) technologies, led by Machine Learning (ML), have gained unprecedented momentum over the past decade. Following this wave of "AI summer", the network research community has also embraced AI/ML algorithms to address many problems related to network operations and management. However, compared to their counterparts in other domains, most ML-based solutions have yet to receive large-scale deployment due to insufficient maturity for production settings. This article concentrates on the practical issues of developing and operating ML-based solutions in real networks. Specifically, we enumerate the key factors hindering the integration of AI/ML in real networks and review existing solutions to uncover the missing considerations. Further, we highlight a promising direction, i.e., Machine Learning Operations (MLOps), that can close the gap. We believe this paper spotlights the system-related considerations on implementing \& maintaining ML-based solutions and invigorate their full adoption in future networks.
△ Less
Submitted 25 June, 2024; v1 submitted 7 March, 2023;
originally announced March 2023.
-
Understanding mobility in networks: A node embedding approach
Authors:
Matheus F. C. Barros,
Carlos H. G. Ferreira,
Bruno Pereira dos Santos,
Lourenço A. P. Júnior,
Marco Mellia,
Jussara M. Almeida
Abstract:
Motivated by the growing number of mobile devices capable of connecting and exchanging messages, we propose a methodology aiming to model and analyze node mobility in networks. We note that many existing solutions in the literature rely on topological measurements calculated directly on the graph of node contacts, aiming to capture the notion of the node's importance in terms of connectivity and m…
▽ More
Motivated by the growing number of mobile devices capable of connecting and exchanging messages, we propose a methodology aiming to model and analyze node mobility in networks. We note that many existing solutions in the literature rely on topological measurements calculated directly on the graph of node contacts, aiming to capture the notion of the node's importance in terms of connectivity and mobility patterns beneficial for prototyping, design, and deployment of mobile networks. However, each measure has its specificity and fails to generalize the node importance notions that ultimately change over time. Unlike previous approaches, our methodology is based on a node embedding method that models and unveils the nodes' importance in mobility and connectivity patterns while preserving their spatial and temporal characteristics. We focus on a case study based on a trace of group meetings. The results show that our methodology provides a rich representation for extracting different mobility and connectivity patterns, which can be helpful for various applications and services in mobile networks.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
On the Dynamics of Political Discussions on Instagram: A Network Perspective
Authors:
Carlos H. G. Ferreira,
Fabricio Murai,
Ana P. C. Silva,
Jussara M. Almeida,
Martino Trevisan,
Luca Vassio,
Marco Mellia,
Idilio Drago
Abstract:
Instagram has been increasingly used as a source of information especially among the youth. As a result, political figures now leverage the platform to spread opinions and political agenda. We here analyze online discussions on Instagram, notably in political topics, from a network perspective. Specifically, we investigate the emergence of communities of co-commenters, that is, groups of users who…
▽ More
Instagram has been increasingly used as a source of information especially among the youth. As a result, political figures now leverage the platform to spread opinions and political agenda. We here analyze online discussions on Instagram, notably in political topics, from a network perspective. Specifically, we investigate the emergence of communities of co-commenters, that is, groups of users who often interact by commenting on the same posts and may be driving the ongoing online discussions. In particular, we are interested in salient co-interactions, i.e., interactions of co-commenters that occur more often than expected by chance and under independent behavior. Unlike casual and accidental co-interactions which normally happen in large volumes, salient co-interactions are key elements driving the online discussions and, ultimately, the information dissemination. We base our study on the analysis of 10 weeks of data centered around major elections in Brazil and Italy, following both politicians and other celebrities. We extract and characterize the communities of co-commenters in terms of topological structure, properties of the discussions carried out by community members, and how some community properties, notably community membership and topics, evolve over time. We show that communities discussing political topics tend to be more engaged in the debate by writing longer comments, using more emojis, hashtags and negative words than in other subjects. Also, communities built around political discussions tend to be more dynamic, although top commenters remain active and preserve community membership over time. Moreover, we observe a great diversity in discussed topics over time: whereas some topics attract attention only momentarily, others, centered around more fundamental political discussions, remain consistently active over time.
△ Less
Submitted 13 September, 2022; v1 submitted 19 September, 2021;
originally announced September 2021.
-
The Internet with Privacy Policies: Measuring The Web Upon Consent
Authors:
Nikhil Jha,
Martino Trevisan,
Luca Vassio,
Marco Mellia
Abstract:
To protect users' privacy, legislators have regulated the usage of tracking technologies, mandating the acquisition of users' consent before collecting data. Consequently, websites started showing more and more consent management modules -- i.e., Privacy Banners -- the visitors have to interact with to access the website content. They challenge the automatic collection of Web measurements, primari…
▽ More
To protect users' privacy, legislators have regulated the usage of tracking technologies, mandating the acquisition of users' consent before collecting data. Consequently, websites started showing more and more consent management modules -- i.e., Privacy Banners -- the visitors have to interact with to access the website content. They challenge the automatic collection of Web measurements, primarily to monitor the extensiveness of tracking technologies but also to measure Web performance in the wild. Privacy Banners in fact limit crawlers from observing the actual website content. In this paper, we present a thorough measurement campaign focusing on popular websites in Europe and the US, visiting both landing and internal pages from different countries around the world. We engineer Priv-Accept, a Web crawler able to accept the privacy policies, as most users would do in practice. This let us compare how webpages change before and after. Our results show that all measurements performed not dealing with the Privacy Banners offer a very biased and partial view of the Web. After accepting the privacy policies, we observe an increase of up to 70 trackers, which in turn slows down the webpage load time by a factor of 2x-3x.
△ Less
Submitted 13 September, 2022; v1 submitted 1 September, 2021;
originally announced September 2021.
-
z-anonymity: Zero-Delay Anonymization for Data Streams
Authors:
Nikhil Jha,
Thomas Favale,
Luca Vassio,
Martino Trevisan,
Marco Mellia
Abstract:
With the advent of big data and the birth of the data markets that sell personal information, individuals' privacy is of utmost importance. The classical response is anonymization, i.e., sanitizing the information that can directly or indirectly allow users' re-identification. The most popular solution in the literature is the k-anonymity. However, it is hard to achieve k-anonymity on a continuous…
▽ More
With the advent of big data and the birth of the data markets that sell personal information, individuals' privacy is of utmost importance. The classical response is anonymization, i.e., sanitizing the information that can directly or indirectly allow users' re-identification. The most popular solution in the literature is the k-anonymity. However, it is hard to achieve k-anonymity on a continuous stream of data, as well as when the number of dimensions becomes high.In this paper, we propose a novel anonymization property called z-anonymity. Differently from k-anonymity, it can be achieved with zero-delay on data streams and it is well suited for high dimensional data. The idea at the base of z-anonymity is to release an attribute (an atomic information) about a user only if at least z - 1 other users have presented the same attribute in a past time window. z-anonymity is weaker than k-anonymity since it does not work on the combinations of attributes, but treats them individually. In this paper, we present a probabilistic framework to map the z-anonymity into the k-anonymity property. Our results show that a proper choice of the z-anonymity parameters allows the data curator to likely obtain a k-anonymized dataset, with a precisely measurable probability. We also evaluate a real use case, in which we consider the website visits of a population of users and show that z-anonymity can work in practice for obtaining the k-anonymity too.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
RL-IoT: Reinforcement Learning to Interact with IoT Devices
Authors:
Giulia Milan,
Luca Vassio,
Idilio Drago,
Marco Mellia
Abstract:
Our life is getting filled by Internet of Things (IoT) devices. These devices often rely on closed or poorly documented protocols, with unknown formats and semantics. Learning how to interact with such devices in an autonomous manner is the key for interoperability and automatic verification of their capabilities. In this paper, we propose RL-IoT, a system that explores how to automatically intera…
▽ More
Our life is getting filled by Internet of Things (IoT) devices. These devices often rely on closed or poorly documented protocols, with unknown formats and semantics. Learning how to interact with such devices in an autonomous manner is the key for interoperability and automatic verification of their capabilities. In this paper, we propose RL-IoT, a system that explores how to automatically interact with possibly unknown IoT devices. We leverage reinforcement learning (RL) to recover the semantics of protocol messages and to take control of the device to reach a given goal, while minimizing the number of interactions. We assume to know only a database of possible IoT protocol messages, whose semantics are however unknown. RL-IoT exchanges messages with the target IoT device, learning those commands that are useful to reach the given goal. Our results show that RL-IoT is able to solve both simple and complex tasks. With properly tuned parameters, RL-IoT learns how to perform actions with the target device, a Yeelight smart bulb in our case study, completing non-trivial patterns with as few as 400 interactions. RL-IoT paves the road for automatic interactions with poorly documented IoT protocols, thus enabling interoperable systems.
△ Less
Submitted 10 September, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
Campus Traffic and e-Learning during COVID-19 Pandemic
Authors:
Thomas Favale,
Francesca Soro,
Martino Trevisan,
Idilio Drago,
Marco Mellia
Abstract:
The COVID-19 pandemic led to the adoption of severe measures to counteract the spread of the infection. Social distancing and lockdown measures modifies people's habits, while the Internet gains a major role to support remote working, e-teaching, online collaboration, gaming, video streaming, etc. All these sudden changes put unprecedented stress on the network. In this paper we analyze the impact…
▽ More
The COVID-19 pandemic led to the adoption of severe measures to counteract the spread of the infection. Social distancing and lockdown measures modifies people's habits, while the Internet gains a major role to support remote working, e-teaching, online collaboration, gaming, video streaming, etc. All these sudden changes put unprecedented stress on the network. In this paper we analyze the impact of the lockdown enforcement on the Politecnico di Torino campus network. Right after the school shutdown on the 25th of February, PoliTO deployed its own in-house solution for virtual teaching. Ever since, the university provides about 600 virtual classes daily, serving more than 16,000 students per day. Here, we report a picture of how the pandemic changed PoliTO's network traffic. We first focus on the usage of remote working and collaborative platforms. Given the peculiarity of PoliTO in-house online teaching solution, we drill down on it, characterizing both the audience and the network footprint. Overall, we present a snapshot of the abrupt changes on campus traffic and learning due to COVID-19, and testify how the Internet has proved robust to successfully cope with challenges and maintain the university operations.
△ Less
Submitted 8 May, 2020; v1 submitted 28 April, 2020;
originally announced April 2020.
-
EXPLAIN-IT: Towards Explainable AI for Unsupervised Network Traffic Analysis
Authors:
Andrea Morichetta,
Pedro Casas,
Marco Mellia
Abstract:
The application of unsupervised learning approaches, and in particular of clustering techniques, represents a powerful exploration means for the analysis of network measurements. Discovering underlying data characteristics, grouping similar measurements together, and identifying eventual patterns of interest are some of the applications which can be tackled through clustering. Being unsupervised,…
▽ More
The application of unsupervised learning approaches, and in particular of clustering techniques, represents a powerful exploration means for the analysis of network measurements. Discovering underlying data characteristics, grouping similar measurements together, and identifying eventual patterns of interest are some of the applications which can be tackled through clustering. Being unsupervised, clustering does not always provide precise and clear insight into the produced output, especially when the input data structure and distribution are complex and difficult to grasp. In this paper we introduce EXPLAIN-IT, a methodology which deals with unlabeled data, creates meaningful clusters, and suggests an explanation to the clustering results for the end-user. EXPLAIN-IT relies on a novel explainable Artificial Intelligence (AI) approach, which allows to understand the reasons leading to a particular decision of a supervised learning-based model, additionally extending its application to the unsupervised learning domain. We apply EXPLAIN-IT to the problem of YouTube video quality classification under encrypted traffic scenarios, showing promising results.
△ Less
Submitted 3 March, 2020;
originally announced March 2020.
-
A Survey on Big Data for Network Traffic Monitoring and Analysis
Authors:
Alessandro D'Alconzo,
Idilio Drago,
Andrea Morichetta,
Marco Mellia,
Pedro Casas
Abstract:
Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require…
▽ More
Network Traffic Monitoring and Analysis (NTMA) represents a key component for network management, especially to guarantee the correct operation of large-scale networks such as the Internet. As the complexity of Internet services and the volume of traffic continue to increase, it becomes difficult to design scalable NTMA applications. Applications such as traffic classification and policing require real-time and scalable approaches. Anomaly detection and security mechanisms require to quickly identify and react to unpredictable events while processing millions of heterogeneous events. At last, the system has to collect, store, and process massive sets of historical data for post-mortem analysis. Those are precisely the challenges faced by general big data approaches: Volume, Velocity, Variety, and Veracity. This survey brings together NTMA and big data. We catalog previous work on NTMA that adopt big data approaches to understand to what extent the potential of big data is being explored in NTMA. This survey mainly focuses on approaches and technologies to manage the big NTMA data, additionally briefly discussing big data analytics (e.g., machine learning) for the sake of NTMA. Finally, we provide guidelines for future work, discussing lessons learned, and research directions.
△ Less
Submitted 3 March, 2020;
originally announced March 2020.
-
IPPO: A Privacy-Aware Architecture for Decentralized Data-sharing
Authors:
Maurizio Aiello,
Enrico Cambiaso,
Roberto Canonico,
Leonardo Maccari,
Marco Mellia,
Antonio Pescapè,
Ivan Vaccari
Abstract:
Online trackers personalize ads campaigns, exponentially increasing their efficacy compared to traditional channels. The downside of this is that thousands of mostly unknown systems own our profiles and violate our privacy without our awareness. IPPO turns the table and re-empower users of their data, through anonymised data publishing via a Blockchain-based Decentralized Data Marketplace. We also…
▽ More
Online trackers personalize ads campaigns, exponentially increasing their efficacy compared to traditional channels. The downside of this is that thousands of mostly unknown systems own our profiles and violate our privacy without our awareness. IPPO turns the table and re-empower users of their data, through anonymised data publishing via a Blockchain-based Decentralized Data Marketplace. We also propose a service based on machine learning and big data analytics to automatically identify web trackers and build Privacy Labels (PLs), based on the nutrition labels concept. This paper describes the motivation, the vision, the architecture and the research challenges related to IPPO.
△ Less
Submitted 17 January, 2020;
originally announced January 2020.
-
Towards Understanding Political Interactions on Instagram
Authors:
Martino Trevisan,
Luca Vassio,
Idilio Drago,
Marco Mellia,
Fabricio Murai,
Flavio Figueiredo,
Ana Paula Couto da Silva,
Jussara M. Almeida
Abstract:
Online Social Networks (OSNs) allow personalities and companies to communicate directly with the public, bypassing filters of traditional medias. As people rely on OSNs to stay up-to-date, the political debate has moved online too. We witness the sudden explosion of harsh political debates and the dissemination of rumours in OSNs. Identifying such behaviour requires a deep understanding on how peo…
▽ More
Online Social Networks (OSNs) allow personalities and companies to communicate directly with the public, bypassing filters of traditional medias. As people rely on OSNs to stay up-to-date, the political debate has moved online too. We witness the sudden explosion of harsh political debates and the dissemination of rumours in OSNs. Identifying such behaviour requires a deep understanding on how people interact via OSNs during political debates. We present a preliminary study of interactions in a popular OSN, namely Instagram. We take Italy as a case study in the period before the 2019 European Elections. We observe the activity of top Italian Instagram profiles in different categories: politics, music, sport and show. We record their posts for more than two months, tracking "likes" and comments from users. Results suggest that profiles of politicians attract markedly different interactions than other categories. People tend to comment more, with longer comments, debating for longer time, with a large number of replies, most of which are not explicitly solicited. Moreover, comments tend to come from a small group of very active users. Finally, we witness substantial differences when comparing profiles of different parties.
△ Less
Submitted 4 May, 2021; v1 submitted 26 April, 2019;
originally announced April 2019.
-
You, the Web and Your Device: Longitudinal Characterization of Browsing Habits
Authors:
Luca Vassio,
Idilio Drago,
Marco Mellia,
Zied Ben Houidi,
Mohamed Lamine Lamali
Abstract:
Understanding how people interact with the web is key for a variety of applications, e.g., from the design of effective web pages to the definition of successful online marketing campaigns. Browsing behavior has been traditionally represented and studied by means of clickstreams, i.e., graphs whose vertices are web pages, and edges are the paths followed by users. Obtaining large and representativ…
▽ More
Understanding how people interact with the web is key for a variety of applications, e.g., from the design of effective web pages to the definition of successful online marketing campaigns. Browsing behavior has been traditionally represented and studied by means of clickstreams, i.e., graphs whose vertices are web pages, and edges are the paths followed by users. Obtaining large and representative data to extract clickstreams is however challenging. The evolution of the web questions whether browsing behavior is changing and, by consequence, whether properties of clickstreams are changing. This paper presents a longitudinal study of clickstreams in from 2013 to 2016. We evaluate an anonymized dataset of HTTP traces captured in a large ISP, where thousands of households are connected. We first propose a methodology to identify actual URLs requested by users from the massive set of requests automatically fired by browsers when rendering web pages. Then, we characterize web usage patterns and clickstreams, taking into account both the temporal evolution and the impact of the device used to explore the web. Our analyses precisely quantify various aspects of clickstreams and uncover interesting patterns, such as the typical short paths followed by people while navigating the web, the fast increasing trend in browsing from mobile devices and the different roles of search engines and social networks in promoting content. Finally, we contribute a dataset of anonymized clickstreams to the community to foster new studies (anonymized clickstreams are available to the public at http://bigdata.polito.it/clickstream).
△ Less
Submitted 4 May, 2021; v1 submitted 19 June, 2018;
originally announced June 2018.
-
Uncovering the Flop of the EU Cookie Law
Authors:
Martino Trevisan,
Stefano Traverso,
Hassan Metwalley,
Marco Mellia
Abstract:
In 2002, the European Union (EU) introduced the ePrivacy Directive to regulate the usage of online tracking technologies. Its aim is to make tracking mechanisms explicit while increasing privacy awareness in users. It mandates websites to ask for explicit consent before using any kind of profiling methodology, e.g., cookies. Starting from 2013 the Directive is mandatory, and now most of European w…
▽ More
In 2002, the European Union (EU) introduced the ePrivacy Directive to regulate the usage of online tracking technologies. Its aim is to make tracking mechanisms explicit while increasing privacy awareness in users. It mandates websites to ask for explicit consent before using any kind of profiling methodology, e.g., cookies. Starting from 2013 the Directive is mandatory, and now most of European websites embed a "Cookie Bar" to explicitly ask user's consent. To the best of our knowledge, no study focused in checking whether a website respects the Directive. For this, we engineer CookieCheck, a simple tool that makes this check automatic. We use it to run a measurement campaign on more than 35,000 websites. Results depict a dramatic picture: 65% of websites do not respect the Directive and install tracking cookies before the user is even offered the accept button. In few words, we testify the failure of the ePrivacy Directive. Among motivations, we identify the absence of rules enabling systematic auditing procedures, the lack of tools to verify its implementation by the deputed agencies, and the technical difficulties of webmasters in implementing it.
△ Less
Submitted 2 May, 2019; v1 submitted 24 May, 2017;
originally announced May 2017.
-
WeBrowse: Mining HTTP logs online for network-based content recommendation
Authors:
Giuseppe Scavo,
Zied Ben Houidi,
Stefano Traverso,
Renata Teixeira,
Marco Mellia
Abstract:
A powerful means to help users discover new content in the overwhelming amount of information available today is sharing in online communities such as social networks or crowdsourced platforms. This means comes short in the case of what we call communities of a place: people who study, live or work at the same place. Such people often share common interests but either do not know each other or fai…
▽ More
A powerful means to help users discover new content in the overwhelming amount of information available today is sharing in online communities such as social networks or crowdsourced platforms. This means comes short in the case of what we call communities of a place: people who study, live or work at the same place. Such people often share common interests but either do not know each other or fail to actively engage in submitting and relaying information. To counter this effect, we propose passive crowdsourced content discovery, an approach that leverages the passive observation of web-clicks as an indication of users' interest in a piece of content. We design, implement, and evaluate WeBrowse , a passive crowdsourced system which requires no active user engagement to promote interesting content to users of a community of a place. Instead, it extracts the URLs users visit from traffic traversing a network link to identify popular and interesting pieces of information. We first prototype WeBrowse and evaluate it using both ground-truths and real traces from a large European Internet Service Provider. Then, we deploy WeBrowse in a campus of 15,000 users, and in a neighborhood. Evaluation based on our deployments shows the feasibility of our approach. The majority of WeBrowse's users welcome the quality of content it promotes. Finally, our analysis of popular topics across different communities confirms that users in the same community of a place share common interests, compared to users from different communities, thus confirming the promise of WeBrowse's approach.
△ Less
Submitted 25 February, 2016; v1 submitted 22 February, 2016;
originally announced February 2016.
-
A First Look at Anycast CDN Traffic
Authors:
Danilo Cicalese,
Danilo Giordano,
Alessandro Finamore,
Marco Mellia,
Maurizio Munafò,
Dario Rossi,
Diana Joumblatt
Abstract:
Anycast routing is an IP solution that allows packets to be routed to the topologically nearest server. Over the last years it has been commonly adopted to manage some services running on top of UDP, e.g., public DNS resolvers, multicast rendez-vous points, etc. However, recently the Internet have witnessed the growth of new Anycast-enabled Content Delivery Networks (A-CDNs) such as CloudFlare and…
▽ More
Anycast routing is an IP solution that allows packets to be routed to the topologically nearest server. Over the last years it has been commonly adopted to manage some services running on top of UDP, e.g., public DNS resolvers, multicast rendez-vous points, etc. However, recently the Internet have witnessed the growth of new Anycast-enabled Content Delivery Networks (A-CDNs) such as CloudFlare and EdgeCast, which provide their web services (i.e., TCP traffic) entirely through anycast.
To the best of our knowledge, little is known in the literature about the nature and the dynamic of such traffic. For instance, since anycast depends on the routing, the question is how stable are the paths toward the nearest server. To bring some light on this question, in this work we provide a first look at A-CDNs traffic by combining active and passive measurements. In particular, building upon our previous work, we use active measurements to identify and geolocate A-CDNs caches starting from a large set of IP addresses related to the top-100k Alexa websites. We then look at the traffic of those caches in the wild using a large passive dataset collected from a European ISP.
We find that several A-CDN servers are encountered on a daily basis when browsing the Internet. Routes to A-CDN servers are very stable, with few changes that are observed on a monthly-basis (in contrast to more the dynamic traffic policies of traditional CDNs). Overall, A-CDNs are a reality worth further investigations.
△ Less
Submitted 12 March, 2021; v1 submitted 5 May, 2015;
originally announced May 2015.
-
YouLighter: An Unsupervised Methodology to Unveil YouTube CDN Changes
Authors:
Danilo Giordano,
Stefano Traverso,
Luigi Grimaudo,
Marco Mellia,
Elena Baralis,
Alok Tongaonkar,
Sabyasachi Saha
Abstract:
YouTube relies on a massively distributed Content Delivery Network (CDN) to stream the billions of videos in its catalogue. Unfortunately, very little information about the design of such CDN is available. This, combined with the pervasiveness of YouTube, poses a big challenge for Internet Service Providers (ISPs), which are compelled to optimize end-users' Quality of Experience (QoE) while having…
▽ More
YouTube relies on a massively distributed Content Delivery Network (CDN) to stream the billions of videos in its catalogue. Unfortunately, very little information about the design of such CDN is available. This, combined with the pervasiveness of YouTube, poses a big challenge for Internet Service Providers (ISPs), which are compelled to optimize end-users' Quality of Experience (QoE) while having no control on the CDN decisions.
This paper presents YouLighter, an unsupervised technique to identify changes in the YouTube CDN. YouLighter leverages only passive measurements to cluster co-located identical caches into edge-nodes. This automatically unveils the structure of YouTube's CDN. Further, we propose a new metric, called Constellation Distance, that compares the clustering obtained from two different time snapshots, to pinpoint sudden changes. While several approaches allow comparison between the clustering results from the same dataset, no technique allows to measure the similarity of clusters from different datasets. Hence, we develop a novel methodology, based on the Constellation Distance, to solve this problem.
By running YouLighter over 10-month long traces obtained from two ISPs in different countries, we pinpoint both sudden changes in edge-node allocation, and small alterations to the cache allocation policies which actually impair the QoE that the end-users perceive.
△ Less
Submitted 18 March, 2015;
originally announced March 2015.
-
CrowdSurf: Empowering Informed Choices in the Web
Authors:
Hassan Metwalley,
Stefano Traverso,
Marco Mellia,
Stanislav Miskovic,
Mario Baldi
Abstract:
When surfing the Internet, individuals leak personal and corporate information to third parties whose (legitimate or not) businesses revolve around the value of collected data. The implications are serious, from a person unwillingly exposing private information to an unknown third party, to a company unable to manage the flow of its information to the outside world. The point is that individuals a…
▽ More
When surfing the Internet, individuals leak personal and corporate information to third parties whose (legitimate or not) businesses revolve around the value of collected data. The implications are serious, from a person unwillingly exposing private information to an unknown third party, to a company unable to manage the flow of its information to the outside world. The point is that individuals and companies are more and more kept out of the loop when it comes to control private data. With the goal of empowering informed choices in information leakage through the Internet, we propose CROWDSURF, a system for comprehensive and collaborative auditing of data that flows to Internet services. Similarly to open-source efforts, we enable users to contribute in building awareness and control over privacy and communication vulnerabilities. CROWDSURF provides the core infrastructure and algorithms to let individuals and enterprises regain control on the information exposed on the web. We advocate CROWDSURF as a data processing layer positioned right below HTTP in the host protocol stack. This enables the inspection of clear-text data even when HTTPS is deployed and the application of processing rules that are customizable to fit any need. Preliminary results obtained executing a prototype implementation on ISP traffic traces demonstrate the feasibility of CROWDSURF.
△ Less
Submitted 25 February, 2015;
originally announced February 2015.