subscribe to arXiv mailings

Learning Cellular Network Connection Quality with Conformal

Authors: Hanyang Jiang, Elizabeth Belding, Ellen Zegure, Yao Xie

Abstract: In this paper, we address the problem of uncertainty quantification for cellular network speed. It is a well-known fact that the actual internet speed experienced by a mobile phone can fluctuate significantly, even when remaining in a single location. This high degree of variability underscores that mere point estimation of network speed is insufficient. Rather, it is advantageous to establish a p… ▽ More In this paper, we address the problem of uncertainty quantification for cellular network speed. It is a well-known fact that the actual internet speed experienced by a mobile phone can fluctuate significantly, even when remaining in a single location. This high degree of variability underscores that mere point estimation of network speed is insufficient. Rather, it is advantageous to establish a prediction interval that can encompass the expected range of speed variations. In order to build an accurate network estimation map, numerous mobile data need to be collected at different locations. Currently, public datasets rely on users to upload data through apps. Although massive data has been collected, the datasets suffer from significant noise due to the nature of cellular networks and various other factors. Additionally, the uneven distribution of population density affects the spatial consistency of data collection, leading to substantial uncertainty in the network quality maps derived from this data. We focus our analysis on large-scale internet-quality datasets provided by Ookla to construct an estimated map of connection quality. To improve the reliability of this map, we introduce a novel conformal prediction technique to build an uncertainty map. We identify regions with heightened uncertainty to prioritize targeted, manual data collection. In addition, the uncertainty map quantifies how reliable the prediction is in different areas. Our method also leads to a sampling strategy that guides researchers to selectively gather high-quality data that best complement the current dataset to improve the overall accuracy of the prediction model. △ Less

Submitted 4 June, 2024; originally announced July 2024.

Comments: arXiv admin note: text overlap with arXiv:2311.05641

arXiv:2405.18657 [pdf, other]

The Efficacy of the Connect America Fund in Addressing US Internet Access Inequities

Authors: Haarika Manda, Varshika Srinivasavaradhan, Laasya Koduru, Kevin Zhang, Xuanhe Zhou, Udit Paul, Elizabeth Belding, Arpit Gupta, Tejas N. Narechania

Abstract: Residential fixed broadband internet access in the United States (US) has long been distributed inequitably, drawing significant attention from researchers and policymakers. This paper evaluates the efficacy of the Connect America Fund (CAF), a key policy intervention aimed at addressing disparities in US internet access. CAF subsidizes the creation of new regulated broadband monopolies in underse… ▽ More Residential fixed broadband internet access in the United States (US) has long been distributed inequitably, drawing significant attention from researchers and policymakers. This paper evaluates the efficacy of the Connect America Fund (CAF), a key policy intervention aimed at addressing disparities in US internet access. CAF subsidizes the creation of new regulated broadband monopolies in underserved areas, aiming to provide comparable internet access, in terms of price and speed, to that available in urban regions. Oversight of CAF largely relies on data self-reported by internet service providers (ISPs), which is often questionable. We use the broadband-plan querying tool (BQT) to curate a novel dataset that complements ISP-reported information with ISP-advertised broadband plan details (download speed and monthly cost) on publicly accessible websites. Specifically, we query advertised broadband plans for 687k residential addresses across 15 states, certified as served by ISPs to regulators. Our analysis reveals significant discrepancies between ISP-reported data and actual broadband availability. We find that the serviceability rate-defined as the fraction of addresses ISPs actively serve out of the total queried, weighted by the number of CAF addresses in a census block group-is only 55%, dropping to as low as 18% in some states. Additionally, the compliance rate-defined as the weighted fraction of addresses where ISPs actively serve and advertise download speeds above the FCC's 10 Mbps threshold-is only 33%. We also observe that in a subset of census blocks, CAF-funded addresses receive higher broadband speeds than their monopoly-served neighbors. These results indicate that while a few users have benefited from this multi-billion dollar program, it has largely failed to achieve its intended goal, leaving many targeted rural communities with inadequate or no broadband connectivity. △ Less

Submitted 12 July, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

arXiv:2311.05641 [pdf, other]

Mobile Internet Quality Estimation using Self-Tuning Kernel Regression

Authors: Hanyang Jiang, Henry Shaowu Yuchi, Elizabeth Belding, Ellen Zegura, Yao Xie

Abstract: Modeling and estimation for spatial data are ubiquitous in real life, frequently appearing in weather forecasting, pollution detection, and agriculture. Spatial data analysis often involves processing datasets of enormous scale. In this work, we focus on large-scale internet-quality open datasets from Ookla. We look into estimating mobile (cellular) internet quality at the scale of a state in the… ▽ More Modeling and estimation for spatial data are ubiquitous in real life, frequently appearing in weather forecasting, pollution detection, and agriculture. Spatial data analysis often involves processing datasets of enormous scale. In this work, we focus on large-scale internet-quality open datasets from Ookla. We look into estimating mobile (cellular) internet quality at the scale of a state in the United States. In particular, we aim to conduct estimation based on highly {\it imbalanced} data: Most of the samples are concentrated in limited areas, while very few are available in the rest, posing significant challenges to modeling efforts. We propose a new adaptive kernel regression approach that employs self-tuning kernels to alleviate the adverse effects of data imbalance in this problem. Through comparative experimentation on two distinct mobile network measurement datasets, we demonstrate that the proposed self-tuning kernel regression method produces more accurate predictions, with the potential to be applied in other applications. △ Less

Submitted 4 November, 2023; originally announced November 2023.

arXiv:2310.16136 [pdf, other]

Analyzing Disparity and Temporal Progression of Internet Quality through Crowdsourced Measurements with Bias-Correction

Authors: Hyeongseong Lee, Udit Paul, Arpit Gupta, Elizabeth Belding, Mengyang Gu

Abstract: Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapo… ▽ More Crowdsourced speedtest measurements are an important tool for studying internet performance from the end user perspective. Nevertheless, despite the accuracy of individual measurements, simplistic aggregation of these data points is problematic due to their intrinsic sampling bias. In this work, we utilize a dataset of nearly 1 million individual Ookla Speedtest measurements, correlate each datapoint with 2019 Census demographic data, and develop new methods to present a novel analysis to quantify regional sampling bias and the relationship of internet performance to demographic profile. We find that the crowdsourced Ookla Speedtest data points contain significant sampling bias across different census block groups based on a statistical test of homogeneity. We introduce two methods to correct the regional bias by the population of each census block group. Whereas the sampling bias leads to a small discrepancy in the overall cumulative distribution function of internet speed in a city between estimation from original samples and bias-corrected estimation, the discrepancy is much smaller compared to the size of the sampling heterogeneity across regions. Further, we show that the sampling bias is strongly associated with a few demographic variables, such as income, education level, age, and ethnic distribution. Through regression analysis, we find that regions with higher income, younger populations, and lower representation of Hispanic residents tend to measure faster internet speeds along with substantial collinearity amongst socioeconomic attributes and ethnic composition. Finally, we find that average internet speed increases over time based on both linear and nonlinear analysis from state space models, though the regional sampling bias may result in a small overestimation of the temporal increase of internet speed. △ Less

Submitted 7 December, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

arXiv:2309.00686 [pdf]

Watching Stars in Pixels: The Interplay of Traffic Shaping and YouTube Streaming QoE over GEO Satellite Networks

Authors: Jiamo Liu, David Lerner, Jae Chung, Udit Paul, Arpit Gupta, Elizabeth Belding

Abstract: Geosynchronous satellite (GEO) networks are a crucial option for users beyond terrestrial connectivity. However, unlike terrestrial networks, GEO networks exhibit high latency and deploy TCP proxies and traffic shapers. The deployment of proxies effectively mitigates the impact of high network latency in GEO networks, while traffic shapers help realize customer-controlled data-saver options that o… ▽ More Geosynchronous satellite (GEO) networks are a crucial option for users beyond terrestrial connectivity. However, unlike terrestrial networks, GEO networks exhibit high latency and deploy TCP proxies and traffic shapers. The deployment of proxies effectively mitigates the impact of high network latency in GEO networks, while traffic shapers help realize customer-controlled data-saver options that optimize data usage. It is unclear how the interplay between GEO networks' high latency, TCP proxies, and traffic-shaping policies affects the quality of experience (QoE) for commonly used video applications. To fill this gap, we analyze the quality of over $2$k YouTube video sessions streamed across a production GEO network with a $900$Kbps shaping rate. Given the average bit rates for the selected videos, we expected seamless streaming at $360$p or lower resolutions. However, our analysis reveals that this is not the case: $28\%$ of TCP sessions and $18\%$ of gQUIC sessions experience rebuffering events, while the median average resolution is only $380$p for TCP and $299$p for gQUIC. Our analysis identifies two key factors contributing to sub-optimal performance: (i)unlike TCP, gQUIC only utilizes $63\%$ of network capacity; and (ii)YouTube's imperfect chunk request pipelining. As a result of our study, the partner GEO ISP discontinued support for the low-bandwidth data-saving option in U.S. business and residential markets to avoid potential degradation of video quality -- highlighting the practical significance of our findings. △ Less

Submitted 1 September, 2023; originally announced September 2023.

arXiv:2302.14216 [pdf]

Decoding the Divide: Analyzing Disparities in Broadband Plans Offered by Major US ISPs

Authors: Udit Paul, Vinothini Gunasekaran, Jiamo Liu, Tejas N. Narechania, Arpit Gupta, Elizabeth Belding

Abstract: Digital equity in Internet access is often measured along three axes: availability, affordability, and adoption. Most prior work focuses on availability; the other two aspects have received little attention. In this paper, we study broadband affordability in the US. Specifically, we focus on the nature of broadband plans offered by major ISPs across the US. To this end, we develop a broadband plan… ▽ More Digital equity in Internet access is often measured along three axes: availability, affordability, and adoption. Most prior work focuses on availability; the other two aspects have received little attention. In this paper, we study broadband affordability in the US. Specifically, we focus on the nature of broadband plans offered by major ISPs across the US. To this end, we develop a broadband plan querying tool (BQT) that obtains broadband plans (upload/download speed and price) offered by seven major ISPs for any street address in the US. We then use this tool to curate a dataset, querying broadband plans for over 837k street addresses in thirty cities for seven major wireline broadband ISPs. We use a plan's carriage value, the Mbps of a user's traffic that an ISP carries for one dollar, to compare plans. Our analysis provides us with the following new insights: (1) ISP plans vary inter-city. Specifically, the fraction of census block groups that receive high and low carriage value plans varies widely by city; (2) ISP plans intra-city are spatially clustered, and the carriage value can vary as much as 600% within a city; (3) Cable-based ISPs offer up to 30% more carriage value to users when competing with fiber-based ISPs in a block group; and (4) Average income in a block group plays a critical role in dictating who gets a fiber deployment (i.e., a better carriage value) in the US. While we hope our tool, dataset, and analysis in their current form are helpful for policymakers at different levels (city, county, state), they are only a small step toward understanding digital equity. Based on our learnings, we conclude with recommendations to continue to advance our understanding of broadband affordability. △ Less

Submitted 27 February, 2023; originally announced February 2023.

Comments: 15 Pages

arXiv:2110.12038 [pdf, ps, other]

Characterizing Performance Inequity Across U.S. Ookla Speedtest Users

Authors: Udit Paul, Jiamo Liu, Vivek Adarsh, Mengyang Gu, Arpit Gupta, Elizabeth Belding

Abstract: The Internet has become indispensable to daily activities, such as work, education and health care. Many of these activities require Internet access data rates that support real-time video conferencing. However, digital inequality persists across the United States, not only in who has access but in the quality of that access. Speedtest by Ookla allows users to run network diagnostic tests to bette… ▽ More The Internet has become indispensable to daily activities, such as work, education and health care. Many of these activities require Internet access data rates that support real-time video conferencing. However, digital inequality persists across the United States, not only in who has access but in the quality of that access. Speedtest by Ookla allows users to run network diagnostic tests to better understand the current performance of their network. In this work, we leverage an Internet performance dataset from Ookla, together with an ESRI demographic dataset, to conduct a comprehensive analysis that characterizes performance differences between Speedtest users across the U.S. Our analysis shows that median download speeds for Speedtest users can differ by over 150Mbps between states. Further, there are important distinctions between user categories. For instance, all but one state showed statistically significant differences in performance between Speedtest users in urban and rural areas. The difference also exists in urban areas between high and low income users in 27 states. Our analysis reveals that states that demonstrate this disparity in Speedtest results are geographically bigger, more populous and have a wider dispersion of median household income. We conclude by highlighting several challenges to the complex problem space of digital inequality characterization and provide recommendations for furthering research on this topic. △ Less

Submitted 22 October, 2021; originally announced October 2021.

Comments: 10 pages, 5 figures, 3 tables

arXiv:2102.07288 [pdf, other]

A Tale of Three Datasets: Towards Characterizing Mobile Broadband Access in the United States

Authors: Tarun Mangla, Esther Showalter, Vivek Adarsh, Kipp Jones, Morgan Vigil-Hayes, Elizabeth Belding, Ellen Zegura

Abstract: Understanding and improving mobile broadband deployment is critical to bridging the digital divide and targeting future investments. Yet accurately mapping mobile coverage is challenging. In 2019, the Federal Communications Commission (FCC) released a report on the progress of mobile broadband deployment in the United States. This report received a significant amount of criticism with claims that… ▽ More Understanding and improving mobile broadband deployment is critical to bridging the digital divide and targeting future investments. Yet accurately mapping mobile coverage is challenging. In 2019, the Federal Communications Commission (FCC) released a report on the progress of mobile broadband deployment in the United States. This report received a significant amount of criticism with claims that the cellular coverage, mainly available through Long-Term Evolution (LTE), was over-reported in some areas, especially those that are rural and/or tribal [12]. We evaluate the validity of this criticism using a quantitative analysis of both the dataset from which the FCC based its report and a crowdsourced LTE coverage dataset. Our analysis is focused on the state of New Mexico, a region characterized by diverse mix of demographics-geography and poor broadband access. We then performed a controlled measurement campaign in northern New Mexico during May 2019. Our findings reveal significant disagreement between the crowdsourced dataset and the FCC dataset regarding the presence of LTE coverage in rural and tribal census blocks, with the FCC dataset reporting higher coverage than the crowdsourced dataset. Interestingly, both the FCC and the crowdsourced data report higher coverage compared to our on-the-ground measurements. Based on these findings, we discuss our recommendations for improved LTE coverage measurements, whose importance has only increased in the COVID-19 era of performing work and school from home, especially in rural and tribal areas. △ Less

Submitted 14 February, 2021; originally announced February 2021.

Comments: 9 pages, 4 figures, submitted to: Communications of the ACM - Contributed Article

arXiv:2005.07926 [pdf, other]

Measuring and Characterizing Hate Speech on News Websites

Authors: Savvas Zannettou, Mai ElSherief, Elizabeth Belding, Shirin Nilizadeh, Gianluca Stringhini

Abstract: The Web has become the main source for news acquisition. At the same time, news discussion has become more social: users can post comments on news articles or discuss news articles on other platforms like Reddit. These features empower and enable discussions among the users; however, they also act as the medium for the dissemination of toxic discourse and hate speech. The research community lacks… ▽ More The Web has become the main source for news acquisition. At the same time, news discussion has become more social: users can post comments on news articles or discuss news articles on other platforms like Reddit. These features empower and enable discussions among the users; however, they also act as the medium for the dissemination of toxic discourse and hate speech. The research community lacks a general understanding on what type of content attracts hateful discourse and the possible effects of social networks on the commenting activity on news articles. In this work, we perform a large-scale quantitative analysis of 125M comments posted on 412K news articles over the course of 19 months. We analyze the content of the collected articles and their comments using temporal analysis, user-based analysis, and linguistic analysis, to shed light on what elements attract hateful comments on news articles. We also investigate commenting activity when an article is posted on either 4chan's Politically Incorrect board (/pol/) or six selected subreddits. We find statistically significant increases in hateful commenting activity around real-world divisive events like the "Unite the Right" rally in Charlottesville and political events like the second and third 2016 US presidential debates. Also, we find that articles that attract a substantial number of hateful comments have different linguistic characteristics when compared to articles that do not attract hateful comments. Furthermore, we observe that the post of a news articles on either /pol/ or the six subreddits is correlated with an increase of (hateful) commenting activity on the news articles. △ Less

Submitted 16 May, 2020; originally announced May 2020.

Comments: Accepted at WebSci'20

arXiv:1911.03642 [pdf, other]

Towards Understanding Gender Bias in Relation Extraction

Authors: Andrew Gaut, Tony Sun, Shirlyn Tang, Yuxin Huang, Jing Qian, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, William Yang Wang

Abstract: Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction (AKBC). While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to our knowledge to evaluate social biases in NRE systems. We create WikiGenderBias, a distantly supervised dataset with a human annotated t… ▽ More Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction (AKBC). While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to our knowledge to evaluate social biases in NRE systems. We create WikiGenderBias, a distantly supervised dataset with a human annotated test set. WikiGenderBias has sentences specifically curated to analyze gender bias in relation extraction systems. We use WikiGenderBias to evaluate systems for bias and find that NRE systems exhibit gender biased predictions and lay groundwork for future evaluation of bias in NRE. We also analyze how name anonymization, hard debiasing for word embeddings, and counterfactual data augmentation affect gender bias in predictions and performance. △ Less

Submitted 8 August, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

arXiv:1909.04251 [pdf, other]

A Benchmark Dataset for Learning to Intervene in Online Hate Speech

Authors: Jing Qian, Anna Bethke, Yinyin Liu, Elizabeth Belding, William Yang Wang

Abstract: Countering online hate speech is a critical yet challenging task, but one which can be aided by the use of Natural Language Processing (NLP) techniques. Previous research has primarily focused on the development of NLP methods to automatically and effectively detect online hate speech while disregarding further action needed to calm and discourage individuals from using hate speech in the future.… ▽ More Countering online hate speech is a critical yet challenging task, but one which can be aided by the use of Natural Language Processing (NLP) techniques. Previous research has primarily focused on the development of NLP methods to automatically and effectively detect online hate speech while disregarding further action needed to calm and discourage individuals from using hate speech in the future. In addition, most existing hate speech datasets treat each post as an isolated instance, ignoring the conversational context. In this paper, we propose a novel task of generative hate speech intervention, where the goal is to automatically generate responses to intervene during online conversations that contain hate speech. As a part of this work, we introduce two fully-labeled large-scale hate speech intervention datasets collected from Gab and Reddit. These datasets provide conversation segments, hate speech labels, as well as intervention responses written by Mechanical Turk Workers. In this paper, we also analyze the datasets to understand the common intervention strategies and explore the performance of common automatic response generation methods on these new datasets to provide a benchmark for future research. △ Less

Submitted 9 September, 2019; originally announced September 2019.

arXiv:1906.08976 [pdf]

Mitigating Gender Bias in Natural Language Processing: Literature Review

Authors: Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, William Yang Wang

Abstract: As Natural Language Processing (NLP) and Machine Learning (ML) tools rise in popularity, it becomes increasingly vital to recognize the role they play in shaping societal biases and stereotypes. Although NLP models have shown success in modeling various applications, they propagate and may even amplify gender bias found in text corpora. While the study of bias in artificial intelligence is not new… ▽ More As Natural Language Processing (NLP) and Machine Learning (ML) tools rise in popularity, it becomes increasingly vital to recognize the role they play in shaping societal biases and stereotypes. Although NLP models have shown success in modeling various applications, they propagate and may even amplify gender bias found in text corpora. While the study of bias in artificial intelligence is not new, methods to mitigate gender bias in NLP are relatively nascent. In this paper, we review contemporary studies on recognizing and mitigating gender bias in NLP. We discuss gender bias based on four forms of representation bias and analyze methods recognizing gender bias. Furthermore, we discuss the advantages and drawbacks of existing gender debiasing methods. Finally, we discuss future studies for recognizing and mitigating gender bias in NLP. △ Less

Submitted 21 June, 2019; originally announced June 2019.

Comments: Accepted to ACL 2019

arXiv:1904.02418 [pdf, other]

Learning to Decipher Hate Symbols

Authors: Jing Qian, Mai ElSherief, Elizabeth Belding, William Yang Wang

Abstract: Existing computational models to understand hate speech typically frame the problem as a simple classification task, bypassing the understanding of hate symbols (e.g., 14 words, kigy) and their secret connotations. In this paper, we propose a novel task of deciphering hate symbols. To do this, we leverage the Urban Dictionary and collected a new, symbol-rich Twitter corpus of hate speech. We inves… ▽ More Existing computational models to understand hate speech typically frame the problem as a simple classification task, bypassing the understanding of hate symbols (e.g., 14 words, kigy) and their secret connotations. In this paper, we propose a novel task of deciphering hate symbols. To do this, we leverage the Urban Dictionary and collected a new, symbol-rich Twitter corpus of hate speech. We investigate neural network latent context models for deciphering hate symbols. More specifically, we study Sequence-to-Sequence models and show how they are able to crack the ciphers based on context. Furthermore, we propose a novel Variational Decipher and show how it can generalize better to unseen hate symbols in a more challenging testing setting. △ Less

Submitted 4 April, 2019; originally announced April 2019.

arXiv:1811.01147 [pdf, other]

SafeRoute: Learning to Navigate Streets Safely in an Urban Environment

Authors: Sharon Levy, Wenhan Xiong, Elizabeth Belding, William Yang Wang

Abstract: Recent studies show that 85% of women have changed their traveled route to avoid harassment and assault. Despite this, current mapping tools do not empower users with information to take charge of their personal safety. We propose SafeRoute, a novel solution to the problem of navigating cities and avoiding street harassment and crime. Unlike other street navigation applications, SafeRoute introduc… ▽ More Recent studies show that 85% of women have changed their traveled route to avoid harassment and assault. Despite this, current mapping tools do not empower users with information to take charge of their personal safety. We propose SafeRoute, a novel solution to the problem of navigating cities and avoiding street harassment and crime. Unlike other street navigation applications, SafeRoute introduces a new type of path generation via deep reinforcement learning. This enables us to successfully optimize for multi-criteria path-finding and incorporate representation learning within our framework. Our agent learns to pick favorable streets to create a safe and short path with a reward function that incorporates safety and efficiency. Given access to recent crime reports in many urban cities, we train our model for experiments in Boston, New York, and San Francisco. We test our model on areas of these cities, specifically the populated downtown regions where tourists and those unfamiliar with the streets walk. We evaluate SafeRoute and successfully improve over state-of-the-art methods by up to 17% in local average distance from crimes while decreasing path length by up to 7%. △ Less

Submitted 2 November, 2018; originally announced November 2018.

Comments: 8 pages

arXiv:1809.00088 [pdf, other]

Hierarchical CVAE for Fine-Grained Hate Speech Classification

Authors: Jing Qian, Mai ElSherief, Elizabeth Belding, William Yang Wang

Abstract: Existing work on automated hate speech detection typically focuses on binary classification or on differentiating among a small set of categories. In this paper, we propose a novel method on a fine-grained hate speech classification task, which focuses on differentiating among 40 hate groups of 13 different hate group categories. We first explore the Conditional Variational Autoencoder (CVAE) as a… ▽ More Existing work on automated hate speech detection typically focuses on binary classification or on differentiating among a small set of categories. In this paper, we propose a novel method on a fine-grained hate speech classification task, which focuses on differentiating among 40 hate groups of 13 different hate group categories. We first explore the Conditional Variational Autoencoder (CVAE) as a discriminative model and then extend it to a hierarchical architecture to utilize the additional hate category information for more accurate prediction. Experimentally, we show that incorporating the hate category information for training can significantly improve the classification performance and our proposed model outperforms commonly-used discriminative models. △ Less

Submitted 31 August, 2018; originally announced September 2018.

arXiv:1804.04649 [pdf, other]

Peer to Peer Hate: Hate Speech Instigators and Their Targets

Authors: Mai ElSherief, Shirin Nilizadeh, Dana Nguyen, Giovanni Vigna, Elizabeth Belding

Abstract: While social media has become an empowering agent to individual voices and freedom of expression, it also facilitates anti-social behaviors including online harassment, cyberbullying, and hate speech. In this paper, we present the first comparative study of hate speech instigators and target users on Twitter. Through a multi-step classification process, we curate a comprehensive hate speech datase… ▽ More While social media has become an empowering agent to individual voices and freedom of expression, it also facilitates anti-social behaviors including online harassment, cyberbullying, and hate speech. In this paper, we present the first comparative study of hate speech instigators and target users on Twitter. Through a multi-step classification process, we curate a comprehensive hate speech dataset capturing various types of hate. We study the distinctive characteristics of hate instigators and targets in terms of their profile self-presentation, activities, and online visibility. We find that hate instigators target more popular and high profile Twitter users, and that participating in hate speech can result in greater online visibility. We conduct a personality analysis of hate instigators and targets and show that both groups have eccentric personality facets that differ from the general Twitter population. Our results advance the state of the art of understanding online hate speech engagement. △ Less

Submitted 12 April, 2018; originally announced April 2018.

Journal ref: ICWSM 2018

arXiv:1804.04257 [pdf, other]

Hate Lingo: A Target-based Linguistic Analysis of Hate Speech in Social Media

Authors: Mai ElSherief, Vivek Kulkarni, Dana Nguyen, William Yang Wang, Elizabeth Belding

Abstract: While social media empowers freedom of expression and individual voices, it also enables anti-social behavior, online harassment, cyberbullying, and hate speech. In this paper, we deepen our understanding of online hate speech by focusing on a largely neglected but crucial aspect of hate speech -- its target: either "directed" towards a specific person or entity, or "generalized" towards a group o… ▽ More While social media empowers freedom of expression and individual voices, it also enables anti-social behavior, online harassment, cyberbullying, and hate speech. In this paper, we deepen our understanding of online hate speech by focusing on a largely neglected but crucial aspect of hate speech -- its target: either "directed" towards a specific person or entity, or "generalized" towards a group of people sharing a common protected characteristic. We perform the first linguistic and psycholinguistic analysis of these two forms of hate speech and reveal the presence of interesting markers that distinguish these types of hate speech. Our analysis reveals that Directed hate speech, in addition to being more personal and directed, is more informal, angrier, and often explicitly attacks the target (via name calling) with fewer analytic words and more words suggesting authority and influence. Generalized hate speech, on the other hand, is dominated by religious hate, is characterized by the use of lethal words such as murder, exterminate, and kill; and quantity words such as million and many. Altogether, our work provides a data-driven analysis of the nuances of online-hate speech that enables not only a deepened understanding of hate speech and its social implications but also its detection. △ Less

Submitted 11 April, 2018; originally announced April 2018.

Comments: 10 pages, 7 figures. ICWSM-2018 accepted

arXiv:1804.03124 [pdf, other]

Leveraging Intra-User and Inter-User Representation Learning for Automated Hate Speech Detection

Authors: Jing Qian, Mai ElSherief, Elizabeth M. Belding, William Yang Wang

Abstract: Hate speech detection is a critical, yet challenging problem in Natural Language Processing (NLP). Despite the existence of numerous studies dedicated to the development of NLP hate speech detection approaches, the accuracy is still poor. The central problem is that social media posts are short and noisy, and most existing hate speech detection solutions take each post as an isolated input instanc… ▽ More Hate speech detection is a critical, yet challenging problem in Natural Language Processing (NLP). Despite the existence of numerous studies dedicated to the development of NLP hate speech detection approaches, the accuracy is still poor. The central problem is that social media posts are short and noisy, and most existing hate speech detection solutions take each post as an isolated input instance, which is likely to yield high false positive and negative rates. In this paper, we radically improve automated hate speech detection by presenting a novel model that leverages intra-user and inter-user representation learning for robust hate speech detection on Twitter. In addition to the target Tweet, we collect and analyze the user's historical posts to model intra-user Tweet representations. To suppress the noise in a single Tweet, we also model the similar Tweets posted by all other users with reinforced inter-user representation learning techniques. Experimentally, we show that leveraging these two representations can significantly improve the f-score of a strong bidirectional LSTM baseline model by 10.1%. △ Less

Submitted 13 September, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

arXiv:1705.02004 [pdf]

A Rural Lens on a Research Agenda for Intelligent Infrastructure

Authors: Ellen Zegura, Beki Grinter, Elizabeth Belding, Klara Nahrstedt

Abstract: A National Agenda for Intelligent Infrastructure is not complete without explicit consideration of the needs of rural communities. While the American population has urbanized, the United States depends on rural communities for agriculture, fishing, forestry, manufacturing and mining. Approximately 20% of the US population lives in rural areas with a skew towards aging adults. Further, nearly 25% o… ▽ More A National Agenda for Intelligent Infrastructure is not complete without explicit consideration of the needs of rural communities. While the American population has urbanized, the United States depends on rural communities for agriculture, fishing, forestry, manufacturing and mining. Approximately 20% of the US population lives in rural areas with a skew towards aging adults. Further, nearly 25% of Veterans live in rural America. And yet, when intelligent infrastructure is imagined, it is often done so with implicit or explicit bias towards cities. In this brief we describe the unique opportunities for rural communities and offer an inclusive vision of intelligent infrastructure research. In this paper, we argue for a set of coordinated actions to ensure that rural Americans are not left behind in this digital revolution. These technological platforms and applications, supported by appropriate policy, will address key issues in transportation, energy, agriculture, public safety and health. We believe that rather than being a set of needs, the rural United States presents a set of exciting possibilities for novel innovation benefiting not just those living there, but the American economy more broadly △ Less

Submitted 4 May, 2017; originally announced May 2017.

Comments: A Computing Community Consortium (CCC) white paper, 6 pages

Showing 1–19 of 19 results for author: Belding, E