subscribe to arXiv mailings

Tox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech

Authors: Neemesh Yadav, Sarah Masud, Vikram Goyal, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty

Abstract: Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflic… ▽ More Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflicting evidence for the role of the quality of KG tuples in generating implicit explanations. Consequently, simpler models incorporating external toxicity signals outperform KG-infused models. Compared to the KG-based setup, we observe a comparable performance for SBIC (LatentHatred) datasets with a performance variation of +0.44 (+0.49), +1.83 (-1.56), and -4.59 (+0.77) in BLEU, ROUGE-L, and BERTScore. Further human evaluation and error analysis reveal that our proposed setup produces more precise explanations than zero-shot GPT-3.5, highlighting the intricate nature of the task. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 17 Pages, 5 Figures, 13 Tables, ACL Findings 2024

arXiv:2402.02144 [pdf, other]

Probing Critical Learning Dynamics of PLMs for Hate Speech Detection

Authors: Sarah Masud, Mohammad Aflah Khan, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty

Abstract: Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs' use in hate speech detection. We deep dive into comparing different pretrai… ▽ More Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs' use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection. △ Less

Submitted 3 February, 2024; originally announced February 2024.

Comments: 20 pages, 9 figures, 14 tables. Accepted at EACL'24

arXiv:2311.09834 [pdf, other]

Overview of the HASOC Subtrack at FIRE 2023: Identification of Tokens Contributing to Explicit Hate in English by Span Detection

Authors: Sarah Masud, Mohammad Aflah Khan, Md. Shad Akhtar, Tanmoy Chakraborty

Abstract: As hate speech continues to proliferate on the web, it is becoming increasingly important to develop computational methods to mitigate it. Reactively, using black-box models to identify hateful content can perplex users as to why their posts were automatically flagged as hateful. On the other hand, proactive mitigation can be achieved by suggesting rephrasing before a post is made public. However,… ▽ More As hate speech continues to proliferate on the web, it is becoming increasingly important to develop computational methods to mitigate it. Reactively, using black-box models to identify hateful content can perplex users as to why their posts were automatically flagged as hateful. On the other hand, proactive mitigation can be achieved by suggesting rephrasing before a post is made public. However, both mitigation techniques require information about which part of a post contains the hateful aspect, i.e., what spans within a text are responsible for conveying hate. Better detection of such spans can significantly reduce explicitly hateful content on the web. To further contribute to this research area, we organized HateNorm at HASOC-FIRE 2023, focusing on explicit span detection in English Tweets. A total of 12 teams participated in the competition, with the highest macro-F1 observed at 0.58. △ Less

Submitted 16 November, 2023; originally announced November 2023.

Comments: 8 pages, 1 figure, 4 Tables

arXiv:2309.11896 [pdf, other]

Focal Inferential Infusion Coupled with Tractable Density Discrimination for Implicit Hate Speech Detection

Authors: Sarah Masud, Ashutosh Bajpai, Tanmoy Chakraborty

Abstract: Although pre-trained large language models (PLMs) have achieved state-of-the-art on many NLP tasks, they lack understanding of subtle expressions of implicit hate speech. Such nuanced and implicit hate is often misclassified as non-hate. Various attempts have been made to enhance the detection of (implicit) hate content by augmenting external context or enforcing label separation via distance-base… ▽ More Although pre-trained large language models (PLMs) have achieved state-of-the-art on many NLP tasks, they lack understanding of subtle expressions of implicit hate speech. Such nuanced and implicit hate is often misclassified as non-hate. Various attempts have been made to enhance the detection of (implicit) hate content by augmenting external context or enforcing label separation via distance-based metrics. We combine these two approaches and introduce FiADD, a novel Focused Inferential Adaptive Density Discrimination framework. FiADD enhances the PLM finetuning pipeline by bringing the surface form of an implicit hate speech closer to its implied form while increasing the inter-cluster distance among various class labels. We test FiADD on three implicit hate datasets and observe significant improvement in the two-way and three-way hate classification tasks. We further experiment on the generalizability of FiADD on three other tasks, namely detecting sarcasm, irony, and stance, in which surface and implied forms differ, and observe similar performance improvement. We analyze the generated latent space to understand its evolution under FiADD, which corroborates the advantage of employing FiADD for implicit hate speech detection. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: 21 pages, 6 Figures and 9 Tables

arXiv:2306.01105 [pdf, other]

Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Authors: Atharva Kulkarni, Sarah Masud, Vikram Goyal, Tanmoy Chakraborty

Abstract: Social media is awash with hateful content, much of which is often veiled with linguistic and topical diversity. The benchmark datasets used for hate speech detection do not account for such divagation as they are predominantly compiled using hate lexicons. However, capturing hate signals becomes challenging in neutrally-seeded malicious content. Thus, designing models and datasets that mimic the… ▽ More Social media is awash with hateful content, much of which is often veiled with linguistic and topical diversity. The benchmark datasets used for hate speech detection do not account for such divagation as they are predominantly compiled using hate lexicons. However, capturing hate signals becomes challenging in neutrally-seeded malicious content. Thus, designing models and datasets that mimic the real-world variability of hate warrants further investigation. To this end, we present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter. GOTHate is neutrally seeded, encompassing different languages and topics. We conduct detailed comparisons of GOTHate with the existing hate speech datasets, highlighting its novelty. We benchmark it with 10 recent baselines. Our extensive empirical and benchmarking experiments suggest that GOTHate is hard to classify in a text-only setup. Thus, we investigate how adding endogenous signals enhances the hate speech detection task. We augment GOTHate with the user's timeline information and ego network, bringing the overall data source closer to the real-world setup for understanding hateful content. Our proposed solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that enriches the linguistic subspace with latent endogenous signals from history, topology, and exemplars. HEN-mBERT transcends the best baseline by 2.5% and 5% in overall macro-F1 and hate class F1, respectively. Inspired by our experiments, in partnership with Wipro AI, we are developing a semi-automated pipeline to detect hateful content as a part of their mission to tackle online harm. △ Less

Submitted 15 June, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: 15 pages, 4 figures, 11 tables. Accepted at SIGKDD'23

arXiv:2302.07964 [pdf, other]

On Rank Energy Statistics via Optimal Transport: Continuity, Convergence, and Change Point Detection

Authors: Matthew Werenski, Shoaib Bin Masud, James M. Murphy, Shuchin Aeron

Abstract: This paper considers the use of recently proposed optimal transport-based multivariate test statistics, namely rank energy and its variant the soft rank energy derived from entropically regularized optimal transport, for the unsupervised nonparametric change point detection (CPD) problem. We show that the soft rank energy enjoys both fast rates of statistical convergence and robust continuity prop… ▽ More This paper considers the use of recently proposed optimal transport-based multivariate test statistics, namely rank energy and its variant the soft rank energy derived from entropically regularized optimal transport, for the unsupervised nonparametric change point detection (CPD) problem. We show that the soft rank energy enjoys both fast rates of statistical convergence and robust continuity properties which lead to strong performance on real datasets. Our theoretical analyses remove the need for resampling and out-of-sample extensions previously required to obtain such rates. In contrast the rank energy suffers from the curse of dimensionality in statistical estimation and moreover can signal a change point from arbitrarily small perturbations, which leads to a high rate of false alarms in CPD. Additionally, under mild regularity conditions, we quantify the discrepancy between soft rank energy and rank energy in terms of the regularization parameter. Finally, we show our approach performs favorably in numerical experiments compared to several other optimal transport-based methods as well as maximum mean discrepancy. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: 36 pages, 5 figures

arXiv:2206.04007 [pdf, other]

Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization

Authors: Sarah Masud, Manjot Bedi, Mohammad Aflah Khan, Md Shad Akhtar, Tanmoy Chakraborty

Abstract: Curbing online hate speech has become the need of the hour; however, a blanket ban on such activities is infeasible for several geopolitical and cultural reasons. To reduce the severity of the problem, in this paper, we introduce a novel task, hate speech normalization, that aims to weaken the intensity of hatred exhibited by an online post. The intention of hate speech normalization is not to sup… ▽ More Curbing online hate speech has become the need of the hour; however, a blanket ban on such activities is infeasible for several geopolitical and cultural reasons. To reduce the severity of the problem, in this paper, we introduce a novel task, hate speech normalization, that aims to weaken the intensity of hatred exhibited by an online post. The intention of hate speech normalization is not to support hate but instead to provide the users with a stepping stone towards non-hate while giving online platforms more time to monitor any improvement in the user's behavior. To this end, we manually curated a parallel corpus - hate texts and their normalized counterparts (a normalized text is less hateful and more benign). We introduce NACL, a simple yet efficient hate speech normalization model that operates in three stages - first, it measures the hate intensity of the original sample; second, it identifies the hate span(s) within it; and finally, it reduces hate intensity by paraphrasing the hate spans. We perform extensive experiments to measure the efficacy of NACL via three-way evaluation (intrinsic, extrinsic, and human-study). We observe that NACL outperforms six baselines - NACL yields a score of 0.1365 RMSE for the intensity prediction, 0.622 F1-score in the span identification, and 82.27 BLEU and 80.05 perplexity for the normalized text generation. We further show the generalizability of NACL across other platforms (Reddit, Facebook, Gab). An interactive prototype of NACL was put together for the user study. Further, the tool is being deployed in a real-world setting at Wipro AI as a part of its mission to tackle harmful content on online platforms. △ Less

Submitted 8 June, 2022; originally announced June 2022.

Comments: 11 pages, 4 figures, 12 tables. Accepted at KDD 2022 (ADS Track)

arXiv:2202.00126 [pdf, other]

Handling Bias in Toxic Speech Detection: A Survey

Authors: Tanmay Garg, Sarah Masud, Tharun Suresh, Tanmoy Chakraborty

Abstract: Detecting online toxicity has always been a challenge due to its inherent subjectivity. Factors such as the context, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can thus lead to a sidelining of the various groups… ▽ More Detecting online toxicity has always been a challenge due to its inherent subjectivity. Factors such as the context, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can thus lead to a sidelining of the various groups they aim to help in the first place. It has piqued researchers' interest in examining unintended biases and their mitigation. Due to the nascent and multi-faceted nature of the work, complete literature is chaotic in its terminologies, techniques, and findings. In this paper, we put together a systematic study of the limitations and challenges of existing methods for mitigating bias in toxicity detection. We look closely at proposed methods for evaluating and mitigating bias in toxic speech detection. To examine the limitations of existing methods, we also conduct a case study to introduce the concept of bias shift due to knowledge-based bias mitigation. The survey concludes with an overview of the critical challenges, research gaps, and future directions. While reducing toxicity on online platforms continues to be an active area of research, a systematic study of various biases and their mitigation strategies will help the research community produce robust and fair models. △ Less

Submitted 15 January, 2023; v1 submitted 26 January, 2022; originally announced February 2022.

Comments: Accepted in ACM Computing Surveys, 30 pages, 5 figures, 7 tables

arXiv:2201.00961 [pdf, other]

Nipping in the Bud: Detection, Diffusion and Mitigation of Hate Speech on Social Media

Authors: Tanmoy Chakraborty, Sarah Masud

Abstract: Since the proliferation of social media usage, hate speech has become a major crisis. Hateful content can spread quickly and create an environment of distress and hostility. Further, what can be considered hateful is contextual and varies with time. While online hate speech reduces the ability of already marginalised groups to participate in discussion freely, offline hate speech leads to hate cri… ▽ More Since the proliferation of social media usage, hate speech has become a major crisis. Hateful content can spread quickly and create an environment of distress and hostility. Further, what can be considered hateful is contextual and varies with time. While online hate speech reduces the ability of already marginalised groups to participate in discussion freely, offline hate speech leads to hate crimes and violence against individuals and communities. The multifaceted nature of hate speech and its real-world impact have already piqued the interest of the data mining and machine learning communities. Despite our best efforts, hate speech remains an evasive issue for researchers and practitioners alike. This article presents methodological challenges that hinder building automated hate mitigation systems. These challenges inspired our work in the broader area of combating hateful content on the web. We discuss a series of our proposed solutions to limit the spread of hate speech on social media. △ Less

Submitted 3 January, 2022; originally announced January 2022.

Comments: Submitted for publication in ACM SIGWEB Newsletter

arXiv:2112.06267 [pdf, other]

DiVA: A Scalable, Interactive and Customizable Visual Analytics Platform for Information Diffusion on Large Networks

Authors: Dhruv Sahnan, Vasu Goel, Sarah Masud, Chhavi Jain, Vikram Goyal, Tanmoy Chakraborty

Abstract: With an increasing outreach of digital platforms in our lives, researchers have taken a keen interest to study different facets of social interactions that seem to be evolving rapidly. Analysing the spread of information (aka diffusion) has brought forth multiple research areas such as modelling user engagement, determining emerging topics, forecasting virality of online posts and predicting infor… ▽ More With an increasing outreach of digital platforms in our lives, researchers have taken a keen interest to study different facets of social interactions that seem to be evolving rapidly. Analysing the spread of information (aka diffusion) has brought forth multiple research areas such as modelling user engagement, determining emerging topics, forecasting virality of online posts and predicting information cascades. Despite such ever-increasing interest, there remains a vacuum among easy-to-use interfaces for large-scale visualisation of diffusion models. In this paper, we introduce DiVA -- Diffusion Visualisation and Analysis, a tool that provides a scalable web interface and extendable APIs to analyse various diffusion trends on networks. DiVA uniquely offers support for simultaneous comparison of two competing diffusion models and even the comparison with the ground-truth results, both of which help develop a coherent understanding of real-world scenarios. Along with performing an exhaustive feature comparison and system evaluation of DiVA against publicly-available web interfaces for information diffusion, we conducted a user study to understand the strengths and limitations of DiVA. We noticed that evaluators had a seamless user experience, especially when analysing diffusion on large networks. △ Less

Submitted 21 August, 2022; v1 submitted 12 December, 2021; originally announced December 2021.

Comments: 33 pages, 12 figures, 11 tables

arXiv:2111.00047 [pdf, other]

Robust and efficient change point detection using novel multivariate rank-energy GoF test

Authors: Shoaib Bin Masud, Shuchin Aeron

Abstract: In this paper, we use and further develop upon a recently proposed multivariate, distribution-free Goodness-of-Fit (GoF) test based on the theory of Optimal Transport (OT) called the Rank Energy (RE) [1], for non-parametric and unsupervised Change Point Detection (CPD) in multivariate time series data. We show that directly using RE leads to high sensitivity to very small changes in distributions… ▽ More In this paper, we use and further develop upon a recently proposed multivariate, distribution-free Goodness-of-Fit (GoF) test based on the theory of Optimal Transport (OT) called the Rank Energy (RE) [1], for non-parametric and unsupervised Change Point Detection (CPD) in multivariate time series data. We show that directly using RE leads to high sensitivity to very small changes in distributions (causing high false alarms) and it requires large sample complexity and huge computational cost. To alleviate these drawbacks, we propose a new GoF test statistic called as soft-Rank Energy (sRE) that is based on entropy regularized OT and employ it towards CPD. We discuss the advantages of using sRE over RE and demonstrate that the proposed sRE based CPD outperforms all the existing methods in terms of Area Under the Curve (AUC) and F1-score on real and synthetic data sets. △ Less

Submitted 29 October, 2021; originally announced November 2021.

Comments: 6 pages, 1 figure

arXiv:2111.00043 [pdf, other]

Multivariate rank via entropic optimal transport: sample efficiency and generative modeling

Authors: Shoaib Bin Masud, Matthew Werenski, James M. Murphy, Shuchin Aeron

Abstract: The framework of optimal transport has been leveraged to extend the notion of rank to the multivariate setting while preserving desirable properties of the resulting goodness-of-fit (GoF) statistics. In particular, the rank energy (RE) and rank maximum mean discrepancy (RMMD) are distribution-free under the null, exhibit high power in statistical testing, and are robust to outliers. In this paper,… ▽ More The framework of optimal transport has been leveraged to extend the notion of rank to the multivariate setting while preserving desirable properties of the resulting goodness-of-fit (GoF) statistics. In particular, the rank energy (RE) and rank maximum mean discrepancy (RMMD) are distribution-free under the null, exhibit high power in statistical testing, and are robust to outliers. In this paper, we point to and alleviate some of the practical shortcomings of these proposed GoF statistics, namely their high computational cost, high statistical sample complexity, and lack of differentiability with respect to the data. We show that all these practically important issues are addressed by considering entropy-regularized optimal transport maps in place of the rank map, which we refer to as the soft rank. We consequently propose two new statistics, the soft rank energy (sRE) and soft rank maximum mean discrepancy (sRMMD), which exhibit several desirable properties. Given $n$ sample data points, we provide non-asymptotic convergence rates for the sample estimate of the entropic transport map to its population version that are essentially of the order $n^{-1/2}$ when the starting measure is subgaussian and the target measure has compact support. This result is novel compared to existing results which achieve a rate of $n^{-1}$ but crucially rely on both measures having compact support. We leverage this result to demonstrate fast convergence of sample sRE and sRMMD to their population version making them useful for high-dimensional GoF testing. Our statistics are differentiable and amenable to popular machine learning frameworks that rely on gradient methods. We leverage these properties towards showcasing the utility of the proposed statistics for generative modeling on two important problems: image generation and generating valid knockoffs for controlled feature selection. △ Less

Submitted 25 November, 2022; v1 submitted 29 October, 2021; originally announced November 2021.

Comments: 46 pages, 10 figures. Replacement note: Substantial revision over V2: Title change, first authors contribution change, new improved theoretical results relaxing compactness assumptions

arXiv:2103.08811 [pdf, other]

Soft and subspace robust multivariate rank tests based on entropy regularized optimal transport

Authors: Shoaib Bin Masud, Boyang Lyu, Shuchin Aeron

Abstract: In this paper, we extend the recently proposed multivariate rank energy distance, based on the theory of optimal transport, for statistical testing of distributional similarity, to soft rank energy distance. Being differentiable, this in turn allows us to extend the rank energy to a subspace robust rank energy distance, dubbed Projected soft-Rank Energy distance, which can be computed via optimiza… ▽ More In this paper, we extend the recently proposed multivariate rank energy distance, based on the theory of optimal transport, for statistical testing of distributional similarity, to soft rank energy distance. Being differentiable, this in turn allows us to extend the rank energy to a subspace robust rank energy distance, dubbed Projected soft-Rank Energy distance, which can be computed via optimization over the Stiefel manifold. We show via experiments that using projected soft rank energy one can trade-off the detection power vs the false alarm via projections onto an appropriately selected low dimensional subspace. We also show the utility of the proposed tests on unsupervised change point detection in multivariate time series data. All codes are publicly available at the link provided in the experiment section. △ Less

Submitted 17 April, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

Comments: 14 pages, 5 figures

arXiv:2101.11425 [pdf, other]

Fake News Detection System using XLNet model with Topic Distributions: CONSTRAINT@AAAI2021 Shared Task

Authors: Akansha Gautam, Venktesh V, Sarah Masud

Abstract: With the ease of access to information, and its rapid dissemination over the internet (both velocity and volume), it has become challenging to filter out truthful information from fake ones. The research community is now faced with the task of automatic detection of fake news, which carries real-world socio-political impact. One such research contribution came in the form of the Constraint@AAA1202… ▽ More With the ease of access to information, and its rapid dissemination over the internet (both velocity and volume), it has become challenging to filter out truthful information from fake ones. The research community is now faced with the task of automatic detection of fake news, which carries real-world socio-political impact. One such research contribution came in the form of the Constraint@AAA12021 Shared Task on COVID19 Fake News Detection in English. In this paper, we shed light on a novel method we proposed as a part of this shared task. Our team introduced an approach to combine topical distributions from Latent Dirichlet Allocation (LDA) with contextualized representations from XLNet. We also compared our method with existing baselines to show that XLNet + Topic Distributions outperforms other approaches by attaining an F1-score of 0.967. △ Less

Submitted 12 January, 2021; originally announced January 2021.

Comments: Accepted at CONSTRAINT@AAAI2021 Shared Task for the CONSTRAINT workshop, collocated with AAAI 2021

arXiv:2010.04377 [pdf, other]

Hate is the New Infodemic: A Topic-aware Modeling of Hate Speech Diffusion on Twitter

Authors: Sarah Masud, Subhabrata Dutta, Sakshi Makkar, Chhavi Jain, Vikram Goyal, Amitava Das, Tanmoy Chakraborty

Abstract: Online hate speech, particularly over microblogging platforms like Twitter, has emerged as arguably the most severe issue of the past decade. Several countries have reported a steep rise in hate crimes infuriated by malicious hate campaigns. While the detection of hate speech is one of the emerging research areas, the generation and spread of topic-dependent hate in the information network remain… ▽ More Online hate speech, particularly over microblogging platforms like Twitter, has emerged as arguably the most severe issue of the past decade. Several countries have reported a steep rise in hate crimes infuriated by malicious hate campaigns. While the detection of hate speech is one of the emerging research areas, the generation and spread of topic-dependent hate in the information network remain under-explored. In this work, we focus on exploring user behaviour, which triggers the genesis of hate speech on Twitter and how it diffuses via retweets. We crawl a large-scale dataset of tweets, retweets, user activity history, and follower networks, comprising over 161 million tweets from more than $41$ million unique users. We also collect over 600k contemporary news articles published online. We characterize different signals of information that govern these dynamics. Our analyses differentiate the diffusion dynamics in the presence of hate from usual information diffusion. This motivates us to formulate the modelling problem in a topic-aware setting with real-world knowledge. For predicting the initiation of hate speech for any given hashtag, we propose multiple feature-rich models, with the best performing one achieving a macro F1 score of 0.65. Meanwhile, to predict the retweet dynamics on Twitter, we propose RETINA, a novel neural architecture that incorporates exogenous influence using scaled dot-product attention. RETINA achieves a macro F1-score of 0.85, outperforming multiple state-of-the-art models. Our analysis reveals the superlative power of RETINA to predict the retweet dynamics of hateful content compared to the existing diffusion models. △ Less

Submitted 9 October, 2020; originally announced October 2020.

Comments: 6 table, 9 figures, Full paper in 37th International Conference on Data Engineering (ICDE)

arXiv:2006.07812 [pdf, other]

Deep Exogenous and Endogenous Influence Combination for Social Chatter Intensity Prediction

Authors: Subhabrata Dutta, Sarah Masud, Soumen Chakrabarti, Tanmoy Chakraborty

Abstract: Modeling user engagement dynamics on social media has compelling applications in user-persona detection and political discourse mining. Most existing approaches depend heavily on knowledge of the underlying user network. However, a large number of discussions happen on platforms that either lack any reliable social network or reveal only partially the inter-user ties (Reddit, Stackoverflow). Many… ▽ More Modeling user engagement dynamics on social media has compelling applications in user-persona detection and political discourse mining. Most existing approaches depend heavily on knowledge of the underlying user network. However, a large number of discussions happen on platforms that either lack any reliable social network or reveal only partially the inter-user ties (Reddit, Stackoverflow). Many approaches require observing a discussion for some considerable period before they can make useful predictions. In real-time streaming scenarios, observations incur costs. Lastly, most models do not capture complex interactions between exogenous events (such as news articles published externally) and in-network effects (such as follow-up discussions on Reddit) to determine engagement levels. To address the three limitations noted above, we propose a novel framework, ChatterNet, which, to our knowledge, is the first that can model and predict user engagement without considering the underlying user network. Given streams of timestamped news articles and discussions, the task is to observe the streams for a short period leading up to a time horizon, then predict chatter: the volume of discussions through a specified period after the horizon. ChatterNet processes text from news and discussions using a novel time-evolving recurrent network architecture that captures both temporal properties within news and discussions, as well as the influence of news on discussions. We report on extensive experiments using a two-month-long discussion corpus of Reddit, and a contemporaneous corpus of online news articles from the Common Crawl. ChatterNet shows considerable improvements beyond recent state-of-the-art models of engagement prediction. Detailed studies controlling observation and prediction windows, over 43 different subreddits, yield further useful insights. △ Less

Submitted 14 June, 2020; originally announced June 2020.

Comments: 6 figures, 7 tables, Accepted in SIGKDD 2020

arXiv:1310.7297 [pdf, other]

Scalable Visibility Color Map Construction in Spatial Databases

Authors: Farhana Murtaza Choudhury, Mohammed Eunus Ali, Sarah Masud, Suman Nath, Ishat E Rabban

Abstract: Recent advances in 3D modeling provide us with real 3D datasets to answer queries, such as "What is the best position for a new billboard?" and "Which hotel room has the best view?" in the presence of obstacles. These applications require measuring and differentiating the visibility of an object (target) from different viewpoints in a dataspace, e.g., a billboard may be seen from two viewpoints bu… ▽ More Recent advances in 3D modeling provide us with real 3D datasets to answer queries, such as "What is the best position for a new billboard?" and "Which hotel room has the best view?" in the presence of obstacles. These applications require measuring and differentiating the visibility of an object (target) from different viewpoints in a dataspace, e.g., a billboard may be seen from two viewpoints but is readable only from the viewpoint closer to the target. In this paper, we formulate the above problem of quantifying the visibility of (from) a target object from (of) the surrounding area with a visibility color map (VCM). A VCM is essentially defined as a surface color map of the space, where each viewpoint of the space is assigned a color value that denotes the visibility measure of the target from that viewpoint. Measuring the visibility of a target even from a single viewpoint is an expensive operation, as we need to consider factors such as distance, angle, and obstacles between the viewpoint and the target. Hence, a straightforward approach to construct the VCM that requires visibility computation for every viewpoint of the surrounding space of the target, is prohibitively expensive in terms of both I/Os and computation, especially for a real dataset comprising of thousands of obstacles. We propose an efficient approach to compute the VCM based on a key property of the human vision that eliminates the necessity of computing the visibility for a large number of viewpoints of the space. To further reduce the computational overhead, we propose two approximations; namely, minimum bounding rectangle and tangential approaches with guaranteed error bounds. Our extensive experiments demonstrate the effectiveness and efficiency of our solutions to construct the VCM for real 2D and 3D datasets. △ Less

Submitted 27 October, 2013; originally announced October 2013.

Comments: 12 pages, 14 figures

Showing 1–17 of 17 results for author: Masud, S