Demarked: A Strategy for Enhanced Abusive Speech Moderation through Counterspeech, Detoxification, and Message Management

Seid Muhie Yimam
University of Hamburg
\AndDaryna Dementieva
Technical University of Munich
\AndTim Fischer
University of Hamburg
\ANDDaniil Moskovskiy
Skoltech
\AndNaquee Rizwan
IIT Kharagpur
\AndPunyajoy Saha
IIT Kharagpur
\AndSarthak Roy
IIT Kharagpur
\ANDMartin Semmann
University of Hamburg
\AndAlexander Panchenko
Skoltech
\AndChris Biemann
University of Hamburg
\AndAnimesh Mukherjee
IIT Kharagpur

Abstract

Despite regulations imposed by nations and social media platforms, such as recent EU regulations targeting digital violence, abusive content persists as a significant challenge. Existing approaches primarily rely on binary solutions, such as outright blocking or banning, yet fail to address the complex nature of abusive speech. In this work, we propose a more comprehensive approach called Demarcation scoring abusive speech based on four aspect—(i) severity scale; (ii) presence of a target; (iii) context scale; (iv) legal scale—and suggesting more options of actions like detoxification, counter speech generation, blocking, or, as a final measure, human intervention. Through a thorough analysis of abusive speech regulations across diverse jurisdictions, platforms, and research papers we highlight the gap in preventing measures and advocate for tailored proactive steps to combat its multifaceted manifestations. Our work aims to inform future strategies for effectively addressing abusive speech online.

1 Introduction

AI continues to advance rapidly across various domains, offering diverse applications. Among these, leveraging AI for societal positive impact Shi et al. (2020) is becoming an extremely important direction to explore. Specifically, in the field of NLP Jin et al. (2021), one of the important societal application lies in mitigating digital violence Kaye (2019). Digital violence persists as a pressing issue in online social environments, posing tangible risks to users Barbieri et al. (2019); Kara et al. (2022). It involves using information and communication technologies to hurt, humiliate, disturb, frighten, exclude, and victimize individuals. This often results in increased anxiety, sadness, tension, and a loss of motivation at work. It includes harmful online activities such as abusive behavior, hate speech, toxic speech, and offensive language, significantly affecting an individual’s professional and social effectiveness and efficiency Özsungur (2022). While this domain encompasses diverse forms of digital abuse—stalking, unauthorized photo sharing, profile hacking, and direct threats—our focus in this study is solely on text-based abusive content.

Traditional automated moderation methods rely mostly on only blocking of offensive messages MacAvaney et al. (2019); Cobbe (2021) to handle abusive behaviour online. Most platform companies like Meta, X etc., have a blanket hate speech policy of their own which is either deletion of the post or suspension of the account. This policy is applied uniformly across all that is flagged as hateful on the platform. Such an approach have been shown to be ineffective in curbing abusive behavior in the long term Parker and Ruths (2023). In contrast, a more nuanced treatment of this serious problem needs to be in place. In recent times, counterspeech Alsagheer et al. (2022) has emerged as an alternative approach to mitigate hate speech and has demonstrated efficacy in addressing such harmful discourse Kulenović (2023). However, the automatic generation of counterspeech remains underexplored. Also, text detoxification was introduced as an approach to prevent toxic speech Nogueira dos Santos et al. (2018); Logacheva et al. (2022); however this method has not yet been explored in-the-wild as well.

In this work, we aim to consolidate various proactive measures for mitigating digital violence into a unified pipeline incorporating insights from several jurisdictions and NLP research in this domain. Thus, our contributions are the following.

•

We provide a comprehensive survey of the state of hate speech definitions and mitigation strategies from three main “pillows”: (i) regulations across nations; (ii) policies of social media platforms; (iii) NLP research papers;
•

We perform a thorough empirical analysis across a large array of documents available in these three pillows;
•

Based on our analysis, we propose our recommendation called “Demarcation”— a unified pipeline for automated several-step proactive mitigation of abusive speech that consists of text detoxification, counterspeech generation, and banning via human moderation as a final step.

All the questionnaires and responses utilized for this survey have been published anonymously for this submission¹¹1https://anonymous.4open.science/r/demarked-tacl-D3ED/ .

2 Background

Violence is an umbrella term that refers to words or actions causing harm to an individual or a community. Digital violence is a special form which anchors on digital technologies and the harm is typically spread through electronic devices like computers, smartphones and IoT sensors. This form of violence can take place publicly on social media platforms or privately on one’s personal devices as well as in alternative digital environments like the metaverse. Quite naturally, the individuals or groups who are most vulnerable in the physical world are also the most vulnerable in to online abuse and harassment. In this paper, we shall only deal with textual forms of digital abuse and its nuances.

The work by Banko et al. (2020) classified all types of harmful content as either abusive or online harm and offered a corresponding typology. The typology includes four categories of harmful content as follows. Hate and harassment - aimed at tormenting, demeaning, or intimidating specific individuals or groups, Self-inflicted harm - content promoting self-harm, Ideological harm - the dissemination of beliefs potentially harmful to society over time, and Exploitation - using content to exploit others financially, sexually, or physically. The study by Lewandowska-Tomaszczyk et al. (2023) categorizes such harmful content as offensive speeches, which includes 17 categories and sub-categories like taboo, insulting, hate speech, harassment, and toxic. The taxonomy also encompasses aspects like hostile, discredit, racist, among others. When it comes specifically to defining hate speech, there is no consensus among legislators, platform operators, and researchers. Furthermore, its definition has become increasingly vague amidst recent ethical and communication challenges. In addition, according to Hietanen and Eddebo (2023), the definition of hate speech is now often intertwined with negative speech, which encompasses expressions of discontent, resentment, and blame concerning virtually any issue. One of the most comprehensive definitions that is also widely followed in the computer science literature is the one proposed by the United Nations which runs as follows - “any kind of communication in speech, writing or behaviour, that attacks or uses pejorative or discriminatory language with reference to a person or a group based on who they are, in other words, based on their religion, ethnicity, nationality, race, colour, descent, gender or other identity factor.” To address instances of textual digital violence, conventional practice often involves the implementation of content moderation measures. Content moderation, both human and algorithmic, involves overseeing user-generated content to align with legal standards, community norms, and platform policies Banko et al. (2020); Hietanen and Eddebo (2023). Algorithmic moderation, primarily aimed at removing or banning non-compliant content, boosts online safety, curbs abuse, and swiftly detects serious infractions, thus reducing the limitations of depending entirely on human moderators.

Recent studies Kulenović (2023) advocate a more sustainable method of counterspeech or counter-hate to mitigate the negative effects of hate speech. Counterspeech offers an approach to combating hateful content by challenging stereotypes and misinformation through reasoned arguments, thereby supporting the principle of free speech Yu et al. (2022); Zheng et al. (2023); Gupta et al. (2023). A third strategy for mitigating hate speech is detoxification, which seeks to decrease the toxicity level of the text maintaining content and fluency as much as feasible Nogueira dos Santos et al. (2018); Dementieva et al. (2021). Despite being criticized by advocates of free speech, detoxification is designed to create a more civil digital environment for various groups, including children, but can only be used to handle explicit toxicity Ziems et al. (2022).

In the remainder of this paper, we will compile various methods of mitigating abusive speech in one unified pipeline of proactive moderation.

3 Related work

Automatic abusive speech detection

Moderation is a fundamental element of social media platforms, involving various measures to limit the visibility of abusive content. These measures range from deleting and hiding posts, issuing warnings, or blocking users who fail to adhere to their policies Trujillo et al. (2023); Arora et al. (2024). Recently, significant research efforts have been focused on gathering datasets to develop automatic hate speech classification models Fortuna et al. (2020); Mathew et al. (2021), including for low-resource languages such as Amharic Ayele et al. (2023, 2024), Arabic Magnossão de Paula et al. (2022); Alzubi et al. (2022), code-mixed Hindi Bohra et al. (2018); Ousidhoum et al. (2019), and others.

Automatic counterspeech generation

While restricting access to messages remains a popular strategy endorsed by both platform owners and government policies to combat harmful content, the method of countering hate speech is increasingly favored Mun et al. (2024). This strategy is frequently advocated with the motto Countering rather than censoring, which is generally viewed as preferable since it avoids interfering with the principle of free speech Yu et al. (2023); Bonaldi et al. (2024). The work by Yu et al. (2023) investigates counterspeech in two distinct approaches: countering the author and countering the hate content, where the former is regarded as a less robust form of countering. Moreover, besides reducing online hatred, counterspeech is utilized to encourage positive transformations in online communities by facilitating dialogue among users and nurturing a sense of community Buerger (2022, 2021).

Automatic text detoxification

Another line of research in abusive language processing involves message detoxification. This process is crucial for eliminating or minimizing offensive or harmful content in sentences, while preserving the original meaning as much as possible Logacheva et al. (2022); Dementieva et al. (2021); Tran et al. (2020). Detoxification enhances the quality of online interactions by making them more respectful and less toxic Tran et al. (2020). Various models applied to detoxification can produce diverse, yet non-toxic and acceptable outcomes.

Examples of automatic mitigation strategies in deployment

The work by Chung et al. (2021) developed a tool for Twitter (now X) designed to continuously monitor and respond to hateful content related to Islamophobia. The tool was used by Non-governmental organization (NGO) operators, and the counter-narrative feature has been highly praised for its potential to significantly impact the fight against online Islamophobia. This feature, previously managed manually, was tested for partial automation, which proved effective in enhancing both the volume and usability of the counter-narratives produced by NGO operators. The study by Arora et al. (2024), using a method somehow similar to ours, examined state-of-the-art (SOTA) research on hate speech and related platform moderation policies. The findings reveal a notable discrepancy between the focus of research and the needs of platform policies. While topics like misinformation and political propaganda receive extensive research attention, as shown by the high research-to-policy coverage ratio, critical issues such as sexual solicitation and graphic content are significantly under-researched. This mismatch underscores a gap between the types of content platforms need to moderate, and the solutions offered by current harmful content detection research. To address this perceived disconnection, we propose a dynamic and demarcation-based moderation framework, integrating multiple intervention options tailored to specific contexts and regulatory requirements.

4 Methodology

As a first step toward unifying the existing hate speech regulations, our approach incorporates two strategic elements. Initially, we consider three perspectives: country-specific regulations, social media platforms’ policies, and the definitions and label descriptions mentioned in research papers on hate speech. For each of these dimensions, we developed specific selection criteria to obtain representative samples. Subsequently, we crafted a series of questions designed to analyze and gain deeper insights into each area. This dual strategy ensures a comprehensive examination of the regulatory landscape. In the following sections, we outline the selection criteria and the rationale for our questions.

4.1 Country-specific regulations

In this section, we examine the national regulations on hate speech. Hate speech can manifest in various forms and requires different regulatory approaches. Our analysis aims to identify common expressions of hatred and link these findings to the digital world.

Selection criteria

We came up with a selection criteria to ensure a diverse and representative sample of countries, prioritizing those we are most familiar with. The criteria included:

•

Home country of the co-authors: We initially selected the home countries of all co-authors to leverage their familiarity with the regulatory framework, ensuring a detailed and contextually rich analysis.
•

Geographic representation: To ensure diversity, we included at least one country from each continent. Countries were chosen based on population to capture a wide range of regulatory approaches and perspectives.
•

Online presence: We selected countries that have a widespread online presence and the nationals frequently express their opinions and viewpoints on social media platforms.
•

Focus on hate speech regulation: We included countries where incidences of hate speech are commonplace to gain insights into their regulatory responses to this widespread serious issue.

Questions

The questions formulated for regulatory analysis were guided by specific rationales aimed at extracting key insights from each country’s approach to hate speech regulation. Each question served a distinct purpose in understanding the regulatory landscape.

•

Freedom of speech: These questions aimed to assess the extent to which hate speech is protected under freedom of speech provisions, providing insights into a country’s tolerance for expression that may incite hatred or discrimination.
•

Hate speech definition: Identification of hate speech definitions within regulations was deemed crucial, as it reflects the country’s conceptualization and legal stance on hate speech, including distinctions between online and offline manifestations.
•

Punishment: Examination of punitive measures for hate speech offenses, including monetary fines and imprisonment, provided insights into the severity of regulatory responses and the enforcement mechanisms in place.
•

Regulation of social media platforms: Inquiry into the regulation of social media platforms and specific provisions addressing hate speech online shed light on the extent to which countries recognize and address hate speech as a digital phenomenon.
•

Preventive measures: Evaluation of regulatory encouragement for counterspeech and message detoxification initiatives reflected a country’s proactive approach to combating hate speech, indicating a nuanced understanding of the issue beyond mere censorship.

In total, we selected the EU and 14 other countries across the globe based as per the above selection criteria and analyzed them using our comprehensive questionnaire. The complete questionnaire are available in Appendix A.1 and in this anonymous link. The insights derived from the countries’ perspectives on hate speech are discussed in Section 5.1.

4.2 Platform policies

Moving closer to the digital sphere, our next step involves analyzing the policies implemented by social media platforms. Our approach was structured to provide a comprehensive understanding of platform policies, considering factors such as accessibility, content moderation practices, and preventive measures.

Selection criteria

Our selection criteria were designed to ensure a thorough examination of policies across globally popular social media platforms while also accounting for regional variations. These criteria encompassed the following.

•

Global popularity: Platforms were selected based on their monthly active user count, prioritizing the most widely used platforms worldwide to ensure broad coverage and relevance.
•

Regional relevance: Importance was given to the popularity of platforms within the countries mentioned in Section 4.1.

Questions

The questions designed for policy analysis were crafted with specific rationales to extract crucial insights from platform policies on hate speech regulation. Each set of questions targeted a distinct aspect of platform functionalities and their strategies for addressing hate speech. The key questions are categorized as follows.

•

Hate speech definition: Identifying the platform’s definition of hate speech was deemed paramount, as it forms the foundation for content moderation and enforcement actions. Understanding how platforms conceptualize hate speech informs subsequent analyses of their policy effectiveness.
•

Platform access & verification: These questions sought to understand the mechanisms for user access and verification, including age restrictions and verification processes. Understanding these aspects is crucial for discerning the demographics of platform users and the potential vulnerability of certain groups, such as children, to hate speech.
•

Regulation accessibility: Inquiry into the accessibility and language of platform regulations aimed to assess the transparency and user-friendliness of policy documents. In addition, examination of policy alignment with country-specific regulations provided insights into platform compliance and adaptability to legal frameworks.
•

Content moderation: These questions delved into the mechanisms and actors involved in content moderation, including user-driven moderation, automated systems, and employee-led moderation teams. Insights into content moderation practices shed light on the platform’s capacity to mitigate hate speech effectively.
•

Preventive measures: Evaluation of preventive measures focused on the platform’s efforts to empower users in reporting hate speech, as well as initiatives aimed at promoting counterspeech and detoxification of harmful content. Understanding these measures is essential for gauging the platform’s commitment to combating hate speech proactively.
•

Data access: Inquiry into data access policies aimed to assess the platform’s transparency and willingness to collaborate with researchers and law enforcement agencies in hate speech investigations. Access to platform data is critical for conducting comprehensive research and ensuring accountability.

In total, 15 social media platforms²²2X, Facebook, Telegram, WhatsApp, Instagram, Reddit, VK, Odnoklassniki, TikTok, YouTube, LinkedIn, Snapchat, GAB, ShareChat, Koo were selected based on the established selection criteria and analyzed through our detailed questionnaire. The complete set of the questionnaire are available in Appendix A.2 and this anonymous link.The findings, which elucidate the platforms’ approaches to hate speech, are presented in Section 5.2.

Refer to caption — (a) Regulations summary for various countries and EU.

4.3 Research datasets

In our third pillow, we bridge the gap with NLP research by examining the current state of automatic abusive speech detection. Our focus is on datasets designed for fine-tuning machine learning models. We ensure a thorough comprehension of the landscape across diverse languages.

Selection criteria

Our selection criteria were crafted to ensure the inclusion of diverse perspectives while maintaining a high standard of relevance and credibility. These criteria included the following points.

•

Language inclusivity: We aimed to encompass a wide array of languages prevalent in the countries considered in Section 4.1.
•

Citations: We prioritized dataset papers that have significantly influenced the academic community, as indicated by their citation metrics. For low-resource languages, we included the majority or all of the available datasets to ensure comprehensive representation in our analysis.
•

Publication venue: Preference was given to papers published in esteemed NLP venues such as ACL Anthology, AAAI, LREC, COLING or WOAH, ensuring a standard of quality and rigor in the selected dataset papers.
•

Cross-verification: To further bolster the credibility of our selection, we cross-checked our choices with established repositories such as hatespeechdatasets.com, thus validating the inclusion of well-established datasets.

Questions

The formulation of specific questions served as a scaffold for our analysis, enabling a detailed examination of the dataset papers. Each question was designed to extract key insights essential for a comprehensive understanding of hate speech datasets. The rationales guiding our questions were as follows:

•

Hate speech definition: Given the complex nature of hate speech, exploring how researchers conceptualize and define it presents valuable opportunities for deeper analysis.
•

Annotation process: Investigating the annotation process sheds light on the methodologies employed, including the existence of guidelines, pilot annotations, and quality control measures, which are crucial for evaluating the quality and reliability of the dataset.
•

Labels: Investigating the labels used for annotation and their descriptions provided insights into the granularity and depth of the dataset’s understanding of hate speech nuances.
•

Annotator demographics: Exploring the demographics of annotators, encompassing factors such as age, gender, religion, and race, facilitated an assessment of dataset inclusivity and annotator suitability.
•

Dataset material: Querying aspects such as data source, modality, size, and availability is vital for understanding the dataset’s scope and applicability in hate speech research.

We selected 38 dataset papers spanning 20 languages based on our criteria and analyzed them using our comprehensive questionnaire. The complete questionnaire is available in Appendix A.3 and in this anonymous link. The results from this analysis are presented in Section 5.3.

5 Results and analysis

In this section, we will discuss the outcomes of our investigation across three key areas aimed at mitigating hate speech: country regulations, platform policies, and research datasets.

5.1 Regulation results

As stated earlier, we select 14 countries from all over the world in order to have a comprehensive picture of how hate speech and related issues are regulated on a governmental level. We have at least one country from each continent and for the European region we analyzed the regulations established by the European Union which all the member states need to abide by. The results from our investigation are summarized in Figure 1(a).

Relevance of the regulations

First of all, we note that all of the countries considered regulate hate speech in one way or another. Moreover, the absolute majority of the regulations have been updated no earlier than four years ago, keeping the nations up-do-date with the current hate speech challenges.

Definition of hate speech

Despite the widespread recognition of the need to address hate speech at the a governmental level, there is no single, universally accepted definition of what constitutes hate speech. Different countries have developed their own definitions, which can lead to inconsistencies and challenges in addressing hate speech across borders. This variation in definitions highlights the need for international cooperation and dialogue to develop a shared understanding of hate speech and its consequences.

Online hate speech regulations

While most of the countries have laws regulating hate speech, only 60% have specific definitions related to online hate speech. Countries such as the USA, Russia, and Ukraine do not independently address online hate speech at the legislative level, while hate speech is protected under freedom of speech in the USA.

Punishments for hate speech

Most countries have a tiered approach to punishing hate speech, with penalties ranging from fines and community service to imprisonment. While imprisonment is a possible consequence, the length of imprisonment is generally relatively short, with only a few countries imposing sentences exceeding 5 years. Moreover, regulations in some countries stipulate harsher penalties for repeat offenses linked to hate crimes.

Methods to pro-actively mitigate hate speech

At both national and regional levels, specific laws addressing counterspeech and detoxification are lacking. However, many countries have emphasized the creation of a safe environment through proactive methods, which appears to be a first positive move in this direction.

5.2 Platform results

As stated earlier, we selected 15 social media platforms with the highest popularity measured in terms of monthly active users. We strategically formulated 25 questions to study the community guidelines provided by the respective platforms in terms of offensive content and their mitigation strategies. The questions can be grouped into 5 categories namely platform access and verification, regulations, content moderation, preventive measures and miscellaneous. The overall results from our investigation are summarized in Figure 1(b).

Platform access and verification

The majority of the platforms has an age limit for account creation and some sort of parental control. Only 3 out of 15 platforms we studied – Facebook, Instagram and YouTube – apply age verification methods. A mandatory phone number or any other sort of ID verification is present in only 8 of the 15 platforms we studied. None of the platforms allow for the creation of completely anonymous accounts, but 10 platforms allow for the creation of pseudonomous accounts, i.e., an account that uses a fictitious name or alias to protect the user’s digital identity.

Regulations

All the platforms except GAB have made their regulations or community guidelines accessible from the home page. X, Telegram and GAB do not adjust the language of the regulations automatically according to user’s geographical location. Only 9 platforms have adjusted their regulations based on the country’s regulations related to hate speech. Platforms like Telegram, WhatsApp, Tiktok and GAB do not even have a strict definition of hate speech in their regulations.

Content moderation

Platform users play an important role in content moderation where administrators or moderators can moderate respective groups or communities except Snapchat and TikTok. Platform employees play the most crucial role as content moderators, who act independently or on content flagged as offensive or inappropriate by users on all platforms except GAB. Content moderation is a subjective job and highly dependent on the social and cultural context of the individual and their demographics. Only a small minority of platforms – Facebook, Instagram, TikTok, ShareChat, YouTube and Koo – have moderators with demographic diversity. A common solution to this challenge is employing auto-moderation, which is adopted by almost all platforms except Telegram, WhatsApp and GAB.

Preventive measures

As the primary preventive measure all platforms have a reporting functionality where users can report a certain content which they find inappropriate. The users generally flag the reported content according to the category labels provided by the platform. Platforms like WhatsApp, VK, Odnoklassniki, TikTok, ShareChat and Koo do not provide a label for offensive or sensitive content while reporting. Counterspeech is one of the widely accepted counter measure for offensive speech although very few platforms like Facebook, VK and Odnoklassniki encourage the promotion of counterspeech. Another preventive measure includes detoxification of offensive content but none of the platforms have adopted this policy.

Miscellaneous

Platforms like VK, Odnoklassniki, GAB and ShareChat do not share data with law enforcement agencies for investigation of hateful or offensive speech. Detection and mitigation of hate speech requires a high volume of data, which generally can be obtained from platforms via API requests. Some of the platforms such as Telegram, WhatsApp, VK, Odnoklassniki, Snapchat, GAB, ShareChat do not share data for research purposes. Influential personalities like public figures/organizations/media companies get their identities verified by the platforms but few platforms such as X, Facebook, WhatsApp, Instagram, TikTok, LinkedIn and Snapchat employ extra rules or regulations for these type of users.

5.3 Results based on research datasets

Our analysis of various hate speech dataset papers has yielded several key findings that provide insights into the landscape of hate speech research and dataset construction.

Hate speech definition

65% of the surveyed papers present a clear definition of hate speech within their work. We believe, especially for annotation tasks and dataset papers, conceptual clarity in understanding hate speech is highly important. Consequently, our expectation was that almost all papers would have a definition of hate speech which is unfortunately not true.

Compliance with regulations

Our analysis reveals that only 15% of the papers have cross-checked their definition with hate speech regulations at the national level, and only three papers referencing platform or data source-specific regulations. This lack of alignment with regulatory frameworks highlights potential discrepancies between academic definitions and legal or platform-specific interpretations of hate speech.

Formulation of recommendations

Only two of the 38 surveyed papers formulate recommendations on addressing hate speech or leveraging their work, datasets, or annotations for combating hate speech. We believe it is a missed opportunity for academic research to inform practical interventions and policy-making efforts in the fight against hate speech.

Imbalance in investigated data sources

Our analysis also reveals a notable imbalance in the investigation of data sources with X accounting for over 50% of the studies, while other platforms such as YouTube, Instagram, Reddit, and WhatsApp are explored in less than 10% of the papers. This discrepancy is significant since the over-representation of certain platforms in research does not correspond with their actual usage; for example, Facebook, with its 3 billion users, far exceeds X, which has only 611 million users. While our survey is not fully representative, these findings underscore important trends and gaps in hate speech research, emphasizing the need for greater alignment with regulatory frameworks, formulation of actionable recommendations, and diversification of investigated data sources to more accurately capture the landscape of online hate speech. Further investigation into hate speech dataset papers revealed a nuanced understanding of hate speech as a multi-faceted phenomenon. Through the analysis of hate speech definitions and descriptions of labels used for annotation, several key aspects emerged that can be considered for the classification of hate speech. We outline these aspects below.

•

Target: Understanding the target of hate speech is essential in contextualizing its impact. Inflammatory messages directed at individuals or groups are often considered as hate speech, while undirected messages are not.
•

Discrimination: Hate speech often manifests through discriminatory language targeting various characteristics such as race, sex, gender, nationality, religion, and more.
•

Intent of the perpetrator: Malicious intent, ranging from mocking and causing emotional harm to issuing threats or inciting violence, is typical for hate speech. However, humorous, sarcastic or troll messages are often not considered as hate speech.
•

Language usage: Hate speech can manifest in diverse linguistic forms, from threatening, dehumanizing and fear-inducing speech to overtly violent or obscene language. Again, sarcastic or humorous language is often not considered as hate speech.
•

Emotions of the victim/target: Understanding the emotional impact on hate speech victims is crucial for assessing its harm, as it often induces sadness, anger, fear, and outgroup prejudice.
•

Frequency: Hate speech can manifest as isolated incidents or persistent harassment, such as mobbing or bullying. Analyzing attack frequency helps gauge the severity of hate speech.
•

Time: Hate speech may reference past events, current circumstances, or future actions. Especially messages that incite violent actions in the near future are dangerous. The temporal dimension should not be neglected.
•

Fact-checking: Hate speech often relies on misinformation or distorted facts to perpetuate harmful narratives. Identifying disinformation can aid hate speech detection and inform the severity.
•

Topic and context: Hate speech targets various topics, from political ideologies to social identities, and contextual factors must be considered in its assessment.

Our analysis underscores the complexity of hate speech, highlighting the need for nuanced approaches to effectively identify, classify, and mitigate its harmful effects.

6 Actionable recommendations

To tackle the challenges associated with addressing abusive speech more effectively across various levels, we introduce a novel pipeline aimed at automating the process of gradually addressing harmful speech—Demarcation. The overall schematic of our approach is presented in Figure 2. In the rest of this section we describe the main components of our proposed pipeline together with examples of its potential deployment and implications of the future research in automatic abuse speech mitigation.

6.1 Demarcation score

The Demarcation approach consists of several aspects that should be assessed simultaneously and combined into one final objective. The aspects that we propose to be considered are as follows.

Severity scale $\alpha$: First, we can detect the specific type of abusive speech and how severe it is — if it is profane, offensive, hateful, racist, cyberbulling, etc., or just normal. Together with the label, we can estimate its severity either based on the scores from some text classification model or using additionally fine-tuned separate classifier or regressor.
Target extraction $\beta$: Together with the label, it is essential to extract whether the abusive speech targets a specific victim community. The text may use rude language but remain neutral and target no one, or it could be hateful and aimed at an individual or an entire group of people.
Context scale $\gamma$: Another important factor is a context. The conversation can be private and may consist of jokes between individuals. The situations that are crucial to handle are when the abusive speech appears in public communications. If children can be present, even profanity on the lexical level is not advisable to show. If the communication is between adults, we should mitigate hate speech, including racism, sexism, and other group-targeted aggression.
Legal scale $\delta$: Finally, we need to take into account the platform guidelines and nationality of the perpetrator. Some expressions of hate can lead to either a ban from a social platform or even prosecutions based on the corresponding country’s laws. From previous works, we can already see examples of where nations have legal regulations about online hate and platforms have special policies to cooperate with investigations. Thus, potential law violations should be specially marked and require further human intervention.

Ultimately, we will derive the overall demarcation score for a given text instance. This computation is intended to be dynamic and may vary depending on the specific implementation. For instance, the final result could be the summation of the products of different scales: $\alpha$ , $\beta$ , $\gamma$ , and $\delta$ .

6.2 Demarcation steps

When the demarcation score $D$ is estimated, demarcation steps can be taken to reduce text toxicity and address its hatefulness. Here, we provide examples of several types of abusive speech and suggest ways to address them. However, it is important that stakeholders such as social media providers tailor these steps to the specific needs of their situation during deployment preparation.

No counter actions: If there is no abusive intentions in a text sample (a text can be, for instance, even with a negative sentiment, but anyway be not abusive), then no moderation actions should be performed.
Text detoxification: If a message contains obscene lexical elements, as the first step of demarcation, text detoxification can be performed. This can be in the form of a suggestion to a user for transferring his/her style of profane language into civil register. Hence, this step aids in mitigating the overall toxicity of the message.
Counterspeech: However, if a message also contains aggressiveness against some target group, then we should mitigate such hate speech with counterspeech. This can be an even several step dialogue with explanations of why such kind of speech is inappropriate in the given context.
Blocking of a message or a user: After proactive moderation efforts, should a user continue to ignore recommendations and repeatedly post severely abusive content that violates the platform’s code of conduct or the country’s rules and regulations, the published message should be blocked or deleted. The user account may also be suspended. Similarly, this measure should be enforced if the content directly incites violence.
Human moderator intervention: At any of the previous steps, if the automatic moderation does not perform as is expected, there should be a possibility for human moderators to be notified and human intervention to occur. This process could be more refined, e.g., according to the classifier’s confidence Pavlopoulos et al. (2017).
Authorities intervention: Finally, if messages contain severe cyberbullying or other dangerous threat speech, high-level platform moderation should be initiated. In extreme cases, these messages should also be passed on to the appropriate department within national security forces.

6.3 Improved data annotation guidelines

Within the NLP community, there are already established guidelines and data statements outlining the responsible annotation and publication of datasets Bender and Friedman (2018); Rogers et al. (2021); ARR (2022). After our “three pillows” survey and Demarcation steps, we would like to suggest extensions of responsible datasets for specifically abusive speech annotation and publication to promote more proactive and realistic deployment of technologies based on such data.

Country/platform alignment: From the survey results in Section 5.3, we can observe that the foundation of the majority of the datasets were samples from popular social networks. However, only a limited number of studies have cross-referenced their definitions of abusive speech with the guidelines set by social media platforms. We recommend that future research efforts on dataset collection include definitions of abusive speech from both the social media platforms hosting the data and the countries’ regulation where the target languages are predominantly spoken.
Context consideration: Together with text instances, it would be extremely beneficial for dataset descriptions to specify the sources of these samples, whether from private conversations, interactions with bots, or public comments sections. Including additional context or preceding and following dialogue steps in these descriptions would further enhance the precision of automatic detection and moderation systems.
Moderation actions suggestion: In the end, it is essential that authors not only provide definitions of abusive speech labels but also suggest approaches for addressing each label during moderation. Furthermore, they should dictate how to responsibly use the dataset to build models for moderating social media platforms, ensuring ethical applications and effective practices.

7 Conclusion

This work introduces Demarcation—a novel pipeline for proactive abusive speech moderation. It is based on the demarcation score that consists of several scales: (i) severity scale; (ii) target extraction scale; (iii) context scale; (iv) legal scale. Further demarcation steps to address the impact of abusive speech should include more granular measures such as text detoxification, counterspeech, and, as ultimate mitigation actions, blocking or intervention by humans or authorities. Thus, we firmly believe that implementing a more comprehensive moderation and demarcation pipeline will enhance proactive measures against abusive speech and decrease its occurrences in the future. We performed a thorough survey study on abusive speech regulations and mitigation strategies across three main “pillows”—(i) countries’ regulations; (ii) social platform policies; (iii) NLP research papers—trying to align the real needs of healthy communications with NLP research directions. Our findings shed light on the lack of proactive moderation implementations while stressing the need of such technology at both national and social platform levels. For this reason, in the end, we provide novel recommendations for responsible collection and publication of specifically abusive speech research data. We firmly believe that enhancing the annotation and publication standards of such datasets will contribute to the development of more accurate automatic proactive moderation models.

Limitations and Ethics statement

While we made diligent efforts to meticulously document our research process, findings, and recommendations, it is important to acknowledge that our study initially encountered certain limitations. 1) Only text-based content: We only took into consideration textual expression of digital violence. We acknowledge that abuse can also be extremely harmful in other modalities like images, voice recordings and videos. Our way of abuse mitigation do not encompass such cases. 2) Only human-written content: Our mitigation pipeline was initially tailored to address only human-authored messages and comments. However, as text generation systems become more prevalent, there is a growing influx of machine-generated content on social media platforms. It is imperative to incorporate additional measures to detect and address bots and other machine-generated texts that may pose greater risks in inciting hatred. 3) Only digital content: Finally, we performed our studies only in the realm of digital violence. Nevertheless, digital abuse can transcend virtual platforms and manifest in real-world scenarios through various means. For this reason, we include an ‘authorities’ intervention’ step in our demarcation pipeline.

Ethics statement: We are committed to upholding freedom of speech and respect the autonomy of stakeholders in deploying moderation technologies tailored to their specific domain, context, and requirements. Our aim is to offer a broader perspective on potential automatic proactive moderation strategies, providing novel insights and recommendations.

References

Abebaw et al. (2022) Zeleke Abebaw, Andreas Rauber, and Solomon Atnafu. 2022. Multi-channel convolutional neural network for hate speech detection in social media. In Advances of Science and Technology, pages 603–618, Cham. Springer International Publishing.
Albadi et al. (2018) Nuha Albadi, Maram Kurdi, and Shivakant Mishra. 2018. Are they our brothers? analysis and detection of religious hate speech in the arabic twittersphere. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 69–76.
Alsagheer et al. (2022) Dana Alsagheer, Hadi Mansourifar, and Weidong Shi. 2022. Counter hate speech in social media: A survey. CoRR, abs/2203.03584.
Alzubi et al. (2022) Salaheddin Alzubi, Thiago Castro Ferreira, Lucas Pavanelli, and Mohamed Al-Badrashiny. 2022. aiXplain at Arabic hate speech 2022: An ensemble based approach to detecting offensive tweets. In Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection, pages 214–217, Marseille, France. European Language Resources Association.
Arora et al. (2024) Arnav Arora, Preslav Nakov, Momchil Hardalov, Sheikh Muhammad Sarwar, Vibha Nayak, Yoan Dinkov, Dimitrina Zlatkova, Kyle Dent, Ameya Bhatawdekar, Guillaume Bouchard, and Isabelle Augenstein. 2024. Detecting harmful content on online platforms: What platforms need vs. where research efforts go. ACM Comput. Surv., 56(3):72:1–72:17.
ARR (2022) ACL Rolling Review ARR. 2022. Acl rolling review responsible checklist. https://aclrollingreview.org/responsibleNLPresearch. Accessed: 2024-05-01.
Assenmacher et al. (2021) Dennis Assenmacher, Marco Niemann, Kilian Müller, Moritz Seiler, Dennis Riehle, Heike Trautmann, and Heike Trautmann. 2021. Rp-mod & rp-crowd: Moderator- and crowd-annotated german news comment datasets. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1.
Ayele et al. (2022) Abinew Ali Ayele, Skadi Dinter, Tadesse Destaw Belay, Tesfa Tegegne Asfaw, Seid Muhie Yimam, and Chris Biemann. 2022. The 5js in ethiopia: Amharic hate speech data annotation using toloka crowdsourcing platform. In 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 114–120.
Ayele et al. (2024) Abinew Ali Ayele, Esubalew Alemneh Jalew, Adem Chanie Ali, , Seid Muhie Yimam, and Chris Biemann. 2024. Exploring boundaries and intensities in offensive and hate speech: Unveiling the complex spectrum of social media discourse. In Proceedings of The Fourth Workshop on Threat, Aggression & Cyberbullying.
Ayele et al. (2023) Abinew Ali Ayele, Seid Muhie Yimam, Tadesse Destaw Belay, Tesfa Asfaw, and Chris Biemann. 2023. Exploring Amharic hate speech data collection and classification approaches. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 49–59, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Banko et al. (2020) Michele Banko, Brendon MacKeen, and Laurie Ray. 2020. A unified taxonomy of harmful content. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 125–137, Online. Association for Computational Linguistics.
Barbieri et al. (2019) Davide Barbieri, Charlotte Dahin, Brianna Guidorzi, Zuzana Madarova Marre Karu, Blandine Mollard, Jolanta Reingardė, Lina Salanauskaitė, onika Natter, Renate Haupfleisch, Katja Korolkova, Monica Barbovschi, Liza Tsaliki, Brian O’Neill, Clara Faulí, Federica Porcu, Francisco Lupiáñez Villanueva, and Alexandra Theben. 2019. Gender equality and youth: opportunities and risks of digitalisation – main report. Technical report, The European Institute for Gender Equality.
Basile et al. (2019) Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. 2019. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
Bender and Friedman (2018) Emily M. Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Trans. Assoc. Comput. Linguistics, 6:587–604.
Bhardwaj et al. (2020) Mohit Bhardwaj, Md Shad Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2020. Hostility detection dataset in hindi.
Bohra et al. (2018) Aditya Bohra, Deepanshu Vijay, Vinay Singh, Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018. A dataset of Hindi-English code-mixed social media text for hate speech detection. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, pages 36–41, New Orleans, Louisiana, USA. Association for Computational Linguistics.
Bonaldi et al. (2024) Helena Bonaldi, Yi-Ling Chung, Gavin Abercrombie, and Marco Guerini. 2024. NLP for Counterspeech against Hate: A Survey and How-To Guide. arXiv preprint arXiv:2403.20103.
Buerger (2021) Catherine Buerger. 2021. #iamhere: Collective Counterspeech and the Quest to Improve Online Discourse. Social Media + Society, 7(4).
Buerger (2022) Catherine Buerger. 2022. Why They Do It: Counterspeech Theories of Change. SSRN Electronic Journal.
Burtenshaw and Kestemont (2021) Ben Burtenshaw and Mike Kestemont. 2021. A Dutch dataset for cross-lingual multilabel toxicity detection. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 75–79, Online (Virtual Mode). INCOMA Ltd.
Carmona et al. (2018) Miguel Angel Carmona, Estefanía Guzmán-Falcón, Manuel Montes, Hugo Jair Escalante, Luis Villaseñor-Pineda, Veronica Reyes-Meza, and Antonio Rico-Sulayes. 2018. Overview of mex-a3t at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish tweets.
Chiril et al. (2019) Patricia Chiril, Farah Benamara Zitoune, Véronique Moriceau, Marlène Coulomb-Gully, and Abhishek Kumar. 2019. Multilingual and multitarget hate speech detection in tweets. In Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts, pages 351–360, Toulouse, France. ATALA.
Chung et al. (2021) Yi-Ling Chung, Serra Sinem Tekiroglu, Sara Tonelli, and Marco Guerini. 2021. Empowering ngos in countering online hate messages. Online Soc. Networks Media, 24:100150.
Cobbe (2021) Jennifer Cobbe. 2021. Algorithmic censorship by social platforms: Power and resistance. Philosophy & Technology, 34(4):739–766.
Corazza et al. (2019) Michele Corazza, Stefano Menini, Elena Cabrio, Sara Tonelli, and Serena Villata. 2019. Cross-Platform Evaluation for Italian Hate Speech Detection. In CLiC-it 2019 - 6th Annual Conference of the Italian Association for Computational Linguistics, Bari, Italy.
Das et al. (2022) Mithun Das, Somnath Banerjee, Punyajoy Saha, and Animesh Mukherjee. 2022. Hate speech and offensive language detection in Bengali. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 286–296, Online only. Association for Computational Linguistics.
Davidson et al. (2017) Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):512–515.
Del Vigna et al. (2017) Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petrocchi, and Maurizio Tesconi. 2017. Hate me, hate me not: Hate speech detection on facebook.
Dementieva et al. (2021) Daryna Dementieva, Daniil Moskovskiy, Varvara Logacheva, David Dale, Olga Kozlova, Nikita Semenov, and Alexander Panchenko. 2021. Methods for Detoxification of Texts for the Russian Language. Multimodal Technol. Interact., 5(9):54.
Demus et al. (2022) Christoph Demus, Jonas Pitz, Mina Schütz, Nadine Probol, Melanie Siegel, and Dirk Labudde. 2022. Detox: A comprehensive dataset for German offensive language and conversation analysis. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 143–153, Seattle, Washington (Hybrid). Association for Computational Linguistics.
Fortuna et al. (2019) Paula Fortuna, João Rocha da Silva, Juan Soler-Company, Leo Wanner, and Sérgio Nunes. 2019. A hierarchically-labeled Portuguese hate speech dataset. In Proceedings of the Third Workshop on Abusive Language Online, pages 94–104, Florence, Italy. Association for Computational Linguistics.
Fortuna et al. (2020) Paula Fortuna, Juan Soler, and Leo Wanner. 2020. Toxic, Hateful, Offensive or Abusive? What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6786–6794, Marseille, France. European Language Resources Association.
Founta et al. (2018) Antigoni Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of twitter abusive behavior. Proceedings of the International AAAI Conference on Web and Social Media, 12(1).
Gupta et al. (2023) Rishabh Gupta, Shaily Desai, Manvi Goel, Anil Bandhakavi, Tanmoy Chakraborty, and Md. Shad Akhtar. 2023. Counterspeeches up my sleeve! intent distribution learning and persistent fusion for intent-conditioned counterspeech generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5792–5809, Toronto, Canada. Association for Computational Linguistics.
Haddad et al. (2019) Hatem Haddad, Hala Mulki, and Asma Oueslati. 2019. T-hsab: A tunisian hate speech and abusive dataset. In Arabic Language Processing: From Theory to Practice, pages 251–263, Cham. Springer International Publishing.
Hietanen and Eddebo (2023) Mika Hietanen and Johan Eddebo. 2023. Towards a definition of hate speech—with a focus on online contexts. Journal of communication Inquiry, 47(4):440–458.
Jeong et al. (2022) Younghoon Jeong, Juhyun Oh, Jaimeen Ahn, Jongwon Lee, Jihyung Moon, Sungjoon Park, and Alice Oh. 2022. Kold: Korean offensive language dataset.
Jiang et al. (2022) Aiqi Jiang, Xiaohan Yang, Yang Liu, and Arkaitz Zubiaga. 2022. Swsr: A chinese dataset and lexicon for online sexism detection. Online Social Networks and Media, 27:100182.
Jin et al. (2021) Zhijing Jin, Geeticka Chauhan, Brian Tse, Mrinmaya Sachan, and Rada Mihalcea. 2021. How good is NLP? a sober look at NLP tasks through the lens of social impact. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3099–3113, Online. Association for Computational Linguistics.
Kara et al. (2022) Ergün Kara, Gülşen Kirpik, and Attila Kaya. 2022. A research on digital violence in social media. In Handbook of research on digital violence and discrimination studies, pages 270–290. IGI Global.
Karim et al. (2021) Md. Rezaul Karim, Sumon Kanti Dey, Tanhim Islam, Sagor Sarker, Mehadi Hasan Menon, Kabir Hossain, Md. Azam Hossain, and Stefan Decker. 2021. Deephateexplainer: Explainable hate speech detection in under-resourced bengali language. In 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10.
Kaye (2019) David Kaye. 2019. Speech Police: The Global Struggle to Govern the Internet. Columbia Global Reports.
Kulenović (2023) Enes Kulenović. 2023. Should democracies ban hate speech? hate speech laws and counterspeech. Ethical Theory and Moral Practice, 26(4):511–532.
Lewandowska-Tomaszczyk et al. (2023) Barbara Lewandowska-Tomaszczyk, Slavko Žitnik, Chaya Liebeskind, Giedre Valunaite Oleskevicienė, Anna Bączkowska, Paul A Wilson, Marcin Trojszczak, Ivana Brač, Lobel Filipić, Ana Ostroški Anić, et al. 2023. Annotation scheme and evaluation: The case of offensive language. Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje, 49(1).
Ljubešić et al. (2021) Nikola Ljubešić, Darja Fišer, Tomaž Erjavec, and Ajda Šulc. 2021. Offensive language dataset of croatian, english and slovenian comments FRENK 1.1. Slovenian language resource repository CLARIN.SI.
Logacheva et al. (2022) Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. 2022. ParaDetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, Dublin, Ireland. Association for Computational Linguistics.
MacAvaney et al. (2019) Sean MacAvaney, Hao-Ren Yao, Eugene Yang, Katina Russell, Nazli Goharian, and Ophir Frieder. 2019. Hate speech detection: Challenges and solutions. PloS one, 14(8):e0221152.
Mandl et al. (2019) Thomas Mandl, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia, and Aditya Patel. 2019. Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’19, page 14–17, New York, NY, USA. Association for Computing Machinery.
Mathew et al. (2021) Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. 2021. HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14867–14875.
Mathur et al. (2018) Puneet Mathur, Rajiv Shah, Ramit Sawhney, and Debanjan Mahata. 2018. Detecting offensive tweets in Hindi-English code-switched language. In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, pages 18–26, Melbourne, Australia. Association for Computational Linguistics.
Mossie and Wang (2020) Zewdie Mossie and Jenq-Haur Wang. 2020. Vulnerable community identification using hate speech detection on social media. Information Processing & Management, 57(3):102087.
Mulki et al. (2019) Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, and Halima Alshabani. 2019. L-HSAB: A Levantine Twitter dataset for hate speech and abusive language. In Proceedings of the Third Workshop on Abusive Language Online, pages 111–118, Florence, Italy. Association for Computational Linguistics.
Mun et al. (2024) Jimin Mun, Cathy Buerger, Jenny T. Liang, Joshua Garland, and Maarten Sap. 2024. Counterspeakers’ perspectives: Unveiling barriers and AI needs in the fight against online hate. CoRR, abs/2403.00179.
Nurce et al. (2022) Erida Nurce, Jorgel Keci, and Leon Derczynski. 2022. Detecting abusive albanian.
Ousidhoum et al. (2019) Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, Yangqiu Song, and Dit-Yan Yeung. 2019. Multilingual and multi-aspect hate speech analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4675–4684, Hong Kong, China. Association for Computational Linguistics.
Parker and Ruths (2023) Sara Parker and Derek Ruths. 2023. Is hate speech detection the solution the world wants? Proceedings of the National Academy of Sciences, 120(10):e2209384120.
Magnossão de Paula et al. (2022) Angel Felipe Magnossão de Paula, Paolo Rosso, Imene Bensalem, and Wajdi Zaghouani. 2022. UPV at the Arabic hate speech 2022 shared task: Offensive language and hate speech detection using transformers and ensemble models. In Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur’an QA and Fine-Grained Hate Speech Detection, pages 181–185, Marseille, France. European Language Resources Association.
Pavlopoulos et al. (2017) John Pavlopoulos, Prodromos Malakasiotis, and Ion Androutsopoulos. 2017. Deeper attention to abusive user content moderation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1125–1135, Copenhagen, Denmark. Association for Computational Linguistics.
Ptaszynski et al. (2019) Michal Ptaszynski, Agata Pieciukiewicz, and Paweł Dybała. 2019. Results of the PolEval 2019 Shared Task 6: first dataset and Open Shared Task for automatic cyberbullying detection in Polish Twitter.
Pérez et al. (2023) Juan Manuel Pérez, Franco M. Luque, Demian Zayat, Martín Kondratzky, Agustín Moro, Pablo Santiago Serrati, Joaquín Zajac, Paula Miguel, Natalia Debandi, Agustín Gravano, and Viviana Cotik. 2023. Assessing the impact of contextual information in hate speech detection. IEEE Access, 11:30575–30590.
Rizwan et al. (2020) Hammad Rizwan, Muhammad Haroon Shakeel, and Asim Karim. 2020. Hate-speech and offensive language detection in Roman Urdu. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2512–2522, Online. Association for Computational Linguistics.
Rogers et al. (2021) Anna Rogers, Timothy Baldwin, and Kobi Leins. 2021. ‘just what do you think you’re doing, dave?’ a checklist for responsible data use in NLP. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4821–4833, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Romim et al. (2021) Nauros Romim, Mosahed Ahmed, Hriteshwar Talukder, and Md. Saiful Islam. 2021. Hate speech detection in the bengali language: A dataset and its baseline evaluation. In Proceedings of International Joint Conference on Advances in Computational Intelligence, pages 457–468, Singapore. Springer Singapore.
Sanguinetti et al. (2018) Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. 2018. An Italian Twitter corpus of hate speech against immigrants. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Nogueira dos Santos et al. (2018) Cicero Nogueira dos Santos, Igor Melnyk, and Inkit Padhi. 2018. Fighting offensive language on social media with unsupervised text style transfer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 189–194, Melbourne, Australia. Association for Computational Linguistics.
Shekhar et al. (2022) Ravi Shekhar, Mladen Karan, and Matthew Purver. 2022. CoRAL: a context-aware Croatian abusive language dataset. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 217–225, Online only. Association for Computational Linguistics.
Shi et al. (2020) Zheyuan Ryan Shi, Claire Wang, and Fei Fang. 2020. Artificial intelligence for social good: A survey. CoRR, abs/2001.01818.
Sigurbergsson and Derczynski (2020) Gudbjartur Ingi Sigurbergsson and Leon Derczynski. 2020. Offensive language and hate speech detection for Danish. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3498–3508, Marseille, France. European Language Resources Association.
Sprugnoli et al. (2018) Rachele Sprugnoli, Stefano Menini, Sara Tonelli, Filippo Oncini, and Enrico Piras. 2018. Creating a WhatsApp dataset to study pre-teen cyberbullying. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 51–59, Brussels, Belgium. Association for Computational Linguistics.
Tran et al. (2020) Minh Tran, Yipeng Zhang, and Mohammad Soleymani. 2020. Towards a friendly online community: An unsupervised style transfer framework for profanity redaction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2107–2114, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Trujillo et al. (2023) Amaury Trujillo, Tiziano Fagni, and Stefano Cresci. 2023. The DSA transparency database: Auditing self-reported moderation actions by social media. CoRR, abs/2312.10269.
Yu et al. (2022) Xinchen Yu, Eduardo Blanco, and Lingzi Hong. 2022. Hate speech and counter speech detection: Conversational context does matter. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5918–5930, Seattle, United States. Association for Computational Linguistics.
Yu et al. (2023) Xinchen Yu, Ashley Zhao, Eduardo Blanco, and Lingzi Hong. 2023. A fine-grained taxonomy of replies to hate speech. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7275–7289. Association for Computational Linguistics.
Zheng et al. (2023) Yi Zheng, Björn Ross, and Walid Magdy. 2023. What makes good counterspeech? a comparison of generation approaches and evaluation metrics. In Proceedings of the 1st Workshop on CounterSpeech for Online Abuse (CS4OA), pages 62–71, Prague, Czechia. Association for Computational Linguistics.
Ziems et al. (2022) Caleb Ziems, Minzhi Li, Anthony Zhang, and Diyi Yang. 2022. Inducing positive perspectives with text reframing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3682–3700, Dublin, Ireland. Association for Computational Linguistics.
Zueva et al. (2020) Nadezhda Zueva, Madina Kabirova, and Pavel Kalaidin. 2020. Reducing unintended identity bias in Russian hate speech detection. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 65–69, Online. Association for Computational Linguistics.
Özsungur (2022) Fahri Özsungur. 2022. Handbook of Research on Digital Violence and Discrimination Studies: A volume in the Advances in Human and Social Aspects of Technology (AHSAT) Book Series.

Appendix A Full questionnaire for “Three Pillows”

Below we list all the questions formulated on the country-level, social platforms hate speech regulations, and research papers.

A.1 Countries Regulations

Q1
Hate speech legal definition
- Q1.1
  
  Year of the last updates on hate speech regulations (Numerical)
- Q1.2
  
  Are there any regulation of hate speech in the country? (Yes/No)
- Q1.3
  
  Is “Hate speech” a legal term in the law of the country? (Yes/No)
- Q1.4
  
  Is online hate speech defined in the regulation? (Yes/No)
- Q1.5
  
  Is the hate speech definition mentioned in criminal codex? (Yes/No)
- Q1.6
  
  Is hate speech an independent criminal offense? (Yes/No)
- Q1.7
  
  Is hate speech protected by freedom of speech? (Yes/No)
Q2
Hate speech liability
- Q2.1
  
  Does the regulation set any kind of punishment? (Yes/No)
- Q2.2
  
  Is there social or community service as punishment? (Yes/No)
- Q2.3
  
  Is there monetary punishment? (Yes/No)
- Q2.4
  
  Is there imprisonment as punishment? (Yes/No)
- Q2.5
  
  Are there extra charge for repeated offenders? (Yes/No)
- Q2.6
  
  Is there special punishments for online hate speech? (Yes/No)
Q3
Preventive measures
- Q3.1
  
  Do the regulations encourage counter hate speech? (Yes/No)
- Q3.2
  
  Do the regulations encourage message rewriting/detoxification? (Yes/No)
Q4
Social media platforms
- Q4.1
  
  Are there social media platform specific regulations? (Yes/No)
- Q4.2
  
  Do they have social media specific regulation on hate speech? (Yes/No)
- Q4.3
  
  Is a time frame specified in the regulation in which a hate speech message has to be dealt with? (Yes/No)
- Q4.4
  
  Was the regulation updated in the last 2 years? (Yes/No)
- Q4.5
  
  Is the regulation of online hate speech inline with the international regulations/law? (Yes/No)
- Q4.6
  
  Do they have regulation of hate speech for broadcasted (TV, Radio, printed newspaper) media? (Yes/No)

A.2 Social platforms regulations

Q1
Hate speech legal definition
- Q1.1
  
  Company’s Headquarter country (Text)
- Q1.2
  
  Monthly Active Users (MAU) (Numerical)
Q2
Platform access & regulations
- Q2.1
  
  Is there an age limit for account creation at the platforms? (Yes/No)
- Q2.2
  
  Is there content adjusted for children? (parental control?) (Yes/No)
- Q2.3
  
  Is there age verification? (Yes/No)
- Q2.4
  
  Is there phone or ID verification? (Yes/No)
- Q2.5
  
  Does the platform allow to create an pseudonymous account? (Yes/No)
- Q2.6
  
  Do they allow you to create an anonymous account? (Yes/No)
Q3
Regulations accessibility
- Q3.1
  
  Are the regulations accessible from the front page? (Yes/No)
- Q3.2
  
  Is the regulations language automatically adjusted to the users location? (Yes/No)
- Q3.3
  
  Is the platform policy adjusted to the regulations of the countries? (Yes/No)
- Q3.4
  
  Is there a definition of hate speech? (Yes/No)
Q4
Content moderation policies
- Q4.1
  
  Are there unmoderated, private groups, channels, or chats? (Yes/No)
- Q4.2
  
  Is the platform moderated by users or groups? (self-moderation) (Yes/No)
- Q4.3
  
  Is the platform moderated by platform employees? (Yes/No)
- Q4.4
  
  Does the platform have a dedicated team of moderators for the specific country? (Yes/No)
- Q4.5
  
  Is there an auto-moderation? (pro-active moderation) (Yes/No)
- Q4.6
  
  Does the platform have community guidelines? (in addition to terms of service?) (Yes/No)
Q5
Preventive measures
- Q5.1
  
  Is there a reporting functionality? (Yes/No)
- Q5.2
  
  Do they label content as offensive / sensitive? (Yes/No)
- Q5.3
  
  Do they encourage counter-hate speech? (Yes/No)
- Q5.4
  
  Do they have message detoxification functionality? (Yes/No)
Q6
Other questions
- Q6.1
  
  Is it possible to create a group without administrator approval? (Yes/No)
- Q6.2
  
  Can you request data from the platform for hate speech case investigation? (usually called "Law Enforcement") (Yes/No)
- Q6.3
  
  Is the data accessible through API for Research? (Yes/No)
- Q6.4
  
  Verification of public persons / organizations / media companies? (Yes/No)
- Q6.5
  
  Extra rules for verified organizations, etc. (Yes/No)

A.3 NLP Datasets Papers

Q1
Hate speech definition and alignment
- Q1.1
  
  Is there a definition of Hate speech mentioned? (Yes/No)
- Q1.2
  
  What is the percentage of hateful samples? (Unknown/Not Applicable/Percentage)
- Q1.3
  
  Does the paper mention alignment with countries’ regulations of corresponding languages? (Yes/No)
- Q1.4
  
  Does the paper mention alignment with corresponding data source’s (platform) hate speech regulations? (Yes/No)
Q2
Dataset Details
- Q2.1
  
  Is the data source of the dataset mentioned? (Yes/No)
- Q2.2
  
  What are the Data Source? (Unknown/List of platforms)
- Q2.3
  
  What is the time period covered in the data? (Not mentioned/Years’ Range)
- Q2.4
  
  Are the target groups of the dataset specified? (Yes/No)
- Q2.5
  
  Is there a clear dataset splitting strategy into train/validation/test? (Yes/No)
- Q2.6
  
  Is the dataset publicly available? (Yes/On Request/No)
- Q2.7
  
  What is the Dataset size (Number of Samples)? (Number)
Q3
Label details
- Q3.1
  
  Do they provide definitions for the labels? (e.g. for binary classification, is the positive class explained?) (Yes/No)
- Q3.2
  
  Are the labels binary? (e.g: hate and no hate) (Yes/No)
- Q3.3
  
  Are the labels fine-grained? (e.g: Are there more annotations than just hate, no hate) (Yes/No)
- Q3.4
  
  List out all the labels. (List of labels)
- Q3.5
  
  Does the paper mention recommendations on how the labeled data should be used? (Yes/No)
Q4
Annotation Details
- Q4.1
  
  Do they mention the annotation tool? (Yes/No)
- Q4.2
  
  What was the annotation platform? (Unknown/Custom/None/System Name e.g. Toloka)
- Q4.3
  
  Is the annotation conducted using crowd-sourcing? (Yes/No)
- Q4.4
  
  Do they mention a pilot annotation? (Yes/No)
- Q4.5
  
  Is there an annotation guideline? (Yes/No)
- Q4.6
  
  Is the annotation guideline published? (Yes/No)
- Q4.7
  
  What are the number of annotators per sample? (Unknown/Number)
- Q4.8
  
  Are there atleast 3 or more annotators? (Yes/No/Unknown)
- Q4.9
  
  Do they report annotation agreement? (No/Number+Metric)
Q5
Annotator details
- Q5.1
  
  Is the payment or reward mentioned for the annotators? (Yes/No)
- Q5.2
  
  Is the age of the annotators specified? (Yes/No)
- Q5.3
  
  Is the gender of the annotators specified? (Yes/No)
- Q5.4
  
  Is the religion of the annotators specified? (Yes/No)
- Q5.5
  
  Is the race of the annotators specified? (Yes/No)
- Q5.6
  
  Is the education of the annotators specified? (Yes/No)
- Q5.7
  
  Is the language proficiency of the annotators specified? (Yes/No)
- Q5.8
  
  Were the annotators representative of the target groups? (Yes/No)
- Q5.9
  
  Do they cover therapy for the annotators? (Yes/No)

Hate speech legal definition
Questions	Ethiopia	India	Sri Lanka	Russia	Ukraine	South Africa	UAE	UK	EU	China	USA	Brazil	Colombia	Australia
Q1	2020	2022	2024	2022	2022	2018	2023	2023	2024	2022	NA	2023	2019	2024
Q1.1	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes
Q1.2	No	Yes	No	No	No	Yes	Yes	No	No	No	No	Yes		No
Q1.3	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes
Q1.4	Yes	Yes	No	No	No	Yes	No	No	Yes	No	No	No	Yes	Yes
Q1.5	Yes	Yes	No	Yes	Yes	Yes	Yes	No	Yes	Yes	No	Yes	Yes	Yes
Q1.6	Yes	No	No	No	No	Yes	Yes	No	No	No	No	Yes	Yes	No
Q1.7	No	No	No	No	No	No	No	No	No	No	Yes	No	No	No
Hate speech liability
Q2.1	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes, but	Yes	No	Yes	Yes	Yes
Q2.2	Yes	No	No	Yes	No	Yes	No	No	Varies	?	No	Yes	No	No
Q2.3	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes	Varies	Yes	No	Yes	Yes	Yes
Q2.4	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Varies	Yes	No	Yes	Yes	Yes
Q2.5	No	NA	Yes	No	No	NA	NA	?	Varies	?	No	Yes	?	?
Q2.6	?	No	No	No	No	No	No	No	Varies	?	No	No	No	?
Preventive measures
Q3.1	Yes	No	No	No	No	No	No	No	No	No	Yes	Yes	No	No
Q3.2	NA	No	No	No	NA	No	No	No	No	?	Yes	No	No	No
Social media platforms
Q4.1	No	Yes	Yes	Yes, partially	No	No	No	Yes	Yes	Yes	No	Yes	Yes	Yes
Q4.2	Yes	Yes	No	Yes	No	No	No	Yes	Yes	Yes	No	No	Yes	Yes
Q4.3	Yes	Yes	No	NA	NA	NA	No	No	Yes	Yes	No	No	Yes	Yes
Q4.4	No	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	No	Yes	No	No (from 2021)
Q4.5	Yes	Yes	Yes	Yes	Yes	Yes	No	?	Yes	No	No	No	Yes	Yes
Q4.6	Yes	Yes	No	Yes	Yes	No	Yes	Yes	Yes	No	Yes	No	Yes	Yes

Table 1: The countries regulations exploration results in terms of hate speech or digital violence regulations. We took into account countries and unions trying to cover all the continents.

Year of publication

Dataset research papers

2017 (2)

Davidson et al. (2017); Del Vigna et al. (2017)

2018 (7)

Albadi et al. (2018); Founta et al. (2018); Bohra et al. (2018); Mathur et al. (2018)

Sanguinetti et al. (2018); Sprugnoli et al. (2018); Carmona et al. (2018)

2019 (9)

Mulki et al. (2019); Haddad et al. (2019); Chiril et al. (2019)

Ousidhoum et al. (2019); Mandl et al. (2019); Corazza et al. (2019)

Ptaszynski et al. (2019); Fortuna et al. (2019); Basile et al. (2019)

2020 (5)

Mossie and Wang (2020); Sigurbergsson and Derczynski (2020)

Bhardwaj et al. (2020); Rizwan et al. (2020); Zueva et al. (2020)

2021 (6)

Karim et al. (2021); Romim et al. (2021); Ljubešić et al. (2021)

Burtenshaw and Kestemont (2021); Mathew et al. (2021); Assenmacher et al. (2021)

2022 (8)

Nurce et al. (2022); Abebaw et al. (2022); Ayele et al. (2022); Jeong et al. (2022)

Das et al. (2022); Jiang et al. (2022); Shekhar et al. (2022); Demus et al. (2022)

2023 (1)

Pérez et al. (2023)

Table 2: The dataset research papers explored arranged in ascending chronological order. Number in brackets denote the number of explored dataset papers published in the corresponding year.