Skip to main content

Showing 1–50 of 77 results for author: Zampieri, M

  1. arXiv:2405.06922  [pdf, other

    cs.CL

    EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection

    Authors: Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

    Abstract: Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages.… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2310.18387, arXiv:2310.18023

  2. arXiv:2405.06078  [pdf, ps, other

    cs.CY cs.HC

    Collaborative Design for Job-Seekers with Autism: A Conceptual Framework for Future Research

    Authors: Sungsoo Ray Hong, Marcos Zampieri, Brittany N. Hand, Vivian Motti, Dongjun Chung, Ozlem Uzuner

    Abstract: The success of employment is highly related to a job seeker's capability of communicating and collaborating with others. While leveraging one's network during the job-seeking process is intuitive to the neurotypical, this can be challenging for people with autism. Recent empirical findings have started to show how facilitating collaboration between people with autism and their social surroundings… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  3. arXiv:2404.16116  [pdf, other

    cs.CL cs.AI

    Classifying Human-Generated and AI-Generated Election Claims in Social Media

    Authors: Alphaeus Dmonte, Marcos Zampieri, Kevin Lybarger, Massimiliano Albanese, Genya Coulter

    Abstract: Politics is one of the most prevalent topics discussed on social media platforms, particularly during major election cycles, where users engage in conversations about candidates and electoral processes. Malicious actors may use this opportunity to disseminate misinformation to undermine trust in the electoral process. The emergence of Large Language Models (LLMs) exacerbates this issue by enabling… ▽ More

    Submitted 25 April, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

  4. arXiv:2404.11470  [pdf, other

    cs.CL cs.LG

    A Federated Learning Approach to Privacy Preserving Offensive Language Identification

    Authors: Marcos Zampieri, Damith Premasiri, Tharindu Ranasinghe

    Abstract: The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most soci… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Accepted to TRAC 2024 (Fourth Workshop on Threat, Aggression and Cyberbullying) at LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)

  5. arXiv:2404.02540  [pdf, ps, other

    cs.CL

    CSEPrompts: A Benchmark of Introductory Computer Science Prompts

    Authors: Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Christian Newman, Tharindu Ranasinghe, Marcos Zampieri

    Abstract: Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and profe… ▽ More

    Submitted 4 April, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

  6. arXiv:2403.14990  [pdf, other

    cs.CL

    MasonTigers at SemEval-2024 Task 1: An Ensemble Approach for Semantic Textual Relatedness

    Authors: Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Md Nishat Raihan, Al Nahian Bin Emran, Amrita Ganguly, Marcos Zampieri

    Abstract: This paper presents the MasonTigers entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches across 14 different languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21… ▽ More

    Submitted 5 April, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  7. arXiv:2403.14982  [pdf, other

    cs.CL

    MasonTigers at SemEval-2024 Task 9: Solving Puzzles with an Ensemble of Chain-of-Thoughts

    Authors: Md Nishat Raihan, Dhiman Goswami, Al Nahian Bin Emran, Sadiya Sayara Chowdhury Puspo, Amrita Ganguly, Marcos Zampieri

    Abstract: Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques. Zero-shot and few-shot prompting generate reasonably good results when tested with proprietary LLMs, compared to the open-source models. We obtain f… ▽ More

    Submitted 3 April, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  8. arXiv:2402.14972  [pdf, ps, other

    cs.CL cs.AI

    MultiLS: A Multi-task Lexical Simplification Framework

    Authors: Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri

    Abstract: Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence's original meaning. LS is a precursor to Text Simplification with the aim of improving text accessibility to various target demographics, including children, second language learners, individuals with reading disabilities or low literacy. Several datasets exist for LS. The… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  9. arXiv:2402.01967  [pdf, other

    cs.CL

    MasonPerplexity at Multimodal Hate Speech Event Detection 2024: Hate Speech and Target Detection Using Transformer Ensembles

    Authors: Amrita Ganguly, Al Nahian Bin Emran, Sadiya Sayara Chowdhury Puspo, Md Nishat Raihan, Dhiman Goswami, Marcos Zampieri

    Abstract: The automatic identification of offensive language such as hate speech is important to keep discussions civil in online communities. Identifying hate speech in multimodal content is a particularly challenging task because offensiveness can be manifested in either words or images or a juxtaposition of the two. This paper presents the MasonPerplexity submission for the Shared Task on Multimodal Hate… ▽ More

    Submitted 18 February, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

  10. arXiv:2401.15043  [pdf, other

    cs.CL cs.AI cs.LG

    Health Text Simplification: An Annotated Corpus for Digestive Cancer Education and Novel Strategies for Reinforcement Learning

    Authors: Md Mushfiqur Rahman, Mohammad Sabik Irbaz, Kai North, Michelle S. Williams, Marcos Zampieri, Kevin Lybarger

    Abstract: Objective: The reading level of health educational materials significantly influences the understandability and accessibility of the information, particularly for minoritized populations. Many patient educational resources surpass the reading level and complexity of widely accepted standards. There is a critical need for high-performing text simplification models in health information to enhance d… ▽ More

    Submitted 29 March, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

  11. arXiv:2312.03379  [pdf, other

    cs.CL

    A Text-to-Text Model for Multilingual Offensive Language Identification

    Authors: Tharindu Ranasinghe, Marcos Zampieri

    Abstract: The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their c… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Accepted to Findings of IJCNLP-AACL 2023

  12. arXiv:2311.15032  [pdf, other

    cs.CL

    nlpBDpatriots at BLP-2023 Task 2: A Transfer Learning Approach to Bangla Sentiment Analysis

    Authors: Dhiman Goswami, Md Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri

    Abstract: In this paper, we discuss the nlpBDpatriots entry to the shared task on Sentiment Analysis of Bangla Social Media Posts organized at the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The main objective of this task is to identify the polarity of social media content using a Bangla dataset annotated with positive, neutral, and negative labels provided by the shared task… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  13. arXiv:2311.15029  [pdf, other

    cs.CL

    nlpBDpatriots at BLP-2023 Task 1: A Two-Step Classification for Violence Inciting Text Detection in Bangla

    Authors: Md Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Marcos Zampieri

    Abstract: In this paper, we discuss the nlpBDpatriots entry to the shared task on Violence Inciting Text Detection (VITD) organized as part of the first workshop on Bangla Language Processing (BLP) co-located with EMNLP. The aim of this task is to identify and classify the violent threats, that provoke further unlawful violent acts. Our best-performing approach for the task is two-step classification using… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  14. arXiv:2311.15023  [pdf, other

    cs.CL

    Offensive Language Identification in Transliterated and Code-Mixed Bangla

    Authors: Md Nishat Raihan, Umma Hani Tanmoy, Anika Binte Islam, Kai North, Tharindu Ranasinghe, Antonios Anastasopoulos, Marcos Zampieri

    Abstract: Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introdu… ▽ More

    Submitted 25 November, 2023; originally announced November 2023.

  15. arXiv:2311.04551  [pdf, other

    cs.CE

    Earth Observation based multi-scale analysis of crop diversity in the European Union: first insights for agro-environmental policies

    Authors: Melissande Machefer, Matteo Zampieri, Marijn van der Velde, Frank Dentener, Martin Claverie, Raphaël d'Andrimont

    Abstract: To understand the resilience of farms and the agricultural sector, as well as the provision of ecosystem services, we need to characterize and quantify crop diversity. Using a 10m resolution satellite-derived product, we created datasets of crop diversity across spatial and administrative scales for 27 EU countries and the UK in 2018. We define local crop diversity, or $α$-diversity, at a 1km scal… ▽ More

    Submitted 30 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

  16. arXiv:2310.18387  [pdf, other

    cs.CL cs.AI

    OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification

    Authors: Dhiman Goswami, Md Nishat Raihan, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

    Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several works have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper,… ▽ More

    Submitted 25 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2310.18023

  17. arXiv:2310.18023  [pdf, other

    cs.CL

    SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis

    Authors: Md Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

    Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel d… ▽ More

    Submitted 29 November, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

  18. arXiv:2305.20080  [pdf, other

    cs.CL

    Findings of the VarDial Evaluation Campaign 2023

    Authors: Noëmi Aepli, Çağrı Çöltekin, Rob Van Der Goot, Tommi Jauhiainen, Mourhaf Kazzaz, Nikola Ljubešić, Kai North, Barbara Plank, Yves Scherrer, Marcos Zampieri

    Abstract: This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR),… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Journal ref: In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 251-261, Dubrovnik, Croatia. Association from Computational Linguistics

  19. arXiv:2305.12000  [pdf, other

    cs.CL

    Deep Learning Approaches to Lexical Simplification: A Survey

    Authors: Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri

    Abstract: Lexical Simplification (LS) is the task of replacing complex for simpler words in a sentence whilst preserving the sentence's original meaning. LS is the lexical component of Text Simplification (TS) with the aim of making texts more accessible to various target populations. A past survey (Paetzold and Specia, 2017) has provided a detailed overview of LS. Since this survey, however, the AI/NLP com… ▽ More

    Submitted 19 May, 2023; originally announced May 2023.

  20. Lexical Complexity Prediction: An Overview

    Authors: Kai North, Marcos Zampieri, Matthew Shardlow

    Abstract: The occurrence of unknown words in texts significantly hinders reading comprehension. To improve accessibility for specific target populations, computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives. In this paper, we present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on… ▽ More

    Submitted 8 March, 2023; originally announced March 2023.

    ACM Class: A.1

    Journal ref: ACM Computing Surveys 55, 9, Article 179 (January 2023), 40 pages

  21. arXiv:2303.01490  [pdf, other

    cs.CL

    Language Variety Identification with True Labels

    Authors: Marcos Zampieri, Kai North, Tommi Jauhiainen, Mariano Felice, Neha Kumari, Nishant Nair, Yash Bangera

    Abstract: Language identification is an important first step in many IR and NLP applications. Most publicly available language identification datasets, however, are compiled under the assumption that the gold label of each instance is determined by where texts are retrieved from. Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e.g., Croatian and… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

  22. arXiv:2302.02888  [pdf, other

    cs.CL cs.LG

    Findings of the TSAR-2022 Shared Task on Multilingual Lexical Simplification

    Authors: Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng Sheang, Matthew Shardlow, Kai North, Marcos Zampieri

    Abstract: We report findings of the TSAR-2022 shared task on multilingual lexical simplification, organized as part of the Workshop on Text Simplification, Accessibility, and Readability TSAR-2022 held in conjunction with EMNLP 2022. The task called the Natural Language Processing research community to contribute with methods to advance the state of the art in multilingual lexical simplification for English… ▽ More

    Submitted 6 February, 2023; originally announced February 2023.

  23. arXiv:2301.12534  [pdf, other

    cs.CL cs.CY cs.LG

    Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive

    Authors: Tharindu Cyril Weerasooriya, Sujan Dutta, Tharindu Ranasinghe, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhsh

    Abstract: Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper investigates how machine and human moderators disagree on what is offensive when it comes to real-world social web political discourse. We show that (1) there is extensive disagreement among the moderators (humans and machines); and (2) human and large-language-model… ▽ More

    Submitted 9 November, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: Accepted to appear at EMNLP 2023

  24. arXiv:2212.00851  [pdf

    cs.CL cs.AI cs.LG cs.SI

    SOLD: Sinhala Offensive Language Dataset

    Authors: Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri

    Abstract: The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (… ▽ More

    Submitted 28 March, 2024; v1 submitted 1 December, 2022; originally announced December 2022.

    Comments: Accepted to Language Resources and Evaluation, Springer

  25. arXiv:2211.12570  [pdf

    cs.CL cs.AI cs.CY cs.LG cs.SI

    Predicting the Type and Target of Offensive Social Media Posts in Marathi

    Authors: Marcos Zampieri, Tharindu Ranasinghe, Mrinal Chaudhari, Saurabh Gaikwad, Prajwal Krishna, Mayuresh Nene, Shrunali Paygude

    Abstract: The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource langu… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: This is a preprint of an article published in the Journal of Intelligent Information Systems, Springer. The final authenticated version is available online at https://link.springer.com/article/10.1007/s13278-022-00906-8

  26. arXiv:2211.10163  [pdf, other

    cs.CL cs.AI cs.CY cs.LG cs.SI

    Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi

    Authors: Tharindu Ranasinghe, Kai North, Damith Premasiri, Marcos Zampieri

    Abstract: The widespread of offensive content online has become a reason for great concern in recent years, motivating researchers to develop robust systems capable of identifying such content automatically. With the goal of carrying out a fair evaluation of these systems, several international competitions have been organized, providing the community with important benchmark data and evaluation methods for… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

  27. arXiv:2209.09034  [pdf, other

    cs.CL cs.AI

    ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

    Authors: Kai North, Marcos Zampieri, Tharindu Ranasinghe

    Abstract: Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions.… ▽ More

    Submitted 9 February, 2024; v1 submitted 19 September, 2022; originally announced September 2022.

  28. arXiv:2209.05301  [pdf, ps, other

    cs.CL

    Lexical Simplification Benchmarks for English, Portuguese, and Spanish

    Authors: Sanja Stajner, Daniel Ferres, Matthew Shardlow, Kai North, Marcos Zampieri, Horacio Saggion

    Abstract: Even in highly-developed countries, as many as 15-30\% of the population can only understand texts written using a basic vocabulary. Their understanding of everyday texts is limited, which prevents them from taking an active role in society and making informed decisions regarding healthcare, legal representation, or democratic choice. Lexical simplification is a natural language processing task th… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

  29. arXiv:2112.09301  [pdf

    cs.CL cs.AI cs.SI

    Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

    Authors: Thomas Mandl, Sandip Modha, Gautam Kishore Shahi, Hiren Madhu, Shrey Satapara, Prasenjit Majumder, Johannes Schaefer, Tharindu Ranasinghe, Marcos Zampieri, Durgesh Nandini, Amit Kumar Jaiswal

    Abstract: The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop… ▽ More

    Submitted 16 December, 2021; originally announced December 2021.

  30. arXiv:2109.05074  [pdf, other

    cs.CL cs.AI cs.LG cs.SI

    FBERT: A Neural Transformer for Identifying Offensive Content

    Authors: Diptanu Sarkar, Marcos Zampieri, Tharindu Ranasinghe, Alexander Ororbia

    Abstract: Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP Findings

  31. arXiv:2109.03552  [pdf, other

    cs.CL cs.AI cs.LG cs.NE cs.SI

    Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

    Authors: Saurabh Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, Christopher M. Homan

    Abstract: The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind co… ▽ More

    Submitted 8 September, 2021; originally announced September 2021.

    Comments: Accepted to RANLP 2021

  32. An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

    Authors: Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, Emily Hill

    Abstract: This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: 18 pages. arXiv admin note: text overlap with arXiv:2007.08033

    Journal ref: in IEEE Transactions on Software Engineering, vol. , no. 01, pp. 1-1, 5555

  33. arXiv:2108.00057  [pdf, other

    cs.CL cs.AI cs.NE cs.SI

    WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

    Authors: Skye Morgan, Tharindu Ranasinghe, Marcos Zampieri

    Abstract: This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media. We used the dataset made available by the organizers of the GermEval-2021 shared task containing over 3,000 manually annotated Facebook comments in German. Considering the relatedness of the three tasks, we approached the problem using large pre-trained transformer models and multitask learning.… ▽ More

    Submitted 30 July, 2021; originally announced August 2021.

    Comments: Accepted to GermEval-2021

  34. arXiv:2106.00473  [pdf, ps, other

    cs.CL

    SemEval-2021 Task 1: Lexical Complexity Prediction

    Authors: Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, Marcos Zampieri

    Abstract: This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featur… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

  35. arXiv:2105.14888  [pdf, other

    cs.CL

    An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

    Authors: Ana-Maria Bucur, Marcos Zampieri, Liviu P. Dinu

    Abstract: In this paper, we analyze the interplay between the use of offensive language and mental health. We acquired publicly available datasets created for offensive language identification and depression detection and we train computational models to compare the use of offensive language in social media posts written by groups of individuals with and without self-reported depression diagnosis. We also l… ▽ More

    Submitted 24 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

    Comments: Accepted to Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

  36. arXiv:2105.08780  [pdf, ps, other

    cs.CL

    LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction

    Authors: Abhinandan Desai, Kai North, Marcos Zampieri, Christopher M. Homan

    Abstract: This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP). The task organizers provided participants with an augmented version of CompLex (Shardlow et al., 2020), an English multi-domain dataset in which words in context were annotated with respect to their complexity using a five point Likert scale. Our system uses logistic regression and a wid… ▽ More

    Submitted 18 May, 2021; originally announced May 2021.

  37. arXiv:2105.05996  [pdf, other

    cs.CL cs.AI cs.LG cs.SI

    Multilingual Offensive Language Identification for Low-resource Languages

    Authors: Tharindu Ranasinghe, Marcos Zampieri

    Abstract: Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain Engl… ▽ More

    Submitted 20 May, 2021; v1 submitted 12 May, 2021; originally announced May 2021.

    Comments: Accepted to ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). This is an extended version of a paper accepted to EMNLP. arXiv admin note: substantial text overlap with arXiv:2010.05324

  38. arXiv:2104.04630  [pdf, other

    cs.CL cs.AI cs.LG

    WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

    Authors: Tharindu Ranasinghe, Diptanu Sarkar, Marcos Zampieri, Alexander Ororbia

    Abstract: In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. In response, social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content. While various state-of-the-art statistical models have been applied to detect toxic pos… ▽ More

    Submitted 27 May, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Accepted to SemEval-2021

  39. arXiv:2104.00041  [pdf, other

    cs.CL

    Domain-specific MT for Low-resource Languages: The case of Bambara-French

    Authors: Allahsera Auguste Tapo, Michael Leventhal, Sarah Luger, Christopher M. Homan, Marcos Zampieri

    Abstract: Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of parallel data. In this paper we address the issue of domain-specific MT for Bambara, an under-resourced Mande language spoken in Mali. We present the first domain-specific parallel dataset for MT of Bambara into and from French. We discuss challenges in working with small quantities… ▽ More

    Submitted 31 March, 2021; originally announced April 2021.

  40. arXiv:2103.05552  [pdf, ps, other

    cs.CL cs.AI cs.LG

    Comparing Approaches to Dravidian Language Identification

    Authors: Tommi Jauhiainen, Tharindu Ranasinghe, Marcos Zampieri

    Abstract: This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages: Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Ba… ▽ More

    Submitted 9 March, 2021; originally announced March 2021.

    Comments: Accepted to VarDial 2021 @ EACL 2021

  41. arXiv:2102.09665  [pdf, other

    cs.CL cs.AI cs.LG

    MUDES: Multilingual Detection of Offensive Spans

    Authors: Tharindu Ranasinghe, Marcos Zampieri

    Abstract: The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for… ▽ More

    Submitted 18 April, 2021; v1 submitted 18 February, 2021; originally announced February 2021.

    Comments: Accepted to NAACL-HLT 2021

  42. Predicting Lexical Complexity in English Texts: The Complex 2.0 Dataset

    Authors: Matthew Shardlow, Richard Evans, Marcos Zampieri

    Abstract: Identifying words which may cause difficulty for a reader is an essential step in most lexical text simplification systems prior to lexical substitution and can also be used for assessing the readability of a text. This task is commonly referred to as Complex Word Identification (CWI) and is often modelled as a supervised classification problem. For training such systems, annotated datasets in whi… ▽ More

    Submitted 3 November, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

    Journal ref: Lang Resources and Evaluation 56, 1153-1194 (2022)

  43. arXiv:2011.05284  [pdf, other

    cs.CL

    Neural Machine Translation for Extremely Low-Resource African Languages: A Case Study on Bambara

    Authors: Allahsera Auguste Tapo, Bakary Coulibaly, Sébastien Diarra, Christopher Homan, Julia Kreutzer, Sarah Luger, Arthur Nagashima, Marcos Zampieri, Michael Leventhal

    Abstract: Low-resource languages present unique challenges to (neural) machine translation. We discuss the case of Bambara, a Mande language for which training data is scarce and requires significant amounts of pre-processing. More than the linguistic situation of Bambara itself, the socio-cultural context within which Bambara speakers live poses challenges for automated processing of this language. In this… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

  44. arXiv:2011.00559  [pdf, other

    cs.CL cs.AI cs.LG

    WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

    Authors: Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu

    Abstract: This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020. The HASOC 2020 organizers provided participants with annotated datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English). We participated in task 1: Offensive comment identification in Code-mixed… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

    Comments: Accepted to FIRE 2020

  45. arXiv:2010.05324  [pdf, other

    cs.CL cs.LG

    Multilingual Offensive Language Identification with Cross-lingual Embeddings

    Authors: Tharindu Ranasinghe, Marcos Zampieri

    Abstract: Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbulling, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain Engli… ▽ More

    Submitted 11 October, 2020; originally announced October 2020.

    Comments: Accepted to EMNLP 2020

  46. arXiv:2006.07235  [pdf, ps, other

    cs.CL

    SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

    Authors: Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, Çağrı Çöltekin

    Abstract: We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, En… ▽ More

    Submitted 30 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2020)

    MSC Class: 68T50; 68T07 ACM Class: I.2.7

  47. arXiv:2005.12443  [pdf, other

    cs.CL

    MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources

    Authors: Farhad Akhbardeh, Travis Desell, Marcos Zampieri

    Abstract: Maintenance record logbooks are an emerging text type in NLP. They typically consist of free text documents with many domain specific technical terms, abbreviations, as well as non-standard spelling and grammar, which poses difficulties to NLP pipelines trained on standard corpora. Analyzing and annotating such documents is of particular importance in the development of predictive maintenance syst… ▽ More

    Submitted 25 May, 2020; originally announced May 2020.

  48. arXiv:2004.14454  [pdf, other

    cs.CL

    SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

    Authors: Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Marcos Zampieri, Preslav Nakov

    Abstract: The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited… ▽ More

    Submitted 24 September, 2021; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: offensive language, hate speech, cyberbullying, cyber-aggression, taxonomy for offensive language identification

    MSC Class: 68T50; 68T07 ACM Class: F.2.2; I.2.7

    Journal ref: ACL-2021 (Findings)

  49. arXiv:2004.00068  [pdf, ps, other

    cs.CL

    Assessing Human Translations from French to Bambara for Machine Learning: a Pilot Study

    Authors: Michael Leventhal, Allahsera Tapo, Sarah Luger, Marcos Zampieri, Christopher M. Homan

    Abstract: We present novel methods for assessing the quality of human-translated aligned texts for learning machine translation models of under-resourced languages. Malian university students translated French texts, producing either written or oral translations to Bambara. Our results suggest that similar quality can be obtained from either written or spoken translations for certain kinds of texts. They al… ▽ More

    Submitted 31 March, 2020; originally announced April 2020.

  50. arXiv:2003.07459  [pdf, other

    cs.CL

    Offensive Language Identification in Greek

    Authors: Zeses Pitenis, Marcos Zampieri, Tharindu Ranasinghe

    Abstract: As offensive language has become a rising issue for online communities and social media platforms, researchers have been investigating ways of coping with abusive content and developing systems to detect its different types: cyberbullying, hate speech, aggression, etc. With a few notable exceptions, most research on this topic so far has dealt with English. This is mostly due to the availability o… ▽ More

    Submitted 18 March, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

    Comments: Accepted to LREC 2020