-
Mitigating Translationese in Low-resource Languages: The Storyboard Approach
Authors:
Garry Kuwanto,
Eno-Abasi E. Urua,
Priscilla Amondi Amuok,
Shamsuddeen Hassan Muhammad,
Anuoluwapo Aremu,
Verrah Otiende,
Loice Emma Nanyanga,
Teresiah W. Nyoike,
Aniefon D. Akpan,
Nsima Ab Udouboh,
Idongesit Udeme Archibong,
Idara Effiong Moses,
Ifeoluwatayo A. Ige,
Benjamin Ajibade,
Olumide Benjamin Awokoya,
Idris Abdulmumin,
Saminu Mohammad Aliyu,
Ruqayya Nasir Iro,
Ibrahim Said Ahmad,
Deontae Smith,
Praise-EL Michaels,
David Ifeoluwa Adelani,
Derry Tanti Wijaya,
Anietie Andy
Abstract:
Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent a…
▽ More
Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences. Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text. We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency. Human annotators and quantitative metrics were used to assess translation quality. The results indicate a preference for text translation in terms of accuracy, while our method demonstrates worse accuracy but better fluency in the language focused.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
A multilingual dataset for offensive language and hate speech detection for hausa, yoruba and igbo languages
Authors:
Saminu Mohammad Aliyu,
Gregory Maksha Wajiga,
Muhammad Murtala
Abstract:
The proliferation of online offensive language necessitates the development of effective detection mechanisms, especially in multilingual contexts. This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for…
▽ More
The proliferation of online offensive language necessitates the development of effective detection mechanisms, especially in multilingual contexts. This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90\%. To further support research in offensive language detection, we plan to make the dataset and our models publicly available.
△ Less
Submitted 5 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
Analyzing COVID-19 Vaccination Sentiments in Nigerian Cyberspace: Insights from a Manually Annotated Twitter Dataset
Authors:
Ibrahim Said Ahmad,
Lukman Jibril Aliyu,
Abubakar Auwal Khalid,
Saminu Muhammad Aliyu,
Shamsuddeen Hassan Muhammad,
Idris Abdulmumin,
Bala Mairiga Abduljalil,
Bello Shehu Bello,
Amina Imam Abubakar
Abstract:
Numerous successes have been achieved in combating the COVID-19 pandemic, initially using various precautionary measures like lockdowns, social distancing, and the use of face masks. More recently, various vaccinations have been developed to aid in the prevention or reduction of the severity of the COVID-19 infection. Despite the effectiveness of the precautionary measures and the vaccines, there…
▽ More
Numerous successes have been achieved in combating the COVID-19 pandemic, initially using various precautionary measures like lockdowns, social distancing, and the use of face masks. More recently, various vaccinations have been developed to aid in the prevention or reduction of the severity of the COVID-19 infection. Despite the effectiveness of the precautionary measures and the vaccines, there are several controversies that are massively shared on social media platforms like Twitter. In this paper, we explore the use of state-of-the-art transformer-based language models to study people's acceptance of vaccines in Nigeria. We developed a novel dataset by crawling multi-lingual tweets using relevant hashtags and keywords. Our analysis and visualizations revealed that most tweets expressed neutral sentiments about COVID-19 vaccines, with some individuals expressing positive views, and there was no strong preference for specific vaccine types, although Moderna received slightly more positive sentiment. We also found out that fine-tuning a pre-trained LLM with an appropriate dataset can yield competitive results, even if the LLM was not initially pre-trained on the specific language of that dataset.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
HausaNLP at SemEval-2023 Task 10: Transfer Learning, Synthetic Data and Side-Information for Multi-Level Sexism Classification
Authors:
Saminu Mohammad Aliyu,
Idris Abdulmumin,
Shamsuddeen Hassan Muhammad,
Ibrahim Said Ahmad,
Saheed Abdullahi Salahudeen,
Aliyu Yusuf,
Falalu Ibrahim Lawan
Abstract:
We present the findings of our participation in the SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS) task, a shared task on offensive language (sexism) detection on English Gab and Reddit dataset. We investigated the effects of transferring two language models: XLM-T (sentiment classification) and HateBERT (same domain -- Reddit) for multi-level classification into Sexist or not…
▽ More
We present the findings of our participation in the SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS) task, a shared task on offensive language (sexism) detection on English Gab and Reddit dataset. We investigated the effects of transferring two language models: XLM-T (sentiment classification) and HateBERT (same domain -- Reddit) for multi-level classification into Sexist or not Sexist, and other subsequent sub-classifications of the sexist data. We also use synthetic classification of unlabelled dataset and intermediary class information to maximize the performance of our models. We submitted a system in Task A, and it ranked 49th with F1-score of 0.82. This result showed to be competitive as it only under-performed the best system by 0.052% F1-score.
△ Less
Submitted 28 April, 2023;
originally announced May 2023.
-
HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource TweetData for Sentiment Analysis
Authors:
Saheed Abdullahi Salahudeen,
Falalu Ibrahim Lawan,
Ahmad Mustapha Wali,
Amina Abubakar Imam,
Aliyu Rabiu Shuaibu,
Aliyu Yusuf,
Nur Bala Rabiu,
Musa Bello,
Shamsuddeen Umaru Adamu,
Saminu Mohammad Aliyu,
Murja Sani Gadanya,
Sanah Abdullahi Muaz,
Mahmoud Said Ahmad,
Abdulkadir Abdullahi,
Abdulmalik Yusuf Jamoh
Abstract:
We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset. The task featured three subtasks; subtask A is monolingual sentiment classification with 12 tracks which are all monolingual languages, subtask B is multilingual sentiment classification using the tracks in subtask A and subtask C is a zero-shot sentiment c…
▽ More
We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset. The task featured three subtasks; subtask A is monolingual sentiment classification with 12 tracks which are all monolingual languages, subtask B is multilingual sentiment classification using the tracks in subtask A and subtask C is a zero-shot sentiment classification. We present the results and findings of subtask A, subtask B and subtask C. We also release the code on github. Our goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large, AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert), Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African languages. The datasets for these subtasks consists of a gold standard multi-class labeled Twitter datasets from these languages. Our results demonstrate that Afro-xlmr-large model performed better compared to the other models in most of the languages datasets. Similarly, Nigerian languages: Hausa, Igbo, and Yoruba achieved better performance compared to other languages and this can be attributed to the higher volume of data present in the languages.
△ Less
Submitted 26 April, 2023;
originally announced April 2023.
-
HERDPhobia: A Dataset for Hate Speech against Fulani in Nigeria
Authors:
Saminu Mohammad Aliyu,
Gregory Maksha Wajiga,
Muhammad Murtala,
Shamsuddeen Hassan Muhammad,
Idris Abdulmumin,
Ibrahim Said Ahmad
Abstract:
Social media platforms allow users to freely share their opinions about issues or anything they feel like. However, they also make it easier to spread hate and abusive content. The Fulani ethnic group has been the victim of this unfortunate phenomenon. This paper introduces the HERDPhobia - the first annotated hate speech dataset on Fulani herders in Nigeria - in three languages: English, Nigerian…
▽ More
Social media platforms allow users to freely share their opinions about issues or anything they feel like. However, they also make it easier to spread hate and abusive content. The Fulani ethnic group has been the victim of this unfortunate phenomenon. This paper introduces the HERDPhobia - the first annotated hate speech dataset on Fulani herders in Nigeria - in three languages: English, Nigerian-Pidgin, and Hausa. We present a benchmark experiment using pre-trained languages models to classify the tweets as either hateful or non-hateful. Our experiment shows that the XML-T model provides better performance with 99.83% weighted F1. We released the dataset at https://github.com/hausanlp/HERDPhobia for further research.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Deep Sequence Models for Text Classification Tasks
Authors:
Saheed Salahudeen Abdullahi,
Sun Yiming,
Shamsuddeen Hassan Muhammad,
Abdulrasheed Mustapha,
Ahmad Muhammad Aminu,
Abdulkadir Abdullahi,
Musa Bello,
Saminu Mohammad Aliyu
Abstract:
The exponential growth of data generated on the Internet in the current information age is a driving force for the digital economy. Extraction of information is the major value in an accumulated big data. Big data dependency on statistical analysis and hand-engineered rules machine learning algorithms are overwhelmed with vast complexities inherent in human languages. Natural Language Processing (…
▽ More
The exponential growth of data generated on the Internet in the current information age is a driving force for the digital economy. Extraction of information is the major value in an accumulated big data. Big data dependency on statistical analysis and hand-engineered rules machine learning algorithms are overwhelmed with vast complexities inherent in human languages. Natural Language Processing (NLP) is equipping machines to understand these human diverse and complicated languages. Text Classification is an NLP task which automatically identifies patterns based on predefined or undefined labeled sets. Common text classification application includes information retrieval, modeling news topic, theme extraction, sentiment analysis, and spam detection. In texts, some sequences of words depend on the previous or next word sequences to make full meaning; this is a challenging dependency task that requires the machine to be able to store some previous important information to impact future meaning. Sequence models such as RNN, GRU, and LSTM is a breakthrough for tasks with long-range dependencies. As such, we applied these models to Binary and Multi-class classification. Results generated were excellent with most of the models performing within the range of 80% and 94%. However, this result is not exhaustive as we believe there is room for improvement if machines are to compete with humans.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.