Skip to main content

Showing 1–33 of 33 results for author: Muhammad, H

  1. arXiv:2407.10152  [pdf, other

    cs.CL

    Mitigating Translationese in Low-resource Languages: The Storyboard Approach

    Authors: Garry Kuwanto, Eno-Abasi E. Urua, Priscilla Amondi Amuok, Shamsuddeen Hassan Muhammad, Anuoluwapo Aremu, Verrah Otiende, Loice Emma Nanyanga, Teresiah W. Nyoike, Aniefon D. Akpan, Nsima Ab Udouboh, Idongesit Udeme Archibong, Idara Effiong Moses, Ifeoluwatayo A. Ige, Benjamin Ajibade, Olumide Benjamin Awokoya, Idris Abdulmumin, Saminu Mohammad Aliyu, Ruqayya Nasir Iro, Ibrahim Said Ahmad, Deontae Smith, Praise-EL Michaels, David Ifeoluwa Adelani, Derry Tanti Wijaya, Anietie Andy

    Abstract: Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent a… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: published at LREC-COLING 2024

    ACM Class: I.2.7

    Journal ref: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 11349-11360

  2. arXiv:2406.09948  [pdf, other

    cs.CL

    BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

    Authors: Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose Camacho-Collados, Alice Oh

    Abstract: Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  3. arXiv:2406.03368  [pdf, other

    cs.CL cs.AI

    IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

    Authors: David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Jian Yun Zhuang, Jesujoba O. Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chiamaka Chukwuneke, Happy Buzaaba, Blessing Sibanda, Godson Kalipe, Jonathan Mukiibi, Salomon Kabongo, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela, Nkiruka Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei, Sokhar Samb, Tadesse Kebede Guge , et al. (1 additional authors not shown)

    Abstract: Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoB… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Under review

  4. arXiv:2405.05376  [pdf, other

    cs.CL

    Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

    Authors: Nathaniel R. Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Bizon Monroc, Loïc Grobol, Hasan Muhammad, Ashi Garg, Naome A. Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Dean Stutzman, Bismarck Bamfo Odoom, Sanjeev Khudanpur, Stephen D. Richardson, Kenton Murray

    Abstract: A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We pr… ▽ More

    Submitted 13 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

    Comments: NAACL 2024

  5. arXiv:2403.18933  [pdf, other

    cs.CL

    SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages

    Authors: Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M. Mohammad

    Abstract: We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. The… ▽ More

    Submitted 17 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: SemEval 2024 Task Description Paper. arXiv admin note: text overlap with arXiv:2402.08638

  6. arXiv:2402.08638  [pdf, other

    cs.CL

    SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages

    Authors: Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata , et al. (2 additional authors not shown)

    Abstract: Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dat… ▽ More

    Submitted 31 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

    Comments: Accepted to the Findings of ACL 2024

  7. arXiv:2401.13133  [pdf, other

    cs.CL cs.SI

    Analyzing COVID-19 Vaccination Sentiments in Nigerian Cyberspace: Insights from a Manually Annotated Twitter Dataset

    Authors: Ibrahim Said Ahmad, Lukman Jibril Aliyu, Abubakar Auwal Khalid, Saminu Muhammad Aliyu, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Bala Mairiga Abduljalil, Bello Shehu Bello, Amina Imam Abubakar

    Abstract: Numerous successes have been achieved in combating the COVID-19 pandemic, initially using various precautionary measures like lockdowns, social distancing, and the use of face masks. More recently, various vaccinations have been developed to aid in the prevention or reduction of the severity of the COVID-19 infection. Despite the effectiveness of the precautionary measures and the vaccines, there… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

  8. arXiv:2311.12179  [pdf, other

    cs.CL

    Leveraging Closed-Access Multilingual Embedding for Automatic Sentence Alignment in Low Resource Languages

    Authors: Idris Abdulmumin, Auwal Abubakar Khalid, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Lukman Jibril Aliyu, Babangida Sani, Bala Mairiga Abduljalil, Sani Ahmad Hassan

    Abstract: The importance of qualitative parallel data in machine translation has long been determined but it has always been very difficult to obtain such in sufficient quantity for the majority of world languages, mainly because of the associated cost and also the lack of accessibility to these languages. Despite the potential for obtaining parallel datasets from online articles using automatic approaches,… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: To appear in the proceedings of ICCAIT 2023. 6 pages, 2 figures

  9. arXiv:2311.09828  [pdf, other

    cs.CL

    AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

    Authors: Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal, Marek Masiak, Ricardo Rei, Eleftheria Briakou, Marine Carpuat, Xuanli He, Sofia Bourhim, Andiswa Bukula, Muhidin Mohamed, Temitayo Olatoye, Tosin Adewumi, Hamam Mokayed, Christine Mwase, Wangui Kimotho, Foutse Yuehgoh, Anuoluwapo Aremu, Jessica Ojo, Shamsuddeen Hassan Muhammad, Salomey Osei, Abdul-Hakeem Omotayo, Chiamaka Chukwuneke, Perez Ogayo, Oumaima Hourrane , et al. (33 additional authors not shown)

    Abstract: Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of eval… ▽ More

    Submitted 23 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: Accepted by NAACL 2024

  10. arXiv:2308.09768  [pdf, other

    cs.CL

    NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

    Authors: Anuoluwapo Aremu, Jesujoba O. Alabi, Daud Abolade, Nkechinyere F. Aguobi, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani

    Abstract: In this paper, we create NaijaRC: a new multi-choice Reading Comprehension dataset for three native Nigeria languages that is based on high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE and Belebele training dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large… ▽ More

    Submitted 19 May, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted to AfricaNLP Workshop at ICLR 2024 (non-archival)

  11. arXiv:2305.17690  [pdf, other

    cs.CL

    HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

    Authors: Shantipriya Parida, Idris Abdulmumin, Shamsuddeen Hassan Muhammad, Aneesh Bose, Guneet Singh Kohli, Ibrahim Said Ahmad, Ketan Kotwal, Sayan Deb Sarkar, Ondřej Bojar, Habeebah Adamu Kakudi

    Abstract: This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fa… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted at ACL 2023 as a long paper (Findings)

  12. arXiv:2305.13989  [pdf, other

    cs.CL

    MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

    Authors: Cheikh M. Bamba Dione, David Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson Kalipe, Tebogo Macucwa, Vukosi Marivate , et al. (19 additional authors not shown)

    Abstract: In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-l… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 (Main conference)

  13. arXiv:2305.06897  [pdf, other

    cs.CL cs.AI cs.IR

    AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

    Authors: Odunayo Ogundepo, Tajuddeen R. Gwadabe, Clara E. Rivera, Jonathan H. Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure F. P. Dossou, Abdou Aziz DIOP, Claytone Sikasote, Gilles Hacheme, Happy Buzaaba, Ignatius Ezeani, Rooweither Mabuya, Salomey Osei, Chris Emezue, Albert Njoroge Kahira, Shamsuddeen H. Muhammad, Akintunde Oladipo, Abraham Toluwase Owodunni, Atnafu Lambebo Tonja, Iyanuoluwa Shode, Akari Asai, Tunde Oluwaseyi Ajayi, Clemencia Siro, Steven Arthur , et al. (27 additional authors not shown)

    Abstract: African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  14. arXiv:2305.00076  [pdf, other

    cs.CL

    HausaNLP at SemEval-2023 Task 10: Transfer Learning, Synthetic Data and Side-Information for Multi-Level Sexism Classification

    Authors: Saminu Mohammad Aliyu, Idris Abdulmumin, Shamsuddeen Hassan Muhammad, Ibrahim Said Ahmad, Saheed Abdullahi Salahudeen, Aliyu Yusuf, Falalu Ibrahim Lawan

    Abstract: We present the findings of our participation in the SemEval-2023 Task 10: Explainable Detection of Online Sexism (EDOS) task, a shared task on offensive language (sexism) detection on English Gab and Reddit dataset. We investigated the effects of transferring two language models: XLM-T (sentiment classification) and HateBERT (same domain -- Reddit) for multi-level classification into Sexist or not… ▽ More

    Submitted 28 April, 2023; originally announced May 2023.

    Comments: 5 pages, 3 figures

  15. arXiv:2304.09972  [pdf, other

    cs.CL

    MasakhaNEWS: News Topic Classification for African languages

    Authors: David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, sana al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen , et al. (40 additional authors not shown)

    Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African… ▽ More

    Submitted 20 September, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

    Comments: Accepted to IJCNLP-AACL 2023 (main conference)

  16. arXiv:2304.06845  [pdf, other

    cs.CL

    SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)

    Authors: Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Seid Muhie Yimam, David Ifeoluwa Adelani, Ibrahim Sa'id Ahmad, Nedjma Ousidhoum, Abinew Ayele, Saif M. Mohammad, Meriem Beloucif, Sebastian Ruder

    Abstract: We present the first Africentric SemEval Shared task, Sentiment Analysis for African Languages (AfriSenti-SemEval) - The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023. AfriSenti-SemEval is a sentiment classification challenge in 14 African languages: Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oro… ▽ More

    Submitted 1 May, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: 19 pages, 5 figures, 6 tables

  17. arXiv:2302.08956  [pdf, other

    cs.CL

    AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

    Authors: Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id Ahmad, Meriem Beloucif, Saif M. Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Felermino Dário Mário António Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku , et al. (1 additional authors not shown)

    Abstract: Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti… ▽ More

    Submitted 4 November, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

    Comments: 14 pages, 3 Figures, 10 Tables

  18. arXiv:2211.15262  [pdf, other

    cs.CL

    HERDPhobia: A Dataset for Hate Speech against Fulani in Nigeria

    Authors: Saminu Mohammad Aliyu, Gregory Maksha Wajiga, Muhammad Murtala, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Ibrahim Said Ahmad

    Abstract: Social media platforms allow users to freely share their opinions about issues or anything they feel like. However, they also make it easier to spread hate and abusive content. The Fulani ethnic group has been the victim of this unfortunate phenomenon. This paper introduces the HERDPhobia - the first annotated hate speech dataset on Fulani herders in Nigeria - in three languages: English, Nigerian… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

    Comments: To appear in the Proceedings of the Sixth Workshop on Widening Natural Language Processing at EMNLP2022

  19. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  20. arXiv:2210.12391  [pdf, other

    cs.CL

    MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

    Authors: David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen H. Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau , et al. (20 additional authors not shown)

    Abstract: African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity r… ▽ More

    Submitted 15 November, 2022; v1 submitted 22 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022 (updated Github link)

  21. arXiv:2210.10692  [pdf, ps, other

    cs.CL

    Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

    Authors: Idris Abdulmumin, Michael Beukman, Jesujoba O. Alabi, Chris Emezue, Everlyn Asiko, Tosin Adewumi, Shamsuddeen Hassan Muhammad, Mofetoluwa Adeyemi, Oreen Yousuf, Sahib Singh, Tajuddeen Rabiu Gwadabe

    Abstract: We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for the African Languages Shared Task. This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier that was built by fine-tuning a pre-trained language model. To train the classifier, we obtain positive samples (i.e. high-quality parallel sentences) from a gold-standar… ▽ More

    Submitted 20 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted at the Seventh Conference on Machine Translation (WMT22)

  22. Deep Sequence Models for Text Classification Tasks

    Authors: Saheed Salahudeen Abdullahi, Sun Yiming, Shamsuddeen Hassan Muhammad, Abdulrasheed Mustapha, Ahmad Muhammad Aminu, Abdulkadir Abdullahi, Musa Bello, Saminu Mohammad Aliyu

    Abstract: The exponential growth of data generated on the Internet in the current information age is a driving force for the digital economy. Extraction of information is the major value in an accumulated big data. Big data dependency on statistical analysis and hand-engineered rules machine learning algorithms are overwhelmed with vast complexities inherent in human languages. Natural Language Processing (… ▽ More

    Submitted 18 July, 2022; originally announced July 2022.

    ACM Class: I.2.7

    Journal ref: In: 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE). IEEE, 2021. p. 1-6

  23. arXiv:2205.06512  [pdf

    cs.CV

    FontNet: Closing the gap to font designer performance in font synthesis

    Authors: Ammar Ul Hassan Muhammad, Jaeyoung Choi

    Abstract: Font synthesis has been a very active topic in recent years because manual font design requires domain expertise and is a labor-intensive and time-consuming job. While remarkably successful, existing methods for font synthesis have major shortcomings; they require finetuning for unobserved font style with large reference images, the recent few-shot font synthesis methods are either designed for sp… ▽ More

    Submitted 13 May, 2022; originally announced May 2022.

    Comments: 5 pages, 2 Figures, 3 Tables. Accepted paper for AI4CC 2022 (https://ai4cc.net/)

  24. arXiv:2205.02022  [pdf, other

    cs.CL

    A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

    Authors: David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Oluwaseyi Ajayi , et al. (20 additional authors not shown)

    Abstract: Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models… ▽ More

    Submitted 22 August, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: Accepted to NAACL 2022 (added evaluation data for amh, kin, nya, sna, xho)

  25. arXiv:2205.01133  [pdf, other

    cs.CL cs.CV cs.LG

    Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation

    Authors: Idris Abdulmumin, Satya Ranjan Dash, Musa Abdullahi Dawud, Shantipriya Parida, Shamsuddeen Hassan Muhammad, Ibrahim Sa'id Ahmad, Subhadarshi Panda, Ondřej Bojar, Bashir Shehu Galadanci, Bello Shehu Bello

    Abstract: Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations. The visual information can serve as a valuable piece of context information to decrease the ambiguity of input sentences. Despite the increasing popularity of such a technique, good and sizeable datasets are scarce, limiting the full extent of their potential. Hausa, a Chadic languag… ▽ More

    Submitted 6 May, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

    Comments: Accepted at Language Resources and Evaluation Conference 2022 (LREC2022)

  26. arXiv:2201.08277  [pdf, other

    cs.CL cs.AI

    NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

    Authors: Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, Pavel Brazdil

    Abstract: Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yorùbá ) consisting of around 30,000 annotated tweets per language (and 14,000 for Nigerian-Pidgin… ▽ More

    Submitted 18 June, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Comments: Submitted to LREC 2022, 13 pages, 2 figures

  27. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  28. arXiv:2101.11085  [pdf, other

    cs.CV

    EPIC-Survival: End-to-end Part Inferred Clustering for Survival Analysis, Featuring Prognostic Stratification Boosting

    Authors: Hassan Muhammad, Chensu Xie, Carlie S. Sigel, Michael Doukas, Lindsay Alpert, William R. Jarnagin, Amber Simpson, Thomas J. Fuchs

    Abstract: Histopathology-based survival modelling has two major hurdles. Firstly, a well-performing survival model has minimal clinical application if it does not contribute to the stratification of a cancer patient cohort into different risk groups, preferably driven by histologic morphologies. In the clinical setting, individuals are not given specific prognostic predictions, but are rather predicted to l… ▽ More

    Submitted 9 July, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

    Comments: co-first authors: Hassan Muhammad and Chensu Xie

  29. arXiv:2010.02353  [pdf, other

    cs.CL cs.AI cs.LG

    Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

    Authors: Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo, Salomey Osei, Sackey Freshia, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Jane Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer , et al. (23 additional authors not shown)

    Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communicat… ▽ More

    Submitted 6 November, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020; updated benchmarks

  30. arXiv:1903.05257  [pdf, other

    cs.CV q-bio.TO stat.ML

    Towards Unsupervised Cancer Subtyping: Predicting Prognosis Using A Histologic Visual Dictionary

    Authors: Hassan Muhammad, Carlie S. Sigel, Gabriele Campanella, Thomas Boerner, Linda M. Pak, Stefan Büttner, Jan N. M. IJzermans, Bas Groot Koerkamp, Michael Doukas, William R. Jarnagin, Amber Simpson, Thomas J. Fuchs

    Abstract: Unlike common cancers, such as those of the prostate and breast, tumor grading in rare cancers is difficult and largely undefined because of small sample sizes, the sheer volume of time needed to undertake on such a task, and the inherent difficulty of extracting human-observed patterns. One of the most challenging examples is intrahepatic cholangiocarcinoma (ICC), a primary liver cancer arising f… ▽ More

    Submitted 12 March, 2019; originally announced March 2019.

    Comments: 10 pages, 6 figures

  31. arXiv:1608.00842  [pdf

    cs.LG

    Mitochondria-based Renal Cell Carcinoma Subtyping: Learning from Deep vs. Flat Feature Representations

    Authors: Peter J. Schüffler, Judy Sarungbam, Hassan Muhammad, Ed Reznik, Satish K. Tickoo, Thomas J. Fuchs

    Abstract: Accurate subtyping of renal cell carcinoma (RCC) is of crucial importance for understanding disease progression and for making informed treatment decisions. New discoveries of significant alterations to mitochondria between subtypes make immunohistochemical (IHC) staining based image classification an imperative. Until now, accurate quantification and subtyping was made impossible by huge IHC vari… ▽ More

    Submitted 2 August, 2016; originally announced August 2016.

    Comments: Presented at 2016 Machine Learning and Healthcare Conference (MLHC 2016), Los Angeles, CA

  32. arXiv:0804.4750  [pdf

    cs.RO

    The Numerical Control Design for a Pair of Dubins Vehicles

    Authors: Heru Tjahjana, Iwan Pranoto, Hari Muhammad, J. Naiborhu, Miswanto

    Abstract: In this paper, a model of a pair of Dubins vehicles is considered. The vehicles move from an initial position and orientation to final position and orientation. A long the motion, the two vehicles are not allowed to collide however the two vehicles cant to far each other. The optimal control of the vehicle is found using the Pontryagins Maximum Principle (PMP). This PMP leads to a Hamiltonian sy… ▽ More

    Submitted 30 April, 2008; originally announced April 2008.

    Comments: Uploaded by ICIUS2007 Conference Organizer on behalf of the author(s). 3 pages, 2 figures

    ACM Class: I.2.8

    Journal ref: Proceedings of the International Conference on Intelligent Unmanned System (ICIUS 2007), Bali, Indonesia, October 24-25, 2007, Paper No. ICIUS2007-C003

  33. arXiv:0804.3879  [pdf

    cs.RO

    Effects of Leaders Position and Shape on Aerodynamic Performances of V Flight Formation

    Authors: H. P. Thien, M. A. Moelyadi, H. Muhammad

    Abstract: The influences of the leader in a group of V flight formation are dealt with. The investigation is focused on the effect of its position and shape on aerodynamics performances of a given V flight formation. Vortices generated the wing tip of the leader moves downstream forming a pair of opposite rotating line vortices. These vortices are generally undesirable because they create a downwash that… ▽ More

    Submitted 24 April, 2008; originally announced April 2008.

    Comments: Uploaded by ICIUS2007 Conference Organizer on behalf of the author(s). 7 pages, 15 figures

    ACM Class: J.2

    Journal ref: Proceedings of the International Conference on Intelligent Unmanned System (ICIUS 2007), Bali, Indonesia, October 24-25, 2007, Paper No. ICIUS2007-A008