Skip to main content

Showing 1–50 of 90 results for author: Klakow, D

  1. arXiv:2407.08597  [pdf, other

    cs.SE cs.LG

    Learning Program Behavioral Models from Synthesized Input-Output Pairs

    Authors: Tural Mammadov, Dietrich Klakow, Alexander Koller, Andreas Zeller

    Abstract: We introduce Modelizer - a novel framework that, given a black-box program, learns a _model from its input/output behavior_ using _neural machine translation_. The resulting model _mocks_ the original program: Given an input, the model predicts the output that would have been produced by the program. However, the model is also _reversible_ - that is, the model can predict the input that would have… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: 42 pages, 6 figures, 8 tables

    MSC Class: 68T07 (Primary); 68N30 (Secondary); 68Q42 ACM Class: D.2.5; D.2.7; I.2.6; F.1.1; F.4.3

  2. arXiv:2406.13842  [pdf, other

    cs.CL cs.SD eess.AS

    Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

    Authors: Alexander Blatt, Aravind Krishnan, Dietrich Klakow

    Abstract: Utilizing air-traffic control (ATC) data for downstream natural-language processing tasks requires preprocessing steps. Key steps are the transcription of the data via automatic speech recognition (ASR) and speaker diarization, respectively speaker role detection (SRD) to divide the transcripts into pilot and air-traffic controller (ATCO) transcripts. While traditional approaches take on these tas… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  3. arXiv:2406.12618  [pdf, other

    cs.CL

    From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP

    Authors: Marius Mosbach, Vagrant Gautam, Tomás Vergara-Browne, Dietrich Klakow, Mor Geva

    Abstract: Interpretability and analysis (IA) research is a growing subfield within NLP with the goal of developing a deeper understanding of the behavior or inner workings of NLP systems and methods. Despite growing interest in the subfield, a commonly voiced criticism is that it lacks actionable insights and therefore has little impact on NLP. In this paper, we seek to quantify the impact of IA research on… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  4. arXiv:2406.11598  [pdf, other

    cs.CL cs.CY

    Understanding "Democratization" in NLP and ML Research

    Authors: Arjun Subramonian, Vagrant Gautam, Dietrich Klakow, Zeerak Talat

    Abstract: Recent improvements in natural language processing (NLP) and machine learning (ML) and increased mainstream adoption have led to researchers frequently discussing the "democratization" of artificial intelligence. In this paper, we seek to clarify how democratization is understood in NLP and ML publications, through large-scale mixed-methods analyses of papers using the keyword "democra*" published… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  5. arXiv:2406.09855  [pdf, other

    cs.CL

    On the Encoding of Gender in Transformer-based ASR Representations

    Authors: Aravind Krishnan, Badr M. Abdullah, Dietrich Klakow

    Abstract: While existing literature relies on performance differences to uncover gender biases in ASR models, a deeper analysis is essential to understand how gender is encoded and utilized during transcript generation. This work investigates the encoding and utilization of gender in the latent representations of two transformer-based ASR models, Wav2Vec2 and HuBERT. Using linear erasure, we demonstrate the… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  6. arXiv:2404.14122  [pdf, other

    cs.CL

    Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?

    Authors: Dawei Zhu, Pinzhen Chen, Miaoran Zhang, Barry Haddow, Xiaoyu Shen, Dietrich Klakow

    Abstract: Traditionally, success in multilingual machine translation can be attributed to three key factors in training data: large volume, diverse translation directions, and high quality. In the current practice of fine-tuning large language models (LLMs) for translation, we revisit the importance of all these factors. We find that LLMs display strong translation capability after being fine-tuned on as fe… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  7. arXiv:2404.11288  [pdf, other

    cs.CL

    A Preference-driven Paradigm for Enhanced Translation with Large Language Models

    Authors: Dawei Zhu, Sony Trenous, Xiaoyu Shen, Dietrich Klakow, Bill Byrne, Eva Hasler

    Abstract: Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a platea… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Accepted to NAACL 2024 (long, main)

  8. arXiv:2404.03134  [pdf, other

    cs.CL cs.CY

    Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?

    Authors: Vagrant Gautam, Eileen Bingert, Dawei Zhu, Anne Lauscher, Dietrich Klakow

    Abstract: Robust, faithful and harm-free pronoun use for individuals is an important goal for language models as their use increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce the task of pronoun fidelity: given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun… ▽ More

    Submitted 1 May, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

  9. arXiv:2404.01490  [pdf, other

    cs.CL

    AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness

    Authors: Miaoran Zhang, Mingyang Wang, Jesujoba O. Alabi, Dietrich Klakow

    Abstract: This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of lim… ▽ More

    Submitted 7 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: SemEval-2024

  10. arXiv:2403.13737  [pdf, ps, other

    cs.CL

    EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation

    Authors: Atnafu Lambebo Tonja, Israel Abebe Azime, Tadesse Destaw Belay, Mesay Gemeda Yigezu, Moges Ahmed Mehamed, Abinew Ali Ayele, Ebrahim Chekol Jibril, Michael Melese Woldeyohannis, Olga Kolesnikova, Philipp Slusallek, Dietrich Klakow, Shengwu Xiong, Seid Muhie Yimam

    Abstract: Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassin… ▽ More

    Submitted 23 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-Coling 2024

  11. arXiv:2403.13537  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    What explains the success of cross-modal fine-tuning with ORCA?

    Authors: Paloma García-de-Herreros, Vagrant Gautam, Philipp Slusallek, Dietrich Klakow, Marius Mosbach

    Abstract: ORCA (Shen et al., 2023) is a recent technique for cross-modal fine-tuning, i.e., applying pre-trained transformer models to modalities beyond their training data. The technique consists primarily of training an embedder and fine-tuning the embedder and model. Despite its high performance on a variety of downstream tasks, we do not understand precisely how each of these components contribute to OR… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

  12. arXiv:2402.13137  [pdf, other

    cs.CL

    The Hidden Space of Transformer Language Adapters

    Authors: Jesujoba O. Alabi, Marius Mosbach, Matan Eyal, Dietrich Klakow, Mor Geva

    Abstract: We analyze the operation of transformer language adapters, which are small modules trained on top of a frozen language model to adapt its predictions to new target languages. We show that adapted predictions mostly evolve in the source language the model was trained on, while the target language becomes pronounced only in the very last layers of the model. Moreover, the adaptation process is gradu… ▽ More

    Submitted 10 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL 2024 (main conference)

  13. arXiv:2402.12976  [pdf, other

    cs.CL cs.AI

    The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

    Authors: Miaoran Zhang, Vagrant Gautam, Mingyang Wang, Jesujoba O. Alabi, Xiaoyu Shen, Dietrich Klakow, Marius Mosbach

    Abstract: In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address… ▽ More

    Submitted 7 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: ACL 2024 findings

  14. arXiv:2312.07338  [pdf, other

    cs.CL cs.SD eess.AS

    Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification

    Authors: Mohammed Maqsood Shaik, Dietrich Klakow, Badr M. Abdullah

    Abstract: Pre-trained Transformer-based speech models have shown striking performance when fine-tuned on various downstream tasks such as automatic speech recognition and spoken language identification (SLID). However, the problem of domain mismatch remains a challenge in this area, where the domain of the pre-training data might differ from that of the downstream labeled data used for fine-tuning. In multi… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Submitted to ICASSP 2024

  15. arXiv:2311.10920  [pdf, other

    cs.CL cs.AI

    Understanding and Mitigating Classification Errors Through Interpretable Token Patterns

    Authors: Michael A. Hedderich, Jonas Fischer, Dietrich Klakow, Jilles Vreeken

    Abstract: State-of-the-art NLP methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions as t… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: Extended abstract at BlackboxNLP'23

  16. arXiv:2311.04547  [pdf, other

    cs.CL

    Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures

    Authors: Julius Steuer, Marius Mosbach, Dietrich Klakow

    Abstract: Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

  17. arXiv:2310.19403  [pdf, other

    cs.CL

    A Lightweight Method to Generate Unanswerable Questions in English

    Authors: Vagrant Gautam, Miaoran Zhang, Dietrich Klakow

    Abstract: If a question cannot be answered with the available information, robust systems for question answering (QA) should know _not_ to answer. One way to build QA models that do this is with additional training data comprised of unanswerable questions, created either by employing annotators or through automated methods for unanswerable question generation. To show that the model complexity of existing a… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted to Findings of EMNLP 2023

    ACM Class: I.2.7

  18. arXiv:2308.04885  [pdf, other

    cs.CL

    Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

    Authors: Julius Steuer, Badr Abdullah, Johann-Mattis List, Dietrich Klakow

    Abstract: We present a cross-linguistic study that aims to quantify vowel harmony using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have relied heavily on inflected word-forms in the analysi… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

    Comments: Presented at SIGTYP at EACL 2023

  19. arXiv:2306.06892  [pdf, other

    cs.CL

    On the N-gram Approximation of Pre-trained Language Models

    Authors: Aravind Krishnan, Jesujoba Alabi, Dietrich Klakow

    Abstract: Large pre-trained language models (PLMs) have shown remarkable performance across various natural language understanding (NLU) tasks, particularly in low-resource settings. Nevertheless, their potential in Automatic Speech Recognition (ASR) remains largely unexplored. This study investigates the potential usage of PLMs for language modelling in ASR. We compare the application of large-scale text s… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  20. arXiv:2306.02405  [pdf, other

    cs.CL

    An Information-Theoretic Analysis of Self-supervised Discrete Representations of Speech

    Authors: Badr M. Abdullah, Mohammed Maqsood Shaik, Bernd Möbius, Dietrich Klakow

    Abstract: Self-supervised representation learning for speech often involves a quantization step that transforms the acoustic input into discrete units. However, it remains unclear how to characterize the relationship between these discrete units and abstract phonetic categories such as phonemes. In this paper, we develop an information-theoretic framework whereby we represent each phonetic category as a dis… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted in Interspeech 2023

  21. arXiv:2305.17442  [pdf, other

    cs.CL

    Weaker Than You Think: A Critical Look at Weakly Supervised Learning

    Authors: Dawei Zhu, Xiaoyu Shen, Marius Mosbach, Andreas Stephan, Dietrich Klakow

    Abstract: Weakly supervised learning is a popular approach for training machine learning models in low-resource settings. Instead of requesting high-quality yet costly human annotations, it allows training models with noisy annotations obtained from various weak sources. Recently, many sophisticated approaches have been proposed for robust training under label noise, reporting impressive results. In this pa… ▽ More

    Submitted 17 September, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: ACL 2023, oral presentation

  22. arXiv:2305.16938  [pdf, other

    cs.CL

    Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation

    Authors: Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, Yanai Elazar

    Abstract: Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations. Unfortunately, previous comparisons of the t… ▽ More

    Submitted 30 May, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of ACL 2023

  23. arXiv:2305.13989  [pdf, other

    cs.CL

    MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

    Authors: Cheikh M. Bamba Dione, David Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson Kalipe, Tebogo Macucwa, Vukosi Marivate , et al. (19 additional authors not shown)

    Abstract: In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-l… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 (Main conference)

  24. arXiv:2303.17972  [pdf, other

    cs.CL

    $\varepsilon$ KÚ <MASK>: Integrating Yorùbá cultural greetings into machine translation

    Authors: Idris Akinade, Jesujoba Alabi, David Adelani, Clement Odoje, Dietrich Klakow

    Abstract: This paper investigates the performance of massively multilingual neural machine translation (NMT) systems in translating Yorùbá greetings ($\varepsilon$ kú [MASK]), which are a big part of Yorùbá language and culture, into English. To evaluate these models, we present IkiniYorùbá, a Yorùbá-English translation dataset containing some Yorùbá greetings, and sample use cases. We analysed the performa… ▽ More

    Submitted 24 April, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

    Comments: C3NLP Workshop @ EACL2023 and AfricaNLP workshop @ ICLR2023

  25. arXiv:2301.03012  [pdf, other

    cs.CL

    Analyzing the Representational Geometry of Acoustic Word Embeddings

    Authors: Badr M. Abdullah, Dietrich Klakow

    Abstract: Acoustic word embeddings (AWEs) are vector representations such that different acoustic exemplars of the same word are projected nearby in the embedding space. In addition to their use in speech technology applications such as spoken term discovery and keyword spotting, AWE models have been adopted as models of spoken-word processing in several cognitively motivated studies and have been shown to… ▽ More

    Submitted 8 January, 2023; originally announced January 2023.

    Comments: In BlackboxNLP workshop, EMNLP 2022 [ oral presentation ]

  26. arXiv:2211.04054  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

    Authors: Juan Zuluaga-Gomez, Karel Veselý, Igor Szöke, Alexander Blatt, Petr Motlicek, Martin Kocour, Mickael Rigault, Khalid Choukri, Amrutha Prasad, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Claudia Cevenini, Pavel Kolčárek, Allan Tart, Jan Černocký, Dietrich Klakow

    Abstract: Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-h… ▽ More

    Submitted 15 June, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: Manuscript under review; The code is available at: https://github.com/idiap/atco2-corpus

  27. arXiv:2210.12391  [pdf, other

    cs.CL

    MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

    Authors: David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen H. Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau , et al. (20 additional authors not shown)

    Abstract: African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity r… ▽ More

    Submitted 15 November, 2022; v1 submitted 22 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022 (updated Github link)

  28. arXiv:2210.10252  [pdf, other

    cs.CL cs.SD eess.AS

    A Data-Driven Investigation of Noise-Adaptive Utterance Generation with Linguistic Modification

    Authors: Anupama Chingacham, Vera Demberg, Dietrich Klakow

    Abstract: In noisy environments, speech can be hard to understand for humans. Spoken dialog systems can help to enhance the intelligibility of their output, either by modifying the speech synthesis (e.g., imitate Lombard speech) or by optimizing the language generation. We here focus on the second type of approach, by which an intended message is realized with words that are more intelligible in a specific… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  29. arXiv:2209.06633  [pdf, other

    cs.CL eess.AS

    Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings

    Authors: Badr M. Abdullah, Bernd Möbius, Dietrich Klakow

    Abstract: Models of acoustic word embeddings (AWEs) learn to map variable-length spoken word segments onto fixed-dimensionality vector representations such that different acoustic exemplars of the same word are projected nearby in the embedding space. In addition to their speech technology applications, AWE models have been shown to predict human performance on a variety of auditory lexical processing tasks… ▽ More

    Submitted 18 September, 2022; v1 submitted 14 September, 2022; originally announced September 2022.

    Comments: Accepted in INTERSPEECH 2022

  30. arXiv:2208.02402  [pdf, other

    cs.CL cs.LG

    Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

    Authors: Vilém Zouhar, Marius Mosbach, Dietrich Klakow

    Abstract: Although masked language models are highly performant and widely adopted by NLP practitioners, they can not be easily used for autoregressive language modelling (next word prediction and sequence probability estimation). We present an LSTM-based autoregressive language model which uses prefix embeddings (from a pretrained masked language model) via fusion (e.g. concatenation) to obtain a richer co… ▽ More

    Submitted 5 August, 2022; v1 submitted 3 August, 2022; originally announced August 2022.

    Comments: Submitted to PBML. Code & experiment repository: https://github.com/zouharvi/sentence-embd-fusion

  31. arXiv:2206.07841  [pdf, other

    cs.CL

    TOKEN is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models

    Authors: Ali Davody, David Ifeoluwa Adelani, Thomas Kleinbauer, Dietrich Klakow

    Abstract: Transferring knowledge from one domain to another is of practical importance for many tasks in natural language processing, especially when the amount of available data in the target domain is limited. In this work, we propose a novel few-shot approach to domain adaptation in the context of Named Entity Recognition (NER). We propose a two-step approach consisting of a variable base module and a te… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: Accepted to 25th International Conference on Text, Speech and Dialogue (TSD 2022)

  32. arXiv:2206.01476  [pdf, ps, other

    cs.CL

    Task-Adaptive Pre-Training for Boosting Learning With Noisy Labels: A Study on Text Classification for African Languages

    Authors: Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Ifeoluwa Adelani, Dietrich Klakow

    Abstract: For high-resource languages like English, text classification is a well-studied task. The performance of modern NLP models easily achieves an accuracy of more than 90% in many standard datasets for text classification in English (Xie et al., 2019; Yang et al., 2019; Zaheer et al., 2020). However, text classification in low-resource languages is still challenging due to the lack of annotated data.… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: AfricaNLP Workshop @ ICLR2022

  33. arXiv:2205.14036  [pdf, other

    cs.CL

    StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

    Authors: Awantee Deshpande, Dana Ruiter, Marius Mosbach, Dietrich Klakow

    Abstract: Analyzing ethnic or religious bias is important for improving fairness, accountability, and transparency of natural language processing models. However, many techniques rely on human-compiled lists of bias terms, which are expensive to create and are limited in coverage. In this study, we present a fully data-driven pipeline for generating a knowledge graph (KG) of cultural knowledge and stereotyp… ▽ More

    Submitted 27 May, 2022; originally announced May 2022.

    Comments: 12 pages, 2 figures, accepted as a long paper at WOAH at NAACL 2022

  34. arXiv:2205.10399  [pdf, other

    cs.CL cs.LG

    Multilingual Normalization of Temporal Expressions with Masked Language Models

    Authors: Lukas Lange, Jannik Strötgen, Heike Adel, Dietrich Klakow

    Abstract: The detection and normalization of temporal expressions is an important task and preprocessing step for many applications. However, prior work on normalization is rule-based, which severely limits the applicability in real-world multilingual settings, due to the costly creation of new rules. We propose a novel neural method for normalizing temporal expressions based on masked language modeling. Ou… ▽ More

    Submitted 10 February, 2023; v1 submitted 20 May, 2022; originally announced May 2022.

    Comments: Accepted at EACL 2023

  35. arXiv:2205.08814  [pdf, other

    cs.CL

    Exploiting Social Media Content for Self-Supervised Style Transfer

    Authors: Dana Ruiter, Thomas Kleinbauer, Cristina España-Bonet, Josef van Genabith, Dietrich Klakow

    Abstract: Recent research on style transfer takes inspiration from unsupervised neural machine translation (UNMT), learning from large amounts of non-parallel data by exploiting cycle consistency loss, back-translation, and denoising autoencoders. By contrast, the use of self-supervised NMT (SSNMT), which leverages (near) parallel instances hidden in non-parallel data more efficiently than UNMT, has not yet… ▽ More

    Submitted 18 May, 2022; originally announced May 2022.

    Comments: 13 pages, 2 figures, accepted as a long paper at SocialNLP 2022 (@NAACL)

  36. arXiv:2205.07290  [pdf, other

    cs.CL

    Meta Self-Refinement for Robust Learning with Weak Supervision

    Authors: Dawei Zhu, Xiaoyu Shen, Michael A. Hedderich, Dietrich Klakow

    Abstract: Training deep neural networks (DNNs) under weak supervision has attracted increasing research attention as it can significantly reduce the annotation cost. However, labels from weak supervision can be noisy, and the high capacity of DNNs enables them to easily overfit the label noise, resulting in poor generalization. Recent methods leverage self-training to build noise-resistant models, in which… ▽ More

    Submitted 30 April, 2023; v1 submitted 15 May, 2022; originally announced May 2022.

    Comments: EACL 2023 (long paper)

  37. arXiv:2205.02022  [pdf, other

    cs.CL

    A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

    Authors: David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Oluwaseyi Ajayi , et al. (20 additional authors not shown)

    Abstract: Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models… ▽ More

    Submitted 22 August, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: Accepted to NAACL 2022 (added evaluation data for amh, kin, nya, sna, xho)

  38. arXiv:2204.13400  [pdf, other

    cs.CL

    Placing M-Phasis on the Plurality of Hate: A Feature-Based Corpus of Hate Online

    Authors: Dana Ruiter, Liane Reiners, Ashwin Geet D'Sa, Thomas Kleinbauer, Dominique Fohr, Irina Illina, Dietrich Klakow, Christian Schemer, Angeliki Monnier

    Abstract: Even though hate speech (HS) online has been an important object of research in the last decade, most HS-related corpora over-simplify the phenomenon of hate by attempting to label user comments as "hate" or "neutral". This ignores the complex and subjective nature of HS, which limits the real-life applicability of classifiers trained on these corpora. In this study, we present the M-Phasis corpus… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: 14 pages, 4 figures, accepted at LREC 2022 (Full Paper)

  39. arXiv:2204.10931  [pdf, other

    cs.CL

    MCSE: Multimodal Contrastive Learning of Sentence Embeddings

    Authors: Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A. Hedderich, Dietrich Klakow

    Abstract: Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: Accepted by NAACL 2022 main conference (short paper), 11 pages

  40. arXiv:2204.09371  [pdf, other

    cs.CL

    Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification

    Authors: Dawei Zhu, Michael A. Hedderich, Fangzhou Zhai, David Ifeoluwa Adelani, Dietrich Klakow

    Abstract: Incorrect labels in training data occur when human annotators make mistakes or when the data is generated via weak or distant supervision. It has been shown that complex noise-handling techniques - by modeling, cleaning or filtering the noisy instances - are required to prevent models from fitting this label noise. However, we show in this work that, for text classification tasks with modern NLP m… ▽ More

    Submitted 20 April, 2022; originally announced April 2022.

    Comments: Accepted at Workshop on Insights from Negative Results in NLP 2022 @ACL 2022

  41. arXiv:2204.06487  [pdf, other

    cs.CL

    Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning

    Authors: Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, Dietrich Klakow

    Abstract: Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks for both high-resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is \textit{language adaptive fine-tuning} (LA… ▽ More

    Submitted 18 October, 2022; v1 submitted 13 April, 2022; originally announced April 2022.

    Comments: Accepted to COLING 2022

  42. arXiv:2204.06309  [pdf, other

    cs.CL cs.SD eess.AS

    Call-sign recognition and understanding for noisy air-traffic transcripts using surveillance information

    Authors: Alexander Blatt, Martin Kocour, Karel Veselý, Igor Szöke, Dietrich Klakow

    Abstract: Air traffic control (ATC) relies on communication via speech between pilot and air-traffic controller (ATCO). The call-sign, as unique identifier for each flight, is used to address a specific pilot by the ATCO. Extracting the call-sign from the communication is a challenge because of the noisy ATC voice channel and the additional noise introduced by the receiver. A low signal-to-noise ratio (SNR)… ▽ More

    Submitted 13 April, 2022; originally announced April 2022.

    Comments: Accepted by ICASSP 2022

  43. arXiv:2204.02906  [pdf, other

    cs.IR cs.CL

    Knowledge Base Index Compression via Dimensionality and Precision Reduction

    Authors: Vilém Zouhar, Marius Mosbach, Miaoran Zhang, Dietrich Klakow

    Abstract: Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of… ▽ More

    Submitted 18 April, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: To be presented at Spa-NLP workshop at ACL 2022

  44. arXiv:2201.09651  [pdf, other

    cs.CL cs.IR

    Artefact Retrieval: Overview of NLP Models with Knowledge Base Access

    Authors: Vilém Zouhar, Marius Mosbach, Debanjali Biswas, Dietrich Klakow

    Abstract: Many NLP models gain performance by having access to a knowledge base. A lot of research has been devoted to devising and improving the way the knowledge base is accessed and incorporated into the model, resulting in a number of mechanisms and pipelines. Despite the diversity of proposed mechanisms, there are patterns in the designs of such systems. In this paper, we systematically describe the ty… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 11 pages of main content, 7 pages of appendix; presented at AKBC CSRR 2021

  45. CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain

    Authors: Lukas Lange, Heike Adel, Jannik Strötgen, Dietrich Klakow

    Abstract: The field of natural language processing (NLP) has recently seen a large change towards using pre-trained language models for solving almost any task. Despite showing great improvements in benchmark datasets for various tasks, these models often perform sub-optimal in non-standard domains like the clinical domain where a large gap between pre-training documents and target documents is observed. In… ▽ More

    Submitted 20 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: This article has been accepted for publication in Bioinformatics \c{opyright}: 2022 The Author(s). Published by Oxford University Press. All rights reserved. The published manuscript can be found here: https://doi.org/10.1093/bioinformatics/btac297

  46. arXiv:2110.14350  [pdf, other

    cs.LG cs.AI

    Enhancing Reinforcement Learning with discrete interfaces to learn the Dyck Language

    Authors: Florian Dietz, Dietrich Klakow

    Abstract: Even though most interfaces in the real world are discrete, no efficient way exists to train neural networks to make use of them, yet. We enhance an Interaction Network (a Reinforcement Learning architecture) with discrete interfaces and train it on the generalized Dyck language. This task requires an understanding of hierarchical structures to solve, and has long proven difficult for neural netwo… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

  47. arXiv:2110.09599  [pdf, other

    cs.LG cs.CL

    Label-Descriptive Patterns and Their Application to Characterizing Classification Errors

    Authors: Michael Hedderich, Jonas Fischer, Dietrich Klakow, Jilles Vreeken

    Abstract: State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those feature-value combinations (i.e., patterns) that strongly correl… ▽ More

    Submitted 17 June, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

    Comments: Accepted at ICML 2022

  48. arXiv:2109.10179  [pdf, other

    cs.CL

    How Familiar Does That Sound? Cross-Lingual Representational Similarity Analysis of Acoustic Word Embeddings

    Authors: Badr M. Abdullah, Iuliia Zaitova, Tania Avgustinova, Bernd Möbius, Dietrich Klakow

    Abstract: How do neural networks "perceive" speech sounds from unknown languages? Does the typological similarity between the model's training language (L1) and an unknown language (L2) have an impact on the model representations of L2 speech signals? To answer these questions, we present a novel experimental design based on representational similarity analysis (RSA) to analyze acoustic word embeddings (AWE… ▽ More

    Submitted 21 September, 2021; originally announced September 2021.

    Comments: BlackboxNLP 2021

  49. arXiv:2109.09133  [pdf, other

    cs.CL

    Preventing Author Profiling through Zero-Shot Multilingual Back-Translation

    Authors: David Ifeoluwa Adelani, Miaoran Zhang, Xiaoyu Shen, Ali Davody, Thomas Kleinbauer, Dietrich Klakow

    Abstract: Documents as short as a single sentence may inadvertently reveal sensitive information about their authors, including e.g. their gender or ethnicity. Style transfer is an effective way of transforming texts in order to remove any information that enables author profiling. However, for a number of current state-of-the-art approaches the improved privacy is accompanied by an undesirable drop in the… ▽ More

    Submitted 19 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021 (Main Conference), 9 pages

  50. arXiv:2107.08772  [pdf, other

    cs.CL

    Integrating Unsupervised Data Generation into Self-Supervised Neural Machine Translation for Low-Resource Languages

    Authors: Dana Ruiter, Dietrich Klakow, Josef van Genabith, Cristina España-Bonet

    Abstract: For most language combinations, parallel data is either scarce or simply unavailable. To address this, unsupervised machine translation (UMT) exploits large amounts of monolingual data by using synthetic data generation techniques such as back-translation and noising, while self-supervised NMT (SSNMT) identifies parallel sentences in smaller comparable data and trains on them. To date, the inclusi… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

    Comments: 11 pages, 8 figures, accepted at MT-Summit 2021 (Research Track)