Skip to main content

Showing 1–50 of 52 results for author: Dabre, R

  1. arXiv:2407.05841  [pdf, other

    cs.CL cs.LG

    An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models

    Authors: Nandini Mundra, Aditya Nanda Kishore, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra

    Abstract: Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model's tokenizer, leading to inadequate representation of new languages… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Under review

  2. arXiv:2406.13332  [pdf, other

    cs.CL

    How effective is Multi-source pivoting for Translation of Low Resource Indian Languages?

    Authors: Pranav Gaikwad, Meet Doshi, Raj Dabre, Pushpak Bhattacharyya

    Abstract: Machine Translation (MT) between linguistically dissimilar languages is challenging, especially due to the scarcity of parallel corpora. Prior works suggest that pivoting through a high-resource language can help translation into a related low-resource language. However, existing works tend to discard the source sentence when pivoting. Taking the case of English to Indian language MT, this paper e… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  3. arXiv:2406.05967  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

    Authors: David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed, Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo Tonja, Bontu Fufa Balcha, Chenxi Whitehouse, Christian Salamea, Dan John Velasco, David Ifeoluwa Adelani, David Le Meur, Emilio Villa-Cueva, Fajri Koto, Fauzan Farooqui, Frederico Belcavello, Ganzorig Batnasan, Gisela Vallejo, Grainne Caulfield, Guido Ivetta, Haiyue Song , et al. (50 additional authors not shown)

    Abstract: Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recen… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  4. arXiv:2406.03893  [pdf, other

    cs.CL

    How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?

    Authors: Anushka Singh, Ananya B. Sai, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M Khapra

    Abstract: While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models. In this paper, we focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi. We collect sufficient Multi-… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  5. arXiv:2405.05376  [pdf, other

    cs.CL

    Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

    Authors: Nathaniel R. Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Bizon Monroc, Loïc Grobol, Hasan Muhammad, Ashi Garg, Naome A. Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Dean Stutzman, Bismarck Bamfo Odoom, Sanjeev Khudanpur, Stephen D. Richardson, Kenton Murray

    Abstract: A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We pr… ▽ More

    Submitted 13 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

    Comments: NAACL 2024

  6. arXiv:2404.04530  [pdf, other

    cs.CL

    A Morphology-Based Investigation of Positional Encodings

    Authors: Poulami Ghosh, Shikhar Vashishth, Raj Dabre, Pushpak Bhattacharyya

    Abstract: Contemporary deep learning models effectively handle languages with diverse morphology despite not being directly integrated into them. Morphology and word order are closely linked, with the latter incorporated into transformer-based models through positional encodings. This prompts a fundamental inquiry: Is there a correlation between the morphological complexity of a language and the utilization… ▽ More

    Submitted 30 May, 2024; v1 submitted 6 April, 2024; originally announced April 2024.

    Comments: Work in Progress

  7. arXiv:2403.13638  [pdf, other

    cs.CL

    Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

    Authors: Meet Doshi, Raj Dabre, Pushpak Bhattacharyya

    Abstract: In this paper, we explore the utility of Translationese as synthetic data created using machine translation for pre-training language models (LMs). Pre-training requires vast amounts of monolingual data, which is mostly unavailable for languages other than English. Recently, there has been a growing interest in using synthetic data to address this data scarcity. We take the case of English and Ind… ▽ More

    Submitted 21 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

  8. arXiv:2403.06350  [pdf, other

    cs.CL

    IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

    Authors: Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad G, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra

    Abstract: Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-re… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

  9. arXiv:2401.15006  [pdf, other

    cs.CL cs.AI

    Airavata: Introducing Hindi Instruction-tuned LLM

    Authors: Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan

    Abstract: We announce the initial release of "Airavata," an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additional… ▽ More

    Submitted 26 February, 2024; v1 submitted 26 January, 2024; originally announced January 2024.

    Comments: Work in progress

  10. arXiv:2401.14280  [pdf, other

    cs.CL cs.AI

    RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization

    Authors: Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan

    Abstract: This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages that use non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involves the continual pretraining of an English LLM like… ▽ More

    Submitted 23 June, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: Accepted to ACL 2024

  11. arXiv:2401.13249  [pdf, other

    eess.AS cs.MM

    MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction

    Authors: Wangjin Zhou, Zhengdong Yang, Chenhui Chu, Sheng Li, Raj Dabre, Yi Zhao, Tatsuya Kawahara

    Abstract: Automatic Mean Opinion Score (MOS) prediction is employed to evaluate the quality of synthetic speech. This study extends the application of predicted MOS to the task of Fake Audio Detection (FAD), as we expect that MOS can be used to assess how close synthesized speech is to the natural human voice. We propose MOS-FAD, where MOS can be leveraged at two key points in FAD: training data selection a… ▽ More

    Submitted 24 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted in ICASSP2024

  12. arXiv:2401.12097  [pdf, other

    cs.CL

    An Empirical Study of In-context Learning in LLMs for Machine Translation

    Authors: Pranjal A. Chitale, Jay Gala, Raj Dabre

    Abstract: Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, an exhaustive study of in-contex… ▽ More

    Submitted 4 June, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted to ACL 2024 Findings

  13. arXiv:2401.07078  [pdf, other

    cs.CL

    PUB: A Pragmatics Understanding Benchmark for Assessing LLMs' Pragmatics Capabilities

    Authors: Settaluri Lakshmi Sravanthi, Meet Doshi, Tankala Pavan Kalyan, Rudra Murthy, Pushpak Bhattacharyya, Raj Dabre

    Abstract: LLMs have demonstrated remarkable capability for understanding semantics, but they often struggle with understanding pragmatics. To demonstrate this fact, we release a Pragmatics Understanding Benchmark (PUB) dataset consisting of fourteen tasks in four pragmatics phenomena, namely, Implicature, Presupposition, Reference, and Deixis. We curated high-quality test sets for each task, consisting of M… ▽ More

    Submitted 13 January, 2024; originally announced January 2024.

  14. arXiv:2401.05632  [pdf, other

    cs.CL

    Natural Language Processing for Dialects of a Language: A Survey

    Authors: Aditya Joshi, Raj Dabre, Diptesh Kanojia, Zhuang Li, Haolan Zhan, Gholamreza Haffari, Doris Dippold

    Abstract: State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we surv… ▽ More

    Submitted 28 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: The paper is under review at ACM Computing Surveys. Please reach out to the authors in the case of feedback

  15. arXiv:2311.03696  [pdf, other

    cs.CL

    Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

    Authors: Haiyue Song, Raj Dabre, Chenhui Chu, Atsushi Fujita, Sadao Kurohashi

    Abstract: Lecture transcript translation helps learners understand online courses, however, building a high-quality lecture machine translation system lacks publicly available parallel corpora. To address this, we examine a framework for parallel corpus mining, which provides a quick and effective way to mine a parallel corpus from publicly available lectures on Coursera. To create the parallel corpora, we… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: Submitted to the Journal of Information Processing (JIP). arXiv admin note: text overlap with arXiv:1912.11739

  16. arXiv:2310.19567  [pdf, other

    cs.CL cs.AI

    CreoleVal: Multilingual Multitask Benchmarks for Creoles

    Authors: Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Ruth-Ann Armstrong, Abee Eijansantos, Catriona Malau, Hans Erik Heje, Ernests Lavrinovics, Diptesh Kanojia, Paul Belony, Marcel Bollmann, Loïc Grobol, Miryam de Lhoneux, Daniel Hershcovich, Michel DeGraff, Anders Søgaard, Johannes Bjerva

    Abstract: Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research.While the genealogical ties between Creoles and a number of highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning… ▽ More

    Submitted 6 May, 2024; v1 submitted 30 October, 2023; originally announced October 2023.

    Comments: Accepted to TACL

  17. SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

    Authors: Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi, Eiichiro Sumita

    Abstract: Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require parallel corpora, days to train and hours to decode. This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: Accepted to TALLIP journal

  18. arXiv:2307.14743  [pdf, other

    cs.CL

    Turning Whisper into Real-Time Transcription System

    Authors: Dominik Macháček, Raj Dabre, Ondřej Bojar

    Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to e… ▽ More

    Submitted 21 September, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: IJCNLP-AACL 2023 system demonstration

  19. arXiv:2306.03491  [pdf, other

    cs.CV cs.CL

    SciCap+: A Knowledge Augmented Dataset to Study the Challenges of Scientific Figure Captioning

    Authors: Zhishen Yang, Raj Dabre, Hideki Tanaka, Naoaki Okazaki

    Abstract: In scholarly documents, figures provide a straightforward way of communicating scientific findings to readers. Automating figure caption generation helps move model understandings of scientific documents beyond text and will help authors write informative captions that facilitate communicating scientific findings. Unlike previous studies, we reframe scientific figure captioning as a knowledge-augm… ▽ More

    Submitted 6 June, 2023; originally announced June 2023.

    Comments: Published in SDU workshop at AAAI23

  20. arXiv:2305.16894  [pdf, other

    cs.CL

    Robustness of Multi-Source MT to Transcription Errors

    Authors: Dominik Macháček, Peter Polák, Ondřej Bojar, Raj Dabre

    Abstract: Automatic speech translation is sensitive to speech recognition errors, but in a multilingual scenario, the same content may be available in various languages via simultaneous interpreting, dubbing or subtitling. In this paper, we hypothesize that leveraging multiple sources will improve translation quality if the sources complement one another in terms of correct information they contain. To this… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: ACL 2023 Findings

  21. arXiv:2305.16307  [pdf

    cs.CL

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

    Authors: Jay Gala, Pranjal A. Chitale, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan

    Abstract: India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, ther… ▽ More

    Submitted 20 December, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted at TMLR

  22. arXiv:2305.14105  [pdf, other

    cs.CL cs.AI

    CTQScorer: Combining Multiple Features for In-context Example Selection for Machine Translation

    Authors: Aswanth Kumar, Ratish Puduppully, Raj Dabre, Anoop Kunchukuttan

    Abstract: Large language models have demonstrated the capability to perform on machine translation when the input is prompted with a few examples (in-context learning). Translation quality depends on various features of the selected examples, such as their quality and relevance, but previous work has predominantly focused on individual features in isolation. In this paper, we propose a general framework for… ▽ More

    Submitted 21 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP 2023 findings

  23. arXiv:2305.13085  [pdf, other

    cs.CL

    Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models

    Authors: Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre, Ai Ti Aw, Nancy F. Chen

    Abstract: This study investigates machine translation between related languages i.e., languages within the same family that share linguistic characteristics such as word order and lexical similarity. Machine translation through few-shot prompting leverages a small set of translation pair examples to generate translations for test sentences. This procedure requires the model to learn how to generate translat… ▽ More

    Submitted 22 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 (Main, Long paper)

  24. arXiv:2305.10190  [pdf, other

    cs.CL

    Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation

    Authors: Zhuoyuan Mao, Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi

    Abstract: The language-independency of encoded representations within multilingual neural machine translation (MNMT) models is crucial for their generalization ability on zero-shot translation. Neural interlingua representations have been shown as an effective method for achieving this. However, fixed-length neural interlingua representations introduced in previous work can limit its flexibility and represe… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to Multi3Generation workshop (held in conjunction with EAMT 2023)

  25. arXiv:2305.09312  [pdf, other

    cs.CL

    Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation

    Authors: Zhuoyuan Mao, Raj Dabre, Qianying Liu, Haiyue Song, Chenhui Chu, Sadao Kurohashi

    Abstract: This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the default. However, Xu et al. (2019) has revealed that PreNorm carries the risk of overfitting the training data. Based on this, we hypothesize that PreNorm may ov… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 main conference

  26. arXiv:2305.07491  [pdf, other

    cs.CL

    A Comprehensive Analysis of Adapter Efficiency

    Authors: Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra

    Abstract: Adapters have been positioned as a parameter-efficient fine-tuning (PEFT) approach, whereby a minimal number of parameters are added to the model and fine-tuned. However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility. Through extensive experiments on many adapters, tasks, and languages i… ▽ More

    Submitted 12 May, 2023; originally announced May 2023.

  27. arXiv:2304.09388  [pdf, other

    cs.CL cs.AI

    An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models

    Authors: Varun Gumma, Raj Dabre, Pratyush Kumar

    Abstract: Knowledge distillation (KD) is a well-known method for compressing neural models. However, works focusing on distilling knowledge from large multilingual neural machine translation (MNMT) models into smaller ones are practically nonexistent, despite the popularity and superiority of MNMT. This paper bridges this gap by presenting an empirical investigation of knowledge distillation for compressing… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

    Comments: accepted at EAMT 2023

  28. arXiv:2212.10180  [pdf, other

    cs.CL

    IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

    Authors: Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

    Abstract: The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over… ▽ More

    Submitted 3 July, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: ACL 2023 long paper

  29. arXiv:2211.08633  [pdf, other

    cs.CL cs.AI

    MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation

    Authors: Dominik Macháček, Ondřej Bojar, Raj Dabre

    Abstract: There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this p… ▽ More

    Submitted 1 June, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

    Comments: IWSLT 2023

  30. arXiv:2206.02421  [pdf, other

    cs.CL

    MorisienMT: A Dataset for Mauritian Creole Machine Translation

    Authors: Raj Dabre, Aneerav Sukhoo

    Abstract: In this paper, we describe MorisienMT, a dataset for benchmarking machine translation quality of Mauritian Creole. Mauritian Creole (Morisien) is the lingua franca of the Republic of Mauritius and is a French-based creole language. MorisienMT consists of a parallel corpus between English and Morisien, French and Morisien and a monolingual corpus for Morisien. We first give an overview of Morisien… ▽ More

    Submitted 6 June, 2022; originally announced June 2022.

    Comments: Work in progress! (obviously) Dataset is here: https://huggingface.co/datasets/prajdabre/MorisienMT

  31. arXiv:2204.12165  [pdf, other

    cs.CL

    When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?

    Authors: Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan, Sadao Kurohashi

    Abstract: Word alignment has proven to benefit many-to-many neural machine translation (NMT). However, high-quality ground-truth bilingual dictionaries were used for pre-editing in previous methods, which are unavailable for most language pairs. Meanwhile, the contrastive objective can implicitly utilize automatically learned word alignment, which has not been explored in many-to-many NMT. This work propose… ▽ More

    Submitted 26 April, 2022; originally announced April 2022.

    Comments: NAACL 2022 findings

  32. arXiv:2204.04855  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Fusion of Self-supervised Learned Models for MOS Prediction

    Authors: Zhengdong Yang, Wangjin Zhou, Chenhui Chu, Sheng Li, Raj Dabre, Raphael Rubino, Yi Zhao

    Abstract: We participated in the mean opinion score (MOS) prediction challenge, 2022. This challenge aims to predict MOS scores of synthetic speech on two tracks, the main track and a more challenging sub-track: out-of-domain (OOD). To improve the accuracy of the predicted scores, we have explored several model fusion-related strategies and proposed a fused framework in which seven pretrained self-supervise… ▽ More

    Submitted 10 April, 2022; originally announced April 2022.

    Comments: MOS 2022 shared task system description paper

  33. arXiv:2203.05437  [pdf

    cs.CL cs.AI

    IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

    Authors: Aman Kumar, Himani Shrotriya, Prachi Sahu, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Amogh Mishra, Mitesh M. Khapra, Pratyush Kumar

    Abstract: Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation… ▽ More

    Submitted 26 October, 2022; v1 submitted 10 March, 2022; originally announced March 2022.

    Comments: Accepted at EMNLP 2022

  34. arXiv:2112.08789  [pdf, other

    cs.CL

    Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

    Authors: Diptesh Kanojia, Raj Dabre, Shubham Dewangan, Pushpak Bhattacharyya, Gholamreza Haffari, Malhar Kulkarni

    Abstract: Cognates are variants of the same lexical form across different languages; for example 'fonema' in Spanish and 'phoneme' in English are cognates, both of which mean 'a unit of sound'. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we d… ▽ More

    Submitted 16 December, 2021; originally announced December 2021.

    Comments: Published at COLING 2020

  35. IndicBART: A Pre-trained Model for Indic Natural Language Generation

    Authors: Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar

    Abstract: In this paper, we study pre-trained sequence-to-sequence models for a group of related languages, with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate Indic… ▽ More

    Submitted 26 October, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

    Comments: Published at ACL 2022, 15 pages

  36. arXiv:2108.11126  [pdf, other

    cs.CL cs.AI

    YANMTT: Yet Another Neural Machine Translation Toolkit

    Authors: Raj Dabre, Eiichiro Sumita

    Abstract: In this paper we present our open-source neural machine translation (NMT) toolkit called "Yet Another Neural Machine Translation Toolkit" abbreviated as YANMTT which is built on top of the Transformers library. Despite the growing importance of sequence to sequence pre-training there surprisingly few, if not none, well established toolkits that allow users to easily do pre-training. Toolkits such… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

    Comments: Submitted to EMNLP 2021 Demo Track

  37. arXiv:2106.10002  [pdf, other

    cs.CL

    Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

    Authors: Raj Dabre, Atsushi Fujita

    Abstract: In deep neural network modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in order to obtain high-quality continuous space representations which in turn improves the quality of the network's prediction. Conventionally, each layer in the stack has its own parameters which leads to a significant increase in the number of model parameters. In t… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: 22 pages. Under review. Work in progress. Extended version of https://ojs.aaai.org//index.php/AAAI/article/view/4590 which is an extension of arXiv:1807.05353 . The focus is on analyzing the limitations of recurrently stacked layers and methods to overcome said limitations

  38. arXiv:2104.07410  [pdf, other

    cs.CL

    Simultaneous Multi-Pivot Neural Machine Translation

    Authors: Raj Dabre, Aizhan Imankulova, Masahiro Kaneko, Abhisek Chakrabarty

    Abstract: Parallel corpora are indispensable for training neural machine translation (NMT) models, and parallel corpora for most language pairs do not exist or are scarce. In such cases, pivot language NMT can be helpful where a pivot language is used such that there exist parallel corpora between the source and pivot and pivot and target languages. Naturally, the quality of pivot language translation is mo… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

    Comments: preliminary work. pardon the messy writing and mistakes. will be submitted to emnlp after major overhaul

  39. arXiv:2009.09372  [pdf, other

    cs.CL cs.AI

    Softmax Tempering for Training Neural Machine Translation Models

    Authors: Raj Dabre, Atsushi Fujita

    Abstract: Neural machine translation (NMT) models are typically trained using a softmax cross-entropy loss where the softmax distribution is compared against smoothed gold labels. In low-resource scenarios, NMT models tend to over-fit because the softmax distribution quickly approaches the gold label distribution. To address this issue, we propose to divide the logits by a temperature coefficient, prior to… ▽ More

    Submitted 20 September, 2020; originally announced September 2020.

    Comments: The paper is about prediction smoothing for improving sequence to sequence performance. Related to but not the same as label smoothing. Work in progress. Updates with deeper analyses and comparisons to related methods to follow. Rejected from EMNLP 2020

  40. arXiv:2005.03361  [pdf, other

    cs.CL

    JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

    Authors: Zhuoyuan Mao, Fabien Cromieres, Raj Dabre, Haiyue Song, Sadao Kurohashi

    Abstract: Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel co… ▽ More

    Submitted 7 May, 2020; originally announced May 2020.

    Comments: LREC 2020

  41. arXiv:2002.08614  [pdf, other

    cs.CL cs.LG

    Balancing Cost and Benefit with Tied-Multi Transformers

    Authors: Raj Dabre, Raphael Rubino, Atsushi Fujita

    Abstract: We propose and evaluate a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In sequence-to-sequence modeling, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is… ▽ More

    Submitted 20 February, 2020; originally announced February 2020.

    Comments: Extended version of our previous manuscript available at arXiv:1908.10118

  42. arXiv:2001.08353  [pdf, other

    cs.CL cs.LG

    Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation

    Authors: Haiyue Song, Raj Dabre, Zhuoyuan Mao, Fei Cheng, Sadao Kurohashi, Eiichiro Sumita

    Abstract: Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks in low-resource settings. However, large monolingual corpora might not always be available for the languages of interest (LOI). To this end, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. A case s… ▽ More

    Submitted 22 January, 2020; originally announced January 2020.

    Comments: Work in progress. Submitted to a conference

  43. arXiv:2001.01115  [pdf, other

    cs.CL cs.AI cs.LG

    A Comprehensive Survey of Multilingual Neural Machine Translation

    Authors: Raj Dabre, Chenhui Chu, Anoop Kunchukuttan

    Abstract: We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years. MNMT has been useful in improving translation quality as a result of translation knowledge transfer (transfer learning). MNMT is more promising and interesting than its statistical machine translation counterpart because end-to-end modeling and distributed representations… ▽ More

    Submitted 7 January, 2020; v1 submitted 4 January, 2020; originally announced January 2020.

    Comments: This is an extended version of our survey paper on multilingual NMT. The previous version [arXiv:1905.05395] is rather condensed and is useful for speed-reading whereas this version is more beginner friendly. Under review at the computing surveys journal. We have intentionally decided to maintain both short and long versions of our survey paper for different reader groups

  44. arXiv:1912.11739  [pdf, other

    cs.CL cs.AI cs.LG

    Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

    Authors: Haiyue Song, Raj Dabre, Atsushi Fujita, Sadao Kurohashi

    Abstract: Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine tr… ▽ More

    Submitted 13 January, 2020; v1 submitted 25 December, 2019; originally announced December 2019.

    Comments: 10 pages, 1 figure, 9 tables, under review by LREC2020

  45. arXiv:1908.10118  [pdf, other

    cs.CL cs.LG

    Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

    Authors: Raj Dabre, Atsushi Fujita

    Abstract: This paper proposes a novel procedure for training an encoder-decoder based deep neural network which compresses NxM models into a single model enabling us to dynamically choose the number of encoder and decoder layers for decoding. Usually, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute softmax loss. I… ▽ More

    Submitted 28 August, 2019; v1 submitted 27 August, 2019; originally announced August 2019.

    Comments: Fixed numeric typos and corresponding explanations in the running text in the paper

  46. arXiv:1907.03060  [pdf, ps, other

    cs.CL

    Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation

    Authors: Aizhan Imankulova, Raj Dabre, Atsushi Fujita, Kenji Imamura

    Abstract: This paper proposes a novel multilingual multistage fine-tuning approach for low-resource neural machine translation (NMT), taking a challenging Japanese--Russian pair for benchmarking. Although there are many solutions for low-resource scenarios, such as multilingual NMT and back-translation, we have empirically confirmed their limited success when restricted to in-domain data. We therefore propo… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.

    Comments: Accepted at the 17th Machine Translation Summit

  47. arXiv:1906.07978  [pdf, ps, other

    cs.CL

    Multilingual Multi-Domain Adaptation Approaches for Neural Machine Translation

    Authors: Chenhui Chu, Raj Dabre

    Abstract: In this paper, we propose two novel methods for domain adaptation for the attention-only neural machine translation (NMT) model, i.e., the Transformer. Our methods focus on training a single translation model for multiple domains by either learning domain specialized hidden state representations or predictor biases for each domain. We combine our methods with a previously proposed black-box method… ▽ More

    Submitted 20 June, 2019; v1 submitted 19 June, 2019; originally announced June 2019.

  48. arXiv:1905.05395  [pdf, other

    cs.CL

    A Brief Survey of Multilingual Neural Machine Translation

    Authors: Raj Dabre, Chenhui Chu, Anoop Kunchukuttan

    Abstract: We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years. MNMT has been useful in improving translation quality as a result of knowledge transfer. MNMT is more promising and interesting than its statistical machine translation counterpart because end-to-end modeling and distributed representations open new avenues. Many approache… ▽ More

    Submitted 4 January, 2020; v1 submitted 14 May, 2019; originally announced May 2019.

    Comments: We have substantially expanded this paper for a journal submission to computing surveys [arXiv:2001.01115]

  49. arXiv:1807.05353  [pdf, other

    cs.CL

    Recurrent Stacking of Layers for Compact Neural Machine Translation Models

    Authors: Raj Dabre, Atsushi Fujita

    Abstract: In neural machine translation (NMT), the most common practice is to stack a number of recurrent or feed-forward layers in the encoder and the decoder. As a result, the addition of each new layer improves the translation quality significantly. However, this also leads to a significant increase in the number of parameters. In this paper, we propose to share parameters across all the layers thereby l… ▽ More

    Submitted 17 July, 2018; v1 submitted 14 July, 2018; originally announced July 2018.

    Comments: Version 2 (Current): Fixed Typos. Additional Results for models using back-translated data. Resized the figure. Better explanations of some parts. Version 1: Initial version

  50. arXiv:1710.01025  [pdf, ps, other

    cs.CL

    MMCR4NLP: Multilingual Multiway Corpora Repository for Natural Language Processing

    Authors: Raj Dabre, Sadao Kurohashi

    Abstract: Multilinguality is gradually becoming ubiquitous in the sense that more and more researchers have successfully shown that using additional languages help improve the results in many Natural Language Processing tasks. Multilingual Multiway Corpora (MMC) contain the same sentence in multiple languages. Such corpora have been primarily used for Multi-Source and Pivot Language Machine Translation but… ▽ More

    Submitted 14 February, 2019; v1 submitted 3 October, 2017; originally announced October 2017.

    Comments: V2: Fixed broken urls V1: 4 pages, Language Resources Paper, Submitted to LREC 2018, parallel corpora, multilingual multiway corpora, machine translation, resource