Skip to main content

Showing 1–18 of 18 results for author: Clark, J H

  1. arXiv:2402.17934  [pdf, other

    cs.CL cs.AI

    Multitask Multilingual Model Adaptation with Featurized Low-Rank Mixtures

    Authors: Chu-Cheng Lin, Xinyi Wang, Jonathan H. Clark, Han Lu, Yun Zhu, Chenxi Whitehouse, Hongkun Yu

    Abstract: Adapting pretrained large language models (LLMs) to various downstream tasks in tens or hundreds of human languages is computationally expensive. Parameter-efficient fine-tuning (PEFT) significantly reduces the adaptation cost, by tuning only a small amount of parameters. However, directly applying PEFT methods such as LoRA (Hu et al., 2022) on diverse dataset mixtures could lead to suboptimal per… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  2. arXiv:2309.04663  [pdf, other

    cs.CL cs.AI

    FIAT: Fusing learning paradigms with Instruction-Accelerated Tuning

    Authors: Xinyi Wang, John Wieting, Jonathan H. Clark

    Abstract: Learning paradigms for large language models (LLMs) currently tend to fall within either in-context learning (ICL) or full fine-tuning. Each of these comes with their own trade-offs based on available data, model size, compute cost, ease-of-use, and final quality with neither solution performing well across-the-board. In this article, we first describe ICL and fine-tuning paradigms in a way that h… ▽ More

    Submitted 12 September, 2023; v1 submitted 8 September, 2023; originally announced September 2023.

  3. arXiv:2308.07286  [pdf, other

    cs.CL cs.LG

    The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation

    Authors: Patrick Fernandes, Daniel Deutsch, Mara Finkelstein, Parker Riley, André F. T. Martins, Graham Neubig, Ankush Garg, Jonathan H. Clark, Markus Freitag, Orhan Firat

    Abstract: Automatic evaluation of machine translation (MT) is a critical tool driving the rapid iterative development of MT systems. While considerable progress has been made on estimating a single scalar quality score, current metrics lack the informativeness of more detailed schemes that annotate individual errors, such as Multidimensional Quality Metrics (MQM). In this paper, we help fill this gap by pro… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: 19 pages

  4. arXiv:2305.14332  [pdf, other

    cs.CL

    Evaluating and Modeling Attribution for Cross-Lingual Question Answering

    Authors: Benjamin Muller, John Wieting, Jonathan H. Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Baldini Soares, Roee Aharoni, Jonathan Herzig, Xinyi Wang

    Abstract: Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve tr… ▽ More

    Submitted 15 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Published as a long paper at EMNLP 2023

  5. XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

    Authors: Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean-Michel A. Sarr, Xinyi Wang, John Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L. Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David I. Adelani, Vera Axelrod, Isaac Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson , et al. (2 additional authors not shown)

    Abstract: Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot;… ▽ More

    Submitted 24 May, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

  6. arXiv:2305.10403  [pdf, other

    cs.CL cs.AI

    PaLM 2 Technical Report

    Authors: Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego , et al. (103 additional authors not shown)

    Abstract: We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on… ▽ More

    Submitted 13 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

  7. arXiv:2305.06897  [pdf, other

    cs.CL cs.AI cs.IR

    AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

    Authors: Odunayo Ogundepo, Tajuddeen R. Gwadabe, Clara E. Rivera, Jonathan H. Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure F. P. Dossou, Abdou Aziz DIOP, Claytone Sikasote, Gilles Hacheme, Happy Buzaaba, Ignatius Ezeani, Rooweither Mabuya, Salomey Osei, Chris Emezue, Albert Njoroge Kahira, Shamsuddeen H. Muhammad, Akintunde Oladipo, Abraham Toluwase Owodunni, Atnafu Lambebo Tonja, Iyanuoluwa Shode, Akari Asai, Tunde Oluwaseyi Ajayi, Clemencia Siro, Steven Arthur , et al. (27 additional authors not shown)

    Abstract: African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create… ▽ More

    Submitted 11 May, 2023; originally announced May 2023.

  8. arXiv:2212.10726  [pdf, other

    cs.CL cs.LG

    Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval

    Authors: John Wieting, Jonathan H. Clark, William W. Cohen, Graham Neubig, Taylor Berg-Kirkpatrick

    Abstract: Contrastive learning has been successfully used for retrieval of semantically aligned sentences, but it often requires large batch sizes or careful engineering to work well. In this paper, we instead propose a generative model for learning multilingual text embeddings which can be used to retrieve or score sentence pairs. Our model operates on parallel data in $N$ languages and, through an approxi… ▽ More

    Submitted 4 June, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: Published as a long paper at ACL 2023

  9. arXiv:2207.00758  [pdf, other

    cs.CL

    MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question Answering for 16 Diverse Languages

    Authors: Akari Asai, Shayne Longpre, Jungo Kasai, Chia-Hsuan Lee, Rui Zhang, Junjie Hu, Ikuya Yamada, Jonathan H. Clark, Eunsol Choi

    Abstract: We present the results of the Workshop on Multilingual Information Access (MIA) 2022 Shared Task, evaluating cross-lingual open-retrieval question answering (QA) systems in 16 typologically diverse languages. In this task, we adapted two large-scale cross-lingual open-retrieval QA datasets in 14 typologically diverse languages, and newly annotated open-retrieval QA data in 2 underrepresented langu… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: NAACL Workshop on Multilingual Information Access

  10. arXiv:2203.17189  [pdf, other

    cs.LG cs.CL

    Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

    Authors: Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen , et al. (18 additional authors not shown)

    Abstract: Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we presen… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

  11. arXiv:2203.10752  [pdf, other

    cs.CL

    XTREME-S: Evaluating Cross-lingual Speech Representations

    Authors: Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan Van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

    Abstract: We introduce XTREME-S, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as w… ▽ More

    Submitted 13 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Minor fix: language code for Filipino (Tagalog), "tg" -> "tl"

  12. arXiv:2110.13254  [pdf, other

    cs.CV cs.LG

    Pediatric Otoscopy Video Screening with Shift Contrastive Anomaly Detection

    Authors: Weiyao Wang, Aniruddha Tamhane, Christine Santos, John R. Rzasa, James H. Clark, Therese L. Canares, Mathias Unberath

    Abstract: Ear related concerns and symptoms represents the leading indication for seeking pediatric healthcare attention. Despite the high incidence of such encounters, the diagnostic process of commonly encountered disease of the middle and external presents significant challenge. Much of this challenge stems from the lack of cost effective diagnostic testing, which necessitating the presence or absence of… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

  13. arXiv:2110.10329  [pdf, other

    cs.CL cs.LG

    SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

    Authors: Ankur Bapna, Yu-an Chung, Nan Wu, Anmol Gulati, Ye Jia, Jonathan H. Clark, Melvin Johnson, Jason Riesa, Alexis Conneau, Yu Zhang

    Abstract: Unsupervised pre-training is now the predominant approach for both text and speech understanding. Self-attention models pre-trained on large amounts of unannotated data have been hugely successful when fine-tuned on downstream tasks from a variety of domains and languages. This paper takes the universality of unsupervised language pre-training one step further, by unifying speech and text pre-trai… ▽ More

    Submitted 19 October, 2021; originally announced October 2021.

  14. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

    Authors: Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting

    Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a m… ▽ More

    Submitted 18 May, 2022; v1 submitted 11 March, 2021; originally announced March 2021.

    Comments: TACL Final Version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 73--91

  15. arXiv:2011.04264  [pdf, other

    cs.CL cs.CV

    CapWAP: Captioning with a Purpose

    Authors: Adam Fisch, Kenton Lee, Ming-Wei Chang, Jonathan H. Clark, Regina Barzilay

    Abstract: The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new task, Captioning with a Purpose (CapWAP). Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population, rath… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: EMNLP 2020

  16. arXiv:2010.12707  [pdf, other

    cs.CL

    Learning to Recognize Dialect Features

    Authors: Dorottya Demszky, Devyani Sharma, Jonathan H. Clark, Vinodkumar Prabhakaran, Jacob Eisenstein

    Abstract: Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in "He {} running". In this paper, we introduce the task of dialect feature detection… ▽ More

    Submitted 6 May, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: NAACL camera-ready

  17. arXiv:2010.11856  [pdf, other

    cs.CL

    XOR QA: Cross-lingual Open-Retrieval Question Answering

    Authors: Akari Asai, Jungo Kasai, Jonathan H. Clark, Kenton Lee, Eunsol Choi, Hannaneh Hajishirzi

    Abstract: Multilingual question answering tasks typically assume answers exist in the same language as the question. Yet in practice, many languages face both information scarcity -- where languages have few reference articles -- and information asymmetry -- where questions reference concepts from other cultures. This work extends open-retrieval question answering to a cross-lingual setting enabling questio… ▽ More

    Submitted 13 April, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Published as a conference paper at NAACL-HLT 2021 (long)

  18. arXiv:2003.05002  [pdf

    cs.CL cs.LG

    TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

    Authors: Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, Jennimaria Palomaki

    Abstract: Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA---a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology---the set of linguistic features each language expresses---such that we expect models performing well on t… ▽ More

    Submitted 10 March, 2020; originally announced March 2020.

    Comments: To appear in Transactions of the Association for Computational Linguistics (TACL) 2020. Please use this as the citation