Skip to main content

Showing 1–20 of 20 results for author: Thrush, T

  1. arXiv:2402.04492  [pdf, other

    cs.CV cs.CL

    ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation

    Authors: Jirayu Burapacheep, Ishan Gaur, Agam Bhatia, Tristan Thrush

    Abstract: This paper introduces the ColorSwap dataset, designed to assess and improve the proficiency of multimodal models in matching objects with their colors. The dataset is comprised of 2,000 unique image-caption pairs, grouped into 1,000 examples. Each example includes a caption-image pair, along with a ``color-swapped'' pair. We follow the Winoground schema: the two captions in an example have the sam… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

  2. arXiv:2401.05300  [pdf, other

    cs.CL cs.AI

    I am a Strange Dataset: Metalinguistic Tests for Language Models

    Authors: Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, Douwe Kiela

    Abstract: Statements involving metalinguistic self-reference ("This paper has six sections.") are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present "I am a Strange Dataset", a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like "The penultimate word in this sent… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

  3. arXiv:2306.16410  [pdf, other

    cs.CL cs.CV

    Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

    Authors: William Berrios, Gautam Mittal, Tristan Thrush, Douwe Kiela, Amanpreet Singh

    Abstract: We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recog… ▽ More

    Submitted 28 June, 2023; originally announced June 2023.

  4. arXiv:2303.03915  [pdf, other

    cs.CL cs.AI

    The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

    Authors: Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa , et al. (29 additional authors not shown)

    Abstract: As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the f… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: NeurIPS 2022, Datasets and Benchmarks Track

    ACM Class: I.2.7

  5. arXiv:2212.05129  [pdf, other

    cs.AI cs.LG

    Measuring Data

    Authors: Margaret Mitchell, Alexandra Sasha Luccioni, Nathan Lambert, Marissa Gerchick, Angelina McMillan-Major, Ezinwanne Ozoani, Nazneen Rajani, Tristan Thrush, Yacine Jernite, Douwe Kiela

    Abstract: We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Similar to an object's height, width, and volume, data measurements quantify different attributes of data along common dimensions that support comparison. Several lines of research have proposed what we refer to as measurements, with differing terminology; we bring some of t… ▽ More

    Submitted 13 February, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

  6. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  7. arXiv:2210.01970  [pdf, other

    cs.LG

    Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements

    Authors: Leandro von Werra, Lewis Tunstall, Abhishek Thakur, Alexandra Sasha Luccioni, Tristan Thrush, Aleksandra Piktus, Felix Marty, Nazneen Rajani, Victor Mustar, Helen Ngo, Omar Sanseviero, Mario Šaško, Albert Villanova, Quentin Lhoest, Julien Chaumond, Margaret Mitchell, Alexander M. Rush, Thomas Wolf, Douwe Kiela

    Abstract: Evaluation is a key part of machine learning (ML), yet there is a lack of support and tooling to enable its informed and systematic practice. We introduce Evaluate and Evaluation on the Hub --a set of tools to facilitate the evaluation of models and datasets in ML. Evaluate is a library to support best practices for measurements, metrics, and comparisons of data and models. Its goal is to support… ▽ More

    Submitted 6 October, 2022; v1 submitted 30 September, 2022; originally announced October 2022.

  8. arXiv:2207.10062  [pdf, other

    cs.LG

    DataPerf: Benchmarks for Data-Centric AI Development

    Authors: Mark Mazumder, Colby Banbury, Xiaozhe Yao, Bojan Karlaš, William Gaviria Rojas, Sudnya Diamos, Greg Diamos, Lynn He, Alicia Parrish, Hannah Rose Kirk, Jessica Quaye, Charvi Rastogi, Douwe Kiela, David Jurado, David Kanter, Rafael Mosquera, Juan Ciro, Lora Aroyo, Bilge Acun, Lingjiao Chen, Mehul Smriti Raje, Max Bartolo, Sabri Eyuboglu, Amirata Ghorbani, Emmett Goodman , et al. (20 additional authors not shown)

    Abstract: Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing datase… ▽ More

    Submitted 13 October, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: NeurIPS 2023 Datasets and Benchmarks Track

  9. arXiv:2204.03162  [pdf, other

    cs.CV cs.CL

    Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

    Authors: Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, Candace Ross

    Abstract: We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annot… ▽ More

    Submitted 22 April, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: CVPR 2022

  10. arXiv:2204.01906  [pdf, other

    cs.CL cs.AI

    Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

    Authors: Tristan Thrush, Kushal Tirumala, Anmol Gupta, Max Bartolo, Pedro Rodriguez, Tariq Kane, William Gaviria Rojas, Peter Mattson, Adina Williams, Douwe Kiela

    Abstract: We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP models, as well as for conducting model in the loop data collection with crowdworkers. Dynatask is integrated with Dynabench, a research platform for rethinking benchmarking in AI that facilitates human a… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: ACL System Demos 2022

  11. arXiv:2112.09062  [pdf, other

    cs.CL

    Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

    Authors: Max Bartolo, Tristan Thrush, Sebastian Riedel, Pontus Stenetorp, Robin Jia, Douwe Kiela

    Abstract: In Dynamic Adversarial Data Collection (DADC), human annotators are tasked with finding examples that models struggle to predict correctly. Models trained on DADC-collected training data have been shown to be more robust in adversarial and out-of-domain settings, and are considerably harder for humans to fool. However, DADC is more time-consuming than traditional data collection and thus more cost… ▽ More

    Submitted 17 May, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

  12. arXiv:2108.05921  [pdf, other

    cs.CL cs.CY

    Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate

    Authors: Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush, Scott A. Hale

    Abstract: Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is an emerging challenge for automated detection. We present HatemojiCheck, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji. Using the test suite, we… ▽ More

    Submitted 6 May, 2022; v1 submitted 12 August, 2021; originally announced August 2021.

    Journal ref: 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022)

  13. arXiv:2106.06052  [pdf, other

    cs.CL cs.AI

    Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking

    Authors: Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, Douwe Kiela

    Abstract: We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibi… ▽ More

    Submitted 20 May, 2021; originally announced June 2021.

  14. arXiv:2104.14337  [pdf, other

    cs.CL cs.AI

    Dynabench: Rethinking Benchmarking in NLP

    Authors: Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, Adina Williams

    Abstract: We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary model… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

    Comments: NAACL 2021

  15. Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation

    Authors: Max Bartolo, Tristan Thrush, Robin Jia, Sebastian Riedel, Pontus Stenetorp, Douwe Kiela

    Abstract: Despite recent progress, state-of-the-art question answering models remain vulnerable to a variety of adversarial attacks. While dynamic adversarial data collection, in which a human annotator tries to write examples that fool a model-in-the-loop, can improve model robustness, this process is expensive which limits the scale of the collected data. In this work, we are the first to use synthetic ad… ▽ More

    Submitted 15 March, 2022; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: EMNLP 2021

    Journal ref: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p.8830-8848. Association for Computational Linguistics

  16. arXiv:2103.03395  [pdf, other

    cs.RO cs.CV

    Rover Relocalization for Mars Sample Return by Virtual Template Synthesis and Matching

    Authors: Tu-Hoa Pham, William Seto, Shreyansh Daftry, Barry Ridge, Johanna Hansen, Tristan Thrush, Mark Van der Merwe, Gerard Maggiolino, Alexander Brinkman, John Mayo, Yang Cheng, Curtis Padgett, Eric Kulczycki, Renaud Detry

    Abstract: We consider the problem of rover relocalization in the context of the notional Mars Sample Return campaign. In this campaign, a rover (R1) needs to be capable of autonomously navigating and localizing itself within an area of approximately 50 x 50 m using reference images collected years earlier by another rover (R0). We propose a visual localizer that exhibits robustness to the relatively barren… ▽ More

    Submitted 4 March, 2021; originally announced March 2021.

    Comments: To appear in IEEE Robotics and Automation Letters (RA-L) and IEEE International Conference on Robotics and Automation (ICRA 2021)

  17. arXiv:2012.15761  [pdf, other

    cs.CL cs.LG

    Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

    Authors: Bertie Vidgen, Tristan Thrush, Zeerak Waseem, Douwe Kiela

    Abstract: We present a human-and-model-in-the-loop process for dynamically generating datasets and training better performing and more robust hate detection models. We provide a new dataset of ~40,000 entries, generated and labelled by trained annotators over four rounds of dynamic data creation. It includes ~15,000 challenging perturbations and each hateful entry has fine-grained labels for the type and ta… ▽ More

    Submitted 3 June, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

  18. arXiv:2011.02417  [pdf, other

    cs.CL cs.LG

    Investigating Novel Verb Learning in BERT: Selectional Preference Classes and Alternation-Based Syntactic Generalization

    Authors: Tristan Thrush, Ethan Wilcox, Roger Levy

    Abstract: Previous studies investigating the syntactic abilities of deep learning models have not targeted the relationship between the strength of the grammatical generalization and the amount of evidence to which the model is exposed during training. We address this issue by deploying a novel word-learning paradigm to test BERT's few-shot learning capabilities for two aspects of English verbs: alternation… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: Accepted to BlackboxNLP 2020

  19. arXiv:2010.12729  [pdf, other

    cs.CL

    ANLIzing the Adversarial Natural Language Inference Dataset

    Authors: Adina Williams, Tristan Thrush, Douwe Kiela

    Abstract: We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We propose a fine-grained annotation scheme of the different aspects of inference that are responsible for the gold classification labels, and use it to hand-code all three of the ANLI development sets. We… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: 33 pages, 1 figure, 24 tables

  20. arXiv:2002.08899  [pdf, other

    cs.CL

    Compositional Neural Machine Translation by Removing the Lexicon from Syntax

    Authors: Tristan Thrush

    Abstract: The meaning of a natural language utterance is largely determined from its syntax and words. Additionally, there is evidence that humans process an utterance by separating knowledge about the lexicon from syntax knowledge. Theories from semantics and neuroscience claim that complete word meanings are not encoded in the representation of syntax. In this paper, we propose neural units that can enfor… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

    Comments: natural language processing; adversarial neural networks; machine translation; aphasia; neural attention