Skip to main content

Showing 1–10 of 10 results for author: Papadimitriou, I

  1. arXiv:2401.06416  [pdf, other

    cs.CL cs.AI cs.LG

    Mission: Impossible Language Models

    Authors: Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, Christopher Potts

    Abstract: Chomsky and others have very directly claimed that large language models (LLMs) are equally capable of learning languages that are possible and impossible for humans to learn. However, there is very little published experimental evidence to support such a claim. Here, we develop a set of synthetic impossible languages of differing complexity, each designed by systematically altering English data w… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

  2. arXiv:2311.06440  [pdf, other

    cs.CL cs.LG

    Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text

    Authors: Isaac Caswell, Lisa Wang, Isabel Papadimitriou

    Abstract: Data quality is a problem that perpetually resurfaces throughout the field of NLP, regardless of task, domain, or architecture, and remains especially severe for lower-resource languages. A typical and insidious issue, affecting both training data and model output, is data that is repetitive and dominated by linguistically uninteresting boilerplate, such as price catalogs or computer-generated log… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Comments: Accepted to GEM workshop 2023; 6 pages

  3. arXiv:2304.13060  [pdf, other

    cs.CL

    Injecting structural hints: Using language models to study inductive biases in language learning

    Authors: Isabel Papadimitriou, Dan Jurafsky

    Abstract: Both humans and large language models are able to learn language without explicit structural supervision. What inductive biases make this learning possible? We address this fundamental cognitive question by leveraging transformer language models: we inject inductive bias into language models by pretraining on formally-structured data, and then evaluate the biased learners' ability to learn typolog… ▽ More

    Submitted 29 October, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

    Comments: Findings of EMNLP 2023

  4. arXiv:2210.05619  [pdf, other

    cs.CL

    Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models

    Authors: Isabel Papadimitriou, Kezia Lopez, Dan Jurafsky

    Abstract: While multilingual language models can improve NLP performance on low-resource languages by leveraging higher-resource languages, they also reduce average performance on all languages (the 'curse of multilinguality'). Here we show another problem with multilingual models: grammatical structures in higher-resource languages bleed into lower-resource languages, a phenomenon we call grammatical struc… ▽ More

    Submitted 13 April, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: Findings of EACL 2023

  5. arXiv:2203.06204  [pdf, other

    cs.CL

    When classifying grammatical role, BERT doesn't care about word order... except when it matters

    Authors: Isabel Papadimitriou, Richard Futrell, Kyle Mahowald

    Abstract: Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words chopped, chef, and onion are more likely used to convey "The chef chopped the onion," not "The onion chopped the chef." Recent work has shown large language models to be surprisingly word order invariant, but crucially has largely considered natural pr… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  6. arXiv:2202.12312  [pdf, other

    cs.CL

    Oolong: Investigating What Makes Transfer Learning Hard with Controlled Studies

    Authors: Zhengxuan Wu, Alex Tamkin, Isabel Papadimitriou

    Abstract: When we transfer a pretrained language model to a new language, there are many axes of variation that change at once. To disentangle the impact of different factors like syntactic similarity and vocabulary similarity, we propose a set of controlled transfer studies: we systematically transform the language of the GLUE benchmark, altering one axis of crosslingual variation at a time, and then measu… ▽ More

    Submitted 23 January, 2024; v1 submitted 24 February, 2022; originally announced February 2022.

    Comments: EMNLP 2023

  7. arXiv:2108.07258  [pdf, other

    cs.LG cs.AI cs.CY

    On the Opportunities and Risks of Foundation Models

    Authors: Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh , et al. (89 additional authors not shown)

    Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap… ▽ More

    Submitted 12 July, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html

  8. Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Authors: Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller , et al. (27 additional authors not shown)

    Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have system… ▽ More

    Submitted 21 February, 2022; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted at TACL; pre-MIT Press publication version

    Journal ref: Transactions of the Association for Computational Linguistics (2022) 10: 50-72

  9. arXiv:2101.11043  [pdf, other

    cs.CL

    Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT

    Authors: Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, Kyle Mahowald

    Abstract: We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a "subject") is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecth… ▽ More

    Submitted 26 January, 2021; originally announced January 2021.

    Comments: EACL 2021

  10. arXiv:2004.14601  [pdf, other

    cs.CL

    Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models

    Authors: Isabel Papadimitriou, Dan Jurafsky

    Abstract: We propose transfer learning as a method for analyzing the encoding of grammatical structure in neural language models. We train LSTMs on non-linguistic data and evaluate their performance on natural language to assess which kinds of data induce generalizable structural features that LSTMs can use for natural language. We find that training on non-linguistic data with latent structure (MIDI music… ▽ More

    Submitted 30 October, 2020; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: EMNLP 2020