Skip to main content

Showing 1–22 of 22 results for author: Fadaee, M

  1. arXiv:2407.01490  [pdf, other

    cs.CL cs.AI cs.LG

    LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

    Authors: Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

    Abstract: The widespread adoption of synthetic data raises new questions about how models generating the data can influence other large language models (LLMs) via distilled data. To start, our work exhaustively characterizes the impact of passive inheritance of model properties by systematically studying the consequences of synthetic data integration. We provide one of the most comprehensive studies to-date… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2406.18682  [pdf, other

    cs.CL cs.AI cs.LG

    The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm

    Authors: Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Marzieh Fadaee, Sara Hooker

    Abstract: A key concern with the concept of "alignment" is the implicit question of "alignment to what?". AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  3. arXiv:2405.15032  [pdf, other

    cs.CL

    Aya 23: Open Weight Releases to Further Multilingual Progress

    Authors: Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, Sara Hooker

    Abstract: This technical report introduces Aya 23, a family of multilingual language models. Aya 23 builds on the recent release of the Aya model (Üstün et al., 2024), focusing on pairing a highly performant pre-trained model with the recently released Aya collection (Singh et al., 2024). The result is a powerful multilingual large language model serving 23 languages, expanding state-of-art language modelin… ▽ More

    Submitted 31 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  4. arXiv:2402.14740  [pdf, other

    cs.LG

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Authors: Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker

    Abstract: AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that mos… ▽ More

    Submitted 26 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: 27 pages, 7 figures, 2 tables

    ACM Class: I.2.7

  5. arXiv:2402.07827  [pdf, other

    cs.CL

    Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

    Authors: Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D'souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, Sara Hooker

    Abstract: Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOM… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  6. arXiv:2402.06619  [pdf, other

    cs.CL cs.AI

    Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

    Authors: Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda , et al. (8 additional authors not shown)

    Abstract: Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets.… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

  7. arXiv:2311.17295  [pdf, other

    cs.CL cs.AI

    Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

    Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee

    Abstract: In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fund… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

    Comments: 22 pages, 7 figures, 2 tables. Revised version of the paper accepted at GEM Workshop, EMNLP 2023

  8. arXiv:2310.14424  [pdf, other

    cs.CL cs.AI

    Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

    Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker

    Abstract: Human evaluation is increasingly critical for assessing large language models, capturing linguistic nuances, and reflecting user preferences more accurately than traditional automated metrics. However, the resource-intensive nature of this type of annotation process poses significant challenges. The key question driving our work: "is it feasible to minimize human-in-the-loop feedback by prioritizi… ▽ More

    Submitted 22 October, 2023; originally announced October 2023.

    Comments: 37 pages, 8 figures

  9. arXiv:2309.04564  [pdf, other

    cs.CL cs.LG

    When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

    Authors: Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker

    Abstract: Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web text. To date, efforts to prune these datasets down to a higher quality subset have relied on hand-crafted heuristics encoded as rule-based filters. In this work… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: 14 pages, 8 figures

  10. arXiv:2301.01820  [pdf, ps, other

    cs.IR cs.AI

    InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

    Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

    Abstract: Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such dataset… ▽ More

    Submitted 26 May, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

  11. arXiv:2212.06121  [pdf, other

    cs.IR cs.CL

    In Defense of Cross-Encoders for Zero-Shot Retrieval

    Authors: Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization… ▽ More

    Submitted 12 December, 2022; originally announced December 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2206.02873

  12. arXiv:2206.02873  [pdf, other

    cs.IR cs.CL cs.PF

    No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

    Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of par… ▽ More

    Submitted 12 December, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

  13. arXiv:2202.05144  [pdf, other

    cs.CL

    InPars: Data Augmentation for Information Retrieval using Large Language Models

    Authors: Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Rodrigo Nogueira

    Abstract: The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has show… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

  14. arXiv:2108.13897  [pdf, other

    cs.CL cs.AI

    mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset

    Authors: Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

    Abstract: The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translat… ▽ More

    Submitted 17 August, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

  15. arXiv:2102.10437  [pdf, other

    cs.CL

    Understanding and Enhancing the Use of Context for Machine Translation

    Authors: Marzieh Fadaee

    Abstract: To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often… ▽ More

    Submitted 20 February, 2021; originally announced February 2021.

    Comments: PhD dissertation defended on November 10th, 2020

  16. arXiv:2011.00061  [pdf, other

    cs.CL cs.IR

    A New Neural Search and Insights Platform for Navigating and Organizing AI Research

    Authors: Marzieh Fadaee, Olga Gureenkova, Fernando Rejon Barrera, Carsten Schnober, Wouter Weerkamp, Jakub Zavrel

    Abstract: To provide AI researchers with modern tools for dealing with the explosive growth of the research literature in their field, we introduce a new platform, AI Research Navigator, that combines classical keyword search with neural retrieval to discover and organize relevant literature. The system provides search at multiple levels of textual granularity, from sentences to aggregations across document… ▽ More

    Submitted 30 October, 2020; originally announced November 2020.

    Comments: Accepted to Workshop on Scholarly Document Processing (SDP) at EMNLP 2020

  17. arXiv:2005.12398  [pdf, other

    cs.CL

    The Unreasonable Volatility of Neural Machine Translation Models

    Authors: Marzieh Fadaee, Christof Monz

    Abstract: Recent works have shown that Neural Machine Translation (NMT) models achieve impressive performance, however, questions about understanding the behavior of these models remain unanswered. We investigate the unexpected volatility of NMT models where the input is semantically and syntactically correct. We discover that with trivial modifications of source sentences, we can identify cases where \text… ▽ More

    Submitted 25 May, 2020; originally announced May 2020.

    Comments: Accepted to Neural Generation and Translation Workshop (WNGT) at ACL 2020

  18. arXiv:1808.09006  [pdf, other

    cs.CL

    Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation

    Authors: Marzieh Fadaee, Christof Monz

    Abstract: Neural Machine Translation has achieved state-of-the-art performance for several language pairs using a combination of parallel and synthetic data. Synthetic data is often generated by back-translating sentences randomly sampled from monolingual data using a reverse translation model. While back-translation has been shown to be very effective in many cases, it is not entirely clear why. In this wo… ▽ More

    Submitted 21 September, 2018; v1 submitted 27 August, 2018; originally announced August 2018.

    Comments: 11 pages, 2 figures. Accepted at EMNLP 2018

  19. arXiv:1802.04681  [pdf, other

    cs.CL

    Examining the Tip of the Iceberg: A Data Set for Idiom Translation

    Authors: Marzieh Fadaee, Arianna Bisazza, Christof Monz

    Abstract: Neural Machine Translation (NMT) has been widely used in recent years with significant improvements for many language pairs. Although state-of-the-art NMT systems are generating progressively better translations, idiom translation remains one of the open challenges in this field. Idioms, a category of multiword expressions, are an interesting language phenomenon where the overall meaning of the ex… ▽ More

    Submitted 13 February, 2018; originally announced February 2018.

    Comments: Accepted at LREC 2018

  20. Learning Topic-Sensitive Word Representations

    Authors: Marzieh Fadaee, Arianna Bisazza, Christof Monz

    Abstract: Distributed word representations are widely used for modeling words in NLP tasks. Most of the existing models generate one representation per word and do not consider different meanings of a word. We present two approaches to learn multiple topic-sensitive representations per word by using Hierarchical Dirichlet Process. We observe that by modeling topics and integrating topic distributions for ea… ▽ More

    Submitted 1 May, 2017; originally announced May 2017.

    Comments: 5 pages, 1 figure, Accepted at ACL 2017

  21. Data Augmentation for Low-Resource Neural Machine Translation

    Authors: Marzieh Fadaee, Arianna Bisazza, Christof Monz

    Abstract: The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthe… ▽ More

    Submitted 1 May, 2017; originally announced May 2017.

    Comments: 5 pages, 2 figures, Accepted at ACL 2017

  22. Active Learning from Positive and Unlabeled Data

    Authors: Alireza Ghasemi, Hamid R. Rabiee, Mohsen Fadaee, Mohammad T. Manzuri, Mohammad H. Rohban

    Abstract: During recent years, active learning has evolved into a popular paradigm for utilizing user's feedback to improve accuracy of learning algorithms. Active learning works by selecting the most informative sample among unlabeled data and querying the label of that point from user. Many different methods such as uncertainty sampling and minimum risk sampling have been utilized to select the most infor… ▽ More

    Submitted 24 February, 2016; originally announced February 2016.

    Comments: 6 pages, presented at IEEE ICDM 2011 Workshops