Skip to main content

Showing 1–15 of 15 results for author: Maillard, J

  1. arXiv:2402.09611  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Towards Privacy-Aware Sign Language Translation at Scale

    Authors: Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camgöz, Jean Maillard

    Abstract: A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

  2. arXiv:2312.05187  [pdf, other

    cs.CL cs.SD eess.AS

    Seamless: Multilingual Expressive and Streaming Speech Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek , et al. (40 additional authors not shown)

    Abstract: Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  3. arXiv:2308.11596  [pdf, other

    cs.CL

    SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

    Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim , et al. (43 additional authors not shown)

    Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded s… ▽ More

    Submitted 24 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    ACM Class: I.2.7

  4. arXiv:2305.02176  [pdf, other

    cs.CL

    Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

    Authors: Haoran Xu, Maha Elbayad, Kenton Murray, Jean Maillard, Vedanuj Goswami

    Abstract: Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize t… ▽ More

    Submitted 22 October, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

    Comments: Accepted at Findings of EMNLP 2023

  5. arXiv:2302.05008  [pdf, other

    cs.CL

    Language-Aware Multilingual Machine Translation with Self-Supervised Learning

    Authors: Haoran Xu, Jean Maillard, Vedanuj Goswami

    Abstract: Multilingual machine translation (MMT) benefits from cross-lingual transfer but is a challenging multitask optimization problem. This is partly because there is no clear framework to systematically learn language-specific parameters. Self-supervised learning (SSL) approaches that leverage large quantities of monolingual data (where parallel data is unavailable) have shown promise by improving tran… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

    Comments: Findings of EACL 2023

  6. arXiv:2210.03070  [pdf, other

    cs.CL

    Toxicity in Multilingual Machine Translation at Scale

    Authors: Marta R. Costa-jussà, Eric Smith, Christophe Ropers, Daniel Licht, Jean Maillard, Javier Ferrando, Carlos Escolano

    Abstract: Machine Translation systems can produce different types of errors, some of which are characterized as critical or catastrophic due to the specific negative impact that they can have on users. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demogra… ▽ More

    Submitted 5 April, 2023; v1 submitted 6 October, 2022; originally announced October 2022.

    ACM Class: I.2.7

  7. arXiv:2207.04672  [pdf

    cs.CL cs.AI

    No Language Left Behind: Scaling Human-Centered Machine Translation

    Authors: NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran , et al. (14 additional authors not shown)

    Abstract: Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high quality res… ▽ More

    Submitted 25 August, 2022; v1 submitted 11 July, 2022; originally announced July 2022.

    Comments: 190 pages

    MSC Class: 68T50 ACM Class: I.2.7

  8. arXiv:2206.07861  [pdf, other

    cs.CL

    Text normalization for low-resource languages: the case of Ligurian

    Authors: Stefano Lusito, Edoardo Ferrante, Jean Maillard

    Abstract: Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language… ▽ More

    Submitted 22 December, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

    Journal ref: In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages, p. 98-103 (2023)

  9. arXiv:2202.13274  [pdf, other

    cs.CL

    OCR Improves Machine Translation for Low-Resource Languages

    Authors: Oana Ignat, Jean Maillard, Vishrav Chaudhary, Francisco Guzmán

    Abstract: We aim to investigate the performance of current OCR systems on low resource languages and low resource scripts. We introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise, for 60 low-resource languages in low resource scripts. We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors. We show that O… ▽ More

    Submitted 13 March, 2022; v1 submitted 26 February, 2022; originally announced February 2022.

    Comments: Accepted at ACL Findings 2022

  10. arXiv:2101.00117  [pdf, other

    cs.CL

    Multi-task Retrieval for Knowledge-Intensive Tasks

    Authors: Jean Maillard, Vladimir Karpukhin, Fabio Petroni, Wen-tau Yih, Barlas Oğuz, Veselin Stoyanov, Gargi Ghosh

    Abstract: Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be universal and perform robustly on a wide va… ▽ More

    Submitted 31 December, 2020; originally announced January 2021.

  11. arXiv:2009.13655  [pdf, other

    cs.CL cs.LG

    Conversational Semantic Parsing

    Authors: Armen Aghajanyan, Jean Maillard, Akshat Shrivastava, Keith Diedrick, Mike Haeger, Haoran Li, Yashar Mehdad, Ves Stoyanov, Anuj Kumar, Mike Lewis, Sonal Gupta

    Abstract: The structured representation for semantic parsing in task-oriented assistant systems is geared towards simple understanding of one-turn queries. Due to the limitations of the representation, the session-based properties such as co-reference resolution and context carryover are processed downstream in a pipelined system. In this paper, we propose a semantic representation for such task-oriented co… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

  12. arXiv:2009.02252  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    KILT: a Benchmark for Knowledge Intensive Language Tasks

    Authors: Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel

    Abstract: Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research… ▽ More

    Submitted 27 May, 2021; v1 submitted 4 September, 2020; originally announced September 2020.

    Comments: accepted at NAACL 2021

  13. arXiv:2001.00393  [pdf, other

    math.CO cs.SC

    Stieltjes moment sequences for pattern-avoiding permutations

    Authors: Alin Bostan, Andrew Elvey Price, Anthony John Guttmann, Jean-Marie Maillard

    Abstract: A small set of combinatorial sequences have coefficients that can be represented as moments of a nonnegative measure on $[0, \infty)$. Such sequences are known as Stieltjes moment sequences. This article focuses on some classical sequences in enumerative combinatorics, denoted $Av(\mathcal{P})$, and counting permutations of $\{1, 2, \ldots, n \}$ that avoid some given pattern $\mathcal{P}$. For in… ▽ More

    Submitted 17 October, 2020; v1 submitted 2 January, 2020; originally announced January 2020.

    Comments: 59 pages, 11 figures

    MSC Class: Primary 44A60; 68W30; 33F10; 15B52; Secondary 05A15; 05A10; 11B65; 60B20; 11F03; 11F12; 33A30; 33C05; 34A05

    Journal ref: The Electronic Journal of Combinatorics, 2020

  14. Latent Tree Learning with Differentiable Parsers: Shift-Reduce Parsing and Chart Parsing

    Authors: Jean Maillard, Stephen Clark

    Abstract: Latent tree learning models represent sentences by composing their words according to an induced parse tree, all based on a downstream task. These models often outperform baselines which use (externally provided) syntax trees to drive the composition order. This work contributes (a) a new latent tree learning model based on shift-reduce parsing, with competitive downstream performance and non-triv… ▽ More

    Submitted 3 June, 2018; originally announced June 2018.

    Comments: ACL 2018 workshop on Relevance of Linguistic Structure in Neural Architectures for NLP

    Journal ref: Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, ACL 2018

  15. Jointly Learning Sentence Embeddings and Syntax with Unsupervised Tree-LSTMs

    Authors: Jean Maillard, Stephen Clark, Dani Yogatama

    Abstract: We introduce a neural network that represents sentences by composing their words according to induced binary parse trees. We use Tree-LSTM as our composition function, applied along a tree structure found by a fully differentiable natural language chart parser. Our model simultaneously optimises both the composition function and the parser, thus eliminating the need for externally-provided parse t… ▽ More

    Submitted 25 May, 2017; originally announced May 2017.

    Journal ref: Natural Language Engineering 25, no. 4 (2019): 433-49