Skip to main content

Showing 1–37 of 37 results for author: Yvon, F

  1. arXiv:2406.06263  [pdf, other

    cs.CL

    MaskLID: Code-Switching Language Identification through Iterative Masking

    Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze

    Abstract: We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: ACL 2024

  2. arXiv:2405.15070  [pdf, other

    cs.CL

    Optimizing example selection for retrieval-augmented machine translation with translation memories

    Authors: Maxime Bouthors, Josep Crego, François Yvon

    Abstract: Retrieval-augmented machine translation leverages examples from a translation memory by retrieving similar instances. These examples are used to condition the predictions of a neural decoder. We aim to improve the upstream retrieval step and consider a fixed downstream edit-based model: the multi-Levenshtein Transformer. The task consists of finding a set of examples that maximizes the overall cov… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: TALN conference, French, 10 pages, 7 figures

  3. arXiv:2405.14782  [pdf, other

    cs.CL

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Authors: Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Jeffrey Hsu, Mimansa Jaiswal, Wilson Y. Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, Aviya Skowron, Samson Tan , et al. (5 additional authors not shown)

    Abstract: Effective evaluation of language models remains an open challenge in NLP. Researchers and engineers face methodological issues such as the sensitivity of models to evaluation setup, difficulty of proper comparisons across methods, and the lack of reproducibility and transparency. In this paper we draw on three years of experience in evaluating large language models to provide guidance and lessons… ▽ More

    Submitted 29 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  4. arXiv:2404.02835  [pdf, other

    cs.CL

    Retrieving Examples from Memory for Retrieval Augmented Neural Machine Translation: A Systematic Comparison

    Authors: Maxime Bouthors, Josep Crego, Francois Yvon

    Abstract: Retrieval-Augmented Neural Machine Translation (RAMT) architectures retrieve examples from memory to guide the generation process. While most works in this trend explore new ways to exploit the retrieved examples, the upstream retrieval step is mostly unexplored. In this paper, we study the effect of varying retrieval methods for several translation architectures, to better understand the interpla… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  5. arXiv:2402.00786  [pdf, other

    cs.CL cs.LG

    CroissantLLM: A Truly Bilingual French-English Language Model

    Authors: Manuel Faysse, Patrick Fernandes, Nuno M. Guerreiro, António Loison, Duarte M. Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro H. Martins, Antoni Bigata Casademunt, François Yvon, André F. T. Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo

    Abstract: We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio, a cust… ▽ More

    Submitted 29 March, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  6. GlotLID: Language Identification for Low-Resource Languages

    Authors: Amir Hossein Kargaran, Ayyoob Imani, François Yvon, Hinrich Schütze

    Abstract: Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide covera… ▽ More

    Submitted 2 July, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023

  7. arXiv:2310.14124  [pdf, other

    cs.CL

    Structural generalization in COGS: Supertagging is (almost) all you need

    Authors: Alban Petit, Caio Corro, François Yvon

    Abstract: In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples. In particular, several recent semantic parsing datasets have put forward important limitations of neural networks in cases where compositional generalization is required. In this work, we extend a neural graph-based semantic parsing framework in several ways to a… ▽ More

    Submitted 21 October, 2023; originally announced October 2023.

    Comments: accepted at EMNLP 2023

  8. arXiv:2310.08967  [pdf, other

    cs.CL

    Towards Example-Based NMT with Multi-Levenshtein Transformers

    Authors: Maxime Bouthors, Josep Crego, François Yvon

    Abstract: Retrieval-Augmented Machine Translation (RAMT) is attracting growing attention. This is because RAMT not only improves translation metrics, but is also assumed to implement some form of domain adaptation. In this contribution, we study another salient trait of RAMT, its ability to make translation decisions more transparent by allowing users to go back to examples that contributed to these decisio… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

    Comments: 17 pages, EMNLP 2023 submission

  9. arXiv:2309.13320  [pdf, other

    cs.CL

    GlotScript: A Resource and Tool for Low Resource Writing System Identification

    Authors: Amir Hossein Kargaran, François Yvon, Hinrich Schütze

    Abstract: We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it ret… ▽ More

    Submitted 27 March, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

    Comments: LREC-COLING 2024

  10. arXiv:2306.00400  [pdf, other

    cs.CL

    BiSync: A Bilingual Editor for Synchronized Monolingual Texts

    Authors: Josep Crego, Jitao Xu, François Yvon

    Abstract: In our globalized world, a growing number of situations arise where people are required to communicate in one or several foreign languages. In the case of written communication, users with a good command of a foreign language may find assistance from computer-aided translation (CAT) technologies. These technologies often allow users to access external resources, such as dictionaries, terminologies… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: ACL 2023 System Demo

  11. arXiv:2305.19689  [pdf, other

    cs.CL

    Assessing Word Importance Using Models Trained for Semantic Tasks

    Authors: Dávid Javorský, Ondřej Bojar, François Yvon

    Abstract: Many NLP tasks require to automatically identify the most significant words in a text. In this work, we derive word significance from models trained to solve semantic task: Natural Language Inference and Paraphrase Identification. Using an attribution method aimed to explain the predictions of these models, we derive importance scores for each input token. We evaluate their relevance using a so-ca… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: Published in the Findings of ACL 2023

  12. Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

    Authors: Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze

    Abstract: The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages an… ▽ More

    Submitted 26 May, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  13. arXiv:2303.01911  [pdf, ps, other

    cs.CL

    Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM

    Authors: Rachel Bawden, François Yvon

    Abstract: The NLP community recently saw the release of a new large open-access multilingual language model, BLOOM (BigScience et al., 2022) covering 46 languages. We focus on BLOOM's multilingual ability by evaluating its machine translation performance across several datasets (WMT, Flores-101 and DiaBLa) and language pairs (high- and low-resourced). Our results show that 0-shot performance suffers from ov… ▽ More

    Submitted 9 May, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

    Comments: Accepted at EAMT 2023

  14. arXiv:2211.05100  [pdf, other

    cs.CL

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Authors: BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major , et al. (369 additional authors not shown)

    Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access… ▽ More

    Submitted 27 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  15. arXiv:2210.13163  [pdf, other

    cs.CL

    Bilingual Synchronization: Restoring Translational Relationships with Editing Operations

    Authors: Jitao Xu, Josep Crego, François Yvon

    Abstract: Machine Translation (MT) is usually viewed as a one-shot process that generates the target language equivalent of some source text from scratch. We consider here a more general setting which assumes an initial target sequence, that must be transformed into a valid translation of the source, thereby restoring parallelism between source and target. For this bilingual synchronization task, we conside… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 main conference

  16. arXiv:2210.09840  [pdf, other

    cs.CL

    Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

    Authors: Ayyoob Imani, Silvia Severini, Masoud Jalili Sabet, François Yvon, Hinrich Schütze

    Abstract: Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-reso… ▽ More

    Submitted 31 October, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  17. arXiv:2210.06020  [pdf, other

    cs.CL

    Integrating Translation Memories into Non-Autoregressive Machine Translation

    Authors: Jitao Xu, Josep Crego, François Yvon

    Abstract: Non-autoregressive machine translation (NAT) has recently made great progress. However, most works to date have focused on standard translation tasks, even though some edit-based NAT models, such as the Levenshtein Transformer (LevT), seem well suited to translate with a Translation Memory (TM). This is the scenario considered here. We first analyze the vanilla LevT model and explain why it does n… ▽ More

    Submitted 17 February, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: Accepted at EACL 2023 main conference

  18. arXiv:2205.09360  [pdf, other

    cs.CL

    Evaluating Subtitle Segmentation for End-to-end Generation Systems

    Authors: Alina Karakanta, François Buet, Mauro Cettolo, François Yvon

    Abstract: Subtitles appear on screen as short pieces of text, segmented based on formal constraints (length) and syntactic/semantic criteria. Subtitle segmentation can be evaluated with sequence segmentation metrics against a human reference. However, standard segmentation metrics cannot be applied when systems generate outputs different than the reference, e.g. with end-to-end subtitling systems. In this p… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: Accepted at LREC 2022

  19. arXiv:2205.06522  [pdf, other

    cs.CL

    Joint Generation of Captions and Subtitles with Dual Decoding

    Authors: Jitao Xu, François Buet, Josep Crego, Elise Bertin-Lemée, François Yvon

    Abstract: As the amount of audio-visual content increases, the need to develop automatic captioning and subtitling solutions to match the expectations of a growing international audience appears as the only viable way to boost throughput and lower the related post-production costs. Automatic captioning and subtitling often need to be tightly intertwined to achieve an appropriate level of consistency and syn… ▽ More

    Submitted 13 May, 2022; originally announced May 2022.

    Comments: Accepted at IWSLT 2022

  20. arXiv:2203.08654  [pdf, other

    cs.CL

    Graph Neural Networks for Multiparallel Word Alignment

    Authors: Ayyoob Imani, Lütfi Kerem Şenel, Masoud Jalili Sabet, François Yvon, Hinrich Schütze

    Abstract: After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection, and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. Here, we compute high-quality word alignments between multiple language pa… ▽ More

    Submitted 10 August, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

    Report number: ACL 2022 Findings

  21. Screening Gender Transfer in Neural Machine Translation

    Authors: Guillaume Wisniewski, Lichao Zhu, Nicolas Ballier, François Yvon

    Abstract: This paper aims at identifying the information flow in state-of-the-art machine translation systems, taking as example the transfer of gender when translating from French into English. Using a controlled set of examples, we experiment several ways to investigate how gender information circulates in a encoder-decoder architecture considering both probing techniques as well as interventions on the i… ▽ More

    Submitted 25 February, 2022; originally announced February 2022.

    Comments: Accepted at BlackBoxNLP'2021

  22. arXiv:2109.10197  [pdf, other

    cs.CL

    One Source, Two Targets: Challenges and Rewards of Dual Decoding

    Authors: Jitao Xu, François Yvon

    Abstract: Machine translation is generally understood as generating one target text from an input source document. In this paper, we consider a stronger requirement: to jointly generate two texts so that each output side effectively depends on the other. As we discuss, such a device serves several practical purposes, from multi-target machine translation to the generation of controlled variations of the tar… ▽ More

    Submitted 21 September, 2021; originally announced September 2021.

    Comments: Accepted at EMNLP 2021

  23. arXiv:2109.06283  [pdf, other

    cs.CL

    Graph Algorithms for Multiparallel Word Alignment

    Authors: Ayyoob Imani, Masoud Jalili Sabet, Lütfi Kerem Şenel, Philipp Dufter, François Yvon, Hinrich Schütze

    Abstract: With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilin… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021

  24. arXiv:2105.04846  [pdf, other

    cs.CL

    Can You Traducir This? Machine Translation for Code-Switched Input

    Authors: Jitao Xu, François Yvon

    Abstract: Code-Switching (CSW) is a common phenomenon that occurs in multilingual geographic or social contexts, which raises challenging problems for natural language processing tools. We focus here on Machine Translation (MT) of CSW texts, where we aim to simultaneously disentangle and translate the two mixed languages. Due to the lack of actual translated CSW data, we generate artificial training data fr… ▽ More

    Submitted 11 May, 2021; originally announced May 2021.

    Journal ref: Workshop on Computational Approaches to Linguistic Code Switching, Jun 2021, Online, United States

  25. arXiv:2009.13117  [pdf, other

    cs.CL cs.LG

    Generative latent neural models for automatic word alignment

    Authors: Anh Khoa Ngo Ho, François Yvon

    Abstract: Word alignments identify translational correspondences between words in a parallel sentence pair and are used, for instance, to learn bilingual dictionaries, to train statistical machine translation systems or to perform quality estimation. Variational autoencoders have been recently used in various of natural language processing to learn in an unsupervised way latent representations that are usef… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

    Journal ref: The Association for Machine Translation in the Americas, Oct 2020, Florida, United States

  26. arXiv:2009.13116  [pdf, other

    cs.CL cs.LG

    Neural Baselines for Word Alignment

    Authors: Anh Khoa Ngo Ho, François Yvon

    Abstract: Word alignments identify translational correspondences between words in a parallel sentence pair and is used, for instance, to learn bilingual dictionaries, to train statistical machine translation systems , or to perform quality estimation. In most areas of natural language processing, neural network models nowadays constitute the preferred approach, a situation that might also apply to word alig… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

    Comments: The 16th International Workshop on Spoken Language Translation, Nov 2019, Hong Kong, Hong Kong SAR China

  27. arXiv:2004.08728  [pdf, other

    cs.CL

    SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

    Authors: Masoud Jalili Sabet, Philipp Dufter, François Yvon, Hinrich Schütze

    Abstract: Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data, and quality decreases as less training data is available. We propose word alignment methods that re… ▽ More

    Submitted 16 April, 2021; v1 submitted 18 April, 2020; originally announced April 2020.

    Comments: EMNLP (Findings) 2020

  28. arXiv:2003.13833  [pdf

    cs.CL cs.AI cs.DL

    The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe

    Authors: Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin , et al. (22 additional authors not shown)

    Abstract: Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitu… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). To appear

  29. arXiv:1910.08418  [pdf, other

    cs.CL

    Controlling Utterance Length in NMT-based Word Segmentation with Attention

    Authors: Pierre Godard, Laurent Besacier, Francois Yvon

    Abstract: One of the basic tasks of computational language documentation (CLD) is to identify word boundaries in an unsegmented phonemic stream. While several unsupervised monolingual word segmentation algorithms exist in the literature, they are challenged in real-world CLD settings by the small amount of available data. A possible remedy is to take advantage of glosses or translation in a foreign, well-re… ▽ More

    Submitted 18 October, 2019; originally announced October 2019.

    Comments: Accepted to IWSLT 2019 (Hong-Kong)

  30. arXiv:1903.11437  [pdf, other

    cs.CL

    Using Monolingual Data in Neural Machine Translation: a Systematic Study

    Authors: Franck Burlot, François Yvon

    Abstract: Neural Machine Translation (MT) has radically changed the way systems are developed. A major difference with the previous generation (Phrase-Based MT) is the way monolingual target data, which often abounds, is used in these two paradigms. While Phrase-Based MT can seamlessly integrate very large language models trained on billions of sentences, the best option for Neural MT developers seems to be… ▽ More

    Submitted 27 March, 2019; originally announced March 2019.

    Comments: Published in the Proceedings of the Third Conference on Machine Translation (Research Papers), 2018

  31. arXiv:1806.06734  [pdf, other

    cs.CL cs.AI

    Unsupervised Word Segmentation from Speech with Attention

    Authors: Pierre Godard, Marcely Zanon-Boito, Lucas Ondel, Alexandre Berard, François Yvon, Aline Villavicencio, Laurent Besacier

    Abstract: We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-ph… ▽ More

    Submitted 18 June, 2018; originally announced June 2018.

    Comments: Interspeech 2018

  32. arXiv:1802.06053  [pdf, ps, other

    cs.CL

    Bayesian Models for Unit Discovery on a Very Low Resource Language

    Authors: Lucas Ondel, Pierre Godard, Laurent Besacier, Elin Larsen, Mark Hasegawa-Johnson, Odette Scharenborg, Emmanuel Dupoux, Lukas Burget, François Yvon, Sanjeev Khudanpur

    Abstract: Developing speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ experiments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show tha… ▽ More

    Submitted 20 February, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

    Comments: Accepted to ICASSP 2018

  33. arXiv:1710.03501  [pdf, ps, other

    cs.CL

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Authors: P. Godard, G. Adda, M. Adda-Decker, J. Benjumea, L. Besacier, J. Cooper-Leavitt, G-N. Kouarata, L. Lamel, H. Maynard, M. Mueller, A. Rialland, S. Stueker, F. Yvon, M. Zanon-Boito

    Abstract: Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation i… ▽ More

    Submitted 15 February, 2018; v1 submitted 10 October, 2017; originally announced October 2017.

    Comments: accepted to LREC 2018

  34. One file to share them all: Using the COMBINE Archive and the OMEX format to share all information about a modeling project

    Authors: Frank T. Bergmann, Richard Adams, Stuart Moodie, Jonathan Cooper, Mihai Glont, Martin Golebiewski, Michael Hucka, Camille Laibe, Andrew K. Miller, David P. Nickerson, Brett G. Olivier, Nicolas Rodriguez, Herbert M. Sauro, Martin Scharm, Stian Soiland-Reyes, Dagmar Waltemath, Florent Yvon, Nicolas Le Novère

    Abstract: Background: With the ever increasing use of computational models in the biosciences, the need to share models and reproduce the results of published studies efficiently and easily is becoming more important. To this end, various standards have been proposed that can be used to describe models, simulations, data or other essential information in a consistent fashion. These constitute various separa… ▽ More

    Submitted 30 September, 2014; v1 submitted 18 July, 2014; originally announced July 2014.

    Comments: 3 figures, 1 table

    Journal ref: BMC Bioinformatics 15 (2014) 369

  35. Efficient Learning of Sparse Conditional Random Fields for Supervised Sequence Labelling

    Authors: Nataliya Sokolovska, Thomas Lavergne, Olivier Cappé, François Yvon

    Abstract: Conditional Random Fields (CRFs) constitute a popular and efficient approach for supervised sequence labelling. CRFs can cope with large description spaces and can integrate some form of structural dependency between labels. In this contribution, we address the issue of efficient feature selection for CRFs based on imposing sparsity through an L1 penalty. We first show how sparsity of the parame… ▽ More

    Submitted 3 January, 2010; v1 submitted 7 September, 2009; originally announced September 2009.

  36. Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

    Authors: Loïs Rigouste, Olivier Cappé, François Yvon

    Abstract: In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the wor… ▽ More

    Submitted 14 June, 2006; originally announced June 2006.

    Journal ref: Information Processing & Management 43, 5 (01/09/2007) 1260?1280

  37. arXiv:cmp-lg/9608006  [pdf, ps

    cs.CL

    Grapheme-to-Phoneme Conversion using Multiple Unbounded Overlapping Chunks

    Authors: Francois Yvon

    Abstract: We present in this paper an original extension of two data-driven algorithms for the transcription of a sequence of graphemes into the corresponding sequence of phonemes. In particular, our approach generalizes the algorithm originally proposed by Dedina and Nusbaum (D&N) (1991), which had originally been promoted as a model of the human ability to pronounce unknown words by analogy to familiar… ▽ More

    Submitted 14 August, 1996; originally announced August 1996.

    Comments: 11 pages, Postscript only, Proceedings of NeMLaP II