Unsupervised Word Segmentation from Speech with Attention
Authors:
Pierre Godard,
Marcely Zanon-Boito,
Lucas Ondel,
Alexandre Berard,
François Yvon,
Aline Villavicencio,
Laurent Besacier
Abstract:
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-ph…
▽ More
We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-phones that is segmented using neural soft-alignments produced by a neural machine translation model. Evaluation uses an actual Bantu UL, Mboshi; comparisons to monolingual and bilingual baselines illustrate the potential of attentional word segmentation for language documentation.
△ Less
Submitted 18 June, 2018;
originally announced June 2018.
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Authors:
P. Godard,
G. Adda,
M. Adda-Decker,
J. Benjumea,
L. Besacier,
J. Cooper-Leavitt,
G-N. Kouarata,
L. Lamel,
H. Maynard,
M. Mueller,
A. Rialland,
S. Stueker,
F. Yvon,
M. Zanon-Boito
Abstract:
Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation i…
▽ More
Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.
△ Less
Submitted 15 February, 2018; v1 submitted 10 October, 2017;
originally announced October 2017.