-
MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition
Authors:
Jarod Duret,
Mickael Rouvier,
Yannick Estève
Abstract:
In this work, we detail our submission to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. We concentrated our efforts on Task 1, which involves the categorical classification of eight emotional states using data from the MSP-Podcast dataset. Our app…
▽ More
In this work, we detail our submission to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. We concentrated our efforts on Task 1, which involves the categorical classification of eight emotional states using data from the MSP-Podcast dataset. Our approach employs an ensemble of models, each trained independently and then fused at the score level using a Support Vector Machine (SVM) classifier. The models were trained using various strategies, including Self-Supervised Learning (SSL) fine-tuning across different modalities: speech alone, text alone, and a combined speech and text approach. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. This joint training methodology aims to enhance the system's ability to accurately classify emotional states. Thus, the system obtained F1-macro of 0.35\% on development set.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Zero-Shot End-To-End Spoken Question Answering In Medical Domain
Authors:
Yanis Labrak,
Adel Moumen,
Richard Dufour,
Mickael Rouvier
Abstract:
In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effec…
▽ More
In the rapidly evolving landscape of spoken question-answering (SQA), the integration of large language models (LLMs) has emerged as a transformative development. Conventional approaches often entail the use of separate models for question audio transcription and answer selection, resulting in significant resource utilization and error accumulation. To tackle these challenges, we explore the effectiveness of end-to-end (E2E) methodologies for SQA in the medical domain. Our study introduces a novel zero-shot SQA approach, compared to traditional cascade systems. Through a comprehensive evaluation conducted on a new open benchmark of 8 medical tasks and 48 hours of synthetic audio, we demonstrate that our approach requires up to 14.7 times fewer resources than a combined 1.3B parameters LLM with a 1.55B parameters ASR model while improving average accuracy by 0.5\%. These findings underscore the potential of E2E methodologies for SQA in resource-constrained contexts.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Asymmetric and trial-dependent modeling: the contribution of LIA to SdSV Challenge Task 2
Authors:
Pierre-Michel Bousquet,
Mickael Rouvier
Abstract:
The SdSv challenge Task 2 provided an opportunity to assess efficiency and robustness of modern text-independent speaker verification systems. But it also made it possible to test new approaches, capable of taking into account the main issues of this challenge (duration, language, ...). This paper describes the contributions of our laboratory to the speaker recognition field. These contributions h…
▽ More
The SdSv challenge Task 2 provided an opportunity to assess efficiency and robustness of modern text-independent speaker verification systems. But it also made it possible to test new approaches, capable of taking into account the main issues of this challenge (duration, language, ...). This paper describes the contributions of our laboratory to the speaker recognition field. These contributions highlight two other challenges in addition to short-duration and language: the mismatch between enrollment and test data and the one between subsets of the evaluation trial dataset. The proposed approaches experimentally show their relevance and efficiency on the SdSv evaluation, and could be of interest in many real-life applications.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Probing the Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems
Authors:
Quentin Raymondaud,
Mickael Rouvier,
Richard Dufour
Abstract:
Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the inf…
▽ More
Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
How Important Is Tokenization in French Medical Masked Language Models?
Authors:
Yanis Labrak,
Adrien Bazoge,
Beatrice Daille,
Mickael Rouvier,
Richard Dufour
Abstract:
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level toke…
▽ More
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair Encoding (BPE) and was later followed by the adoption of SentencePiece and WordPiece. While subword tokenization consistently outperforms character and word-level tokenization, the precise factors contributing to its success remain unclear. Key aspects such as the optimal segmentation granularity for diverse tasks and languages, the influence of data sources on tokenizers, and the role of morphological information in Indo-European languages remain insufficiently explored. This is particularly pertinent for biomedical terminology, characterized by specific rules governing morpheme combinations. Despite the agglutinative nature of biomedical terminology, existing language models do not explicitly incorporate this knowledge, leading to inconsistent tokenization strategies for common terms. In this paper, we seek to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks and pinpoint areas where further enhancements can be made. We analyze classical tokenization algorithms, including BPE and SentencePiece, and introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
△ Less
Submitted 9 June, 2024; v1 submitted 22 February, 2024;
originally announced February 2024.
-
DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain
Authors:
Yanis Labrak,
Adrien Bazoge,
Oumaima El Khettari,
Mickael Rouvier,
Pacome Constant dit Beaufils,
Natalia Grabar,
Beatrice Daille,
Solen Quiniou,
Emmanuel Morin,
Pierre-Antoine Gourraud,
Richard Dufour
Abstract:
The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for t…
▽ More
The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. We evaluate 8 state-of-the-art pre-trained masked language models (MLMs) on general and biomedical-specific data, as well as English specific MLMs to assess their cross-lingual capabilities. Our experiments reveal that no single model excels across all tasks, while generalist models are sometimes still competitive.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
Authors:
Yanis Labrak,
Adrien Bazoge,
Emmanuel Morin,
Pierre-Antoine Gourraud,
Mickael Rouvier,
Richard Dufour
Abstract:
Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-sourc…
▽ More
Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English. We also explore lightweight models obtained through quantization and model merging approaches. Our results demonstrate BioMistral's superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts. Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages. This marks the first large-scale multilingual evaluation of LLMs in the medical domain. Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released.
△ Less
Submitted 9 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Jeffreys divergence-based regularization of neural network output distribution applied to speaker recognition
Authors:
Pierre-Michel Bousquet,
Mickael Rouvier
Abstract:
A new loss function for speaker recognition with deep neural network is proposed, based on Jeffreys Divergence. Adding this divergence to the cross-entropy loss function allows to maximize the target value of the output distribution while smoothing the non-target values. This objective function provides highly discriminative features. Beyond this effect, we propose a theoretical justification of i…
▽ More
A new loss function for speaker recognition with deep neural network is proposed, based on Jeffreys Divergence. Adding this divergence to the cross-entropy loss function allows to maximize the target value of the output distribution while smoothing the non-target values. This objective function provides highly discriminative features. Beyond this effect, we propose a theoretical justification of its effectiveness and try to understand how this loss function affects the model, in particular the impact on dataset types (i.e. in-domain or out-of-domain w.r.t the training corpus). Our experiments show that Jeffreys loss consistently outperforms the state-of-the-art for speaker recognition, especially on out-of-domain data, and helps limit false alarms.
△ Less
Submitted 28 December, 2023;
originally announced December 2023.
-
SynVox2: Towards a privacy-friendly VoxCeleb2 dataset
Authors:
Xiaoxiao Miao,
Xin Wang,
Erica Cooper,
Junichi Yamagishi,
Nicholas Evans,
Massimiliano Todisco,
Jean-François Bonastre,
Mickael Rouvier
Abstract:
The success of deep learning in speaker recognition relies heavily on the use of large datasets. However, the data-hungry nature of deep learning methods has already being questioned on account the ethical, privacy, and legal concerns that arise when using large-scale datasets of natural speech collected from real human speakers. For example, the widely-used VoxCeleb2 dataset for speaker recogniti…
▽ More
The success of deep learning in speaker recognition relies heavily on the use of large datasets. However, the data-hungry nature of deep learning methods has already being questioned on account the ethical, privacy, and legal concerns that arise when using large-scale datasets of natural speech collected from real human speakers. For example, the widely-used VoxCeleb2 dataset for speaker recognition is no longer accessible from the official website. To mitigate these concerns, this work presents an initiative to generate a privacy-friendly synthetic VoxCeleb2 dataset that ensures the quality of the generated speech in terms of privacy, utility, and fairness. We also discuss the challenges of using synthetic data for the downstream task of speaker verification.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Authors:
Titouan Parcollet,
Ha Nguyen,
Solene Evain,
Marcely Zanon Boito,
Adrien Pupier,
Salima Mdhaffar,
Hang Le,
Sina Alisamir,
Natalia Tomashenko,
Marco Dinarelli,
Shucong Zhang,
Alexandre Allauzen,
Maximin Coavoux,
Yannick Esteve,
Mickael Rouvier,
Jerome Goulian,
Benjamin Lecouteux,
Francois Portet,
Solange Rossato,
Fabien Ringeval,
Didier Schwab,
Laurent Besacier
Abstract:
Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-…
▽ More
Self-supervised learning (SSL) is at the origin of unprecedented improvements in many different domains including computer vision and natural language processing. Speech processing drastically benefitted from SSL as most of the current domain-related tasks are now being approached with pre-trained models. This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies. It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community, and an evaluation protocol made of six downstream tasks to complement existing benchmarks. LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for speech with the investigation of frozen versus fine-tuned downstream models, task-agnostic versus task-specific pre-trained models as well as a discussion on the carbon footprint of large-scale model training. Overall, the newly introduced models trained on 14,000 hours of French speech outperform multilingual and previous LeBenchmark SSL models across the benchmark but also required up to four times more energy for pre-training.
△ Less
Submitted 18 March, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks
Authors:
Yanis Labrak,
Mickael Rouvier,
Richard Dufour
Abstract:
We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to appr…
▽ More
We evaluate four state-of-the-art instruction-tuned large language models (LLMs) -- ChatGPT, Flan-T5 UL2, Tk-Instruct, and Alpaca -- on a set of 13 real-world clinical and biomedical natural language processing (NLP) tasks in English, such as named-entity recognition (NER), question-answering (QA), relation extraction (RE), etc. Our overall results demonstrate that the evaluated LLMs begin to approach performance of state-of-the-art models in zero- and few-shot scenarios for most tasks, and particularly well for the QA task, even though they have never seen examples from these tasks before. However, we observed that the classification and RE tasks perform below what can be achieved with a specifically trained model for the medical field, such as PubMedBERT. Finally, we noted that no LLM outperforms all the others on all the studied tasks, with some models being better suited for certain tasks than others.
△ Less
Submitted 9 June, 2024; v1 submitted 22 July, 2023;
originally announced July 2023.
-
FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain
Authors:
Yanis Labrak,
Adrien Bazoge,
Richard Dufour,
Mickael Rouvier,
Emmanuel Morin,
Béatrice Daille,
Pierre-Antoine Gourraud
Abstract:
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual…
▽ More
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.
△ Less
Submitted 9 April, 2023;
originally announced April 2023.
-
DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains
Authors:
Yanis Labrak,
Adrien Bazoge,
Richard Dufour,
Mickael Rouvier,
Emmanuel Morin,
Béatrice Daille,
Pierre-Antoine Gourraud
Abstract:
In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time,…
▽ More
In recent years, pre-trained language models (PLMs) achieve the best performance on a wide range of natural language processing (NLP) tasks. While the first models were trained on general domain data, specialized ones have emerged to more effectively treat specific domains. In this paper, we propose an original study of PLMs in the medical domain on French language. We compare, for the first time, the performance of PLMs trained on both public data from the web and private data from healthcare establishments. We also evaluate different learning strategies on a set of biomedical tasks. In particular, we show that we can take advantage of already existing biomedical PLMs in a foreign language by further pre-train it on our targeted data. Finally, we release the first specialized PLMs for the biomedical field in French, called DrBERT, as well as the largest corpus of medical data under free license on which these models are trained.
△ Less
Submitted 4 May, 2023; v1 submitted 3 April, 2023;
originally announced April 2023.
-
I4U System Description for NIST SRE'20 CTS Challenge
Authors:
Kong Aik Lee,
Tomi Kinnunen,
Daniele Colibro,
Claudio Vair,
Andreas Nautsch,
Hanwu Sun,
Liang He,
Tianyu Liang,
Qiongqiong Wang,
Mickael Rouvier,
Pierre-Michel Bousquet,
Rohan Kumar Das,
Ignacio Viñals Bailo,
Meng Liu,
Héctor Deldago,
Xuechen Liu,
Md Sahidullah,
Sandro Cumani,
Boning Zhang,
Koji Okabe,
Hitoshi Yamamoto,
Ruijie Tao,
Haizhou Li,
Alfonso Ortega Giménez,
Longbiao Wang
, et al. (1 additional authors not shown)
Abstract:
This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (C…
▽ More
This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (China). The submission was based on the fusion of top performing sub-systems and sub-fusion systems contributed by individual teams. Efforts have been spent on the use of common development and validation sets, submission schedule and milestone, minimizing inconsistency in trial list and score file format across sites.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding
Authors:
Gaëlle Laperrière,
Valentin Pelloin,
Mickaël Rouvier,
Themos Stafylakis,
Yannick Estève
Abstract:
In this paper we examine the use of semantically-aligned speech representations for end-to-end spoken language understanding (SLU). We employ the recently-introduced SAMU-XLSR model, which is designed to generate a single embedding that captures the semantics at the utterance level, semantically aligned across different languages. This model combines the acoustic frame-level speech representation…
▽ More
In this paper we examine the use of semantically-aligned speech representations for end-to-end spoken language understanding (SLU). We employ the recently-introduced SAMU-XLSR model, which is designed to generate a single embedding that captures the semantics at the utterance level, semantically aligned across different languages. This model combines the acoustic frame-level speech representation learning model (XLS-R) with the Language Agnostic BERT Sentence Embedding (LaBSE) model. We show that the use of the SAMU-XLSR model instead of the initial XLS-R model improves significantly the performance in the framework of end-to-end SLU. Finally, we present the benefits of using this model towards language portability in SLU.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Speech Resources in the Tamasheq Language
Authors:
Marcely Zanon Boito,
Fethi Bougares,
Florentin Barbier,
Souhir Gahbiche,
Loïc Barrault,
Mickael Rouvier,
Yannick Estève
Abstract:
In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from daily broadcast news in Niger (Studio Kalangou) and Mali (Studio Tamani). We share (i) a massive amount of unlabeled audio data (671 hours)…
▽ More
In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from daily broadcast news in Niger (Studio Kalangou) and Mali (Studio Tamani). We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller 17 hours parallel corpus of audio recordings in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language.
△ Less
Submitted 11 April, 2022; v1 submitted 13 January, 2022;
originally announced January 2022.
-
Studying squeeze-and-excitation used in CNN for speaker verification
Authors:
Mickael Rouvier,
Pierre-Michel Bousquet
Abstract:
In speaker verification, the extraction of voice representations is mainly based on the Residual Neural Network (ResNet) architecture. ResNet is built upon convolution layers which learn filters to capture local spatial patterns along all the input, then generate feature maps that jointly encode the spatial and channel information. Unfortunately, all feature maps in a convolution layer are learnt…
▽ More
In speaker verification, the extraction of voice representations is mainly based on the Residual Neural Network (ResNet) architecture. ResNet is built upon convolution layers which learn filters to capture local spatial patterns along all the input, then generate feature maps that jointly encode the spatial and channel information. Unfortunately, all feature maps in a convolution layer are learnt independently (the convolution layer does not exploit the dependencies between feature maps) and locally. This problem has first been tackled in image processing. A channel attention mechanism, called squeeze-and-excitation (SE), has recently been proposed in convolution layers and applied to speaker verification. This mechanism re-weights the information extracted across features maps. In this paper, we first propose an original qualitative study about the influence and the role of the SE mechanism applied to the speaker verification task at different stages of the ResNet, and then evaluate several SE architectures. We finally propose to improve the SE approach with a new pool- ing variant based on the concatenation of mean- and standard- deviation-pooling. Results showed that applying SE only on the first stages of the ResNet allows to better capture speaker information for the verification task, and that significant discrimination gains on Voxceleb1-E, Voxceleb1-H and SITW evaluation tasks have been noted using the proposed pooling variant.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
Study on the temporal pooling used in deep neural networks for speaker verification
Authors:
Mickael Rouvier,
Pierre-Michel Bousquet,
Jarod Duret
Abstract:
The x-vector architecture has recently achieved state-of-the-art results on the speaker verification task. This architecture incorporates a central layer, referred to as temporal pooling, which stacks statistical parameters of the acoustic frame distribution. This work proposes to highlight the significant effect of the temporal pooling content on the training dynamics and task performance. An eva…
▽ More
The x-vector architecture has recently achieved state-of-the-art results on the speaker verification task. This architecture incorporates a central layer, referred to as temporal pooling, which stacks statistical parameters of the acoustic frame distribution. This work proposes to highlight the significant effect of the temporal pooling content on the training dynamics and task performance. An evaluation with different pooling layers is conducted, that is, including different statistical measures of central tendency. Notably, 3rd and 4th moment-based statistics (skewness and kurtosis) are also tested to complete the usual mean and standard-deviation parameters. Our experiments show the influence of the pooling layer content in terms of speaker verification performance, but also for several classification tasks (speaker, channel or text related), and allow to better reveal the presence of external information to the speaker identity depending on the layer content.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task
Authors:
Ha Nguyen,
Natalia Tomashenko,
Marcely Zanon Boito,
Antoine Caubriere,
Fethi Bougares,
Mickael Rouvier,
Laurent Besacier,
Yannick Esteve
Abstract:
This paper describes the ON-TRAC Consortium translation systems developed for the end-to-end model task of IWSLT Evaluation 2019 for the English-to-Portuguese language pair. ON-TRAC Consortium is composed of researchers from three French academic laboratories: LIA (Avignon Université), LIG (Université Grenoble Alpes), and LIUM (Le Mans Université). A single end-to-end model built as a neural encod…
▽ More
This paper describes the ON-TRAC Consortium translation systems developed for the end-to-end model task of IWSLT Evaluation 2019 for the English-to-Portuguese language pair. ON-TRAC Consortium is composed of researchers from three French academic laboratories: LIA (Avignon Université), LIG (Université Grenoble Alpes), and LIUM (Le Mans Université). A single end-to-end model built as a neural encoder-decoder architecture with attention mechanism was used for two primary submissions corresponding to the two EN-PT evaluations sets: (1) TED (MuST-C) and (2) How2. In this paper, we notably investigate impact of pooling heterogeneous corpora for training, impact of target tokenization (characters or BPEs), impact of speech input segmentation and we also compare our best end-to-end model (BLEU of 26.91 on MuST-C and 43.82 on How2 validation sets) to a pipeline (ASR+MT) approach.
△ Less
Submitted 30 October, 2019;
originally announced October 2019.
-
I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences
Authors:
Kong Aik Lee,
Ville Hautamaki,
Tomi Kinnunen,
Hitoshi Yamamoto,
Koji Okabe,
Ville Vestman,
Jing Huang,
Guohong Ding,
Hanwu Sun,
Anthony Larcher,
Rohan Kumar Das,
Haizhou Li,
Mickael Rouvier,
Pierre-Michel Bousquet,
Wei Rao,
Qing Wang,
Chunlei Zhang,
Fahimeh Bahmaninezhad,
Hector Delgado,
Jose Patino,
Qiongqiong Wang,
Ling Guo,
Takafumi Koshinaka,
Jiacen Zhang,
Koichi Shinoda
, et al. (21 additional authors not shown)
Abstract:
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the res…
▽ More
The I4U consortium was established to facilitate a joint entry to NIST speaker recognition evaluations (SRE). The latest edition of such joint submission was in SRE 2018, in which the I4U submission was among the best-performing systems. SRE'18 also marks the 10-year anniversary of I4U consortium into NIST SRE series of evaluation. The primary objective of the current paper is to summarize the results and lessons learned based on the twelve sub-systems and their fusion submitted to SRE'18. It is also our intention to present a shared view on the advancements, progresses, and major paradigm shifts that we have witnessed as an SRE participant in the past decade from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm shift from supervector representation to deep speaker embedding, and a switch of research challenge from channel compensation to domain adaptation.
△ Less
Submitted 15 April, 2019;
originally announced April 2019.
-
Building a robust sentiment lexicon with (almost) no resource
Authors:
Mickael Rouvier,
Benoit Favre
Abstract:
Creating sentiment polarity lexicons is labor intensive. Automatically translating them from resourceful languages requires in-domain machine translation systems, which rely on large quantities of bi-texts. In this paper, we propose to replace machine translation by transferring words from the lexicon through word embeddings aligned across languages with a simple linear transform. The approach lea…
▽ More
Creating sentiment polarity lexicons is labor intensive. Automatically translating them from resourceful languages requires in-domain machine translation systems, which rely on large quantities of bi-texts. In this paper, we propose to replace machine translation by transferring words from the lexicon through word embeddings aligned across languages with a simple linear transform. The approach leads to no degradation, compared to machine translation, when tested on sentiment polarity classification on tweets from four languages.
△ Less
Submitted 15 December, 2016;
originally announced December 2016.
-
LIA system description for NIST SRE 2016
Authors:
Mickael Rouvier,
Pierre-Michel Bousquet,
Moez Ajili,
Waad Ben Kheder,
Driss Matrouf,
Jean-François Bonastre
Abstract:
This paper describes the LIA speaker recognition system developed for the Speaker Recognition Evaluation (SRE) campaign. Eight sub-systems are developed, all based on a state-of-the-art approach: i-vector/PLDA which represents the mainstream technique in text-independent speaker recognition. These sub-systems differ: on the acoustic feature extraction front-end (MFCC, PLP), at the i-vector extract…
▽ More
This paper describes the LIA speaker recognition system developed for the Speaker Recognition Evaluation (SRE) campaign. Eight sub-systems are developed, all based on a state-of-the-art approach: i-vector/PLDA which represents the mainstream technique in text-independent speaker recognition. These sub-systems differ: on the acoustic feature extraction front-end (MFCC, PLP), at the i-vector extraction stage (UBM, DNN or two-feats posteriors) and finally on the data-shifting (IDVC, mean-shifting). The submitted system is a fusion at the score-level of these eight sub-systems.
△ Less
Submitted 15 December, 2016;
originally announced December 2016.