subscribe to arXiv mailings

doi 10.1109/ICASSP40776.2020.9054709

Fusion approaches for emotion recognition from speech using acoustic and text-based features

Authors: Leonardo Pepino, Pablo Riera, Luciana Ferrer, Agustin Gravano

Abstract: In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and… ▽ More In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: 5 pages. Accepted in ICASSP 2020

arXiv:2401.03051 [pdf, other]

On the Stability of a non-hyperbolic nonlinear map with non-bounded set of non-isolated fixed points with applications to Machine Learning

Authors: Roberta Hansen, Matias Vera, Lautaro Estienne, Luciana Ferrer, Pablo Piantanida

Abstract: This paper deals with the convergence analysis of the SUCPA (Semi Unsupervised Calibration through Prior Adaptation) algorithm, defined from a first-order non-linear difference equations, first developed to correct the scores output by a supervised machine learning classifier. The convergence analysis is addressed as a dynamical system problem, by studying the local and global stability of the non… ▽ More This paper deals with the convergence analysis of the SUCPA (Semi Unsupervised Calibration through Prior Adaptation) algorithm, defined from a first-order non-linear difference equations, first developed to correct the scores output by a supervised machine learning classifier. The convergence analysis is addressed as a dynamical system problem, by studying the local and global stability of the nonlinear map derived from the algorithm. This map, which is defined by a composition of exponential and rational functions, turns out to be non-hyperbolic with a non-bounded set of non-isolated fixed points. Hence, a non-standard method for solving the convergence analysis is used consisting of an ad-hoc geometrical approach. For a binary classification problem (two-dimensional map), we rigorously prove that the map is globally asymptotically stable. Numerical experiments on real-world application are performed to support the theoretical results by means of two different classification problems: Sentiment Polarity performed with a Large Language Model and Cat-Dog Image classification. For a greater number of classes, the numerical evidence shows the same behavior of the algorithm, and this is illustrated with a Natural Language Inference example. The experiment codes are publicly accessible online at the following repository: https://github.com/LautaroEst/sucpa-convergence △ Less

Submitted 25 April, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

arXiv:2309.07391 [pdf, other]

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

Authors: Leonardo Pepino, Pablo Riera, Luciana Ferrer

Abstract: The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propos… ▽ More The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation. △ Less

Submitted 20 May, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

arXiv:2307.16324 [pdf, other]

Mispronunciation detection using self-supervised speech representations

Authors: Jazmin Vidal, Pablo Riera, Luciana Ferrer

Abstract: In recent years, self-supervised learning (SSL) models have produced promising results in a variety of speech-processing tasks, especially in contexts of data scarcity. In this paper, we study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition (PR) using native English data… ▽ More In recent years, self-supervised learning (SSL) models have produced promising results in a variety of speech-processing tasks, especially in contexts of data scarcity. In this paper, we study the use of SSL models for the task of mispronunciation detection for second language learners. We compare two downstream approaches: 1) training the model for phone recognition (PR) using native English data, and 2) training a model directly for the target task using non-native English data. We compare the performance of these two approaches for various SSL representations as well as a representation extracted from a traditional DNN-based speech recognition model. We evaluate the models on L2Arctic and EpaDB, two datasets of non-native speech annotated with pronunciation labels at the phone level. Overall, we find that using a downstream model trained for the target task gives the best performance and that most upstream models perform similarly for the task. △ Less

Submitted 30 July, 2023; originally announced July 2023.

arXiv:2307.06713 [pdf, other]

doi 10.26615/issn.2603-2821.2023_002

Unsupervised Calibration through Prior Adaptation for Text Classification using Large Language Models

Authors: Lautaro Estienne, Luciana Ferrer, Matías Vera, Pablo Piantanida

Abstract: A wide variety of natural language tasks are currently being addressed with large-scale language models (LLMs). These models are usually trained with a very large amount of unsupervised text data and adapted to perform a downstream natural language task using methods like fine-tuning, calibration or in-context learning. In this work, we propose an approach to adapt the prior class distribution to… ▽ More A wide variety of natural language tasks are currently being addressed with large-scale language models (LLMs). These models are usually trained with a very large amount of unsupervised text data and adapted to perform a downstream natural language task using methods like fine-tuning, calibration or in-context learning. In this work, we propose an approach to adapt the prior class distribution to perform text classification tasks without the need for labelled samples and only few in-domain sample queries. The proposed approach treats the LLM as a black box, adding a stage where the model posteriors are calibrated to the task. Results show that these methods outperform the un-adapted model for different number of training shots in the prompt and a previous approach were calibration is performed without using any adaptation data. △ Less

Submitted 9 August, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

Journal ref: In Proceedings of the RANLP 2023 Student Research Workshop

arXiv:2303.12540 [pdf, other]

Deployment of Image Analysis Algorithms under Prevalence Shifts

Authors: Patrick Godau, Piotr Kalinowski, Evangelia Christodoulou, Annika Reinke, Minu Tizabi, Luciana Ferrer, Paul Jäger, Lena Maier-Hein

Abstract: Domain gaps are among the most relevant roadblocks in the clinical translation of machine learning (ML)-based solutions for medical image analysis. While current research focuses on new training paradigms and network architectures, little attention is given to the specific effect of prevalence shifts on an algorithm deployed in practice. Such discrepancies between class frequencies in the data use… ▽ More Domain gaps are among the most relevant roadblocks in the clinical translation of machine learning (ML)-based solutions for medical image analysis. While current research focuses on new training paradigms and network architectures, little attention is given to the specific effect of prevalence shifts on an algorithm deployed in practice. Such discrepancies between class frequencies in the data used for a method's development/validation and that in its deployment environment(s) are of great importance, for example in the context of artificial intelligence (AI) democratization, as disease prevalences may vary widely across time and location. Our contribution is twofold. First, we empirically demonstrate the potentially severe consequences of missing prevalence handling by analyzing (i) the extent of miscalibration, (ii) the deviation of the decision threshold from the optimum, and (iii) the ability of validation metrics to reflect neural network performance on the deployment population as a function of the discrepancy between development and deployment prevalence. Second, we propose a workflow for prevalence-aware image classification that uses estimated deployment prevalences to adjust a trained classifier to a new environment, without requiring additional annotated deployment data. Comprehensive experiments based on a diverse set of 30 medical classification tasks showcase the benefit of the proposed workflow in generating better classifier decisions and more reliable performance estimates compared to current practice. △ Less

Submitted 24 July, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2302.14055 [pdf, other]

doi 10.1109/ICASSPW59220.2023.10193460

Phone and speaker spatial organization in self-supervised speech representations

Authors: Pablo Riera, Manuela Cerdeiro, Leonardo Pepino, Luciana Ferrer

Abstract: Self-supervised representations of speech are currently being widely used for a large number of applications. Recently, some efforts have been made in trying to analyze the type of information present in each of these representations. Most such work uses downstream models to test whether the representations can be successfully used for a specific task. The downstream models, though, typically perf… ▽ More Self-supervised representations of speech are currently being widely used for a large number of applications. Recently, some efforts have been made in trying to analyze the type of information present in each of these representations. Most such work uses downstream models to test whether the representations can be successfully used for a specific task. The downstream models, though, typically perform nonlinear operations on the representation extracting information that may not have been readily available in the original representation. In this work, we analyze the spatial organization of phone and speaker information in several state-of-the-art speech representations using methods that do not require a downstream model. We measure how different layers encode basic acoustic parameters such as formants and pitch using representation similarity analysis. Further, we study the extent to which each representation clusters the speech samples by phone or speaker classes using non-parametric statistical testing. Our results indicate that models represent these speech attributes differently depending on the target task used during pretraining. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2302.01790 [pdf, other]

doi 10.1038/s41592-023-02150-0

Understanding metric-related pitfalls in image analysis validation

Authors: Annika Reinke, Minu D. Tizabi, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, A. Emre Kavur, Tim Rädsch, Carole H. Sudre, Laura Acion, Michela Antonelli, Tal Arbel, Spyridon Bakas, Arriel Benis, Matthew Blaschko, Florian Buettner, M. Jorge Cardoso, Veronika Cheplygina, Jianxu Chen, Evangelia Christodoulou, Beth A. Cimini, Gary S. Collins, Keyvan Farahani, Luciana Ferrer, Adrian Galdran, Bram van Ginneken , et al. (53 additional authors not shown)

Abstract: Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibilit… ▽ More Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation. △ Less

Submitted 23 February, 2024; v1 submitted 3 February, 2023; originally announced February 2023.

Comments: Shared first authors: Annika Reinke and Minu D. Tizabi; shared senior authors: Lena Maier-Hein and Paul F. Jäger. Published in Nature Methods. arXiv admin note: text overlap with arXiv:2206.01653

Journal ref: Nature methods, 1-13 (2024)

arXiv:2209.05355 [pdf, other]

Analysis and Comparison of Classification Metrics

Authors: Luciana Ferrer

Abstract: A variety of different performance metrics are commonly used in the machine learning literature for the evaluation of classification systems. Some of the most common ones for measuring quality of hard decisions are standard and balanced accuracy, standard and balanced error rate, F-beta score, and Matthews correlation coefficient (MCC). In this document, we review the definition of these and other… ▽ More A variety of different performance metrics are commonly used in the machine learning literature for the evaluation of classification systems. Some of the most common ones for measuring quality of hard decisions are standard and balanced accuracy, standard and balanced error rate, F-beta score, and Matthews correlation coefficient (MCC). In this document, we review the definition of these and other metrics and compare them with the expected cost (EC), a metric introduced in every statistical learning course but rarely used in the machine learning literature. We show that both the standard and balanced error rates are special cases of the EC. Further, we show its relation with F-beta score and MCC and argue that EC is superior to these traditional metrics for being based on first principles from statistics, and for being more general, interpretable, and adaptable to any application scenario. The metrics mentioned above measure the quality of hard decisions. Yet, most modern classification systems output continuous scores for the classes which we may want to evaluate directly. Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk, among others. The last three metrics are special cases of a family of metrics given by the expected value of proper scoring rules (PSRs). We review the theory behind these metrics, showing that they are a principled way to measure the quality of the posterior probabilities produced by a system. Finally, we show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE), arguing that calibration loss based on PSRs is superior to the ECE for being more interpretable, more general, and directly applicable to the multi-class case, among other reasons. △ Less

Submitted 20 September, 2023; v1 submitted 12 September, 2022; originally announced September 2022.

arXiv:2206.01653 [pdf, other]

doi 10.1038/s41592-023-02151-z

Metrics reloaded: Recommendations for image analysis validation

Authors: Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D. Tizabi, Florian Buettner, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, Mauricio Reyes, Michael A. Riegler, Manuel Wiesenfarth, A. Emre Kavur, Carole H. Sudre, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, Tim Rädsch, Laura Acion, Michela Antonelli, Tal Arbel, Spyridon Bakas, Arriel Benis, Matthew Blaschko , et al. (49 additional authors not shown)

Abstract: Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international ex… ▽ More Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases. △ Less

Submitted 23 February, 2024; v1 submitted 3 June, 2022; originally announced June 2022.

Comments: Shared first authors: Lena Maier-Hein, Annika Reinke. arXiv admin note: substantial text overlap with arXiv:2104.05642 Published in Nature Methods

Journal ref: Nature methods, 1-18 (2024)

arXiv:2204.12649 [pdf, other]

Study on the Fairness of Speaker Verification Systems on Underrepresented Accents in English

Authors: Mariel Estevez, Luciana Ferrer

Abstract: Speaker verification (SV) systems are currently being used to make sensitive decisions like giving access to bank accounts or deciding whether the voice of a suspect coincides with that of the perpetrator of a crime. Ensuring that these systems are fair and do not disfavor any particular group is crucial. In this work, we analyze the performance of several state-of-the-art SV systems across groups… ▽ More Speaker verification (SV) systems are currently being used to make sensitive decisions like giving access to bank accounts or deciding whether the voice of a suspect coincides with that of the perpetrator of a crime. Ensuring that these systems are fair and do not disfavor any particular group is crucial. In this work, we analyze the performance of several state-of-the-art SV systems across groups defined by the accent of the speakers when speaking English. To this end, we curated a new dataset based on the VoxCeleb corpus where we carefully selected samples from speakers with accents from different countries. We use this dataset to evaluate system performance for several SV systems trained with VoxCeleb data. We show that, while discrimination performance is reasonably robust across accent groups, calibration performance degrades dramatically on some accents that are not well represented in the training data. Finally, we show that a simple data balancing approach mitigates this undesirable bias, being particularly effective when applied to our recently-proposed discriminative condition-aware backend. △ Less

Submitted 26 April, 2022; originally announced April 2022.

Comments: 5 pages, 2 figures, submitted to INTERSPEECH

arXiv:2201.01364 [pdf, other]

doi 10.1109/TASLP.2022.3190736

A Discriminative Hierarchical PLDA-based Model for Spoken Language Recognition

Authors: Luciana Ferrer, Diego Castan, Mitchell McLaren, Aaron Lawson

Abstract: Spoken language recognition (SLR) refers to the automatic process used to determine the language present in a speech sample. SLR is an important task in its own right, for example, as a tool to analyze or categorize large amounts of multi-lingual data. Further, it is also an essential tool for selecting downstream applications in a work flow, for example, to chose appropriate speech recognition or… ▽ More Spoken language recognition (SLR) refers to the automatic process used to determine the language present in a speech sample. SLR is an important task in its own right, for example, as a tool to analyze or categorize large amounts of multi-lingual data. Further, it is also an essential tool for selecting downstream applications in a work flow, for example, to chose appropriate speech recognition or machine translation models. SLR systems are usually composed of two stages, one where an embedding representing the audio sample is extracted and a second one which computes the final scores for each language. In this work, we approach the SLR task as a detection problem and implement the second stage as a probabilistic linear discriminant analysis (PLDA) model. We show that discriminative training of the PLDA parameters gives large gains with respect to the usual generative training. Further, we propose a novel hierarchical approach where two PLDA models are trained, one to generate scores for clusters of highly-related languages and a second one to generate scores conditional to each cluster. The final language detection scores are computed as a combination of these two sets of scores. The complete model is trained discriminatively to optimize a cross-entropy objective. We show that this hierarchical approach consistently outperforms the non-hierarchical one for detection of highly related languages, in many cases by large margins. We train our systems on a collection of datasets including over 100 languages, and test them both on matched and mismatched conditions, showing that the gains are robust to condition mismatch. △ Less

Submitted 11 August, 2022; v1 submitted 4 January, 2022; originally announced January 2022.

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2396-2410, 2022

arXiv:2112.12843 [pdf, other]

Impact of class imbalance on chest x-ray classifiers: towards better evaluation practices for discrimination and calibration performance

Authors: Candelaria Mosquera, Luciana Ferrer, Diego Milone, Daniel Luna, Enzo Ferrante

Abstract: This work aims to analyze standard evaluation practices adopted by the research community when assessing chest x-ray classifiers, particularly focusing on the impact of class imbalance in such appraisals. Our analysis considers a comprehensive definition of model performance, covering not only discriminative performance but also model calibration, a topic of research that has received increasing a… ▽ More This work aims to analyze standard evaluation practices adopted by the research community when assessing chest x-ray classifiers, particularly focusing on the impact of class imbalance in such appraisals. Our analysis considers a comprehensive definition of model performance, covering not only discriminative performance but also model calibration, a topic of research that has received increasing attention during the last years within the machine learning community. Firstly, we conducted a literature study to analyze common scientific practices and confirmed that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest x-ray classifiers, albeit its importance in the context of healthcare. Secondly, we perform a systematic experiment on two major chest x-ray datasets to explore the behavior of several performance metrics under different class ratios and show that widely adopted metrics can conceal the performance in the minority class. Finally, we recommend the inclusion of complementary metrics to better reflect the system's performance in such scenarios. Our study indicates that current evaluation practices adopted by the research community for chest x-ray computer-aided diagnosis systems may not reflect their performance in real clinical scenarios, and suggest alternatives to improve this situation. △ Less

Submitted 14 March, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

Comments: Conference on Health, Inference, and Learning (CHIL) 2022 - Invited non-archival presentation

arXiv:2111.11873 [pdf, other]

doi 10.1088/1361-6560/ac7e17

Deformable image registration with deep network priors: a study on longitudinal PET images

Authors: Constance Fourcade, Ludovic Ferrer, Noemie Moreau, Gianmarco Santini, Aishlinn Brennan, Caroline Rousseau, Marie Lacombe, Vincent Fleury, Mathilde Colombié, Pascal Jézéquel, Mario Campone, Mathieu Rubeaux, Diana Mateus

Abstract: Longitudinal image registration is challenging and has not yet benefited from major performance improvements thanks to deep-learning. Inspired by Deep Image Prior, this paper introduces a different use of deep architectures as regularizers to tackle the image registration question. We propose a subject-specific deformable registration method called MIRRBA, relying on a deep pyramidal architecture… ▽ More Longitudinal image registration is challenging and has not yet benefited from major performance improvements thanks to deep-learning. Inspired by Deep Image Prior, this paper introduces a different use of deep architectures as regularizers to tackle the image registration question. We propose a subject-specific deformable registration method called MIRRBA, relying on a deep pyramidal architecture to be the prior parametric model constraining the deformation field. Diverging from the supervised learning paradigm, MIRRBA does not require a learning database, but only the pair of images to be registered to optimize the network's parameters and provide a deformation field. We demonstrate the regularizing power of deep architectures and present new elements to understand the role of the architecture in deep learning methods for registration. Hence, to study the impact of the network parameters, we ran our method with different architectural configurations on a private dataset of 110 metastatic breast cancer full-body PET images with manual segmentations of the brain, bladder and metastatic lesions. We compared it against conventional iterative registration approaches and supervised deep learning-based models. Global and local registration accuracies were evaluated using the detection rate and the Dice score respectively, while registration realism was evaluated using the Jacobian's determinant. Moreover, we computed the ability of the different methods to shrink vanishing lesions with the disappearing rate. MIRRBA significantly improves the organ and lesion Dice scores of supervised models. Regarding the disappearing rate, MIRRBA more than doubles the best performing conventional approach SyNCC score. Our work therefore proposes an alternative way to bridge the performance gap between conventional and deep learning-based methods and demonstrates the regularizing power of deep architectures. △ Less

Submitted 30 March, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

Comments: 11 pages 3 figures in the main article 2 tables in the main article 2 figures in supplementary material

arXiv:2111.00976 [pdf, other]

A transfer learning based approach for pronunciation scoring

Authors: Marcelo Sancinetti, Jazmin Vidal, Cyntia Bonomi, Luciana Ferrer

Abstract: Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators. Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only. Better performance has been shown when using systems that are trained specifically for the task using non-native data. Yet, such systems face the c… ▽ More Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators. Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only. Better performance has been shown when using systems that are trained specifically for the task using non-native data. Yet, such systems face the challenge that datasets labelled for this task are scarce and usually small. In this paper, we present a transfer learning-based approach that leverages a model trained for ASR, adapting it for the task of pronunciation scoring. We analyze the effect of several design choices and compare the performance with a state-of-the-art goodness of pronunciation (GOP) system. Our final system is 20% better than the GOP system on EpaDB, a database for pronunciation scoring research, for a cost function that prioritizes low rates of unnecessary corrections. △ Less

Submitted 9 May, 2023; v1 submitted 1 November, 2021; originally announced November 2021.

Comments: ICASSP 2022

arXiv:2110.06999 [pdf, other]

doi 10.1109/ICASSP43922.2022.9747742

Study of positional encoding approaches for Audio Spectrogram Transformers

Authors: Leonardo Pepino, Pablo Riera, Luciana Ferrer

Abstract: Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results in several datasets. However, in order for ASTs to outperform CNNs, pretraining with ImageNet is needed. In this paper, we study one component of the AST, the… ▽ More Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results in several datasets. However, in order for ASTs to outperform CNNs, pretraining with ImageNet is needed. In this paper, we study one component of the AST, the positional encoding, and propose several variants to improve the performance of ASTs trained from scratch, without ImageNet pretraining. Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST. △ Less

Submitted 13 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022. 5 pages, 3 figures

arXiv:2104.05642 [pdf, other]

Common Limitations of Image Processing Metrics: A Picture Story

Authors: Annika Reinke, Minu D. Tizabi, Carole H. Sudre, Matthias Eisenmann, Tim Rädsch, Michael Baumgartner, Laura Acion, Michela Antonelli, Tal Arbel, Spyridon Bakas, Peter Bankhead, Arriel Benis, Matthew Blaschko, Florian Buettner, M. Jorge Cardoso, Jianxu Chen, Veronika Cheplygina, Evangelia Christodoulou, Beth Cimini, Gary S. Collins, Sandy Engelhardt, Keyvan Farahani, Luciana Ferrer, Adrian Galdran, Bram van Ginneken , et al. (68 additional authors not shown)

Abstract: While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using spe… ▽ More While the importance of automatic image analysis is continuously increasing, recent meta-research revealed major flaws with respect to algorithm validation. Performance metrics are particularly key for meaningful, objective, and transparent performance assessment and validation of the used automatic algorithms, but relatively little attention has been given to the practical pitfalls when using specific metrics for a given image analysis task. These are typically related to (1) the disregard of inherent metric properties, such as the behaviour in the presence of class imbalance or small target structures, (2) the disregard of inherent data set properties, such as the non-independence of the test cases, and (3) the disregard of the actual biomedical domain interest that the metrics should reflect. This living dynamically document has the purpose to illustrate important limitations of performance metrics commonly applied in the field of image analysis. In this context, it focuses on biomedical image analysis problems that can be phrased as image-level classification, semantic segmentation, instance segmentation, or object detection task. The current version is based on a Delphi process on metrics conducted by an international consortium of image analysis experts from more than 60 institutions worldwide. △ Less

Submitted 6 December, 2023; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: Shared first authors: Annika Reinke and Minu D. Tizabi. This is a dynamic paper on limitations of commonly used metrics. It discusses metrics for image-level classification, semantic and instance segmentation, and object detection. For missing use cases, comments or questions, please contact a.reinke@dkfz.de. Substantial contributions to this document will be acknowledged with a co-authorship

arXiv:2104.03502 [pdf, other]

Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings

Authors: Leonardo Pepino, Pablo Riera, Luciana Ferrer

Abstract: Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using… ▽ More Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model. Further, we compare performance using two different wav2vec 2.0 models, with and without finetuning for speech recognition. We evaluate our proposed approaches on two standard emotion databases IEMOCAP and RAVDESS, showing superior performance compared to results in the literature. △ Less

Submitted 8 April, 2021; originally announced April 2021.

Comments: 5 pages, 2 figures. Submitted to Interspeech 2021

arXiv:2104.00732 [pdf, other]

Out of a hundred trials, how many errors does your speaker verifier make?

Authors: Niko Brümmer, Luciana Ferrer, Albert Swart

Abstract: Out of a hundred trials, how many errors does your speaker verifier make? For the user this is an important, practical question, but researchers and vendors typically sidestep it and supply instead the conditional error-rates that are given by the ROC/DET curve. We posit that the user's question is answered by the Bayes error-rate. We present a tutorial to show how to compute the error-rate that r… ▽ More Out of a hundred trials, how many errors does your speaker verifier make? For the user this is an important, practical question, but researchers and vendors typically sidestep it and supply instead the conditional error-rates that are given by the ROC/DET curve. We posit that the user's question is answered by the Bayes error-rate. We present a tutorial to show how to compute the error-rate that results when making Bayes decisions with calibrated likelihood ratios, supplied by the verifier, and an hypothesis prior, supplied by the user. For perfect calibration, the Bayes error-rate is upper bounded by min(EER,P,1-P), where EER is the equal-error-rate and P, 1-P are the prior probabilities of the competing hypotheses. The EER represents the accuracy of the verifier, while min(P,1-P) represents the hardness of the classification problem. We further show how the Bayes error-rate can be computed also for non-perfect calibration and how to generalize from error-rate to expected cost. We offer some criticism of decisions made by direct score thresholding. Finally, we demonstrate by analyzing error-rates of the recently published DCA-PLDA speaker verifier. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Comments: Submitted to Interspeech 2021

arXiv:2102.09370 [pdf, other]

A Study on the Manifestation of Trust in Speech

Authors: Lara Gauder, Leonardo Pepino, Pablo Riera, Silvina Brussino, Jazmín Vidal, Agustín Gravano, Luciana Ferrer

Abstract: Research has shown that trust is an essential aspect of human-computer interaction directly determining the degree to which the person is willing to use a system. An automatic prediction of the level of trust that a user has on a certain system could be used to attempt to correct potential distrust by having the system take relevant actions like, for example, apologizing or explaining its decision… ▽ More Research has shown that trust is an essential aspect of human-computer interaction directly determining the degree to which the person is willing to use a system. An automatic prediction of the level of trust that a user has on a certain system could be used to attempt to correct potential distrust by having the system take relevant actions like, for example, apologizing or explaining its decisions. In this work, we explore the feasibility of automatically detecting the level of trust that a user has on a virtual assistant (VA) based on their speech. We developed a novel protocol for collecting speech data from subjects induced to have different degrees of trust in the skills of a VA. The protocol consists of an interactive session where the subject is asked to respond to a series of factual questions with the help of a virtual assistant. In order to induce subjects to either trust or distrust the VA's skills, they are first informed that the VA was previously rated by other users as being either good or bad; subsequently, the VA answers the subjects' questions consistently to its alleged abilities. All interactions are speech-based, with subjects and VAs communicating verbally, which allows the recording of speech produced under different trust conditions. Using this protocol, we collected a speech corpus in Argentine Spanish. We show clear evidence that the protocol effectively succeeded in influencing subjects into the desired mental state of either trusting or distrusting the agent's skills, and present results of a perceptual study of the degree of trust performed by expert listeners. Finally, we found that the subject's speech can be used to detect which type of VA they were using, which could be considered a proxy for the user's trust toward the VA's abilities, with an accuracy up to 76%, compared to a random baseline of 50%. △ Less

Submitted 9 February, 2021; originally announced February 2021.

Comments: arXiv admin note: text overlap with arXiv:2007.15711, arXiv:2006.05977

arXiv:2102.01760 [pdf, other]

doi 10.1016/j.csl.2021.101258

A Speaker Verification Backend with Robust Performance across Conditions

Authors: Luciana Ferrer, Mitchell McLaren, Niko Brummer

Abstract: In this paper, we address the problem of speaker verification in conditions unseen or unknown during development. A standard method for speaker verification consists of extracting speaker embeddings with a deep neural network and processing them through a backend composed of probabilistic linear discriminant analysis (PLDA) and global logistic regression score calibration. This method is known to… ▽ More In this paper, we address the problem of speaker verification in conditions unseen or unknown during development. A standard method for speaker verification consists of extracting speaker embeddings with a deep neural network and processing them through a backend composed of probabilistic linear discriminant analysis (PLDA) and global logistic regression score calibration. This method is known to result in systems that work poorly on conditions different from those used to train the calibration model. We propose to modify the standard backend, introducing an adaptive calibrator that uses duration and other automatically extracted side-information to adapt to the conditions of the inputs. The backend is trained discriminatively to optimize binary cross-entropy. When trained on a number of diverse datasets that are labeled only with respect to speaker, the proposed backend consistently and, in some cases, dramatically improves calibration, compared to the standard PLDA approach, on a number of held-out datasets, some of which are markedly different from the training data. Discrimination performance is also consistently improved. We show that joint training of the PLDA and the adaptive calibrator is essential -- the same benefits cannot be achieved when freezing PLDA and fine-tuning the calibrator. To our knowledge, the results in this paper are the first evidence in the literature that it is possible to develop a speaker verification system with robust out-of-the-box performance on a large variety of conditions. △ Less

Submitted 17 August, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

Journal ref: Computer Speech and Language, Volume 71, 2021

arXiv:2007.15711 [pdf, other]

Detecting Distrust Towards the Skills of a Virtual Assistant Using Speech

Authors: Leonardo Pepino, Pablo Riera, Lara Gauder, Agustín Gravano, Luciana Ferrer

Abstract: Research has shown that trust is an essential aspect of human-computer interaction directly determining the degree to which the person is willing to use the system. An automatic prediction of the level of trust that a user has on a certain system could be used to attempt to correct potential distrust by having the system take relevant actions like, for example, explaining its actions more thorough… ▽ More Research has shown that trust is an essential aspect of human-computer interaction directly determining the degree to which the person is willing to use the system. An automatic prediction of the level of trust that a user has on a certain system could be used to attempt to correct potential distrust by having the system take relevant actions like, for example, explaining its actions more thoroughly. In this work, we explore the feasibility of automatically detecting the level of trust that a user has on a virtual assistant (VA) based on their speech. We use a dataset collected for this purpose, containing human-computer speech interactions where subjects were asked to answer various factual questions with the help of a virtual assistant, which they were led to believe was either very reliable or unreliable. We find that the subject's speech can be used to detect which type of VA they were using, which could be considered a proxy for the user's trust toward the VA's abilities, with an accuracy up to 76\%, compared to a random baseline of 50\%. These results are obtained using features that have been previously found useful for detecting speech directed to infants and non-native speakers. △ Less

Submitted 30 July, 2020; originally announced July 2020.

arXiv:2006.05977 [pdf, other]

Trust-UBA: A Corpus for the Study of the Manifestation of Trust in Speech

Authors: Lara Gauder, Pablo Riera, Leonardo Pepino, Silvina Brussino, Jazmín Vidal, Luciana Ferrer, Agustín Gravano

Abstract: This paper describes a novel protocol for collecting speech data from subjects induced to have different degrees of trust in the skills of a conversational agent. The protocol consists of an interactive session where the subject is asked to respond to a series of factual questions with the help of a virtual assistant. In order to induce subjects to either trust or distrust the agent's skills, they… ▽ More This paper describes a novel protocol for collecting speech data from subjects induced to have different degrees of trust in the skills of a conversational agent. The protocol consists of an interactive session where the subject is asked to respond to a series of factual questions with the help of a virtual assistant. In order to induce subjects to either trust or distrust the agent's skills, they are first informed that it was previously rated by other users as being either good or bad; subsequently, the agent answers the subjects' questions consistently to its alleged abilities. All interactions are speech-based, with subjects and agents communicating verbally, which allows the recording of speech produced under different trust conditions. We collected a speech corpus in Argentine Spanish using this protocol, which we are currently using to study the feasibility of predicting the degree of trust from speech. We find clear evidence that the protocol effectively succeeded in influencing subjects into the desired mental state of either trusting or distrusting the agent's skills, and present preliminary results of a perceptual study of the degree of trust performed by expert listeners. The collected speech dataset will be made publicly available once ready. △ Less

Submitted 30 July, 2020; v1 submitted 10 June, 2020; originally announced June 2020.

arXiv:2002.03802 [pdf, other]

A Speaker Verification Backend for Improved Calibration Performance across Varying Conditions

Authors: Luciana Ferrer, Mitchell McLaren

Abstract: In a recent work, we presented a discriminative backend for speaker verification that achieved good out-of-the-box calibration performance on most tested conditions containing varying levels of mismatch to the training conditions. This backend mimics the standard PLDA-based backend process used in most current speaker verification systems, including the calibration stage. All parameters of the bac… ▽ More In a recent work, we presented a discriminative backend for speaker verification that achieved good out-of-the-box calibration performance on most tested conditions containing varying levels of mismatch to the training conditions. This backend mimics the standard PLDA-based backend process used in most current speaker verification systems, including the calibration stage. All parameters of the backend are jointly trained to optimize the binary cross-entropy for the speaker verification task. Calibration robustness is achieved by making the parameters of the calibration stage a function of vectors representing the conditions of the signal, which are extracted using a model trained to predict condition labels. In this work, we propose a simplified version of this backend where the vectors used to compute the calibration parameters are estimated within the backend, without the need for a condition prediction model. We show that this simplified method provides similar performance to the previously proposed method while being simpler to implement, and having less requirements on the training data. Further, we provide an analysis of different aspects of the method including the effect of initialization, the nature of the vectors used to compute the calibration parameters, and the effect that the random seed and the number of training epochs has on performance. We also compare the proposed method with the trial-based calibration (TBC) method that, to our knowledge, was the state-of-the-art for achieving good calibration across varying conditions. We show that the proposed method outperforms TBC while also being several orders of magnitude faster to run, comparable to the standard PLDA baseline. △ Less

Submitted 5 February, 2020; originally announced February 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1911.11622

arXiv:1911.11622 [pdf, other]

A discriminative condition-aware backend for speaker verification

Authors: Luciana Ferrer, Mitchell McLaren

Abstract: We present a scoring approach for speaker verification that mimics the standard PLDA-based backend process used in most current speaker verification systems. However, unlike the standard backends, all parameters of the model are jointly trained to optimize the binary cross-entropy for the speaker verification task. We further integrate the calibration stage inside the model, making the parameters… ▽ More We present a scoring approach for speaker verification that mimics the standard PLDA-based backend process used in most current speaker verification systems. However, unlike the standard backends, all parameters of the model are jointly trained to optimize the binary cross-entropy for the speaker verification task. We further integrate the calibration stage inside the model, making the parameters of this stage depend on metadata vectors that represent the conditions of the signals. We show that the proposed backend has excellent out-of-the-box calibration performance on most of our test sets, making it an ideal approach for cases in which the test conditions are not known and development data is not available for training a domain-specific calibration model. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Journal ref: Proceedings of ICASSP 2020

arXiv:1803.10554 [pdf, other]

Joint PLDA for Simultaneous Modeling of Two Factors

Authors: Luciana Ferrer, Mitchell McLaren

Abstract: Probabilistic linear discriminant analysis (PLDA) is a method used for biometric problems like speaker or face recognition that models the variability of the samples using two latent variables, one that depends on the class of the sample and another one that is assumed independent across samples and models the within-class variability. In this work, we propose a generalization of PLDA that enables… ▽ More Probabilistic linear discriminant analysis (PLDA) is a method used for biometric problems like speaker or face recognition that models the variability of the samples using two latent variables, one that depends on the class of the sample and another one that is assumed independent across samples and models the within-class variability. In this work, we propose a generalization of PLDA that enables joint modeling of two sample-dependent factors: the class of interest and a nuisance condition. The approach does not change the basic form of PLDA but rather modifies the training procedure to consider the dependency across samples of the latent variable that models within-class variability. While the identity of the nuisance condition is needed during training, it is not needed during testing since we propose a scoring procedure that marginalizes over the corresponding latent variable. We show results on a multilingual speaker-verification task, where the language spoken is considered a nuisance condition. We show that the proposed joint PLDA approach leads to significant performance gains in this task for two different datasets, in particular when the training data contains mostly or only monolingual speakers. △ Less

Submitted 28 March, 2018; originally announced March 2018.

Comments: Submitted to Journal of Machine Learning Research

Journal ref: Journal of Machine Learning Research, January, 2019

arXiv:1803.03684 [pdf, ps, other]

Scoring Formulation for Multi-Condition Joint PLDA

Authors: Luciana Ferrer

Abstract: The joint PLDA model, is a generalization of PLDA where the nuisance variable is no longer considered independent across samples, but potentially shared (tied) across samples that correspond to the same nuisance condition. The original work considered a single nuisance condition, deriving the EM and scoring formulas for this scenario. In this document, we show how to obtain likelihood ratios for s… ▽ More The joint PLDA model, is a generalization of PLDA where the nuisance variable is no longer considered independent across samples, but potentially shared (tied) across samples that correspond to the same nuisance condition. The original work considered a single nuisance condition, deriving the EM and scoring formulas for this scenario. In this document, we show how to obtain likelihood ratios for scoring when multiple nuisance conditions are allowed in the model. △ Less

Submitted 9 March, 2018; originally announced March 2018.

arXiv:1704.02346 [pdf, ps, other]

Joint Probabilistic Linear Discriminant Analysis

Authors: Luciana Ferrer

Abstract: Standard probabilistic linear discriminant analysis (PLDA) for speaker recognition assumes that the sample's features (usually, i-vectors) are given by a sum of three terms: a term that depends on the speaker identity, a term that models the within-speaker variability and is assumed independent across samples, and a final term that models any remaining variability and is also independent across sa… ▽ More Standard probabilistic linear discriminant analysis (PLDA) for speaker recognition assumes that the sample's features (usually, i-vectors) are given by a sum of three terms: a term that depends on the speaker identity, a term that models the within-speaker variability and is assumed independent across samples, and a final term that models any remaining variability and is also independent across samples. In this work, we propose a generalization of this model where the within-speaker variability is not necessarily assumed independent across samples but dependent on another discrete variable. This variable, which we call the channel variable as in the standard PLDA approach, could be, for example, a discrete category for the channel characteristics, the language spoken by the speaker, the type of speech in the sample (conversational, monologue, read), etc. The value of this variable is assumed to be known during training but not during testing. Scoring is performed, as in standard PLDA, by computing a likelihood ratio between the null hypothesis that the two sides of a trial belong to the same speaker versus the alternative hypothesis that the two sides belong to different speakers. The two likelihoods are computed by marginalizing over two hypothesis about the channels in both sides of a trial: that they are the same and that they are different. This way, we expect that the new model will be better at coping with same-channel versus different-channel trials than standard PLDA, since knowledge about the channel (or language, or speech style) is used during training and implicitly considered during scoring. △ Less

Submitted 16 January, 2018; v1 submitted 7 April, 2017; originally announced April 2017.

Comments: Technical report

arXiv:1611.08947 [pdf, other]

Navigable videos for presenting scientific data on head-mounted displays

Authors: Jacqueline Chu, Leonardo Ferrer, Min Shih, Kwan-Liu Ma

Abstract: Immersive, stereoscopic viewing enables scientists to better analyze the spatial structures of visualized physical phenomena. However, their findings cannot be properly presented in traditional media, which lack these core attributes. Creating a presentation tool that captures this environment poses unique challenges, namely related to poor viewing accessibility. Immersive scientific renderings of… ▽ More Immersive, stereoscopic viewing enables scientists to better analyze the spatial structures of visualized physical phenomena. However, their findings cannot be properly presented in traditional media, which lack these core attributes. Creating a presentation tool that captures this environment poses unique challenges, namely related to poor viewing accessibility. Immersive scientific renderings often require high-end equipment, which can be impractical to obtain. We address these challenges with our authoring tool and navigational interface, which is designed for affordable head-mounted displays. With the authoring tool, scientists can show salient data features as connected 360° video paths, resulting in a "choose-your-own-adventure" experience. Our navigational interface features bidirectional video playback for added viewing control when users traverse the tailor-made content. We evaluate our system's benefits by authoring case studies on several data sets and conducting a usability study on the navigational interface's design. In summary, our approach provides scientists an immersive medium to visually present their research to the intended audience--spanning from students to colleagues--on affordable virtual reality headsets. △ Less

Submitted 27 November, 2016; originally announced November 2016.

Showing 1–29 of 29 results for author: Ferrer, L