Skip to main content

Showing 1–28 of 28 results for author: Scharenborg, O

  1. arXiv:2406.10284  [pdf, other

    cs.CL cs.SD eess.AS

    Improving child speech recognition with augmented child-like speech

    Authors: Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

    Abstract: State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingua… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure Accepted to INTERSPEECH 2024

  2. Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

    Authors: Zhaofeng Lin, Tanvina Patel, Odette Scharenborg

    Abstract: Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. To address the data scarcity issue, we use a signal processin… ▽ More

    Submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted to ASRU 2023

  3. arXiv:2309.08348  [pdf, other

    eess.AS cs.SD

    The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

    Authors: Shilong Wu, Chenxi Wang, Hang Chen, Yusheng Dai, Chenyue Zhang, Ruoyu Wang, Hongbo Lan, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Marco Siniscalchi, Odette Scharenborg, Zhong-Qiu Wang, Jia Pan, Jianqing Gao

    Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures

  4. arXiv:2307.02009  [pdf, other

    cs.CL

    Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End Speech Recognition Systems

    Authors: Tanvina Patel, Odette Scharenborg

    Abstract: Speech technology has improved greatly for norm speakers, i.e., adult native speakers of a language without speech impediments or strong accents. However, non-norm or diverse speaker groups show a distinct performance gap with norm speakers, which we refer to as bias. In this work, we aim to reduce bias against different age groups and non-native speakers of Dutch. For an end-to-end (E2E) ASR syst… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: 5 Pages, 2 Figures, 5 Tables

  5. arXiv:2303.06326  [pdf, other

    cs.MM

    The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

    Authors: Zhe Wang, Shilong Wu, Hang Chen, Mao-Kui He, Jun Du, Chin-Hui Lee, Jingdong Chen, Shinji Watanabe, Sabato Siniscalchi, Odette Scharenborg, Diyuan Liu, Baocai Yin, Jia Pan, Jianqing Gao, Cong Liu

    Abstract: The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio… ▽ More

    Submitted 11 March, 2023; originally announced March 2023.

    Comments: 5 pages, 4 figures, to be published in ICASSP2023

  6. arXiv:2206.12489  [pdf, other

    eess.AS cs.SD

    Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models

    Authors: Hang Ji, Tanvina Patel, Odette Scharenborg

    Abstract: In this work, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory features (AF) information and their subsequent prediction of phone recognition performance for within and across language scenarios. Specifically, we compared CPC, wav2vec 2.0, and HuBert. First, frame-level… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: Submitted to INTERSPEECH 2022

  7. arXiv:2203.17072  [pdf, other

    cs.SD cs.CL eess.AS

    Manipulation of oral cancer speech using neural articulatory synthesis

    Authors: Bence Mark Halpern, Teja Rebernik, Thomas Tienkamp, Rob van Son, Michiel van den Brekel, Martijn Wieling, Max Witjes, Odette Scharenborg

    Abstract: We present an articulatory synthesis framework for the synthesis and manipulation of oral cancer speech for clinical decision making and alleviation of patient stress. Objective and subjective evaluations demonstrate that the framework has acceptable naturalness and is worth further investigation. A subsequent subjective vowel and consonant identification experiment showed that the articulatory sy… ▽ More

    Submitted 31 March, 2022; originally announced March 2022.

    Comments: 5 pages, 4 tables, 1 figure. Submitted to Interspeech 2022

  8. arXiv:2203.06937  [pdf, ps, other

    cs.CL

    Modelling word learning and recognition using visually grounded speech

    Authors: Danny Merkx, Sebastiaan Scholten, Stefan L. Frank, Mirjam Ernestus, Odette Scharenborg

    Abstract: Background: Computational models of speech recognition often assume that the set of target words is already given. This implies that these models do not learn to recognise speech from scratch without prior knowledge and explicit supervision. Visually grounded speech models learn to recognise speech without prior knowledge by exploiting statistical dependencies between spoken and visual input. Whil… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.

  9. arXiv:2201.11207  [pdf, other

    cs.SD cs.CL eess.AS

    Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

    Authors: Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak

    Abstract: The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. Wh… ▽ More

    Submitted 27 January, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: Accepted for publication in Computer Speech and Language

  10. arXiv:2201.04908  [pdf, ps, other

    cs.SD cs.AI eess.AS

    The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition

    Authors: Luke Prananta, Bence Mark Halpern, Siyuan Feng, Odette Scharenborg

    Abstract: In this paper, we investigate several existing and a new state-of-the-art generative adversarial network-based (GAN) voice conversion method for enhancing dysarthric speech for improved dysarthric speech recognition. We compare key components of existing methods as part of a rigorous ablation study to find the most effective solution to improve dysarthric speech recognition. We find that straightf… ▽ More

    Submitted 13 January, 2022; originally announced January 2022.

    Comments: Extended version of paper to be submitted to Interspeech 2022. 6 pages, 2 tables

  11. arXiv:2110.08213  [pdf, other

    cs.SD cs.CL eess.AS q-bio.QM

    Towards Identity Preserving Normal to Dysarthric Voice Conversion

    Authors: Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette Scharenborg, Tomoki Toda

    Abstract: We present a voice conversion framework that converts normal speech into dysarthric speech while preserving the speaker identity. Such a framework is essential for (1) clinical decision making processes and alleviation of patient stress, (2) data augmentation for dysarthric speech recognition. This is an especially challenging task since the converted samples should capture the severity of dysarth… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  12. arXiv:2107.00308  [pdf, other

    cs.SD cs.CL eess.AS

    An Objective Evaluation Framework for Pathological Speech Synthesis

    Authors: Bence Mark Halpern, Julian Fritsch, Enno Hermann, Rob van Son, Odette Scharenborg, Mathew Magimai. -Doss

    Abstract: The development of pathological speech systems is currently hindered by the lack of a standardised objective evaluation framework. In this work, (1) we utilise existing detection and analysis techniques to propose a general framework for the consistent evaluation of synthetic pathological speech. This framework evaluates the voice quality and the intelligibility aspects of speech and is shown to b… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Comments: 4 pages, 4 figures. Accepted to the ITG Conference on Speech Communication | 29.09.2021 - 01.10.2021 | Kiel

  13. arXiv:2106.08427  [pdf, other

    cs.SD cs.CL eess.AS

    Pathological voice adaptation with autoencoder-based voice conversion

    Authors: Marc Illa, Bence Mark Halpern, Rob van Son, Laureano Moro-Velazquez, Odette Scharenborg

    Abstract: In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: 6 pages, 3 figures. Accepted to the 11th ISCA Speech Synthesis Workshop (2021)

  14. arXiv:2104.00994  [pdf, other

    eess.AS cs.CL cs.SD

    Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Odette Scharenborg

    Abstract: This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. I… ▽ More

    Submitted 7 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted for publication in INTERSPEECH 2021

  15. arXiv:2103.15122  [pdf, other

    eess.AS cs.CL cs.SD

    Quantifying Bias in Automatic Speech Recognition

    Authors: Siyuan Feng, Olya Kudina, Bence Mark Halpern, Odette Scharenborg

    Abstract: Automatic speech recognition (ASR) systems promise to deliver objective interpretation of human speech. Practice and recent evidence suggests that the state-of-the-art (SotA) ASRs struggle with the large variation in speech due to e.g., gender, age, speech impairment, race, and accents. Many factors can cause the bias of an ASR system. Our overarching goal is to uncover bias in ASR systems to work… ▽ More

    Submitted 1 April, 2021; v1 submitted 28 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH (IS) 2021. This preprint version differs slightly from the version submitted to IS 2021: Figure 1 is not included in IS 2021

  16. arXiv:2012.09544  [pdf, other

    eess.AS cs.CL cs.SD

    The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

    Authors: Siyuan Feng, Odette Scharenborg

    Abstract: This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (… ▽ More

    Submitted 28 April, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

    Comments: 18 pages (including 1 page as supplementary material), 13 figures. Accepted for publication in IEEE Open Journal of Signal Processing (OJ-SP)

  17. arXiv:2011.06239  [pdf, other

    eess.AS cs.SD

    The CUHK-TUDELFT System for The SLT 2021 Children Speech Recognition Challenge

    Authors: Si-Ioi Ng, Wei Liu, Zhiyuan Peng, Siyuan Feng, Hing-Pang Huang, Odette Scharenborg, Tan Lee

    Abstract: This technical report describes our submission to the 2021 SLT Children Speech Recognition Challenge (CSRC) Track 1. Our approach combines the use of a joint CTC-attention end-to-end (E2E) speech recognition framework, transfer learning, data augmentation and development of various language models. Procedures of data pre-processing, the background and the course of system development are described… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

    Comments: Submitted to 2021 SLT Children Speech Recognition Challenge (CSRC)

  18. arXiv:2010.12267  [pdf, other

    cs.CV cs.CL

    Show and Speak: Directly Synthesize Spoken Description of Images

    Authors: Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

    Abstract: This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtai… ▽ More

    Submitted 17 November, 2020; v1 submitted 23 October, 2020; originally announced October 2020.

  19. How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successfu… ▽ More

    Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this work

  20. arXiv:2007.15916  [pdf

    cs.CL cs.CV

    Evaluating Automatically Generated Phoneme Captions for Images

    Authors: Justin van der Hout, Zoltán D'Haese, Mark Hasegawa-Johnson, Odette Scharenborg

    Abstract: Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions wer… ▽ More

    Submitted 31 July, 2020; originally announced July 2020.

    Comments: Accepted at Interspeech2020

  21. arXiv:2007.14205  [pdf, other

    eess.AS cs.LG cs.SD

    Detecting and analysing spontaneous oral cancer speech in the wild

    Authors: Bence Mark Halpern, Rob van Son, Michiel van den Brekel, Odette Scharenborg

    Abstract: Oral cancer speech is a disease which impacts more than half a million people worldwide every year. Analysis of oral cancer speech has so far focused on read speech. In this paper, we 1) present and 2) analyse a three-hour long spontaneous oral cancer speech dataset collected from YouTube. 3) We set baselines for an oral cancer speech detection task on this dataset. The analysis of these explainab… ▽ More

    Submitted 28 July, 2020; originally announced July 2020.

    Comments: Accepted to Interspeech 2020

  22. Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling

    Authors: Siyuan Feng, Odette Scharenborg

    Abstract: This study addresses unsupervised subword modeling, i.e., learning feature representations that can distinguish subword units of a language. The proposed approach adopts a two-stage bottleneck feature (BNF) learning framework, consisting of autoregressive predictive coding (APC) as a front-end and a DNN-BNF model as a back-end. APC pretrained features are set as input features to a DNN-BNF model.… ▽ More

    Submitted 6 August, 2020; v1 submitted 25 July, 2020; originally announced July 2020.

    Comments: 5 pages, 3 figures. Accepted for publication in INTERSPEECH 2020, Shanghai, China

  23. arXiv:2006.00512  [pdf, other

    cs.CL

    Learning to Recognise Words using Visually Grounded Speech

    Authors: Sebastiaan Scholten, Danny Merkx, Odette Scharenborg

    Abstract: We investigated word recognition in a Visually Grounded Speech model. The model has been trained on pairs of images and spoken captions to create visually grounded embeddings which can be used for speech to image retrieval and vice versa. We investigate whether such a model can be used to recognise words by embedding isolated words and using them to retrieve images of their visual referents. We in… ▽ More

    Submitted 31 May, 2020; originally announced June 2020.

  24. arXiv:2005.08118  [pdf, other

    eess.AS cs.CL cs.SD

    That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

    Authors: Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020. For some reason, the ArXiv Latex engine rendered it in more than 4 pages

  25. arXiv:2005.06968  [pdf, other

    cs.LG cs.CL cs.CV

    S2IGAN: Speech-to-Image Generation via Adversarial Learning

    Authors: Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg

    Abstract: An estimated half of the world's languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from… ▽ More

    Submitted 15 September, 2020; v1 submitted 14 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech2020

  26. arXiv:1803.05058  [pdf

    cs.SD cs.CL eess.AS

    Investigating the Effect of Music and Lyrics on Spoken-Word Recognition

    Authors: Odette Scharenborg, Martha Larson

    Abstract: Background music in social interaction settings can hinder conversation. Yet, little is known of how specific properties of music impact speech processing. This paper addresses this knowledge gap by investigating 1) whether the masking effect of background music with lyrics is larger than that of music without lyrics, and 2) whether the masking effect is larger for more complex music. To answer th… ▽ More

    Submitted 13 March, 2018; originally announced March 2018.

    Comments: Preliminary study

  27. arXiv:1802.06053  [pdf, ps, other

    cs.CL

    Bayesian Models for Unit Discovery on a Very Low Resource Language

    Authors: Lucas Ondel, Pierre Godard, Laurent Besacier, Elin Larsen, Mark Hasegawa-Johnson, Odette Scharenborg, Emmanuel Dupoux, Lukas Burget, François Yvon, Sanjeev Khudanpur

    Abstract: Developing speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ experiments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show tha… ▽ More

    Submitted 20 February, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

    Comments: Accepted to ICASSP 2018

  28. arXiv:1802.05092  [pdf, other

    cs.CL

    Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

    Authors: Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, Markus Mueller, Lucas Ondel, Shruti Palaskar, Philip Arthur, Francesco Ciannella, Mingxing Du, Elin Larsen, Danny Merkx, Rachid Riad, Liming Wang, Emmanuel Dupoux

    Abstract: We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.

    Submitted 14 February, 2018; originally announced February 2018.

    Comments: Accepted to ICASSP 2018