Skip to main content

Showing 1–17 of 17 results for author: Richmond, K

  1. arXiv:2406.08911  [pdf, other

    cs.CL eess.AS

    An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

    Authors: Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

    Abstract: Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  2. arXiv:2312.14398  [pdf, other

    cs.SD eess.AS

    ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

    Authors: Cheng Gong, Xin Wang, Erica Cooper, Dan Wells, Longbiao Wang, Jianwu Dang, Korin Richmond, Junichi Yamagishi

    Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. In most cases, TTS systems are built using a single speaker's voice. However, there is growing interest in developing systems that can synthesize voices… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: 13 pages, 5 figures

  3. arXiv:2309.05423  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

    Authors: Jinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu

    Abstract: In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silenc… ▽ More

    Submitted 11 June, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

  4. Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

    Authors: Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter

    Abstract: Automatically predicting the outcome of subjective listening tests is a challenging task. Ratings may vary from person to person even if preferences are consistent across listeners. While previous work has focused on predicting listeners' ratings (mean opinion scores) of individual stimuli, we focus on the simpler task of predicting subjective preference given two speech stimuli for the same text.… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Journal ref: Proceedings of INTERSPEECH 2022

  5. arXiv:2105.15162  [pdf, other

    eess.AS cs.CL cs.LG cs.SD eess.IV

    Automatic audiovisual synchronisation for ultrasound tongue imaging

    Authors: Aciel Eshky, Joanne Cleland, Manuel Sam Ribeiro, Eleanor Sugden, Korin Richmond, Steve Renals

    Abstract: Ultrasound tongue imaging is used to visualise the intra-oral articulators during speech production. It is utilised in a range of applications, including speech and language therapy and phonetics research. Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised. Synchronisation is achieved using specialis… ▽ More

    Submitted 31 May, 2021; originally announced May 2021.

    Comments: 18 pages, 10 figures. Manuscript accepted at Speech Communication

  6. arXiv:2103.00333  [pdf, other

    eess.AS cs.CL cs.SD q-bio.QM

    Silent versus modal multi-speaker speech recognition from ultrasound and video

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode misma… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    Comments: 5 pages, 5 figures, Submitted to Interspeech 2021

  7. arXiv:2103.00324  [pdf, ps, other

    eess.AS cs.CL cs.SD q-bio.NC

    Exploiting ultrasound tongue imaging for the automatic detection of speech articulation errors

    Authors: Manuel Sam Ribeiro, Joanne Cleland, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: Speech sound disorders are a common communication impairment in childhood. Because speech disorders can negatively affect the lives and the development of children, clinical intervention is often recommended. To help with diagnosis and treatment, clinicians use instrumented methods such as spectrograms or ultrasound tongue imaging to analyse speech articulations. Analysis with these methods can be… ▽ More

    Submitted 27 February, 2021; originally announced March 2021.

    Comments: 15 pages, 9 figures, 6 tables

    Journal ref: Speech Communication, Volume 128, April 2021, Pages 24-34

  8. arXiv:2011.09804  [pdf, other

    eess.AS cs.CL cs.CV cs.SD eess.IV

    TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

    Authors: Manuel Sam Ribeiro, Jennifer Sanger, Jing-Xuan Zhang, Aciel Eshky, Alan Wrench, Korin Richmond, Steve Renals

    Abstract: We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of audio, ultrasound tongue imaging, and lip videos. TaL consists of two parts: TaL1 is a set of six recording sessions of one professional voice talent, a male native speaker of English; TaL80 is a set of recording sessions of 81 native speakers of English without voice talent experience. Overall, the corpus contains 24 hours of… ▽ More

    Submitted 19 November, 2020; originally announced November 2020.

    Comments: 8 pages, 4 figures, Accepted to SLT2021, IEEE Spoken Language Technology Workshop

  9. arXiv:1907.01413  [pdf, other

    eess.AS cs.CL cs.CV cs.LG cs.SD eess.IV

    Speaker-independent classification of phonetic segments from raw ultrasound in child speech

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 4 figures, published in ICASSP2019 (IEEE International Conference on Acoustics, Speech and Signal Processing, 2019)

  10. arXiv:1907.00835  [pdf, other

    cs.CL cs.CV cs.SD eess.AS eess.IV

    UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

    Authors: Aciel Eshky, Manuel Sam Ribeiro, Joanne Cleland, Korin Richmond, Zoe Roxburgh, James Scobbie, Alan Wrench

    Abstract: We introduce UltraSuite, a curated repository of ultrasound and acoustic data, collected from recordings of child speech therapy sessions. This release includes three data collections, one from typically developing children and two from children with speech sound disorders. In addition, it includes a set of annotations, some manual and some automatically produced, and software tools to process, tr… ▽ More

    Submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure, 3 tables; accepted to Interspeech 2018: 19th Annual Conference of the International Speech Communication Association (ISCA)

  11. arXiv:1907.00818  [pdf, other

    eess.AS cs.CL cs.SD eess.IV

    Ultrasound tongue imaging for diarization and alignment of child speech therapy sessions

    Authors: Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, Steve Renals

    Abstract: We investigate the automatic processing of child speech therapy sessions using ultrasound visual biofeedback, with a specific focus on complementing acoustic features with ultrasound images of the tongue for the tasks of speaker diarization and time-alignment of target words. For speaker diarization, we propose an ultrasound-based time-domain signal which we call estimated tongue activity. For wor… ▽ More

    Submitted 15 August, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 3 figures, Accepted for publication at Interspeech 2019

  12. arXiv:1907.00758  [pdf, other

    cs.CL cs.CV cs.LG cs.SD eess.AS eess.IV

    Synchronising audio and ultrasound by learning cross-modal embeddings

    Authors: Aciel Eshky, Manuel Sam Ribeiro, Korin Richmond, Steve Renals

    Abstract: Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the s… ▽ More

    Submitted 27 November, 2019; v1 submitted 1 July, 2019; originally announced July 2019.

    Comments: 5 pages, 1 figure, 4 tables; Interspeech 2019 with the following edits: 1) Loss and accuracy upon convergence were accidentally reported from an older model. Now updated with model described throughout the paper. All other results remain unchanged. 2) Max true offset in the training data corrected from 179ms to 1789ms. 3) Detectability "boundary/range" renamed to detectability "thresholds"

  13. arXiv:1810.13048  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Attentive Filtering Networks for Audio Replay Attack Detection

    Authors: Cheng-I Lai, Alberto Abad, Korin Richmond, Junichi Yamagishi, Najim Dehak, Simon King

    Abstract: An attacker may use a variety of techniques to fool an automatic speaker verification system into accepting them as a genuine user. Anti-spoofing methods meanwhile aim to make the system robust against such attacks. The ASVspoof 2017 Challenge focused specifically on replay attacks, with the intention of measuring the limits of replay attack detection as well as developing countermeasures against… ▽ More

    Submitted 30 October, 2018; originally announced October 2018.

    Comments: Submitted to ICASSP 2019

  14. A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract

    Authors: Alexander Hewer, Stefanie Wuhrer, Ingmar Steiner, Korin Richmond

    Abstract: We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction is performed by using a minimally supervised method that uses as basis an image segmentation approach and a templa… ▽ More

    Submitted 17 April, 2018; v1 submitted 15 December, 2016; originally announced December 2016.

    Journal ref: Computer Speech & Language 51 (2018) 68-92

  15. arXiv:1602.07679  [pdf, other

    cs.CV

    A statistical shape space model of the palate surface trained on 3D MRI scans of the vocal tract

    Authors: Alexander Hewer, Ingmar Steiner, Timo Bolkart, Stefanie Wuhrer, Korin Richmond

    Abstract: We describe a minimally-supervised method for computing a statistical shape space model of the palate surface. The model is created from a corpus of volumetric magnetic resonance imaging (MRI) scans collected from 12 speakers. We extract a 3D mesh of the palate from each speaker, then train the model using principal component analysis (PCA). The palate model is then tested using 3D MRI from anothe… ▽ More

    Submitted 4 September, 2015; originally announced February 2016.

    Comments: Proceedings of the 18th International Congress of Phonetic Sciences, Aug 2015, Glasgow, United Kingdom. 2015, http://www.icphs2015.info/

  16. arXiv:1310.8585  [pdf, other

    cs.HC q-bio.QM

    Speech animation using electromagnetic articulography as motion capture data

    Authors: Ingmar Steiner, Korin Richmond, Slim Ouni

    Abstract: Electromagnetic articulography (EMA) captures the position and orientation of a number of markers, attached to the articulators, during speech. As such, it performs the same function for speech that conventional motion capture does for full-body movements acquired with optical modalities, a long-time staple technique of the animation industry. In this paper, EMA data is processed from a motion-cap… ▽ More

    Submitted 30 October, 2013; originally announced October 2013.

    Journal ref: AVSP - 12th International Conference on Auditory-Visual Speech Processing - 2013 (2013) 55-60

  17. arXiv:1209.4982  [pdf, other

    cs.HC cs.GR

    Using multimodal speech production data to evaluate articulatory animation for audiovisual speech synthesis

    Authors: Ingmar Steiner, Korin Richmond, Slim Ouni

    Abstract: The importance of modeling speech articulation for high-quality audiovisual (AV) speech synthesis is widely acknowledged. Nevertheless, while state-of-the-art, data-driven approaches to facial animation can make use of sophisticated motion capture techniques, the animation of the intraoral articulators (viz. the tongue, jaw, and velum) typically makes use of simple rules or viseme morphing, in sta… ▽ More

    Submitted 22 September, 2012; originally announced September 2012.

    Journal ref: 3rd International Symposium on Facial Analysis and Animation (2012)