Skip to main content

Showing 1–50 of 57 results for author: Khudanpur, S

  1. arXiv:2406.02560  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Less Peaky and More Accurate CTC Forced Alignment by Label Priors

    Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur

    Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve… ▽ More

    Submitted 15 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

    Comments: Accepted by ICASSP 2024. Github repo: https://github.com/huangruizhe/audio/tree/aligner_label_priors

  2. arXiv:2405.05376  [pdf, other

    cs.CL

    Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages

    Authors: Nathaniel R. Robinson, Raj Dabre, Ammon Shurtz, Rasul Dent, Onenamiyi Onesi, Claire Bizon Monroc, Loïc Grobol, Hasan Muhammad, Ashi Garg, Naome A. Etori, Vijay Murari Tiyyala, Olanrewaju Samuel, Matthew Dean Stutzman, Bismarck Bamfo Odoom, Sanjeev Khudanpur, Stephen D. Richardson, Kenton Murray

    Abstract: A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We pr… ▽ More

    Submitted 13 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

    Comments: NAACL 2024

  3. arXiv:2401.15676  [pdf, other

    eess.AS cs.SD

    On Speaker Attribution with SURT

    Authors: Desh Raj, Matthew Wiesner, Matthew Maciejewski, Leibny Paola Garcia-Perera, Daniel Povey, Sanjeev Khudanpur

    Abstract: The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework furth… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: 8 pages, 6 figures, 6 tables. Submitted to Odyssey 2024

  4. arXiv:2309.16953  [pdf, other

    eess.AS cs.SD

    Enhancing Code-switching Speech Recognition with Interactive Language Biases

    Authors: Hexin Liu, Leibny Paola Garcia, Xiangyu Zhang, Andy W. H. Khong, Sanjeev Khudanpur

    Abstract: Languages usually switch within a multilingual speech signal, especially in a bilingual society. This phenomenon is referred to as code-switching (CS), making automatic speech recognition (ASR) challenging under a multilingual scenario. We propose to improve CS-ASR by biasing the hybrid CTC/attention ASR model with multi-level language information comprising frame- and token-level language posteri… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  5. arXiv:2309.15796  [pdf, other

    eess.AS cs.CL cs.LG

    Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

    Authors: Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel Povey, Sanjeev Khudanpur

    Abstract: Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. Thi… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  6. arXiv:2309.15686  [pdf, other

    cs.CL cs.SD eess.AS

    Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization

    Authors: Amir Hussein, Brian Yan, Antonios Anastasopoulos, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Incorporating longer context has been shown to benefit machine translation, but the inclusion of context in end-to-end speech translation (E2E-ST) remains under-studied. To bridge this gap, we introduce target language context in E2E-ST, enhancing coherence and overcoming memory constraints of extended audio segments. Additionally, we propose context dropout to ensure robustness to the absence of… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  7. arXiv:2309.15674  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speech collage: code-switched audio generation by collaging monolingual corpora

    Authors: Amir Hussein, Dorsa Zeinali, Ondřej Klejch, Matthew Wiesner, Brian Yan, Shammur Chowdhury, Ahmed Ali, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: Designing effective automatic speech recognition (ASR) systems for Code-Switching (CS) often depends on the availability of the transcribed CS resources. To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. We further improve the smoothness quality of audio generation using an overlap-add approach. We… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

  8. arXiv:2306.13734  [pdf, other

    eess.AS cs.CL cs.SD

    The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

    Authors: Samuele Cornell, Matthew Wiesner, Shinji Watanabe, Desh Raj, Xuankai Chang, Paola Garcia, Matthew Maciejewski, Yoshiki Masuyama, Zhong-Qiu Wang, Stefano Squartini, Sanjeev Khudanpur

    Abstract: The CHiME challenges have played a significant role in the development and evaluation of robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR (DASR) task, within the 7th CHiME challenge. This task comprises joint ASR and diarization in far-field settings with multiple, and possibly heterogeneous, recording devices. Different from previous challenges, we evaluate… ▽ More

    Submitted 14 July, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

  9. arXiv:2306.11252  [pdf, other

    cs.CL cs.LG

    HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

    Authors: Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur

    Abstract: We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verb… ▽ More

    Submitted 19 June, 2023; originally announced June 2023.

  10. arXiv:2306.10559  [pdf, other

    eess.AS cs.SD

    SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

    Authors: Desh Raj, Daniel Povey, Sanjeev Khudanpur

    Abstract: The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academ… ▽ More

    Submitted 19 September, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

    Comments: 13 pages, 7 figures. To appear in IEEE TASLP. Project webpage: https://sites.google.com/view/surt2

  11. arXiv:2306.01031  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts

    Authors: Dongji Gao, Matthew Wiesner, Hainan Xu, Leibny Paola Garcia, Daniel Povey, Sanjeev Khudanpur

    Abstract: This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in human-annotated speech corpora, which degrades the performance of ASR models. To address this problem, we propose Bypass Temporal Classification (BTC) as an expansion of the Connectionist Temporal Classification (CTC) cr… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  12. arXiv:2305.18925  [pdf, other

    eess.AS cs.CL cs.SD

    Investigating model performance in language identification: beyond simple error statistics

    Authors: Suzy J. Styles, Victoria Y. H. Chua, Fei Ting Woon, Hexin Liu, Leibny Paola Garcia Perera, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels

    Abstract: Language development experts need tools that can automatically identify languages from fluent, conversational speech, and provide reliable estimates of usage rates at the level of an individual recording. However, language identification systems are typically evaluated on metrics such as equal error rate and balanced accuracy, applied at the level of an entire speech corpus. These overview metrics… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023, 5 pages, 5 figures

  13. arXiv:2212.05271  [pdf, other

    eess.AS cs.SD

    GPU-accelerated Guided Source Separation for Meeting Transcription

    Authors: Desh Raj, Daniel Povey, Sanjeev Khudanpur

    Abstract: Guided source separation (GSS) is a type of target-speaker extraction method that relies on pre-computed speaker activities and blind source separation to perform front-end enhancement of overlapped speech signals. It was first proposed during the CHiME-5 challenge and provided significant improvements over the delay-and-sum beamforming baseline. Despite its strengths, however, the method has seen… ▽ More

    Submitted 13 August, 2023; v1 submitted 10 December, 2022; originally announced December 2022.

    Comments: 7 pages, 4 figures. To appear at InterSpeech 2023. Code available at https://github.com/desh2608/gss

  14. arXiv:2211.17196  [pdf, other

    cs.CL cs.SD eess.AS

    EURO: ESPnet Unsupervised ASR Open-source Toolkit

    Authors: Dongji Gao, Jiatong Shi, Shun-Po Chuang, Leibny Paola Garcia, Hung-yi Lee, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO adopts the state-of-the-art UASR learning method introduced by the Wav2vec-U, originally implemented at FAIRSEQ, which leverages self-supervised speech representations and adversarial training. In addition to wav2vec2, EURO extend… ▽ More

    Submitted 20 May, 2023; v1 submitted 30 November, 2022; originally announced November 2022.

  15. arXiv:2211.00482  [pdf, other

    eess.AS cs.SD

    Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

    Authors: Zili Huang, Desh Raj, Paola García, Sanjeev Khudanpur

    Abstract: Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we inve… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: submitted to ICASSP 2023

  16. arXiv:2210.14567  [pdf, other

    eess.AS cs.SD

    Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization

    Authors: Hexin Liu, Haihua Xu, Leibny Paola Garcia, Andy W. H. Khong, Yi He, Sanjeev Khudanpur

    Abstract: Code-switching (CS) refers to the phenomenon that languages switch within a speech signal and leads to language confusion for automatic speech recognition (ASR). This paper aims to address language confusion for improving CS-ASR from two perspectives: incorporating and disentangling language information. We incorporate language information in the CS-ASR model by dynamically biasing the model with… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  17. arXiv:2204.03851  [pdf, other

    eess.AS cs.CR cs.SD

    Defense against Adversarial Attacks on Hybrid Speech Recognition using Joint Adversarial Fine-tuning with Denoiser

    Authors: Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesus Villalba, Sanjeev Khudanpur, Najim Dehak

    Abstract: Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning ASR model, and adversarially fine-tuning joint mod… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

  18. arXiv:2203.03218  [pdf, other

    eess.AS cs.CL cs.SD

    Enhance Language Identification using Dual-mode Model with Knowledge Distillation

    Authors: Hexin Liu, Leibny Paola Garcia Perera, Andy W. H. Khong, Justin Dauwels, Suzy J. Styles, Sanjeev Khudanpur

    Abstract: In this paper, we propose to employ a dual-mode framework on the x-vector self-attention (XSA-LID) model with knowledge distillation (KD) to enhance its language identification (LID) performance for both long and short utterances. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip e… ▽ More

    Submitted 7 March, 2022; originally announced March 2022.

    Comments: Submitted to Odyssey 2022

  19. arXiv:2201.02550  [pdf, other

    cs.CL cs.SD eess.AS

    Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition

    Authors: Amir Hussein, Shammur Absar Chowdhury, Ahmed Abdelali, Najim Dehak, Ahmed Ali, Sanjeev Khudanpur

    Abstract: The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with mono… ▽ More

    Submitted 11 January, 2023; v1 submitted 7 January, 2022; originally announced January 2022.

  20. arXiv:2110.12561  [pdf, other

    cs.SD eess.AS

    Lhotse: a speech data representation library for the modern deep learning ecosystem

    Authors: Piotr Żelasko, Daniel Povey, Jan "Yenda" Trmal, Sanjeev Khudanpur

    Abstract: Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes an… ▽ More

    Submitted 24 October, 2021; originally announced October 2021.

    Comments: Accepted for presentation at NeurIPS 2021 Data-Centric AI (DCAI) Workshop

  21. arXiv:2110.04863  [pdf, other

    eess.AS cs.CL

    Injecting Text and Cross-lingual Supervision in Few-shot Learning from Self-Supervised Models

    Authors: Matthew Wiesner, Desh Raj, Sanjeev Khudanpur

    Abstract: Self-supervised model pre-training has recently garnered significant interest, but relatively few efforts have explored using additional resources in fine-tuning these models. We demonstrate how universal phoneset acoustic models can leverage cross-lingual supervision to improve transfer of pretrained self-supervised representations to new languages. We also show how target-language text can be us… ▽ More

    Submitted 10 October, 2021; originally announced October 2021.

    Comments: \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  22. arXiv:2106.06909  [pdf, other

    cs.SD cs.CL eess.AS

    GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio

    Authors: Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan

    Abstract: This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous sp… ▽ More

    Submitted 13 June, 2021; originally announced June 2021.

  23. arXiv:2104.01954  [pdf, other

    eess.AS cs.SD

    Reformulating DOVER-Lap Label Mapping as a Graph Partitioning Problem

    Authors: Desh Raj, Sanjeev Khudanpur

    Abstract: We recently proposed DOVER-Lap, a method for combining overlap-aware speaker diarization system outputs. DOVER-Lap improved upon its predecessor DOVER by using a label mapping method based on globally-informed greedy search. In this paper, we analyze this label mapping in the framework of a maximum orthogonal graph partitioning problem, and present three inferences. First, we show that DOVER-Lap l… ▽ More

    Submitted 3 June, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: 5 pages, 3 figures. Acceped at INTERSPEECH 2021

  24. arXiv:2103.17122  [pdf, ps, other

    eess.AS cs.CR cs.SD

    Adversarial Attacks and Defenses for Speech Recognition Systems

    Authors: Piotr Żelasko, Sonal Joshi, Yiwen Shao, Jesus Villalba, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

    Abstract: The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model and a more recent Espresso framework Transformer enc… ▽ More

    Submitted 31 March, 2021; originally announced March 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  25. arXiv:2103.09063  [pdf, other

    cs.SD eess.AS

    An Asynchronous WFST-Based Decoder For Automatic Speech Recognition

    Authors: Hang Lv, Zhehuai Chen, Hainan Xu, Daniel Povey, Lei Xie, Sanjeev Khudanpur

    Abstract: We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with… ▽ More

    Submitted 16 March, 2021; originally announced March 2021.

    Comments: 5 pages, 5 figures, icassp

  26. arXiv:2103.06968  [pdf, other

    cs.CL

    Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

    Authors: Gaurav Kumar, Philipp Koehn, Sanjeev Khudanpur

    Abstract: Large web-crawled corpora represent an excellent resource for improving the performance of Neural Machine Translation (NMT) systems across several language pairs. However, since these corpora are typically extremely noisy, their use is fairly limited. Current approaches to dealing with this problem mainly focus on filtering using heuristics or single features such as language model scores or bi-li… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 10 pages, 2 figures

  27. arXiv:2103.06964  [pdf, other

    cs.CL

    Learning Policies for Multilingual Training of Neural Machine Translation Systems

    Authors: Gaurav Kumar, Philipp Koehn, Sanjeev Khudanpur

    Abstract: Low-resource Multilingual Neural Machine Translation (MNMT) is typically tasked with improving the translation performance on one or more language pairs with the aid of high-resource language pairs. In this paper, we propose two simple search based curricula -- orderings of the multilingual training data -- which help improve translation performance in conjunction with existing techniques such as… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: 7 pages, 2 figures

  28. arXiv:2103.05081  [pdf, other

    eess.AS cs.CL cs.SD

    A Parallelizable Lattice Rescoring Strategy with Neural Language Models

    Authors: Ke Li, Daniel Povey, Sanjeev Khudanpur

    Abstract: This paper proposes a parallel computation strategy and a posterior-based lattice expansion algorithm for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition. First, lattices from first-pass decoding are expanded by the proposed posterior-based lattice expansion algorithm. Second, each expanded lattice is converted into a minimal list of hypotheses that c… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

    Comments: To appear at ICASSP 2021. 5 pages, 1 figure

  29. arXiv:2102.04488  [pdf, other

    cs.CL cs.SD eess.AS

    Wake Word Detection with Streaming Transformers

    Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

    Abstract: Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Tr… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: Accepted at IEEE ICASSP 2021. 5 pages, 3 figures

  30. arXiv:2102.01363  [pdf, other

    eess.AS cs.CL cs.SD

    The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

    Authors: Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

  31. arXiv:2012.01392  [pdf, other

    cs.CV

    Fine-grained activity recognition for assembly videos

    Authors: Jonathan D. Jones, Cathryn Cortesa, Amy Shelton, Barbara Landau, Sanjeev Khudanpur, Gregory D. Hager

    Abstract: In this paper we address the task of recognizing assembly actions as a structure (e.g. a piece of furniture or a toy block tower) is built up from a set of primitive objects. Recognizing the full range of assembly actions requires perception at a level of spatial detail that has not been attempted in the action recognition literature to date. We extend the fine-grained activity recognition setting… ▽ More

    Submitted 2 December, 2020; originally announced December 2020.

    Comments: 8 pages, 6 figures. Submitted to RA-L/ICRA 2021

  32. arXiv:2011.02900  [pdf, other

    eess.AS cs.SD

    Multi-class Spectral Clustering with Overlaps for Speaker Diarization

    Authors: Desh Raj, Zili Huang, Sanjeev Khudanpur

    Abstract: This paper describes a method for overlap-aware speaker diarization. Given an overlap detector and a speaker embedding extractor, our method performs spectral clustering of segments informed by the output of the overlap detector. This is achieved by transforming the discrete clustering problem into a convex optimization problem which is solved by eigen-decomposition. Thereafter, we discretize the… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted at IEEE SLT 2021

  33. arXiv:2011.02090  [pdf, other

    eess.AS cs.SD

    Frustratingly Easy Noise-aware Training of Acoustic Models

    Authors: Desh Raj, Jesus Villalba, Daniel Povey, Sanjeev Khudanpur

    Abstract: Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of a… ▽ More

    Submitted 2 February, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

    Comments: 6 + 3 (Appendix) pages

  34. arXiv:2011.01997  [pdf, other

    eess.AS cs.SD

    DOVER-Lap: A Method for Combining Overlap-aware Diarization Outputs

    Authors: Desh Raj, Leibny Paola Garcia-Perera, Zili Huang, Shinji Watanabe, Daniel Povey, Andreas Stolcke, Sanjeev Khudanpur

    Abstract: Several advances have been made recently towards handling overlapping speech for speaker diarization. Since speech and natural language tasks often benefit from ensemble techniques, we propose an algorithm for combining outputs from such diarization systems through majority voting. Our method, DOVER-Lap, is inspired from the recently proposed DOVER algorithm, but is designed to handle overlapping… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  35. arXiv:2010.12430  [pdf, other

    eess.AS cs.SD

    Training Noisy Single-Channel Speech Separation With Noisy Oracle Sources: A Large Gap and A Small Step

    Authors: Matthew Maciejewski, Jing Shi, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: As the performance of single-channel speech separation systems has improved, there has been a desire to move to more challenging conditions than the clean, near-field speech that initial systems were developed on. When training deep learning separation models, a need for ground truth leads to training on synthetic mixtures. As such, training in noisy conditions requires either using noise syntheti… ▽ More

    Submitted 22 February, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  36. arXiv:2008.02385  [pdf, other

    cs.CL

    Efficient MDI Adaptation for n-gram Language Models

    Authors: Ruizhe Huang, Ke Li, Ashish Arora, Dan Povey, Sanjeev Khudanpur

    Abstract: This paper presents an efficient algorithm for n-gram language model adaptation under the minimum discrimination information (MDI) principle, where an out-of-domain language model is adapted to satisfy the constraints of marginal probabilities of the in-domain data. The challenge for MDI language model adaptation is its computational complexity. By taking advantage of the backoff structure of n-gr… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: To appear in INTERSPEECH 2020. Appendix A of this full version will be filled soon

  37. arXiv:2006.07898  [pdf, other

    eess.AS cs.SD

    The JHU Multi-Microphone Multi-Speaker ASR System for the CHiME-6 Challenge

    Authors: Ashish Arora, Desh Raj, Aswin Shanmugam Subramanian, Ke Li, Bar Ben-Yair, Matthew Maciejewski, Piotr Żelasko, Paola García, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper summarizes the JHU team's efforts in tracks 1 and 2 of the CHiME-6 challenge for distant multi-microphone conversational speech diarization and recognition in everyday home environments. We explore multi-array processing techniques at each stage of the pipeline, such as multi-array guided source separation (GSS) for enhancement and acoustic model training data, posterior fusion for spee… ▽ More

    Submitted 14 June, 2020; originally announced June 2020.

    Comments: Presented at the CHiME-6 workshop (colocated with ICASSP 2020)

  38. arXiv:2005.09824  [pdf, other

    eess.AS cs.CL cs.SD

    PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

    Authors: Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

    Abstract: We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called \emph{chain models} in the Kaldi automatic speech recognition (ASR) toolkit. Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible so that it can be easily plugged into new ASR proje… ▽ More

    Submitted 19 May, 2020; originally announced May 2020.

    Comments: Submtted to Interspeech 2020

  39. arXiv:2005.08347  [pdf, other

    eess.AS cs.CL cs.SD

    Wake Word Detection with Alignment-Free Lattice-Free MMI

    Authors: Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

    Abstract: Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-tra… ▽ More

    Submitted 28 July, 2020; v1 submitted 17 May, 2020; originally announced May 2020.

    Comments: Accepted at Interspeech 2020. 5 pages, 3 figures

  40. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  41. arXiv:2002.06220  [pdf, other

    eess.AS cs.SD

    Speaker Diarization with Region Proposal Network

    Authors: Zili Huang, Shinji Watanabe, Yusuke Fujita, Paola Garcia, Yiwen Shao, Daniel Povey, Sanjeev Khudanpur

    Abstract: Speaker diarization is an important pre-processing step for many speech applications, and it aims to solve the "who spoke when" problem. Although the standard diarization systems can achieve satisfactory results in various scenarios, they are composed of several independently-optimized modules and cannot deal with the overlapped speech. In this paper, we propose a novel speaker diarization method:… ▽ More

    Submitted 14 February, 2020; originally announced February 2020.

    Comments: Accepted to ICASSP 2020

  42. arXiv:1909.08723  [pdf, other

    cs.CL cs.SD eess.AS

    Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

    Authors: Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language… ▽ More

    Submitted 14 October, 2019; v1 submitted 18 September, 2019; originally announced September 2019.

    Comments: Accepted to ASRU 2019

  43. Probing the Information Encoded in X-vectors

    Authors: Desh Raj, David Snyder, Daniel Povey, Sanjeev Khudanpur

    Abstract: Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about… ▽ More

    Submitted 30 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted at IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2019

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2019): 726-733

  44. arXiv:1812.03919  [pdf, other

    eess.AS cs.CL cs.SD

    Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings

    Authors: Matthew Wiesner, Adithya Renduchintala, Shinji Watanabe, Chunxi Liu, Najim Dehak, Sanjeev Khudanpur

    Abstract: We explore training attention-based encoder-decoder ASR in low-resource settings. These models perform poorly when trained on small amounts of transcribed speech, in part because they depend on having sufficient target-side text to train the attention and decoder networks. In this paper we address this shortcoming by pretraining our network parameters using only text-based data and transcribed spe… ▽ More

    Submitted 2 August, 2019; v1 submitted 10 December, 2018; originally announced December 2018.

  45. arXiv:1811.02641  [pdf, other

    cs.CL

    Building Corpora for Single-Channel Speech Separation Across Multiple Domains

    Authors: Matthew Maciejewski, Gregory Sell, Leibny Paola Garcia-Perera, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: To date, the bulk of research on single-channel speech separation has been conducted using clean, near-field, read speech, which is not representative of many modern applications. In this work, we develop a procedure for constructing high-quality synthetic overlap datasets, necessary for most deep learning-based separation frameworks. We produced datasets that are more representative of realistic… ▽ More

    Submitted 6 November, 2018; originally announced November 2018.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  46. arXiv:1807.06204  [pdf, other

    cs.CL

    Low-Resource Contextual Topic Identification on Speech

    Authors: Chunxi Liu, Matthew Wiesner, Shinji Watanabe, Craig Harman, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

    Abstract: In topic identification (topic ID) on real-world unstructured audio, an audio instance of variable topic shifts is first broken into sequential segments, and each segment is independently classified. We first present a general purpose method for topic ID on spoken segments in low-resource languages, using a cascade of universal acoustic modeling, translation lexicons to English, and English-langua… ▽ More

    Submitted 28 September, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

    Comments: Accepted for publication at 2018 IEEE Workshop on Spoken Language Technology (SLT)

  47. arXiv:1804.03243  [pdf, other

    cs.CL

    A GPU-based WFST Decoder with Exact Lattice Generation

    Authors: Zhehuai Chen, Justin Luitjens, Hainan Xu, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

    Abstract: We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics Processing Units (GPUs). We implement token recombination as an atomic GPU operation in order to fully parallelize the Viterbi beam search, and propose a dynamic load balancing strategy for more efficient token passing scheduling among GPU threads. We also redesi… ▽ More

    Submitted 27 July, 2018; v1 submitted 9 April, 2018; originally announced April 2018.

    Comments: accepted by INTERSPEECH 2018

    MSC Class: 68T10 ACM Class: I.2.7

  48. arXiv:1802.08731  [pdf, other

    cs.CL

    Automatic Speech Recognition and Topic Identification for Almost-Zero-Resource Languages

    Authors: Matthew Wiesner, Chunxi Liu, Lucas Ondel, Craig Harman, Vimal Manohar, Jan Trmal, Zhongqiang Huang, Najim Dehak, Sanjeev Khudanpur

    Abstract: Automatic speech recognition (ASR) systems often need to be developed for extremely low-resource languages to serve end-uses such as audio content categorization and search. While universal phone recognition is natural to consider when no transcribed speech is available to train an ASR system in a language, adapting universal phone models using very small amounts (minutes rather than hours) of tra… ▽ More

    Submitted 18 June, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

    Comments: Accepted for publication at Interspeech 2018

  49. arXiv:1802.06053  [pdf, ps, other

    cs.CL

    Bayesian Models for Unit Discovery on a Very Low Resource Language

    Authors: Lucas Ondel, Pierre Godard, Laurent Besacier, Elin Larsen, Mark Hasegawa-Johnson, Odette Scharenborg, Emmanuel Dupoux, Lukas Burget, François Yvon, Sanjeev Khudanpur

    Abstract: Developing speech technologies for low-resource languages has become a very active research field over the last decade. Among others, Bayesian models have shown some promising results on artificial examples but still lack of in situ experiments. Our work applies state-of-the-art Bayesian models to unsupervised Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also show tha… ▽ More

    Submitted 20 February, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

    Comments: Accepted to ICASSP 2018

  50. arXiv:1706.03747  [pdf, other

    cs.CL

    Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework

    Authors: Xiaohui Zhang, Vimal Manohar, Daniel Povey, Sanjeev Khudanpur

    Abstract: Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect o… ▽ More

    Submitted 12 June, 2017; originally announced June 2017.