-
Less Peaky and More Accurate CTC Forced Alignment by Label Priors
Authors:
Ruizhe Huang,
Xiaohui Zhang,
Zhaoheng Ni,
Li Sun,
Moto Hira,
Jeff Hwang,
Vimal Manohar,
Vineel Pratap,
Matthew Wiesner,
Shinji Watanabe,
Daniel Povey,
Sanjeev Khudanpur
Abstract:
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve…
▽ More
Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
△ Less
Submitted 15 June, 2024; v1 submitted 22 April, 2024;
originally announced June 2024.
-
TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Authors:
Jeff Hwang,
Moto Hira,
Caroline Chen,
Xiaohui Zhang,
Zhaoheng Ni,
Guangzhi Sun,
Pingchuan Ma,
Ruizhe Huang,
Vineel Pratap,
Yuekai Zhang,
Anurag Kumar,
Chin-Yun Yu,
Chuang Zhu,
Chunxi Liu,
Jacob Kahn,
Mirco Ravanelli,
Peng Sun,
Shinji Watanabe,
Yangyang Shi,
Yumeng Tao,
Robin Scheibler,
Samuele Cornell,
Sean Kim,
Stavros Petridis
Abstract:
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's devel…
▽ More
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.
△ Less
Submitted 26 October, 2023;
originally announced October 2023.
-
Scaling Speech Technology to 1,000+ Languages
Authors:
Vineel Pratap,
Andros Tjandra,
Bowen Shi,
Paden Tomasello,
Arun Babu,
Sayani Kundu,
Ali Elkahky,
Zhaoheng Ni,
Apoorv Vyas,
Maryam Fazel-Zarandi,
Alexei Baevski,
Yossi Adi,
Xiaohui Zhang,
Wei-Ning Hsu,
Alexis Conneau,
Michael Auli
Abstract:
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on…
▽ More
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Flashlight: Enabling Innovation in Tools for Machine Learning
Authors:
Jacob Kahn,
Vineel Pratap,
Tatiana Likhomanenko,
Qiantong Xu,
Awni Hannun,
Jeff Cai,
Paden Tomasello,
Ann Lee,
Edouard Grave,
Gilad Avidov,
Benoit Steiner,
Vitaliy Liptchinsky,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the…
▽ More
As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together. Flashlight is available at https://github.com/flashlight/flashlight .
△ Less
Submitted 22 June, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Star Temporal Classification: Sequence Classification with Partially Labeled Data
Authors:
Vineel Pratap,
Awni Hannun,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be mi…
▽ More
We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.
△ Less
Submitted 3 March, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Word Order Does Not Matter For Speech Recognition
Authors:
Vineel Pratap,
Qiantong Xu,
Tatiana Likhomanenko,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the p…
▽ More
In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.
△ Less
Submitted 18 October, 2021; v1 submitted 12 October, 2021;
originally announced October 2021.
-
Parallel Composition of Weighted Finite-State Transducers
Authors:
Shubho Sengupta,
Vineel Pratap,
Awni Hannun
Abstract:
Finite-state transducers (FSTs) are frequently used in speech recognition. Transducer composition is an essential operation for combining different sources of information at different granularities. However, composition is also one of the more computationally expensive operations. Due to the heterogeneous structure of FSTs, parallel algorithms for composition are suboptimal in efficiency, generali…
▽ More
Finite-state transducers (FSTs) are frequently used in speech recognition. Transducer composition is an essential operation for combining different sources of information at different granularities. However, composition is also one of the more computationally expensive operations. Due to the heterogeneous structure of FSTs, parallel algorithms for composition are suboptimal in efficiency, generality, or both. We propose an algorithm for parallel composition and implement it on graphics processing units. We benchmark our parallel algorithm on the composition of random graphs and the composition of graphs commonly used in speech recognition. The parallel composition scales better with the size of the input graphs and for large graphs can be as much as 10 to 30 times faster than a sequential CPU algorithm.
△ Less
Submitted 6 October, 2021;
originally announced October 2021.
-
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Authors:
Wei-Ning Hsu,
Anuroop Sriram,
Alexei Baevski,
Tatiana Likhomanenko,
Qiantong Xu,
Vineel Pratap,
Jacob Kahn,
Ann Lee,
Ronan Collobert,
Gabriel Synnaeve,
Michael Auli
Abstract:
Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which…
▽ More
Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at https://github.com/pytorch/fairseq.
△ Less
Submitted 8 September, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
MLS: A Large-Scale Multilingual Dataset for Speech Research
Authors:
Vineel Pratap,
Qiantong Xu,
Anuroop Sriram,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models an…
▽ More
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.
△ Less
Submitted 19 December, 2020; v1 submitted 6 December, 2020;
originally announced December 2020.
-
Rethinking Evaluation in ASR: Are Our Models Robust Enough?
Authors:
Tatiana Likhomanenko,
Qiantong Xu,
Vineel Pratap,
Paden Tomasello,
Jacob Kahn,
Gilad Avidov,
Ronan Collobert,
Gabriel Synnaeve
Abstract:
Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset…
▽ More
Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset transfer to other (possibly out-of-domain) datasets. We show that, in general, reverberative and additive noise augmentation improves generalization performance across domains. Further, we demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world noisy data. Finally, we show that training a single acoustic model on the most widely-used datasets - combined - reaches competitive performance on both research and real-world benchmarks.
△ Less
Submitted 2 May, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Differentiable Weighted Finite-State Transducers
Authors:
Awni Hannun,
Vineel Pratap,
Jacob Kahn,
Wei-Ning Hsu
Abstract:
We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time. Through the separation of graphs from operations on graphs, this framework enables the exploration of new structured loss functions which in turn eases the encoding of prior knowledge into learning algorithms. We show how the framework can com…
▽ More
We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time. Through the separation of graphs from operations on graphs, this framework enables the exploration of new structured loss functions which in turn eases the encoding of prior knowledge into learning algorithms. We show how the framework can combine pruning and back-off in transition models with various sequence-level loss functions. We also show how to learn over the latent decomposition of phrases into word pieces. Finally, to demonstrate that WFSTs can be used in the interior of a deep neural network, we propose a convolutional WFST layer which maps lower-level representations to higher-level representations and can be used as a drop-in replacement for a traditional convolution. We validate these algorithms with experiments in handwriting recognition and speech recognition.
△ Less
Submitted 2 October, 2020;
originally announced October 2020.
-
Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
Authors:
Vineel Pratap,
Anuroop Sriram,
Paden Tomasello,
Awni Hannun,
Vitaliy Liptchinsky,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three vari…
▽ More
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads (one per language cluster). We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages. We see 20.9%, 23% and 28.8% average WER relative reduction compared to monolingual baselines on joint model, joint model with language input and multi head model respectively. To our knowledge, this is the first work studying multilingual ASR at massive scale, with more than 50 languages and more than 16,000 hours of audio across them.
△ Less
Submitted 7 July, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
Scaling Up Online Speech Recognition Using ConvNets
Authors:
Vineel Pratap,
Qiantong Xu,
Jacob Kahn,
Gilad Avidov,
Tatiana Likhomanenko,
Awni Hannun,
Vitaliy Liptchinsky,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency a…
▽ More
We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency, accuracy, and discuss how these metrics can be tuned based on the user requirements.
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures
Authors:
Gabriel Synnaeve,
Qiantong Xu,
Jacob Kahn,
Tatiana Likhomanenko,
Edouard Grave,
Vineel Pratap,
Anuroop Sriram,
Vitaliy Liptchinsky,
Ronan Collobert
Abstract:
We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance…
▽ More
We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models.
△ Less
Submitted 14 July, 2020; v1 submitted 19 November, 2019;
originally announced November 2019.
-
wav2letter++: The Fastest Open-source Speech Recognition System
Authors:
Vineel Pratap,
Awni Hannun,
Qiantong Xu,
Jeff Cai,
Jacob Kahn,
Gabriel Synnaeve,
Vitaliy Liptchinsky,
Ronan Collobert
Abstract:
This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster th…
▽ More
This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show that wav2letter++'s training times scale linearly to 64 GPUs, the highest we tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.