Skip to main content

Showing 1–50 of 53 results for author: Dehak, N

  1. arXiv:2402.19355  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification

    Authors: Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak

    Abstract: Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectur… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  2. arXiv:2311.06170  [pdf, other

    cs.LG

    Time Scale Network: A Shallow Neural Network For Time Series Data

    Authors: Trevor Meyer, Camden Shultz, Najim Dehak, Laureano Moro-Velazquez, Pedro Irazoqui

    Abstract: Time series data is often composed of information at multiple time scales, particularly in biomedical data. While numerous deep learning strategies exist to capture this information, many make networks larger, require more data, are more demanding to compute, and are difficult to interpret. This limits their usefulness in real-world applications facing even modest computational or data constraints… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Comments: 8 pages, 5 figures, preprint

  3. arXiv:2310.04567  [pdf, other

    eess.AS cs.SD

    DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

    Authors: Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

    Abstract: Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve… ▽ More

    Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  4. arXiv:2309.04628  [pdf, other

    eess.AS cs.SD

    Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

    Authors: Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak

    Abstract: Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP att… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  5. arXiv:2303.04187  [pdf, other

    cs.LG

    Stabilized training of joint energy-based models and their practical applications

    Authors: Martin Sustek, Samik Sadhu, Lukas Burget, Hynek Hermansky, Jesus Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative examples", which are samples from the modeled distr… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  6. arXiv:2208.05445  [pdf, other

    eess.AS cs.AI cs.LG

    Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

    Authors: Jaejin Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-s… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: EARLY ACCESS of IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing

  7. arXiv:2208.05413  [pdf, other

    eess.AS cs.LG

    Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

    Authors: Jaejin Cho, Raghavendra Pappagari, Piotr Żelasko, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak

    Abstract: Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples b… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: Accepted at Interspeech 2022

  8. arXiv:2204.03851  [pdf, other

    eess.AS cs.CR cs.SD

    Defense against Adversarial Attacks on Hybrid Speech Recognition using Joint Adversarial Fine-tuning with Denoiser

    Authors: Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesus Villalba, Sanjeev Khudanpur, Najim Dehak

    Abstract: Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning ASR model, and adversarially fine-tuning joint mod… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

  9. arXiv:2204.03848  [pdf, ps, other

    eess.AS cs.CR cs.SD

    AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification

    Authors: Sonal Joshi, Saurabh Kataria, Jesus Villalba, Najim Dehak

    Abstract: Adversarial attacks pose a severe security threat to the state-of-the-art speaker identification systems, thereby making it vital to propose countermeasures against them. Building on our previous work that used representation learning to classify and detect adversarial attacks, we propose an improvement to it using AdvEst, a method to estimate adversarial perturbation. First, we prove our claim th… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to InterSpeech 2022

  10. arXiv:2203.16614  [pdf, other

    eess.AS cs.SD

    Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone speech to wideband microphone speech. We developed par… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: submitted to Interspeech 2022

  11. arXiv:2201.11207  [pdf, other

    cs.SD cs.CL eess.AS

    Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

    Authors: Piotr Żelasko, Siyuan Feng, Laureano Moro Velazquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak

    Abstract: The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. Wh… ▽ More

    Submitted 27 January, 2022; v1 submitted 26 January, 2022; originally announced January 2022.

    Comments: Accepted for publication in Computer Speech and Language

  12. arXiv:2201.02550  [pdf, other

    cs.CL cs.SD eess.AS

    Textual Data Augmentation for Arabic-English Code-Switching Speech Recognition

    Authors: Amir Hussein, Shammur Absar Chowdhury, Ahmed Abdelali, Najim Dehak, Ahmed Ali, Sanjeev Khudanpur

    Abstract: The pervasiveness of intra-utterance code-switching (CS) in spoken content requires that speech recognition (ASR) systems handle mixed language. Designing a CS-ASR system has many challenges, mainly due to data scarcity, grammatical structure complexity, and domain mismatch. The most common method for addressing CS is to train an ASR system with the available transcribed CS speech, along with mono… ▽ More

    Submitted 11 January, 2023; v1 submitted 7 January, 2022; originally announced January 2022.

  13. arXiv:2110.02345  [pdf, other

    eess.AS cs.SD

    Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learn… ▽ More

    Submitted 8 October, 2021; v1 submitted 5 October, 2021; originally announced October 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.02170

  14. arXiv:2109.13425  [pdf, ps, other

    eess.AS cs.LG cs.SD

    The JHU submission to VoxSRC-21: Track 3

    Authors: Jejin Cho, Jesus Villalba, Najim Dehak

    Abstract: This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised m… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

  15. arXiv:2109.06112  [pdf, other

    cs.CL cs.SD eess.AS

    Beyond Isolated Utterances: Conversational Emotion Recognition

    Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted for ASRU 2021

  16. arXiv:2109.06103  [pdf, other

    cs.CL cs.LG

    Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios

    Authors: Raghavendra Pappagari, Piotr Żelasko, Agnieszka Mikołajczyk, Piotr Pęzik, Najim Dehak

    Abstract: Capitalization and punctuation are important cues for comprehending written texts and conversational transcripts. Yet, many ASR systems do not produce punctuated and case-formatted speech transcripts. We propose to use a multi-task system that can exploit the relations between casing and punctuation to improve their prediction performance. Whereas text data for predicting punctuation and truecasin… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted for ASRU 2021

  17. arXiv:2107.02294  [pdf, other

    cs.CL

    What Helps Transformers Recognize Conversational Structure? Importance of Context, Punctuation, and Labels in Dialog Act Recognition

    Authors: Piotr Żelasko, Raghavendra Pappagari, Najim Dehak

    Abstract: Dialog acts can be interpreted as the atomic units of a conversation, more fine-grained than utterances, characterized by a specific communicative function. The ability to structure a conversational transcript as a sequence of dialog acts -- dialog act recognition, including the segmentation -- is critical for understanding dialog. We apply two pre-trained transformer models, XLNet and Longformer,… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: Accepted for publication in Transactions of the Association of Computational Linguistics. This is a pre-MIT Press publication version and it is subject to change

  18. arXiv:2106.09660  [pdf, ps, other

    eess.AS cs.LG cs.SD

    WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

    Authors: Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, Najim Dehak, William Chan

    Abstract: This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditi… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Proceedings of INTERSPEECH

  19. arXiv:2106.05885  [pdf, other

    cs.CL cs.SD eess.AS

    Balanced End-to-End Monolingual pre-training for Low-Resourced Indic Languages Code-Switching Speech Recognition

    Authors: Amir Hussein, Shammur Chowdhury, Najim Dehak, Ahmed Ali

    Abstract: The success in designing Code-Switching (CS) ASR often depends on the availability of the transcribed CS resources. Such dependency harms the development of ASR in low-resourced languages such as Bengali and Hindi. In this paper, we exploit the transfer learning approach to design End-to-End (E2E) CS ASR systems for the two low-resourced language pairs using different monolingual speech data and a… ▽ More

    Submitted 15 February, 2022; v1 submitted 10 June, 2021; originally announced June 2021.

  20. arXiv:2106.02170  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive pr… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

  21. arXiv:2103.17122  [pdf, ps, other

    eess.AS cs.CR cs.SD

    Adversarial Attacks and Defenses for Speech Recognition Systems

    Authors: Piotr Żelasko, Sonal Joshi, Yiwen Shao, Jesus Villalba, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

    Abstract: The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model and a more recent Espresso framework Transformer enc… ▽ More

    Submitted 31 March, 2021; originally announced March 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  22. arXiv:2101.08909  [pdf, other

    eess.AS cs.SD

    Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition Systems

    Authors: Sonal Joshi, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose def… ▽ More

    Submitted 25 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  23. arXiv:2011.01210  [pdf, other

    eess.AS cs.LG

    Focus on the present: a regularization method for the ASR source-target attention layer

    Authors: Nanxin Chen, Piotr Żelasko, Jesús Villalba, Najim Dehak

    Abstract: This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training. Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations. To understand the functionality of the attention, CTC is applie… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

    Comments: submitted to ICASSP2021. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  24. arXiv:2010.14602  [pdf, ps, other

    cs.SD cs.LG eess.AS

    CopyPaste: An Augmentation Method for Speech Emotion Recognition

    Authors: Raghavendra Pappagari, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictat… ▽ More

    Submitted 11 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted at ICASSP2021

  25. How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successfu… ▽ More

    Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this work

  26. arXiv:2010.11860  [pdf, other

    eess.AS cs.SD

    Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

    Authors: Saurabh Kataria, Jesús Villalba, Najim Dehak

    Abstract: Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classifica… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  27. arXiv:2010.11221  [pdf, other

    eess.AS cs.LG cs.SD

    Learning Speaker Embedding from Text-to-Speech

    Authors: Jaejin Cho, Piotr Zelasko, Jesus Villalba, Shinji Watanabe, Najim Dehak

    Abstract: Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesi… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  28. arXiv:2007.13033  [pdf, other

    eess.AS cs.LG cs.SD

    Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Najim Dehak

    Abstract: Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance, it is crucial that the features represent the phon… ▽ More

    Submitted 25 July, 2020; originally announced July 2020.

  29. arXiv:2005.08331  [pdf, ps, other

    eess.AS cs.SD

    Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

    Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Paola García-Perera, Jesús Villalba, Najim Dehak

    Abstract: We investigated an enhancement and a domain adaptation approach to make speaker verification systems robust to perturbations of far-field speech. In the enhancement approach, using paired (parallel) reverberant-clean speech, we trained a supervised Generative Adversarial Network (GAN) along with a feature mapping loss. For the domain adaptation approach, we trained a Cycle Consistent Generative Ad… ▽ More

    Submitted 17 May, 2020; originally announced May 2020.

    Comments: submitted to INTERSPEECH 2020

  30. arXiv:2005.08118  [pdf, other

    eess.AS cs.CL cs.SD

    That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

    Authors: Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020. For some reason, the ArXiv Latex engine rendered it in more than 4 pages

  31. arXiv:2004.05985  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

    Authors: Łukasz Augustyniak, Piotr Szymanski, Mikołaj Morzy, Piotr Zelasko, Adrian Szymczak, Jan Mizgajski, Yishay Carmiel, Najim Dehak

    Abstract: Automatic Speech Recognition (ASR) systems introduce word errors, which often confuse punctuation prediction models, turning punctuation restoration into a challenging task. These errors usually take the form of homonyms. We show how retrofitting of the word embeddings on the domain-specific data can mitigate ASR errors. Our main contribution is a method for better alignment of homonym embeddings… ▽ More

    Submitted 13 April, 2020; originally announced April 2020.

    Comments: submitted to INTERSPEECH'20

  32. arXiv:2002.05039  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

    Authors: Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, Najim Dehak

    Abstract: In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features… ▽ More

    Submitted 12 February, 2020; originally announced February 2020.

    Comments: 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

  33. arXiv:2002.00139  [pdf, other

    eess.AS cs.SD

    Analysis of Deep Feature Loss based Enhancement for Speaker Verification

    Authors: Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Najim Dehak

    Abstract: Data augmentation is conventionally used to inject robustness in Speaker Verification systems. Several recently organized challenges focus on handling novel acoustic environments. Deep learning based speech enhancement is a modern solution for this. Recently, a study proposed to optimize the enhancement network in the activation space of a pre-trained auxiliary network. This methodology, called de… ▽ More

    Submitted 27 April, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

    Comments: 8 pages; accepted in Odyssey2020 workshop

  34. arXiv:1912.00938  [pdf

    eess.AS cs.SD

    Speaker detection in the wild: Lessons learned from JSALT 2019

    Authors: Paola Garcia, Jesus Villalba, Herve Bredin, Jun Du, Diego Castan, Alejandrina Cristia, Latane Bullock, Ling Guo, Koji Okabe, Phani Sankar Nidadavolu, Saurabh Kataria, Sizhu Chen, Leo Galmant, Marvin Lavechin, Lei Sun, Marie-Philippe Gill, Bar Ben-Yair, Sajjad Abdoli, Xin Wang, Wassim Bouaziz, Hadrien Titeux, Emmanuel Dupoux, Kong Aik Lee, Najim Dehak

    Abstract: This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker dete… ▽ More

    Submitted 2 December, 2019; originally announced December 2019.

    Comments: Submitted to ICASSP 2020

  35. arXiv:1911.04908  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

    Authors: Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

    Abstract: Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FM… ▽ More

    Submitted 6 April, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

  36. arXiv:1910.11915  [pdf, ps, other

    eess.AS cs.SD

    Unsupervised Feature Enhancement for speaker verification

    Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Jesús Villalba, Paola García-Perera, Najim Dehak

    Abstract: The task of making speaker verification systems robust to adverse scenarios remain a challenging and an active area of research. We developed an unsupervised feature enhancement approach in log-filter bank domain with the end goal of improving speaker verification performance. We experimented with using both real speech recorded in adverse environments and degraded speech obtained by simulation to… ▽ More

    Submitted 14 February, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: 5 pages; accepted in ICASSP 2020

  37. arXiv:1910.11909  [pdf, other

    eess.AS cs.SD

    Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-GANs

    Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Jesús Villalba, Najim Dehak

    Abstract: Current speaker recognition technology provides great performance with the x-vector approach. However, performance decreases when the evaluation domain is different from the training domain, an issue usually addressed with domain adaptation approaches. Recently, unsupervised domain adaptation using cycle-consistent Generative Adversarial Netorks (CycleGAN) has received a lot of attention. CycleGAN… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: 8 pages, accepted to ASRU 2019

  38. arXiv:1910.11905  [pdf, ps, other

    eess.AS cs.SD

    Feature Enhancement with Deep Feature Losses for Speaker Verification

    Authors: Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Nanxin Chen, Paola García, Najim Dehak

    Abstract: Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker emb… ▽ More

    Submitted 14 February, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: 5 pages, accepted in ICASSP 2020

  39. arXiv:1910.10781  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Hierarchical Transformers for Long Document Classification

    Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Yishay Carmiel, Najim Dehak

    Abstract: BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its major limitations - applicability to inputs longer than a few hundred words, such as transcripts of human call conversations. Our method is conceptually simple. We… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: 4 figures, 7 pages

    Journal ref: Automatic Speech Recognition and Understanding Workshop, 2019

  40. arXiv:1906.03588  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method

    Authors: Zheng-Hua Tan, Achintya kr. Sarkar, Najim Dehak

    Abstract: This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the s… ▽ More

    Submitted 11 January, 2022; v1 submitted 9 June, 2019; originally announced June 2019.

    Journal ref: Computer Speech & Language, volume 59, January 2020, Pages 1-21

  41. arXiv:1904.11641  [pdf, other

    cs.SD cs.CL eess.AS

    Speaker Sincerity Detection based on Covariance Feature Vectors and Ensemble Methods

    Authors: Mohammed Senoussaoui, Patrick Cardinal, Najim Dehak, Alessandro Lameiras Koerich

    Abstract: Automatic measuring of speaker sincerity degree is a novel research problem in computational paralinguistics. This paper proposes covariance-based feature vectors to model speech and ensembles of support vector regressors to estimate the degree of sincerity of a speaker. The elements of each covariance vector are pairwise statistics between the short-term feature components. These features are use… ▽ More

    Submitted 25 April, 2019; originally announced April 2019.

  42. arXiv:1904.04240  [pdf, other

    eess.AS cs.SD

    MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge Evaluation

    Authors: Suwon Shon, Najim Dehak, Douglas Reynolds, James Glass

    Abstract: The Multi-target Challenge aims to assess how well current speech technology is able to determine whether or not a recorded utterance was spoken by one of a large number of blacklisted speakers. It is a form of multi-target speaker detection based on real-world telephone conversations. Data recordings are generated from call center customer-agent conversations. The task is to measure how accuratel… ▽ More

    Submitted 7 April, 2019; originally announced April 2019.

    Comments: http://mce.csail.mit.edu . arXiv admin note: text overlap with arXiv:1807.06663

  43. arXiv:1904.01120  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

    Authors: Cheng-I Lai, Nanxin Chen, Jesús Villalba, Najim Dehak

    Abstract: We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT). Anti-spoofing has gathered more and more attention since the inauguration of the ASVspoof Challenges, and ASVspoof 2019 dedicates to address attacks from all three major types: text-to-speech, voice conversion, and replay. Built upon previous research work on Dee… ▽ More

    Submitted 1 April, 2019; originally announced April 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria

  44. arXiv:1812.03919  [pdf, other

    eess.AS cs.CL cs.SD

    Pretraining by Backtranslation for End-to-end ASR in Low-Resource Settings

    Authors: Matthew Wiesner, Adithya Renduchintala, Shinji Watanabe, Chunxi Liu, Najim Dehak, Sanjeev Khudanpur

    Abstract: We explore training attention-based encoder-decoder ASR in low-resource settings. These models perform poorly when trained on small amounts of transcribed speech, in part because they depend on having sufficient target-side text to train the attention and decoder networks. In this paper we address this shortcoming by pretraining our network parameters using only text-based data and transcribed spe… ▽ More

    Submitted 2 August, 2019; v1 submitted 10 December, 2018; originally announced December 2018.

  45. arXiv:1811.02162  [pdf, other

    eess.AS cs.SD

    Language model integration based on memory control for sequence to sequence speech recognition

    Authors: Jaejin Cho, Shinji Watanabe, Takaaki Hori, Murali Karthick Baskar, Hirofumi Inaguma, Jesus Villalba, Najim Dehak

    Abstract: In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. Th… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

    Comments: 4 pages, 1 figure, 5 tables, submitted to ICASSP 2019

  46. arXiv:1810.13048  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Attentive Filtering Networks for Audio Replay Attack Detection

    Authors: Cheng-I Lai, Alberto Abad, Korin Richmond, Junichi Yamagishi, Najim Dehak, Simon King

    Abstract: An attacker may use a variety of techniques to fool an automatic speaker verification system into accepting them as a genuine user. Anti-spoofing methods meanwhile aim to make the system robust against such attacks. The ASVspoof 2017 Challenge focused specifically on replay attacks, with the intention of measuring the limits of replay attack detection as well as developing countermeasures against… ▽ More

    Submitted 30 October, 2018; originally announced October 2018.

    Comments: Submitted to ICASSP 2019

  47. arXiv:1807.06663  [pdf, other

    eess.AS cs.SD

    MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge Evaluation (MCE) Plan, Dataset and Baseline System

    Authors: Suwon Shon, Najim Dehak, Douglas Reynolds, James Glass

    Abstract: The Multitarget Challenge aims to assess how well current speech technology is able to determine whether or not a recorded utterance was spoken by one of a large number of 'blacklisted' speakers. It is a form of multi-target speaker detection based on real-world telephone conversations. Data recordings are generated from call center customer-agent conversations. Each conversation is represented by… ▽ More

    Submitted 17 July, 2018; originally announced July 2018.

    Comments: MCE 2018 Plan (http://mce.csail.mit.edu)

  48. arXiv:1807.06204  [pdf, other

    cs.CL

    Low-Resource Contextual Topic Identification on Speech

    Authors: Chunxi Liu, Matthew Wiesner, Shinji Watanabe, Craig Harman, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

    Abstract: In topic identification (topic ID) on real-world unstructured audio, an audio instance of variable topic shifts is first broken into sequential segments, and each segment is independently classified. We first present a general purpose method for topic ID on spoken segments in low-resource languages, using a cascade of universal acoustic modeling, translation lexicons to English, and English-langua… ▽ More

    Submitted 28 September, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

    Comments: Accepted for publication at 2018 IEEE Workshop on Spoken Language Technology (SLT)

  49. arXiv:1807.00543  [pdf, other

    cs.CL

    Punctuation Prediction Model for Conversational Speech

    Authors: Piotr Żelasko, Piotr Szymański, Jan Mizgajski, Adrian Szymczak, Yishay Carmiel, Najim Dehak

    Abstract: An ASR system usually does not predict any punctuation or capitalization. Lack of punctuation causes problems in result presentation and confuses both the human reader andoff-the-shelf natural language processing algorithms. To overcome these limitations, we train two variants of Deep Neural Network (DNN) sequence labelling models - a Bidirectional Long Short-Term Memory (BLSTM) and a Convolutiona… ▽ More

    Submitted 2 July, 2018; originally announced July 2018.

    Comments: Accepted for Interspeech 2018 Conference

  50. arXiv:1802.08731  [pdf, other

    cs.CL

    Automatic Speech Recognition and Topic Identification for Almost-Zero-Resource Languages

    Authors: Matthew Wiesner, Chunxi Liu, Lucas Ondel, Craig Harman, Vimal Manohar, Jan Trmal, Zhongqiang Huang, Najim Dehak, Sanjeev Khudanpur

    Abstract: Automatic speech recognition (ASR) systems often need to be developed for extremely low-resource languages to serve end-uses such as audio content categorization and search. While universal phone recognition is natural to consider when no transcribed speech is available to train an ASR system in a language, adapting universal phone models using very small amounts (minutes rather than hours) of tra… ▽ More

    Submitted 18 June, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

    Comments: Accepted for publication at Interspeech 2018