subscribe to arXiv mailings

doi 10.1109/TITS.2021.3132227

Digital Video Manipulation Detection Technique Based on Compression Algorithms

Authors: Edgar Gonzalez Fernandez, Ana Lucila Sandoval Orozco, Luis Javier Garcia Villalba

Abstract: Digital images and videos play a very important role in everyday life. Nowadays, people have access the affordable mobile devices equipped with advanced integrated cameras and powerful image processing applications. Technological development facilitates not only the generation of multimedia content, but also the intentional modification of it, either with recreational or malicious purposes. This i… ▽ More Digital images and videos play a very important role in everyday life. Nowadays, people have access the affordable mobile devices equipped with advanced integrated cameras and powerful image processing applications. Technological development facilitates not only the generation of multimedia content, but also the intentional modification of it, either with recreational or malicious purposes. This is where forensic techniques to detect manipulation of images and videos become essential. This paper proposes a forensic technique by analysing compression algorithms used by the H.264 coding. The presence of recompression uses information of macroblocks, a characteristic of the H.264-MPEG4 standard, and motion vectors. A Vector Support Machine is used to create the model that allows to accurately detect if a video has been recompressed. △ Less

Submitted 3 February, 2024; originally announced March 2024.

Journal ref: IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 3, pp. 2596-2605, December 2021

arXiv:2402.19355 [pdf, other]

Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification

Authors: Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak

Abstract: Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectur… ▽ More Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectures. Additionally, we introduce a method for identifying the victim model on which the adversarial attack is carried out. To achieve this, we generate a new dataset containing multiple attacks performed against various victim models. We achieve an AUC of 0.982 for attack detection, with no more than a 0.03 drop in performance for unknown attacks. Our attack classification accuracy (excluding benign) reaches 86.48% across eight attack types using our LightResNet34 architecture, while our victim model classification accuracy reaches 72.28% across four victim models. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.07714 [pdf]

doi 10.1016/j.swevo.2017.07.002

Adaptive Artificial Immune Networks for Mitigating DoS flooding Attacks

Authors: Jorge Maestre Vidal, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

Abstract: Denial of service attacks pose a threat in constant growth. This is mainly due to their tendency to gain in sophistication, ease of implementation, obfuscation and the recent improvements in occultation of fingerprints. On the other hand, progress towards self-organizing networks, and the different techniques involved in their development, such as software-defined networking, network-function virt… ▽ More Denial of service attacks pose a threat in constant growth. This is mainly due to their tendency to gain in sophistication, ease of implementation, obfuscation and the recent improvements in occultation of fingerprints. On the other hand, progress towards self-organizing networks, and the different techniques involved in their development, such as software-defined networking, network-function virtualization, artificial intelligence or cloud computing, facilitates the design of new defensive strategies, more complete, consistent and able to adapt the defensive deployment to the current status of the network. In order to contribute to their development, in this paper, the use of artificial immune systems to mitigate denial of service attacks is proposed. The approach is based on building networks of distributed sensors suited to the requirements of the monitored environment. These components are capable of identifying threats and reacting according to the behavior of the biological defense mechanisms in human beings. It is accomplished by emulating the different immune reactions, the establishment of quarantine areas and the construction of immune memory. For their assessment, experiments with public domain datasets (KDD'99, CAIDA'07 and CAIDA'08) and simulations on various network configurations based on traffic samples gathered by the University Complutense of Madrid and flooding attacks generated by the tool DDoSIM were performed. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Journal ref: J. Maestre Vidal, A. L. Sandoval Orozco, L. J. García Villalba: Adaptive Artificial Immune Networks for Mitigating DoS Flooding Attacks. Swarm and Evolutionary Computation. Vol. 38, pp. 3894-108, February 2018

arXiv:2402.06669 [pdf]

doi 10.1016/j.eswa.2020.114515

Compression effects and scene details on the source camera identification of digital videos

Authors: Raquel Ramos López, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

Abstract: The continuous growth of technologies like 4G or 5G has led to a massive use of mobile devices such as smartphones and tablets. This phenomenon, combined with the fact that people use mobile phones for a longer period of time, results in mobile phones becoming the main source of creation of visual information. However, its reliability as a true representation of reality cannot be taken for granted… ▽ More The continuous growth of technologies like 4G or 5G has led to a massive use of mobile devices such as smartphones and tablets. This phenomenon, combined with the fact that people use mobile phones for a longer period of time, results in mobile phones becoming the main source of creation of visual information. However, its reliability as a true representation of reality cannot be taken for granted due to the constant increase in editing software. This makes it easier to alter original content without leaving a noticeable trace in the modification. Therefore, it is essential to introduce forensic analysis mechanisms to guarantee the authenticity or integrity of a certain digital video, particularly if it may be considered as evidence in legal proceedings. This paper explains the branch of multimedia forensic analysis that allows to determine the identification of the source of acquisition of a certain video by exploiting the unique traces left by the camera sensor of the mobile device in visual content. To do this, a technique that performs the identification of the source of acquisition of digital videos from mobile devices is presented. It involves 3 stages: (1) Extraction of the sensor fingerprint by applying the block-based technique. (2) Filtering the strong component of the PRNU signal to improve the quality of the sensor fingerprint. (3) Classification of digital videos in an open scenario, that is, where the forensic analyst does not need to have access to the device that recorded the video to find out the origin of the video. The main contribution of the proposed technique eliminates the details of the scene to improve the PRNU fingerprint. It should be noted that these techniques are applied to digital images and not to digital videos. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Journal ref: Expert Systems with Applications, Vol. 170, pp. 114515, May 2021

arXiv:2402.06661 [pdf]

doi 10.1016/j.future.2020.02.044

Authentication and integrity of smartphone videos through multimedia container structure analysis

Authors: Carlos Quinto Huamán, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

Abstract: Nowadays, mobile devices have become the natural substitute for the digital camera, as they capture everyday situations easily and quickly, encouraging users to express themselves through images and videos. These videos can be shared across different platforms exposing them to any kind of intentional manipulation by criminals who are aware of the weaknesses of forensic techniques to accuse an inno… ▽ More Nowadays, mobile devices have become the natural substitute for the digital camera, as they capture everyday situations easily and quickly, encouraging users to express themselves through images and videos. These videos can be shared across different platforms exposing them to any kind of intentional manipulation by criminals who are aware of the weaknesses of forensic techniques to accuse an innocent person or exonerate a guilty person in a judicial process. Commonly, manufacturers do not comply 100% with the specifications of the standards for the creation of videos. Also, videos shared on social networks, and instant messaging applications go through filtering and compression processes to reduce their size, facilitate their transfer, and optimize storage on their platforms. The omission of specifications and results of transformations carried out by the platforms embed a features pattern in the multimedia container of the videos. These patterns make it possible to distinguish the brand of the device that generated the video, social network, and instant messaging application that was used for the transfer. Research in recent years has focused on the analysis of AVI containers and tiny video datasets. This work presents a novel technique to detect possible attacks against MP4, MOV, and 3GP format videos that affect their integrity and authenticity. The method is based on the analysis of the structure of video containers generated by mobile devices and their behavior when shared through social networks, instant messaging applications, or manipulated by editing programs. The objectives of the proposal are to verify the integrity of videos, identify the source of acquisition and distinguish between original and manipulated videos. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Journal ref: Quinto Huamán, A. L. Sandoval Orozco, L. J. García Villalba: Authentication and Integrity of Smartphone Videos Through Multimedia Container Structure Analysis. Future Generation Computer Systems. Vol. 108, pp. 15-33, July 2020

arXiv:2402.03562 [pdf]

doi 10.1016/j.knosys.2018.03.018

A novel pattern recognition system for detecting Android malware by analyzing suspicious boot sequences

Authors: Jorge Maestre Vidal, Marco Antonio Sotelo Monge, Luis Javier García Villalba

Abstract: This paper introduces a malware detection system for smartphones based on studying the dynamic behavior of suspicious applications. The main goal is to prevent the installation of the malicious software on the victim systems. The approach focuses on identifying malware addressed against the Android platform. For that purpose, only the system calls performed during the boot process of the recently… ▽ More This paper introduces a malware detection system for smartphones based on studying the dynamic behavior of suspicious applications. The main goal is to prevent the installation of the malicious software on the victim systems. The approach focuses on identifying malware addressed against the Android platform. For that purpose, only the system calls performed during the boot process of the recently installed applications are studied. Thereby the amount of information to be considered is reduced, since only activities related with their initialization are taken into account. The proposal defines a pattern recognition system with three processing layers: monitoring, analysis and decision-making. First, in order to extract the sequences of system calls, the potentially compromised applications are executed on a safe and isolated environment. Then the analysis step generates the metrics required for decision-making. This level combines sequence alignment algorithms with bagging, which allow scoring the similarity between the extracted sequences considering their regions of greatest resemblance. At the decision-making stage, the Wilcoxon signed-rank test is implemented, which determines if the new software is labeled as legitimate or malicious. The proposal has been tested in different experiments that include an in-depth study of a particular use case, and the evaluation of its effectiveness when analyzing samples of well-known public datasets. Promising experimental results have been shown, hence demonstrating that the approach is a good complement to the strategies of the bibliography. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Journal ref: Knowledge-Based Systems. Vol. 150, pp. 198-217, June 2018

arXiv:2402.03555 [pdf]

doi 10.1016/j.comcom.2021.03.008

A security framework for Ethereum smart contracts

Authors: Antonio López Vivar, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

Abstract: The use of blockchain and smart contracts have not stopped growing in recent years. Like all software that begins to expand its use, it is also beginning to be targeted by hackers who will try to exploit vulnerabilities in both the underlying technology and the smart contract code itself. While many tools already exist for analyzing vulnerabilities in smart contracts, the heterogeneity and variety… ▽ More The use of blockchain and smart contracts have not stopped growing in recent years. Like all software that begins to expand its use, it is also beginning to be targeted by hackers who will try to exploit vulnerabilities in both the underlying technology and the smart contract code itself. While many tools already exist for analyzing vulnerabilities in smart contracts, the heterogeneity and variety of approaches and differences in providing the analysis data makes the learning curve for the smart contract developer steep. In this article the authors present ESAF (Ethereum Security Analysis Framework), a framework for analysis of smart contracts that aims to unify and facilitate the task of analyzing smart contract vulnerabilities which can be used as a persistent security monitoring tool for a set of target contracts as well as a classic vulnerability analysis tool among other uses. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Journal ref: Computer Communications. Vol. 172, pp. 119-129, April 2021

arXiv:2402.02240 [pdf]

doi 10.1145/3447773

Recommendations on Statistical Randomness Test Batteries for Cryptographic Purposes

Authors: Elena Almaraz Luengo, Luis Javier García Villalba

Abstract: Security in different applications is closely related to the goodness of the sequences generated for such purposes. Not only in Cryptography but also in other areas, it is necessary to obtain long sequences of random numbers or that, at least, behave as such. To decide whether the generator used produces sequences that are random, unpredictable and independent, statistical checks are needed. Diffe… ▽ More Security in different applications is closely related to the goodness of the sequences generated for such purposes. Not only in Cryptography but also in other areas, it is necessary to obtain long sequences of random numbers or that, at least, behave as such. To decide whether the generator used produces sequences that are random, unpredictable and independent, statistical checks are needed. Different batteries of hypothesis tests have been proposed for this purpose. In this work, a survey of the main test batteries is presented, indicating their pros and cons, giving some guidelines for their use and presenting some practical examples. △ Less

Submitted 3 February, 2024; originally announced February 2024.

Journal ref: ACM Computing Surveys, Vol. 54, No. 80, pp. 12420, May 2021

arXiv:2401.09464 [pdf]

Floating Point HUB Adder for RISC-V Sargantana Processor

Authors: Gerardo Bandera, Javier Salamero, Miquel Moreto, Julio Villalba

Abstract: HUB format is an emerging technique to improve the hardware and time requirement when round to nearest is needed. On the other hand, RISC-V is an open-source ISA that many companies currently use in their designs. This paper presents a tailored floating point HUB adder implemented in the Sargantana RISC-V processor. HUB format is an emerging technique to improve the hardware and time requirement when round to nearest is needed. On the other hand, RISC-V is an open-source ISA that many companies currently use in their designs. This paper presents a tailored floating point HUB adder implemented in the Sargantana RISC-V processor. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: RISC-V Summit Europe, Barcelona, 5-9th June 2023

arXiv:2309.04628 [pdf, other]

Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

Authors: Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak

Abstract: Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP att… ▽ More Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP attempted to generate localized word-level information and utilize both the pretrained image and text encoders. Despite using both, they noticed a substantial drop in retrieval performance. We proposed Segmental SpeechCLIP which used a hierarchical segmental speech encoder to generate sequences of word-like units. We used the pretrained CLIP text encoder on top of these word-like unit representations and showed significant improvements over the cascaded variant of SpeechCLIP. Segmental SpeechCLIP directly learns the word embeddings as input to the CLIP text encoder bypassing the vocabulary embeddings. Here, we explore mapping audio to CLIP vocabulary embeddings via regularization and quantization. As our objective is to distill semantic information into the speech encoders, we explore the usage of large unimodal pretrained language models as the text encoders. Our method enables us to bridge image and text encoders e.g. DINO and RoBERTa trained with uni-modal data. Finally, we extend our framework in audio-only settings where only pairs of semantically related audio are available. Experiments show that audio-only systems perform close to the audio-visual system. △ Less

Submitted 8 September, 2023; originally announced September 2023.

arXiv:2303.04187 [pdf, other]

Stabilized training of joint energy-based models and their practical applications

Authors: Martin Sustek, Samik Sadhu, Lukas Burget, Hynek Hermansky, Jesus Villalba, Laureano Moro-Velazquez, Najim Dehak

Abstract: The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative examples", which are samples from the modeled distr… ▽ More The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative examples", which are samples from the modeled distribution $p(x)$ generated by means of Stochastic Gradient Langevin Dynamics (SGLD). Unfortunately, SGLD often fails to deliver negative samples of sufficient quality during the standard JEM training, which causes a very unbalanced contribution from the positive and negative examples when calculating gradients for JEM updates. As a consequence, the standard JEM training is quite unstable requiring careful tuning of hyper-parameters and frequent restarts when the training starts diverging. This makes it difficult to apply JEM to different neural network architectures, modalities, and tasks. In this work, we propose a training procedure that stabilizes SGLD-based JEM training (ST-JEM) by balancing the contribution from the positive and negative examples. We also propose to add an additional "regularization" term to the training objective -- MI between the input observations $x$ and output labels $y$ -- which encourages the JEM classifier to make more certain decisions about output labels. We demonstrate the effectiveness of our approach on the CIFAR10 and CIFAR100 tasks. We also consider the task of classifying phonemes in a speech signal, for which we were not able to train JEM without the proposed stabilization. We show that a convincing speech can be generated from the trained model. Alternatively, corrupted speech can be de-noised by bringing it closer to the modeled speech distribution using a few SGLD iterations. We also propose and discuss additional applications of the trained model. △ Less

Submitted 7 March, 2023; originally announced March 2023.

arXiv:2208.05445 [pdf, other]

doi 10.1109/JSTSP.2022.3197315

Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

Authors: Jaejin Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak

Abstract: In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-s… ▽ More In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to down-stream tasks (speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: EARLY ACCESS of IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing

arXiv:2208.05413 [pdf, other]

Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

Authors: Jaejin Cho, Raghavendra Pappagari, Piotr Żelasko, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak

Abstract: Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples b… ▽ More Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised learning method on an unlabeled speech corpus to learn utterance-level embeddings. We used DIstillation with NO labels (DINO), proposed in computer vision, and adapted it to the speech domain. Unlike the contrastive methods, DINO does not require negative sampling. These embeddings were evaluated on speaker verification and emotion recognition. In speaker verification, the unsupervised DINO embedding with cosine scoring provided 4.38% EER on the VoxCeleb1 test trial. This outperforms the best contrastive self-supervised method by 40% relative in EER. An iterative pseudo-labeling training pipeline, not requiring speaker labels, further improved the EER to 1.89%. In emotion recognition, the DINO embedding performed 60.87, 79.21, and 56.98% in micro-f1 score on IEMOCAP, Crema-D, and MSP-Podcast, respectively. The results imply the generality of the DINO embedding to different speech applications. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: Accepted at Interspeech 2022

arXiv:2204.03851 [pdf, other]

Defense against Adversarial Attacks on Hybrid Speech Recognition using Joint Adversarial Fine-tuning with Denoiser

Authors: Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesus Villalba, Sanjeev Khudanpur, Najim Dehak

Abstract: Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning ASR model, and adversarially fine-tuning joint mod… ▽ More Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning ASR model, and adversarially fine-tuning joint model of ASR and denoiser. Our evaluation shows denoiser pre-processor (trained on offline adversarial examples) fails to defend against adaptive white-box attacks. However, adversarially fine-tuning the denoiser using a tandem model of denoiser and ASR offers more robustness. We evaluate two variants of this defense--one updating parameters of both models and the second keeping ASR frozen. The joint model offers a mean absolute decrease of 19.3\% ground truth (GT) WER with reference to baseline against fast gradient sign method (FGSM) attacks with different $L_\infty$ norms. The joint model with frozen ASR parameters gives the best defense against projected gradient descent (PGD) with 7 iterations, yielding a mean absolute increase of 22.3\% GT WER with reference to baseline; and against PGD with 500 iterations, yielding a mean absolute decrease of 45.08\% GT WER and an increase of 68.05\% adversarial target WER. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted to Interspeech 2022

arXiv:2204.03848 [pdf, ps, other]

AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification

Authors: Sonal Joshi, Saurabh Kataria, Jesus Villalba, Najim Dehak

Abstract: Adversarial attacks pose a severe security threat to the state-of-the-art speaker identification systems, thereby making it vital to propose countermeasures against them. Building on our previous work that used representation learning to classify and detect adversarial attacks, we propose an improvement to it using AdvEst, a method to estimate adversarial perturbation. First, we prove our claim th… ▽ More Adversarial attacks pose a severe security threat to the state-of-the-art speaker identification systems, thereby making it vital to propose countermeasures against them. Building on our previous work that used representation learning to classify and detect adversarial attacks, we propose an improvement to it using AdvEst, a method to estimate adversarial perturbation. First, we prove our claim that training the representation learning network using adversarial perturbations as opposed to adversarial examples (consisting of the combination of clean signal and adversarial perturbation) is beneficial because it eliminates nuisance information. At inference time, we use a time-domain denoiser to estimate the adversarial perturbations from adversarial examples. Using our improved representation learning approach to obtain attack embeddings (signatures), we evaluate their performance for three applications: known attack classification, attack verification, and unknown attack detection. We show that common attacks in the literature (Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), Carlini-Wagner (CW) with different Lp threat models) can be classified with an accuracy of ~96%. We also detect unknown attacks with an equal error rate (EER) of ~9%, which is absolute improvement of ~12% from our previous work. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Submitted to InterSpeech 2022

arXiv:2203.16614 [pdf, other]

Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification

Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak

Abstract: Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone speech to wideband microphone speech. We developed par… ▽ More Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone speech to wideband microphone speech. We developed parallel and non-parallel learning solutions which utilize both paired and unpaired data. First, we first discuss joint and disjoint training of multiple generative models for our tasks. Then, we propose a two-stage learning solution where we use a pre-trained domain adaptation system for pre-processing in bandwidth extension training. We evaluated our schemes on a Speaker Verification downstream task. We used the JHU-MIT experimental setup for NIST SRE21, which comprises SRE16, SRE-CTS Superset and SRE21. Our results provide the first evidence that learning both tasks is better than learning just one. On SRE16, our best system achieves 22% relative improvement in Equal Error Rate w.r.t. a direct learning baseline and 8% w.r.t. a strong bandwidth expansion system. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: submitted to Interspeech 2022

arXiv:2110.02345 [pdf, other]

Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

Abstract: Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learn… ▽ More Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learning, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive predictive coding (SCPC) framework to model the signal structure at a higher level, e.g., phone level. A convolutional neural network learns frame-level representation from the raw waveform via noise-contrastive estimation (NCE). A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE to learn segment representations. The differentiable boundary detector allows us to train frame-level and segment-level encoders jointly. Experiments show that our single model outperforms existing phone and word segmentation methods on TIMIT and Buckeye datasets. We discover that phone class impacts the boundary detection performance, and the boundaries between successive vowels or semivowels are the most difficult to identify. Finally, we use SCPC to extract speech features at the segment level rather than at uniformly spaced frame level (e.g., 10 ms) and produce variable rate representations that change according to the contents of the utterance. We can lower the feature extraction rate from the typical 100 Hz to as low as 14.5 Hz on average while still outperforming the MFCC features on the linear phone classification task. △ Less

Submitted 8 October, 2021; v1 submitted 5 October, 2021; originally announced October 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2106.02170

arXiv:2109.13425 [pdf, ps, other]

The JHU submission to VoxSRC-21: Track 3

Authors: Jejin Cho, Jesus Villalba, Najim Dehak

Abstract: This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised m… ▽ More This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised method in computer vision (CV), distillation with no labels (DINO), is used to train our initial model, which outperformed the last year's contrastive learning based on momentum contrast (MoCo). Also, this requires only a few iterations in the iterative clustering stage, where pseudo labels for supervised embedding learning are updated based on the clusters of the embeddings generated from a model that is continually fine-tuned over iterations. In the final stage, Res2Net50 is trained on the final pseudo labels from the iterative clustering stage. This is our best submitted model to the challenge, showing 1.89, 6.50, and 6.89 in EER(%) in voxceleb1 test o, VoxSRC-21 validation, and test trials, respectively. △ Less

Submitted 27 September, 2021; originally announced September 2021.

arXiv:2109.06112 [pdf, other]

Beyond Isolated Utterances: Conversational Emotion Recognition

Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Laureano Moro-Velazquez, Najim Dehak

Abstract: Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches… ▽ More Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches for CER by treating it as a sequence labeling task. We investigated transformer architecture for CER and, compared it with ResNet-34 and BiLSTM architectures in both contextual and context-less scenarios using IEMOCAP corpus. Based on the inner workings of the self-attention mechanism, we proposed DiverseCatAugment (DCA), an augmentation scheme, which improved the transformer model performance by an absolute 3.3% micro-f1 on conversations and 3.6% on isolated utterances. We further enhanced the performance by introducing an interlocutor-aware transformer model where we learn a dictionary of interlocutor index embeddings to exploit diarized conversations. △ Less

Submitted 13 September, 2021; originally announced September 2021.

Comments: Accepted for ASRU 2021

arXiv:2106.02170 [pdf, other]

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

Abstract: Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive pr… ▽ More Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level. In this framework, a convolutional neural network learns frame-level representation from the raw waveform via noise-contrastive estimation (NCE). A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE to learn segment representations. The differentiable boundary detector allows us to train frame-level and segment-level encoders jointly. Typically, phoneme and word segmentation are treated as separate tasks. We unify them and experimentally show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets. We analyze the impact of boundary threshold and when is the right time to include the segmental loss in the learning process. △ Less

Submitted 3 June, 2021; originally announced June 2021.

arXiv:2103.17122 [pdf, ps, other]

Adversarial Attacks and Defenses for Speech Recognition Systems

Authors: Piotr Żelasko, Sonal Joshi, Yiwen Shao, Jesus Villalba, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

Abstract: The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model and a more recent Espresso framework Transformer enc… ▽ More The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model and a more recent Espresso framework Transformer encoder-decoder model. We investigate two threat models: a denial-of-service scenario where fast gradient-sign method (FGSM) or weak projected gradient descent (PGD) attacks are used to degrade the model's word error rate (WER); and a targeted scenario where a more potent imperceptible attack forces the system to recognize a specific phrase. We find that the attack transferability across the investigated ASR systems is limited. To defend the model, we use two preprocessing defenses: randomized smoothing and WaveGAN-based vocoder, and find that they significantly improve the model's adversarial robustness. We show that a WaveGAN vocoder can be a useful countermeasure to adversarial attacks on ASR systems - even when it is jointly attacked with the ASR, the target phrases' word error rate is high. △ Less

Submitted 31 March, 2021; originally announced March 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2101.08909 [pdf, other]

Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition Systems

Authors: Sonal Joshi, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

Abstract: Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose def… ▽ More Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose defenses that can protect the systems against these attacks. Addressing these concerns, this paper at first investigates how state-of-the-art x-vector based SR systems are affected by white-box adversarial attacks, i.e., when the adversary has full knowledge of the system. x-Vector based SR systems are evaluated against white-box adversarial attacks common in the literature like fast gradient sign method (FGSM), basic iterative method (BIM)--a.k.a. iterative-FGSM--, projected gradient descent (PGD), and Carlini-Wagner (CW) attack. To mitigate against these attacks, the paper proposes four pre-processing defenses. It evaluates them against powerful adaptive white-box adversarial attacks, i.e., when the adversary has full knowledge of the system, including the defense. The four pre-processing defenses--viz. randomized smoothing, DefenseGAN, variational autoencoder (VAE), and Parallel WaveGAN vocoder (PWG) are compared against the baseline defense of adversarial training. Conclusions indicate that SR systems were extremely vulnerable under BIM, PGD, and CW attacks. Among the proposed pre-processing defenses, PWG combined with randomized smoothing offers the most protection against the attacks, with accuracy averaging 93% compared to 52% in the undefended system and an absolute improvement >90% for BIM attacks with $L_\infty>0.001$ and CW attack. △ Less

Submitted 25 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2011.02090 [pdf, other]

Frustratingly Easy Noise-aware Training of Acoustic Models

Authors: Desh Raj, Jesus Villalba, Daniel Povey, Sanjeev Khudanpur

Abstract: Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of a… ▽ More Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of acoustic models in hybrid ASR. Our noise vectors are obtained by combining the means of speech frames and silence frames in the utterance, where the speech/silence labels may be obtained from a GMM-HMM model trained for ASR alignments, such that no extra computation is required beyond averaging of feature vectors. We show through experiments on AMI and Aurora-4 that this simple adaptation technique can result in 6-7% relative WER improvement. We implement several embedding-based adaptation baselines proposed in literature, and show that our method outperforms them on both the datasets. Finally, we extend our method to the online ASR setting by using frame-level maximum likelihood for the mean estimation. △ Less

Submitted 2 February, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: 6 + 3 (Appendix) pages

arXiv:2011.01210 [pdf, other]

Focus on the present: a regularization method for the ASR source-target attention layer

Authors: Nanxin Chen, Piotr Żelasko, Jesús Villalba, Najim Dehak

Abstract: This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training. Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations. To understand the functionality of the attention, CTC is applie… ▽ More This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training. Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations. To understand the functionality of the attention, CTC is applied to compute the token posteriors given the attention outputs. We found that the source-target attention heads are able to predict several tokens ahead of the current one. Inspired by the observation, a new regularization method is proposed which leverages CTC to make source-target attention more focused on the frames corresponding to the output token being predicted by the decoder. Experiments reveal stable improvements up to 7\% and 13\% relatively with the proposed regularization on TED-LIUM 2 and LibriSpeech. △ Less

Submitted 2 November, 2020; originally announced November 2020.

Comments: submitted to ICASSP2021. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

arXiv:2010.14602 [pdf, ps, other]

CopyPaste: An Augmentation Method for Speech Emotion Recognition

Authors: Raghavendra Pappagari, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

Abstract: Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictat… ▽ More Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictates a speaker's overall perceived emotion in a recording, concatenation of an emotional (emotion E) and a neutral utterance can still be labeled with emotion E. We hypothesize that SER performance can be improved using these concatenated utterances in model training. To verify this, three CopyPaste schemes are tested on two deep learning models: one trained independently and another using transfer learning from an x-vector model, a speaker recognition model. We observed that all three CopyPaste schemes improve SER performance on all the three datasets considered: MSP-Podcast, Crema-D, and IEMOCAP. Additionally, CopyPaste performs better than noise augmentation and, using them together improves the SER performance further. Our experiments on noisy test sets suggested that CopyPaste is effective even in noisy test conditions. △ Less

Submitted 11 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

Comments: Accepted at ICASSP2021

arXiv:2010.11860 [pdf, other]

Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

Authors: Saurabh Kataria, Jesús Villalba, Najim Dehak

Abstract: Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classifica… ▽ More Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses. △ Less

Submitted 22 October, 2020; originally announced October 2020.

arXiv:2010.11221 [pdf, other]

Learning Speaker Embedding from Text-to-Speech

Authors: Jaejin Cho, Piotr Zelasko, Jesus Villalba, Shinji Watanabe, Najim Dehak

Abstract: Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesi… ▽ More Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesize that the embeddings will contain minimal phonetic information since the TTS decoder will obtain that information from the textual input. TTS reconstruction can also be combined with speaker classification to enhance these embeddings further. Once trained, the speaker encoder computes representations for the speaker verification task, while the rest of the TTS blocks are discarded. We investigated training TTS from either manual or ASR-generated transcripts. The latter allows us to train embeddings on datasets without manual transcripts. We compared ASR transcripts and Kaldi phone alignments as TTS inputs, showing that the latter performed better due to their finer resolution. Unsupervised TTS embeddings improved EER by 2.06\% absolute with regard to i-vectors for the LibriTTS dataset. TTS with speaker classification loss improved EER by 0.28\% and 0.73\% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively. △ Less

Submitted 21 October, 2020; originally announced October 2020.

arXiv:2007.13033 [pdf, other]

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Najim Dehak

Abstract: Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance, it is crucial that the features represent the phon… ▽ More Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance, it is crucial that the features represent the phonetic properties of a frame more than other factors of variability. We achieve this via a self-expressing autoencoder framework. It consists of a single encoder and two decoders with shared weights. The encoder projects the input features into a latent representation. One of the decoders tries to reconstruct the input from these latent representations and the other from the self-expressed version of them. We use the obtained features to segment and cluster the speech data. We evaluate the performance of the proposed method in the Zero Resource 2020 challenge unit discovery task. The proposed system consistently outperforms the baseline, demonstrating the usefulness of the method in learning representations. △ Less

Submitted 25 July, 2020; originally announced July 2020.

arXiv:2005.08331 [pdf, ps, other]

Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Paola García-Perera, Jesús Villalba, Najim Dehak

Abstract: We investigated an enhancement and a domain adaptation approach to make speaker verification systems robust to perturbations of far-field speech. In the enhancement approach, using paired (parallel) reverberant-clean speech, we trained a supervised Generative Adversarial Network (GAN) along with a feature mapping loss. For the domain adaptation approach, we trained a Cycle Consistent Generative Ad… ▽ More We investigated an enhancement and a domain adaptation approach to make speaker verification systems robust to perturbations of far-field speech. In the enhancement approach, using paired (parallel) reverberant-clean speech, we trained a supervised Generative Adversarial Network (GAN) along with a feature mapping loss. For the domain adaptation approach, we trained a Cycle Consistent Generative Adversarial Network (CycleGAN), which maps features from far-field domain to the speaker embedding training domain. This was trained on unpaired data in an unsupervised manner. Both networks, termed Supervised Enhancement Network (SEN) and Domain Adaptation Network (DAN) respectively, were trained with multi-task objectives in (filter-bank) feature domain. On a simulated test setup, we first note the benefit of using feature mapping (FM) loss along with adversarial loss in SEN. Then, we tested both supervised and unsupervised approaches on several real noisy datasets. We observed relative improvements ranging from 2% to 31% in terms of DCF. Using three training schemes, we also establish the effectiveness of the novel DAN approach. △ Less

Submitted 17 May, 2020; originally announced May 2020.

Comments: submitted to INTERSPEECH 2020

arXiv:2002.05039 [pdf, ps, other]

x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

Authors: Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, Najim Dehak

Abstract: In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features… ▽ More In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models such as the x-vector model. Then, we improve emotion recognition performance by fine-tuning for emotion classification. We evaluated our experiments on three different types of datasets: IEMOCAP, MSP-Podcast, and Crema-D. By fine-tuning, we obtained 30.40%, 7.99%, and 8.61% absolute improvement on IEMOCAP, MSP-Podcast, and Crema-D respectively over baseline model with no pre-training. Finally, we present results on the effect of emotion on speaker verification. We observed that speaker verification performance is prone to changes in test speaker emotions. We found that trials with angry utterances performed worst in all three datasets. We hope our analysis will initiate a new line of research in the speaker recognition community. △ Less

Submitted 12 February, 2020; originally announced February 2020.

Comments: 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

arXiv:2002.00139 [pdf, other]

Analysis of Deep Feature Loss based Enhancement for Speaker Verification

Authors: Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Najim Dehak

Abstract: Data augmentation is conventionally used to inject robustness in Speaker Verification systems. Several recently organized challenges focus on handling novel acoustic environments. Deep learning based speech enhancement is a modern solution for this. Recently, a study proposed to optimize the enhancement network in the activation space of a pre-trained auxiliary network. This methodology, called de… ▽ More Data augmentation is conventionally used to inject robustness in Speaker Verification systems. Several recently organized challenges focus on handling novel acoustic environments. Deep learning based speech enhancement is a modern solution for this. Recently, a study proposed to optimize the enhancement network in the activation space of a pre-trained auxiliary network. This methodology, called deep feature loss, greatly improved over the state-of-the-art conventional x-vector based system on a children speech dataset called BabyTrain. This work analyzes various facets of that approach and asks few novel questions in that context. We first search for optimal number of auxiliary network activations, training data, and enhancement feature dimension. Experiments reveal the importance of Signal-to-Noise Ratio filtering that we employ to create a large, clean, and naturalistic corpus for enhancement network training. To counter the "mismatch" problem in enhancement, we find enhancing front-end (x-vector network) data helpful while harmful for the back-end (Probabilistic Linear Discriminant Analysis (PLDA)). Importantly, we find enhanced signals contain complementary information to original. Established by combining them in front-end, this gives ~40% relative improvement over the baseline. We also do an ablation study to remove a noise class from x-vector data augmentation and, for such systems, we establish the utility of enhancement regardless of whether it has seen that noise class itself during training. Finally, we design several dereverberation schemes to conclude ineffectiveness of deep feature loss enhancement scheme for this task. △ Less

Submitted 27 April, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

Comments: 8 pages; accepted in Odyssey2020 workshop

arXiv:1912.00938 [pdf]

Speaker detection in the wild: Lessons learned from JSALT 2019

Authors: Paola Garcia, Jesus Villalba, Herve Bredin, Jun Du, Diego Castan, Alejandrina Cristia, Latane Bullock, Ling Guo, Koji Okabe, Phani Sankar Nidadavolu, Saurabh Kataria, Sizhu Chen, Leo Galmant, Marvin Lavechin, Lei Sun, Marie-Philippe Gill, Bar Ben-Yair, Sajjad Abdoli, Xin Wang, Wassim Bouaziz, Hadrien Titeux, Emmanuel Dupoux, Kong Aik Lee, Najim Dehak

Abstract: This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker dete… ▽ More This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection. △ Less

Submitted 2 December, 2019; originally announced December 2019.

Comments: Submitted to ICASSP 2020

arXiv:1911.04908 [pdf, other]

doi 10.1109/LSP.2020.3044547

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

Authors: Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

Abstract: Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FM… ▽ More Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to the most difficult ones. Results on Mandarin (Aishell) and Japanese (CSJ) ASR benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed the Kaldi ASR system and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup. Pretrained models and code will be made available after publication. △ Less

Submitted 6 April, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

arXiv:1910.11915 [pdf, ps, other]

Unsupervised Feature Enhancement for speaker verification

Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Jesús Villalba, Paola García-Perera, Najim Dehak

Abstract: The task of making speaker verification systems robust to adverse scenarios remain a challenging and an active area of research. We developed an unsupervised feature enhancement approach in log-filter bank domain with the end goal of improving speaker verification performance. We experimented with using both real speech recorded in adverse environments and degraded speech obtained by simulation to… ▽ More The task of making speaker verification systems robust to adverse scenarios remain a challenging and an active area of research. We developed an unsupervised feature enhancement approach in log-filter bank domain with the end goal of improving speaker verification performance. We experimented with using both real speech recorded in adverse environments and degraded speech obtained by simulation to train the enhancement systems. The effectiveness of the approach was shown by testing on several real, simulated noisy, and reverberant test sets. The approach yielded significant improvements on both real and simulated sets when data augmentation was not used in speaker verification pipeline or augmentation was used only during x-vector training. When data augmentation was used for x-vector and PLDA training, our enhancement approach yielded slight improvements. △ Less

Submitted 14 February, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

Comments: 5 pages; accepted in ICASSP 2020

arXiv:1910.11909 [pdf, other]

Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-GANs

Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Jesús Villalba, Najim Dehak

Abstract: Current speaker recognition technology provides great performance with the x-vector approach. However, performance decreases when the evaluation domain is different from the training domain, an issue usually addressed with domain adaptation approaches. Recently, unsupervised domain adaptation using cycle-consistent Generative Adversarial Netorks (CycleGAN) has received a lot of attention. CycleGAN… ▽ More Current speaker recognition technology provides great performance with the x-vector approach. However, performance decreases when the evaluation domain is different from the training domain, an issue usually addressed with domain adaptation approaches. Recently, unsupervised domain adaptation using cycle-consistent Generative Adversarial Netorks (CycleGAN) has received a lot of attention. CycleGAN learn mappings between features of two domains given non-parallel data. We investigate their effectiveness in low resource scenario i.e. when limited amount of target domain data is available for adaptation, a case unexplored in previous works. We experiment with two adaptation tasks: microphone to telephone and a novel reverberant to clean adaptation with the end goal of improving speaker recognition performance. Number of speakers present in source and target domains are 7000 and 191 respectively. By adding noise to the target domain during CycleGAN training, we were able to achieve better performance compared to the adaptation system whose CycleGAN was trained on a larger target data. On reverberant to clean adaptation task, our models improved EER by 18.3% relative on VOiCES dataset compared to a system trained on clean data. They also slightly improved over the state-of-the-art Weighted Prediction Error (WPE) de-reverberation algorithm. △ Less

Submitted 25 October, 2019; originally announced October 2019.

Comments: 8 pages, accepted to ASRU 2019

arXiv:1910.11905 [pdf, ps, other]

Feature Enhancement with Deep Feature Losses for Speaker Verification

Authors: Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Nanxin Chen, Paola García, Najim Dehak

Abstract: Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker emb… ▽ More Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker embedding network. We experimentally verify the approach on simulated and real data. A simulated testing setup is created using various noise types at different SNR levels. For evaluation on real data, we choose BabyTrain corpus which consists of children recordings in uncontrolled environments. We observe consistent gains in every condition over the state-of-the-art augmented Factorized-TDNN x-vector system. On BabyTrain corpus, we observe relative gains of 10.38% and 12.40% in minDCF and EER respectively. △ Less

Submitted 14 February, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

Comments: 5 pages, accepted in ICASSP 2020

arXiv:1910.10781 [pdf, ps, other]

Hierarchical Transformers for Long Document Classification

Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Yishay Carmiel, Najim Dehak

Abstract: BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its major limitations - applicability to inputs longer than a few hundred words, such as transcripts of human call conversations. Our method is conceptually simple. We… ▽ More BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its major limitations - applicability to inputs longer than a few hundred words, such as transcripts of human call conversations. Our method is conceptually simple. We segment the input into smaller chunks and feed each of them into the base model. Then, we propagate each output through a single recurrent layer, or another transformer, followed by a softmax activation. We obtain the final classification decision after the last segment has been consumed. We show that both BERT extensions are quick to fine-tune and converge after as little as 1 epoch of training on a small, domain-specific data set. We successfully apply them in three different tasks involving customer call satisfaction prediction and topic classification, and obtain a significant improvement over the baseline models in two of them. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: 4 figures, 7 pages

Journal ref: Automatic Speech Recognition and Understanding Workshop, 2019

arXiv:1904.01120 [pdf, other]

ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

Authors: Cheng-I Lai, Nanxin Chen, Jesús Villalba, Najim Dehak

Abstract: We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT). Anti-spoofing has gathered more and more attention since the inauguration of the ASVspoof Challenges, and ASVspoof 2019 dedicates to address attacks from all three major types: text-to-speech, voice conversion, and replay. Built upon previous research work on Dee… ▽ More We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT). Anti-spoofing has gathered more and more attention since the inauguration of the ASVspoof Challenges, and ASVspoof 2019 dedicates to address attacks from all three major types: text-to-speech, voice conversion, and replay. Built upon previous research work on Deep Neural Network (DNN), ASSERT is a pipeline for DNN-based approach to anti-spoofing. ASSERT has four components: feature engineering, DNN models, network optimization and system combination, where the DNN models are variants of squeeze-excitation and residual networks. We conducted an ablation study of the effectiveness of each component on the ASVspoof 2019 corpus, and experimental results showed that ASSERT obtained more than 93% and 17% relative improvements over the baseline systems in the two sub-challenges in ASVspooof 2019, ranking ASSERT one of the top performing systems. Code and pretrained models will be made publicly available. △ Less

Submitted 1 April, 2019; originally announced April 2019.

Comments: Submitted to Interspeech 2019, Graz, Austria

arXiv:1811.02162 [pdf, other]

Language model integration based on memory control for sequence to sequence speech recognition

Authors: Jaejin Cho, Shinji Watanabe, Takaaki Hori, Murali Karthick Baskar, Hirofumi Inaguma, Jesus Villalba, Najim Dehak

Abstract: In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. Th… ▽ More In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multi-level decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline. △ Less

Submitted 5 November, 2018; originally announced November 2018.

Comments: 4 pages, 1 figure, 5 tables, submitted to ICASSP 2019

arXiv:1101.5411 [pdf, ps, other]

Efficient Algorithms for Searching Optimal Shortened Cyclic Single-Burst-Correcting Codes

Authors: Luis Javier García Villalba, José René Fuentes Cortez, Ana Lucila Sandoval Orozco, Mario Blaum

Abstract: In a previous work it was shown that the best measure for the efficiency of a single burst-correcting code is obtained using the Gallager bound as opposed to the Reiger bound. In this paper, an efficient algorithm that searches for the best (shortened) cyclic burst-correcting codes is presented. Using this algorithm, extensive tables that either tie existing constructions or improve them are obtai… ▽ More In a previous work it was shown that the best measure for the efficiency of a single burst-correcting code is obtained using the Gallager bound as opposed to the Reiger bound. In this paper, an efficient algorithm that searches for the best (shortened) cyclic burst-correcting codes is presented. Using this algorithm, extensive tables that either tie existing constructions or improve them are obtained for burst lengths up to b=10. △ Less

Submitted 27 January, 2011; originally announced January 2011.

Showing 1–40 of 40 results for author: Villalba, J