Skip to main content

Showing 1–16 of 16 results for author: Denisov, P

  1. arXiv:2404.10922  [pdf, other

    cs.CL cs.SD eess.AS

    Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness th… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

    Comments: NAACL Findings 2024

  2. arXiv:2310.17499  [pdf, other

    cs.CL cs.LG eess.AS

    The IMS Toucan System for the Blizzard Challenge 2023

    Authors: Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu

    Abstract: For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synt… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

    Comments: Published at the Blizzard Challenge Workshop 2023, colocated with the Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023

  3. arXiv:2310.06103  [pdf, other

    cs.CL cs.SD eess.AS

    Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2023

  4. arXiv:2309.15800  [pdf, other

    cs.CL cs.SD eess.AS

    Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

    Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang

    Abstract: Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning repre… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Submitted to IEEE ICASSP 2024

  5. arXiv:2210.07002  [pdf, other

    cs.SD cs.CL eess.AS

    Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

    Authors: Sarina Meyer, Pascal Tilli, Pavel Denisov, Florian Lux, Julia Koch, Ngoc Thang Vu

    Abstract: In order to protect the privacy of speech data, speaker anonymization aims for hiding the identity of a speaker by changing the voice in speech recordings. This typically comes with a privacy-utility trade-off between protection of individuals and usability of the data for downstream applications. One of the challenges in this context is to create non-existent voices that sound as natural as possi… ▽ More

    Submitted 20 October, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: IEEE Spoken Language Technology Workshop 2022

  6. arXiv:2207.04834  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Speaker Anonymization with Phonetic Intermediate Representations

    Authors: Sarina Meyer, Florian Lux, Pavel Denisov, Julia Koch, Pascal Tilli, Ngoc Thang Vu

    Abstract: In this work, we propose a speaker anonymization pipeline that leverages high quality automatic speech recognition and synthesis systems to generate speech conditioned on phonetic transcriptions and anonymized speaker embeddings. Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input while preserving the original phonetic co… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: Accepted at Interspeech 2022

  7. arXiv:2111.14706  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet

    Authors: Siddhant Arora, Siddharth Dalmia, Pavel Denisov, Xuankai Chang, Yushi Ueda, Yifan Peng, Yuekai Zhang, Sujay Kumar, Karthik Ganesan, Brian Yan, Ngoc Thang Vu, Alan W Black, Shinji Watanabe

    Abstract: As Automatic Speech Processing (ASR) systems are getting better, there is an increasing interest of using the ASR output to do downstream Natural Language Processing (NLP) tasks. However, there are few open source toolkits that can be used to generate reproducible results on different Spoken Language Understanding (SLU) benchmarks. Hence, there is a need to build an open source standard that can b… ▽ More

    Submitted 3 March, 2022; v1 submitted 29 November, 2021; originally announced November 2021.

    Comments: Accepted at ICASSP 2022 (5 pages)

  8. arXiv:2108.12881  [pdf, other

    cs.CL

    Investigations on Speech Recognition Systems for Low-Resource Dialectal Arabic-English Code-Switching Speech

    Authors: Injy Hamed, Pavel Denisov, Chia-Yu Li, Mohamed Elmahdy, Slim Abdennadher, Ngoc Thang Vu

    Abstract: Code-switching (CS), defined as the mixing of languages in conversations, has become a worldwide phenomenon. The prevalence of CS has been recently met with a growing demand and interest to build CS ASR systems. In this paper, we present our work on code-switched Egyptian Arabic-English automatic speech recognition (ASR). We first contribute in filling the huge gap in resources by collecting, anal… ▽ More

    Submitted 29 August, 2021; originally announced August 2021.

    Comments: To be published in Computer Speech and Language Journal

  9. arXiv:2106.16055  [pdf, ps, other

    cs.CL cs.SD eess.AS

    IMS' Systems for the IWSLT 2021 Low-Resource Speech Translation Task

    Authors: Pavel Denisov, Manuel Mager, Ngoc Thang Vu

    Abstract: This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team. We utilize state-of-the-art models combined with several data augmentation, multi-task and transfer learning approaches for the automatic speech recognition (ASR) and machine translation (MT) steps of our cascaded system. Moreover, we also explore the feasibility of a full end-to-end spee… ▽ More

    Submitted 30 June, 2021; originally announced June 2021.

    Comments: IWSLT 2021

  10. arXiv:2011.02014  [pdf, other

    eess.AS cs.SD

    Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

    Authors: Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, Jinyu Li, Scott Wisdom, John R. Hershey

    Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this pape… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  11. arXiv:2007.01836  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps. These components are optimized independently to allow usage of available data, but the overall system suffers from error propagation. In this paper, we propose a novel training method that enables pretrained contextual embeddings to process acoustic feat… ▽ More

    Submitted 11 August, 2020; v1 submitted 3 July, 2020; originally announced July 2020.

    Comments: Interspeech 2020

  12. arXiv:2005.01777  [pdf, other

    cs.CL cs.AI

    ADVISER: A Toolkit for Developing Multi-modal, Multi-domain and Socially-engaged Conversational Agents

    Authors: Chia-Yu Li, Daniel Ortega, Dirk Väth, Florian Lux, Lindsey Vanderlyn, Maximilian Schmidt, Michael Neumann, Moritz Völkel, Pavel Denisov, Sabrina Jenne, Zorica Kacarevic, Ngoc Thang Vu

    Abstract: We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically exper… ▽ More

    Submitted 4 May, 2020; originally announced May 2020.

    Comments: All authors contributed equally. Accepted to be presented at ACL - System demonstrations - 2020

  13. arXiv:1908.04743  [pdf, ps, other

    cs.CL cs.SD eess.AS

    IMS-Speech: A Speech to Text Tool

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: We present the IMS-Speech, a web based tool for German and English speech transcription aiming to facilitate research in various disciplines which require accesses to lexical information in spoken language materials. This tool is based on modern open source software stack, advanced speech recognition methods and public data resources and is freely available for academic researchers. The utilized m… ▽ More

    Submitted 13 August, 2019; originally announced August 2019.

    Comments: ESSV 2019

  14. arXiv:1908.04737  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning

    Authors: Pavel Denisov, Ngoc Thang Vu

    Abstract: This paper presents our latest investigation on end-to-end automatic speech recognition (ASR) for overlapped speech. We propose to train an end-to-end system conditioned on speaker embeddings and further improved by transfer learning from clean speech. This proposed framework does not require any parallel non-overlapped speech materials and is independent of the number of speakers. Our experimenta… ▽ More

    Submitted 13 August, 2019; originally announced August 2019.

    Comments: Interspeech 2019

  15. arXiv:1902.11060  [pdf, other

    cs.CL

    Context-aware Neural-based Dialog Act Classification on Automatically Generated Transcriptions

    Authors: Daniel Ortega, Chia-Yu Li, Gisela Vallejo, Pavel Denisov, Ngoc Thang Vu

    Abstract: This paper presents our latest investigations on dialog act (DA) classification on automatically generated transcriptions. We propose a novel approach that combines convolutional neural networks (CNNs) and conditional random fields (CRFs) for context modeling in DA classification. We explore the impact of transcriptions generated from different automatic speech recognition systems such as hybrid T… ▽ More

    Submitted 28 February, 2019; originally announced February 2019.

    Comments: 5 pages, 1 figure, ICASSP 2019, dialog act classification, automatic speech recognition

  16. arXiv:1807.11284  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Unsupervised Domain Adaptation by Adversarial Learning for Robust Speech Recognition

    Authors: Pavel Denisov, Ngoc Thang Vu, Marc Ferras Font

    Abstract: In this paper, we investigate the use of adversarial learning for unsupervised adaptation to unseen recording conditions, more specifically, single microphone far-field speech. We adapt neural networks based acoustic models trained with close-talk clean speech to the new recording conditions using untranscribed adaptation data. Our experimental results on Italian SPEECON data set show that our pro… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.

    Comments: 5 pages, 2 figures, the 13th ITG conference on Speech Communication