Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2211.00482 (eess)

[Submitted on 1 Nov 2022]

Title:Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Authors:Zili Huang, Desh Raj, Paola García, Sanjeev Khudanpur

View PDF

Abstract:Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios -- possibly due to the domain mismatch -- which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1% and 42.1% over strong baselines for the segmented and unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset.

Comments:	submitted to ICASSP 2023
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2211.00482 [eess.AS]
	(or arXiv:2211.00482v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2211.00482

Submission history

From: Zili Huang [view email]
[v1] Tue, 1 Nov 2022 14:16:16 UTC (1,322 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators