\interspeechcameraready\name

[affiliation=1]QiumingZhao \name[affiliation=2]GuangzhiSun \name[affiliation=1]ChaoZhang \name[affiliation=1]MingxingXu \name[affiliation=1*]Thomas FangZheng

SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

Abstract

Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters in MoE. Specifically, SAML is applied to the quantised and personalised end-to-end automatic speech recognition models, which combines test-time speaker adaptation to improve the performance of heavily compressed models in speaker-specific scenarios. Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora. Remarkably, with a 7x reduction in model size, 29.1% and 31.1% relative word error rate reductions were achieved on the quantised Whisper model and Conformer-based attention-based encoder-decoder ASR model respectively, comparing to the original full precision models.

keywords:
mixture-of-experts, LoRA, quantisation, speaker adaptation, end-to-end ASR

1 Introduction

Transformer or Conformer-based end-to-end neural network models have achieved state-of-the-art performance in Automatic Speech Recognition (ASR) tasks [1, 2, 3]. While these end-to-end ASR models are becoming more capable and generalisable via large-scale training, the model sizes increase significantly, motivating efficient training approaches to be explored when there is limited task-specific data [4, 5]. Speaker adaptation is a typical task of this kind, where a generic ASR system is to be adapted to perform better for a certain speaker by providing only a handful of annotated speech data from that speaker. Previous work has explored using the low-rank adaptation (LoRA) [6] in combination with model quantisation for speaker adaptation [5]. However, a single static set of adaptation parameters to handle speaker variability may yield a sub-optimal solution in speaker adaptation [7, 8, 9], necessitating the use of a dynamic network design for enhanced adaptation performance, such as Mixture-of-Experts approaches.

Mixture-of-Experts (MoE) Transformer-based models have received extensive research attention in fields such as natural language processing [10, 11, 12], speech processing [13, 14], and computer vision [15, 16]. Concretely, the MoE is a family of neural network architectures that enables conditional computation through multiple experts that are activated based on a gating network, referred to as the router. This mechanism effectively enhances model representation power and expands model capacity. Furthermore, sparse MoE [10] activates only one sub-network for each input data, improving training and inference efficiency. However, these advantages come at the cost of dramatic increases in model size, which not only increases operational costs on the server but also presents significant challenges in deploying them on resource-constrained edge devices.

To combine the advantages of both LoRA and MoE for speaker adaptation, this paper proposes the Speaker Adaptive Mixture of LoRA experts (SAML) approach that adopts LoRA modules as experts. As a LoRA-based approach, SAML significantly releases the burden of the number of parameters in MoE. Specifically, SAML is integrated into the personalisation for a quantised model (PQM) framework [5]. In PQM, block-wise NormalFloat4 (NF4) quantisation [17] is adopted to achieve model compression, which incurs a smaller performance loss compared to conventional uniform quantisation. The SAML-based speaker adaptation is applied on top of the quantised models to compensate for the degradation due to quantisation. This is based on the fact that the edge devices to deploy quantised models are often personalised. For these devices, such as personalised voice assistants or smart door locks, improving performance for the target speaker is the critical objective rather than the performance concerning other speakers.

The SAML approach was implemented for the Conformer attention-based encoder-decoder (AED) model and the Whisper model as two examples of end-to-end ASR models in this paper. Experiments performed on the LibriSpeech and TED-LIUM 3 datasets demonstrated that, with nearly a 7×\times× compression of the model in the PQM framework, using SAML achieves a relative WER reduction of 29.1% and 31.1% on quantised Whisper and Conformer AED models respectively, compared to the original full-precision models. The main contribution of this paper can be summarised as follows.

  • We propose SAML as the first LoRA-based MoE approach for speaker adaptation.

  • We integrate SAML into the PQM framework to further compensate for the degradation incurred in the quantisation.

  • SAML has been validated on both Conformer AED and Whisper models across two datasets, with superior performance over single LoRA-based adaptation methods.

Refer to caption
Figure 1: Overview of the SAML integrated into PQM framework.

2 Related work

2.1 Mixture-of-Experts

MoE is an effective method to expand model capacity. Recently, some studies investigated the scale properties [18, 19, 20] of MoE models. Routing algorithms have also been studied extensively. Classic routing algorithms include soft routing [21, 22] and top-k routing [10, 11], among others [23, 24]. Moreover, MoE models have been applied to multimodal [25, 26] and multitask [27, 28] learning, illustrating their adaptability across diverse domains. To improve the deployment of MoE models, several studies have applied techniques such as quantisation [29], pruning [30], and distillation [10] to reduce the size and memory footprint of MoE models. Our objectives align with theirs, but we primarily propose the SAML approach to achieve a lightweight MoE model.

2.2 Speaker adaptation

The objective of speaker adaptation is to minimize the mismatch between speakers in training and testing conditions. Embedding-based adaptation methods map speakers into a continuous space using techniques like i-vectors [31] or neural network bottlenecks [32]. Model-based adaptation methods [33, 34, 35] adjust the model structure and parameters to individual speakers. Recent LoRA-based adaptation methods [5, 36, 37] have been widely applied in many tasks. Compared to full fine-tuning, LoRA adjusts only the low-rank subspace parameters of the model, thereby achieving higher computational efficiency and lower costs for computation and storage.

3 Methodology

3.1 Preliminaries

3.1.1 Mixture-of-Experts

An MoE layer consists of a router network G𝐺Gitalic_G and a set of n𝑛nitalic_n expert networks E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT,…,Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The output hhitalic_h of the MoE layer can be expressed as follows:

h=Emix=i=1nG(x)iEi(x)subscript𝐸mixsuperscriptsubscript𝑖1𝑛𝐺subscript𝑥𝑖subscript𝐸𝑖𝑥h=E_{\text{mix}}=\sum_{i=1}^{n}G(x)_{i}E_{i}(x)italic_h = italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_G ( italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) (1)

For soft routing, G()𝐺G(\cdot)italic_G ( ⋅ ) calculates scores for each expert based on the input x𝑥xitalic_x:

G(x)=Softmax(Wgx)𝐺𝑥Softmaxsubscript𝑊𝑔𝑥G(x)=\text{Softmax}(W_{g}\cdot x)italic_G ( italic_x ) = Softmax ( italic_W start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⋅ italic_x ) (2)

where Wgsubscript𝑊𝑔W_{g}italic_W start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the weight matrix of the router G𝐺Gitalic_G.

3.1.2 Low-Rank Adaptation

LoRA [6] is a parameter-efficient fine-tuning method. For the pretrained model with weight matrix W0d×ksubscript𝑊0superscript𝑑𝑘W_{0}\in\mathbb{R}^{d\times k}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT, its forward pass yields:

h=W0x+ΔWx=W0x+αrBAxsubscript𝑊0𝑥Δ𝑊𝑥subscript𝑊0𝑥𝛼𝑟𝐵𝐴𝑥h=W_{0}x+\Delta Wx=W_{0}x+\frac{\alpha}{r}BAxitalic_h = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + roman_Δ italic_W italic_x = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG italic_B italic_A italic_x (3)

where α𝛼\alphaitalic_α is a scaling factor that adjusts the magnitude of the changes to the original W0subscript𝑊0W_{0}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT made by the LoRA module, Bd×r𝐵superscript𝑑𝑟B\in\mathbb{R}^{d\times r}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT, Ar×k𝐴superscript𝑟𝑘A\in\mathbb{R}^{r\times k}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, and the rank rmin(d,k)much-less-than𝑟𝑑𝑘r\ll\min(d,k)italic_r ≪ roman_min ( italic_d , italic_k ).

3.2 Speaker Adaptive Mixture of LoRA Experts

MoE models dynamically select and weigh experts based on input data through a dynamic routing mechanism, significantly enhancing the model representation power and scaling up the model capacity with only a minor computation overhead. However, previous works [10, 11] adopted dense feed-forward networks as experts, leading to dramatic increases in model size. Consequently, we propose the SAML, which adopts parameter-efficient LoRA modules as experts, significantly reducing the parameter burden in MoE models. The specific details of the SAML architecture are shown in Figure 2.

The output hhitalic_h of the SAML layer is:

h=W0x+ΔWx=W0x+Emixsubscript𝑊0𝑥Δ𝑊𝑥subscript𝑊0𝑥subscript𝐸mixh=W_{0}x+\Delta Wx=W_{0}x+E_{\text{mix}}italic_h = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + roman_Δ italic_W italic_x = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT (4)

where Emixsubscript𝐸mixE_{\text{mix}}italic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT is the mixture of LoRA experts by soft routing:

Emix=αr(i=1nG(x)iBi)(i=1nG(x)iAi)xsubscript𝐸mix𝛼𝑟superscriptsubscript𝑖1𝑛𝐺subscript𝑥𝑖subscript𝐵𝑖superscriptsubscript𝑖1𝑛𝐺subscript𝑥𝑖subscript𝐴𝑖𝑥E_{\text{mix}}=\frac{\alpha}{r}(\sum_{i=1}^{n}G(x)_{i}B_{i})(\sum_{i=1}^{n}G(x% )_{i}A_{i})xitalic_E start_POSTSUBSCRIPT mix end_POSTSUBSCRIPT = divide start_ARG italic_α end_ARG start_ARG italic_r end_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_G ( italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_G ( italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_x (5)

Compared to the multiplication of Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT before addition, we adopted adding Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT before their multiplication which is more efficient in terms of GPU memory because it circumvents multiple matrix multiplication across different LoRA modules. Additionally, we quantise both the router and the experts to further reduce the model size.

3.3 SAML integrated into PQM framework

The SAML integrated into the PQM framework is illustrated in Figure 1, which is divided into three stages. In stage 1, we apply block-wise NF4 quantisation to the base model’s primary weight parameters. In stage 2, we pretrain the router and the LoRA experts’ parameters using data from a large number of speakers, providing a more robust starting point for subsequent speaker adaptation. In stage 3, we perform SAML-based speaker adaptation on speaker-specific data.

The block-wise NF4 quantisation is adopted in the PQM framework. While standard floating point quantisation applies the same set of quantisation bins to all weight matrices, the dynamic range of parameter values is not taken into account, resulting in heavily unbalanced quantisation bins. NF4, on the contrary, ensures each bin has an equal number of values by estimating the quantile of the input matrices using the empirical cumulative normal distribution. This leveraged the fact that the parameters of a weight matrix, in general, follow a normal distribution [17].

To reduce the influence of extreme values in weight matrices (i.e. outliers) on the maximum absolute value normalisation, block-wise quantisation is applied which divides the weight matrices into small blocks and quantises each block with separate normalisation factors. In this way, outliers in the input tensor are confined to individual blocks, reducing their overall impact on quantisation. As a result, block-wise quantisation allows for individual normalisation factors for each block, resulting in a more fine-grained overall quantisation.

Although the target speaker data is always limited, in reality, the target domain data of other speakers is usually available. Therefore, PQM leverages those data to find a better initialisation point for SAML weights before performing speaker adaptation, referred to as SAML pretraining. In speaker adaptation, the base model is frozen, and only the router and the LoRA experts’ parameters corresponding to each speaker are updated.

Refer to caption
Figure 2: The SAML architecture. Each attention layer is replaced with the SAML layer, and a LoRA module is added to each feed-forward layer.

4 Experimental setup

4.1 Data

LibriSpeech is an English audiobook dataset. We selected 5 male speakers and 5 female speakers with the largest number of utterances from train-clean-360 as speaker adaptation data. Each speaker contributes approximately 150 utterances, resulting in a total speech duration of roughly 25 minutes. For SAML pretraining, the train-clean-100 set was used which does not have any speaker overlap with the selected speakers.

TED-LIUM 3 (TL3) is a TED talks dataset. We selected 16 speakers from the test set as speaker adaptation data. On average, each speaker has 161 utterances (14 minutes).

Speaker adaptation data for LibriSpeech and TL3 was divided randomly, where 2/5 was divided into the train set, 1/5 was divided into the dev set, and 2/5 was divided into the test set. On average, each speaker has 6-10 minutes of training data, while the dev and test data remains constant across all experiments. We denote the partitioned test sets as LibriSpeech-SA and TL3-SA respectively in the results. 111Code and data partition: https://github.com/qmgzhao/SAML.git

4.2 Model and training specifications

To verify the effectiveness of SAML, we use the Whisper and Conformer AED models as two widely used models as examples.

Whisper is a Transformer-based AED model released by OpenAI trained on 680k hours of audio. The base.en model with a full model size of 278MB was used. The encoder has 6 Transformer blocks with 2048 hidden dimensions, and the output size is 512. The decoder has 6 Transformer blocks with 2048 hidden dimensions. The Transformer-related weight matrices are all 512 by 512 dimensional. Feature processing and model training followed [1, 4, 5].

Conformer AED is a hybrid CTC/attention-based encoder-decoder model, whose FP32 model size is about 131MB. The training follows ESPnet [38] with 0.3 CTC weight and 80-dim FBank features. The Conformer encoder has 12 blocks with 1024 hidden dimensions. The decoder uses a 6-block Transformer architecture with 2048-dim linear units. The Transformer-related weight matrices are all 256 by 256.

During the pretraining and adaptation stages, we conduct joint interleaved training of the router and experts. The default number of experts for all layers is uniformly set to 10, with the LoRA rank set to 1 for Whisper and 4 for Conformer. Furthermore, each group of experts has initialised with the LoRA parameters pretrained on a single speaker data from the train-clean-100 set. Models are evaluated using WER averaged across all utterances from the test set speakers.

5 Evaluation results and analysis

First, we apply block-wise NF4 quantisation to primary weight parameters of the model, including linear, convolution, and embedding layers, resulting in a 7×\times× reduction in the size of both Whisper and Conformer models. For detailed WER and model compression ratios of systems after quantising different parts, please refer to [5]. Furthermore, WER increased by 1.20% for Whisper and only 0.34% for Conformer upon NF4 quantisation. This suggests that models trained on smaller datasets are more robust to the quantisation noises under NF4 quantisation.

Table 1: WER on the LibriSpeech-SA and TL3-SA using quantised Whisper models. Parameter size lists the size (MB) of total parameters and trainable parameters (in parentheses). FFT refers to full fine-tuning which trains all model parameters. LoRA refers to a single LoRA and SAML refers to the mixture of LoRA experts.

System Param. Size WER(%) LibriSpeech-SA TL3-SA Whisper-FP32 277.8 10.02 5.93 Whisper-NF4 38.3 11.22 7.71 Whisper-FFT-FP32 277.8 (277.8) 9.05 6.43 Whisper-FFT-NF4 38.3 (38.3) 10.59 7.20 Whisper-LoRA-FP32 38.6 (0.3) 8.48 6.87 Whisper-LoRA-NF4 38.4 (0.1) 8.51 6.72 Whisper-SAML-pretrain-FP32 44.6 (6.3) 7.90 5.49 Whisper-SAML-pretrain-NF4 39.5 (1.2) 7.99 5.30 Whisper-SAML-adaptation-FP32 44.6 (6.3) 6.94 4.95 Whisper-SAML-adaptation-NF4 39.5 (1.2) 7.10 4.72

Table 1 shows the performance of the SAML approach on the Whisper base.en model. Compared to Whisper-NF4, the WER reduction achieved by fine-tuning all model parameters at full precision on pretraining data and target speaker data was largely reduced after model quantisation. As a result, the Whisper-FFT-NF4 model only achieved around 5.6% relative WER reduction on LibriSpeech-SA and 6.6% relative WER reduction on TL3-SA tasks. Due to LoRA updating only a small amount of the low-rank subspace parameters, which enhances its robustness to quantisation, there is almost no performance degradation after quantisation. LoRA at NF4 precision achieved 24.2% and 12.8% relative WER reductions respectively. When SAML was applied to the model, only with pretraining, the performance already surpassed that of both FFT and LoRA. Moreover, with speaker adaptation, the improvements at NF4 precision were further enlarged, resulting in 36.7% and 38.8% relative WER reductions on LibriSpeech-SA and TL3-SA sets respectively. Note that the SAML pretraining for TL3-SA was cross-data, as the pretraining was done on the LibriSpeech clean-100 training set while directly applied to the TL3-SA data for speaker adaptation. This underscores the effectiveness of the pretrained SAML approach.

Table 2: WER on the LibriSpeech-SA using quantised Conformer models. FFT, LoRA and SAML follow the same definition as Table 1.

System Param. Size WER(%) Conformer-FP32 130.9 12.43 Conformer-NF4 19.1 12.77 Conformer-FFT-FP32 130.9 (130.9) 8.46 Conformer-FFT-NF4 19.1 (19.1) 10.52 Conformer-LoRA-FP32 19.9 (0.8) 9.53 Conformer-LoRA-NF4 19.3 (0.2) 9.54 Conformer-SAML-pretrain-FP32 28.0 (8.9) 9.99 Conformer-SAML-pretrain-NF4 20.8 (1.7) 9.98 Conformer-SAML-adaptation-FP32 28.0 (8.9) 8.55 Conformer-SAML-adaptation-NF4 20.8 (1.7) 8.56

The same set of experiments was also performed for the Conformer model as shown in Table 2. Note that as the Conformer AED is trained on train-clean-100 already, we selected 250 speakers from LibriSpeech train-clean-360 for SAML pretraining. As before, the Conformer-SAML-pretrain-NF4 achieved a 21.8% relative WER reduction compared to the Conformer-NF4. The Conformer-SAML-adaptation-NF4 model achieved a further WER reduction, resulting in a relative 33.0% WER reduction.

Table 3: WER on the LibriSpeech-SA using Whisper-SAML-pretrain-FP32 with different numbers of experts.

System WER(%) Whisper-SAML-5experts 8.02 Whisper-SAML-10experts 7.90 Whisper-SAML-15experts 7.85 Whisper-SAML-20experts 7.82

Next, Table 3 shows the performance of the Whisper-SAML-pretrain-FP32 model with different numbers of experts. The results demonstrate that performance consistently improves with an increasing number of experts, though at a diminishing rate for further increases.

Table 4: WER on the LibriSpeech-SA using Whisper-SAML-pretrain-FP32 with MoE pruning.

System WER(%) complete 7.90 pruning (delete non-collapsed experts & router) 7.90 keep top1 expert & router 8.27 keep top1 expert 15.83

In experiments, we observed that some SAML layers exhibit issues of load imbalance and model collapse [39], with severe reliance on or even collapse into a single expert. We suggest that the collapsed layers might not require the complexity of the MoE architecture, since a single expert seems capable of handling their tasks. Therefore, we prune the collapsed layers by deleting all non-collapsed experts and the router, resulting in each collapsed layer degenerating into a single LoRA layer. Table 4 shows the results of MoE pruning. Line 2 indicates that performance is lossless after MoE pruning. Moreover, for layers with load imbalance, we also attempt to only keep the top1 expert and the router. Line 3 demonstrates that keeping the top1 expert and the router results in only a slight decrease in performance. Line 4 shows that further deleting the router leads to a sharp decline in performance. This indicates that merely dynamic scaling on a single LoRA can yield significant effects.

Refer to caption
Refer to caption
(a) SAML
Refer to caption
Refer to caption
(b) Single LoRA
Figure 3: t-SNE visualisation of Whisper-SAML and Whisper-LoRA encoder outputs, with different colours for each speaker.

Finally, the t-SNE visualisation results of speaker adaptation are displayed in Figure 3. As shown, after speaker adaptation, both SAML and single LoRA have effectively captured speaker features. Furthermore, for each speaker cluster, SAML achieves clearer separation, indicating that the experts in SAML provided better representation that enhanced speaker adaptive capabilities compared to the single LoRA.

6 Conclusions

This paper proposes the SAML approach and integrates it into the PQM framework. SAML, which uses LoRA modules as experts, is applied to both the Conformer-based AED and the Whisper ASR models. Experiments on LibriSpeech and TL3 datasets showed that SAML can largely reduce the WERs of the quantised models. Compared to the original full precision models, using SAML, 29.1% and 31.1% relative WER reductions were achieved on quantised Whisper and Conformer-based AED models respectively.

References

  • [1] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, Hawaii, 2023.
  • [2] A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, Shanghai, 2020.
  • [3] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, Vancouver, 2020.
  • [4] G. Sun, X. Zheng, C. Zhang, P. C. Woodland, “Can contextual biasing remain effective with Whisper and GPT-2?” in Proc. Interspeech, Dublin, 2023.
  • [5] Q. Zhao, G. Sun, C. Zhang, M. Xu, and T. F. Zheng, “Enhancing quantised end-to-end asr models via personalisation,” arXiv preprint arXiv:2309.09136, 2023.
  • [6] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proc. ICLR, Virtual Event, 2022.
  • [7] T. Tan, Y. Qian, M. Yin, Y. Zhuang, and K. Yu, “Cluster adaptive training for deep neural network,” in Proc. ICASSP, Brisbane, 2015.
  • [8] K. Yu and M. J. Gales, “Discriminative cluster adaptive training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1694–1703, 2006.
  • [9] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 8, no. 6, pp. 695–707, 2000.
  • [10] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, 2022.
  • [11] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in Proc. ICLR, Toulon, 2017.
  • [12] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of language models with mixture-of-experts,” in Proc. ICML, Baltimore, 2022.
  • [13] Z. You, S. Feng, D. Su, and D. Yu, “Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,” in Proc. Interspeech, Brno, 2021.
  • [14] M. Perez, Z. Aldeneh, and E. M. Provost, “Aphasic speech recognition using a mixture of speech intelligibility experts,” in Proc. Interspeech, Shanghai, 2020.
  • [15] C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,” in Proc. NeurIPS, Virtual Event, 2021.
  • [16] T. Chen, X. Chen, X. Du, A. Rashwan, F. Yang, H. Chen, Z. Wang, and Y. Li, “Adamv-moe: Adaptive multi-task vision mixture-of-experts,” in Proc. ICCV, Paris, 2023.
  • [17] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” in Proc. NeurIPS, New Orleans, 2023.
  • [18] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus, “St-moe: Designing stable and transferable sparse expert models,” arXiv preprint arXiv:2202.08906, 2022.
  • [19] S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A. Awan, J. Rasley, and Y. He, “Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale,” in Proc. ICML, Baltimore, 2022.
  • [20] A. Clark, D. De Las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud et al., “Unified scaling laws for routed language models,” in Proc. ICML, Baltimore, 2022.
  • [21] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.
  • [22] J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,” arXiv preprint arXiv:2308.00951, 2023.
  • [23] S. Roller, S. Sukhbaatar, J. Weston et al., “Hash layers for large sparse models,” in Proc. NeurIPS, Virtual Event, 2021.
  • [24] M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, “Base layers: Simplifying training of large, sparse models,” in Proc. ICML, Virtual Event, 2021.
  • [25] B. Mustafa, C. Riquelme, J. Puigcerver, R. Jenatton, and N. Houlsby, “Multimodal contrastive learning with limoe: the language-image mixture of experts,” in Proc. NeurIPS, New Orleans, 2022.
  • [26] H. Akbari, D. Kondratyuk, Y. Cui, R. Hornung, H. Wang, and H. Adam, “Alternating gradient descent and mixture-of-experts for integrated multimodal perception,” in Proc. NeurIPS, New Orleans, 2023.
  • [27] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,” in Proc. KDD, London, 2018.
  • [28] Z. Chen, Y. Shen, M. Ding, Z. Chen, H. Zhao, E. G. Learned-Miller, and C. Gan, “Mod-squad: Designing mixtures of experts as modular multi-task learners,” in Proc. CVPR, Vancouver, 2023.
  • [29] E. Frantar and D. Alistarh, “Qmoe: Practical sub-1-bit compression of trillion-parameter models,” arXiv preprint arXiv:2310.16795, 2023.
  • [30] Y. J. Kim, A. A. Awan, A. Muzio, A. F. C. Salinas, L. Lu, A. Hendy, S. Rajbhandari, Y. He, and H. H. Awadalla, “Scalable and efficient moe training for multitask multilingual models,” arXiv preprint arXiv:2109.10465, 2021.
  • [31] L. Sarı, N. Moritz, T. Hori, and J. Le Roux, “Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR,” in Proc. ICASSP, Barcelona, 2020.
  • [32] Z. Yue, H. Christensen, and J. Barker, “Autoencoder bottleneck features with multi-task optimisation for improved continuous dysarthric speech recognition,” in Proc. Interspeech, Shanghai, 2020.
  • [33] P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, 2016.
  • [34] C. Zhang and P. C. Woodland, “DNN speaker adaptation using parameterised sigmoid and ReLU hidden activation functions,” in Proc. ICASSP, Shanghai, 2016.
  • [35] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. ICASSP, Vancouver, 2013.
  • [36] C.-P. Hsieh, S. Ghosh, and B. Ginsburg, “Adapter-based extension of multi-speaker text-to-speech model for new speakers,” arXiv preprint arXiv:2211.00585, 2022.
  • [37] C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E.-S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,” in Proc. NeurIPS, New Orleans, 2023.
  • [38] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018.
  • [39] Z. Chen, Y. Deng, Y. Wu, Q. Gu, and Y. Li, “Towards understanding the mixture-of-experts layer in deep learning,” in Proc. NeurIPS, New Orleans, 2022.