Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

Zhenyi Lu1,2 Chenghao Fan1,211footnotemark: 1Wei Wei1,2Xiaoye Qu1Dangyang Chen3Yu Cheng4
1 School of Computer Science & Technology, Huazhong University of Science and Technology,
2 Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL),
3 Ping An Property & Casualty Insurance Company of China, Ltd.,
4 The Chinese University of Hong Kong.
{luzhenyi529,facicofan}@gmail.com, {weiw, xiaoye}@hust.edu.cn,
chendangyang273@pingan.com.cn, chengyu@cse.cuhk.edu.hk
 Equal contribution. Corresponding authors.
Abstract

In the era of large language models, model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training. However, two challenges remain: (a) interference between different models and (b) heterogeneous data during testing. Traditional model merging methods often show significant performance gaps compared to fine-tuned models due to these issues. Additionally, a one-size-fits-all model lacks flexibility for diverse test data, leading to performance degradation. We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance. In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input. This approach narrows the performance gap between merged and fine-tuned models and improves adaptability to heterogeneous data. Extensive experiments on 12121212 datasets for both discriminative and generative tasks demonstrate the effectiveness of our method, showing an average improvement of 28.34%percent28.3428.34\%28.34 % in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks. 111Our implementation is available in https://github.com/LZY-the-boys/Twin-Merging

1 Introduction

In recent years, Large Language Models (LLMs) have demonstrated notable success across various Natural Language Processing (NLP) tasks [9, 49, 52], including code generation [17, 44], solving math problems [35, 2], multilingualism [38], etc. These models, with billions of parameters, excel in various downstream tasks [27, 19, 56] but require extensive training on large datasets using thousands of GPUs. The considerable computational and energy costs [43] limit their specialization and deployment in resource-constrained environments [30].

Refer to caption
Figure 1: Subfigure (I) shows that in conventional merging methods, parameters from different task-specific models and a pre-trained model are weighted-summed into a single multitask model for inference. Subfigure (II) illustrates that our Twin-Merging method first isolates shared knowledge, then extracts exclusive knowledge by identifying differences between task experts and the shared model. This exclusive knowledge is then compressed into sparse vectors. Subfigure (III) shows that during testing, Twin-Merging dynamically merges shared and compressed specialized knowledge based on test inputs to form the final inference model.

To tackle this challenge, model fusion has emerged as a promising solution [29]. One notable paradigm is model merging [22, 26, 59, 60], where multiple task-specific models, or “experts”, are combined into a single unified model. This unified model can quickly adapt to new tasks without the need to retrain a large model. Various techniques, such as parameter averaging [5, 58], weight interpolation [37, 26], and advanced strategies like task arithmetic [22, 41, 60, 51], have been developed for model merging. These techniques have been proven effective, enabling the integration of fine-tuned knowledge from diverse tasks into a multi-task model without additional training.

However, merging models from different domains often sacrifices specific task performance, leading to a large performance gap compared to the individual expert [24, 59]. Two major causes prevent the existing merging methods from reaching the theoretical upper-bound performance of individual experts: (1) Interference between models. Previous research shows that parameter redundancy and sign discrepancies [59], as well as the distribution gap between tasks [24], hinder effective model merging. We demonstrate that task-specific models often contain mixed knowledge, where the expertise in one model may be exclusive or detrimental to others. This redundancy or interference can obstruct the integration of expertise across models [7]. (2) heterogeneity of data at test time. Previous methods pursue a single, static optimal solution for various tasks. While a one-size-fits-all model avoids introducing new parameters, it might be inadequate or suboptimal due to the unpredictable nature of test inputs [60]. It limits the utilization of complementary knowledge and leads to deteriorated performance [55].

To address the above issues, in this paper, we introduce Twin Merging, involving two principal stages: (1) Knowledge Modularization: Unlike previous research that migrates merging interference in a parameter-wise manner or searches merging coefficients, we decompose the knowledge possessed by experts into shared knowledge and exclusive task-specific knowledge, as shown in Figure 1 (II). First, we compress common knowledge into a shared expert, serving to capture and consolidate common knowledge across varying tasks. Then we isolate exclusive knowledge based on the difference between the task experts and the shared expert, allowing diverse knowledge to be decomposed more finely. (2) Dynamic Merging: Inspired by Mixture of Experts (MoE), we simplify the parameter merging problem into a conditional composition problem. Instead of pre-determining the best parameter combination for heterogeneous data at test time, as illustrated in Figure 1 (III), we introduce a router to dynamically merge shared and exclusive knowledge based on the test inputs. The shared model serves as the foundation, and task-specific knowledge is conditionally injected according to the router.

We demonstrate the effectiveness of our proposed Twin-Merging method through extensive experiments on 12121212 datasets, covering both discriminative and generative tasks, various model architectures, and in-domain and out-of-domain setups. As shown in Figure 2(b), Twin-Merging consistently outperforms other merging methods across all datasets, surpassing the strongest baseline by an average of 28.34%percent28.3428.34\%28.34 % in normalized scores for discriminative tasks and 3.86%percent3.863.86\%3.86 % for generative tasks on the scaled model (Qwen-14B). We validate the scalability, extensibility, generalization, and storage efficiency of Twin-Merging (Figure 2(a)). Remarkably, even with a 99.9%percent99.999.9\%99.9 % reduction in parameters, our method only experiences a slight 14%percent1414\%14 % performance degradation. Our results establish Twin-Merging as a powerful and effective method for combining multiple fine-tuned models into a single multi-task model.

To summarize, our contributions are as follows: (1) We introduce Twin-Merging, a novel model fusion method that reduces the performance gap between traditional model merging and fine-tuned models while enhancing adaptability to diverse data. (2) We investigate the impact of shared and exclusive task-specific knowledge on merging performance, presenting innovative techniques for knowledge disentanglement and dynamic merging. (3) Twin-Merging is simple to implement with minimal hyperparameters, improves multi-task performance without retraining expert models, and can be combined with other merging methods for further gains. Our approach scales well with model size and task numbers and is storage-efficient.

Refer to caption
(a) The average performance on generative tasks vs. the number of parameters of Twin-Merging compared to various merging baselines, with different storage sizes indicated by circle size.
Refer to caption
(b) Comparison of absolute accuracy (%) of individual tasks for the NLP benchmarks on RoBERTa and Qwen, covering 4 discriminative and 8 generative tasks.
Figure 2: The effectiveness of Twin-Merging in terms of performance and parameter-efficiency.

2 Related Work

In this section, we focus on model merging research, for additional related work on multi-task learning and Mixture of Experts, please see Appendix B. Model merging aims to fuse multiple fine-tuned task-specific models into one comprehensive multi-task model without additional training. FisherMerging [37] and RegMean [26], use straightforward weight averaging but require extra data and computation. Some works [54, 46, 16, 1, 47] bring models into a single low-loss basin and interpolate between them based on the linear mode connectivity (LMC) theory [11, 15, 13]. The weight permutations [1] and optimal transport [46] are utilized to better interpolate neural networks. However, recent studies [63] suggest that LMC might not always hold for fine-tuned models. Task-Arithmetic [21, 41] extends averaging to arithmetic operations in the parameter space for finer control over model behaviors, but the interference between the multiple models can be an issue. To tackle this challenge, advanced merging methods like Ties-Merging [59], AdaMerging [60] and DARE [61] have been proposed. These methods aim to reduce task conflicts by addressing parameter redundancy or disagreements in signs, finding optimal merging coefficients, and reducing weight density, respectively. Jiang et al. [25] assume that test tasks are known and use task-specific knowledge to improve performance. However, this assumption is often unrealistic since real-world data distributions are unpredictable. In contrast, our method addresses merging interference by modularizing shared and task-specific knowledge. We handle heterogeneous test data scenarios by introducing dynamic merging techniques.

3 Methodology

3.1 Analysis of the Performance Gap in Model Merging

In this paper, following the settings of model merging [22, 59, 61], we consider the case of T𝑇Titalic_T tasks, where training for each task t𝑡titalic_t starts from pre-trained model weight 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and fine-tunes on 𝒟ttrainsubscriptsuperscript𝒟𝑡𝑟𝑎𝑖𝑛𝑡\mathcal{D}^{train}_{t}caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to obtain task-specific model 𝜽tsubscript𝜽𝑡\bm{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Let f(𝒙;𝜽)𝑓𝒙𝜽f(\bm{x};\bm{\theta})italic_f ( bold_italic_x ; bold_italic_θ ) be a language model accepting inputs 𝒙𝒳𝒙𝒳\bm{x}\in\mathcal{X}bold_italic_x ∈ caligraphic_X and paramterized by weights 𝜽Θ𝜽Θ\bm{\theta}\in\Thetabold_italic_θ ∈ roman_Θ. Considering the real data distributions are diverse and challenging to represent with a single task, to model such distributions, previous methods typically consider the mixture of T𝑇Titalic_T task test data: 𝒟=t=1Tαt𝒟t𝒟superscriptsubscript𝑡1𝑇subscript𝛼𝑡subscript𝒟𝑡\mathcal{D}=\sum_{t=1}^{T}\alpha_{t}\mathcal{D}_{t}caligraphic_D = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where t=1Tαt=1,αt>0tformulae-sequencesuperscriptsubscript𝑡1𝑇subscript𝛼𝑡1subscript𝛼𝑡0for-all𝑡\sum_{t=1}^{T}\alpha_{t}=1,\alpha_{t}>0\ \forall t∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > 0 ∀ italic_t. The model merging considers the problem where we have T𝑇Titalic_T fine-tuned expert models {ft(𝒙;𝜽t)}t=1Tsuperscriptsubscriptsubscript𝑓𝑡𝒙subscript𝜽𝑡𝑡1𝑇\{f_{t}(\bm{x};\bm{\theta}_{t})\}_{t=1}^{T}{ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and pre-trained weight 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, composing a multitask model 𝜽superscript𝜽\bm{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to approximate the optimal solution.

𝜽opt𝜽=(𝜽0,𝜽1,,𝜽T)subscript𝜽𝑜𝑝𝑡superscript𝜽subscript𝜽0subscript𝜽1subscript𝜽𝑇\displaystyle\bm{\theta}_{opt}\approx\bm{\theta}^{*}=\mathcal{F}(\bm{\theta}_{% 0},\bm{\theta}_{1},\cdots,\bm{\theta}_{T})bold_italic_θ start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT ≈ bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_F ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) (1)

Here \mathcal{F}caligraphic_F represents an arbitrary merging function. For example, in Task Arithmetic [21], 𝜽=𝜽0+t=1Tγt(𝜽t𝜽0)superscript𝜽subscript𝜽0superscriptsubscript𝑡1𝑇subscript𝛾𝑡subscript𝜽𝑡subscript𝜽0\bm{\theta}^{*}=\bm{\theta}_{0}+\sum_{t=1}^{T}\gamma_{t}(\bm{\theta}_{t}-\bm{% \theta}_{0})bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

Table 1: Merging without parameter interference and merging between similar tasks both cause performance degradation (Notice: these two experiments use different datasets).
Task Normalized Score
(Equation (4))
With parameter interference
Fine-tuned 100.00
Merging 85.43
Without parameter interference
Non-overlap Fine-tuned 100.00
Non-overlap Merging     82.21 [3.21]delimited-[]absent3.21\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}[\downarrow 3% .21][ ↓ 3.21 ]
Similar tasks
Fine-tuned 100.00
Similar-Tasks Merging     91.58 [8.42]delimited-[]absent8.42\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}[\downarrow 8% .42][ ↓ 8.42 ]
Refer to caption
Refer to caption
Figure 3: The impact of different ratios of shared knowledge and exclusive knowledge.

Although existing merging methods, like Task Arithmetic, can combine multiple task-specific models efficiently, they often exhibit significant performance gaps compared to single-task models. Previous studies attribute this to parameter redundancy and sign discrepancies, denoted as parameter interference [59], leading to the loss of task-specific information. Furthermore, differences between tasks can cause interference in the merged weights, denoted as task interference [24]. To investigate the causes of performance degradation, we designed two experiments using Task Arithmetic. First, we injected task-specific knowledge into non-overlapping parameter sets, fine-tuning Qwen-14B with LoRA on different modules for each task (Detailed in Appendix D.4). Despite avoiding parameter interference, merging resulted in an 82.21%percent82.2182.21\%82.21 % normalized score, a drop of 3.21%percent3.213.21\%3.21 % compared to the overlapping version. Second, we merged models fine-tuned on similar tasks (e.g., XSUM and CNN-DailyMail for summarization). This experiment yields an 8.42%percent8.428.42\%8.42 % lower normalized score compared to the individually fine-tuned models, indicating persistent interference. In summary, our results show that interference in model merging is not limited to parameter-wise and task-wise issues.

3.2 Interpreting Interference From the Perspective of Knowledge

To tackle the challenge of interference, we examine the merging process from a knowledge perspective. We identify two types of critical knowledge: (1) Shared knowledge, which benefits multiple tasks, and (2) Exclusive knowledge, which is useful only for a specific task. Single-task models often contain both types, complicating the merging process and leading to interference. To validate our hypotheses, we conduct experiments that vary the ratio of task-specific and shared knowledge.

To examine the impact of shared knowledge, we conducted full fine-tuning on each model for its specific task. Excessive fine-tuning epochs can lead to catastrophic forgetting [14], a phenomenon where the model retains task-specific knowledge but loses general knowledge. As the fine-tuning epochs increase, the shared knowledge gradually decreases. The top section of Figure 3 illustrates that as the epoch count increases, merging performance significantly deteriorates, even though the fine-tuned model performs well on its task. This underscores the crucial role of shared knowledge in merging performance.

To explore the impact of exclusive knowledge, we merge a single task-specific model into the base model. We apply a sparsity method (e.g., SVD) to reduce the ratios of task-specific weights in the merging model from 100%percent100100\%100 % (standard merging) to 0%percent00\%0 % (base model). As shown in the lower part of Figure 3, performance remains stable up to 90%percent9090\%90 % sparsity. Notably, even with a 99%percent9999\%99 % sparsity rate, a single-merged model outperforms multi-model merging, confirming the existence of exclusive knowledge, which is more pronounced with more models. This also underscores the value of unmerged task-specific knowledge, since the fine-tuning performance can be effectively restored by preserving unmerged task-specific information.

To summarize, both shared knowledge and un-merged task-specific knowledge play a vital role in merging performance. The exclusive nature of task-specific knowledge hinders the effectiveness of merging methods. Different types of knowledge need to be separated and modularized to achieve optimal performance. Thus, the first step of our Twin-Merging approach is to explicitly partition the weights into an expert containing shared knowledge and weights holding task-exclusive knowledge before merging. Formally, we denote the shared expert as 𝜽ssubscript𝜽𝑠\bm{\theta}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the exclusive task-specific knowledge as {𝒗t}t=1Tsuperscriptsubscriptsubscript𝒗𝑡𝑡1𝑇\{\bm{v}_{t}\}_{t=1}^{T}{ bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, the detail of our method is illustrated in the following section.

3.3 Twin Merging

Algorithm 1 Twin-Merging
1:language model f(𝒙;𝜽)𝑓𝒙𝜽f(\bm{x};\bm{\theta})italic_f ( bold_italic_x ; bold_italic_θ ), pre-trained weight 𝜽0subscript𝜽0\bm{\theta}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and T𝑇Titalic_T task-specific fine-tuned weights {𝜽t}t=1Tsuperscriptsubscriptsubscript𝜽𝑡𝑡1𝑇\{\bm{\theta}_{t}\}_{t=1}^{T}{ bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, trained router \mathcal{R}caligraphic_R parameterized by a full-connect layer ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ, embedding Emb𝐸𝑚𝑏Embitalic_E italic_m italic_b, compression rank r𝑟ritalic_r and pre-specified weight {γt}t=1Tsuperscriptsubscriptsubscript𝛾𝑡𝑡1𝑇\{\gamma_{t}\}_{t=1}^{T}{ italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT
2:
3:Pre-calculation: \triangleright Only excute once
4:Compute the shared expert 𝜽ssubscript𝜽𝑠\bm{\theta}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT:
5:𝜽s𝜽0+t=1Tγt(𝜽t𝜽0)subscript𝜽𝑠subscript𝜽0superscriptsubscript𝑡1𝑇subscript𝛾𝑡subscript𝜽𝑡subscript𝜽0\quad\bm{\theta}_{s}\leftarrow\bm{\theta}_{0}+\sum_{t=1}^{T}\gamma_{t}(\bm{% \theta}_{t}-\bm{\theta}_{0})bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
6:Extract exclusive knowledge vectors for each task-specific weight:
7:𝒗tSVDr(𝜽t𝜽s)subscript𝒗𝑡subscriptSVD𝑟subscript𝜽𝑡subscript𝜽𝑠\quad\bm{v}_{t}\leftarrow\text{SVD}_{r}(\bm{\theta}_{t}-\bm{\theta}_{s})bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← SVD start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), for t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T
8:
9:Inference: \triangleright Main loop
10:initialize output 𝒀𝒀\bm{Y}bold_italic_Y
11:for each input 𝒙𝒙\bm{x}bold_italic_x in inputs 𝑿𝑿\bm{X}bold_italic_X do
12:    Calculate router weights:
13:    [w1,,wT]softmax((Emb(𝒙);ϕ))subscript𝑤1subscript𝑤𝑇softmaxEmb𝒙bold-italic-ϕ\quad[w_{1},\cdots,w_{T}]\leftarrow\text{softmax}(\mathcal{R}(\text{Emb}(\bm{x% });\bm{\phi}))[ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ] ← softmax ( caligraphic_R ( Emb ( bold_italic_x ) ; bold_italic_ϕ ) )
14:    Merge into a single expert 𝜽superscript𝜽\bm{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:
15:    𝜽𝜽s+t=1Twt𝒗tsuperscript𝜽subscript𝜽𝑠superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscript𝒗𝑡\quad\bm{\theta}^{*}\leftarrow\bm{\theta}_{s}+\sum_{t=1}^{T}w_{t}\bm{v}_{t}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
16:    Perform model inference to produce the output:
17:    𝒀𝒀f(𝒙;𝜽)𝒀𝒀𝑓𝒙superscript𝜽\quad\bm{Y}\leftarrow\bm{Y}\cup f(\bm{x};\bm{\theta}^{*})bold_italic_Y ← bold_italic_Y ∪ italic_f ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
18:end for
19:
20:Output 𝒀𝒀\bm{Y}bold_italic_Y for input 𝑿𝑿\bm{X}bold_italic_X.

Our proposed Twin-Merging employs two main stages: knowledge modularization and dynamic merging. These stages are designed to narrow the performance gap and enhance adaptive knowledge composition. Building on the formulation in Equation (2), Twin-Merging preprocesses experts into shared experts, isolates and compresses exclusive knowledge into vectors, and dynamically composes them during inference.

The preprocess stage comprises three steps: (1) Shared Expert: To separate shared knowledge across different models, we consider the pre-merged model as a natural placeholder to encapsulate common knowledge that is important to all tasks (denoted as 𝜽superscript𝜽\bm{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT). By leveraging established merging techniques such as Task Arithmetic, we can readily extract the shared experts from the initial merged model. (2) Exclusive Knowledge: To convey task-specific information while separating common knowledge, we calculate the difference vector: 𝒗t=𝜽t𝜽subscript𝒗𝑡subscript𝜽𝑡superscript𝜽\bm{v}_{t}=\bm{\theta}_{t}-\bm{\theta}^{*}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This subtraction vector preserves un-merged task-specific information while discarding the shared knowledge. (3) Compressed exclusive vectors: For practical use and distribution, we apply singular value decomposition (SVD) to further compress the above exclusive knowledge into vectors for each task. Assuming 𝒗tsubscript𝒗𝑡\bm{v}_{t}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has a rank-m𝑚mitalic_m decomposition, 𝒗t=𝐔t𝚺t𝐕tTsubscript𝒗𝑡subscript𝐔𝑡subscript𝚺𝑡superscriptsubscript𝐕𝑡𝑇\bm{v}_{t}=\mathbf{U}_{t}\mathbf{\Sigma}_{t}\mathbf{V}_{t}^{T}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, we achieve a low-rank task space by selecting the top-r𝑟ritalic_r singular values, resulting in 𝐔t(r)𝚺t(r)𝐕t(r)Tsubscript𝐔𝑡𝑟subscript𝚺𝑡𝑟subscript𝐕𝑡superscript𝑟𝑇\mathbf{U}_{t}(r)\mathbf{\Sigma}_{t}(r)\mathbf{V}_{t}(r)^{T}bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

In inference stage, adapting to unforeseen challenges is difficult, especially with varied test data. For example, if most of the data consists of a certain type (denoted as 𝒟usubscript𝒟𝑢\mathcal{D}_{u}caligraphic_D start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT), we should tailor the merged model for that specific task to get the best results. Instead of pre-defining the best parameters, we propose a new approach that combines shared expertise with exclusive knowledge. Our method involves using the input 𝒙𝒙\bm{x}bold_italic_x to dynamically adjust to the current data, enabling us to utilize shared knowledge and apply specialized expertise based on the inputs.

𝜽=(𝜽sshared knowledge,𝒗1,,𝒗Texclusive knowledge,𝒙)superscript𝜽subscriptsubscript𝜽𝑠shared knowledgesubscriptsubscript𝒗1subscript𝒗𝑇exclusive knowledge𝒙\displaystyle\bm{\theta}^{*}=\mathcal{F}(\underbrace{\bm{\theta}_{s}}_{\text{% shared knowledge}},\underbrace{\bm{v}_{1},\cdots,\bm{v}_{T}}_{\text{exclusive % knowledge}},\bm{x})bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_F ( under⏟ start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT shared knowledge end_POSTSUBSCRIPT , under⏟ start_ARG bold_italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_v start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT exclusive knowledge end_POSTSUBSCRIPT , bold_italic_x ) (2)

During inference, we fine-tune a small fuser \mathcal{R}caligraphic_R parameterized by ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ through empirical risk minimization on a small validation dataset. This fuser, trained to dynamically select the specific task experts, replacing the need for complex optimization algorithms to determine fusion coefficients. The merging model is obtained by:

𝜽=𝜽s+t=1TwtSVDr(𝜽t𝜽)superscript𝜽subscript𝜽𝑠superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscriptSVD𝑟subscript𝜽𝑡superscript𝜽\displaystyle\bm{\theta}^{*}=\bm{\theta}_{s}+\sum_{t=1}^{T}w_{t}*\text{SVD}_{r% }(\bm{\theta}_{t}-\bm{\theta}^{*})bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ SVD start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (3)
{w1,,wT}=softmax((Emb(𝒙);ϕ))subscript𝑤1subscript𝑤𝑇softmaxEmb𝒙bold-italic-ϕ\displaystyle\{w_{1},\cdots,w_{T}\}=\text{softmax}\Biggl{(}\mathcal{R}(\text{% Emb}(\bm{x});\bm{\phi})\Biggr{)}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } = softmax ( caligraphic_R ( Emb ( bold_italic_x ) ; bold_italic_ϕ ) )

Here, Emb(𝒙)Emb𝒙\text{Emb}(\bm{x})Emb ( bold_italic_x ) represents the sequence of the last-layer token embeddings from the shared expert ( f(𝒙;𝜽s)𝑓𝒙subscript𝜽𝑠f(\bm{x};\bm{\theta}_{s})italic_f ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ).

4 Experiments

4.1 Merging Experiment

Baselines

We compare Twin-Merging with several train-free model-merging methods, including weight averaging, Task Arithmetic [21], Ties-Merging [59], and DARE Merging [61]. Details on these baselines are provided in Appendix D. Additionally, we include individually fine-tuned models and the pre-trained model as upper and lower bounds on performance, respectively. Performance is assessed using the average normalized score of the fine-tuned models to mitigate the effects of different task-specific score ranges. The normalized score of merged model 𝜽superscript𝜽\bm{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is calculated as:

Normalized Score=1Tt=1TScorex𝒟t[f(𝒙;𝜽)]Scorex𝒟t[ft(𝒙;𝜽t)]Normalized Score1𝑇superscriptsubscript𝑡1𝑇similar-to𝑥subscript𝒟𝑡Scoredelimited-[]𝑓𝒙superscript𝜽similar-to𝑥subscript𝒟𝑡Scoredelimited-[]subscript𝑓𝑡𝒙subscript𝜽𝑡\text{Normalized Score}=\frac{1}{T}\sum_{t=1}^{T}\frac{\underset{x\sim\mathcal% {D}_{t}}{\operatorname{Score}}\left[f(\bm{x};\bm{\theta}^{*})\right]}{% \underset{x\sim\mathcal{D}_{t}}{\operatorname{Score}}\left[f_{t}(\bm{x};\bm{% \theta}_{t})\right]}Normalized Score = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG start_UNDERACCENT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_Score end_ARG [ italic_f ( bold_italic_x ; bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ] end_ARG start_ARG start_UNDERACCENT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_Score end_ARG [ italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_ARG (4)

We evaluate our method on both discriminative and generative NLP benchmarks.

Discriminative Tasks

For discriminative tasks, following [59, 61], we use RoBERTa [34] as the backbone and evaluate on the 8-task GLUE benchmark [53]. More details are in Appendix D.2.

Generative Tasks

For our generative tasks, we use Qwen-14B [3] as the primary model to demonstrate the effectiveness of our approach on large-scale language models. To reduce deployment costs, we utilize task-specific checkpoints fine-tuned with the LoRA method [20] (See Appendix A for details on adapting Twin-Merging to LoRA). We evaluate our model on four scenarios: general knowledge (MMLU benchmark [18]), factualness (TruthfulQA [32]), safety (BBQ [42]), and summarization (CNN-DailyMail [39]). Detailed information is provided in Appendix D.2.

Table 2: Performance on 8 Discriminative Tasks (RoBERTa) and 4 Generative Tasks (Qwen-14B)
Method 8 Discriminative Tasks 4 Generative Tasks Avg.
Pretrained 41.69 91.06 66.37
Fine-tuned 100.00 100.00 100.00
Weight Averaging 52.56 95.74 74.15
Task Arithmetic 67.80 96.61 82.20
Task Arithmetic (w/ DARE) 64.66 98.52 81.59
Ties-Merging 63.68 92.67 78.17
Ties-Merging (w/ DARE) 65.58 91.92 78.75
Twin-Merging (Best Storage) 86.00 100.96 93.48
Twin-Merging (Ours) 96.14 102.38 99.26

Main Results

Table 2 presents the results for all discriminative and generative benchmarks. A comparison of each task is illustrated in Figure 2(b) (detailed statistics are provided in Table 8 and Table 9 in the Appendix D.7). Twin-Merging consistently outperforms weight averaging, Task Arithmetic, Ties-Merging, and DARE Merging, leading to significant performance gains across settings. For discriminative tasks, it approachs the upper bound of finetune performance in the GLUE benchmark. Specifically, our methods improve over Task Arithmetic by 28.34%percent28.3428.34\%28.34 %, Ties-Merging by 32.46%percent32.4632.46\%32.46 %, and DARE-Merging by 30.56%percent30.5630.56\%30.56 % in absolute normalized score. In Figure 2(b), we observe that especially on the COLA task, where conventional merging methods fail to improve the result, our approach can still approach the upper bound of the COLA expert.

Similar to discriminative tasks, Twin-Merging achieves the best results on generative benchmarks, improving Task Arithmetic and DARE Merging by 5.77% and 3.86%, respectively. We observe two interesting findings: (1) The merging gains on Qwen-14B for generative tasks are lower than those on RoBERTa for discriminative tasks. We observe that pretrained RoBERTa exhibits only about half of its fine-tuned capabilities, while Qwen-14B achieves 91.06%percent91.0691.06\%91.06 % of its performance without fine-tuning. This suggests that smaller models like RoBERTa benefit more from task-specific biases, whereas large models like Qwen-14B already perform well without additional task-specific knowledge. Consequently, merging task-specific experts significantly improves RoBERTa, but has limited effect on Qwen-14B. (2) On the generative benchmark, Twin-Merging even surpasses the original upper bound of finetuned experts. This likely stems from the vast knowledge within Qwen-14B. Although not specifically finetuned, proper knowledge modularization and dynamic merging techniques in our method can further ignite the merged model’s capabilities. This suggests a promising direction for pushing the limits of LLMs without retraining.

Table 3: Our method scalability (72B)
Method TruthfulQA BBQ
Pretrained-72B 94.48 89.51
Fine-tuned 100 100
Task Arithmetic 98.70 95.40
Twin Merging 99.30 97.14
Table 4: Our method extensibility to other model merging methods
Method RoBERTa Qwen
Weight Average 52.56 95.74
Twin-Merging + Weight Average 96.23 100.08
Task-Arithmetic 67.80 98.52
Twin-Merging + Task-Arithmetic 96.14 102.38
Ties-Merging 63.68 92.67
Twin-Merging + Ties-Merging 96.34 102.35

Scalability of Twin-Merging

Our method remains effective with scaled models (e.g., 72B parameters), as shown in Table LABEL:tab:large. To manage high deployment costs, we limited our evaluation and merged experts to two tasks: BBQ and TruthfulQA. Twin-Merging consistently surpasses scaled pre-trained models and Task Arithmetic, highlighting our approach’s scalability.

Collaborating with Other Merging Method

To evaluate the compatibility of Twin-Merging with other merging methods, we conducted experiments using different techniques to create a shared expert, followed by dynamically merging the twin vectors. The results in Table LABEL:tab:ortho demonstrate that our method integrates seamlessly with primary merging techniques, leading to significant improvements. For example, when combined with our approach, the baseline Weight Average method improves from 52.2652.2652.2652.26 to 96.2396.2396.2396.23 on GLUE, approaching the performance of fine-tuned experts. Notably, our method complements Ties-Merging particularly well, suggesting that better isolation of shared knowledge enhances the overall performance of Twin-Merging.

Table 5: Performance on unseen tasks
Method QNLI+MNLI+RTE MMLU
Task Arithmetic 53.92 62.02
Task Arithmetic (w/ DARE) 54.27 63.09
Ties Merging 54.09 64.62
Ties Merging (w/ DARE) 54.72 63.13
Twin-Merging 55.86 65.98
Table 6: Ablation study of Twin-Merging
Task RoBERTa Qwen
Twin-Merging 96.14 102.38
-- shared expert 81.47 87.77
-- dynamic Merging 67.80 96.61

4.2 Unseen Generalization

As shown in Table 6, Twin-Merging method benefits from complementary collaboration among different experts. Since the corresponding task-specific experts are unavailable, we directly use the average of the unnormalized scores as the metrics. In the GLUE benchmark, when QNLI, MNLI, and RTE experts are absent, our approach still outperforms traditional baselines. Details on the expert combination for QNLI can be found in Figure 5(a). For complex tasks like MMLU, which involves multiple-choice QA tasks across 57 categories, Twin-Merging demonstrates superior performance using the combined knowledge from TruthfulQA, BBQ, and CNN-DailyMail domains.

4.3 Ablation Studies

To demonstrate the effectiveness of our modularization approach using twin vectors and the dynamic merging strategy, we conducted ablation studies for Twin-Merging, detailed in Table 6.

To assess the impact of the shared expert strategy, we replace the shared expert with a randomly chosen task-specific expert. Twin-Merging’s performance significantly degrades without the shared expert, emphasizing its importance in capturing common knowledge. Additionally, to evaluate the dynamic merging strategy, we remove the dynamic experts, leaving only a single shared expert. This leads to a consistent drop in performance, necessitating dynamic merging experts in our method.

We observe that removing dynamic experts causes a significant performance drop for RoBERTa while it is less critical than replacing the shared expert for Qwen-14B. This suggests that for smaller models like RoBERTa, task-specific biases are more important than common knowledge. In contrast, for large generative models like Qwen-14B, the extensive general knowledge within the model allows it to handle most tasks without fine-tuning. Therefore, the shared expert is more crucial for Qwen-14B than task-specific knowledge. Our approach effectively merges fine-tuned and shared experts, adapting seamlessly to both scenarios. These findings demonstrate the effectiveness of our fine-grained expert merging strategy.

4.4 Scale to More Tasks

Refer to caption
Figure 4: Averaged normalized accuracy vs. the number of tasks for various benchmarks. Twin-Merging maintains performance regardless of task number and compresses the fine-tuned checkpoints.

In the left panel of Figure 4, we examine the impact of the number of tasks on model merging performance. Conventional model merging methods degrade notably, especially with many tasks, nearly reaching pre-trained levels. However, Twin-Merging consistently outperforms other methods, approaching fine-tuned performance, with greater gains as the task count rises.

The right panel of Figure 4 shows the performance-storage trade-offs. While model merging methods have a constant storage cost, their performance remains low. In contrast, maintaining individual task-specific models guarantees strong performance but requires excessive storage. Twin-Merging achieves nearly 100% normalized accuracy across various tasks, balancing performance and storage efficiency by maintaining task-specific parameters with shared experts. This makes Twin-Merging a viable solution for scenarios demanding a balance between performance and storage efficiency.

4.5 Router Analysis

Refer to caption
(a) The routing result on the QNLI dataset using different numbers of GLUE experts, ranging from 2 twin vectors (𝒗CoLAsubscript𝒗CoLA\bm{v}_{\text{CoLA}}bold_italic_v start_POSTSUBSCRIPT CoLA end_POSTSUBSCRIPT and 𝒗SST-2subscript𝒗SST-2\bm{v}_{\text{SST-2}}bold_italic_v start_POSTSUBSCRIPT SST-2 end_POSTSUBSCRIPT) to 7 twin vectors (𝒗CoLAsubscript𝒗CoLA\bm{v}_{\text{CoLA}}bold_italic_v start_POSTSUBSCRIPT CoLA end_POSTSUBSCRIPT, 𝒗SST-2subscript𝒗SST-2\bm{v}_{\text{SST-2}}bold_italic_v start_POSTSUBSCRIPT SST-2 end_POSTSUBSCRIPT, 𝒗MRPCsubscript𝒗MRPC\bm{v}_{\text{MRPC}}bold_italic_v start_POSTSUBSCRIPT MRPC end_POSTSUBSCRIPT, 𝒗STS-Bsubscript𝒗STS-B\bm{v}_{\text{STS-B}}bold_italic_v start_POSTSUBSCRIPT STS-B end_POSTSUBSCRIPT, 𝒗QQPsubscript𝒗QQP\bm{v}_{\text{QQP}}bold_italic_v start_POSTSUBSCRIPT QQP end_POSTSUBSCRIPT, 𝒗MNLIsubscript𝒗MNLI\bm{v}_{\text{MNLI}}bold_italic_v start_POSTSUBSCRIPT MNLI end_POSTSUBSCRIPT, and 𝒗QNLIsubscript𝒗QNLI\bm{v}_{\text{QNLI}}bold_italic_v start_POSTSUBSCRIPT QNLI end_POSTSUBSCRIPT). The router weights are Softmax normalized.
Refer to caption
(b) The routing weight of Qwen experts (𝒗MMLUsubscript𝒗MMLU\bm{v}_{\text{MMLU}}bold_italic_v start_POSTSUBSCRIPT MMLU end_POSTSUBSCRIPT,𝒗TruthfulQAsubscript𝒗TruthfulQA\bm{v}_{\text{TruthfulQA}}bold_italic_v start_POSTSUBSCRIPT TruthfulQA end_POSTSUBSCRIPT,𝒗BBQsubscript𝒗BBQ\bm{v}_{\text{BBQ}}bold_italic_v start_POSTSUBSCRIPT BBQ end_POSTSUBSCRIPT,𝒗CNN-DailyMailsubscript𝒗CNN-DailyMail\bm{v}_{\text{CNN-DailyMail}}bold_italic_v start_POSTSUBSCRIPT CNN-DailyMail end_POSTSUBSCRIPT) on four generative tasks (MMLU, TruthfulQA, BBQ, CNN-DailyMail).
Figure 5: Twin-Merging routing decisions of the experts for various tasks.

Figure 5 shows the results of routing decisions among experts for the QNLI dataset and four generative benchmarks. As shown in Figure 5(a), the router maximizes the use of limited expert knowledge to address QNLI, a task where the goal is to determine if the context sentence contains the answer to the input question. For example, with only 𝒗CoLAsubscript𝒗CoLA\bm{v}_{\text{CoLA}}bold_italic_v start_POSTSUBSCRIPT CoLA end_POSTSUBSCRIPT and 𝒗SST-2subscript𝒗SST-2\bm{v}_{\text{SST-2}}bold_italic_v start_POSTSUBSCRIPT SST-2 end_POSTSUBSCRIPT available, the router primarily uses 𝒗CoLAsubscript𝒗CoLA\bm{v}_{\text{CoLA}}bold_italic_v start_POSTSUBSCRIPT CoLA end_POSTSUBSCRIPT, which provides knowledge of sentence and word relations, while 𝒗SST-2subscript𝒗SST-2\bm{v}_{\text{SST-2}}bold_italic_v start_POSTSUBSCRIPT SST-2 end_POSTSUBSCRIPT is focused on irrelevant sentiment classification. With six experts ranging from 𝒗CoLAsubscript𝒗CoLA\bm{v}_{\text{CoLA}}bold_italic_v start_POSTSUBSCRIPT CoLA end_POSTSUBSCRIPT to 𝒗MNLIsubscript𝒗MNLI\bm{v}_{\text{MNLI}}bold_italic_v start_POSTSUBSCRIPT MNLI end_POSTSUBSCRIPT, the router mainly leverages 𝒗MNLIsubscript𝒗MNLI\bm{v}_{\text{MNLI}}bold_italic_v start_POSTSUBSCRIPT MNLI end_POSTSUBSCRIPT for textual entailment and 𝒗QQPsubscript𝒗QQP\bm{v}_{\text{QQP}}bold_italic_v start_POSTSUBSCRIPT QQP end_POSTSUBSCRIPT for question-answering capabilities. When 𝒗QNLIsubscript𝒗QNLI\bm{v}_{\text{QNLI}}bold_italic_v start_POSTSUBSCRIPT QNLI end_POSTSUBSCRIPT is included, the router naturally relies on QNLI-specific knowledge. These results demonstrate the flexibility and adaptability of our Twin-Merging method, providing good interpretability. For larger models like Qwen-14B, as shown in Figure 5(b), the router plays a crucial role in selecting and combining specific knowledge. When experts have overlapping task-specific knowledge, such as 𝒗TruthfulQAsubscript𝒗TruthfulQA\bm{v}_{\text{TruthfulQA}}bold_italic_v start_POSTSUBSCRIPT TruthfulQA end_POSTSUBSCRIPT and 𝒗MMLUsubscript𝒗MMLU\bm{v}_{\text{MMLU}}bold_italic_v start_POSTSUBSCRIPT MMLU end_POSTSUBSCRIPT, the router may assign them similar weights.

4.6 Compression and Speed Analysis

Compression Analysis

Refer to caption
Refer to caption
Figure 6: Twin-Merging performance vs. different sparsity levels and techniques for GLUE

In the left panel of Figure 6, we explore sparsity rates from 0%percent00\%0 % to 100%percent100100\%100 %. Appendix E attachs detail qualtivie analysis of various Merging methods. Remarkably, our Twin-Merging method maintains 86.4%percent86.486.4\%86.4 % performance even at a 99.8%percent99.899.8\%99.8 % compression rate. This suggests that performance relies on a small fraction of task-specific parameters, aligning with previous findings [59, 61]. Our results also validate our hypothesis that redundant parameters can obscure critical knowledge, leading to performance degradation. Consequently, we primarily use a 90%percent9090\%90 % sparsity rate in our experiments to preserve performance while reducing storage costs. We also conducted an ablation study on sparsity methods, shown on the right side of Figure 6. SVD better retains task-specific information compared to Magnitude [59] and Bernoulli Dropout [61]. As SVD is applied only once during preprocessing, it does not become an inference bottleneck.

Table 7: Compute-performance tradeoff in the generative benchmark.
Method Training Tokens Training Cost Inference Cost (/1000 items) Performance
Multi-Task Learning 536.35M 10h32min 236s 94.31
Model Merging 0 0 236s 96.61
Twin-Merging 0.57M 183s 275s 102.38

Speed Analysis

Table 4.6 presents the time cost for Twin-Merging in generative benchmarks. Although the training stage uses only 0.1% of the total training budget, Twin-Merging significantly improves general capabilities compared to multi-task learning. Twin-Merging does not retrain all task experts; instead, it reuses experts (e.g., downloaded from model hubs like Huggingface [57]) and trains a small router to fuse these experts. Compared to conventional model merging methods, Twin-Merging sacrifices minimal router training budget and slightly reduces inference speed for dynamically composing the twin vectors, achieving superior performance. In summary, our approach strikes a better balance between compute and performance.

5 Conclusions

In this paper, we introduce the Twin-Merging to merge language models, aiming to close the performance gap between conventional model merging techniques and fine-tuned models, while improving adaptability to data heterogeneity. By modularizing and dynamically merging shared and task-specific knowledge, Twin-Merging significantly outperforms existing model-merging methods and approaches the performance of fine-tuned models across various settings and domains. Our study highlights the impact of shared and exclusive task-specific knowledge on merging performance. We show that Twin-Merging benefits even strong scaled models like Qwen-72B, which already perform well across domains. It extends to more tasks and merging methods, demonstrating better generalization on unseen data. By utilizing SVD, our solution retains 86%percent8686\%86 % of the performance with only 0.1%percent0.10.1\%0.1 % of the parameters, approaching upper-bound performance with minimal storage increase as tasks grow, achieving a better tradeoff between computation and performance.

References

  • Ainsworth et al. [2023] Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2023.
  • Azerbayev et al. [2024] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics, 2024.
  • Bai et al. [2023] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023.
  • Chen et al. [2018] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
  • Choshen et al. [2022] Leshem Choshen, Elad Venezian, Noam Slonim, and Yoav Katz. Fusing finetuned models for better pretraining, 2022.
  • Clark et al. [2022] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057–4086. PMLR, 2022.
  • Dai et al. [2024] Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024.
  • Dettmers et al. [2024] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Dong et al. [2015] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1723–1732, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1166. URL https://aclanthology.org/P15-1166.
  • Draxler et al. [2018] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
  • Fedus et al. [2022] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  • Frankle et al. [2020] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
  • French [1999] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  • Garipov et al. [2018] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  • Gueta et al. [2023] Almog Gueta, Elad Venezian, Colin Raffel, Noam Slonim, Yoav Katz, and Leshem Choshen. Knowledge is a region in weight space for fine-tuned language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1350–1370, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.95. URL https://aclanthology.org/2023.findings-emnlp.95.
  • Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  • Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
  • Hoffmann et al. [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022.
  • Hu et al. [2021] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  • Ilharco et al. [2022] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2022.
  • Ilharco et al. [2023] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj.
  • Jiang et al. [2024a] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024a.
  • Jiang et al. [2023] Junguang Jiang, Baixu Chen, Junwei Pan, Ximei Wang, Dapeng Liu, Mingsheng Long, et al. Forkmerge: Mitigating negative transfer in auxiliary-task learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Jiang et al. [2024b] Weisen Jiang, Baijiong Lin, Han Shi, Yu Zhang, Zhenguo Li, and James T. Kwok. Byom: Building your own multi-task model for free, 2024b.
  • Jin et al. [2022] Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations, 2022.
  • Kaplan et al. [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
  • Lepikhin et al. [2021] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021.
  • Li et al. [2023] Weishi Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, and Li Shen. Deep model fusion: A survey, 2023.
  • Liebenwein et al. [2021] Lucas Liebenwein, Cenk Baykal, Brandon Carter, David Gifford, and Daniela Rus. Lost in pruning: The effects of pruning neural networks beyond test accuracy. Proceedings of Machine Learning and Systems, 3:93–138, 2021.
  • Lin and Hovy [2003] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 150–157, 2003. URL https://aclanthology.org/N03-1020.
  • Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
  • Liu et al. [2021] Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 2021.
  • Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. 2019.
  • Luo et al. [2023] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2023.
  • Maninis et al. [2019] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1851–1860, 2019.
  • Matena and Raffel [2022] Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 2022.
  • Nakamura et al. [2024] Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, and Sampo Pyysalo. Aurora-m: The first open source multilingual language model red-teamed according to the u.s. executive order, 2024.
  • Nallapati et al. [2016] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
  • Navon et al. [2022] Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. In International Conference on Machine Learning, pages 16428–16446. PMLR, 2022.
  • Ortiz-Jimenez et al. [2023] Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=0A9f2jZDGW.
  • Parrish et al. [2022] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. Bbq: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, 2022.
  • Patterson et al. [2021] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training, 2021.
  • Rozière et al. [2024] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024.
  • Sanh et al. [2022] Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
  • Singh and Jaggi [2020] Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
  • Stoica et al. [2023] George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training, 2023.
  • Sukhbaatar et al. [2024] Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, et al. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. arXiv preprint arXiv:2403.07816, 2024.
  • Sun et al. [2023] Xiaofei Sun, Linfeng Dong, Xiaoya Li, Zhen Wan, Shuhe Wang, Tianwei Zhang, Jiwei Li, Fei Cheng, Lingjuan Lyu, Fei Wu, and Guoyin Wang. Pushing the limits of chatgpt on nlp tasks, 2023.
  • Tang et al. [2024a] Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, and Dacheng Tao. Merging multi-task models via weight-ensembling mixture of experts, 2024a.
  • Tang et al. [2024b] Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter-efficient multi-task model fusion with partial linearization. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=iynRvVVAmH.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
  • Wang et al. [2018] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018.
  • Wang et al. [2019] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In International Conference on Learning Representations, 2019.
  • Wang et al. [2023] Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric P Xing, and Mikhail Yurochkin. Fusing models with complementary expertise. In NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models, 2023.
  • Wei et al. [2022] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  • Wolf et al. [2020] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
  • Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 2022.
  • Yadav et al. [2023] Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Yang et al. [2024] Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=nZP6NgD3QY.
  • Yu et al. [2024] Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024.
  • Zhou et al. [2022] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 2022.
  • Zhou et al. [2024] Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. Cross-task linearity emerges in the pretraining-finetuning paradigm, 2024.
  • Zoph [2022] Barret Zoph. Designing effective sparse expert models. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1044–1044, 2022. doi: 10.1109/IPDPSW55747.2022.00171.

Appendix A Twin Merge on LoRA

Here, we will demonstrate that our Twin-Merging method can be seamlessly applied to LoRA module [20], where the base model is fixed and additional task-specific information is injected through matrix, i.e.𝜽t=𝜽0+LoRAtsubscript𝜽𝑡subscript𝜽0subscriptLoRA𝑡\bm{\theta}_{t}=\bm{\theta}_{0}+\text{LoRA}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + LoRA start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where LoRAtsubscriptLoRA𝑡\text{LoRA}_{t}LoRA start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the fine-tuned LoRA module for the t𝑡titalic_t-th task. let 𝜽s=𝜽0+LoRAssubscript𝜽𝑠subscript𝜽0subscriptLoRA𝑠\bm{\theta}_{s}=\bm{\theta}_{0}+\text{LoRA}_{s}bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + LoRA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we can prove that Twin-Merging on the 𝜽𝜽\bm{\theta}bold_italic_θ is equivalent to Twin-Merging on the LoRA module.

𝜽superscript𝜽\displaystyle\bm{\theta}^{*}bold_italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =𝜽s+t=1TwtSVDr(𝜽t𝜽s)Twin-Merging on 𝜽absentsubscriptsubscript𝜽𝑠superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscriptSVD𝑟subscript𝜽𝑡subscript𝜽𝑠Twin-Merging on 𝜽\displaystyle=\underbrace{\bm{\theta}_{s}+\sum_{t=1}^{T}w_{t}*\text{SVD}_{r}(% \bm{\theta}_{t}-\bm{\theta}_{s})}_{\text{Twin-Merging on }\bm{\theta}}= under⏟ start_ARG bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ SVD start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Twin-Merging on bold_italic_θ end_POSTSUBSCRIPT (5)
=𝜽0+LoRAs+t=1TwtSVDr((𝜽0+LoRAt)(𝜽0+LoRAs))absentsubscript𝜽0subscriptLoRA𝑠superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscriptSVD𝑟subscript𝜽0subscriptLoRA𝑡subscript𝜽0subscriptLoRA𝑠\displaystyle=\bm{\theta}_{0}+\text{LoRA}_{s}+\sum_{t=1}^{T}w_{t}*\text{SVD}_{% r}\Biggl{(}(\bm{\theta}_{0}+\text{LoRA}_{t})-(\bm{\theta}_{0}+\text{LoRA}_{s})% \Biggr{)}= bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + LoRA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ SVD start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + LoRA start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + LoRA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) )
=𝜽0+LoRAs+t=1TwtSVDr(LoRAtLoRAs)Twin-Merging on LoRAabsentsubscript𝜽0subscriptsubscriptLoRA𝑠superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscriptSVD𝑟subscriptLoRA𝑡subscriptLoRA𝑠Twin-Merging on LoRA\displaystyle=\bm{\theta}_{0}+\underbrace{{\text{LoRA}_{s}+\sum_{t=1}^{T}w_{t}% *\text{SVD}_{r}(\text{LoRA}_{t}-\text{LoRA}_{s})}}_{\text{Twin-Merging on LoRA}}= bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + under⏟ start_ARG LoRA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ SVD start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( LoRA start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - LoRA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT Twin-Merging on LoRA end_POSTSUBSCRIPT
=𝜽0+LORAabsentsubscript𝜽0superscriptLORA\displaystyle=\bm{\theta}_{0}+\text{LORA}^{*}= bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + LORA start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

where we denote LORA=LORAs+t=1TwtSVDr(LoRAtLoRAs)superscriptLORAsubscriptLORA𝑠superscriptsubscript𝑡1𝑇subscript𝑤𝑡subscriptSVD𝑟subscriptLoRA𝑡subscriptLoRA𝑠\text{LORA}^{*}=\text{LORA}_{s}+\sum_{t=1}^{T}w_{t}*\text{SVD}_{r}(\text{LoRA}% _{t}-\text{LoRA}_{s})LORA start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = LORA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ SVD start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( LoRA start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - LoRA start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).

Appendix B More relative research

Multi-Task Learning.

The multi-task training typically learns multi-task features by simultaneously optimizing task-specific objectives, facilitating the integration of diverse knowledge into the model. Existing works mainly focus on mitigating task conflicts [33] and catastrophic forgetting [14] by parameter sharing [36], adjusting suitable objectives [10, 45], find suitable task weighting [4, 40], and minimizing negative transfer [24]. In an era where models are growing larger, and the number of task scenarios is increasing, what we need to explore is a more cost-effective approach to multi-task learning. Therefore our focus is on multi-task scenarios that do not require acquiring or integrating multi-task data and do not involve additional updates to existing experts.

Mixture of Experts.

To enhance model scalability without increasing computational costs, the mixture of experts (MoE) paradigm introduces conditional routing of inputs to a subset of learnable parameters. Several efforts have extended feedforward networks (FFNs) within Transformers to incorporate MoE layers, such as GShard [28] and Switch Transformer [12]. These models typically employ learnable top-2 or top-1 routing strategies to scale MoE language models to an extremely large size [23]. Recent studies have focused on challenges such as load balancing of experts [6, 62], training instability [64], expert specialization [7, 50], and synchronization reduction [48]. However, these methods often require substantial multi-task data and costly joint training. In contrast, our approach directly reuses task-specific experts, leading to the natural specialization of experts in different domains. We only require minimal fine-tuning for a small router to calculate fusion weights, making our method highly efficient.

Appendix C The Merging Interference and Limited Generalization

To illustrate the challenge in determining the optimal merging coefficient and the limitations of pre-specified coefficients with unpredictable data, we consider COLA and SST-2 as in-domain experts. We merge them using Task Arithmetic and evaluate on the eight discriminative tasks from the GLUE benchmark. Only COLA and SST-2 are seen tasks, while the others are unseen. Since the merging coefficient is crucial for performance [60, 41], we conduct an extensive grid search for coefficients ranging from 22-2- 2 to 2222.

Refer to caption
Figure 7: The visualizations show normalized performance across eight GLUE tasks, highlighting the impact of combining expertise from the COLA and SST-2 domains (expert indicated by red vectors) through Task Arithmetic. Performance scores are normalized, with the unmerged pretrained model set to zero and other results scaled to the [1,1]11[-1,1][ - 1 , 1 ] range. The x-axis (γCOLAsubscript𝛾COLA\gamma_{\text{COLA}}italic_γ start_POSTSUBSCRIPT COLA end_POSTSUBSCRIPT) and y-axis (γSST-2subscript𝛾SST-2\gamma_{\text{SST-2}}italic_γ start_POSTSUBSCRIPT SST-2 end_POSTSUBSCRIPT) represent the merging weights for COLA and SST-2 expertise. Blue regions indicate improved performance over the pretrained model, while red regions indicate deterioration.

A large dark-blue region indicates consistent optimal performance, which is why Task Arithmetic can work with various weights. Conventional methods search this region for optimal performance across all in-domain tasks, avoiding the red region. However, this is computationally expensive and does not scale well with an increasing number of tasks. Additionally, it cannot handle unseen tasks, as the same coefficients can produce different patterns across tasks. For example, setting coefficients γCOLAsubscript𝛾COLA\gamma_{\text{COLA}}italic_γ start_POSTSUBSCRIPT COLA end_POSTSUBSCRIPT and γSST-2subscript𝛾SST-2\gamma_{\text{SST-2}}italic_γ start_POSTSUBSCRIPT SST-2 end_POSTSUBSCRIPT to 1111 leads to performance drops in MRPC and QNLI, but gains in MNLI, QQP, and RTE. 222In fact, the MNLI and QNLI are very similar tasks about Natural Language Inference (NLI) [53]. This demonstrates that task similarity does not guarantee similar merging performance patterns.

Furthermore, merging performance is not always a single cluster. For example, within the range of [2,2]22[-2,2][ - 2 , 2 ], STS-B and QNLI already show complex patterns, making it difficult to find an optimal weight for all tasks when task-specific experts are limited. Although Yang et al. [60] propose unsupervised entropy minimization to find optimal coefficients, this method is limited to classification tasks and has limited adaptability.

To address this, we propose reformulating the problem of fusing models as a supervised learning task. Specifically, we train a router to dynamically merge task-specific experts, as detailed in Section 3.3.

Appendix D Experiment Details

Here we detaily illustrate the setting of our experiments.

D.1 Compute Resources Used and Runtimes

We executed all our experiments on Nvidia A100 GPUs equipped with 80GB RAM. Single-task LoRA models for Qwen-14B on four generative tasks required 1-2 hours per task, Single-task LoRA for Qwen-72B need 10 hours on single GPUs to train. while the multitask vector took around 10 hours on single GPUs of 500M tokens. The RoBERTa model needs 15 minutes per task on GLUE datasets. Merge experiments were efficient, with evaluations consuming less than 2 minutes. The inference is generally fast within 4 minutes per 1000 items for generative tasks and less than 30 seconds per 1000 items for discriminative tasks. The detail comparison of the training cost and inference cost of different methods are detailed in Table 4.6.

D.2 Employed Datasets and Associated Licences

Discriminative Tasks.

we conduct experiments on the GLUE benchmark [53] with eight discriminative tasks, which is designed for classification tasks except for STS-B for the regression task. The detail of eight dataset can be found in the paper of Wang et al. [53]. Consistent with prior research [61], We split 10% of the training set as a validation set and employ the original validation data as the test set.

The licenses of QNLI, COLA, and STS-B are licensed under CC-BY-SA. QQP is licensed under MIT. SST-2 and MRPC are licensed under Apache 2.0. MNLI is licensed under OANC. RTE is licensed under CC BY 4.0. Thus, these datasets in GLUE are available for non-commercial research purposes.

Generative Tasks.

We conducted experiments on four benchmarks:

  1. 1.

    MMLU [18]: This benchmark tests general and STEM knowledge across 57 subjects, from elementary to professional levels. We used Exact-Match as the metric.

  2. 2.

    TruthfulQA [32]: This benchmark assesses the truthfulness of language models with 817 questions spanning 38 categories like health, law, finance, and politics. Exact-Match was used as the metric.

  3. 3.

    BBQ [42]: This dataset highlights social biases against protected classes in nine social dimensions relevant to U.S. English-speaking contexts. Exact-Match was the metric.

  4. 4.

    CNN-DailyMail [39]: This dataset is used for text summarization, requiring models to generate summaries of news stories. ROUGE-2 scores [31] were used for evaluation.

We evaluated these tasks using the HELM benchmark333https://github.com/stanford-crfm/helm in a few-shot setting.

For MMLU and TruthfulQA, which lack official training sets, we used the Dolly-15k dataset444https://huggingface.co/datasets/databricks/databricks-dolly-15k for MMLU and the BigBench-sampled dataset for TruthfulQA.

The GSM8K and MMLU datasets are under the MIT License. TruthfulQA and CNN-DailyMail are under the Apache-2.0 License. BBQ is under the CC-BY 4.0 License. These datasets are available for non-commercial research purposes.

D.3 Language Model Backbone

For discriminative tasks, we used RoBERTa-base555https://huggingface.co/FacebookAI/roberta-base [34] as our pre-trained backbone and fine-tuned it for each dataset to create supervised models. We conducted separate fine-tuning for the RoBERTa-base model on each dataset for 10101010 epochs. Our selected hyperparameters included a batch size of 64646464 and a learning rate set at 1e51superscript𝑒51e^{-5}1 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT.

For generative tasks, we employed Qwen-14B666https://huggingface.co/Qwen/Qwen-14B as the backbone and applied LoRA [20] for task-specific fine-tuning. In the case of generative tasks, the fine-tuning process for Qwen-14B involved the utilization of LoRA with a rank set to 32323232, a batch size of 128128128128, and a learning rate of 2e42superscript𝑒42e^{-4}2 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for 3333 epochs. For Qwen-72B we employ the same setting with QLoRA technique [8].

D.4 Non-Overlapping Merging

To serperate the impact of parameter-wise interference, we design the non-overlapping experiment based on Qwen LoRA modules as follows: (1) Firstly, we obtain standard merging experts by injecting the LoRA module into both the “w1” and “c_proj” weights of the Qwen-based model, and fine-tune them on two different tasks, resulting in two distinct models. Then we combine it into a single model to obtrain standard merging results. (2) Next, we performe a non-overlapping fine-tuning by injecting LoRA only to “w1” on one task and “c_proj” on another, producing two models with task-specific knowledge in different modules. (3) Finally, we combined the non-overlapping checkpoints to get the merged results. Since task-specific knowledge was injected into separate modules, parameter-wise interference was minimized. The results are shown in the upper section of Table 3.

D.5 Sparsification Methods Details

In Figure 6, we conduct a comparative analysis employing various sparsification methods. The specifics of each method are outlined below:

  • Magnitude. Following the setting in Ties-Merging [59], we retain solely the k%percent𝑘k\%italic_k % largest-magnitude values while resetting the remaining values to zero.

  • Bernoulli-Dropout. Adhering to the methodology introduced in DARE [61], we employ a parameterized Bernoulli distribution to sample a sparse mask 𝒎tsuperscript𝒎𝑡\bm{m}^{t}bold_italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. This mask is then applied to the parameters 𝜹𝜹\bm{\delta}bold_italic_δ and subsequently rescaled with respect to the mask rate k𝑘kitalic_k.

    𝒎tBernoulli(k),𝜹~t=𝒎t𝜹t,𝜹^t=𝜹~t/(1k).formulae-sequencesimilar-tosuperscript𝒎𝑡Bernoulli𝑘formulae-sequencesuperscript~𝜹𝑡direct-productsuperscript𝒎𝑡superscript𝜹𝑡superscript^𝜹𝑡superscript~𝜹𝑡1𝑘\begin{gathered}\bm{m}^{t}\sim\operatorname{Bernoulli}(k),\\ \widetilde{\bm{\delta}}^{t}=\bm{m}^{t}\odot\bm{\delta}^{t},\\ \hat{\bm{\delta}}^{t}=\widetilde{\bm{\delta}}^{t}/(1-k).\end{gathered}start_ROW start_CELL bold_italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∼ roman_Bernoulli ( italic_k ) , end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_italic_δ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = bold_italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊙ bold_italic_δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL over^ start_ARG bold_italic_δ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = over~ start_ARG bold_italic_δ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT / ( 1 - italic_k ) . end_CELL end_ROW (6)
  • Singular value decomposition (SVD). Assuming that matrix M𝑀Mitalic_M has a rank-m𝑚mitalic_m decomposition, expressed as 𝐌=𝐔t𝚺t𝐕tT𝐌subscript𝐔𝑡subscript𝚺𝑡superscriptsubscript𝐕𝑡𝑇\mathbf{M}=\mathbf{U}_{t}\mathbf{\Sigma}_{t}\mathbf{V}_{t}^{T}bold_M = bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT where 𝐔tdout×m,𝚺tm×m,𝐕tdin×mformulae-sequencesubscript𝐔𝑡superscriptsubscript𝑑𝑜𝑢𝑡𝑚formulae-sequencesubscript𝚺𝑡superscript𝑚𝑚subscript𝐕𝑡superscriptsubscript𝑑𝑖𝑛𝑚\mathbf{U}_{t}\in\mathbb{R}^{d_{out}\times m},\mathbf{\Sigma}_{t}\in\mathbb{R}% ^{m\times m},\mathbf{V}_{t}\in\mathbb{R}^{d_{in}\times m}bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_m end_POSTSUPERSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_m end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_m end_POSTSUPERSCRIPT. We compress the matrix 𝐌𝐌\mathbf{M}bold_M by selecting only the top-r𝑟ritalic_r singular values from 𝚺tsubscript𝚺𝑡\mathbf{\Sigma}_{t}bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, denoted as 𝐌r=𝐔t(r)𝚺t(r)𝐕t(r)Tsubscript𝐌𝑟subscript𝐔𝑡𝑟subscript𝚺𝑡𝑟subscript𝐕𝑡superscript𝑟𝑇\mathbf{M}_{r}=\mathbf{U}_{t}(r)\mathbf{\Sigma}_{t}(r)\mathbf{V}_{t}(r)^{T}bold_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Here, 𝐔t(r)dout×r,𝚺t(r)r×r,𝐕trdin×rformulae-sequencesubscript𝐔𝑡𝑟superscriptsubscript𝑑𝑜𝑢𝑡𝑟formulae-sequencesubscript𝚺𝑡𝑟superscript𝑟𝑟superscriptsubscript𝐕𝑡𝑟superscriptsubscript𝑑𝑖𝑛𝑟\mathbf{U}_{t}(r)\in\mathbb{R}^{d_{out}\times r},\mathbf{\Sigma}_{t}(r)\in% \mathbb{R}^{r\times r},\mathbf{V}_{t}^{r}\in\mathbb{R}^{d_{in}\times r}bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT × italic_r end_POSTSUPERSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_r end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT × italic_r end_POSTSUPERSCRIPT represent sub-matrices of 𝐔t,𝚺t,𝐕tTsubscript𝐔𝑡subscript𝚺𝑡superscriptsubscript𝐕𝑡𝑇\mathbf{U}_{t},\mathbf{\Sigma}_{t},\mathbf{V}_{t}^{T}bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. This transformation significantly reduces the task-specific parameter dimensionality from m×(dout+din+1)𝑚subscript𝑑𝑜𝑢𝑡subscript𝑑𝑖𝑛1m\times(d_{out}+d_{in}+1)italic_m × ( italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + 1 ) to r×(dout+din+1)𝑟subscript𝑑𝑜𝑢𝑡subscript𝑑𝑖𝑛1r\times(d_{out}+d_{in}+1)italic_r × ( italic_d start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + 1 ), as the maximum m𝑚mitalic_m typically equals to the hidden size of the language model (e.g., m=768𝑚768m=768italic_m = 768 for RoBERTa-base and m=4096𝑚4096m=4096italic_m = 4096 for Qwen-14B) and r𝑟ritalic_r can be reduced to 1, resulting in a significant reduction in parameters and storage effectiveness.

D.6 Baselines Details

Here we will elaborate on the baselines utilized in our main comparison experiment, as outlined in Table 2 and Figure 2(b).

  • Individual means that each task uses the corresponding fine-tuned model, which has no interference between tasks but cannot perform multiple tasks simultaneously. It serves as the upper-bound performance for each specific task.

  • Weight Averaging [5, 58] is the simplest form of model merging, which straightforwardly averages the parameters of multiple models. It serves as a lower bound for model merging.

  • Task Arithmetic [21] first introduces the concept of “task vectors” and merges them into the pre-trained model to execute multi-task learning.

  • Ties-Merging [59] addresses task conflicts by eliminating redundant parameters. The process involves three steps: Trim, Elect Sign, and Disjoint Merge.

  • Task Arithmetic (w/ DARE) [61] This variant incorporates the Bernoulli-Dropout technique for 70% sparsification before employing Task Arithmetic [21] for merging.

  • Ties-Merging (w/ DARE) [61] Similar to the previous approach, this variant integrates Bernoulli-Dropout for 70% sparsification, followed by Ties-Merging [59] for the merging process.

The coefficient for Task Arithmetic and Ties-Merging are decided by a small scale grid search on validation datasets. The coefficient of 0.7 is consistently applied for DARE Merging, following the previous papers [61].

D.7 Detail Results

In Table 2, we present only the average normalized scores across various tasks. In this section, we detail the statistical performance of all tasks, with discriminative results displayed in Table 8 and generative results shown in Table 9.

Table 8: The detail statistics of different merging performance on 8 discriminative tasks. Bold numbers indicate the best-averaging performance across different model merging methods.
Model COLA STS-2 MRPC STS-B QQP QNLI MNLI RTE Avg.
Pre-trained 0.00 53.76 85.01 4.01 37.48 53.05 37.09 71.19 41.69
Fine-tuned 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Weight Averaging 0.00 59.21 85.79 46.99 45.37 63.94 48.00 71.19 52.56
Task Arithmetic 8.35 88.26 89.57 32.84 82.03 85.40 75.54 80.43 67.80
Ties-Merging 31.76 88.86 86.18 10.94 61.05 85.94 83.01 69.56 64.66
Task Arithmetic (w/ DARE) 0.00 88.14 86.61 30.19 84.33 79.09 63.95 77.16 63.68
Ties-Merging (w/ DARE) 11.82 95.52 85.75 9.43 86.77 88.67 83.13 63.59 65.58
Twin-Merging (Rank-1) 51.24 98.67 89.20 76.31 92.16 93.24 96.45 90.76 86.00
Twin-Merging (90%percent9090\%90 % compressed) 101.01 99.88 99.41 79.89 99.14 99.67 96.68 93.47 96.14
Table 9: The detail statistics of different merging performance on 4 generative tasks. Bold numbers indicate the best-averaging performance across different model merging methods. Underlines indicate the second best performance of each task across different model merging methods.
Model MMLU TruthfulQA BBQ CNN-DailyMail Avg.
Pretrained 101.37 94.35 86.27 82.24 91.06
Fine-tuned 100.00 100.00 100.00 100.00 100.00
Weight Averging 99.63 92.04 88.01 103.28 95.74
Task Arithmetic 98.93 98.23 83.65 105.62 96.61
Task Arithmetic (w/ DARE) 99.22 96.90 88.56 109.40 98.52
Ties-Merging 99.88 92.04 89.92 88.83 92.67
Ties-Merging (w/ DARE) 101.41 97.66 86.81 81.80 91.92
Twin-Merging (rank-1) 99.40 95.58 93.46 115.39 100.96
Twin-Merging (rank-16) 99.87 98.23 97.00 114.43 102.38

Appendix E Efficiency Analysis

Assume we have T𝑇Titalic_T tasks, the fine-tuned model have P=Pf+Pa𝑃subscript𝑃𝑓subscript𝑃𝑎P=P_{f}+P_{a}italic_P = italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT parameters, where Pfsubscript𝑃𝑓P_{f}italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are frozen and Pasubscript𝑃𝑎P_{a}italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are activated.

Parameter Count and Storage Cost

Assuming each float parameter uses 16 bits (either fp16 or bf16): Fine-tuned models require 2(TPa+Pf)2𝑇subscript𝑃𝑎subscript𝑃𝑓2(TP_{a}+P_{f})2 ( italic_T italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) bytes of storage. Pretrained models, including those using Weight Average, Task Arithmetic, Ties-Merging, and DARE Merging techniques, each need 2P2𝑃2P2 italic_P bytes of storage per model. For Twin-Merging, with the router having Prsubscript𝑃𝑟P_{r}italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT parameters (PrPmuch-less-thansubscript𝑃𝑟𝑃P_{r}\ll Pitalic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≪ italic_P) and a compression rate of k%percent𝑘k\%italic_k %, it need to store 2TkPa+2P+Pr2𝑇𝑘subscript𝑃𝑎2𝑃subscript𝑃𝑟2TkP_{a}+2P+P_{r}2 italic_T italic_k italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + 2 italic_P + italic_P start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bytes including a shared expert, compressed exclusive task-specific vectors, and the router. We can select k𝑘kitalic_k to compress the model matrix to rank 1111 for best storage. These strategies enhance the accessibility and sustainability of task-specific models, fostering wider advancements and applications. Visual representations can be found in Figure 2(a) and Figure 4.

Appendix F Limitations and Future Work

Our approach shares common limitations with existing merging methods: (1) The underlying theory behind why and when weight interpolation works is not fully understood, though recent works [63, 41] have made interesting observations about weight disentanglement and cross-task linearity. (2) Currently, merging is limited to models with the same architecture and it may be difficult to find a suitable fine-tuned model with specific capacities.

Additionally, while our method focuses on shared and exclusive task-specific knowledge, providing a way to approach fine-tuned model performance and potentially surpass it without additional training, we observe there may be other types of knowledge that remain unexplored: (1) Evil knowledge: Useless for any task and distracts the model, obscuring critical knowledge during merging. (2) Irrelevant knowledge: Has no impact on merging performance. Our experiments validate the existence of the irrelevant knowledge since we demonstrate that dropping 90%percent9090\%90 % of parameters retains most of the fine-tuned performance, but we have not investigated evil knowledge. Future work may include further investigation and decomposing these different types of knowledge to better ignite the model’s full potential without retraining.

Appendix G Broader Impacts

This paper presents work whose goal is to advance the field of machine learning and model merging research. In terms of positive social impact, twin-merging techniques can achieve multi-task performance of foundation models without retraining expert models, significantly reducing computational and energy costs. Our proposed knowledge modularization and compression techniques make the task-specific enhanced model more accessible and sustainable, paving the way for broader applications and advancements in the field. These techniques effectively align unaligned models by leveraging experts, thus mitigating the harmfulness and biases present in the original models. Additionally, model merging allows the unified model to benefit from the strengths of each task-specific model, even for tasks with private or inaccessible data, enhancing commercial and safety benefits. However, improper merging of biased models may contaminate the merged model. This issue can be addressed by merging a de-bias expert or using sparsity techniques to minimize the impact.