Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

Zhenyi Lu^1,2 Chenghao Fan^1,2¹¹footnotemark: 1 Wei Wei^1,2 Xiaoye Qu¹ Dangyang Chen³ Yu Cheng⁴
¹ School of Computer Science & Technology, Huazhong University of Science and Technology,
² Joint Laboratory of HUST and Pingan Property & Casualty Research (HPL),
³ Ping An Property & Casualty Insurance Company of China, Ltd.,
⁴ The Chinese University of Hong Kong.
{luzhenyi529,facicofan}@gmail.com, {weiw, xiaoye}@hust.edu.cn,
chendangyang273@pingan.com.cn, chengyu@cse.cuhk.edu.hk Equal contribution. Corresponding authors.

Abstract

In the era of large language models, model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training. However, two challenges remain: (a) interference between different models and (b) heterogeneous data during testing. Traditional model merging methods often show significant performance gaps compared to fine-tuned models due to these issues. Additionally, a one-size-fits-all model lacks flexibility for diverse test data, leading to performance degradation. We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance. In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input. This approach narrows the performance gap between merged and fine-tuned models and improves adaptability to heterogeneous data. Extensive experiments on $12$ datasets for both discriminative and generative tasks demonstrate the effectiveness of our method, showing an average improvement of $28.34\%$ in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks. ¹¹1Our implementation is available in https://github.com/LZY-the-boys/Twin-Merging

1 Introduction

In recent years, Large Language Models (LLMs) have demonstrated notable success across various Natural Language Processing (NLP) tasks [9, 49, 52], including code generation [17, 44], solving math problems [35, 2], multilingualism [38], etc. These models, with billions of parameters, excel in various downstream tasks [27, 19, 56] but require extensive training on large datasets using thousands of GPUs. The considerable computational and energy costs [43] limit their specialization and deployment in resource-constrained environments [30].

Refer to caption — Figure 1: Subfigure (I) shows that in conventional merging methods, parameters from different task-specific models and a pre-trained model are weighted-summed into a single multitask model for inference. Subfigure (II) illustrates that our Twin-Merging method first isolates shared knowledge, then extracts exclusive knowledge by identifying differences between task experts and the shared model. This exclusive knowledge is then compressed into sparse vectors. Subfigure (III) shows that during testing, Twin-Merging dynamically merges shared and compressed specialized knowledge based on test inputs to form the final inference model.

To tackle this challenge, model fusion has emerged as a promising solution [29]. One notable paradigm is model merging [22, 26, 59, 60], where multiple task-specific models, or “experts”, are combined into a single unified model. This unified model can quickly adapt to new tasks without the need to retrain a large model. Various techniques, such as parameter averaging [5, 58], weight interpolation [37, 26], and advanced strategies like task arithmetic [22, 41, 60, 51], have been developed for model merging. These techniques have been proven effective, enabling the integration of fine-tuned knowledge from diverse tasks into a multi-task model without additional training.

However, merging models from different domains often sacrifices specific task performance, leading to a large performance gap compared to the individual expert [24, 59]. Two major causes prevent the existing merging methods from reaching the theoretical upper-bound performance of individual experts: (1) Interference between models. Previous research shows that parameter redundancy and sign discrepancies [59], as well as the distribution gap between tasks [24], hinder effective model merging. We demonstrate that task-specific models often contain mixed knowledge, where the expertise in one model may be exclusive or detrimental to others. This redundancy or interference can obstruct the integration of expertise across models [7]. (2) heterogeneity of data at test time. Previous methods pursue a single, static optimal solution for various tasks. While a one-size-fits-all model avoids introducing new parameters, it might be inadequate or suboptimal due to the unpredictable nature of test inputs [60]. It limits the utilization of complementary knowledge and leads to deteriorated performance [55].

To address the above issues, in this paper, we introduce Twin Merging, involving two principal stages: (1) Knowledge Modularization: Unlike previous research that migrates merging interference in a parameter-wise manner or searches merging coefficients, we decompose the knowledge possessed by experts into shared knowledge and exclusive task-specific knowledge, as shown in Figure 1 (II). First, we compress common knowledge into a shared expert, serving to capture and consolidate common knowledge across varying tasks. Then we isolate exclusive knowledge based on the difference between the task experts and the shared expert, allowing diverse knowledge to be decomposed more finely. (2) Dynamic Merging: Inspired by Mixture of Experts (MoE), we simplify the parameter merging problem into a conditional composition problem. Instead of pre-determining the best parameter combination for heterogeneous data at test time, as illustrated in Figure 1 (III), we introduce a router to dynamically merge shared and exclusive knowledge based on the test inputs. The shared model serves as the foundation, and task-specific knowledge is conditionally injected according to the router.

We demonstrate the effectiveness of our proposed Twin-Merging method through extensive experiments on $12$ datasets, covering both discriminative and generative tasks, various model architectures, and in-domain and out-of-domain setups. As shown in Figure 2(b), Twin-Merging consistently outperforms other merging methods across all datasets, surpassing the strongest baseline by an average of $28.34\%$ in normalized scores for discriminative tasks and $3.86\%$ for generative tasks on the scaled model (Qwen-14B). We validate the scalability, extensibility, generalization, and storage efficiency of Twin-Merging (Figure 2(a)). Remarkably, even with a $99.9\%$ reduction in parameters, our method only experiences a slight $14\%$ performance degradation. Our results establish Twin-Merging as a powerful and effective method for combining multiple fine-tuned models into a single multi-task model.

To summarize, our contributions are as follows: (1) We introduce Twin-Merging, a novel model fusion method that reduces the performance gap between traditional model merging and fine-tuned models while enhancing adaptability to diverse data. (2) We investigate the impact of shared and exclusive task-specific knowledge on merging performance, presenting innovative techniques for knowledge disentanglement and dynamic merging. (3) Twin-Merging is simple to implement with minimal hyperparameters, improves multi-task performance without retraining expert models, and can be combined with other merging methods for further gains. Our approach scales well with model size and task numbers and is storage-efficient.

2 Related Work

In this section, we focus on model merging research, for additional related work on multi-task learning and Mixture of Experts, please see Appendix B. Model merging aims to fuse multiple fine-tuned task-specific models into one comprehensive multi-task model without additional training. FisherMerging [37] and RegMean [26], use straightforward weight averaging but require extra data and computation. Some works [54, 46, 16, 1, 47] bring models into a single low-loss basin and interpolate between them based on the linear mode connectivity (LMC) theory [11, 15, 13]. The weight permutations [1] and optimal transport [46] are utilized to better interpolate neural networks. However, recent studies [63] suggest that LMC might not always hold for fine-tuned models. Task-Arithmetic [21, 41] extends averaging to arithmetic operations in the parameter space for finer control over model behaviors, but the interference between the multiple models can be an issue. To tackle this challenge, advanced merging methods like Ties-Merging [59], AdaMerging [60] and DARE [61] have been proposed. These methods aim to reduce task conflicts by addressing parameter redundancy or disagreements in signs, finding optimal merging coefficients, and reducing weight density, respectively. Jiang et al. [25] assume that test tasks are known and use task-specific knowledge to improve performance. However, this assumption is often unrealistic since real-world data distributions are unpredictable. In contrast, our method addresses merging interference by modularizing shared and task-specific knowledge. We handle heterogeneous test data scenarios by introducing dynamic merging techniques.

3 Methodology

3.1 Analysis of the Performance Gap in Model Merging

In this paper, following the settings of model merging [22, 59, 61], we consider the case of $T$ tasks, where training for each task $t$ starts from pre-trained model weight $\bm{\theta}_{0}$ and fine-tunes on $\mathcal{D}^{train}_{t}$ to obtain task-specific model $\bm{\theta}_{t}$ . Let $f(\bm{x};\bm{\theta})$ be a language model accepting inputs $\bm{x}\in\mathcal{X}$ and paramterized by weights $\bm{\theta}\in\Theta$ . Considering the real data distributions are diverse and challenging to represent with a single task, to model such distributions, previous methods typically consider the mixture of $T$ task test data: $\mathcal{D}=\sum_{t=1}^{T}\alpha_{t}\mathcal{D}_{t}$ , where $\sum_{t=1}^{T}\alpha_{t}=1,\alpha_{t}>0\ \forall t$ . The model merging considers the problem where we have $T$ fine-tuned expert models $\{f_{t}(\bm{x};\bm{\theta}_{t})\}_{t=1}^{T}$ and pre-trained weight $\bm{\theta}_{0}$ , composing a multitask model $\bm{\theta}^{*}$ to approximate the optimal solution.

\displaystyle\bm{\theta}_{opt}\approx\bm{\theta}^{*}=\mathcal{F}(\bm{\theta}_{% 0},\bm{\theta}_{1},\cdots,\bm{\theta}_{T})

(1)

Here $\mathcal{F}$ represents an arbitrary merging function. For example, in Task Arithmetic [21], $\bm{\theta}^{*}=\bm{\theta}_{0}+\sum_{t=1}^{T}\gamma_{t}(\bm{\theta}_{t}-\bm{% \theta}_{0})$ .

Task	Normalized Score
(Equation (4))
With parameter interference
Fine-tuned	100.00
Merging	85.43
Without parameter interference
Non-overlap Fine-tuned	100.00
Non-overlap Merging	82.21 $\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}[\downarrow 3% .21]$
Similar tasks
Fine-tuned	100.00
Similar-Tasks Merging	91.58 $\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}[\downarrow 8% .42]$

Although existing merging methods, like Task Arithmetic, can combine multiple task-specific models efficiently, they often exhibit significant performance gaps compared to single-task models. Previous studies attribute this to parameter redundancy and sign discrepancies, denoted as parameter interference [59], leading to the loss of task-specific information. Furthermore, differences between tasks can cause interference in the merged weights, denoted as task interference [24]. To investigate the causes of performance degradation, we designed two experiments using Task Arithmetic. First, we injected task-specific knowledge into non-overlapping parameter sets, fine-tuning Qwen-14B with LoRA on different modules for each task (Detailed in Appendix D.4). Despite avoiding parameter interference, merging resulted in an $82.21\%$ normalized score, a drop of $3.21\%$ compared to the overlapping version. Second, we merged models fine-tuned on similar tasks (e.g., XSUM and CNN-DailyMail for summarization). This experiment yields an $8.42\%$ lower normalized score compared to the individually fine-tuned models, indicating persistent interference. In summary, our results show that interference in model merging is not limited to parameter-wise and task-wise issues.

3.2 Interpreting Interference From the Perspective of Knowledge

To tackle the challenge of interference, we examine the merging process from a knowledge perspective. We identify two types of critical knowledge: (1) Shared knowledge, which benefits multiple tasks, and (2) Exclusive knowledge, which is useful only for a specific task. Single-task models often contain both types, complicating the merging process and leading to interference. To validate our hypotheses, we conduct experiments that vary the ratio of task-specific and shared knowledge.

To examine the impact of shared knowledge, we conducted full fine-tuning on each model for its specific task. Excessive fine-tuning epochs can lead to catastrophic forgetting [14], a phenomenon where the model retains task-specific knowledge but loses general knowledge. As the fine-tuning epochs increase, the shared knowledge gradually decreases. The top section of Figure 3 illustrates that as the epoch count increases, merging performance significantly deteriorates, even though the fine-tuned model performs well on its task. This underscores the crucial role of shared knowledge in merging performance.

To explore the impact of exclusive knowledge, we merge a single task-specific model into the base model. We apply a sparsity method (e.g., SVD) to reduce the ratios of task-specific weights in the merging model from $100\%$ (standard merging) to $0\%$ (base model). As shown in the lower part of Figure 3, performance remains stable up to $90\%$ sparsity. Notably, even with a $99\%$ sparsity rate, a single-merged model outperforms multi-model merging, confirming the existence of exclusive knowledge, which is more pronounced with more models. This also underscores the value of unmerged task-specific knowledge, since the fine-tuning performance can be effectively restored by preserving unmerged task-specific information.

To summarize, both shared knowledge and un-merged task-specific knowledge play a vital role in merging performance. The exclusive nature of task-specific knowledge hinders the effectiveness of merging methods. Different types of knowledge need to be separated and modularized to achieve optimal performance. Thus, the first step of our Twin-Merging approach is to explicitly partition the weights into an expert containing shared knowledge and weights holding task-exclusive knowledge before merging. Formally, we denote the shared expert as $\bm{\theta}_{s}$ and the exclusive task-specific knowledge as $\{\bm{v}_{t}\}_{t=1}^{T}$ , the detail of our method is illustrated in the following section.

3.3 Twin Merging

Algorithm 1 Twin-Merging

1:language model

f(\bm{x};\bm{\theta})

, pre-trained weight

\bm{\theta}_{0}

and

T

task-specific fine-tuned weights

\{\bm{\theta}_{t}\}_{t=1}^{T}

, trained router

\mathcal{R}

parameterized by a full-connect layer

\bm{\phi}

, embedding

Emb

, compression rank

r

and pre-specified weight

\{\gamma_{t}\}_{t=1}^{T}

3:Pre-calculation:

\triangleright

Only excute once

4:Compute the shared expert

\bm{\theta}_{s}

\quad\bm{\theta}_{s}\leftarrow\bm{\theta}_{0}+\sum_{t=1}^{T}\gamma_{t}(\bm{% \theta}_{t}-\bm{\theta}_{0})

6:Extract exclusive knowledge vectors for each task-specific weight:

\quad\bm{v}_{t}\leftarrow\text{SVD}_{r}(\bm{\theta}_{t}-\bm{\theta}_{s})

, for

t=1,\ldots,T

9:Inference:

\triangleright

Main loop

10:initialize output

\bm{Y}

11:for each input

\bm{x}

in inputs

\bm{X}

12: Calculate router weights:

13:

\quad[w_{1},\cdots,w_{T}]\leftarrow\text{softmax}(\mathcal{R}(\text{Emb}(\bm{x% });\bm{\phi}))

14: Merge into a single expert

\bm{\theta}^{*}

15:

\quad\bm{\theta}^{*}\leftarrow\bm{\theta}_{s}+\sum_{t=1}^{T}w_{t}\bm{v}_{t}

16: Perform model inference to produce the output:

17:

\quad\bm{Y}\leftarrow\bm{Y}\cup f(\bm{x};\bm{\theta}^{*})

18:end for

19:

20:Output

\bm{Y}

for input

\bm{X}

Our proposed Twin-Merging employs two main stages: knowledge modularization and dynamic merging. These stages are designed to narrow the performance gap and enhance adaptive knowledge composition. Building on the formulation in Equation (2), Twin-Merging preprocesses experts into shared experts, isolates and compresses exclusive knowledge into vectors, and dynamically composes them during inference.

The preprocess stage comprises three steps: (1) Shared Expert: To separate shared knowledge across different models, we consider the pre-merged model as a natural placeholder to encapsulate common knowledge that is important to all tasks (denoted as $\bm{\theta}^{*}$ ). By leveraging established merging techniques such as Task Arithmetic, we can readily extract the shared experts from the initial merged model. (2) Exclusive Knowledge: To convey task-specific information while separating common knowledge, we calculate the difference vector: $\bm{v}_{t}=\bm{\theta}_{t}-\bm{\theta}^{*}$ . This subtraction vector preserves un-merged task-specific information while discarding the shared knowledge. (3) Compressed exclusive vectors: For practical use and distribution, we apply singular value decomposition (SVD) to further compress the above exclusive knowledge into vectors for each task. Assuming $\bm{v}_{t}$ has a rank- $m$ decomposition, $\bm{v}_{t}=\mathbf{U}_{t}\mathbf{\Sigma}_{t}\mathbf{V}_{t}^{T}$ , we achieve a low-rank task space by selecting the top- $r$ singular values, resulting in $\mathbf{U}_{t}(r)\mathbf{\Sigma}_{t}(r)\mathbf{V}_{t}(r)^{T}$ .

In inference stage, adapting to unforeseen challenges is difficult, especially with varied test data. For example, if most of the data consists of a certain type (denoted as $\mathcal{D}_{u}$ ), we should tailor the merged model for that specific task to get the best results. Instead of pre-defining the best parameters, we propose a new approach that combines shared expertise with exclusive knowledge. Our method involves using the input $\bm{x}$ to dynamically adjust to the current data, enabling us to utilize shared knowledge and apply specialized expertise based on the inputs.

\displaystyle\bm{\theta}^{*}=\mathcal{F}(\underbrace{\bm{\theta}_{s}}_{\text{% shared knowledge}},\underbrace{\bm{v}_{1},\cdots,\bm{v}_{T}}_{\text{exclusive % knowledge}},\bm{x})

(2)

During inference, we fine-tune a small fuser $\mathcal{R}$ parameterized by $\bm{\phi}$ through empirical risk minimization on a small validation dataset. This fuser, trained to dynamically select the specific task experts, replacing the need for complex optimization algorithms to determine fusion coefficients. The merging model is obtained by:

	$\displaystyle\bm{\theta}^{}=\bm{\theta}_{s}+\sum_{t=1}^{T}w_{t}\text{SVD}_{r% }(\bm{\theta}_{t}-\bm{\theta}^{*})$		(3)
	$\displaystyle\{w_{1},\cdots,w_{T}\}=\text{softmax}\Biggl{(}\mathcal{R}(\text{% Emb}(\bm{x});\bm{\phi})\Biggr{)}$		(3)

Here, $\text{Emb}(\bm{x})$ represents the sequence of the last-layer token embeddings from the shared expert ( $f(\bm{x};\bm{\theta}_{s})$ ).

4 Experiments

4.1 Merging Experiment

Baselines

We compare Twin-Merging with several train-free model-merging methods, including weight averaging, Task Arithmetic [21], Ties-Merging [59], and DARE Merging [61]. Details on these baselines are provided in Appendix D. Additionally, we include individually fine-tuned models and the pre-trained model as upper and lower bounds on performance, respectively. Performance is assessed using the average normalized score of the fine-tuned models to mitigate the effects of different task-specific score ranges. The normalized score of merged model $\bm{\theta}^{*}$ is calculated as:

\text{Normalized Score}=\frac{1}{T}\sum_{t=1}^{T}\frac{\underset{x\sim\mathcal% {D}_{t}}{\operatorname{Score}}\left[f(\bm{x};\bm{\theta}^{*})\right]}{% \underset{x\sim\mathcal{D}_{t}}{\operatorname{Score}}\left[f_{t}(\bm{x};\bm{% \theta}_{t})\right]}

(4)

We evaluate our method on both discriminative and generative NLP benchmarks.

Discriminative Tasks

For discriminative tasks, following [59, 61], we use RoBERTa [34] as the backbone and evaluate on the 8-task GLUE benchmark [53]. More details are in Appendix D.2.

Generative Tasks

For our generative tasks, we use Qwen-14B [3] as the primary model to demonstrate the effectiveness of our approach on large-scale language models. To reduce deployment costs, we utilize task-specific checkpoints fine-tuned with the LoRA method [20] (See Appendix A for details on adapting Twin-Merging to LoRA). We evaluate our model on four scenarios: general knowledge (MMLU benchmark [18]), factualness (TruthfulQA [32]), safety (BBQ [42]), and summarization (CNN-DailyMail [39]). Detailed information is provided in Appendix D.2.

Table 2: Performance on 8 Discriminative Tasks (RoBERTa) and 4 Generative Tasks (Qwen-14B)

Method	8 Discriminative Tasks	4 Generative Tasks	Avg.
Pretrained	41.69	91.06	66.37
Fine-tuned	100.00	100.00	100.00
Weight Averaging	52.56	95.74	74.15
Task Arithmetic	67.80	96.61	82.20
Task Arithmetic (w/ DARE)	64.66	98.52	81.59
Ties-Merging	63.68	92.67	78.17
Ties-Merging (w/ DARE)	65.58	91.92	78.75
Twin-Merging (Best Storage)	86.00	100.96	93.48
Twin-Merging (Ours)	96.14	102.38	99.26

Main Results

Table 2 presents the results for all discriminative and generative benchmarks. A comparison of each task is illustrated in Figure 2(b) (detailed statistics are provided in Table 8 and Table 9 in the Appendix D.7). Twin-Merging consistently outperforms weight averaging, Task Arithmetic, Ties-Merging, and DARE Merging, leading to significant performance gains across settings. For discriminative tasks, it approachs the upper bound of finetune performance in the GLUE benchmark. Specifically, our methods improve over Task Arithmetic by $28.34\%$ , Ties-Merging by $32.46\%$ , and DARE-Merging by $30.56\%$ in absolute normalized score. In Figure 2(b), we observe that especially on the COLA task, where conventional merging methods fail to improve the result, our approach can still approach the upper bound of the COLA expert.

Similar to discriminative tasks, Twin-Merging achieves the best results on generative benchmarks, improving Task Arithmetic and DARE Merging by 5.77% and 3.86%, respectively. We observe two interesting findings: (1) The merging gains on Qwen-14B for generative tasks are lower than those on RoBERTa for discriminative tasks. We observe that pretrained RoBERTa exhibits only about half of its fine-tuned capabilities, while Qwen-14B achieves $91.06\%$ of its performance without fine-tuning. This suggests that smaller models like RoBERTa benefit more from task-specific biases, whereas large models like Qwen-14B already perform well without additional task-specific knowledge. Consequently, merging task-specific experts significantly improves RoBERTa, but has limited effect on Qwen-14B. (2) On the generative benchmark, Twin-Merging even surpasses the original upper bound of finetuned experts. This likely stems from the vast knowledge within Qwen-14B. Although not specifically finetuned, proper knowledge modularization and dynamic merging techniques in our method can further ignite the merged model’s capabilities. This suggests a promising direction for pushing the limits of LLMs without retraining.

Table 3: Our method scalability (72B)

Method	TruthfulQA	BBQ
Pretrained-72B	94.48	89.51
Fine-tuned	100	100
Task Arithmetic	98.70	95.40
Twin Merging	99.30	97.14

Table 4: Our method extensibility to other model merging methods

Method	RoBERTa	Qwen
Weight Average	52.56	95.74
Twin-Merging + Weight Average	96.23	100.08
Task-Arithmetic	67.80	98.52
Twin-Merging + Task-Arithmetic	96.14	102.38
Ties-Merging	63.68	92.67
Twin-Merging + Ties-Merging	96.34	102.35

Scalability of Twin-Merging

Our method remains effective with scaled models (e.g., 72B parameters), as shown in Table LABEL:tab:large. To manage high deployment costs, we limited our evaluation and merged experts to two tasks: BBQ and TruthfulQA. Twin-Merging consistently surpasses scaled pre-trained models and Task Arithmetic, highlighting our approach’s scalability.

Collaborating with Other Merging Method

To evaluate the compatibility of Twin-Merging with other merging methods, we conducted experiments using different techniques to create a shared expert, followed by dynamically merging the twin vectors. The results in Table LABEL:tab:ortho demonstrate that our method integrates seamlessly with primary merging techniques, leading to significant improvements. For example, when combined with our approach, the baseline Weight Average method improves from $52.26$ to $96.23$ on GLUE, approaching the performance of fine-tuned experts. Notably, our method complements Ties-Merging particularly well, suggesting that better isolation of shared knowledge enhances the overall performance of Twin-Merging.

Table 5: Performance on unseen tasks

Method	QNLI+MNLI+RTE	MMLU
Task Arithmetic	53.92	62.02
Task Arithmetic (w/ DARE)	54.27	63.09
Ties Merging	54.09	64.62
Ties Merging (w/ DARE)	54.72	63.13
Twin-Merging	55.86	65.98

Table 6: Ablation study of Twin-Merging

Task	RoBERTa	Qwen
Twin-Merging	96.14	102.38
$-$ shared expert	81.47	87.77
$-$ dynamic Merging	67.80	96.61

4.2 Unseen Generalization

As shown in Table 6, Twin-Merging method benefits from complementary collaboration among different experts. Since the corresponding task-specific experts are unavailable, we directly use the average of the unnormalized scores as the metrics. In the GLUE benchmark, when QNLI, MNLI, and RTE experts are absent, our approach still outperforms traditional baselines. Details on the expert combination for QNLI can be found in Figure 5(a). For complex tasks like MMLU, which involves multiple-choice QA tasks across 57 categories, Twin-Merging demonstrates superior performance using the combined knowledge from TruthfulQA, BBQ, and CNN-DailyMail domains.

4.3 Ablation Studies

To demonstrate the effectiveness of our modularization approach using twin vectors and the dynamic merging strategy, we conducted ablation studies for Twin-Merging, detailed in Table 6.

To assess the impact of the shared expert strategy, we replace the shared expert with a randomly chosen task-specific expert. Twin-Merging’s performance significantly degrades without the shared expert, emphasizing its importance in capturing common knowledge. Additionally, to evaluate the dynamic merging strategy, we remove the dynamic experts, leaving only a single shared expert. This leads to a consistent drop in performance, necessitating dynamic merging experts in our method.

We observe that removing dynamic experts causes a significant performance drop for RoBERTa while it is less critical than replacing the shared expert for Qwen-14B. This suggests that for smaller models like RoBERTa, task-specific biases are more important than common knowledge. In contrast, for large generative models like Qwen-14B, the extensive general knowledge within the model allows it to handle most tasks without fine-tuning. Therefore, the shared expert is more crucial for Qwen-14B than task-specific knowledge. Our approach effectively merges fine-tuned and shared experts, adapting seamlessly to both scenarios. These findings demonstrate the effectiveness of our fine-grained expert merging strategy.

4.4 Scale to More Tasks

In the left panel of Figure 4, we examine the impact of the number of tasks on model merging performance. Conventional model merging methods degrade notably, especially with many tasks, nearly reaching pre-trained levels. However, Twin-Merging consistently outperforms other methods, approaching fine-tuned performance, with greater gains as the task count rises.

The right panel of Figure 4 shows the performance-storage trade-offs. While model merging methods have a constant storage cost, their performance remains low. In contrast, maintaining individual task-specific models guarantees strong performance but requires excessive storage. Twin-Merging achieves nearly 100% normalized accuracy across various tasks, balancing performance and storage efficiency by maintaining task-specific parameters with shared experts. This makes Twin-Merging a viable solution for scenarios demanding a balance between performance and storage efficiency.

4.5 Router Analysis

Figure 5 shows the results of routing decisions among experts for the QNLI dataset and four generative benchmarks. As shown in Figure 5(a), the router maximizes the use of limited expert knowledge to address QNLI, a task where the goal is to determine if the context sentence contains the answer to the input question. For example, with only $\bm{v}_{\text{CoLA}}$ and $\bm{v}_{\text{SST-2}}$ available, the router primarily uses $\bm{v}_{\text{CoLA}}$ , which provides knowledge of sentence and word relations, while $\bm{v}_{\text{SST-2}}$ is focused on irrelevant sentiment classification. With six experts ranging from $\bm{v}_{\text{CoLA}}$ to $\bm{v}_{\text{MNLI}}$ , the router mainly leverages $\bm{v}_{\text{MNLI}}$ for textual entailment and $\bm{v}_{\text{QQP}}$ for question-answering capabilities. When $\bm{v}_{\text{QNLI}}$ is included, the router naturally relies on QNLI-specific knowledge. These results demonstrate the flexibility and adaptability of our Twin-Merging method, providing good interpretability. For larger models like Qwen-14B, as shown in Figure 5(b), the router plays a crucial role in selecting and combining specific knowledge. When experts have overlapping task-specific knowledge, such as $\bm{v}_{\text{TruthfulQA}}$ and $\bm{v}_{\text{MMLU}}$ , the router may assign them similar weights.

4.6 Compression and Speed Analysis

Compression Analysis

In the left panel of Figure 6, we explore sparsity rates from $0\%$ to $100\%$ . Appendix E attachs detail qualtivie analysis of various Merging methods. Remarkably, our Twin-Merging method maintains $86.4\%$ performance even at a $99.8\%$ compression rate. This suggests that performance relies on a small fraction of task-specific parameters, aligning with previous findings [59, 61]. Our results also validate our hypothesis that redundant parameters can obscure critical knowledge, leading to performance degradation. Consequently, we primarily use a $90\%$ sparsity rate in our experiments to preserve performance while reducing storage costs. We also conducted an ablation study on sparsity methods, shown on the right side of Figure 6. SVD better retains task-specific information compared to Magnitude [59] and Bernoulli Dropout [61]. As SVD is applied only once during preprocessing, it does not become an inference bottleneck.

Table 7: Compute-performance tradeoff in the generative benchmark.

Method	Training Tokens	Training Cost	Inference Cost (/1000 items)	Performance
Multi-Task Learning	536.35M	10h32min	236s	94.31
Model Merging	0	0	236s	96.61
Twin-Merging	0.57M	183s	275s	102.38

Speed Analysis

Table 4.6 presents the time cost for Twin-Merging in generative benchmarks. Although the training stage uses only 0.1% of the total training budget, Twin-Merging significantly improves general capabilities compared to multi-task learning. Twin-Merging does not retrain all task experts; instead, it reuses experts (e.g., downloaded from model hubs like Huggingface [57]) and trains a small router to fuse these experts. Compared to conventional model merging methods, Twin-Merging sacrifices minimal router training budget and slightly reduces inference speed for dynamically composing the twin vectors, achieving superior performance. In summary, our approach strikes a better balance between compute and performance.

5 Conclusions

In this paper, we introduce the Twin-Merging to merge language models, aiming to close the performance gap between conventional model merging techniques and fine-tuned models, while improving adaptability to data heterogeneity. By modularizing and dynamically merging shared and task-specific knowledge, Twin-Merging significantly outperforms existing model-merging methods and approaches the performance of fine-tuned models across various settings and domains. Our study highlights the impact of shared and exclusive task-specific knowledge on merging performance. We show that Twin-Merging benefits even strong scaled models like Qwen-72B, which already perform well across domains. It extends to more tasks and merging methods, demonstrating better generalization on unseen data. By utilizing SVD, our solution retains $86\%$ of the performance with only $0.1\%$ of the parameters, approaching upper-bound performance with minimal storage increase as tasks grow, achieving a better tradeoff between computation and performance.

References

Ainsworth et al. [2023] Samuel Ainsworth, Jonathan Hayase, and Siddhartha Srinivasa. Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2023.
Azerbayev et al. [2024] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics, 2024.
Bai et al. [2023] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023.
Chen et al. [2018] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
Choshen et al. [2022] Leshem Choshen, Elad Venezian, Noam Slonim, and Yoav Katz. Fusing finetuned models for better pretraining, 2022.
Clark et al. [2022] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057–4086. PMLR, 2022.
Dai et al. [2024] Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024.
Dettmers et al. [2024] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
Dong et al. [2015] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1723–1732, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1166. URL https://aclanthology.org/P15-1166.
Draxler et al. [2018] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pages 1309–1318. PMLR, 2018.
Fedus et al. [2022] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
Frankle et al. [2020] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR, 2020.
French [1999] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
Garipov et al. [2018] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
Gueta et al. [2023] Almog Gueta, Elad Venezian, Colin Raffel, Noam Slonim, Yoav Katz, and Leshem Choshen. Knowledge is a region in weight space for fine-tuned language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1350–1370, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.95. URL https://aclanthology.org/2023.findings-emnlp.95.
Guo et al. [2024] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
Hoffmann et al. [2022] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022.
Hu et al. [2021] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
Ilharco et al. [2022] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2022.
Ilharco et al. [2023] Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj.
Jiang et al. [2024a] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024a.
Jiang et al. [2023] Junguang Jiang, Baixu Chen, Junwei Pan, Ximei Wang, Dapeng Liu, Mingsheng Long, et al. Forkmerge: Mitigating negative transfer in auxiliary-task learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Jiang et al. [2024b] Weisen Jiang, Baijiong Lin, Han Shi, Yu Zhang, Zhenguo Li, and James T. Kwok. Byom: Building your own multi-task model for free, 2024b.
Jin et al. [2022] Xisen Jin, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations, 2022.
Kaplan et al. [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
Lepikhin et al. [2021] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021.
Li et al. [2023] Weishi Li, Yong Peng, Miao Zhang, Liang Ding, Han Hu, and Li Shen. Deep model fusion: A survey, 2023.
Liebenwein et al. [2021] Lucas Liebenwein, Cenk Baykal, Brandon Carter, David Gifford, and Daniela Rus. Lost in pruning: The effects of pruning neural networks beyond test accuracy. Proceedings of Machine Learning and Systems, 3:93–138, 2021.
Lin and Hovy [2003] Chin-Yew Lin and Eduard Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 150–157, 2003. URL https://aclanthology.org/N03-1020.
Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
Liu et al. [2021] Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. Advances in Neural Information Processing Systems, 2021.
Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. 2019.
Luo et al. [2023] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2023.
Maninis et al. [2019] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1851–1860, 2019.
Matena and Raffel [2022] Michael S Matena and Colin A Raffel. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 2022.
Nakamura et al. [2024] Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolo Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, and Sampo Pyysalo. Aurora-m: The first open source multilingual language model red-teamed according to the u.s. executive order, 2024.
Nallapati et al. [2016] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
Navon et al. [2022] Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-task learning as a bargaining game. In International Conference on Machine Learning, pages 16428–16446. PMLR, 2022.
Ortiz-Jimenez et al. [2023] Guillermo Ortiz-Jimenez, Alessandro Favero, and Pascal Frossard. Task arithmetic in the tangent space: Improved editing of pre-trained models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=0A9f2jZDGW.
Parrish et al. [2022] Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. Bbq: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, 2022.
Patterson et al. [2021] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training, 2021.
Rozière et al. [2024] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024.
Sanh et al. [2022] Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
Singh and Jaggi [2020] Sidak Pal Singh and Martin Jaggi. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
Stoica et al. [2023] George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, and Judy Hoffman. Zipit! merging models from different tasks without training, 2023.
Sukhbaatar et al. [2024] Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, et al. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. arXiv preprint arXiv:2403.07816, 2024.
Sun et al. [2023] Xiaofei Sun, Linfeng Dong, Xiaoya Li, Zhen Wan, Shuhe Wang, Tianwei Zhang, Jiwei Li, Fei Cheng, Lingjuan Lyu, Fei Wu, and Guoyin Wang. Pushing the limits of chatgpt on nlp tasks, 2023.
Tang et al. [2024a] Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, and Dacheng Tao. Merging multi-task models via weight-ensembling mixture of experts, 2024a.
Tang et al. [2024b] Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, and Dacheng Tao. Parameter-efficient multi-task model fusion with partial linearization. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=iynRvVVAmH.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023.
Wang et al. [2018] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018.
Wang et al. [2019] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khazaeni. Federated learning with matched averaging. In International Conference on Learning Representations, 2019.
Wang et al. [2023] Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik Kundu, Eric P Xing, and Mikhail Yurochkin. Fusing models with complementary expertise. In NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models, 2023.
Wei et al. [2022] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
Wolf et al. [2020] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 2022.
Yadav et al. [2023] Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. Ties-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Yang et al. [2024] Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=nZP6NgD3QY.
Yu et al. [2024] Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch, 2024.
Zhou et al. [2022] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 2022.
Zhou et al. [2024] Zhanpeng Zhou, Zijun Chen, Yilan Chen, Bo Zhang, and Junchi Yan. Cross-task linearity emerges in the pretraining-finetuning paradigm, 2024.
Zoph [2022] Barret Zoph. Designing effective sparse expert models. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 1044–1044, 2022. doi: 10.1109/IPDPSW55747.2022.00171.

Appendix A Twin Merge on LoRA

Here, we will demonstrate that our Twin-Merging method can be seamlessly applied to LoRA module [20], where the base model is fixed and additional task-specific information is injected through matrix, i.e., $\bm{\theta}_{t}=\bm{\theta}_{0}+\text{LoRA}_{t}$ , where $\text{LoRA}_{t}$ represents the fine-tuned LoRA module for the $t$ -th task. let $\bm{\theta}_{s}=\bm{\theta}_{0}+\text{LoRA}_{s}$ , we can prove that Twin-Merging on the $\bm{\theta}$ is equivalent to Twin-Merging on the LoRA module.

$\displaystyle\bm{\theta}^{*}$	$\displaystyle=\underbrace{\bm{\theta}_{s}+\sum_{t=1}^{T}w_{t}*\text{SVD}_{r}(% \bm{\theta}_{t}-\bm{\theta}_{s})}_{\text{Twin-Merging on }\bm{\theta}}$	(5)
	$\displaystyle=\bm{\theta}_{0}+\text{LoRA}_{s}+\sum_{t=1}^{T}w_{t}*\text{SVD}_{% r}\Biggl{(}(\bm{\theta}_{0}+\text{LoRA}_{t})-(\bm{\theta}_{0}+\text{LoRA}_{s})% \Biggr{)}$
	$\displaystyle=\bm{\theta}_{0}+\underbrace{{\text{LoRA}_{s}+\sum_{t=1}^{T}w_{t}% *\text{SVD}_{r}(\text{LoRA}_{t}-\text{LoRA}_{s})}}_{\text{Twin-Merging on LoRA}}$
	$\displaystyle=\bm{\theta}_{0}+\text{LORA}^{*}$

where we denote $\text{LORA}^{*}=\text{LORA}_{s}+\sum_{t=1}^{T}w_{t}*\text{SVD}_{r}(\text{LoRA}% _{t}-\text{LoRA}_{s})$ .

Appendix B More relative research

Multi-Task Learning.

The multi-task training typically learns multi-task features by simultaneously optimizing task-specific objectives, facilitating the integration of diverse knowledge into the model. Existing works mainly focus on mitigating task conflicts [33] and catastrophic forgetting [14] by parameter sharing [36], adjusting suitable objectives [10, 45], find suitable task weighting [4, 40], and minimizing negative transfer [24]. In an era where models are growing larger, and the number of task scenarios is increasing, what we need to explore is a more cost-effective approach to multi-task learning. Therefore our focus is on multi-task scenarios that do not require acquiring or integrating multi-task data and do not involve additional updates to existing experts.

Mixture of Experts.

To enhance model scalability without increasing computational costs, the mixture of experts (MoE) paradigm introduces conditional routing of inputs to a subset of learnable parameters. Several efforts have extended feedforward networks (FFNs) within Transformers to incorporate MoE layers, such as GShard [28] and Switch Transformer [12]. These models typically employ learnable top-2 or top-1 routing strategies to scale MoE language models to an extremely large size [23]. Recent studies have focused on challenges such as load balancing of experts [6, 62], training instability [64], expert specialization [7, 50], and synchronization reduction [48]. However, these methods often require substantial multi-task data and costly joint training. In contrast, our approach directly reuses task-specific experts, leading to the natural specialization of experts in different domains. We only require minimal fine-tuning for a small router to calculate fusion weights, making our method highly efficient.

Appendix C The Merging Interference and Limited Generalization

To illustrate the challenge in determining the optimal merging coefficient and the limitations of pre-specified coefficients with unpredictable data, we consider COLA and SST-2 as in-domain experts. We merge them using Task Arithmetic and evaluate on the eight discriminative tasks from the GLUE benchmark. Only COLA and SST-2 are seen tasks, while the others are unseen. Since the merging coefficient is crucial for performance [60, 41], we conduct an extensive grid search for coefficients ranging from $-2$ to $2$ .

A large dark-blue region indicates consistent optimal performance, which is why Task Arithmetic can work with various weights. Conventional methods search this region for optimal performance across all in-domain tasks, avoiding the red region. However, this is computationally expensive and does not scale well with an increasing number of tasks. Additionally, it cannot handle unseen tasks, as the same coefficients can produce different patterns across tasks. For example, setting coefficients $\gamma_{\text{COLA}}$ and $\gamma_{\text{SST-2}}$ to $1$ leads to performance drops in MRPC and QNLI, but gains in MNLI, QQP, and RTE. ²²2In fact, the MNLI and QNLI are very similar tasks about Natural Language Inference (NLI) [53]. This demonstrates that task similarity does not guarantee similar merging performance patterns.

Furthermore, merging performance is not always a single cluster. For example, within the range of $[-2,2]$ , STS-B and QNLI already show complex patterns, making it difficult to find an optimal weight for all tasks when task-specific experts are limited. Although Yang et al. [60] propose unsupervised entropy minimization to find optimal coefficients, this method is limited to classification tasks and has limited adaptability.

To address this, we propose reformulating the problem of fusing models as a supervised learning task. Specifically, we train a router to dynamically merge task-specific experts, as detailed in Section 3.3.

Appendix D Experiment Details

Here we detaily illustrate the setting of our experiments.

D.1 Compute Resources Used and Runtimes

We executed all our experiments on Nvidia A100 GPUs equipped with 80GB RAM. Single-task LoRA models for Qwen-14B on four generative tasks required 1-2 hours per task, Single-task LoRA for Qwen-72B need 10 hours on single GPUs to train. while the multitask vector took around 10 hours on single GPUs of 500M tokens. The RoBERTa model needs 15 minutes per task on GLUE datasets. Merge experiments were efficient, with evaluations consuming less than 2 minutes. The inference is generally fast within 4 minutes per 1000 items for generative tasks and less than 30 seconds per 1000 items for discriminative tasks. The detail comparison of the training cost and inference cost of different methods are detailed in Table 4.6.

D.2 Employed Datasets and Associated Licences

Discriminative Tasks.

we conduct experiments on the GLUE benchmark [53] with eight discriminative tasks, which is designed for classification tasks except for STS-B for the regression task. The detail of eight dataset can be found in the paper of Wang et al. [53]. Consistent with prior research [61], We split 10% of the training set as a validation set and employ the original validation data as the test set.

The licenses of QNLI, COLA, and STS-B are licensed under CC-BY-SA. QQP is licensed under MIT. SST-2 and MRPC are licensed under Apache 2.0. MNLI is licensed under OANC. RTE is licensed under CC BY 4.0. Thus, these datasets in GLUE are available for non-commercial research purposes.

Generative Tasks.

We conducted experiments on four benchmarks:

1.

MMLU [18]: This benchmark tests general and STEM knowledge across 57 subjects, from elementary to professional levels. We used Exact-Match as the metric.
2.

TruthfulQA [32]: This benchmark assesses the truthfulness of language models with 817 questions spanning 38 categories like health, law, finance, and politics. Exact-Match was used as the metric.
3.

BBQ [42]: This dataset highlights social biases against protected classes in nine social dimensions relevant to U.S. English-speaking contexts. Exact-Match was the metric.
4.

CNN-DailyMail [39]: This dataset is used for text summarization, requiring models to generate summaries of news stories. ROUGE-2 scores [31] were used for evaluation.

We evaluated these tasks using the HELM benchmark³³3https://github.com/stanford-crfm/helm in a few-shot setting.

For MMLU and TruthfulQA, which lack official training sets, we used the Dolly-15k dataset⁴⁴4https://huggingface.co/datasets/databricks/databricks-dolly-15k for MMLU and the BigBench-sampled dataset for TruthfulQA.

The GSM8K and MMLU datasets are under the MIT License. TruthfulQA and CNN-DailyMail are under the Apache-2.0 License. BBQ is under the CC-BY 4.0 License. These datasets are available for non-commercial research purposes.

D.3 Language Model Backbone

For discriminative tasks, we used RoBERTa-base⁵⁵5https://huggingface.co/FacebookAI/roberta-base [34] as our pre-trained backbone and fine-tuned it for each dataset to create supervised models. We conducted separate fine-tuning for the RoBERTa-base model on each dataset for $10$ epochs. Our selected hyperparameters included a batch size of $64$ and a learning rate set at $1e^{-5}$ .

For generative tasks, we employed Qwen-14B⁶⁶6https://huggingface.co/Qwen/Qwen-14B as the backbone and applied LoRA [20] for task-specific fine-tuning. In the case of generative tasks, the fine-tuning process for Qwen-14B involved the utilization of LoRA with a rank set to $32$ , a batch size of $128$ , and a learning rate of $2e^{-4}$ for $3$ epochs. For Qwen-72B we employ the same setting with QLoRA technique [8].

D.4 Non-Overlapping Merging

To serperate the impact of parameter-wise interference, we design the non-overlapping experiment based on Qwen LoRA modules as follows: (1) Firstly, we obtain standard merging experts by injecting the LoRA module into both the “w1” and “c_proj” weights of the Qwen-based model, and fine-tune them on two different tasks, resulting in two distinct models. Then we combine it into a single model to obtrain standard merging results. (2) Next, we performe a non-overlapping fine-tuning by injecting LoRA only to “w1” on one task and “c_proj” on another, producing two models with task-specific knowledge in different modules. (3) Finally, we combined the non-overlapping checkpoints to get the merged results. Since task-specific knowledge was injected into separate modules, parameter-wise interference was minimized. The results are shown in the upper section of Table 3.

D.5 Sparsification Methods Details

In Figure 6, we conduct a comparative analysis employing various sparsification methods. The specifics of each method are outlined below:

•

Magnitude. Following the setting in Ties-Merging [59], we retain solely the $k\%$ largest-magnitude values while resetting the remaining values to zero.

•

Bernoulli-Dropout. Adhering to the methodology introduced in DARE [61], we employ a parameterized Bernoulli distribution to sample a sparse mask $\bm{m}^{t}$ . This mask is then applied to the parameters $\bm{\delta}$ and subsequently rescaled with respect to the mask rate $k$ .

\begin{gathered}\bm{m}^{t}\sim\operatorname{Bernoulli}(k),\\ \widetilde{\bm{\delta}}^{t}=\bm{m}^{t}\odot\bm{\delta}^{t},\\ \hat{\bm{\delta}}^{t}=\widetilde{\bm{\delta}}^{t}/(1-k).\end{gathered}

(6)

•

Singular value decomposition (SVD). Assuming that matrix $M$ has a rank- $m$ decomposition, expressed as $\mathbf{M}=\mathbf{U}_{t}\mathbf{\Sigma}_{t}\mathbf{V}_{t}^{T}$ where $\mathbf{U}_{t}\in\mathbb{R}^{d_{out}\times m},\mathbf{\Sigma}_{t}\in\mathbb{R}% ^{m\times m},\mathbf{V}_{t}\in\mathbb{R}^{d_{in}\times m}$ . We compress the matrix $\mathbf{M}$ by selecting only the top- $r$ singular values from $\mathbf{\Sigma}_{t}$ , denoted as $\mathbf{M}_{r}=\mathbf{U}_{t}(r)\mathbf{\Sigma}_{t}(r)\mathbf{V}_{t}(r)^{T}$ . Here, $\mathbf{U}_{t}(r)\in\mathbb{R}^{d_{out}\times r},\mathbf{\Sigma}_{t}(r)\in% \mathbb{R}^{r\times r},\mathbf{V}_{t}^{r}\in\mathbb{R}^{d_{in}\times r}$ represent sub-matrices of $\mathbf{U}_{t},\mathbf{\Sigma}_{t},\mathbf{V}_{t}^{T}$ . This transformation significantly reduces the task-specific parameter dimensionality from $m\times(d_{out}+d_{in}+1)$ to $r\times(d_{out}+d_{in}+1)$ , as the maximum $m$ typically equals to the hidden size of the language model (e.g., $m=768$ for RoBERTa-base and $m=4096$ for Qwen-14B) and $r$ can be reduced to 1, resulting in a significant reduction in parameters and storage effectiveness.

D.6 Baselines Details

Here we will elaborate on the baselines utilized in our main comparison experiment, as outlined in Table 2 and Figure 2(b).

•

Individual means that each task uses the corresponding fine-tuned model, which has no interference between tasks but cannot perform multiple tasks simultaneously. It serves as the upper-bound performance for each specific task.
•

Weight Averaging [5, 58] is the simplest form of model merging, which straightforwardly averages the parameters of multiple models. It serves as a lower bound for model merging.
•

Task Arithmetic [21] first introduces the concept of “task vectors” and merges them into the pre-trained model to execute multi-task learning.
•

Ties-Merging [59] addresses task conflicts by eliminating redundant parameters. The process involves three steps: Trim, Elect Sign, and Disjoint Merge.
•

Task Arithmetic (w/ DARE) [61] This variant incorporates the Bernoulli-Dropout technique for 70% sparsification before employing Task Arithmetic [21] for merging.
•

Ties-Merging (w/ DARE) [61] Similar to the previous approach, this variant integrates Bernoulli-Dropout for 70% sparsification, followed by Ties-Merging [59] for the merging process.

The coefficient for Task Arithmetic and Ties-Merging are decided by a small scale grid search on validation datasets. The coefficient of 0.7 is consistently applied for DARE Merging, following the previous papers [61].

D.7 Detail Results

In Table 2, we present only the average normalized scores across various tasks. In this section, we detail the statistical performance of all tasks, with discriminative results displayed in Table 8 and generative results shown in Table 9.

Table 8: The detail statistics of different merging performance on 8 discriminative tasks. Bold numbers indicate the best-averaging performance across different model merging methods.

Model	COLA	STS-2	MRPC	STS-B	QQP	QNLI	MNLI	RTE	Avg.
Pre-trained	0.00	53.76	85.01	4.01	37.48	53.05	37.09	71.19	41.69
Fine-tuned	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00	100.00
Weight Averaging	0.00	59.21	85.79	46.99	45.37	63.94	48.00	71.19	52.56
Task Arithmetic	8.35	88.26	89.57	32.84	82.03	85.40	75.54	80.43	67.80
Ties-Merging	31.76	88.86	86.18	10.94	61.05	85.94	83.01	69.56	64.66
Task Arithmetic (w/ DARE)	0.00	88.14	86.61	30.19	84.33	79.09	63.95	77.16	63.68
Ties-Merging (w/ DARE)	11.82	95.52	85.75	9.43	86.77	88.67	83.13	63.59	65.58
Twin-Merging (Rank-1)	51.24	98.67	89.20	76.31	92.16	93.24	96.45	90.76	86.00
Twin-Merging ( $90\%$ compressed)	101.01	99.88	99.41	79.89	99.14	99.67	96.68	93.47	96.14

Table 9: The detail statistics of different merging performance on 4 generative tasks. Bold numbers indicate the best-averaging performance across different model merging methods. Underlines indicate the second best performance of each task across different model merging methods.

Model	MMLU	TruthfulQA	BBQ	CNN-DailyMail	Avg.
Pretrained	101.37	94.35	86.27	82.24	91.06
Fine-tuned	100.00	100.00	100.00	100.00	100.00
Weight Averging	99.63	92.04	88.01	103.28	95.74
Task Arithmetic	98.93	98.23	83.65	105.62	96.61
Task Arithmetic (w/ DARE)	99.22	96.90	88.56	109.40	98.52
Ties-Merging	99.88	92.04	89.92	88.83	92.67
Ties-Merging (w/ DARE)	101.41	97.66	86.81	81.80	91.92
Twin-Merging (rank-1)	99.40	95.58	93.46	115.39	100.96
Twin-Merging (rank-16)	99.87	98.23	97.00	114.43	102.38

Appendix E Efficiency Analysis

Assume we have $T$ tasks, the fine-tuned model have $P=P_{f}+P_{a}$ parameters, where $P_{f}$ are frozen and $P_{a}$ are activated.

Parameter Count and Storage Cost

Assuming each float parameter uses 16 bits (either fp16 or bf16): Fine-tuned models require $2(TP_{a}+P_{f})$ bytes of storage. Pretrained models, including those using Weight Average, Task Arithmetic, Ties-Merging, and DARE Merging techniques, each need $2P$ bytes of storage per model. For Twin-Merging, with the router having $P_{r}$ parameters ( $P_{r}\ll P$ ) and a compression rate of $k\%$ , it need to store $2TkP_{a}+2P+P_{r}$ bytes including a shared expert, compressed exclusive task-specific vectors, and the router. We can select $k$ to compress the model matrix to rank $1$ for best storage. These strategies enhance the accessibility and sustainability of task-specific models, fostering wider advancements and applications. Visual representations can be found in Figure 2(a) and Figure 4.

Appendix F Limitations and Future Work

Our approach shares common limitations with existing merging methods: (1) The underlying theory behind why and when weight interpolation works is not fully understood, though recent works [63, 41] have made interesting observations about weight disentanglement and cross-task linearity. (2) Currently, merging is limited to models with the same architecture and it may be difficult to find a suitable fine-tuned model with specific capacities.

Additionally, while our method focuses on shared and exclusive task-specific knowledge, providing a way to approach fine-tuned model performance and potentially surpass it without additional training, we observe there may be other types of knowledge that remain unexplored: (1) Evil knowledge: Useless for any task and distracts the model, obscuring critical knowledge during merging. (2) Irrelevant knowledge: Has no impact on merging performance. Our experiments validate the existence of the irrelevant knowledge since we demonstrate that dropping $90\%$ of parameters retains most of the fine-tuned performance, but we have not investigated evil knowledge. Future work may include further investigation and decomposing these different types of knowledge to better ignite the model’s full potential without retraining.

Appendix G Broader Impacts

This paper presents work whose goal is to advance the field of machine learning and model merging research. In terms of positive social impact, twin-merging techniques can achieve multi-task performance of foundation models without retraining expert models, significantly reducing computational and energy costs. Our proposed knowledge modularization and compression techniques make the task-specific enhanced model more accessible and sustainable, paving the way for broader applications and advancements in the field. These techniques effectively align unaligned models by leveraging experts, thus mitigating the harmfulness and biases present in the original models. Additionally, model merging allows the unified model to benefit from the strengths of each task-specific model, even for tasks with private or inaccessible data, enhancing commercial and safety benefits. However, improper merging of biased models may contaminate the merged model. This issue can be addressed by merging a de-bias expert or using sparsity techniques to minimize the impact.