Visual Perception by Large Language Model’s Weights
Abstract
Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM’s weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.
1 Introduction
Large language models (LLMs) [54, 61, 44] have achieved promising performance on most natural language tasks and have shown great generalization ability in solving real-world problems. Derived from LLMs, multimodal large language models (MLLMs) [34, 62, 4, 59, 52, 45] take a step toward artificial general intelligence (AGI) by perceiving visual information from the real world. Therefore, the way of perceiving visual information is the key to moving from LLM to MLLM.
To perceive visual information, recent MLLMs follow an input space alignment paradigm that aligns visual features with the input space of LLM and concatenates visual tokens with text tokens to form a unified sequence as input for LLM. For instance, LLaVA [34] uses CLIP-ViT-L-14 [47] as the visual encoder and introduces a linear projector to align the visual tokens with the input space of LLM. Monkey [29] divides input images into uniform patches and equips individual adapters for each patch to handle high-resolution images. Recent work [53] also identifies the visual shortcomings of CLIP for MLLMs as “CLIP-blind pairs” and integrates vision self-supervised learning features with MLLM to address this issue. DeepSeek-VL [39] and Sphinx [30] also adopt hybrid vision encoders. Vary [55] identifies that a fixed vision vocabulary limits the dense and fine-grained visual perception and introduces a new vocabulary to address this issue.
Despite these efforts to advance MLLM in visual perception, the paradigm of input space alignment remains unchanged, which can result in computational inefficiency for both training and inference. The computational cost of MLLM is concentrated on the attention mechanism of LLM, which is when the length of the input sequence is . Using ViT-L-14 as the vision encoder, a 224224 low-resolution image can result in 256 visual tokens, and the length increases to 576 when the resolution slightly raises to 336336. Considering high-resolution images, some works [30, 33, 29, 11] split an image into multiple sub-images for capturing fine-grained information, leading to a significantly higher number of visual tokens. For instance, Sphinx-2k [30] adopts 2,890 visual tokens, while InternLM-Xcomposer2-4KHD [11] even uses up to 8,737 visual tokens. Concatenating such a long sequence of visual tokens to text tokens results in a dramatic increase in computational overhead for both training and inference. Specifically, current MLLMs are usually pre-trained on web-crawled image-text pairs, which usually have very short texts, with an average word count of 10.95 for LAION-2B [48] and 8.99 for LAION-COCO [1]. As a result, the number of visual tokens during the pre-training stage is about 20 to 50 times the number of text tokens, which suggests that the involvement of visual tokens seriously affects the efficiency of the pre-training. Some works [25, 9, 22] employ resamplers to reduce the number of visual tokens to a fixed count but still follow the input space alignment paradigm and introduce extra visual tokens for LLMs.
To address this issue, we explore a novel parameter space alignment paradigm where visual information is represented as LLM’s weights. As shown in Fig. 1, for an input image, we use a vision encoder to extract visual features. Then, the visual features are converted to perceptual weights, which represent visual information as model weights. The perceptual weights can be directly merged with LLM’s weights. Thus, the visual information is merged into LLM in the form of weights, eliminating the need for visual tokens in the LLM’s input and significantly improving efficiency. Building on this paradigm, we introduce VLoRA, which contains the perceptual weights generator. The perceptual weight generator is designed to convert visual features to perceptual weights. LLMs usually contain a large number of parameters, for feasibility and efficiency, perceptual weights are designed with a low-rank property. Thus the generated perceptual weights are similar to the form of LoRA weights.
Our contributions are summarised as follows:
-
1.
We explore a novel paradigm for MLLMs that aligns visual features with the parameter space of LLMs, which highly improves the efficiency of MLLMs
-
2.
Based on this paradigm, we propose VLoRA and design the perceptual weights generator that generates low-rank perceptual weights.
-
3.
Experimental results demonstrate the effectiveness and efficiency of our approach. We obtain results comparable to those of state-of-the-art MLLMs on various benchmarks, including MMBench, ScienceQA, HallusionBench, and MMMU.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x1.png)
2 Related Works
Multimodal Large Language Models. Current MLLMs are developed from LLMs by aligning visual features into the input space of LLMs. Many efforts have been made to explore introducing visual perception capability for LLMs. LLaVA [34] connects the visual encoder of CLIP to the Vicuna [61] with a linear projector. Further research that follows this paradigm focuses on improving MLLMs from the perspective of vision encoder and projector DeepSeek-VL [39] use SigLip [58] to extract high-level semantic features and use SAM-B [20] to process low-level features. Tong et al. [53] finds that visually distinct images can be encoded as similar due to the shortcoming of CLIP and integrates vision self-supervised learning features with CLIP features. Sphinx [30] ensembles various vision backbones that have different architectures, pre-training paradigms, and information granularities. These works input the entire visual tokens sequence into the LLM, which can lead to a high computational cost during training and inference. Specifically, LLaVA [32] and DeepSeek-VL [39] utilize 576 visual tokens, Sphinx-2k [30] employs 2,890 visual tokens, and InternLM-XComposer2-4KHD [11] uses up to 8,737 tokens. Some works consider adopting cross-attention architecture as the projector to improve efficiency. MiniGPT4-v1 [62] and BLIP series [25, 9] adopt Q-Former as the projector, which reduces the length of visual tokens to a fixed number of 64. Qwen-VL [5] uses a single-layer cross-attention module incorporated with 2D absolute positional encodings to avoid the potential loss of positional details. However, these improvements still follow the paradigm of aligning visual features to the input space of LLM, introducing extra computational overhead on LLM inference. Different from previous work, our VLoRA aligns visual features with the parameter space of LLM. The visual information can be represented as perceptual weights in LoRA format and merged into LLM’s weights during inference.
Parameter-Efficient Fine-Tuning. Parameter-efficient fine-tuning (PEFT) is a key technique for fine-tuning large pre-trained models, including LLMs and MLLMs. PEFT methods freeze the backbone and only fine-tune a small number of parameters, which can be typically categorized into three classes: adapters [16, 46, 51, 60], prefix-tuning [27, 24, 36], and Low-Rank Adaption (LoRA) [17, 35, 10]. In the field of language models, Houlsby et al. [16] design bottleneck adapters and insert two adapters into the transformer layers, one after the attention module and one after the feed-forward network. Prefix-tuning [27] prepends a set of learnable prefix vectors at the query and key of the self-attention module for every layer. Prompt-tuning proposes to only prepend learnable vectors to the input prompt with no intermediate-layer prefixes. LoRA [17] uses learnable low-rank matrices to approximate the backbone’s weight updates, and the low-rank matrices can be merged with the backbone during inference without extra inference burden. Considering the pre-training stage, current MLLMs usually freeze the unimodal backbones and project visual tokens through a learnable projector, then prepend visual tokens into the input sequence of LLMs, which can be seen as prefix-tuning methods. Our VLoRA is closer to the style of LoRA. Specifically, VLoRA generates low-rank perceptual weights, which can be seen as a generated visual parameters matrix multiplied with a learnable matrix . Similar to LoRA, the perceptual weights can be injected into LLMs’ weights without introducing extra inference overhead.
3 Method
3.1 Preliminaries
In this subsection, we review the details of the decoder block in the current LLM. As shown in Fig. 2, the decoder block of LLM contains a self-attention module and a feed-forward network.
Self-attention. As shown in Fig. 2 (b), the self-attention module contains four types of linear layers: query , key , value , and output . Here, represents the dimension of the hidden states of LLM, and represents the dimension of each attention head. For each input token in the input sequence , it is multiplied by linear layers , , , obtaining , and . Then, the attention operation is executed along the sequence dimension as follows:
(1) |
The self-attention mechanism is performed on each head, and the outputs from different heads are concatenated and multiplied by output linear layer with weights .
Feed-forward Network. As shown in Fig. 2 (c), the feed-forward network is an MLP with two fully connected layers and a non-linear activation function. The formulation can be written as follows:
(2) |
where is the input token, is the activation function, and and are the weights of two fully connected layers. To summarize, the decoder block of LLM has five types of weights, including , , , from the self-attention module, and , from the feed-forward network.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x2.png)
3.2 Visual Perception by LLM’s Weights
Previous MLLMs follow the paradigm of aligning the visual features with the input space of LLM and require additional visual tokens as LLM’s input, which can lead to computational inefficiency. This inefficiency becomes more pronounced when encountering high-resolution or multiple images as the number of tokens increases drastically. To address this issue, we propose to align visual features with LLM’s parameter space without introducing extra tokens into LLM’s input.
To achieve this goal, we represent the visual information of the input image as perceptual weights and integrate them into the weights of LLM. This approach allows LLM to perceive visual information without introducing extra tokens into the input. As mentioned in Sect. 3.1, LLM’s decoder blocks have five types of weights. We use to denote the weight matrix of LLM. For an input image , we first adopt a vision encoder to extract the visual features , where , is the number of visual tokens, and is the dimension of visual features. Then, we design a perceptual weights generator to convert the visual features to perceptual weights . It is worth noting that, given that we want LLM to perceive visual information while preserving its language capabilities, is a low-rank matrix, which also helps to reduce the computation cost of the perceptual weights generator. With the generated perceptual weights , we can directly merge it into the LLM’s weights as:
(3) |
By integrating the weights transferred from the visual features into the LLM’s weights, the visual perception ability is naturally equipped. After merging the weights, no extra inference burden will be introduced for LLM. For any weights in each decoder block of LLM, we can generate the corresponding perceptual weights and integrate them into LLM’s weights.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x3.png)
3.3 Perceptual Weights Generator
To convert visual features to perceptual weights , we propose the perceptual weights generator. Since each layer and each type of weight in LLM focus on different visual information, our perceptual weights generator needs to be able to generate weights corresponding to each of the LLM weights flexibly.
Inspired by DETR [6] and BLIP-2 [25], we design the perceptual weights generator as a decoder-only architecture with cross-attention layers to generate . As shown in Fig. 3 (a), the perceptual weights generator contains blocks, each comprising a self-attention module, a cross-attention module, and a feed-forward network. The hidden states dimension of the perceptual weights generator is , where . We set learnable perceptual quires corresponding to the number of decoder blocks where we want to insert perceptual weights. For each block, the perceptual queries first pass through the self-attention module, then interact with visual features in the cross-attention module, and finally go through a feed-forward network. After blocks, we obtain features . The features should be mapped to the target shape of perceptual weights . However, due to , directly mapping the dimensions of the from to with a linear layer can introduce a large number of parameters, dramatically reducing the feasibility. Therefore, we consider introducing the low-rank property in this process. We adopt a shared linear layer to map all features from to as follows:
(4) |
where is the rank for perceptual weights and is visual parameter.
And we reshape the output as . When ascending to the target dimension , independent linear layers are used for each visual parameter and obtain perceptual weights , this process can be formulated as follows:
(5) |
Substituting Eq. 5 into Eq. 3, we get:
(6) |
Considering the low-rank property of and , we can observe that Eq. 6 and LoRA [17] are of the same form, where corresponds to and corresponds to . As illustrated in Fig. 3 (b), our perceptual weights generator can be seen as “LoRA weights generator” from the perspective of LoRA. This is because it generates and for weights of LLM. Our perceptual weights generator generates one type of perceptual weights for decoder blocks at a time. For generating multiple types of weights, we employ multiple perceptual weights generators.
3.4 Analysis of the Computational Cost
By not introducing additional visual tokens in the input of the LLM, our VLoRA achieves higher computational efficiency for both training and inference. We only consider the computational cost of LLM, as the computational overhead of our perceptual weights generator is negligible in comparison. We assume the LLM has blocks and hidden states dimension of , the input text length is , and the number of visual tokens is . For convenience, we only consider the computational cost of the self-attention module and feed-forward network in LLM. The FLOPs of the self-attention module and the feed-forward network are and . For previous MLLMs that align visual features to the input space of LLM, the FLOPs of LLM are . For our VLoRA, the extra computational cost occurs in Eq. 6, where is multiplied with . Assuming that we generate perceptual weights for all 5 types of weighs in decoder blocks. During training, we do not merge the perceptual weights with the LLM weights but use them as branches of the LLM weights. Therefore, the FLOPs are . For inference, the perceptual weights can be merged into the LLM, and the FLOPs are . Details of the FLOPs calculation are in the Appendix A. There is a small increase in the overhead of training compared to inference, and we compare by the training FLOPs. In Fig. 4, we compare the FLOPs of LLaVA and VLoRA. Our approach does not introduce additional computation as the number of visual tokens increases, and our FLOPs are only 8% of LLaVA-v1.5’s when the text length is 32.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x4.png)
4 Experiments
4.1 Implementation Details
Model Settings. We use Vicuna-7b-v1.5 [61] as our foundational LLM and CLIP-ViT-L-14 [47] as vision encoder. The perceptual weights generator is initialized randomly. For the perceptual weights generator, we set the hidden size as 512, and the number of blocks as 8. The rank of perceptual weights is 64. The number of perceptual queries is 8, which means that we insert perceptual weights only on 8 blocks, and in the implementation, for Vicuna-7b-v1.5 with 32 blocks, we insert every 4 blocks. For better visual perceptual ability, we insert for all five types of weights in LLM. It is worth noting that the last linear layers of the perceptual weights generator are zero-initialized as they are equivalent to the of LoRA weights, which are initialized as zero for training stability.
Pre-training Data. During pre-training, we use image-text pairs to train our model. Specifically, we use a subset of CapsFusion-120M [56] with 30 million image-text pairs. CapsFusion-120M randomly collects image-text pairs from LAION-COCO [1], which contains both web-crawled and synthetic captions generated by BLIP [26]. Then, a fine-tuned LLM is used to integrate both types of captions.
Pre-training Configuration. We freeze the weights of LLM and visual encoder in the pre-training stage, making only the perceptual weights generator trainable. We use the AdamW [38] optimizer with a learning rate of 5-5, which follows a linear warm-up and then a cosine decay schedule. The pre-training is conducted with a total batch size of 768 for 40,000 iterations. The input images are resized to a resolution of 336 336. The pre-training stage uses 24 NVIDIA H800 GPUs for 7 hours.
Fine-tuning Data. For supervised fine-tuning, we adopt the same data as LLaVA-v1.5. Specifically, the supervised fine-tuning data is constructed with VQAv2 [13], GQA [18], OKVQA [42], OCRVQA [43], A-OKVQA [49], TextCaps [50], RefCOCO [41, 19], Visual Genome [21], ShareGPT [2], and LLaVA-Insturct [34], with a total of 665K conversation data.
Fine-tuning Configuration. During the fine-tuning stage, we freeze the vision encoder and update the weights of the perceptual weights generator and LLM. The learning rate is set to 5-5 and the learning rate schedule is the same as in the pre-training stage. The global batch size is 128. We train for one epoch on 8 NVIDIA H800 GPUs, which takes 2 hours.
4.2 Benchmarks for Evaluation
MMBench & CCBench. MMBench [37] is a comprehensive multimodal benchmark designed to evaluate the performance of MLLMs. It includes over 3,000 multiple-choice questions covering 20 ability categories. The evaluation is divided into perceptual and reasoning dimensions and subdivided into 20 categories. CCBench [37], released by the MMBench team, is designed for evaluating MLLMs in the domain of Chinese Culture.
MME. MME [12] also measures the advanced MLLMs in terms of perception and cognition, with a total of 14 subtasks. To minimize the influence of prompt engineering on MLLMs, the instructions of MME are designed as simple binary responses: “please answer yes or no".
ScienceQA. ScienceQA [40] is constructed from elementary and high school science curricula. Questions of ScienceQA span three subjects: natural science, language science, and social science. We use samples with images from the validation set to evaluate MLLMs.
HallusionBench. HallusionBench [14] is designed for evaluating image-context reasoning, including 346 images paired with 1129 questions crafted by human experts. Unlike other benchmarks [15, 28, 31] that focus on object hallucinations with limited topics and visual input types, HallusionBench considers both language hallucinations and visual illusions across a diverse range of topics.
MMMU. MMMU [57] collects 11.5K multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines, spanning 30 subjects and 183 subfields, and comprising 30 heterogeneous image types. MMMU is more challenging than existing benchmarks due to the demand for college-level domain-specific knowledge.
Model | Size | # vis. tok. | GFLOPs | MMBench | MME | ScienceQA | HallusionBench | MMMU | CCBench |
---|---|---|---|---|---|---|---|---|---|
InstructBLIP [9] | 8B | 32 | 827 | 36.0 | 1137.1 | 54.7 | 31.2 | 30.6 | 12.7 |
MiniGPT-4-v1 [62] | 7B | 32 | 827 | 12.2 | 770.6 | 39.0 | 31.9 | 23.6 | 1.8 |
MiniGPT-4-v2 [7] | 7B | 256 | 3754 | 24.3 | 708.4 | 54.1 | 30.0 | 25.0 | 1.4 |
Idefics-instruct [23] | 9B | 64 | 1362 | 48.2 | 942 | 51.6 | 27.3 | 18.4 | 7.8 |
OpenFlamingo v2 [3, 4] | 9B | 64 | 1362 | 6.6 | 535 | 45.7 | 29.4 | 28.2 | 6.3 |
Qwen-VL [5] | 9.6B | 256 | 3754 | 38.2 | 334.1 | 57.7 | 29.9 | 29.6 | 6.1 |
Qwen-VL-Chat [5] | 9.6B | 256 | 3754 | 60.6 | 1467.8 | 65.5 | 36.8 | 37.0 | 41.2 |
LLaVA-v1.5 [32] | 7.2B | 576 | 8027 | 64.3 | 1510.7 | 66.8 | 27.6 | 35.7 | 27.5 |
VLoRA | 7.8B | 0 | 619 | 63.4 | 1311.3 | 66.4 | 26.4 | 36.0 | 28.6 |
4.3 Comparison with State-of-the-arts
Tab. 1 compares our VLoRA with other state-of-the-art MLLMs on six MLLM benchmarks. The results are obtained from OpenCompass [8]. Unlike other MLLMs, our VLoRA does not require any visual tokens during LLM inference and has only 8% of the computational overhead of LLaVA-v1.5 when the text length is 32. On most benchmarks, VLoRA outperforms InstructBLIP, MiniGPT-4, Idefics-instruct, and OpenFlamingo v2. Compared with Qwen-VL-Chat pre-trained on 1.4B image-text pairs, VLoRA has a higher score of 3.7 on MMBench and 1.3 on ScienceQA. Compared with LLaVA-v1.5, VLoRA can achieve comparable performance on MMBench, ScienceQA, and HallusionBench and even better performance on MMMU and CCBench. However, the results on MME fall short of LLaVA-v1.5 since our perceptual weights generator is randomly initialized and necessitates more image-text pair data during the pre-training stage. To verify this, in Tab. 2, we reproduce LLaVA-v1.5 by replacing the projector with a randomly initialized Q-Former and achieve similar results on MME. Our VLoRA achieves comparable performance to state-of-the-art MLLMs without introducing visual tokens as LLM inputs, drastically reducing computational overhead.
5 Ablation Study
Currently, the performance of MLLMs is significantly affected by the foundational LLMs and the training data, including pre-training data and supervised fine-tuning data. To explore the effectiveness of our proposed paradigm and model, we perform a fair comparison with LLaVA-v1.5 [34] by adopting the same foundation LLM and training data in this section. Then, with this setting, we also explore the impact of different settings of each component on performance.
Model | PT data | # vis. tok. | MMBench | MME | ScienceQA | HallusionBench | MMMU | CCBench |
---|---|---|---|---|---|---|---|---|
LLaVA-7b-v1.5 | blip-558k | 576 | 64.3 | 1510.7 | 66.8 | 27.6 | 35.7 | 27.5 |
LLaVA-7b-v1.5 | CapsFus-30m | 576 | 64.6 | 1470.0 | 67.7 | 27.4 | 33.8 | 25.3 |
LLaVA-7b-v1.5-QFormer | CapsFus-30m | 128 | 60.7 | 1241.5 | 67.3 | 26.7 | 33.8 | 25.3 |
VLoRA | CapsFus-30m | 0 | 63.4 | 1311.3 | 66.4 | 26.4 | 36.0 | 28.6 |
5.1 Comparison with LLaVA-v1.5
To ensure a fair comparison with LLaVA-v1.5, we reproduce LLaVA-v1.5 with the same setting as our VLoRA, including the pre-training and supervised fine-tuning data. Furthermore, to eliminate the influence of the difference in the projector, we replace the project of LLaVA-v1.5 as a randomly initialized Q-Former, which has the same number of blocks and hidden size as our perceptual weights generator. The training is conducted using the same pre-training and fine-tuning data as VLoRA.
In Tab. 2, the second row is the results of LLaVA-v1.5 pre-training on CapsFus-30m. With more pre-training data, LLaVA-v1.5 doesn’t achieve significant improvement on MLLM benchmarks but rather a drop on MME, HallusionBench, MMMU, and CCBench. Our VLoRA is still comparable with the LLaVA-v1.5 training on the same data. The third row is the results of LLaVA-v1.5 with Q-Former, which is pre-trained on CapsFus-30m. We set the number of learnable queries as 128, thus the number of visual tokens is 128. Except for being slightly lower in ScienceQA and HallusionBench, our VLoRA is significantly better on other MLLM benchmarks. These results demonstrate that our approach is comparable to or even better than LLaVA-v1.5 with consistent settings.
Weights type | MMBench | MME | ScienceQA | HallusionBench | MMMU | CCBench |
---|---|---|---|---|---|---|
qkvom | 63.4 | 1311.3 | 66.4 | 26.4 | 36.0 | 28.6 |
qkvm | 59.6 | 1227.5 | 64.6 | 23.4 | 34.7 | 24.9 |
qkv | 59.4 | 1267.9 | 65.8 | 23.2 | 33.9 | 28.8 |
qko | 57.2 | 1240.5 | 64.0 | 23.4 | 34.6 | 24.9 |
qk | 53.3 | 1169.8 | 65.0 | 23.5 | 36.7 | 21.8 |
Rank | MMBench | MME | ScienceQA | HallusionBench | MMMU | CCBench |
---|---|---|---|---|---|---|
59.4 | 1212.7 | 67.1 | 22.9 | 39.3 | 24.5 | |
60.7 | 1235.6 | 67.2 | 23.5 | 36.0 | 25.3 | |
63.4 | 1311.3 | 66.4 | 26.4 | 36.0 | 28.6 | |
61.0 | 1228.4 | 68.0 | 23.8 | 33.4 | 26.7 |
Blocks | MMBench | MME | ScienceQA | HallusionBench | MMMU | CCBench |
---|---|---|---|---|---|---|
60.7 | 1289.3 | 63.9 | 24.4 | 32.0 | 26.7 | |
63.4 | 1311.3 | 66.4 | 26.4 | 36.0 | 28.6 | |
61.3 | 1289.3 | 67.1 | 25.5 | 34.7 | 30.2 |
5.2 Analysis of each component
To further analyze VLoRA, we explore the impact of each component, including the type of weights that equipped perceptual weights, the rank of perceptual weights, and the number of blocks of perceptual weights generator.
The type of weights that equipped perceptual weights. As we mentioned in Sect. 3.1, there are five types of weights in the decoder block of LLM, which are query, key, value, output, and mlp. We explore the impact of inserting perceptual weights for different types of LLM weights. As shown in Tab. 3, we compare different combinations, including qkvom, qkvm, qkv, qko, and qk. The model that equipped perceptual weights for all types of weights can achieve the best performance on most benchmarks. We notice that the performance of qkv is much better than qk. This suggests that the value matrix is essential for visual perception since the output of the value matrix will be weighted and summed, involving the results of the self-attention module.
The rank of perceptual weights. The rank of the generated perceptual weights represents the degree of visual information compression. The smaller the rank, the more compressed the visual information. We compare the performance of rank from 16 to 128 in Tab. 4. When the , the visual information is compressed severely in perceptual weights. However, LLM with such low-rank perceptual weights can still perceive visual information. From to , the performance on MMBench, MME, HallusionBench, and CCBench improves with increasing rank. Specifically, the score of MMBench increases from 57.6 to 63.4, and the score of MME increases from 1163.8 to 1311.3. When the rank reaches 128, VLoRA’s performance declines across these benchmarks. The reason might be that the visual information becomes redundant, and a large rank may introduce noise into the perceptual weights, which hurts LLM’s capability.
The number of blocks of perceptual weights generator. To explore the influence of the perceptual weights generator, we perform experiments with different numbers of blocks in the perceptual weights generator. In Tab. 5, we observe that the performance of the weights generator with 8 blocks is better than with 4 blocks. However, when it comes to , the scores on ScienceQA and CCBench are higher than with 8 blocks, but performance drops on other benchmarks. This suggests that while a stronger perceptual weights generator can achieve better performance, there is no benefit to increasing the number of blocks after the threshold is reached.
6 Conclusion
In this paper, instead of aligning visual features with the input space of LLM, we propose VLoRA to align visual features with the parameter space of LLM. By not introducing visual tokens into LLM, our VLoRA can make LLM perceive visual information without extra computational overhead. To convert visual features into perceptual weights, we propose the perceptual weights generator to generate low-rank perceptual weights for any weights of LLM. Due to the low-rank property, the perceptual weights can be seen as LoRA weights, while is generated and is learnable. We perform comprehensive experiments on six MLLM benchmarks, and VLoRA can achieve comparable performance to LLaVA-v1.5 in most benchmarks while only bringing 10% computational cost as LLaVA’s. In the ablation study, we reproduce LLaVA-v1.5 under the same settings and show that our method can achieve better performance.
7 Limitations
Despite VLoRA’s promising performance on various benchmarks, it still has some limitations. 1) Representing images as model weights is a previously unexplored practice, and the extracted features from existing CLIP models may not be suitable to be converted into model weights. It is necessary to explore a vision encoder that is more suitable for this paradigm. 2) We use one perceptual weights generator for one type of weight, which may lead to an insufficient correlation between different types of generated perceptual weights. It may be better to use the same perceptual weights generator to produce weights for all types at once.
References
- [1] Laion coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco, 2022.
- [2] Sharegpt. https://sharegpt.com, 2023.
- [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
- [4] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- [5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- [6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- [7] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- [8] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- [9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
- [10] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
- [11] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
- [12] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- [13] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- [14] Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023.
- [15] Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143, 2024.
- [16] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
- [17] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
- [18] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- [19] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
- [20] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- [21] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- [22] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024.
- [23] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
- [24] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
- [25] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- [26] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
- [27] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
- [28] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- [29] Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024.
- [30] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- [31] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
- [32] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- [33] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
- [34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
- [35] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024.
- [36] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. AI Open, 2023.
- [37] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023.
- [38] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- [39] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
- [40] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
- [41] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
- [42] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
- [43] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
- [44] OpenAI. Gpt-4 technical report, 2023.
- [45] OpenAI. Gpt-4v(ision) system card. 2023.
- [46] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
- [47] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- [48] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- [49] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
- [50] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
- [51] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
- [52] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- [53] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024.
- [54] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- [55] Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
- [56] Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023.
- [57] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
- [58] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
- [59] Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
- [60] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- [61] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- [62] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024.
Appendix A Analysis of VLoRA computational overhead
In this subsection, we give a detailed calculation of the computational overhead of VLoRA. Similar to Sect. 3.4, we assume the LLM has blocks and hidden states dimension of , the input text length is , and the number of visual tokens is . Therefore, the FLOPs of the self-attention module and the feed-forward network are and . Since visual tokens are not introduced, then LLM has a computational overhead of for text token sequence input. For training, we use perceptual weights as branches of LLM weights. The extra computation comes from three parts: 1) the matrix multiplication of the two perceptual weights with FLOPs of . 2) The multiplication of the text token and the perceptual weights with FLOPs of . 3) The output coming out of the perceptual weights is to be added to the output of the LLM weights with FLOPs of . Therefore, the total FLOPs of VLoRA during training is . For inference, we merge the perceptual weights with LLM’s weights. The extra computation comes from two parts: 1) the matrix multiplication of the two perceptual weights with FLOPs of , which is the same as training. 2) Adding perceptual weights to LLM weights with FLOPs of . The total FLOPs during inference are .
Appendix B Visualization Results
VLoRA can achieve promising results on various MLLM benchmarks, but these benchmarks are either multiple choice or judgmental, and to demonstrate VLoRA’s capabilities even further, we show some real-world samples in Fig. 5. The first figure suggests that our VLoRA can count the accurate number of steaks in the image. The second figure shows that VLoRA has sufficient common sense. In the third figure, VLoRA demonstrates the ability to reason and have long text conversations.
![Refer to caption](https://cdn.statically.io/img/arxiv.org/x5.png)
Appendix C Broader Impacts
Our proposed new paradigm significantly improves the training and inference efficiency of multimodal large models and reduces the computational overhead, which, in terms of research, can reduce the resource threshold of multimodal large model research, which is conducive to the active exploration of researchers in related fields, and, in terms of practical application, reduces the cost of large-scale deployment for use and helps to reduce the consumption of resources.