Visual Perception by Large Language Model’s Weights

Feipeng Ma1, Hongwei Xue1,3, Guangting Wang2, Yizhou Zhou233footnotemark: 3, Fengyun Rao2
Shilin Yan4, Yueyi Zhang133footnotemark: 3, Siying Wu5, Mike Zheng Shou3, Xiaoyan Sun1,5

1University of Science and Technology of China  2WeChat, Tencent Inc.   
3Show Lab, National University of Singapore   4Fudan University
5Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
{mafp,xuehongwei}@mail.ustc.edu.cn
harryizzhou@tencent.com, {zhyuey,sunxiaoyan}@ustc.edu.cn
This work was performed while Feipeng Ma and Hongwei Xue were interns at WeChat, Tencent Inc.Project Leader.Corresponding authors.
Abstract

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM’s weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

1 Introduction

Large language models (LLMs) [54, 61, 44] have achieved promising performance on most natural language tasks and have shown great generalization ability in solving real-world problems. Derived from LLMs, multimodal large language models (MLLMs) [34, 62, 4, 59, 52, 45] take a step toward artificial general intelligence (AGI) by perceiving visual information from the real world. Therefore, the way of perceiving visual information is the key to moving from LLM to MLLM.

To perceive visual information, recent MLLMs follow an input space alignment paradigm that aligns visual features with the input space of LLM and concatenates visual tokens with text tokens to form a unified sequence as input for LLM. For instance, LLaVA [34] uses CLIP-ViT-L-14 [47] as the visual encoder and introduces a linear projector to align the visual tokens with the input space of LLM. Monkey [29] divides input images into uniform patches and equips individual adapters for each patch to handle high-resolution images. Recent work [53] also identifies the visual shortcomings of CLIP for MLLMs as “CLIP-blind pairs” and integrates vision self-supervised learning features with MLLM to address this issue. DeepSeek-VL [39] and Sphinx [30] also adopt hybrid vision encoders. Vary [55] identifies that a fixed vision vocabulary limits the dense and fine-grained visual perception and introduces a new vocabulary to address this issue.

Despite these efforts to advance MLLM in visual perception, the paradigm of input space alignment remains unchanged, which can result in computational inefficiency for both training and inference. The computational cost of MLLM is concentrated on the attention mechanism of LLM, which is O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) when the length of the input sequence is n𝑛nitalic_n. Using ViT-L-14 as the vision encoder, a 224×\times×224 low-resolution image can result in 256 visual tokens, and the length increases to 576 when the resolution slightly raises to 336×\times×336. Considering high-resolution images, some works [30, 33, 29, 11] split an image into multiple sub-images for capturing fine-grained information, leading to a significantly higher number of visual tokens. For instance, Sphinx-2k [30] adopts 2,890 visual tokens, while InternLM-Xcomposer2-4KHD [11] even uses up to 8,737 visual tokens. Concatenating such a long sequence of visual tokens to text tokens results in a dramatic increase in computational overhead for both training and inference. Specifically, current MLLMs are usually pre-trained on web-crawled image-text pairs, which usually have very short texts, with an average word count of 10.95 for LAION-2B [48] and 8.99 for LAION-COCO [1]. As a result, the number of visual tokens during the pre-training stage is about 20 to 50 times the number of text tokens, which suggests that the involvement of visual tokens seriously affects the efficiency of the pre-training. Some works [25, 9, 22] employ resamplers to reduce the number of visual tokens to a fixed count but still follow the input space alignment paradigm and introduce extra visual tokens for LLMs.

To address this issue, we explore a novel parameter space alignment paradigm where visual information is represented as LLM’s weights. As shown in Fig. 1, for an input image, we use a vision encoder to extract visual features. Then, the visual features are converted to perceptual weights, which represent visual information as model weights. The perceptual weights can be directly merged with LLM’s weights. Thus, the visual information is merged into LLM in the form of weights, eliminating the need for visual tokens in the LLM’s input and significantly improving efficiency. Building on this paradigm, we introduce VLoRA, which contains the perceptual weights generator. The perceptual weight generator is designed to convert visual features to perceptual weights. LLMs usually contain a large number of parameters, for feasibility and efficiency, perceptual weights are designed with a low-rank property. Thus the generated perceptual weights are similar to the form of LoRA weights.

Our contributions are summarised as follows:

  1. 1.

    We explore a novel paradigm for MLLMs that aligns visual features with the parameter space of LLMs, which highly improves the efficiency of MLLMs

  2. 2.

    Based on this paradigm, we propose VLoRA and design the perceptual weights generator that generates low-rank perceptual weights.

  3. 3.

    Experimental results demonstrate the effectiveness and efficiency of our approach. We obtain results comparable to those of state-of-the-art MLLMs on various benchmarks, including MMBench, ScienceQA, HallusionBench, and MMMU.

Refer to caption
Figure 1: Overview of the input space alignment and the parameter space alignment paradigms. The input space alignment paradigm is aligning visual features with the input space of LLM and concatenating visual tokens with text tokens as input for LLM. Our proposed VLoRA follows the parameter space alignment paradigm that aligns visual features with the parameters of LLM and merges perceptual weights generated by the perceptual weights generator with LLM’s weights.

2 Related Works

Multimodal Large Language Models. Current MLLMs are developed from LLMs by aligning visual features into the input space of LLMs. Many efforts have been made to explore introducing visual perception capability for LLMs. LLaVA [34] connects the visual encoder of CLIP to the Vicuna [61] with a linear projector. Further research that follows this paradigm focuses on improving MLLMs from the perspective of vision encoder and projector DeepSeek-VL [39] use SigLip [58] to extract high-level semantic features and use SAM-B [20] to process low-level features. Tong et al. [53] finds that visually distinct images can be encoded as similar due to the shortcoming of CLIP and integrates vision self-supervised learning features with CLIP features. Sphinx [30] ensembles various vision backbones that have different architectures, pre-training paradigms, and information granularities. These works input the entire visual tokens sequence into the LLM, which can lead to a high computational cost during training and inference. Specifically, LLaVA [32] and DeepSeek-VL [39] utilize 576 visual tokens, Sphinx-2k [30] employs 2,890 visual tokens, and InternLM-XComposer2-4KHD [11] uses up to 8,737 tokens. Some works consider adopting cross-attention architecture as the projector to improve efficiency. MiniGPT4-v1 [62] and BLIP series [25, 9] adopt Q-Former as the projector, which reduces the length of visual tokens to a fixed number of 64. Qwen-VL [5] uses a single-layer cross-attention module incorporated with 2D absolute positional encodings to avoid the potential loss of positional details. However, these improvements still follow the paradigm of aligning visual features to the input space of LLM, introducing extra computational overhead on LLM inference. Different from previous work, our VLoRA aligns visual features with the parameter space of LLM. The visual information can be represented as perceptual weights in LoRA format and merged into LLM’s weights during inference.

Parameter-Efficient Fine-Tuning. Parameter-efficient fine-tuning (PEFT) is a key technique for fine-tuning large pre-trained models, including LLMs and MLLMs. PEFT methods freeze the backbone and only fine-tune a small number of parameters, which can be typically categorized into three classes: adapters [16, 46, 51, 60], prefix-tuning [27, 24, 36], and Low-Rank Adaption (LoRA) [17, 35, 10]. In the field of language models, Houlsby et al. [16] design bottleneck adapters and insert two adapters into the transformer layers, one after the attention module and one after the feed-forward network. Prefix-tuning [27] prepends a set of learnable prefix vectors at the query and key of the self-attention module for every layer. Prompt-tuning proposes to only prepend learnable vectors to the input prompt with no intermediate-layer prefixes. LoRA [17] uses learnable low-rank matrices to approximate the backbone’s weight updates, and the low-rank matrices can be merged with the backbone during inference without extra inference burden. Considering the pre-training stage, current MLLMs usually freeze the unimodal backbones and project visual tokens through a learnable projector, then prepend visual tokens into the input sequence of LLMs, which can be seen as prefix-tuning methods. Our VLoRA is closer to the style of LoRA. Specifically, VLoRA generates low-rank perceptual weights, which can be seen as a generated visual parameters matrix ΔWAh×rΔsubscript𝑊𝐴superscript𝑟\Delta W_{A}\in\mathbb{R}^{h\times r}roman_Δ italic_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_r end_POSTSUPERSCRIPT multiplied with a learnable matrix ΔWBr×hΔsubscript𝑊𝐵superscript𝑟\Delta W_{B}\in\mathbb{R}^{r\times h}roman_Δ italic_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_h end_POSTSUPERSCRIPT. Similar to LoRA, the perceptual weights can be injected into LLMs’ weights without introducing extra inference overhead.

3 Method

3.1 Preliminaries

In this subsection, we review the details of the decoder block in the current LLM. As shown in Fig. 2, the decoder block of LLM contains a self-attention module and a feed-forward network.

Self-attention. As shown in Fig. 2 (b), the self-attention module contains four types of linear layers: query WQh×dsubscript𝑊𝑄superscript𝑑W_{Q}\in\mathbb{R}^{h\times d}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_d end_POSTSUPERSCRIPT, key WKh×dsubscript𝑊𝐾superscript𝑑W_{K}\in\mathbb{R}^{h\times d}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_d end_POSTSUPERSCRIPT, value WVh×dsubscript𝑊𝑉superscript𝑑W_{V}\in\mathbb{R}^{h\times d}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_d end_POSTSUPERSCRIPT, and output WOh×hsubscript𝑊𝑂superscriptW_{O}\in\mathbb{R}^{h\times h}italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT. Here, hhitalic_h represents the dimension of the hidden states of LLM, and d𝑑ditalic_d represents the dimension of each attention head. For each input token xihsubscript𝑥𝑖superscriptx_{i}\in\mathbb{R}^{h}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT in the input sequence X=(x1,x2,,xN)𝑋subscript𝑥1subscript𝑥2subscript𝑥𝑁X=(x_{1},x_{2},...,x_{N})italic_X = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), it is multiplied by linear layers WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, WKsubscript𝑊𝐾W_{K}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, obtaining Xq=XWQsuperscript𝑋𝑞𝑋subscript𝑊𝑄X^{q}=XW_{Q}italic_X start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT = italic_X italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, Xk=XWKsuperscript𝑋𝑘𝑋subscript𝑊𝐾X^{k}=XW_{K}italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_X italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and Xv=XWVsuperscript𝑋𝑣𝑋subscript𝑊𝑉X^{v}=XW_{V}italic_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT = italic_X italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. Then, the attention operation is executed along the sequence dimension as follows:

Attention(Xq,Xk,Xv)=softmax(XqXkTd)Xv.Attentionsuperscript𝑋𝑞superscript𝑋𝑘superscript𝑋𝑣softmaxsuperscript𝑋𝑞superscriptsuperscript𝑋𝑘𝑇𝑑superscript𝑋𝑣\textrm{Attention}(X^{q},X^{k},X^{v})=\textrm{softmax}(\frac{X^{q}{X^{k}}^{T}}% {\sqrt{d}})X^{v}.Attention ( italic_X start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ) = softmax ( divide start_ARG italic_X start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_X start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT . (1)

The self-attention mechanism is performed on each head, and the outputs from different heads are concatenated and multiplied by output linear layer with weights WOsubscript𝑊𝑂W_{O}italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT.

Feed-forward Network. As shown in Fig. 2 (c), the feed-forward network is an MLP with two fully connected layers and a non-linear activation function. The formulation can be written as follows:

FFN(xi)=ϕ(xiW1)W2,FFNsubscript𝑥𝑖italic-ϕsubscript𝑥𝑖subscript𝑊1subscript𝑊2\textrm{FFN}(x_{i})=\phi(x_{i}W_{1})W_{2},FFN ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (2)

where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the input token, ϕitalic-ϕ\phiitalic_ϕ is the activation function, and W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the weights of two fully connected layers. To summarize, the decoder block of LLM has five types of weights, including WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, WKsubscript𝑊𝐾W_{K}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, WOsubscript𝑊𝑂W_{O}italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT from the self-attention module, and W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the feed-forward network.

Refer to caption
Figure 2: Details of the LLM Decoder Block. (a) illustrates the details of the LLM decoder block, including the multi-head self-attention module and the feed-forward network. (b) provides a detailed view of the multi-head self-attention module, which incorporates four types of weights: WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, WKsubscript𝑊𝐾W_{K}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT, and WOsubscript𝑊𝑂W_{O}italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT. (c) depicts the feed-forward network, which consists of the weights W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

3.2 Visual Perception by LLM’s Weights

Previous MLLMs follow the paradigm of aligning the visual features with the input space of LLM and require additional visual tokens as LLM’s input, which can lead to computational inefficiency. This inefficiency becomes more pronounced when encountering high-resolution or multiple images as the number of tokens increases drastically. To address this issue, we propose to align visual features with LLM’s parameter space without introducing extra tokens into LLM’s input.

To achieve this goal, we represent the visual information of the input image as perceptual weights and integrate them into the weights of LLM. This approach allows LLM to perceive visual information without introducing extra tokens into the input. As mentioned in Sect. 3.1, LLM’s decoder blocks have five types of weights. We use Wh×h𝑊superscriptW\in\mathbb{R}^{h\times h}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT to denote the weight matrix of LLM. For an input image I𝐼Iitalic_I, we first adopt a vision encoder f()𝑓f(\cdot)italic_f ( ⋅ ) to extract the visual features z=f(I)𝑧𝑓𝐼z=f(I)italic_z = italic_f ( italic_I ), where zc×dv𝑧superscript𝑐subscript𝑑𝑣z\in\mathbb{R}^{c\times d_{v}}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, c𝑐citalic_c is the number of visual tokens, and dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the dimension of visual features. Then, we design a perceptual weights generator g()𝑔g(\cdot)italic_g ( ⋅ ) to convert the visual features to perceptual weights ΔWh×hΔ𝑊superscript\Delta W\in\mathbb{R}^{h\times h}roman_Δ italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT. It is worth noting that, given that we want LLM to perceive visual information while preserving its language capabilities, ΔWΔ𝑊\Delta Wroman_Δ italic_W is a low-rank matrix, which also helps to reduce the computation cost of the perceptual weights generator. With the generated perceptual weights ΔWΔ𝑊\Delta Wroman_Δ italic_W, we can directly merge it into the LLM’s weights as:

W^=W+ΔW.^𝑊𝑊Δ𝑊\hat{W}=W+\Delta W.over^ start_ARG italic_W end_ARG = italic_W + roman_Δ italic_W . (3)

By integrating the weights transferred from the visual features into the LLM’s weights, the visual perception ability is naturally equipped. After merging the weights, no extra inference burden will be introduced for LLM. For any weights in each decoder block of LLM, we can generate the corresponding perceptual weights and integrate them into LLM’s weights.

Refer to caption
Figure 3: Perceptual Weights Generator. Figure (a) illustrates the pipeline of our perceptual weights generator. We set k𝑘kitalic_k learnable perceptual queries, which interact with image features in N𝑁Nitalic_N decoder blocks, and obtain k𝑘kitalic_k visual parameters. Then, a shared linear layer and k𝑘kitalic_k independent linear layers are used to convert these visual parameters to perceptual weights ΔWΔ𝑊\Delta Wroman_Δ italic_W. Figure (b) demonstrates that our approach is formally consistent with LoRA.

3.3 Perceptual Weights Generator

To convert visual features to perceptual weights ΔWh×hΔ𝑊superscript\Delta W\in\mathbb{R}^{h\times h}roman_Δ italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT, we propose the perceptual weights generator. Since each layer and each type of weight in LLM focus on different visual information, our perceptual weights generator needs to be able to generate weights corresponding to each of the LLM weights flexibly.

Inspired by DETR [6] and BLIP-2 [25], we design the perceptual weights generator as a decoder-only architecture with cross-attention layers to generate ΔWh×hΔ𝑊superscript\Delta W\in\mathbb{R}^{h\times h}roman_Δ italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT. As shown in Fig. 3 (a), the perceptual weights generator contains N𝑁Nitalic_N blocks, each comprising a self-attention module, a cross-attention module, and a feed-forward network. The hidden states dimension of the perceptual weights generator is hpsubscript𝑝h_{p}italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, where hphhmuch-less-thansubscript𝑝h_{p}\ll h\cdot hitalic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≪ italic_h ⋅ italic_h. We set k𝑘kitalic_k learnable perceptual quires corresponding to the number of decoder blocks where we want to insert perceptual weights. For each block, the perceptual queries first pass through the self-attention module, then interact with visual features in the cross-attention module, and finally go through a feed-forward network. After N𝑁Nitalic_N blocks, we obtain k𝑘kitalic_k features pvhpsubscript𝑝𝑣superscriptsubscript𝑝p_{v}\in\mathbb{R}^{h_{p}}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The features pvsubscript𝑝𝑣p_{v}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT should be mapped to the target shape of perceptual weights ΔWh×hΔ𝑊superscript\Delta W\in\mathbb{R}^{h\times h}roman_Δ italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_h end_POSTSUPERSCRIPT. However, due to hphhmuch-less-thansubscript𝑝h_{p}\ll h\cdot hitalic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≪ italic_h ⋅ italic_h, directly mapping the dimensions of the pvsubscript𝑝𝑣p_{v}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT from hpsubscript𝑝h_{p}italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to hhh\cdot hitalic_h ⋅ italic_h with a linear layer can introduce a large number of parameters, dramatically reducing the feasibility. Therefore, we consider introducing the low-rank property in this process. We adopt a shared linear layer Wsharehp×hrsubscript𝑊𝑠𝑎𝑟𝑒superscriptsubscript𝑝𝑟W_{share}\in\mathbb{R}^{h_{p}\times h\cdot r}italic_W start_POSTSUBSCRIPT italic_s italic_h italic_a italic_r italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_h ⋅ italic_r end_POSTSUPERSCRIPT to map all features pvsubscript𝑝𝑣p_{v}italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT from hpsubscript𝑝h_{p}italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to hr𝑟h\cdot ritalic_h ⋅ italic_r as follows:

Wv=pvWshare,subscript𝑊𝑣subscript𝑝𝑣subscript𝑊𝑠𝑎𝑟𝑒W_{v}=p_{v}W_{share},italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_s italic_h italic_a italic_r italic_e end_POSTSUBSCRIPT , (4)

where r𝑟ritalic_r is the rank for perceptual weights and Wvhrsubscript𝑊𝑣superscript𝑟W_{v}\in\mathbb{R}^{h\cdot r}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h ⋅ italic_r end_POSTSUPERSCRIPT is visual parameter.

And we reshape the output Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT as h×r𝑟h\times ritalic_h × italic_r. When ascending to the target dimension h×hh\times hitalic_h × italic_h, k𝑘kitalic_k independent linear layers Wsr×hsubscript𝑊𝑠superscript𝑟W_{s}\in\mathbb{R}^{r\times h}italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_h end_POSTSUPERSCRIPT are used for each visual parameter and obtain k𝑘kitalic_k perceptual weights ΔWΔ𝑊\Delta Wroman_Δ italic_W, this process can be formulated as follows:

ΔW=WvWs.Δ𝑊subscript𝑊𝑣subscript𝑊𝑠\Delta W=W_{v}W_{s}.roman_Δ italic_W = italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (5)

Substituting Eq. 5 into Eq. 3, we get:

W^=W+ΔW=W+WvWs.^𝑊𝑊Δ𝑊𝑊subscript𝑊𝑣subscript𝑊𝑠\displaystyle\hat{W}=W+\Delta W=W+W_{v}W_{s}.over^ start_ARG italic_W end_ARG = italic_W + roman_Δ italic_W = italic_W + italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (6)

Considering the low-rank property of Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and Wssubscript𝑊𝑠W_{s}italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, we can observe that Eq. 6 and LoRA [17] are of the same form, where Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT corresponds to ΔWAΔsubscript𝑊𝐴\Delta W_{A}roman_Δ italic_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and Wssubscript𝑊𝑠W_{s}italic_W start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT corresponds to ΔWBΔsubscript𝑊𝐵\Delta W_{B}roman_Δ italic_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. As illustrated in Fig. 3 (b), our perceptual weights generator can be seen as “LoRA weights generator” from the perspective of LoRA. This is because it generates ΔWAΔsubscript𝑊𝐴\Delta W_{A}roman_Δ italic_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and ΔWBΔsubscript𝑊𝐵\Delta W_{B}roman_Δ italic_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT for weights of LLM. Our perceptual weights generator generates one type of perceptual weights for k𝑘kitalic_k decoder blocks at a time. For generating multiple types of weights, we employ multiple perceptual weights generators.

3.4 Analysis of the Computational Cost

By not introducing additional visual tokens in the input of the LLM, our VLoRA achieves higher computational efficiency for both training and inference. We only consider the computational cost of LLM, as the computational overhead of our perceptual weights generator is negligible in comparison. We assume the LLM has d𝑑ditalic_d blocks and hidden states dimension of hhitalic_h, the input text length is C𝐶Citalic_C, and the number of visual tokens is L𝐿Litalic_L. For convenience, we only consider the computational cost of the self-attention module and feed-forward network in LLM. The FLOPs of the self-attention module and the feed-forward network are 8Lh2+4L2h8𝐿superscript24superscript𝐿28Lh^{2}+4L^{2}h8 italic_L italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h and 16Lh216𝐿superscript216Lh^{2}16 italic_L italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For previous MLLMs that align visual features to the input space of LLM, the FLOPs of LLM are 24(L+C)dh2+4(L+C)2dh24𝐿𝐶𝑑superscript24superscript𝐿𝐶2𝑑24(L+C)dh^{2}+4(L+C)^{2}dh24 ( italic_L + italic_C ) italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 ( italic_L + italic_C ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_h. For our VLoRA, the extra computational cost occurs in Eq. 6, where ΔWAΔsubscript𝑊𝐴\Delta W_{A}roman_Δ italic_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is multiplied with ΔWBΔsubscript𝑊𝐵\Delta W_{B}roman_Δ italic_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. Assuming that we generate perceptual weights for all 5 types of weighs in k𝑘kitalic_k decoder blocks. During training, we do not merge the perceptual weights with the LLM weights but use them as branches of the LLM weights. Therefore, the FLOPs are 24Cdh2+4C2dh+24krh2+12Ckh2+14Ckh24𝐶𝑑superscript24superscript𝐶2𝑑24𝑘𝑟superscript212𝐶𝑘superscript214𝐶𝑘24Cdh^{2}+4C^{2}dh+24krh^{2}+12Ckh^{2}+14Ckh24 italic_C italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_h + 24 italic_k italic_r italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 12 italic_C italic_k italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 14 italic_C italic_k italic_h. For inference, the perceptual weights can be merged into the LLM, and the FLOPs are 24Cdh2+4C2dh+24krh2+12kh224𝐶𝑑superscript24superscript𝐶2𝑑24𝑘𝑟superscript212𝑘superscript224Cdh^{2}+4C^{2}dh+24krh^{2}+12kh^{2}24 italic_C italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_h + 24 italic_k italic_r italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 12 italic_k italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Details of the FLOPs calculation are in the Appendix A. There is a small increase in the overhead of training compared to inference, and we compare by the training FLOPs. In Fig. 4, we compare the FLOPs of LLaVA and VLoRA. Our approach does not introduce additional computation as the number of visual tokens increases, and our FLOPs are only 8% of LLaVA-v1.5’s when the text length is 32.

Refer to caption
Figure 4: Comparison of FLOPs. This figure shows the FLOPs of LLaVA and VLoRA with different numbers of input visual tokens. The left subplot illustrates the change in GFLOPs, the right subplot plots the ratio of GFLOPs for VLoRA to LLaVA, and C denotes the number of text tokens.

4 Experiments

4.1 Implementation Details

Model Settings. We use Vicuna-7b-v1.5 [61] as our foundational LLM and CLIP-ViT-L-14 [47] as vision encoder. The perceptual weights generator is initialized randomly. For the perceptual weights generator, we set the hidden size hpsubscript𝑝h_{p}italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as 512, and the number of blocks N𝑁Nitalic_N as 8. The rank r𝑟ritalic_r of perceptual weights is 64. The number of perceptual queries is 8, which means that we insert perceptual weights ΔWΔ𝑊\Delta Wroman_Δ italic_W only on 8 blocks, and in the implementation, for Vicuna-7b-v1.5 with 32 blocks, we insert ΔWΔ𝑊\Delta Wroman_Δ italic_W every 4 blocks. For better visual perceptual ability, we insert ΔWΔ𝑊\Delta Wroman_Δ italic_W for all five types of weights in LLM. It is worth noting that the last k𝑘kitalic_k linear layers of the perceptual weights generator are zero-initialized as they are equivalent to the ΔWBΔsubscript𝑊𝐵\Delta W_{B}roman_Δ italic_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT of LoRA weights, which are initialized as zero for training stability.

Pre-training Data. During pre-training, we use image-text pairs to train our model. Specifically, we use a subset of CapsFusion-120M [56] with 30 million image-text pairs. CapsFusion-120M randomly collects image-text pairs from LAION-COCO [1], which contains both web-crawled and synthetic captions generated by BLIP [26]. Then, a fine-tuned LLM is used to integrate both types of captions.

Pre-training Configuration. We freeze the weights of LLM and visual encoder in the pre-training stage, making only the perceptual weights generator trainable. We use the AdamW [38] optimizer with a learning rate of 5e𝑒eitalic_e-5, which follows a linear warm-up and then a cosine decay schedule. The pre-training is conducted with a total batch size of 768 for 40,000 iterations. The input images are resized to a resolution of 336 ×\times× 336. The pre-training stage uses 24 NVIDIA H800 GPUs for 7 hours.

Fine-tuning Data. For supervised fine-tuning, we adopt the same data as LLaVA-v1.5. Specifically, the supervised fine-tuning data is constructed with VQAv2 [13], GQA [18], OKVQA [42], OCRVQA [43], A-OKVQA [49], TextCaps [50], RefCOCO [41, 19], Visual Genome [21], ShareGPT [2], and LLaVA-Insturct [34], with a total of 665K conversation data.

Fine-tuning Configuration. During the fine-tuning stage, we freeze the vision encoder and update the weights of the perceptual weights generator and LLM. The learning rate is set to 5e𝑒eitalic_e-5 and the learning rate schedule is the same as in the pre-training stage. The global batch size is 128. We train for one epoch on 8 NVIDIA H800 GPUs, which takes 2 hours.

4.2 Benchmarks for Evaluation

MMBench & CCBench. MMBench [37] is a comprehensive multimodal benchmark designed to evaluate the performance of MLLMs. It includes over 3,000 multiple-choice questions covering 20 ability categories. The evaluation is divided into perceptual and reasoning dimensions and subdivided into 20 categories. CCBench [37], released by the MMBench team, is designed for evaluating MLLMs in the domain of Chinese Culture.

MME. MME [12] also measures the advanced MLLMs in terms of perception and cognition, with a total of 14 subtasks. To minimize the influence of prompt engineering on MLLMs, the instructions of MME are designed as simple binary responses: “please answer yes or no".

ScienceQA. ScienceQA [40] is constructed from elementary and high school science curricula. Questions of ScienceQA span three subjects: natural science, language science, and social science. We use samples with images from the validation set to evaluate MLLMs.

HallusionBench. HallusionBench [14] is designed for evaluating image-context reasoning, including 346 images paired with 1129 questions crafted by human experts. Unlike other benchmarks [15, 28, 31] that focus on object hallucinations with limited topics and visual input types, HallusionBench considers both language hallucinations and visual illusions across a diverse range of topics.

MMMU. MMMU [57] collects 11.5K multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines, spanning 30 subjects and 183 subfields, and comprising 30 heterogeneous image types. MMMU is more challenging than existing benchmarks due to the demand for college-level domain-specific knowledge.

Table 1: Comparisons on six MLLM benchmarks, including MMBench, MME, ScienceQA, HallusionBench, MMMU, and CCBench. vis. tok. denotes the number of visual tokens involved in the LLM. Bolded numbers indicate the best results, and underlined numbers are the second-best results. GFLOPs denotes the overhead of the LLM part when the number of input text tokens is 32.
Model Size # vis. tok. GFLOPs MMBench MME ScienceQA HallusionBench MMMU CCBench
InstructBLIP [9] 8B 32 827 36.0 1137.1 54.7 31.2 30.6 12.7
MiniGPT-4-v1 [62] 7B 32 827 12.2 770.6 39.0 31.9 23.6 1.8
MiniGPT-4-v2 [7] 7B 256 3754 24.3 708.4 54.1 30.0 25.0 1.4
Idefics-instruct [23] 9B 64 1362 48.2 942 51.6 27.3 18.4 7.8
OpenFlamingo v2 [3, 4] 9B 64 1362 6.6 535 45.7 29.4 28.2 6.3
Qwen-VL [5] 9.6B 256 3754 38.2 334.1 57.7 29.9 29.6 6.1
Qwen-VL-Chat [5] 9.6B 256 3754 60.6 1467.8 65.5 36.8 37.0 41.2
LLaVA-v1.5 [32] 7.2B 576 8027 64.3 1510.7 66.8 27.6 35.7 27.5
VLoRA 7.8B 0 619 63.4 1311.3 66.4 26.4 36.0 28.6

4.3 Comparison with State-of-the-arts

Tab. 1 compares our VLoRA with other state-of-the-art MLLMs on six MLLM benchmarks. The results are obtained from OpenCompass [8]. Unlike other MLLMs, our VLoRA does not require any visual tokens during LLM inference and has only 8% of the computational overhead of LLaVA-v1.5 when the text length is 32. On most benchmarks, VLoRA outperforms InstructBLIP, MiniGPT-4, Idefics-instruct, and OpenFlamingo v2. Compared with Qwen-VL-Chat pre-trained on 1.4B image-text pairs, VLoRA has a higher score of 3.7 on MMBench and 1.3 on ScienceQA. Compared with LLaVA-v1.5, VLoRA can achieve comparable performance on MMBench, ScienceQA, and HallusionBench and even better performance on MMMU and CCBench. However, the results on MME fall short of LLaVA-v1.5 since our perceptual weights generator is randomly initialized and necessitates more image-text pair data during the pre-training stage. To verify this, in Tab. 2, we reproduce LLaVA-v1.5 by replacing the projector with a randomly initialized Q-Former and achieve similar results on MME. Our VLoRA achieves comparable performance to state-of-the-art MLLMs without introducing visual tokens as LLM inputs, drastically reducing computational overhead.

5 Ablation Study

Currently, the performance of MLLMs is significantly affected by the foundational LLMs and the training data, including pre-training data and supervised fine-tuning data. To explore the effectiveness of our proposed paradigm and model, we perform a fair comparison with LLaVA-v1.5 [34] by adopting the same foundation LLM and training data in this section. Then, with this setting, we also explore the impact of different settings of each component on performance.

Table 2: Comparison to LLaVA-v1.5 with various settings on six MLLM benchmarks, including MMBench, MME, ScienceQA, HallusionBench, MMMU, and CCBench. PT data represents the pre-training data. vis. tok. denotes the number of visual tokens involved in LLM.
Model PT data # vis. tok. MMBench MME ScienceQA HallusionBench MMMU CCBench
LLaVA-7b-v1.5 blip-558k 576 64.3 1510.7 66.8 27.6 35.7 27.5
LLaVA-7b-v1.5 CapsFus-30m 576 64.6 1470.0 67.7 27.4 33.8 25.3
LLaVA-7b-v1.5-QFormer CapsFus-30m 128 60.7 1241.5 67.3 26.7 33.8 25.3
VLoRA CapsFus-30m 0 63.4 1311.3 66.4 26.4 36.0 28.6

5.1 Comparison with LLaVA-v1.5

To ensure a fair comparison with LLaVA-v1.5, we reproduce LLaVA-v1.5 with the same setting as our VLoRA, including the pre-training and supervised fine-tuning data. Furthermore, to eliminate the influence of the difference in the projector, we replace the project of LLaVA-v1.5 as a randomly initialized Q-Former, which has the same number of blocks and hidden size as our perceptual weights generator. The training is conducted using the same pre-training and fine-tuning data as VLoRA.

In Tab. 2, the second row is the results of LLaVA-v1.5 pre-training on CapsFus-30m. With more pre-training data, LLaVA-v1.5 doesn’t achieve significant improvement on MLLM benchmarks but rather a drop on MME, HallusionBench, MMMU, and CCBench. Our VLoRA is still comparable with the LLaVA-v1.5 training on the same data. The third row is the results of LLaVA-v1.5 with Q-Former, which is pre-trained on CapsFus-30m. We set the number of learnable queries as 128, thus the number of visual tokens is 128. Except for being slightly lower in ScienceQA and HallusionBench, our VLoRA is significantly better on other MLLM benchmarks. These results demonstrate that our approach is comparable to or even better than LLaVA-v1.5 with consistent settings.

Table 3: The impact of weights type that equipped perceptual weights. q, k, v, and o denote the query, key, value, and output weights in the self-attention module, respectively. m denotes the weights of the feed-forward network.
Weights type MMBench MME ScienceQA HallusionBench MMMU CCBench
qkvom 63.4 1311.3 66.4 26.4 36.0 28.6
qkvm 59.6 1227.5 64.6 23.4 34.7 24.9
qkv 59.4 1267.9 65.8 23.2 33.9 28.8
qko 57.2 1240.5 64.0 23.4 34.6 24.9
qk 53.3 1169.8 65.0 23.5 36.7 21.8
Table 4: The impact of perceptual weights’ rank. The rank of the generated perceptual weights indicates the extent of visual information compression.
Rank MMBench MME ScienceQA HallusionBench MMMU CCBench
r=16𝑟16r=16italic_r = 16 59.4 1212.7 67.1 22.9 39.3 24.5
r=32𝑟32r=32italic_r = 32 60.7 1235.6 67.2 23.5 36.0 25.3
r=64𝑟64r=64italic_r = 64 63.4 1311.3 66.4 26.4 36.0 28.6
r=128𝑟128r=128italic_r = 128 61.0 1228.4 68.0 23.8 33.4 26.7
Table 5: The impact of different numbers of blocks of perceptual weights generator.
Blocks MMBench MME ScienceQA HallusionBench MMMU CCBench
N=4𝑁4N=4italic_N = 4 60.7 1289.3 63.9 24.4 32.0 26.7
N=8𝑁8N=8italic_N = 8 63.4 1311.3 66.4 26.4 36.0 28.6
N=12𝑁12N=12italic_N = 12 61.3 1289.3 67.1 25.5 34.7 30.2

5.2 Analysis of each component

To further analyze VLoRA, we explore the impact of each component, including the type of weights that equipped perceptual weights, the rank of perceptual weights, and the number of blocks of perceptual weights generator.

The type of weights that equipped perceptual weights. As we mentioned in Sect. 3.1, there are five types of weights in the decoder block of LLM, which are query, key, value, output, and mlp. We explore the impact of inserting perceptual weights for different types of LLM weights. As shown in Tab. 3, we compare different combinations, including qkvom, qkvm, qkv, qko, and qk. The model that equipped perceptual weights for all types of weights can achieve the best performance on most benchmarks. We notice that the performance of qkv is much better than qk. This suggests that the value matrix is essential for visual perception since the output of the value matrix will be weighted and summed, involving the results of the self-attention module.

The rank of perceptual weights. The rank of the generated perceptual weights represents the degree of visual information compression. The smaller the rank, the more compressed the visual information. We compare the performance of rank r𝑟ritalic_r from 16 to 128 in Tab. 4. When the r=16𝑟16r=16italic_r = 16, the visual information is compressed severely in perceptual weights. However, LLM with such low-rank perceptual weights can still perceive visual information. From r=16𝑟16r=16italic_r = 16 to r=64𝑟64r=64italic_r = 64, the performance on MMBench, MME, HallusionBench, and CCBench improves with increasing rank. Specifically, the score of MMBench increases from 57.6 to 63.4, and the score of MME increases from 1163.8 to 1311.3. When the rank reaches 128, VLoRA’s performance declines across these benchmarks. The reason might be that the visual information becomes redundant, and a large rank may introduce noise into the perceptual weights, which hurts LLM’s capability.

The number of blocks of perceptual weights generator. To explore the influence of the perceptual weights generator, we perform experiments with different numbers of blocks in the perceptual weights generator. In Tab. 5, we observe that the performance of the weights generator with 8 blocks is better than with 4 blocks. However, when it comes to N=12𝑁12N=12italic_N = 12, the scores on ScienceQA and CCBench are higher than with 8 blocks, but performance drops on other benchmarks. This suggests that while a stronger perceptual weights generator can achieve better performance, there is no benefit to increasing the number of blocks after the threshold is reached.

6 Conclusion

In this paper, instead of aligning visual features with the input space of LLM, we propose VLoRA to align visual features with the parameter space of LLM. By not introducing visual tokens into LLM, our VLoRA can make LLM perceive visual information without extra computational overhead. To convert visual features into perceptual weights, we propose the perceptual weights generator to generate low-rank perceptual weights for any weights of LLM. Due to the low-rank property, the perceptual weights can be seen as LoRA weights, while ΔWAΔsubscript𝑊𝐴\Delta W_{A}roman_Δ italic_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is generated and ΔWBΔsubscript𝑊𝐵\Delta W_{B}roman_Δ italic_W start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT is learnable. We perform comprehensive experiments on six MLLM benchmarks, and VLoRA can achieve comparable performance to LLaVA-v1.5 in most benchmarks while only bringing 10% computational cost as LLaVA’s. In the ablation study, we reproduce LLaVA-v1.5 under the same settings and show that our method can achieve better performance.

7 Limitations

Despite VLoRA’s promising performance on various benchmarks, it still has some limitations. 1) Representing images as model weights is a previously unexplored practice, and the extracted features from existing CLIP models may not be suitable to be converted into model weights. It is necessary to explore a vision encoder that is more suitable for this paradigm. 2) We use one perceptual weights generator for one type of weight, which may lead to an insufficient correlation between different types of generated perceptual weights. It may be better to use the same perceptual weights generator to produce weights for all types at once.

References

  • [1] Laion coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco, 2022.
  • [2] Sharegpt. https://sharegpt.com, 2023.
  • [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • [4] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  • [5] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • [6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  • [7] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  • [8] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  • [9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  • [10] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  • [11] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
  • [12] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  • [13] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [14] Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023.
  • [15] Anisha Gunjal, Jihan Yin, and Erhan Bas. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143, 2024.
  • [16] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pages 2790–2799. PMLR, 2019.
  • [17] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021.
  • [18] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  • [19] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  • [20] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • [21] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  • [22] Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024.
  • [23] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
  • [24] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  • [25] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  • [26] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
  • [27] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
  • [28] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  • [29] Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024.
  • [30] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  • [31] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
  • [32] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • [33] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  • [34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
  • [35] Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024.
  • [36] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. AI Open, 2023.
  • [37] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023.
  • [38] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  • [39] Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  • [40] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • [41] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  • [42] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  • [43] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  • [44] OpenAI. Gpt-4 technical report, 2023.
  • [45] OpenAI. Gpt-4v(ision) system card. 2023.
  • [46] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  • [47] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [48] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
  • [49] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  • [50] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
  • [51] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
  • [52] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [53] Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms, 2024.
  • [54] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [55] Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
  • [56] Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023.
  • [57] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  • [58] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
  • [59] Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023.
  • [60] Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  • [61] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  • [62] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024.

Appendix A Analysis of VLoRA computational overhead

In this subsection, we give a detailed calculation of the computational overhead of VLoRA. Similar to Sect. 3.4, we assume the LLM has d𝑑ditalic_d blocks and hidden states dimension of hhitalic_h, the input text length is C𝐶Citalic_C, and the number of visual tokens is L𝐿Litalic_L. Therefore, the FLOPs of the self-attention module and the feed-forward network are 8Lh2+4L2h8𝐿superscript24superscript𝐿28Lh^{2}+4L^{2}h8 italic_L italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h and 16Lh216𝐿superscript216Lh^{2}16 italic_L italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Since visual tokens are not introduced, then LLM has a computational overhead of 24Cdh2+4C2dh24𝐶𝑑superscript24superscript𝐶2𝑑24Cdh^{2}+4C^{2}dh24 italic_C italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_h for text token sequence input. For training, we use perceptual weights as branches of LLM weights. The extra computation comes from three parts: 1) the matrix multiplication of the two perceptual weights with FLOPs of 24krh224𝑘𝑟superscript224krh^{2}24 italic_k italic_r italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. 2) The multiplication of the text token and the perceptual weights with FLOPs of 12Ckh212𝐶𝑘superscript212Ckh^{2}12 italic_C italic_k italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. 3) The output coming out of the perceptual weights is to be added to the output of the LLM weights with FLOPs of 14Ckh14𝐶𝑘14Ckh14 italic_C italic_k italic_h. Therefore, the total FLOPs of VLoRA during training is 24Cdh2+4C2dh+24krh2+12Ckh2+14Ckh24𝐶𝑑superscript24superscript𝐶2𝑑24𝑘𝑟superscript212𝐶𝑘superscript214𝐶𝑘24Cdh^{2}+4C^{2}dh+24krh^{2}+12Ckh^{2}+14Ckh24 italic_C italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_h + 24 italic_k italic_r italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 12 italic_C italic_k italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 14 italic_C italic_k italic_h. For inference, we merge the perceptual weights with LLM’s weights. The extra computation comes from two parts: 1) the matrix multiplication of the two perceptual weights with FLOPs of 24krh224𝑘𝑟superscript224krh^{2}24 italic_k italic_r italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which is the same as training. 2) Adding perceptual weights to LLM weights with FLOPs of 12kh212𝑘superscript212kh^{2}12 italic_k italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The total FLOPs during inference are 24Cdh2+4C2dh+24krh2+12kh224𝐶𝑑superscript24superscript𝐶2𝑑24𝑘𝑟superscript212𝑘superscript224Cdh^{2}+4C^{2}dh+24krh^{2}+12kh^{2}24 italic_C italic_d italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_h + 24 italic_k italic_r italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 12 italic_k italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Appendix B Visualization Results

VLoRA can achieve promising results on various MLLM benchmarks, but these benchmarks are either multiple choice or judgmental, and to demonstrate VLoRA’s capabilities even further, we show some real-world samples in Fig. 5. The first figure suggests that our VLoRA can count the accurate number of steaks in the image. The second figure shows that VLoRA has sufficient common sense. In the third figure, VLoRA demonstrates the ability to reason and have long text conversations.

Refer to caption
Figure 5: Visualization results of VLoRA. This figure demonstrates the capabilities of our VLoRA in real-world scenarios, including accurate counting and common sense reasoning.

Appendix C Broader Impacts

Our proposed new paradigm significantly improves the training and inference efficiency of multimodal large models and reduces the computational overhead, which, in terms of research, can reduce the resource threshold of multimodal large model research, which is conducive to the active exploration of researchers in related fields, and, in terms of practical application, reduces the cost of large-scale deployment for use and helps to reduce the consumption of resources.