RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Jiaming Liu^{1 ^∗,†}, Mengzhen Liu¹, Zhenyu Wang ¹, Lily Lee ¹, Kaichen Zhou¹,
Pengju An¹, Senqiao Yang¹, Renrui Zhang^{^†}, Yandong Guo², Shanghang Zhang¹ ^✉
¹National Key Laboratory for Multimedia Information Processing,
School of Computer Science, Peking University, ²AI²Robotics
jiamingliu@stu.pku.edu.cn, 21251282@bjtu.edu.cn, shanghang@pku.edu.cn Equal contribution, ^† Project Lead, ^✉ Corresponding author.

Abstract

A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing robot Multimodal Large Language Models (MLLMs) can handle a range of basic tasks, they still face challenges in two areas: 1) inadequate reasoning ability to tackle complex tasks, and 2) high computational costs for MLLM fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic MLLM that leverages the Mamba model to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual data with language embedding through co-training, empowering our model with visual common sense and robot-related reasoning. To further equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1% of the model) and time (20 minutes). In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 7 times faster than existing robot MLLMs. Our project web page: https://sites.google.com/view/robomamba-web

1 Introduction

The scaling up of data has significantly propelled research on Large Language Models (LLMs) [1, 2, 3], showcasing notable advancements in reasoning and generalization abilities within Natural Language Processing (NLP). To comprehend multimodal information, Multimodal Large Language Models (MLLMs) [4, 5, 6, 7, 8] have been introduced, empowering LLMs with the capability of visual instruction-following and scene understanding. Inspired by the strong capabilities of MLLMs in general settings, recent research aims to incorporate MLLMs into robot manipulation. On one hand, some works [9, 10, 11, 12] enable robots to comprehend natural language and visual scenes, automatically generating task plans. On the other hand, [13, 14, 15] effectively leverage the inherent capabilities of MLLMs, empowering them with the ability to predict manipulation poses.

Robot manipulation involves interacting with objects in dynamic environments, requiring human-like reasoning abilities to comprehend the semantic information of scenes [11, 16], alongside a robust low-level action prediction ability [17, 18]. While existing MLLM-based approaches can handle a range of basic tasks, they still face challenges in two aspects. First, the reasoning capabilities of pre-trained MLLMs [6, 19] in robotic scenarios are found to be insufficient. As shown in Figure 1 (reasoning example), this deficiency presents challenges for fine-tuned robot MLLMs when they encounter complex reasoning tasks. Second, fine-tuning MLLMs and using them to generate robot manipulation actions incurs higher computational costs due to their expensive attention-based LLMs [20, 21]. To balance the reasoning ability and efficiency, several studies [22, 23, 24] have emerged in the field of NLP. Notably, Mamba [25] introduces the innovative selective State Space Model (SSM), promoting context-aware reasoning while maintaining linear complexity. Drawing inspiration from this, we raise a question: “Can we develop an efficient robot MLLM that possesses strong reasoning capabilities while also acquiring robot manipulation skills in a very cost-effective manner?"

Refer to caption — Figure 1: Overview of RoboMamba. RoboMamba is an efficient robot MLLM that combines reasoning and manipulation capabilities. First, we innovatively integrate and align a vision encoder with the efficient Mamba language model, endowing our model with common sense and robot-related reasoning abilities. RoboMamba achieves competitive performance on general MLLM benchmarks (Part 3) while demonstrating long-horizon reasoning abilities in robotic tasks (Part 1). Subsequently, we introduce an extremely efficient fine-tuning strategy to equip RoboMamba with pose prediction abilities (Part 2), requiring only 20 minutes to fine-tune a simple policy head (3.7M parameters). More real-world downstream tasks are displayed in Figure 3.

To address this, we propose RoboMamba, an end-to-end robotic MLLM that fully leverages the efficiency of Mamba to achieve robust robotic reasoning and action capabilities. As shown in Figure 1, we initially integrate a vision encoder (e.g., CLIP [26]) with Mamba to empower RoboMamba with visual common sense and robot-related reasoning. Specifically, we proceed with alignment pre-training, activating the cross-modal connector [4, 19] to convert visual information into Mamba’s token embeddings. We then unfreeze Mamba for instructions co-training, utilizing its powerful sequence modeling to comprehend high-level robotic and general instruction data. On top of this, to equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. Notably, we discover that once RoboMamba possesses sufficient reasoning capabilities, it can acquire pose prediction skills with minimal parameter fine-tuning. The fine-tuned policy head constitutes only 0.1% of the model parameters, which is 10 times smaller than existing robot MLLM [15, 14]. In this way, RoboMamba can simultaneously generate robot reasoning using language responses and predict end-effector poses via the policy head.

To systematically evaluate our end-to-end RoboMamba, we conduct extensive experiments in both simulation and real-world scenarios. First, we assess our reasoning abilities on general and robotic evaluation benchmarks. As shown in Figure 1, RoboMamba, with only 3.2B parameters, achieves competitive performance on several MLLM benchmarks and also delivers promising results on RoboVQA (36.3 BLEU-4) [27]. With its strong reasoning abilities, RoboMamba achieves state-of-the-art (SOTA) manipulation performance in the SAPIEN simulation [28], requiring only a 7MB policy head and less than 20 minutes of fine-tuning on a single A100 GPU. Moreover, RoboMamba achieves an inference speed that is 7 times faster than previous SOTA robot MLLM [15]. Additionally, we evaluate RoboMamba in real-world scenarios, where it can generate long-horizon planning and predict the end-effector pose for each atomic task. In summary, our contributions are as follows:

•

We innovatively integrate a vision encoder with the efficient Mamba language model to construct our end-to-end RoboMamba, which possesses visual common sense and robot-related reasoning abilities.
•

To equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy using a simple policy head. We find that once RoboMamba achieves sufficient reasoning capabilities, it can acquire pose prediction skills with minimal cost.
•

In our extensive experiments, RoboMamba excels in reasoning on general and robotic evaluation benchmarks, and showcases impressive pose prediction results in both simulation and real-world experiments.

2 Related work

State Space Models (SSMs). SSMs have become effective substitutes for transformers and CNNs due to their linear scalability with sequence length [29, 23]. Recent works [22, 30, 31] use the state space to robustly establish dependencies across long sequences. Especially, Mamba [25] designs the SSM matrices to be functions of the input, creating a learnable selection mechanism that improves adaptability and reasoning capabilities. [32, 33, 34, 35, 36] expand selective SSMs to vision and video tasks. Furthermore, MambaIR [37] focuses on image restoration, and PanMamba [38] addresses pan-sharpening, while DiS [39] integrates SSMs into diffusion models. These findings demonstrate that Mamba exhibits promising performance and efficiency in various visual downstream tasks. With the emergence of SSMs, we make the first attempt to introduce Mamba to address critical challenges in robotics, which demands efficient action capabilities.

Robot Manipulation. Traditional robotic manipulation employs state-based reinforcement learning [40, 41, 42, 43]. In contrast, [44, 11, 45, 46, 47] use state with visual observation as input and then make predictions. Specifically, Where2Act [48] takes visual observations and predicts on actionable pixels and movable regions in objects. Flowbot3d [44] predicts point-wise motion flow on 3D objects. Anygrasp [17] employs point cloud data to learn grasp poses on a large scale datasets. Inspired by the success of MLLMs in general scenarios [49, 50, 19, 51, 52], several studies [13, 16] explore utilizing their common sense reasoning capabilities to address manipulation problems. Palm-E [10] integrates multimodal encodings with LLMs, training them end-to-end for manipulation planning. VoxPoser [11] extracts affordances and constraints from MLLMs to further zero-shot predict trajectories. RoboFlamingo [14] fine-tunes MLLM on vision language manipulation dataset to complete language-conditioned manipulation tasks. ManipLLM [15] introduces specific training scheme for manipulation tasks that equips MLLMs with the ability to predict end-effector poses. ManipVQA [53], enhancing robotic manipulation with physically grounded information processed by MLLM. In this paper, instead of fine-tuning a pre-trained MLLM, we introduce a novel end-to-end MLLM that possesses both robot-related reasoning and pose prediction capabilities. Finally, due to space limitation, we provide additional related work in Appendix F.

3 RoboMamba

In Section 3.1, we introduce the preliminaries of our proposed robot MLLM, including the problem statement and a description of the language model. Subsequently, in Section 3.2 and 3.3, we describe the architecture of RoboMamba and how we empower it with common sense and robot-related reasoning. In Section 3.4, we outline our proposed robot manipulation fine-tuning strategy, which equips our model with pose prediction skills by minimal fine-tuning parameters and time.

3.1 Preliminaries

Problem statement. For robot visual reasoning, our RoboMamba generates a language answer $L_{a}$ based on the image $I\in\mathbb{R}^{W\times H\times 3}$ and the language question $L_{q}$ , denoted as $L_{a}=R(I,L_{q})$ . The reasoning answer usually contains individual sub-tasks ( $L_{a}\rightarrow(L^{1}_{a},L^{2}_{a},\ldots,L^{n}_{a})$ ) for one problem $L_{q}$ . For example, when faced with a planning question like ’How to clean the table?’, the response typically includes steps such as ’Step 1: Pick up the object’ and ’Step 2: Place the object in the box’. For action prediction, we utilize an efficient and simple policy head $\pi$ to predict an action $a=\pi(R(I,L_{q}))$ . Following previous works [54, 15], we use 6-DoF to express the end-effector pose of the Franka Emika Panda robot arm. The 6-DoF includes the end-effector position $a_{\mathrm{pos}}\in\mathbb{R}^{3}$ representing a 3D coordinate and direction $a_{\mathrm{dir}}\in\mathbb{R}^{3\times 3}$ representing a rotation matrix. If training for a grasping task, we add gripper status to the pose prediction, resulting in a 7-DoF control.

State Space Models (SSMs). In this paper, we select Mamba [25] as our language model. Mamba consists of numerous Mamba blocks, with the most crucial component being the SSM. SSMs [21] are designed based on continuous systems, projecting the 1D input sequence $x(t)\in\mathbb{R}^{L}$ into a 1D output sequence $y(t)\in\mathbb{R}^{L}$ through a hidden state $h(t)\in\mathbb{R}^{N}$ . An SSM consists of three key parameters: the state matrix ${\mathbf{A}}\in\mathbb{R}^{N\times N}$ , the input matrix ${\mathbf{B}}\in\mathbb{R}^{N\times 1}$ , and the output matrix ${\mathbf{C}}\in\mathbb{R}^{N\times 1}$ . The SSM can be represented as follows:

\displaystyle h^{\prime}(t)={\mathbf{A}}h(t)+{\mathbf{B}}x(t);y(t)={\mathbf{C}% }h(t),

(1)

Recent SSMs (e.g., Mamba [25]) are constructed as discretized continuous systems using a timescale parameter ${\mathbf{\Delta}}$ . This parameter transforms the continuous parameters ${\mathbf{A}}$ and ${\mathbf{B}}$ into their discrete counterparts $\overline{{\mathbf{A}}}$ and $\overline{{\mathbf{B}}}$ . The discretization employs the zero-order hold method, defined as follows:

$\displaystyle\overline{{\mathbf{A}}}$	$\displaystyle=\exp({\mathbf{\Delta}\mathbf{A}}),$	(2)
$\displaystyle\overline{{\mathbf{B}}}$	$\displaystyle=({\mathbf{\Delta}\mathbf{A}})^{-1}(\exp({\mathbf{\Delta}\mathbf{% A}})-{\mathbf{I}})\cdot{\mathbf{\Delta}\mathbf{B}}$	(3)
$\displaystyle h_{t}$	$\displaystyle=\overline{{\mathbf{A}}}h_{t-1}+\overline{{\mathbf{B}}}x_{t};y_{t% }={\mathbf{C}}h_{t}.$	(4)

Mamba introduces the Selective Scan Mechanism (S6) to form its SSM operator in each Mamba block. The SSM parameters are updated to ${\mathbf{B}}\in\mathbb{R}^{B\times L\times N}$ , ${\mathbf{C}}\in\mathbb{R}^{B\times L\times N}$ , and ${\mathbf{\Delta}}\in\mathbb{R}^{B\times L\times D}$ , achieving better content-aware reasoning. The details of the Mamba block are shown in Figure 2.

3.2 RoboMamba architecture

To equip RoboMamba with both visual reasoning and manipulation abilities, we start from pre-trained Large Language Models (LLMs) [25] and visual models to construct an effective MLLM architecture. As shown in Figure 2, we utilize the CLIP visual encoder [26] to extract visual features $f_{v}\in\mathbb{R}^{B\times N\times 1024}$ from input images $I$ , where $B$ and $N$ represent batch size and tokens, respectively. In contrast to [55, 56], we do not adopt the vision encoder ensemble technique, which employs various backbones (i.e., DINOv2 [57], CLIP-ConvNeXt [58], CLIP-ViT) for image feature extraction. The ensemble introduces additional computational costs that severely impact the practicality of robot MLLMs in the real world. Therefore, we demonstrate that a simple and straightforward model design can also achieve strong reasoning abilities when combined with high-quality data and appropriate training strategies. To enable the LLM to understand visual features, we connect the vision encoder to the LLM using a multilayer perceptron (MLP). Through this simple cross-modal connector, RoboMamba can convert visual information into language embedding space $f^{L}_{v}\in\mathbb{R}^{B\times N\times 2560}$ . Note that model efficiency is crucial in the field of robotics, as robots need to respond quickly based on human instructions. Therefore, we select Mamba as our language model due to its context-aware reasoning ability and linear computational complexity. Text prompts are encoded into embedding space $f_{t}\in\mathbb{R}^{B\times N\times 2560}$ using the pre-trained tokenizer, then concatenated ( $cat$ ) with visual tokens and input into Mamba. We leverage Mamba’s powerful sequence modeling to comprehend multimodal information and utilize effective training strategies to develop visual reasoning capabilities (as described in the next section). The output tokens $T_{a}$ are then detokenized ( $det$ ) to produce responses in natural language $L_{a}$ . The model forward can be represented as follows:

\displaystyle L_{a}=det(T_{a});T_{a}=Mamba(cat(f^{L}_{v},f_{t}));f^{L}_{v}=MLP% (CLIP(I)),

(5)

To equip our model with both reasoning and manipulation abilities, we meticulously design a comprehensive training pipeline, which is divided into two stages. We introduce the training recipes of Stage 1 in Section 3.3 and present the robot manipulation fine-tuning in Section 3.4.

3.3 General and robot-related training

After constructing the RoboMamba architecture, the next goal is to train our model to learn general vision and robot-related reasoning abilities. As shown in Figure 2, we divide our Stage 1 training into two steps: alignment pre-training (Stage 1.1) and instruction co-training (Stage 1.2). Specifically, unlike previous MLLM training methods [19, 59, 55], we aim to enable RoboMamba to comprehend both common vision and robotic scenes. Given that the robotics field involves numerous complex and novel tasks, RoboMamba requires enhanced generalization capabilities. Therefore, we adopt a co-training strategy in Stage 1.2, combining high-level robotic data (e.g., task planning) with general instruction data. We find that co-training not only leads to more generalizable robot policies but also enhances general scene reasoning abilities due to the complex reasoning tasks embedded in the robotic data (demonstrated in Appendix C). The training details are shown below:

Stage 1.1: Alignment pre-training. We adopt LLaVA [4] filtered 558k image-text paired dataset for our cross-modal alignment. As shown in Figure 2, we freeze the parameters of the CLIP encoder and Mamba language model, and only update the project layer. In this way, we can align image features with the pre-trained Mamba word embedding.

State 1.2: Instruction co-training. In this stage, we first follow previous MLLM works [4, 55, 56] for general vision instruction data collection. We adopt the 655K LLaVA mixed instruction dataset [4] and 400K LRV-Instruct dataset [60], which aim to learn visual instruction following and mitigate hallucination, respectively. Note that mitigating hallucination plays an important role in robotic scenarios, as the robot MLLM needs to generate task planning based on real scenes instead of imagined ones. For example, existing MLLMs might formulaically answer “open the microwave” with “step 1: find the handle,” but many microwaves do not have handles. Next, we incorporate the 800K RoboVQA dataset [27] to learn high-level robotic skills, such as long-horizon planning, success classification, discriminative and generative affordance, past description, and future prediction. During co-training, as shown in Figure 2, we freeze the parameters of the CLIP encoder and fine-tune the projection layer and Mamba on the combined 1.8 million datasets. All outputs from the Mamba language model are supervised using the cross-entropy loss.

3.4 Robot manipulation fine-tuning

Building upon RoboMamba’s strong reasoning ability, we introduce our robot manipulation fine-tuning strategy in this section, termed Training Stage 2 in Figure 2. Existing MLLM-based manipulation methods [15, 14] require updating the projection layer and the entire LLM during the manipulation fine-tuning stage. While this paradigm can develop action prediction capabilities, it also breaks the inherent abilities of the MLLM and demands significant training resources. To address these challenges, we propose an efficient fine-tuning strategy, as shown in Figure 2. We freeze all the parameters of RoboMamba and introduce a simple policy head to model Mamba’s output tokens. The policy head contains two MLPs that separately learn the end-effector’s position $a_{\mathrm{pos}}$ and direction $a_{\mathrm{dir}}$ , collectively occupying 0.1% of the entire model parameters. Following [48], the position and direction losses are formulated as follows:

	$\displaystyle L_{pos}$	$\displaystyle=\frac{1}{N}{\sum_{i=1}^{N}\|a_{\mathrm{pos}}-a^{gt}_{\mathrm{pos}% }\|}$		(6)
	$\displaystyle L_{dir}$	$\displaystyle=\frac{1}{N}{\sum_{i=1}^{N}\arccos\left(\frac{{Tr\Big{(}a^{gt}_{% \mathrm{dir}}}^{T}a_{\mathrm{dir}}\Big{)}-1}{2}\right)}$		(7)

where $N$ represents the number of training samples, $Tr(A)$ means the trace of matrix $A$ . RoboMamba only predicts the 2D position ( $x$ , $y$ ) of the contact pixel in the image, which is then translated into 3D space using depth information. To evaluate this fine-tuning strategy, we generate a dataset of 10k end-effector pose predictions using the SAPIEN simulation [28]. After manipulation fine-tuning, we find that once RoboMamba possesses sufficient reasoning capabilities, it can acquire pose prediction skills with extremely efficient fine-tuning. Due to the minimal fine-tuning parameters (7MB) and efficient model design, we need only 20 minutes to achieve novel manipulation skills. This finding highlights the importance of reasoning abilities for learning manipulation skills and presents a new perspective: we can efficiently equip an MLLM with manipulation abilities without compromising its inherent reasoning capabilities. Finally, RoboMamba can use language responses for common sense and robot-related reasoning, and the policy head for action pose prediction.

4 Experiment

In Section 4.1, we introduce our experiment settings, including dataset, implementation, and evaluation benchmark details. Subsequently, we conduct extensive experiments to demonstrate RoboMamba’s reasoning and manipulation abilities in Sections 4.2 and 4.3, respectively. To thoroughly validate the effectiveness of each method design, we perform an ablation study in Section 4.4. Finally, the qualitative results of real-world experiments are presented in Section 4.5.

4.1 Experiment settings

Datasets (Stage 1) In the alignment pre-training stage, we utilize the LLaVA-LCS 558K dataset [59], which is a curated subset of the LAION-CC-SBU dataset, supplemented with captions. During the instruction co-training stage, we created a combined dataset totaling 1.8 million samples, including LLaVA-v1.5 655K mix [59], LRV-INSTRUCT 400K [60], and RoboVQA 800K dataset [27]. The detailed descriptions of the datasets is shown in Appendix B.

Datasets (Stage 2) For the dataset used in the robot manipulation fine-tuning stage, we utilize the SAPIEN [28] to set up an interactive simulation environment with articulated objects from PartNet [45]. To generate data, we use the Franka Panda Robot to randomly interact with objects. When successful manipulation occurs, we record the success 6-DOF poses of the end-effector, which serves as ground truth labels for training. In the training set, we collect 10K images across 20 categories. For testing, we use a set of 1.1K images that include both seen categories from the training set and unseen categories. The details of the categories are provided in Appendix B.

Implementation details Before training, RoboMamba loads a pre-trained CLIP/SigLIP ViT-Large [26, 61] as the visual encoder, and the 2.8/1.4B Mamba [1] model as the language model. During the alignment pre-training and instruction co-training, we conduct training for 1 epoch and 2 epochs, respectively. We utilize the AdamW optimizer with $(\beta_{1},\beta_{2})=(0.9,0.999)$ and a learning rate (LR) of 2e-5. The precision of floating-point calculations is set to 16-bit. For manipulation fine-tuning, we train the model for 5 epochs, setting the LR to 1e-5 and applying a weight decay of 0.1. The floating-point precision is set to 32-bit. All experiments are conducted on NVIDIA A100 GPUs.

Reasoning evaluation benchmarks To evaluate reasoning capabilities, we employ several popular benchmarks, including VQAv2 [62], OKVQA [63], RoboVQA [27], GQA [64], OCRVQA [65], VizWiz [66], POPE [67], MME [68], MMBench [69], and MM-Vet [70]. As detailed in Appendix E, we describe the key aspects each benchmark focuses on when assessing models in the field of robotics. Notably, we also directly evaluate RoboMamba’s robot-related reasoning abilities on the 18k validation dataset of RoboVQA, covering robotic tasks such as long-horizon planning, success classification, discriminative and generative affordance, past description, and future prediction.

Manipulation evaluation benchmarks To evaluate our model’s manipulation capabilities, we follow previous works [44, 54, 15] and test pulling accuracy exclusively in the simulator [28]. We use the predicted contact point and rotation to interact with objects. To measure the model’s performance, we use the classical manipulation success rate, defined as the ratio of successfully manipulated samples to the total test samples. A manipulation action is considered successful if the difference in the object’s joint state before and after interaction exceeds a threshold of 0.1 meters. In real-world experiments, we use the Franka Panda robot to manipulate several articulated objects.

Table 1: Comparison of general reasoning abilities with previous MLLMs on several benchmarks. Res. refers to the resolution of the input image.

Method LLM Size Res. OKVQA VQAV2 GQA VizWiz OCR-VQA POPE MME MMB MM-Vet BLIP-2 [49] 7B 224 45.9 - 41.0 19.6 40.6 85.3 1293.8 - 22.4 InstructBLIP [71] 7B 224 - - 49.5 33.4 44.8 - - 36 26.2 LLaMA-AdapterV2 [51] 7B 336 49.6 70.7 45.1 39.8 - - 1328.4 - - MiniGPT-v2 [72] 7B 448 57.8 - 60.1 53.6 - - - - - Qwen-VL [73] 7B 448 58.6 79.5 59.3 35.2 75.7 - - 38.2 - LLaVA1.5 [59] 7B 336 - 78.5 62.0 50.0 - 85.9 1510.7 64.3 30.5 SPHINX [55] 7B 224 62.1 78.1 62.6 39.9 66.0 80.7 1476.1 66.9 36.0 LLaVA-Phi [74] 2.7B 336 - 71.4 - 35.9 - 85.0 1335.1 59.8 28.9 MobileVLM [75] 2.7B 336 - - 59.0 - - 84.9 1288.9 59.6 - TinyLLaVA [76] 2.7B 336 - 77.7 61.0 - - 86.3 1437.3 68.3 31.7 RoboMamba(Ours) 2.7B 224 63.1 80.3 62.4 55.0 62.5 85.3 1314.8 64.2 28.6 RoboMamba(Ours) 2.7B 384 62.4 79.1 64.4 55.0 66.7 86.9 1354.2 65.7 29.7

4.2 Reasoning quantitative result

General reasoning. As shown in Table 1, we compare RoboMamba with previous state-of-the-art (SOTA) MLLMs on general VQA and recent MLLM benchmarks. First, we find that RoboMamba achieves promising results across all VQA benchmarks, using only a 2.7B language model. The results demonstrate that our simple architecture design is effective. The alignment pre-training and proposed instruction co-training significantly enhance the MLLM’s reasoning capabilities. For example, due to the large amount of robot data introduced during the co-training stage, our model’s spatial identification performance on the GQA benchmark is improved. Meanwhile, we also test our RoboMamba on recently proposed MLLM benchmarks. Compared to previous MLLMs, we observe that our model achieves competitive results across all benchmarks. Notably, our model achieves satisfactory results on the POPE benchmark due to the inclusion of the LRV-Instruct dataset during the co-training stage, which helps reduce failed robot actions caused by hallucinations. Although some performances of RoboMamba are still below those of LLaVA1.5 and SPHINX, we prioritize using a smaller and faster Mamba to balance the efficiency of the robotic model. In the future, we plan to develop RoboMamba-7B for scenarios where resources are not limited.

Robot-related reasoning. To comprehensively compare RoboMamba’s robot-related reasoning abilities, we benchmark it against LLaMA-AdapterV2 [51] on the RoboVQA [27] validation set. We chose LLaMA-AdapterV2 as a baseline because it serves as the base model for the current SOTA Robot MLLM, ManipLLM [15]. For a fair comparison, we loaded the baseline pre-trained parameters and fine-tuned it on the RoboVQA training set for two epochs, using its official instruction tuning method. As shown in Figure 3 a), RoboMamba achieves superior performance across BLEU-1 to BLEU-4. The results indicate that our model possesses advanced robot-related reasoning capabilities and confirms the effectiveness of our training strategy. In addition to higher accuracy, our model achieves inference speeds 7 times faster than LLaMA-AdapterV2 and ManipLLM, which can be attributed to the content-aware reasoning ability and efficiency of the Mamba language model [25]. Finally, we visualize the qualitative results in Figure 4.

Table 2: Comparison of the success rates between RoboMamba and baselines across various training (seen) and test (unseen) categories. The representation for each category icon is shown in Table 3.

		Seen Categories
Method
UMPNet [54]	0.28	0.41	0.25	0.20	0.49	0.20	0.35	0.57	0.51	0.25	0.66	0.17	0.17	0.26	0.27	0.40
FlowBot3D [44]	0.50	0.53	0.26	0.36	0.34	0.36	0.54	0.26	0.12	0.34	0.41	0.23	0.36	0.30	0.17	0.37
RoboFlamingo [14]	0.48	0.51	0.50	0.35	0.11	0.47	0.54	0.35	0.19	0.46	0.18	0.64	0.26	0.42	0.15	0.87
ManipLLM [15]	0.68	0.62	0.45	0.74	0.42	0.25	0.61	0.66	0.56	0.52	0.50	0.42	0.64	0.76	0.63	0.60
RoboMamba (Ours)	0.81	0.73	0.33	0.85	0.86	0.60	0.81	0.42	0.56	0.54	0.68	0.81	0.26	0.86	0.39	0.91
		Seen Categories				Unseen Categories
Method					AVG											AVG
UMPNet [54]	0.27	0.37	0.19	0.60	0.34	0.32	0.36	0.18	0.37	0.21	0.12	0.04	0.53	0.28	0.13	0.26
FlowBot3D [44]	0.21	0.57	0.29	0.45	0.35	0.36	0.36	0.18	0.30	0.21	0.50	0.13	0.53	0.28	0.09	0.30
RoboFlamingo [14]	0.20	0.42	0.58	0.60	0.41	0.36	0.62	0.64	0.33	0.14	0.34	0.44	0.66	0.41	0.31	0.43
ManipLLM [15]	0.41	0.78	0.41	0.59	0.56	0.21	0.25	0.79	0.76	0.52	0.76	0.43	0.85	0.26	0.52	0.51
RoboMamba(Ours)	0.40	0.55	0.37	0.80	0.63	0.19	0.23	0.67	0.66	0.57	0.45	0.65	0.68	0.30	0.93	0.53

4.3 Manipulation quantitative result

Baselines. To evaluate RoboMamba’s manipulation abilities, we compare our model with four baselines: UMPNet [54], Flowbot3D [44], RoboFlamingo [14], and ManipLLM [14]. Before comparison, we reproduce all baselines and train them on our collected dataset. For UMPNet, we execute manipulation on the predicted contact point, with the orientation perpendicular to the object’s surface. Flowbot3D predicts motion direction on the point cloud, selecting the largest flow magnitude as the interaction point and using the direction of the flow to represent the end-effector’s orientation. RoboFlamingo and ManipLLM separately load the pre-trained parameters of OpenFlamingo [50] and LLaMA-AdapterV2 [51], and follow their respective fine-tuning and model updating strategies.

Results. As shown in Table 2, our RoboMamba achieves a 7.0% improvement on seen categories and a 2.0% improvement on unseen categories compared to the previous SOTA ManipLLM. Moreover, our method showcases SOTA performance across 14 of 20 seen categories, highlighting its effectiveness and stability in predicting action poses. For unseen categories, the recent three MLLM-based methods—RoboFlamingo, ManipLLM, and our method—all achieved promising performance. The results demonstrate that leveraging the strong generalization abilities of MLLMs can effectively improve the policy’s generalization ability while enhancing accuracy on unseen objects. Regarding efficiency, RoboFlamingo updates 35.5% (1.8B) of the model parameters, ManipLLM updates an adapter (41.3M) comprising 0.5% of the model parameters, whereas our fine-tuned simple policy head (3.7M) only constitutes 0.1% of the model parameters. RoboMamba effectively updates 10 times fewer parameters than previous MLLM-based methods while achieving seven times faster inference speeds. The results reveal that our RoboMamba not only possesses strong reasoning abilities but also can acquires manipulation capabilities in a cost-effective manner.

4.4 Ablation study

The impact of reasoning ability. We explore whether utilizing MLLMs with different reasoning abilities affects manipulation skill learning. For a fair comparison, we use the same manipulation fine-tuning strategy, injecting and fine-tuning a simple MLP policy head after the MLLM (while freezing other parameters). We compare our RoboMamba-2.7B (Ours-2.7B) with OpenFlamingo, LLaMA-AdapterV2, and our RoboMamba-1.4B. As shown in Figure 3 b), Ours-2.7B achieves promising results compared with other methods, which is proportional to its reasoning ability. Meanwhile, Ours-2.7B (w/o C) indicates that we did not use the instruction co-training method, omitting the robot-related RoboVQA dataset during fine-tuning. We find that this also impacts the accuracy of manipulation, especially reducing the model’s generalization ability when facing unseen objects. The results confirm our finding: fine-tuning an MLLM to learn robot skills does not require extensive resources; it only requires that the MLLM possesses strong robot-related reasoning abilities. Additionally, we present more ablation studies in Appendix C, including explorations of training strategies and policy head design.

4.5 Real-world experiments

As shown in Figure 4, we visualize RoboMamba’s reasoning results across various robotic downstream tasks. For task planning, compared to LLaMA-AdapterV2, RoboMamba demonstrates more accurate and long-horizon planning abilities, thanks to its strong reasoning capabilities. For a fair comparison, we also fine-tuned the baseline LLaMA-AdapterV2 on the RoboVQA dataset. Additionally, RoboMamba accurately performs fundamental robotic tasks such as affordance generation and discrimination, proving that it can understand robotic scenes. Notably, our model also possesses past and future prediction capabilities, further highlighting its robust reasoning capabilities. For pose prediction, we use a Franka Emika robotic arm to interact with various household objects. We project RoboMamba’s predicted 3D pose back onto a 2D image, using a red dot to indicate the contact point and the end-effector to show the direction, as shown in the bottom right corner of the figure. More real-world demonstrations are provided in Appendix 5 and the supplementary video file.

5 Conclusion and future plan

In this paper, we introduce an end-to-end robotic MLLM named RoboMamba, which possesses both reasoning and manipulation capabilities. Based on our RoboMamba, we can impart new manipulation skills to the model by fine-tuning a simple policy head (0.1% of the model) in approximately 20 minutes. This finding reveals how to efficiently equip an MLLM with manipulation abilities without compromising its inherent reasoning capabilities. Finally, RoboMamba excels in reasoning on both general and robot-related evaluation benchmarks and showcases impressive pose prediction results. As for our future plan, we focus on two directions. 1) During the process for robot manipulation fine-tuning, we aim to introduce continual learning techniques [77, 78, 79] to continuously enhance the model’s manipulation abilities in the real world. 2) Constructing a 3D Robot MLLM [80, 81, 82], as the 3D point cloud contains more geometric information that helps predict 3D manipulation poses.

References

[1] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[2] OpenAI. GPT-4 technical report, 2023.
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[4] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
[5] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
[6] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
[7] Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, et al. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. ICML 2024, 2024.
[8] Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024.
[9] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
[10] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
[11] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
[12] Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, et al. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023.
[13] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
[14] Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023.
[15] Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. arXiv preprint arXiv:2312.16217, 2023.
[16] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning, 2023.
[17] Hao-Shu Fang, Chenxi Wang, Hongjie Fang, Minghao Gou, Jirong Liu, Hengxu Yan, Wenhai Liu, Yichen Xie, and Cewu Lu. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics, 2023.
[18] Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[19] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
[20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[21] Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, et al. State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516, 2024.
[22] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
[23] Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
[24] Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
[25] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
[26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[27] Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. arXiv preprint arXiv:2311.00899, 2023.
[28] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part-based interactive environment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[29] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
[30] Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
[31] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
[32] Ethan Baron, Itamar Zimerman, and Lior Wolf. 2-d ssm: A general spatial layer for visual transformers. arXiv preprint arXiv:2306.06635, 2023.
[33] Jiacheng Ruan and Suncheng Xiang. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491, 2024.
[34] Ziyang Wang and Chao Ma. Semi-mamba-unet: Pixel-level contrastive cross-supervised visual mamba-based unet for semi-supervised medical image segmentation. arXiv preprint arXiv:2402.07245, 2024.
[35] Ziyang Wang, Jian-Qing Zheng, Yichi Zhang, Ge Cui, and Lei Li. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079, 2024.
[36] Eric Nguyen, Karan Goel, Albert Gu, Gordon Downs, Preey Shah, Tri Dao, Stephen Baccus, and Christopher Ré. S4nd: Modeling images and videos as multidimensional signals with state spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
[37] Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu-Tao Xia. Mambair: A simple baseline for image restoration with state-space model. arXiv preprint arXiv:2402.15648, 2024.
[38] Xuanhua He, Ke Cao, Keyu Yan, Rui Li, Chengjun Xie, Jie Zhang, and Man Zhou. Pan-mamba: Effective pan-sharpening with state space model. arXiv preprint arXiv:2402.12192, 2024.
[39] Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024.
[40] OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
[41] Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen, Yaodong Yang, and Hao Dong. End-to-end affordance learning for robotic manipulation. In International Conference on Robotics and Automation (ICRA), 2023.
[42] Shirin Joshi, Sulabh Kumra, and Ferat Sahin. Robotic grasping using deep reinforcement learning. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), pages 1461–1466. IEEE, 2020.
[43] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
[44] Ben Eisner, Harry Zhang, and David Held. Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382, 2022.
[45] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019.
[46] Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023.
[47] Qianxu Wang, Haotong Zhang, Congyue Deng, Yang You, Hao Dong, Yixin Zhu, and Leonidas Guibas. Sparsedff: Sparse-view feature distillation for one-shot dexterous manipulation. arXiv preprint arXiv:2310.16838, 2023.
[48] Kaichun Mo, Leonidas J Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6813–6823, 2021.
[49] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
[50] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
[51] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
[52] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[53] Siyuan Huang, Iaroslav Ponomarenko, Zhengkai Jiang, Xiaoqi Li, Xiaobin Hu, Peng Gao, Hongsheng Li, and Hao Dong. Manipvqa: Injecting robotic affordance and physically grounded information into multi-modal large language models. arXiv preprint arXiv:2403.11289, 2024.
[54] Zhenjia Xu, Zhanpeng He, and Shuran Song. Universal manipulation policy network for articulated objects. IEEE Robotics and Automation Letters, 7(2):2447–2454, 2022.
[55] Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
[56] Han Zhao, Min Zhang, Wei Zhao, Pengxiang Ding, Siteng Huang, and Donglin Wang. Cobra: Extending mamba to multi-modal large language model for efficient inference. arXiv preprint arXiv:2403.14520, 2024.
[57] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
[58] Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
[59] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
[60] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023.
[61] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
[62] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[63] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
[64] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
[65] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
[66] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
[67] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
[68] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
[69] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
[70] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
[71] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
[72] Jun Chen, Deyao Zhu1 Xiaoqian Shen1 Xiang Li, Zechun Liu2 Pengchuan Zhang, Raghuraman Krishnamoorthi2 Vikas Chandra2 Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
[73] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023.
[74] Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.
[75] X Chu, L Qiao, X Lin, S Xu, Y Yang, Y Hu, F Wei, X Zhang, B Zhang, X Wei, et al. Mobilevlm: A fast, strong and open vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
[76] Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models. arXiv preprint arXiv:2402.14289, 2024.
[77] Jiaming Liu, Senqiao Yang, Peidong Jia, Ming Lu, Yandong Guo, Wei Xue, and Shanghang Zhang. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344, 2023.
[78] Senqiao Yang, Jiarui Wu, Jiaming Liu, Xiaoqi Li, Qizhe Zhang, Mingjie Pan, and Shanghang Zhang. Exploring sparse visual prompt for cross-domain semantic segmentation. arXiv preprint arXiv:2303.09792, 2023.
[79] Jiaming Liu, Ran Xu, Senqiao Yang, Renrui Zhang, Qizhe Zhang, Zehui Chen, Yandong Guo, and Shanghang Zhang. Adaptive distribution masked autoencoders for continual test-time adaptation. arXiv preprint arXiv:2312.12480, 2023.
[80] Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023.
[81] Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, and Shanghang Zhang. Lidar-llm: Exploring the potential of large language models for 3d lidar understanding. arXiv preprint arXiv:2312.14074, 2023.
[82] Yiwen Tang, Jiaming Liu, Dong Wang, Zhigang Wang, Shanghang Zhang, Bin Zhao, and Xuelong Li. Any2point: Empowering any-modality large models for efficient 3d understanding. arXiv preprint arXiv:2404.07989, 2024.
[83] https://sharegpt.com/. Sharegpt. 2023.
[84] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
[85] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
[86] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
[87] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
[88] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
[89] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
[90] Jiageng Mao, Yuxi Qian, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
[91] Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes. arXiv preprint arXiv:2308.08769, 2023.
[92] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
[93] Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.

Appendix A Appendix

Due to space limitations, we provide additional details of the proposed method in this supplementary material. In Appendix B, we offer a more detailed description of our training dataset, including alignment pre-training, instruction co-training, and robot manipulation fine-tuning. Additional ablation study are presented in Appendix C, which explore the impact of different training strategies on reasoning ability and the effect of different head designs on manipulation fine-tuning. In Appendix 5, we show additional qualitative results across multiple robot-related downstream tasks. Finally, we provide the metric selection rationale and the usage of prompts during testing in Appendix E.

Appendix B Dataset description


Safe	Door	Display	Refrigerator	Laptop	Lighter	Microwave	Mouse	Box	Trashcan

Kitchen pot	Suitcase	Pliers	Storage	Remote	Bottle	Folding chair	Toaster	Lamp	Dispenser

Toilet	Scissors	Table	Stapler	Kettle	USB	Switch	Washing	Faucet	Phone

Table 3: Representation of each category icon.

Stage 1.1: Alignment pre-training dataset

1) LLaVA-LCS 558K: This LLaVA Visual Instruct Pretrain LCS-558K dataset is a curated subset of the LAION/CC/SBU dataset, specifically filtered to achieve a more balanced distribution of concept coverage. Additionally, it includes captions paired with BLIP synthetic captions for reference purposes.

Stage 1.2: Instruction co-training dataset.

1) LLaVA-v1.5 655K: This dataset is a mixture of ten distinct datasets, including LLaVA [4], ShareGPT [83], VQAv2 [62], GQA [64], OKVQA [63], OCRVQA [63], A-OKVQA [84], TextCaps [85], RefCOCO [86, 87], and Visual Genome (VG) [88]. This mix dataset is also one of the most renowned datasets used for instruction tuning in several works [4, 59].

2) LRV-INSTRUCT 400K: Although recent work [55] suggests that increasing image resolution can effectively mitigate hallucination, processing high-resolution images affects the efficiency of robotic policy. Therefore, we introduced this dataset during the co-training process to mitigate hallucination. This dataset contains visual instructions generated by GPT-4, encompassing 16 different vision-and-language tasks with open-ended instructions and answers.

3) RoboVQA 800K: In co-training, we use this dataset to enhance our model’s robot-related reasoning abilities. RoboVQA [27] comprises realistic data collected by performing various user requests and using multiple embodiments, such as robots, humans, and humans with grasping tools. This dataset includes 5,246 long-horizon episodes and 92,948 medium-horizon episodes of robotic tasks, each paired with image and text prompt inputs.

Stage 2: Robot manipulation fine-tuning dataset.

Representation for Each Category Icon In Table 3, we provide an overview of the meaning of each category icon presented in Table 2 of the main paper. These categories and their corresponding objects are sourced from PartNet-Mobility [89].

Simulator Data Collection In the simulator, we use a Franka Panda Robot with a suction gripper as the robotic actuator. During data collection, we randomly select a contact point on the movable part of the object and orient the end-effector’s z-axis opposite to the object’s normal vector, with a random y-axis direction to interact with the object. Successful operations are categorized as successful samples and integrated into the training dataset. For the training set, we collect 10K images across 20 categories, including Safe, Door, Display, Refrigerator, Laptop, Lighter, Microwave, Mouse, Box, Trash Can, Kitchen Pot, Suitcase, Pliers, Storage Furniture, Remote, Bottle, Folding Chair, Toaster, Lamp, and Dispenser. For testing, we use a set of 1.1K images that include both seen categories from training and unseen categories, such as Toilet, Scissors, Table, Stapler, Kettle, USB, Switch, Washing Machine, Faucet, and Phone.

Appendix C Additional ablation study

The impact of training strategies on reasoning abilities As shown in Table 4, we explore the impact of different training strategies on common sense reasoning ability. Specifically, we conduct these experiments using $336\times 336$ input images. In this table, AP refers to alignment pre-training, and IC refers to instruction co-training with different dataset combinations. First, we observe that Ex2 outperforms Ex1 across all three metrics, validating the importance of alignment pre-training, consistent with conclusions drawn by mainstream methods [19, 4, 49]. Next, we find that incorporating the LRV-INSTRUCT 400K dataset (Ex3) indeed improves POPE accuracy, helping our model mitigate the negative effects of hallucinations. Finally, introducing robot-related datasets in co-training not only empowers our model with robot-related reasoning abilities but also enhances common sense reasoning performance, particularly on benchmarks like GQA that are related to spatial reasoning.

Table 4: Ablation study of training strategies on MLLM reasoning benchmarks.

	AP	IC(LLaVA-655K)	IC(LRV-400k)	IC(Robo-800k)	OKVQA	GQA	POPE
Ex1	-	✓	-	-	61.5	62.2	85.5
Ex2	✓	✓	-	-	62.3	62.7	85.9
Ex3	✓	✓	✓	-	62.0	62.6	86.6
Ex4	✓	✓	✓	✓	62.4	63.8	86.9

The impact of policy head designs on manipulation accuracy As shown in Table 5, we explore the impact of different policy head designs on manipulation skill learning. In this table, MLP $\times$ 1 means using only one MLP heads to predict the position and direction of the end-effector pose. MLP $\times$ 2 means using one shared head to predict direction and another head to predict position separately. (SSM block+MLP) $\times$ 2 is similar to MLP $\times$ 2 but adds a State Space Model (SSM) block before the MLP to increase the parameter count of the policy head. The experimental results show that the manipulation accuracy across the three configurations is quite similar, indicating that the parameter count of the fine-tuning policy head has small impact on the results. Combined with Figure 3 b), this further supports our finding that once RoboMamba achieves sufficient robotic reasoning capabilities, it can acquire pose prediction skills at a low cost, regardless of the policy head design.

Table 5: Ablation study of policy head design on manipulation dataset.

result	MLP $\times$ 2	MLP $\times$ 1	(SSM block+MLP) $\times$ 2
Acc (Seen)	63.7%	62.1%	63.2%
Parameters	3.7M	1.8M	45.2M
Percentage	0.11%	0.05%	1.3%

Appendix D Additional real-world experiments

We conduct real-world experiments involving interactions with various household objects using a Franka Emika robotic arm. We modify the finger gripper by attaching double-sided tape to convert it into a suction gripper, providing the gripper head with adhesive properties. The video demonstrations are included in the supplementary video file. As shown in Figure 5, we visualize our model’s reasoning results on a series of robotic downstream tasks, including long-horizon planning, discriminative affordance, generative affordance, past description, and future prediction. Additionally, we project our RoboMamba’s predicted 3D pose back onto a 2D image based on camera parameter.

Appendix E Reasoning evaluation bencharks

To comprehensively evaluate our model’s capabilities in common sense reasoning, we select several general MLLM evaluation benchmarks, prioritizing those related to robotics. Below is a description of each benchmark

•

VQAv2 and OKVQA: These benchmarks are utilized to assess the model’s proficiency in basic vision question answering, which is a foundational skill in embodied AI. This ability ensures that the model can understand and respond to visual content effectively.
•

POPE and VizWiz: These benchmarks are chosen to evaluate the model’s capability to answer questions without falling prey to visual illusions or ambiguities. This aspect is crucial for avoiding significant errors in robotic applications.
•

GQA and OCRVQA: These benchmarks are employed to test the model’s ability to identify and comprehend the types and positions of important objects within an image. Such spatial identification skills are vital for tasks related to robotic manipulation and interaction with the environment.
•

RobotVQA: This benchmark is used to assess the model’s ability to plan and understand actions based on both textual and visual inputs. This skill is indispensable in the realm of robotics, where understanding and executing complex actions is necessary.
•

MM-Vet, MME and MMB: These benchmarks are utilized to evaluate multimodal large language models’s ability to integrate on complex multi-modal tasks including Recognition, Spatial awareness, OCR, and Math. All of them contain a wealth of evaluation indicators, such as perception and cognition, which can fully demonstrate the performance of the model under different tasks, and this performance is the best embodiment of the comprehensive application performance of multimodal large language models(MLLM).

Appendix F Additional related work

Multimodal Large Language Models. Large language models (LLMs) have exhibited remarkable reasoning capabilities across various downstream tasks [19, 90, 2]. When addressing complex multimodal reasoning challenges, multimodal large language models (MLLMs) have shown exceptional visual understanding, i.e., BLIP-2 [49], OpenFlamingo [50], LLaMA-Adapter [19, 51], and LLaVA [52]. Additionally, the introduction of 3D MLLMs [12, 91, 81] seeks to expand the reasoning and conversational capabilities of LLMs to include the 3D modality. However, deploying LMMs is expensive due to their significant computational overhead, primarily caused by their billions of parameters. To mitigate these challenges, recent small-scale models [74, 92] demonstrate impressive performance while maintaining manageable computational costs. LLaVA-Phi [74] empowers the recently developed smaller LLM, Phi-2, for visual instruction tuning. TinyLLaVA [92] and MobileVLM V2 [93] demonstrate that high-quality training data and schemes can effectively compensate for the reasoning abilities of smaller LMMs. Furthermore, Cobra [56] innovatively utilizes an SSM-based language model to reduce complexity. Different from previous works, our goal is to develop an efficient embodied MLLM using the Mamba language model. This model not only possesses common sense understanding but also has the capability to complete manipulation tasks effectively.