Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.05127 (cs)

[Submitted on 7 Jun 2024 (v1), last revised 27 Jun 2024 (this version, v2)]

Title:Towards Semantic Equivalence of Tokenization in Multimodal LLM

Authors:Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at this https URL.

Comments:	Technical Report. The project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.05127 [cs.CV]
	(or arXiv:2406.05127v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.05127

Submission history

From: Xiangtai Li Dr [view email]
[v1] Fri, 7 Jun 2024 17:55:43 UTC (1,669 KB)
[v2] Thu, 27 Jun 2024 17:35:45 UTC (1,668 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Semantic Equivalence of Tokenization in Multimodal LLM

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Semantic Equivalence of Tokenization in Multimodal LLM

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators