-
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Authors:
Elmira Amirloo,
Jean-Philippe Fauconnier,
Christoph Roesmann,
Christian Kerl,
Rinu Boney,
Yusu Qian,
Zirui Wang,
Afshin Dehghan,
Yinfei Yang,
Zhe Gan,
Peter Grasch
Abstract:
Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by pro…
▽ More
Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by producing responses that are inconsistent with the image content. A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information. Recently, multiple works have introduced preference datasets for MLLMs and examined different alignment methods, including Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). However, due to variations in datasets, base model types, and alignment methods, it remains unclear which specific elements contribute most significantly to the reported improvements in these works. In this paper, we independently analyze each aspect of preference alignment in MLLMs. We start by categorizing the alignment algorithms into two groups, offline (such as DPO), and online (such as online-DPO), and show that combining offline and online methods can improve the performance of the model in certain scenarios. We review a variety of published multimodal preference datasets and discuss how the details of their construction impact model performance. Based on these insights, we introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS) that needs neither additional annotation nor external models, and show that it can achieve competitive performance to previously published alignment work for multimodal models across a range of benchmarks.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Authors:
Yusu Qian,
Hanrong Ye,
Jean-Philippe Fauconnier,
Peter Grasch,
Yinfei Yang,
Zhe Gan
Abstract:
We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results fro…
▽ More
We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Authors:
Brandon McKinzie,
Zhe Gan,
Jean-Philippe Fauconnier,
Sam Dodge,
Bowen Zhang,
Philipp Dufter,
Dhruti Shah,
Xianzhi Du,
Futang Peng,
Floris Weers,
Anton Belyi,
Haotian Zhang,
Karanjeet Singh,
Doug Kang,
Ankur Jain,
Hongyu Hè,
Max Schwarzer,
Tom Gunter,
Xiang Kong,
Aonan Zhang,
Jianyu Wang,
Chong Wang,
Nan Du,
Tao Lei,
Sam Wiseman
, et al. (7 additional authors not shown)
Abstract:
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la…
▽ More
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
△ Less
Submitted 18 April, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Model Stability with Continuous Data Updates
Authors:
Huiting Liu,
Avinesh P. V. S.,
Siddharth Patwardhan,
Peter Grasch,
Sachin Agarwal
Abstract:
In this paper, we study the "stability" of machine learning (ML) models within the context of larger, complex NLP systems with continuous training data updates. For this study, we propose a methodology for the assessment of model stability (which we refer to as jitter under various experimental conditions. We find that model design choices, including network architecture and input representation,…
▽ More
In this paper, we study the "stability" of machine learning (ML) models within the context of larger, complex NLP systems with continuous training data updates. For this study, we propose a methodology for the assessment of model stability (which we refer to as jitter under various experimental conditions. We find that model design choices, including network architecture and input representation, have a critical impact on stability through experiments on four text classification tasks and two sequence labeling tasks. In classification tasks, non-RNN-based models are observed to be more stable than RNN-based ones, while the encoder-decoder model is less stable in sequence labeling tasks. Moreover, input representations based on pre-trained fastText embeddings contribute to more stability than other choices. We also show that two learning strategies -- ensemble models and incremental training -- have a significant influence on stability. We recommend ML model designers account for trade-offs in accuracy and jitter when making modeling choices.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Noise Robust Named Entity Understanding for Voice Assistants
Authors:
Deepak Muralidharan,
Joel Ruben Antony Moniz,
Sida Gao,
Xiao Yang,
Justine Kao,
Stephen Pulman,
Atish Kothari,
Ray Shen,
Yinying Pan,
Vivek Kaul,
Mubarak Seyed Ibrahim,
Gang Xiang,
Nan Dun,
Yidan Zhou,
Andy O,
Yuan Zhang,
Pooja Chitkara,
Xuan Wang,
Alkesh Patel,
Kushal Tayal,
Roger Zheng,
Peter Grasch,
Jason D. Williams,
Lin Li
Abstract:
Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to…
▽ More
Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to 3.13% and EL accuracy by up to 3.6% in F1 score. The features used also lead to better accuracies in other natural language understanding tasks, such as domain classification and semantic parsing.
△ Less
Submitted 10 August, 2021; v1 submitted 29 May, 2020;
originally announced May 2020.