Skip to main content

Showing 1–19 of 19 results for author: Li, L H

  1. arXiv:2405.19315  [pdf, other

    cs.CV cs.CL cs.LG

    Matryoshka Query Transformer for Large Vision-Language Models

    Authors: Wenbo Hu, Zi-Yi Dou, Liunian Harold Li, Amita Kamath, Nanyun Peng, Kai-Wei Chang

    Abstract: Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resource… ▽ More

    Submitted 6 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

    Comments: Preprint. Our code and model are publicly available at https://github.com/gordonhu608/MQT-LLaVA

  2. arXiv:2311.02805  [pdf, other

    cs.CL

    Tailoring Self-Rationalizers with Multi-Reward Distillation

    Authors: Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

    Abstract: Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In thi… ▽ More

    Submitted 22 May, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

    Journal ref: The Twelfth International Conference on Learning Representations, 2024

  3. arXiv:2306.14060  [pdf, other

    cs.CV cs.CL cs.LG

    DesCo: Learning Object Recognition with Rich Language Descriptions

    Authors: Liunian Harold Li, Zi-Yi Dou, Nanyun Peng, Kai-Wei Chang

    Abstract: Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. These approaches align objects with language queries (e.g. "a photo of a cat") and improve the models' adaptability to identify novel objects and domains. Recently, several studies have attempted to query these models with complex language expressions th… ▽ More

    Submitted 24 June, 2023; originally announced June 2023.

  4. arXiv:2306.14050  [pdf, other

    cs.CL

    Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step

    Authors: Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi

    Abstract: Chain-of-thought prompting (e.g., "Let's think step-by-step") primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-th… ▽ More

    Submitted 15 April, 2024; v1 submitted 24 June, 2023; originally announced June 2023.

    Comments: ACL 2023

  5. arXiv:2306.01311  [pdf, other

    cs.CL

    MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

    Authors: Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin F. Yang, Kai-Wei Chang

    Abstract: Large-scale language models have shown the ability to adapt to a new task via conditioning on a few demonstrations (i.e., in-context learning). However, in the vision-language domain, most large-scale pre-trained vision-language (VL) models do not possess the ability to conduct in-context learning. How can we enable in-context learning for VL models? In this paper, we study an interesting hypothes… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

  6. arXiv:2206.05836  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    GLIPv2: Unifying Localization and Vision-Language Understanding

    Authors: Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

    Abstract: We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, reg… ▽ More

    Submitted 11 October, 2022; v1 submitted 12 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022; updated with reviewers' comments addressed; Code is released at https://github.com/microsoft/GLIP

  7. arXiv:2205.12617  [pdf, other

    cs.CL cs.AI cs.CV

    DisinfoMeme: A Multimodal Dataset for Detecting Meme Intentionally Spreading Out Disinformation

    Authors: Jingnong Qu, Liunian Harold Li, Jieyu Zhao, Sunipa Dev, Kai-Wei Chang

    Abstract: Disinformation has become a serious problem on social media. In particular, given their short format, visual attraction, and humorous nature, memes have a significant advantage in dissemination among online communities, making them an effective vehicle for the spread of disinformation. We present DisinfoMeme to help detect disinformation memes. The dataset contains memes mined from Reddit covering… ▽ More

    Submitted 25 May, 2022; originally announced May 2022.

  8. arXiv:2205.12247  [pdf, other

    cs.CL

    GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models

    Authors: Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li, Kai-Wei Chang

    Abstract: Recent work has shown that Pre-trained Language Models (PLMs) store the relational knowledge learned from data and utilize it for performing downstream tasks. However, commonsense knowledge across different regions may vary. For instance, the color of bridal dress is white in American weddings whereas it is red in Chinese weddings. In this paper, we introduce a benchmark dataset, Geo-Diverse Commo… ▽ More

    Submitted 29 November, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022. Code and data are released at https://github.com/WadeYin9712/GeoMLAMA/

  9. arXiv:2205.11502  [pdf, other

    cs.CL cs.AI

    On the Paradox of Learning to Reason from Data

    Authors: Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, Guy Van den Broeck

    Abstract: Logical reasoning is needed in a wide range of NLP tasks. Can a BERT model be trained end-to-end to solve logical reasoning problems presented in natural language? We attempt to answer this question in a confined problem space where there exists a set of parameters that perfectly simulates logical reasoning. We make observations that seem to contradict each other: BERT attains near-perfect accurac… ▽ More

    Submitted 24 May, 2022; v1 submitted 23 May, 2022; originally announced May 2022.

    Comments: Table 1 & 2 numbers were out-dated in v1; we have updated them; the observations and conclusions remain unchanged

  10. arXiv:2204.08790  [pdf, other

    cs.CV cs.CL cs.LG

    ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

    Authors: Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, Jianfeng Gao

    Abstract: Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks. However, it remains challenging to evaluate the transferablity of these models due to the lack of easy-to-use evaluation toolkits and public bench… ▽ More

    Submitted 13 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

    Comments: NeurIPS 2022 (Datasets and Benchmarks Track). The first two authors contribute equally. Benchmark page: https://computer-vision-in-the-wild.github.io/ELEVATER/

  11. arXiv:2112.09106  [pdf, other

    cs.CV cs.AI cs.LG

    RegionCLIP: Region-based Language-Image Pretraining

    Authors: Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

    Abstract: Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings. However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without… ▽ More

    Submitted 16 December, 2021; originally announced December 2021.

    Comments: Technical report

  12. arXiv:2112.08587  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

    Authors: Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

    Abstract: Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-mo… ▽ More

    Submitted 15 December, 2021; originally announced December 2021.

    Comments: AAAI 2022

    Journal ref: AAAI 2022

  13. arXiv:2112.03857  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Grounded Language-Image Pre-training

    Authors: Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao

    Abstract: This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can… ▽ More

    Submitted 17 June, 2022; v1 submitted 7 December, 2021; originally announced December 2021.

    Comments: CVPR 2022; updated visualizations; fixed hyper-parameters in Appendix C.1

  14. arXiv:2109.06860  [pdf, other

    cs.CL cs.AI cs.CV

    Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

    Authors: Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, Kai-Wei Chang

    Abstract: Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally o… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: EMNLP 2021. Code and data are available at https://github.com/WadeYin9712/GD-VCR

  15. arXiv:2108.04938  [pdf, other

    cs.CV cs.AI cs.CL

    BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

    Authors: Masoud Monajatipoor, Mozhdeh Rouhsedaghat, Liunian Harold Li, Aichi Chien, C. -C. Jay Kuo, Fabien Scalzo, Kai-Wei Chang

    Abstract: Vision-and-language(V&L) models take image and text as input and learn to capture the associations between them. Prior studies show that pre-trained V&L models can significantly improve the model performance for downstream tasks such as Visual Question Answering (VQA). However, V&L models are less effective when applied in the medical domain (e.g., on X-ray images and clinical notes) due to the do… ▽ More

    Submitted 10 August, 2021; originally announced August 2021.

    Comments: 10 pages, 8 figures, Accepted in ICCV workshop

  16. arXiv:2107.06383  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    How Much Can CLIP Benefit Vision-and-Language Tasks?

    Authors: Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

    Abstract: Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world. However, it has been observed that large-scale pretraining usually can result in better generalization performance, e.g., CLIP (Contrastive Language-Image Pre-training), trained on a massive amou… ▽ More

    Submitted 13 July, 2021; originally announced July 2021.

    Comments: 14 pages

  17. arXiv:2010.12831  [pdf, other

    cs.CL cs.CV cs.LG

    Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

    Authors: Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, Kai-Wei Chang

    Abstract: Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation. Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through u… ▽ More

    Submitted 11 April, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

    Comments: NAACL 2021 Camera Ready

  18. arXiv:1908.03557  [pdf, other

    cs.CV cs.CL cs.LG

    VisualBERT: A Simple and Performant Baseline for Vision and Language

    Authors: Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang

    Abstract: We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an associated input image with self-attention. We further propose two visually-grounded language model objectives for pre-training VisualBERT on image caption data. Experim… ▽ More

    Submitted 9 August, 2019; originally announced August 2019.

    Comments: Work in Progress

  19. arXiv:1902.11269  [pdf, ps, other

    cs.CL cs.LG

    Efficient Contextual Representation Learning Without Softmax Layer

    Authors: Liunian Harold Li, Patrick H. Chen, Cho-Jui Hsieh, Kai-Wei Chang

    Abstract: Contextual representation models have achieved great success in improving various downstream tasks. However, these language-model-based encoders are difficult to train due to the large parameter sizes and high computational complexity. By carefully examining the training procedure, we find that the softmax layer (the output layer) causes significant inefficiency due to the large vocabulary size. T… ▽ More

    Submitted 28 February, 2019; originally announced February 2019.

    Comments: Work in progress