Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

Giuseppe Carenini
Professor and Director of the Master in Data Science, Department of Computer Science
University of British Columbia
V6T 1Z4, Vancouver, BC, Canada
carenini@cs.ubc.ca
\AndJordon Johnson
Lecturer, Department of Computer Science
University of British Columbia
V6T 1Z4, Vancouver, BC, Canada
jordon@cs.ubc.ca
\AndAli Salamatian
Undergraduate Student Research Awards Recipient
University of British Columbia
V6T 1Z4, Vancouver, BC, Canada
alisalam@students.cs.ubc.ca
Abstract

Automatically captioning visualizations is not new, but recent advances in large language models (LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and past work in captioning, we introduce neural models and the transformer architecture used in generic LLMs. We then discuss their recent applications in InfoVis, with a focus on captioning. Additionally, we explore promising future directions in this field.

1 Introduction

It is well-established that visualizations have advantages over text-based representations for a number of analysis tasks, since they more fully leverage our innate visual processing capabilities. However, it has also been found that visualizations can be well-supported by textual augmentations such as captions [1]. Further, recent advances in large language models (LLMs) have resulted in their incorporation into an unprecedented number of applications and domains. That being the case, this tutorial aims to provide: (1) an overview of captioning visualizations and key concepts in Information Visualization (InfoVis), (2) an introduction to neural networks and transformers, (3) an exploration of the limitations of LLMs and recent developments in the field, and (4) the latest research on InfoVis captioning using LLMs and Large Vision-Language Models (LVLMs).

We will begin with an overview of key concepts in InfoVis and captioning visualizations, including marks, channels, and content characterization. Following this, we will delve into the underlying mechanisms of LLMs, specifically neural networks and transformers. Finally, we will connect these concepts to discuss recent advancements in visualization captioning using LLMs and LVLMs, exploring their limitations and highlighting promising future research directions.

2 Past Editions, Similar Initiatives and Target Audience

This tutorial was first presented at AVI 2024.

A related tutorial, "NLP+Vis: NLP Meets Visualization," was offered at EMNLP 2023 and InfoVis 2022, covering a broader scope including InfoVis for NLP model interpretability and text analytics [2]. Our tutorial narrows the focus to the background knowledge and the latest methods used in textual support for InfoVis.

The target audience is researchers and practitioners in visual interfaces who want to understand the fundamental concepts and techniques involved in LLM-based textual support for visualizations.

3 Organization and Duration

The tutorial is presented in two parts, each lasting 90 minutes with a 30 minutes break in between.

3.1 Part 1

The main goal of this part is to lay the necessary background knowledge required to understanding InfoVis and LLMs.

3.1.1 Key InfoVis Concepts: Abstractions, Marks, Channels

In visualization design, it is desirable to identify and work with data abstractions and intended user tasks [3]. This section explores the importance of task abstraction and methods to achieve it using marks (geometric primitives) and channels that control the appearance of these marks.

Refer to caption
Figure 1: "A four-level model of semantic content for accessible visualization. Levels are defined by the semantic content conveyed by natural language descriptions of visualizations." [4]
Refer to caption
Figure 2: "The Y-axis identifies the houses in the three charts. In the left chart, house prices are shown along the X-axis. The house’s selling price is shown by the left edge of the bar, whereas the house’s asking price is shown by the right edge of the bar…" [5].

3.1.2 Captioning visualizations

Visualizations with numerous attributes are often difficult to understand completely until explained. Captions provide interpretations of visualizations that help readers understand the purpose of a visualization. They can improve recall and comprehension of depicted data and are crucial for individuals with visual impairments. Moreover, text augmentation of charts can be used in search and question answering.

In this tutorial, we explain the four levels of semantics used in captions (as depicted in Figure 1), and examine the extent in which each level is covered from early visualization captioning work (e.g., shown in Figure 2) to more recent advancements such as LSTM encoder-decoder models, transformers, and the generated captions using language models (e.g., shown in Figure 3).

3.1.3 Neural Networks and the Transformer architecture

In this section, a wide range of fundamental concepts are explained. Firstly, the neural networks are introduced in the context of next token prediction [6]. We then cover methods to improve simple neural models, such as the attention mechanism [7], and discuss the transformer architecture [8] used in generic LLMs [9].

3.2 Part 2

The main goal of this part is to explain the limitations of LLMs, explore methods to mitigate these limitations, and discuss the latest developments in visualization captioning.

3.2.1 Large Language Models: Limitations and Recent Development [9]

In this section, we show that although LLMs have become amazingly proficient at language competence, they are not nearly as good at functional competence such as solving arithmetic and novel planning problems, often involving issues like hallucinations [10] that can negatively impact InfoVis captioning. LLMs, essentially being very large neural networks also suffer form lack of interpretability.

We conclude this section by discussing some of the latest techniques developed to address the above limitations, such as Chain-of-Thought (CoT) [11, 12, 13], Retrieval-Augmented Generation (RAG) [14], and Reinforcement Learning from Human Feedback (RLHF) [15, 16]. Lastly, we touch upon advancements in LVLMs [17] and multimodal models [18, 19], which can be effectively applied to InfoVis captioning.

3.2.2 Recent Advances and Challenges in InfoVis Captioning: A Review of Key Papers

In this section, we review six recent papers, five of which were carefully selected from Huang et al.’s survey and GitHub page listing key works in the field [20]. These papers illustrate significant advancements in InfoVis Captioning. Our review covers essential steps in recent research progress, including the creation of novel large datasets and the development and testing of new techniques designed to enhance the quality of generated captions, many of which were discussed in the previous section.

We begin by discussing the significant contributions of Kantharaj et al. (2022) [21], who introduced a large dataset of 44,096 items, including charts, data tables, and captions (primarily at level 2 of semantic content described in Figure 1) along with a benchmark for chart captioning. Next, we examine Tang et al.’s (2023) [22] contribution, which includes a dataset of 12,441 items, comprising charts, scene-graphs, data tables, and structured captions. Notably, this dataset incorporates levels 2 and 3 (see Figure 1) through crowdsourcing. As a result, their proposed models fine-tuned on this dataset could generate semantically rich captions as shown in Figure 3. After that, we highlight the two tasks introduced by Li et al. (2024) [23]: multiple figure and contextualized captioning. Subsequently, we explore how previously mentioned techniques, such as Chain-of-Thought (CoT) and context retrieval have enabled Liu et al.(2024) [24] to perform step-by-step learning more effectively and answer relevant questions more accurately. Moreover, we show that through a novel RLHF method, Singh et al. (2023) [25] have optimized a generative figure-to-caption model for reader preferences. Finally, we present Huang et al. (2024) [26] comprehensive typology of factual errors and their finding that state-of-the-art language-vision models, including GPT-4V, frequently produce captions containing with factual inaccuracies as demonstrated in Figure 4.

In the concluding part of this section (which also concludes the tutorial), we discuss open issues in InfoVis captioning. These include:

  1. 1.

    Handling domain specific visualization s (e.g., pathway flowcharts in chemistry) and more complex visualizations (e.g., involving both spatial and temporal features)

  2. 2.

    Developing more robust and comprehensive evaluation metrics

  3. 3.

    Enhancing the interpretability of captioning models

  4. 4.

    Advancing multilingual chart captioning capabilities

Refer to caption
Figure 3: "The scene-graph model’s output L1 caption and L2/L3 caption for a VisText bar chart…" [22]
Refer to caption
Figure 4: "Error distribution for different models on VisText and Pew." [26]

4 Presenters’ Past Experiences

Giuseppe Carenini has taught dozens of undergraduate and graduate courses in his academic career in CS, AI and NLP. Specifically on tutorials in conferences, he has created and given the following ones:

  • NLP for Conversations: Sentiment, Summarization, and Group Dynamics; Developed in collaboration with G. Murray and S. Joty – given by G. Murray at COLING 2018, Santa Fe, New Mexico, August 20, 2018 [27]

  • Discourse Processing and Its Applications in Text Mining; Developed in collaboration with G. Murray and S. Joty [28]

    • given by G. Carenini and S. Joty at ICDM 2018, Singapore, November 18, 2018

    • given by G. Carenini and S. Joty at ACL 2019, July 28, 2019

Jordon Johnson is a Lecturer in the UBC CS department, and he has been teaching there since 2018. He has received a teaching award from the CS department and repeated recognition letters from the UBC Faculty of Science. As a graduate student, he received a Killam Graduate Teaching Assistant award, which is the highest such award granted at UBC. While he has taught a variety of courses spanning all undergraduate levels, his most frequently taught courses include CPSC 322 (Introduction to Artificial Intelligence) and CPSC 422 (Intelligent Systems).

References

  • [1] Chase Stokes and Marti Hearst. Why more text is (often) better: Themes from reader preferences for integration of charts and text, 2022.
  • [2] Shafiq Joty, Enamul Hoque, and Jesse Vig. NLP+Vis: NLP meets visualization. In Qi Zhang and Hassan Sajjad, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts, pages 1–6, Singapore, December 2023. Association for Computational Linguistics.
  • [3] Tamara Munzner. Visualization Analysis and Design. CRC Press, 2014.
  • [4] Alan Lundgard and Arvind Satyanarayan. Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content. IEEE Transactions on Visualization & Computer Graphics (Proc. IEEE VIS), 2022.
  • [5] Vibhu O. Mittal, Johanna D. Moore, Giuseppe Carenini, and Steven Roth. Describing complex charts in natural language: A caption generation system. Computational Linguistics, 24(3):431–467, 1998.
  • [6] David L. Poole and Alan K. Mackworth. Artificial Intelligence: Foundations of Computational Agents. Cambridge University Press, 3 edition, 2023.
  • [7] Daniel Jurafsky and James H. Martin. Speech and Language Processing. Prentice Hall, 3 edition, 2023.
  • [8] Dan Milmo Seán Clarke and Garry Blight. How ai chatbots like ChatGPT or Bard work – visual explainer. The Guardian, Nov 2023.
  • [9] Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2024.
  • [10] OpenAI. Gpt-4 technical report, 2024.
  • [11] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  • [12] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023.
  • [13] Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. Can large language models reason about medical questions?, 2023.
  • [14] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024.
  • [15] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
  • [16] Christopher D. Manning. Academic NLP research in the age of LLMs: Nothing but blue skies! In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, December 2023. Keynote Talk.
  • [17] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
  • [18] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm, 2023.
  • [19] Gemini Team at Google. Gemini: A family of highly capable multimodal models, 2024.
  • [20] Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, and Heng Ji. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models, 2024.
  • [21] Shankar Kantharaj, Rixie Tiffany Ko Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty. Chart-to-text: A large-scale benchmark for chart summarization, 2022.
  • [22] Benny J. Tang, Angie Boggust, and Arvind Satyanarayan. Vistext: A benchmark for semantically rich chart captioning, 2023.
  • [23] Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models, 2024.
  • [24] Mengsha Liu, Daoyuan Chen, Yaliang Li, Guian Fang, and Ying Shen. Chartthinker: A contextual chain-of-thought approach to optimized chart summarization, 2024.
  • [25] Ashish Singh, Prateek Agarwal, Zixuan Huang, Arpita Singh, Tong Yu, Sungchul Kim, Victor Bursztyn, Nikos Vlassis, and Ryan A. Rossi. Figcaps-hf: A figure-to-caption generative framework and benchmark with human feedback, 2023.
  • [26] Kung-Hsiang Huang, Mingyang Zhou, Hou Pong Chan, Yi R. Fung, Zhenhailong Wang, Lingyu Zhang, Shih-Fu Chang, and Heng Ji. Do lvlms understand charts? analyzing and correcting factual errors in chart captioning, 2024.
  • [27] Gabriel Murray, Giuseppe Carenini, and Shafiq Joty. NLP for conversations: Sentiment, summarization, and group dynamics. In Donia Scott, Marilyn Walker, and Pascale Fung, editors, Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts, pages 1–4, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics.
  • [28] Shafiq Joty, Giuseppe Carenini, Raymond Ng, and Gabriel Murray. Discourse analysis and its applications. In Preslav Nakov and Alexis Palmer, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 12–17, Florence, Italy, July 2019. Association for Computational Linguistics.