Localizing Events in Videos with Multimodal Queries

Gengyuan Zhang 1,4     Mang Ling Ada Fok 211footnotemark: 1     Yan Xia 2,4    Yansong Tang 3
Daniel Cremers 2,4    Philip Torr 5    Volker Tresp 1,4     Jindong Gu 5
1 LMU Munich    2 TU Munich    3 Tsinghua University
4 Munich Center for Machine Learning (MCML)    5 University of Oxford
zhang@dbs.ifi.lmu.de    ada.fok@tum.de
Equal contribution
Abstract

Video understanding is a pivotal task in the digital era, yet the dynamic and multi-event nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images’ semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization111Our project is available at https://icq-benchmark.github.io/.

1 Introduction

Videos are the prevailing data medium on the Internet and a common multimodal interface when we interact with the world. User-centric applications, such as video search engines and video highlight/moment recommendations, are increasingly popular on streaming media and short video platforms. Moreover, large foundation models are expected to process videos as input data to understand surroundings and make decisions. Consequently, video understanding has been a long-standing research topic and has recently gained increased attention.

However, videos are inherently dynamic and contain multiple events [66, 75] that are sparsely distributed. This redundancy makes processing and understanding dense videos labor-intensive and computation-demanding for human users and deep-learning models. As a result, the need to localize events in videos becomes essential [28, 48].

Refer to caption
Figure 1: Localizing events in videos with semantics queries: so far, the community has only focused on natural language query-based video event localization as in [29]. Our benchmark ICQ focuses on a more general scenario: localizing events in video with multimodal queries.

Localizing events in videos encompasses a broad spectrum of related tasks. From a practical perspective, particularly in user-centric applications like video search and recommendation, tasks including video moment retrieval [17, 18, 39] and highlight detection [2, 29, 45] focus on identifying and retrieving video segments of interest based on textual queries within extensive, long-range videos. For video foundation models that aim to understand and reason video content, video temporal grounding [11, 12, 13, 15, 24, 56] with a given natural language query not only can reduce the video processing duration but also elucidate the reasoning processes.

A series of benchmarks [4, 16, 29, 54] has been established for exploring video event localization using natural language queries as semantic queries. Building on these foundations, existing models have primarily focused on this natural language query setting [1, 6, 7, 8, 10, 9, 12, 13, 16, 19, 29, 60]. However, with the increasing need for human users to efficiently process massive video data online and the advent of large-scale foundation models in recent years, multimodal interaction with videos is a promising scenario. In other words, texts should not be the only possible query for localizing events in videos. As the saying goes, “A picture is worth a thousand words.”, images are illiteral language and can express rich semantic meaning and describe events in videos.

Multimodal queries, also known as composed queries [23, 57] for video event localization, bring practical benefits. From a pragmatic perspective, using queries such as user-input “scribble images” can facilitate a more natural human-computer interaction. As users, we often opt for writing brief and simple text queries rather than detailed and lengthy paragraphs for semantic search on videos, and thus, a text query can be ambiguous. Texts sometimes fail to deliver the message, while images are capable. In the meantime, grounding/localizing events in videos with multimodal queries heuristically serves as an important module of video foundation models as in temporal grounding and episodic memory search [21, 27, 50]. This relates to grounding an event stimulated by a similar scene, which is similar to a common cognitive phenomenon called Déjà vu.

Since using multimodal queries for semantically searching events in videos remains largely unexplored, this inspires us to propose a new task: localizing events in videos with multimodal queries. We introduce a new benchmark, ICQ, for localizing events in videos using Image-Text Composed Queries as multimodal queries. Our benchmark is targeted at evaluating model performance for localization events in videos with multimodal queries consisting of reference images and refinement texts. Alongside this benchmark, we propose a new evaluation dataset, ICQ-Highlight, as a testbed for our task. Given that reference images may have a significant distribution shift from the video data in styles and that refinement texts should alter the semantic meaning of reference images in various aspects, our dataset highlights 4 reference image styles and 5 refinement text types.

In ICQ, we evaluate a broad spectrum of existing video localization models, from specialized models to LLM-based video foundation models, on the ICQ. To bridge the gap between current natural language query-based models and multimodal queries, we propose 3 adaptation methods: Captioning, Summarization, and Visual Query Encoding. Our results demonstrate that existing models can be effectively adapted to our new benchmark with the aid of Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) despite performance decline and instability to a greater or lesser extent. They should serve as a solid baseline for future studies. Additionally, our findings reveal that multimodal semantic queries can successfully localize events in videos, suggesting multimodal queries have promising applications for video localization.

Our contributions are summarized as follows:

  1. 1.

    We introduce a new evaluation benchmark, ICQ, and a new evaluation dataset, ICQ-Highlight, for analyzing event localization in videos with multimodal queries;

  2. 2.

    We propose 3 adaptation methods and evaluate 10 models ranging from specialized models to large-scale video foundation models;

  3. 3.

    Our comprehensive experiments show that our adaptation method is a simple yet effective baseline method to adapt existing models to ICQ;

  4. 4.

    We claim that using multimodal queries for video event localization is a practical and feasible scenario with broad prospects.

2 Related Work

2.1 Localizing Event in Videos with Natural Language Queries

Query-based video temporal localization has been a long-standing research topic and is an umbrella of several related tasks. According to their scenarios and motivation, they can be further classified into several similar but slightly different tasks. Video moment retrieval [32, 38, 42, 43, 41, 68, 71, 74] aims to localize a video segment based on a textual caption query that describes events in the video. Video temporal grounding/localization [14, 22, 34, 35, 46, 47, 67, 70, 72] with natural language queries aims to determine the video segment that corresponds with textual description and usually serves downstream Question-answering task [3, 63, 69, 76] and aims to provide relevant segments in videos. Other similar yet less relevant tasks include video highlight detection [2, 29, 45, 54] and action detection; these tasks also involve localizing video segments but with an implicit query or a category-level action label. Our benchmark steps torwards localizing video events in multimodal query. This multimodal query underlines a composed query of images and text, which are different from other works, as a semantic search for events in videos described by multimodal queries.

Regarding the methodology, a line of works are focused on video moment retrieval/ video temporal grounding tasks: this includes two-stage (i.e. proposal-based) models [33] that firstly generates moment candidates and then filter out the matched moment based on the query and one-stage (i.e. proposal-free) models [7, 52, 70] that integrates the moment generation and moment localization into a unified framework. Within the one-stage models, DETR [5] has been widely employed in multiple models for video temporal localization as in [25, 29, 45, 44, 55, 64]. More recent works [31, 40, 65, 62] attempt to uniform multiple video localization tasks, including video moment retrieval and highlight detection in a single framework. This again shows the correlation of video temporal localization tasks. In addition, with the large-scale video foundation models and MLLMs gaining increasing attention, temporal grounding has also been a core module in models like SeViLA [69], InternVideo2 [61], VideoPrism [78], etc. [51, 73].

2.2 Multimodal Query for Image/Video Understanding

Using multimodal queries is a practical and important scenario for video/vision understanding scenarios [57, 58]. However, it is cruical to note that video event localization with multimodal queries differs from image/video retrieval tasks, which primarily involve instance-level similarity matching. Temporal localization requires dense video processing, significantly increasing the complexity of the task.

For video localization tasks, [77] is the first work to use image queries to localize unseen activities in videos. More recently, [20] proposes to ground videos spatio-temporally using images or texts, although their queries are still limited to object or action levels. To the best of our knowledge, our work is the first to attempt localizing events in videos using multimodal semantic queries.

3 ICQ: Video Event Localization with Multimodal Queries

In the following section, we will detail the benchmark ICQ and a new evaluation dataset ICQ-Highlight to benchmark video event localization with multimodal queries.

Refer to caption
Figure 2: Examples of ICQ-Highlight: Multimodal queries consist of a reference image and a refinement text. We consider 4 different reference image styles: scribble, cartoon, cinematic, and realistic. They describe a target event that corresponds to moments or segments in original videos and are equivalent to natural language queries in the original dataset [29]. Refinement texts add either complementary information if reference images are minimal like for scribble images, or corrective information if reference images are more complicated.

3.1 Task Definition

We define the multimodal query qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as consisting of a reference image vrefsubscript𝑣𝑟𝑒𝑓v_{ref}italic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT accompanied by a refinement text trefsubscript𝑡𝑟𝑒𝑓t_{ref}italic_t start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT for minor adjustments to localize the target event in a video that corresponds to the query semantically. The reference image captures the broad semantics of the target event, while the refinement text provides supplementary information that can be either complementary or corrective. We believe that this setting is more adaptable and general in real-world applications.

Given the query qmsubscript𝑞𝑚q_{m}italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the model predicts all the relevant segments or moments [timestart,timeend]𝑡𝑖𝑚subscript𝑒𝑠𝑡𝑎𝑟𝑡𝑡𝑖𝑚subscript𝑒𝑒𝑛𝑑\left[time_{start},time_{end}\right][ italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_s italic_t italic_a italic_r italic_t end_POSTSUBSCRIPT , italic_t italic_i italic_m italic_e start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT ]. Similar to the metrics used in common-setting video moment retrieval, we utilize recall R and mean Average Precision as the evaluation metrics for video moment retrieval.

Reference Image Styles Reference images vrefsubscript𝑣𝑟𝑒𝑓v_{ref}italic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT visually describe the semantics of an event in a video. They can be simple scribble images with minimal strokes that describe an event succinctly, effectively summarizing an event for non-verbal semantic queries in video localization or more detailed images that depict semantically relevant scenes in a video. As illustrated in Fig. 2, reference images describe semantically similar scenes yet might vary in details as target videos. In practice, visual queries can differ in style, which may impact model performance. Therefore, we explore multiple reference image styles, as detailed in the subsequent section, to assess whether the model maintains consistent performance across various styles as an indicator of model robustness.

Refinement Texts Refinement texts refer to simple phrases or sentences to complement or correct descriptions that are either missing or contradictory in the reference images. This is particularly practical in real-world applications, as reference images often do not semantically align perfectly with the target video event. We identify 5 different types of refinement texts that can be applied to various aspects of the reference image semantics: “object”, “action”, “relation”, “attribute”, “environment”, and “others” as shown in Fig. 3. This categorization is designed for elements of a semantic scene graph [26] and borrowed by us to summarize different semantic elements of the multimodal queries.

3.2 Dataset Construction

We introduce our new evaluation dataset, ICQ-Highlight, as a testbed for ICQ. This dataset is built upon the validation set of QVHighlight [29], a popular natural-language query-based video localization dataset. For each original query in QVHighlight, we construct multimodal semantic queries that incorporate reference images paired with refinement texts. Considering the reference image style distribution discussed earlier, ICQ-Highlight features 4 variants based on different image styles. In total, the dataset comprises 1515 videos and 1546 test samples on average for each style. The exact numbers may vary slightly across styles and are provided in the Appendix.

Reference Image Generation We generate reference images based on the original natural language queries and refinement texts using a suite of state-of-the-art Text-to-Image models, including DALL-E-2111https://openai.com/index/dall-e-2/ and Stable Diffusion222https://stability.ai/stable-image. For the reference image styles mentioned earlier, we select 4 representative styles: scribble, cartoon, cinematic, and realistic. These styles effectively capture a variety of real-world scenarios such as user inputs, book illustrations, television shows, and actual photographs, where images are often used as queries.

Data Annotation and Preprocessing We emphasize the meticulous crowd-sourced data curation and annotation effort applied to QVHighlight for 2 main reasons: (1) To introduce refinement texts, we purposefully modify the original semantics of text queries in QVHighlight to generate queries that are similar yet subtly different; (2) Given that the original queries in QVHighlight can be too simple and ambiguous to generate reasonable reference images, we add necessary annotations to ensure that the generated image queries are more relevant to the original video semantics. We employed human annotators to annotate and modify the natural language queries. Each query is annotated and reviewed by different annotators to ensure consistency. Further details can be found in the Appendix.

Data Curation and Quality check Image generation can suffer from significant imperfections in terms of semantic consistency and content safety. To address these issues, we implement a quality check in two stages: (1) We calculate the semantic similarity between the generated images and the text queries using BLIP2 [30] encoders, eliminating samples that score lower than 0.2; (2) We perform human sanity check to replace images that are: i) semantically misaligned with the text, ii) mismatched with the required reference image style, iii) containing sensitive or unpleasant content (e.g., violent, racial, sexual content), counterintuitive elements, or obvious generation artifacts.

Refer to caption
Figure 3: Distribution of refinement text types. Refinement texts are designed to either complement or correct the original semantics of reference images. We identify 5 major types of refinement texts, each targeting different semantic aspects: object, action, relationship, attribute, environment, and others.

3.3 Baseline Selection

We have selected and benchmarked 10 models specifically designed for video event localization with natural language queries. We assess the zero-shot performance of these models using checkpoints that have been fine-tuned on the original QVHighlight dataset. This evaluation allows us to understand their effectiveness for multimodal queries straight out of the box.

Particularly, we categorize them as follows and compare the models in different dimensions in the Appendix: (1) Specialized models use natural language as a semantic query and are targeted at video moment retrieval tasks. We have selected a series of these models including Moment-DETR[29], QD-DETR[45], EaTR[25], CG-DETR[44], TR-DETR[55]; (2) Unified frameworks are aimed to solve multiple video localization tasks within one model, such as moment retrieval, highlight detection, and video summarization. We have selected UMT[40], UniVTG[31], and UVCOM[62] as strong baselines; (3) LLM-based Models features the power of Large Language Models, which prove to be a powerful and general head for varied video tasks. We have selected SeViLA [69] as a representative.

3.4 Adaptation Methods

Most existing video localization methods utilize natural language as input queries and are not readily adaptable to composed queries. Thus, we propose 3 adaptation methods: Captioning (Cap), Summarization(Sum), and Visual Query Encoding(VisEnc), as illustrated in Fig. 4. For Cap and Sum, we aim to leverage the power of LLMs and MLLMs to caption reference images vrefsubscript𝑣𝑟𝑒𝑓v_{ref}italic_v start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT and integrate refinement texts trefsubscript𝑡𝑟𝑒𝑓t_{ref}italic_t start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT: Cap uses MLLMs as a captioner to caption reference images and LLMs as a modifier to integrate refinement texts. In contrast, Sum uses MLLMs to directly summarize reference images and refinement texts in one step. Generated texts tquerysubscript𝑡𝑞𝑢𝑒𝑟𝑦t_{query}italic_t start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT can be seamlessly used by existing models. For VisEnc, we explore using only reference images and employing visual encoders to embed the reference images as query embeddings equerysubscript𝑒𝑞𝑢𝑒𝑟𝑦e_{query}italic_e start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT. This is based on the background that all models we have selected employ a dual-stream encoder that embeds image-text pairs in a joint feature space.

Refer to caption
Figure 4: Adaptation methods: We propose 3 adaptation methods to bridge the current gap between natural language query-based models and our multimodal query-based benchmark: Captioning(Cap), Summarization(Sum), and Visual Query Encoding(VisEnc). For brevity, we refer to them by the abbreviation.

4 Experiments

4.1 Experimental Setup

Implementation We employ state-of-the-art MLLMs, LLaVA-mistral [36, 37], as a captioner and GPT-3.5 as a modifier in our Cap adaptation. For a fair comparison, we utilize LLaVA-mistral for Sum adaptation. We believe that the performance of these models is representative of the SOTA capabilities of MLLMs. For VisEnc, we utilize the corresponding CLIP [49] Visual Encoder, as all models typically employ the CLIP Text Encoder for text query encoding. In this adaptation method, we omit refinement texts and only use the reference image.

Evaluation Metrics We evaluate models on our new testbed ICQ-Highlight. For evaluation, we report both Recall R@1 with IoU thresholds 0.5 and 0.7, mean Average Precision with IoU threshold 0.5 and the average over multiple IoU thresholds [0.5:0.05:0.95] as standard metrics for video moment retrieval and localization [29, 69], where IoU (Intersection over Union) thresholds determine if a predicted temporal window is positive.

4.2 Results & Analysis

We present the pairwise performance of 10 models combined with 3 adaptation methods on ICQ in Tab. 4.24.2. For Cap and Sum methods, we have conducted multiple runs with different prompts used for captioning and summarization and reported the average performance and standard deviation.

Table 1: Model performance (Recall) on ICQ. We highlight the best score in bold for each adaptation method and reference image style. For Cap and Sum, we also report the standard deviation of 3 runs with different prompts. {\dagger} indicates the usage of additional audio modality.

{NiceTabular}llcccccccc[colortbl-like] \CodeBefore3-8,12,19-21,23-28,32 9-11 1 \Body Model scribble cartoon cinematic realistic
R1@0.5 R1@0.7 R1@0.5 R1@0.7 R1@0.5 R1@0.7 R1@0.5 R1@0.7
Captioning Moment-DETR (2021) 44.83 (± 2.7) 27.97 (± 2.2) 46.02 (± 1.5) 29.36 (± 0.9) 46.89 (± 0.7) 30.35 (± 1.2) 47.16 (± 1.5) 30.53 (± 0.8)
QD-DETR (2023) 48.92 (± 4.1) 33.57 (± 3.3) 52.87 (± 0.8) 36.01 (± 1.3) 54.01 (± 0.7) 37.29 (± 0.5) 53.07 (± 0.8) 37.53 (± 1.1)
QD-DETR{\dagger} (2023) 50.15 (± 4.6) 34.67 (± 3.9) 53.53 (± 1.3) 38.30 (± 1.2) 53.37 (± 0.6) 37.93 (± 0.5) 53.39 (± 1.0) 38.47 (± 0.8)
EaTR (2023) 49.20 (± 3.2) 34.82 (± 3.5) 50.50 (± 0.6) 35.27 (± 0.7) 51.76 (± 0.5) 36.92 (± 0.7) 52.33 (± 0.5) 37.01 (± 0.3)
CG-DETR (2023) 50.65 (± 3.5) 36.37 (± 2.9) 56.26 (± 0.7) 40.82 (± 0.7) 54.53 (± 0.9) 39.32 (± 0.8) 56.72 (± 0.7) 41.79 (± 1.2)
TR-DETR (2024) 50.99 (± 3.3) 35.55 (± 3.7) 55.37 (± 1.0) 39.92 (± 2.0) 56.03 (± 1.0) 40.69 (± 0.9) 56.94 (± 0.5) 41.99 (± 0.3)
UMT{\dagger} (2022) 44.76 (± 3.5) 29.41 (± 3.0) 48.15 (± 1.7) 32.18 (± 1.6) 49.96 (± 0.9) 33.90 (± 0.9) 48.83 (± 1.0) 34.09 (± 1.2)
UniVTG (2023) 47.50 (± 3.1) 31.58 (± 3.0) 49.50 (± 0.8) 33.09 (± 1.1) 50.98 (± 0.2) 33.36 (± 0.6) 51.42 (± 1.1) 43.75 (± 0.2)
UVCOM (2023) 50.99 (± 3.6) 37.36 (± 3.1) 54.39 (± 0.5) 40.06 (± 1.0) 55.88 (± 0.7) 40.88 (± 0.5) 54.92 (± 0.9) 41.08 (± 0.9)
SeViLA (2023) 17.37 (± 1.3) 10.56 (± 0.8) 22.72 (± 0.8) 15.31 (± 0.7) 25.94 (± 0.1) 16.99 (± 0.3) 26.83 (± 0.8) 16.83 (± 0.6)
Summarization Moment-DETR (2021) 42.00 (± 3.3) 25.14 (± 3.0) 44.56 (± 2.4) 27.24 (± 2.1) 43.73 (± 2.0) 27.00 (± 1.8) 44.34 (± 2.6) 27.74 (± 2.0)
QD-DETR (2023) 45.56 (± 3.3) 30.44 (± 3.0) 49.09 (± 3.8) 33.64 (± 3.2) 48.89 (± 3.5) 32.66 (± 3.1) 47.83 (± 4.1) 32.86 (± 3.8)
QD-DETR{\dagger} (2023) 46.57 (± 3.8) 32.52 (± 3.6) 49.30 (± 4.3) 34.12 (± 4.2) 48.83 (± 3.2) 34.16 (± 3.4) 49.13 (± 4.4) 33.83 (± 3.1)
EaTR (2023) 45.79 (± 3.0) 32.67 (± 2.9) 48.45 (± 2.9) 32.96 (± 2.7) 48.24 (± 3.8) 33.35 (± 3.5) 48.69 (± 3.7) 33.85 (± 2.5)
CG-DETR (2023) 47.07 (± 4.2) 33.14 (± 4.1) 51.46 (± 3.1) 36.49 (± 2.7) 50.59 (± 3.4) 36.08 (± 3.6) 51.91 (± 3.5) 36.58 (± 2.4)
TR-DETR (2024) 46.44 (± 4.4) 33.23 (± 3.8) 51.35 (± 3.2) 36.14 (± 2.3) 51.92 (± 3.8) 36.29 (± 3.7) 52.87 (± 4.0) 36.77 (± 3.4)
UMT{\dagger} (2022) 43.88 (± 3.4) 29.28 (± 1.9) 45.39 (± 2.8) 29.98 (± 2.4) 45.37 (± 2.3) 30.01 (± 2.2) 46.35 (± 2.0) 30.27 (± 1.0)
UniVTG (2023) 44.98 (± 3.3) 27.99 (± 2.7) 46.19 (± 3.5) 30.37 (± 2.4) 47.22 (± 3.3) 29.90 (± 2.5) 50.39 (± 3.3) 30.33 (± 2.4)
UVCOM (2023) 46.62 (± 3.8) 33.40 (± 3.4) 51.48 (± 4.1) 36.92 (± 3.7) 50.91 (± 5.3) 36.58 (± 4.5) 51.18 (± 3.7) 36.23 (± 3.4)
SeViLA (2023) 17.89 (± 1.9) 10.65 (± 1.5) 27.47 (± 3.5) 16.98 (± 1.9) 27.76 (± 2.5) 17.77 (± 1.5) 28.61 (± 3.3) 17.30 (± 2.0)
Visual Query Enc. Moment-DETR (2021) 12.55 5.69 13.38 6.59 14.36 6.01 14.88 6.53
QD-DETR (2023) 15.91 9.12 14.88 8.62 13.90 8.49 14.62 8.36
QD-DETR{\dagger} (2023) 15.65 10.03 12.60 6.79 12.34 6.72 12.34 7.44
EaTR (2023) 19.86 13.00 19.91 12.99 21.15 13.45 21.48 13.38
CG-DETR (2023) 22.90 13.00 24.93 13.58 23.24 13.12 24.74 14.23
TR-DETR (2024) 17.92 11.19 17.36 11.10 15.14 9.86 15.60 9.53
UMT{\dagger} (2022) 5.43 2.85 4.77 2.09 5.22 2.35 4.57 2.42
UniVTG (2023) 21.93 13.00 23.89 13.64 22.78 13.19 22.52 12.79
UVCOM (2023) 17.08 9.77 16.78 10.97 17.36 11.68 17.10 11.23

Table 2: Model performance (mAP) on ICQ. We highlight the best score in bold for each adaptation method and reference image style. For Cap and Sum, we also report the standard deviation of 3 runs with different prompts. {\dagger} indicates the usage of additional audio modality.

{NiceTabular}llcccccccc[colortbl-like] \CodeBefore3-8,12,19-21,23-28,32 9-11 1 \Body Model scribble cartoon cinematic realistic
mAP@0.5 Avg. mAP@0.5 Avg. mAP@0.5 Avg. mAP@0.5 Avg.
Captioning Moment-DETR (2021) 46.98 (± 2.3) 26.15 (± 1.5) 48.14 (± 1.2) 27.22 (± 0.7) 48.98 (± 0.4) 27.96 (± 0.4) 49.00 (± 0.82) 27.72 (± 0.5)
QD-DETR (2023) 50.69 (± 3.1) 31.01 (± 2.4) 54.15 (± 0.9) 33.04 (± 0.9) 55.32 (± 0.9) 34.06 (± 0.7) 54.75 (± 0.7) 34.31 (± 0.7)
QD-DETR{\dagger} (2023) 50.78 (± 3.9) 31.44 (± 3.0) 53.91 (± 1.2) 33.94 (± 1.0) 54.06 (± 0.5) 34.67 (± 0.3) 53.82 (± 0.8) 34.18 (± 0.7)
EaTR (2023) 52.11 (± 2.8) 32.88 (± 2.6) 53.23 (± 0.7) 33.60 (± 0.7) 54.00 (± 0.7) 34.54 (± 0.3) 54.36 (± 0.8) 34.73 (± 0.3)
CG-DETR (2023) 51.13 (± 3.0) 32.13 (± 2.1) 56.15 (± 0.8) 36.08 (± 0.6) 55.15 (± 1.0) 35.22 (± 0.7) 56.63 (± 0.8) 36.57 (± 0.9)
TR-DETR (2024) 51.07 (± 2.5) 32.15 (± 2.1) 55.72 (± 1.1) 35.98 (± 1.2) 55.87 (± 0.8) 36.29 (± 0.5) 56.32 (± 0.4) 36.76 (± 0.5)
UMT{\dagger} (2022) 42.35 (± 2.7) 26.47 (± 2.0) 45.03 (± 1.3) 28.64 (± 1.0) 46.43 (± 0.8) 30.01 (± 0.7) 45.93 (± 0.8) 29.67 (± 0.8)
UniVTG (2023) 40.68 (± 2.5) 24.71 (± 1.9) 42.68 (± 0.7) 26.03 (± 0.6) 43.53 (± 0.4) 26.43 (± 0.5) 43.64 (± 0.8) 26.76 (± 0.5)
UVCOM (2023) 51.27 (± 3.2) 33.39 (± 2.5) 54.40 (± 0.7) 36.50 (± 0.7) 55.99 (± 0.7) 37.11 (± 0.3) 54.98 (± 0.8) 36.83 (± 0.6)
SeViLA (2023) 14.45 (± 0.8) 9.30 (± 0.6) 19.52 (± 0.5) 13.12 (± 0.4) 22.16 (± 0.3) 14.64 (± 0.4) 22.48 (± 0.6) 14.55 (± 0.5)
Summarization Moment-DETR (2021) 44.40 (± 2.5) 23.96 (± 1.8) 47.31 (± 2.1) 26.03 (± 1.4) 46.62 (± 1.9) 25.55 (± 1.3) 47.29 (± 2.2) 26.07 (± 1.3)
QD-DETR (2023) 47.09 (± 2.8) 28.27 (± 2.4) 51.06 (± 3.3) 30.90 (± 2.5) 50.89 (± 3.3) 30.52 (± 2.8) 50.05 (± 3.6) 30.49 (± 2.7)
QD-DETR{\dagger} (2023) 48.10 (± 3.2) 29.49 (± 2.9) 50.72 (± 3.3) 31.11 (± 3.0) 49.94 (± 2.8) 31.38 (± 2.4) 50.30 (± 3.8) 30.85 (± 2.6)
EaTR (2023) 49.07 (± 2.6) 30.92 (± 2.0) 50.82 (± 2.6) 31.38 (± 1.7) 50.71 (± 3.2) 31.34 (± 2.7) 51.37 (± 3.0) 32.02 (± 2.0)
CG-DETR (2023) 48.41 (± 3.5) 29.86 (± 2.9) 52.31 (± 2.9) 33.21 (± 2.3) 51.59 (± 2.8) 32.34 (± 2.5) 52.31 (± 3.1) 32.91 (± 2.0)
TR-DETR (2024) 46.69 (± 3.6) 29.72 (± 2.8) 52.41 (± 2.6) 33.48 (± 1.9) 52.39 (± 3.1) 33.14 (± 2.6) 52.87 (± 3.1) 33.57 (± 2.5)
UMT{\dagger} (2022) 40.99 (± 2.7) 25.88 (± 1.8) 43.03 (± 2.0) 27.02 (± 1.5) 42.88 (± 2.0) 26.73 (± 1.6) 43.89 (± 1.3) 27.38 (± 1.0)
UniVTG (2023) 38.86 (± 2.7) 22.76 (± 1.8) 40.13 (± 2.8) 24.43 (± 1.7) 40.73 (± 2.7) 24.02 (± 1.9) 40.20 (± 2.4) 24.11 (± 1.6)
UVCOM (2023) 47.33 (± 3.2) 30.75 (± 2.5) 52.22 (± 3.4) 34.00 (± 2.7) 51.37 (± 4.2) 33.36 (± 3.1) 51.64 (± 3.8) 33.52 (± 2.6)
SeViLA (2023) 14.54 (± 1.7) 9.24 (± 1.3) 22.13 (± 1.8) 14.07 (± 1.1) 22.17 (± 1.4) 14.52 (± 0.9) 22.87 (± 1.8) 14.45 (± 1.3)
Visual Query Enc. Moment-DETR (2021) 14.95 6.67 16.51 7.21 17.00 7.39 17.41 7.66
QD-DETR (2023) 19.48 10.11 19.57 10.18 18.07 9.54 18.88 9.94
QD-DETR{\dagger} (2023) 18.22 9.74 14.31 7.30 15.18 7.45 14.71 7.66
EaTR (2023) 25.27 13.98 25.95 14.21 26.83 14.70 26.65 14.49
CG-DETR (2023) 30.24 15.57 30.78 15.70 30.07 15.48 30.98 15.83
TR-DETR (2024) 21.09 11.67 20.87 11.71 19.62 11.02 19.72 10.76
UMT{\dagger} (2022) 5.57 2.81 4.66 1.96 5.60 2.46 4.59 2.23
UniVTG (2023) 24.30 13.02 20.80 11.56 19.85 10.99 19.42 10.95
UVCOM (2023) 20.13 11.15 20.19 11.96 20.67 12.37 20.73 12.03

Best adaptation methods We find that Cap can achieve the best performance and is more robust to different prompts compared to other adaptation methods by an average margin of 3.6% on all styles. We observe that both utilizing MLLMs for captioning reference images, Sum suffers more than Cap adaptation regarding performance and is more sensitive to prompts for all reference styles, which can be observed from the higher standard deviation, showing asking MLLMs to caption and summarize the refinement texts is less controllable. To conclude, captioning images is still a golden method since MLLMs and LLMs are powerful enough to generate faithful captions.

Comparing models Models that perform well in one adaptation method tend to perform well in others. For example, UVCOM and TR-DETR consistently show high performance across Cap, Sum, and VisEnc methods. We observe that more recent models keep their outperforming performance on our ICQ. Latest models, including UVCOM, TR-DETR, and CG-DETR, tend to perform better across different adaptation methods and reference image styles. In contrast, older models like Moment-DETR consistently show lower performance. LLM-based models, SeViLA, cannot compete with other specialized models; this aligns with their subpar performance on natural language query-based benchmarks. In the next section, we find that model performance on ICQ highly correlates with that on natural language query-based benchmark QVHighlight. This shows that (1) our multimodal queries share semantics with the original benchmark; (2) the adaptation methods and models could understand semantics from multimodal queries.

Comparing styles We find all adaptation methods perform consistently across different styles and therefore suggest that they could understand the multimodal semantics queries well, particularly for styles including cartoon, cinematic, and realistic; the model performance is close to each other. For scribble, all models show marginally worse performance, and even both Cap and Sum methods have more significant standard deviation, which reflects that it is heavily influenced by the prompts. This can be explained by the fact that scribble images are more minimal and abstract in semantics and more challenging to interpret. Surprisingly, in spite of being more abstract and simpler, the model performance on scribble reference images is close to other reference image styles. This demonstrates the potential of using scribble as multimodal queries in real-world video event localization applications like video search.

Comparing refinement text types In addition, we calculate the model performance on different subsets of refinement texts shown in Fig. 5. We conclude even though models have close performance across reference image styles, and they show varied performance on different refinement text types across styles. For scribble style, models perform in general for “relation” better than on other styles. For cartoon style, models demonstrate a more balanced performance across all types. The performance is notably higher for “environment” and “attribute” in cinematic style. Finally, for realistic style, the models yield better performance in “object” and “environment”.

Refer to caption
(a) scribble
Refer to caption
(b) cartoon
Refer to caption
(c) cinematic
Refer to caption
(d) realistic
Figure 5: Model performance on different subsets of refinement text types. We observe that model performance with different refinement text types varies across styles.

4.3 Ablation studies

Multimodal Query-based vs. Natural Language Query-based Performance We compare model performance on the multimodal query-based ICQ-Highlight and the original natural language query-based QVHighlight using Spearman’s rank correlation coefficient [53] on R1@0.5. For scribble, Spearman’s rank correlation coefficients are 0.89(Cap) and 0.93(Sum). The cartoon style yields coefficients of 0.98(Cap) and 0.94(Sum). The cinematic style shows coefficients of 0.93 for both Cap and Sum. Lastly, realistic has coefficients of 0.96(Cap) and 0.95(Sum). The high correlation scores indicate a strong positive correlation across benchmarks, suggesting queries of both benchmarks share the common semantics and yield the reliability of our benchmark.

Model Performance With vs. Without Refinement Texts To assess the impact of refinement texts on video event localization using multimodal queries, we have evaluated model performance using only reference images as queries, omitting refinement texts. We employ the Cap adaptation without a modifier for integrating refinement texts. As shown in Tab. 3, we present the model performance and their relative performance drop in percentage compared to those with refinement texts. Models have different scales of performance drop, which indicates that refinement texts help refine the semantics of reference images and localize the events. Additionally, we observe that for scribble images, the performance drop is less pronounced compared to other reference image styles in that these images are inherently minimalistic and less reliant on detailed semantics.

Table 3: Model performance without refinement texts. We employ Cap for methods without considering refinement texts. The performance drop highlighted in the parenthesis indicates that refinement texts in ICQ-Highlight can help refine the semantics of the reference images and localize the events better.
Model scribble cartoon cinematic realistic
R1@0.5 R1@0.7 R1@0.5 R1@0.7 R1@0.5 R1@0.7 R1@0.5 R1@0.7
Moment-DETR 45.15 (-2.7%percent\%%) 28.72 (-3.3%percent\%%) 43.60 (-7.1%percent\%%) 27.94 (-5.8%percent\%%) 44.06 (-7.3%percent\%%) 29.70 (-2.8%percent\%%) 44.06 (-9.3%percent\%%) 28.98 (-6.5%percent\%%)
QD-DETR 49.81 (-4.0%percent\%%) 33.70 (-5.4%percent\%%) 49.87 (-6.6%percent\%%) 34.33 (-6.3%percent\%%) 49.67 (-9.3%percent\%%) 34.73 (-8.1%percent\%%) 50.52 (-5.7%percent\%%) 35.25 (-7.4%percent\%%)
QD-DETR{\dagger} 51.29 (-3.9%percent\%%) 36.03 (-3.8%percent\%%) 48.69 (-10.8%percent\%%) 33.88 (-13.4%percent\%%) 49.48 (-8.5%percent\%%) 34.99 (-9.0%percent\%%) 49.93 (-7.5%percent\%%) 35.05 (-10.4%percent\%%)
EaTR 52.01 (+0.5%percent\%%) 37.77 (+1.2%percent\%%) 47.45 (-6.7%percent\%%) 33.09 (-8.0%percent\%%) 48.56 (-7.0) 34.33 (-5.1) 49.61 (-6.1%percent\%%) 35.64 (-3.0%percent\%%)
CG-DETR 51.42 (-4.0%percent\%%) 37.84 (-1.7%percent\%%) 49.35 (-13.0%percent\%%) 35.90 (-13.4%percent\%%) 48.89 (-10.3) 34.79 (-11.3) 51.04 (-10.5%percent\%%) 36.55 (-14.0%percent\%%)
TR-DETR 52.01 (-2.4%percent\%%) 37.19 (-2.9%percent\%%) 51.04 (-9.2%percent\%%) 36.62 (-11.2%percent\%%) 50.00 (-11.8) 36.03 (-12.5) 52.28 (-8.8%percent\%%) 37.53 (-10.6%percent\%%)
UMT{\dagger} 46.25 (-3.0%percent\%%) 31.57 (-1.0%percent\%%) 45.82 (-6.9%percent\%%) 30.61 (-7.1%percent\%%) 46.34 (-8.6%percent\%%) 29.96 (-13.7%percent\%%) 46.08 (-6.2%percent\%%) 31.85 (-7.1%percent\%%)
UniVTG 47.87 (-3.8%percent\%%) 33.76 (-2.2%percent\%%) 45.56 (-9.4%percent\%%) 29.24 (-11.5%percent\%%) 45.43 (-11.2%percent\%%) 29.05 (-13.9%percent\%%) 46.80 (-9.3%percent\%%) 30.42 (-12.4%percent\%%)
UVCOM 52.26 (-1.7%percent\%%) 39.39 (+1.0%percent\%%) 51.50 (-6.1%percent\%%) 37.99 (-6.6%percent\%%) 50.98 (-9.4%percent\%%) 36.75 (-11.3%percent\%%) 51.70 (-7.6%percent\%%) 37.53 (-10.5%percent\%%)
SeViLA 13.15 (-30.3%percent\%%) 8.06 (-29.3%percent\%%) 11.89 (-49.8%percent\%%) 6.89 (-57.0%percent\%%) 13.26 (-49.0%percent\%%) 8.32 (-51.5%percent\%%) 13.65 (-49.1%percent\%%) 8.22 (-51.1%percent\%%)

5 Conclusion

Limitations and Future Work As the first multimodal query-based video event localization benchmark, ICQ still has several limitations: (1) Selection of LLM-based models is limited as of the time of the work due to the lack of more open source models; (2) Our current benchmark utilizes generated multimodal queries and, as a result, can suffer from generation artifacts. Additionally, although our benchmark serves as a practical testbed, fine-tuning models with unlabeled videos [43, 59] for this new setting remains an open question, particularly because of the lack of training data.

Societal Impacts We believe that using multimodal semantic queries for video event localization brings prospects in real-world applications, such as providing service for illiterate, pre-literate, or non-speakers in cross-lingual situations, as it allows them to interact with videos through simple scribble images as a more accessible and convenient approach. However, reference images could contain intentionally harmful content, which may pose new threats to AI safety and privacy.

In this work, we introduce a new benchmark, ICQ, marking an initial step towards using multimodal semantic queries for video event localization. We have found that MLLM/LLM-enhanced enhanced adaptation methods can accommodate conventional models to multimodal queries, serving as a simple yet effective baseline for this novel setting. Our findings confirm that using multimodal queries for video event localization is practical and feasible. Nonetheless, the field remains open to innovative model architectures and training paradigms for multimodal queries. We believe our work paves the way for real-world applications that leverage multimodal queries to interact with video content.

References

  • Anne Hendricks et al. [2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  • Badamdorj et al. [2022] Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, and Li Cheng. Contrastive learning for unsupervised video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14042–14052, 2022.
  • Bai et al. [2024] Ziyi Bai, Ruiping Wang, and Xilin Chen. Glance and focus: Memory prompting for multi-event video question answering. Advances in Neural Information Processing Systems, 36, 2024.
  • Caba Heilbron et al. [2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 961–970, 2015.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229, 2020.
  • Chen et al. [2018] Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. Temporally grounding natural sentence in video. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 162–171, 2018.
  • Chen et al. [2019a] Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8175–8182, 2019a.
  • Chen and Jiang [2019] Shaoxiang Chen and Yu-Gang Jiang. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8199–8206, 2019.
  • Chen and Jiang [2020] Shaoxiang Chen and Yu-Gang Jiang. Hierarchical visual-textual graph for temporal activity localization via language. In Computer Vision–ECCV 2020: Proceedings, Part XX 16, pages 601–618, 2020.
  • Chen et al. [2020] Shaoxiang Chen, Wenhao Jiang, Wei Liu, and Yu-Gang Jiang. Learning modality interaction for temporal sentence localization and event captioning in videos. In Computer Vision–ECCV 2020: Proceedings, Part IV 16, pages 333–351, 2020.
  • Chen et al. [2021] Yi-Wen Chen, Yi-Hsuan Tsai, and Ming-Hsuan Yang. End-to-end multi-modal video temporal grounding. Advances in Neural Information Processing Systems, 34:28442–28453, 2021.
  • Chen et al. [2019b] Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee K Wong. Weakly-supervised spatio-temporally grounding natural sentence in video. arXiv preprint arXiv:1906.02549, 2019b.
  • Escorcia et al. [2019] Victor Escorcia, Mattia Soldan, Josef Sivic, Bernard Ghanem, and Bryan Russell. Temporal localization of moments in video collections with natural language. 2019.
  • Fang et al. [2023] Xiang Fang, Daizong Liu, Pan Zhou, and Guoshun Nan. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2448–2460, 2023.
  • Gao et al. [2021] Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. Relation-aware video reading comprehension for temporal language grounding. arXiv preprint arXiv:2110.05717, 2021.
  • Gao et al. [2017] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017.
  • Gao and Xu [2021a] Junyu Gao and Changsheng Xu. Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1523–1532, 2021a.
  • Gao and Xu [2021b] Junyu Gao and Changsheng Xu. Learning video moment retrieval without a single annotated video. IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1646–1657, 2021b.
  • Ge et al. [2019] Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. Mac: Mining activity concepts for language-based temporal localization. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 245–253. IEEE, 2019.
  • Goyal et al. [2023] Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, and Du Tran. Minotaur: Multi-task video grounding from multimodal queries. arXiv preprint arXiv:2302.08063, 2023.
  • Grauman et al. [2022] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  • Hao et al. [2023] Jiachang Hao, Haifeng Sun, Pengfei Ren, Yiming Zhong, Jingyu Wang, Qi Qi, and Jianxin Liao. Fine-grained text-to-video temporal grounding from coarse boundary. ACM Transactions on Multimedia Computing, Communications and Applications, 19(5):1–21, 2023.
  • Hosseinzadeh and Wang [2020] Mehrdad Hosseinzadeh and Yang Wang. Composed query image retrieval using locally bounded features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3596–3605, 2020.
  • Hou et al. [2022] Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, and Nan Duan. Cone: An efficient coarse-to-fine alignment framework for long video temporal grounding. arXiv preprint arXiv:2209.10918, 2022.
  • Jang et al. [2023] Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023.
  • Ji et al. [2020] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
  • Jiang et al. [2024] Hanwen Jiang, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Single-stage visual query localization in egocentric videos. Advances in Neural Information Processing Systems, 36, 2024.
  • Krishna et al. [2017] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  • Lei et al. [2021] Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021.
  • Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  • Lin et al. [2023] Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023.
  • Lin et al. [2020] Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. Weakly-supervised video moment retrieval via semantic completion network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11539–11546, 2020.
  • Liu et al. [2021a] Daizong Liu, Xiaoye Qu, Jianfeng Dong, and Pan Zhou. Adaptive proposal generation network for temporal sentence localization in videos. arXiv preprint arXiv:2109.06398, 2021a.
  • Liu et al. [2021b] Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Yu Cheng, Wei Wei, Zichuan Xu, and Yulai Xie. Context-aware biaffine localizing network for temporal sentence grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11235–11244, 2021b.
  • Liu et al. [2021c] Daizong Liu, Xiaoye Qu, and Pan Zhou. Progressively guide to attend: An iterative alignment framework for temporal sentence grounding. arXiv preprint arXiv:2109.06400, 2021c.
  • Liu et al. [2023a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a.
  • Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023b.
  • Liu et al. [2018] Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international conference on Multimedia, pages 843–851, 2018.
  • Liu et al. [2023c] Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, and Yong Rui. A survey on video moment localization. ACM Computing Surveys, 55(9):1–37, 2023c.
  • Liu et al. [2022] Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3042–3051, 2022.
  • Luo et al. [2023] Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, and Yang Liu. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23045–23055, 2023.
  • Ma et al. [2020] Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, Sunghun Kang, and Chang D Yoo. Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In Computer Vision–ECCV 2020: Proceedings, Part XXVIII 16, pages 156–171, 2020.
  • Mithun et al. [2019] Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K Roy-Chowdhury. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11592–11601, 2019.
  • Moon et al. [2023a] WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae-Pil Heo. Correlation-guided query-dependency calibration in video representation learning for temporal grounding. arXiv preprint arXiv:2311.08835, 2023a.
  • Moon et al. [2023b] WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023b.
  • Mun et al. [2020] Jonghwan Mun, Minsu Cho, and Bohyung Han. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10810–10819, 2020.
  • Nam et al. [2021] Jinwoo Nam, Daechul Ahn, Dongyeop Kang, Seong Jong Ha, and Jonghyun Choi. Zero-shot natural language video localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1470–1479, 2021.
  • Pan et al. [2023] Yulin Pan, Xiangteng He, Biao Gong, Yiliang Lv, Yujun Shen, Yuxin Peng, and Deli Zhao. Scanning only once: An end-to-end framework for fast temporal grounding in long videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13767–13777, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ramakrishnan et al. [2023] Santhosh Kumar Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Spotem: efficient video search for episodic memory. In International Conference on Machine Learning, pages 28618–28636. PMLR, 2023.
  • Ren et al. [2023] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. arXiv preprint arXiv:2312.02051, 2023.
  • Rodriguez et al. [2020] Cristian Rodriguez, Edison Marrese-Taylor, Fatemeh Sadat Saleh, Hongdong Li, and Stephen Gould. Proposal-free temporal moment localization of a natural-language query in video using guided attention. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2464–2473, 2020.
  • Spearman [1961] Charles Spearman. The proof and measurement of association between two things. 1961.
  • Sul et al. [2024] Jinhwan Sul, Jihoon Han, and Joonseok Lee. Mr. hisum: A large-scale dataset for video highlight detection and summarization. Advances in Neural Information Processing Systems, 36, 2024.
  • Sun et al. [2024] Hao Sun, Mingyao Zhou, Wenjing Chen, and Wei Xie. Tr-detr: Task-reciprocal transformer for joint moment retrieval and highlight detection. arXiv preprint arXiv:2401.02309, 2024.
  • Tan et al. [2023] Chaolei Tan, Zihang Lin, Jian-Fang Hu, Wei-Shi Zheng, and Jianhuang Lai. Hierarchical semantic correspondence networks for video paragraph grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18973–18982, 2023.
  • Ventura et al. [2024] Lucas Ventura, Antoine Yang, Cordelia Schmid, and Gül Varol. Covr: Learning composed video retrieval from web video captions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5270–5279, 2024.
  • Vo et al. [2019] Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–6448, 2019.
  • Wang et al. [2023a] Lan Wang, Gaurav Mittal, Sandra Sajeev, Ye Yu, Matthew Hall, Vishnu Naresh Boddeti, and Mei Chen. Protégé: Untrimmed pretraining for video temporal grounding by video temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6575–6585, 2023a.
  • Wang et al. [2023b] Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ran Cheng, and Ping Luo. Learning grounded vision-language representation for versatile understanding in untrimmed videos. arXiv preprint arXiv:2303.06378, 2023b.
  • Wang et al. [2024] Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024.
  • Xiao et al. [2023] Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Hengwei Bian, Yatai Ji, Yujiu Yang, and Xiu Li. Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection. arXiv preprint arXiv:2311.16464, 2023.
  • Xiong et al. [2016] Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. arXiv preprint arXiv:1611.01604, 2016.
  • Xu et al. [2023] Yifang Xu, Yunzhuo Sun, Yang Li, Yilei Shi, Xiaoxiang Zhu, and Sidan Du. Mh-detr: Video moment and highlight detection with cross-modal transformer. arXiv preprint arXiv:2305.00355, 2023.
  • Yan et al. [2023] Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang, Weina Ge, David Ross, and Cordelia Schmid. Unloc: A unified framework for video localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13623–13633, 2023.
  • Yang et al. [2023a] Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023a.
  • Yang et al. [2023b] Lijin Yang, Quan Kong, Hsuan-Kung Yang, Wadim Kehl, Yoichi Sato, and Norimasa Kobori. Deco: Decomposition and reconstruction for compositional temporal grounding via coarse-to-fine contrastive ranking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23130–23140, 2023b.
  • Yoon et al. [2023] Sunjae Yoon, Gwanhyeong Koo, Dahyun Kim, and Chang D Yoo. Scanet: Scene complexity aware network for weakly-supervised video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13576–13586, 2023.
  • Yu et al. [2023] Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
  • Yuan et al. [2019] Yitian Yuan, Tao Mei, and Wenwu Zhu. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9159–9166, 2019.
  • Zala et al. [2023] Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oguz, Yashar Mehdad, and Mohit Bansal. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023.
  • Zeng et al. [2020] Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287–10296, 2020.
  • Zhang et al. [2023a] Ao Zhang, Liming Zhao, Chen-Wei Xie, Yun Zheng, Wei Ji, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv:2311.04498, 2023a.
  • Zhang et al. [2019a] Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1247–1257, 2019a.
  • Zhang et al. [2023b] Gengyuan Zhang, Jisen Ren, Jindong Gu, and Volker Tresp. Multi-event video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22113–22123, 2023b.
  • Zhang et al. [2021] Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, and Rick Siow Mong Goh. Natural language video localization: A revisit in span-based question answering framework. IEEE transactions on pattern analysis and machine intelligence, 44(8):4252–4266, 2021.
  • Zhang et al. [2019b] Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, and Deng Cai. Localizing unseen activities in video via image query. arXiv preprint arXiv:1906.12165, 2019b.
  • Zhao et al. [2024] Long Zhao, Nitesh B Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, et al. Videoprism: A foundational visual encoder for video understanding. arXiv preprint arXiv:2402.13217, 2024.

Appendix A Appendix

In this Appendix, we present the following:

  • Additional information about the dataset ICQ-Highlight and licenses for the datasets and models we have used.

  • Additional technical implementations including prompts of the benchmark ICQ;

  • Extended experimental results due to page limits in the main part.

A.1 Dataset: ICQ-Highlight

A.1.1 License

The dataset and code are publicly accessible. We use standard licenses from the community and provide the following links to the non-commercial licenses for the datasets we used in this paper.

A.1.2 Construction Pipeline

We base our model on the original annotation from QVHighlights [29]. The whole pipeline as shown in Fig. 6 consists of (1) annotation: We further conduct a quality check on the annotations in the original dataset and filter out a few samples (details can be found in Sec. A.1.4). In order to generate more relevant reference images, we manually augment the original captions by adding new visual details based on three frames extracted from the raw videos. To introduce refinement texts, we purposely alter certain details of the captions to generate a new one. All annotations are carried out by two individuals and evaluated by a third party for accuracy. (2) We use the augmented and altered captions to generate reference images with a suite of Text-2-Image models, including DALL-E 2 and Stability Diffusion XL for 4 variants of styles. (3) We implement an additional quality check process for all generated images to eliminate and regenerate images that might contain unsafe or counterintuitive content. We employ BLIP2 [30] to filter out generated images that have lower semantic similarity with augmented captions than 0.2 and conduct a manual sanity check to control the image quality.

Refer to caption
Figure 6: Dataset Construction Pipeline: We base our model with original annotations from QVHighlights and introduce a pipeline consisting of annotation, reference image generation, and quality check.

A.1.3 Statistics

Tab. 4 presents the statistics for various reference image styles in terms of the number of queries, videos, and the presence of refinement texts. Tab. 5 breaks down the statistics of refinement texts for different reference image styles across various query types: object, action, relation, attribute, environment, and others. The numbers of each type can vary slightly depending on the different styles.

Table 4: Statistics of Different Reference Image Styles
Reference Image Style #Queries #Videos #With Refinement Texts #Without Refinement Texts
scribble 1546 1515 / 5
cinematic 1532 1502 1445 5
cartoon 1532 1501 1444 5
realistic 1532 1501 1446 4
Table 5: Statistics of Refinement Texts
Reference Image Style #Queries
\textObject \textAction \textRelation \textAttribute \textEnvironment \textOthers
scribble 594 242 50 162 343 70
cinematic 588 239 50 162 343 66
cartoon 590 239 48 161 341 68
realistic 586 241 50 161 341 70

A.1.4 Details of Deleted Data

We removed four entries from the QVHighlight dataset that could cause violent, sexual, sensitive, or graphic content in generation in the original natural language query as listed:

  • “A graph depicts penis size.” (qid: 9737)

  • “People mess with the bull statues testicles.” (qid: 7787)

  • “People butcher meat from a carcass.” (qid: 4023)

  • “Woman films herself wearing black lingerie in the bathroom.” (qid: 7685)

A.2 Benchmark Details

A.2.1 Model Comparison

Tab. 6 compares our selected baseline models. Query encoder denotes the text encoder of each model to encode natural language queries. Source represents the modalities of the source data, while V and A refer to “Video” and “Audio” respectively. All models have been fine-tuned on QVHilight.

Table 6: Comparison of selected baseline models. We only list the model head for the localization task if the model has multiple heads for different tasks.
Model Visual Encoder Query Encoder Localization Decoder Source
Moment-DETR (2021) ViT-B/32 + SlowFast CLIP Text DETR V
QD-DETR (2023) ViT-B/32 + SlowFast CLIP Text DETR V, V+A
EaTR (2023) ViT-B/32 + SlowFast CLIP Text DETR V
CG-DETR (2023) ViT-B/32 + SlowFast CLIP Text DETR V
TR-DETR (2024) ViT-B/32 + SlowFast CLIP Text DETR V, V+A
UMT (2022) ViT-B/32 + SlowFast CLIP Text Transformer V+A
UniVTG (2023) ViT-B/32 + SlowFast CLIP Text Conv. Heads V
UVCOM (2023) ViT-B/32 + SlowFast CLIP Text Transformer Heads V, V+A
SeViLA (2023) ViT-G CLIP Text Multimodal LLM (BLIP2) V

A.2.2 Prompt Engineering

Since the performance may highly depend on the wording in a prompt, we use 3 different prompts for Cap and Sum adaptation methods. In Tab. 7, the prompts are divided into “Prompts For Style cartoon/cinematic/realistic” and “Prompts for scribble”. This distinction arises because refining scribble images with complementary texts involves adding new details, slightly differing from other scenarios. Despite this minor variation, the prompt style remains consistent, simulating 3 different user query styles.

Table 7: Prompts for Cap and Sum. We use 3 different prompts and report the average performance and standard derivation in other tables.
Prompts For Style cartoon/cinematic/realistic Prompts For Style scribble
1 I have a caption {INPUT DATA}, adjust the {MODIFICATION TYPE} from {MODIFIED DETAIL} to {ORIGINAL DETAIL}. The revised caption should remain coherent and logical without introducing any additional details. I have a caption {INPUT DATA}. Modify it by adding {NEW TYPE} {NEW DETAIL}. The revised caption should remain coherent and logical without introducing any other additional details.
2 Read this {INPUT DATA}! Change the {MODIFICATION TYPE} from {MODIFIED DETAIL} to {ORIGINAL DETAIL}. Then, write a new caption that fits and doesn’t add new stuff. Only give the caption, no extra words. Read this {INPUT DATA}! Add the {NEW TYPE} {NEW DETAIL} to it. Then, write a new caption that fits and doesn’t add new stuff. Only give the caption, no extra words.
3 Here’s a caption {INPUT DATA}. Can you change {MODIFICATION TYPE} from {MODIFIED DETAIL} to {ORIGINAL DETAIL}? After that, make a new caption that makes sense and doesn’t add anything extra. Just write the caption, no explanations needed. Here’s a caption {INPUT DATA}. Can you add {NEW TYPE} {NEW DETAIL}? After that, make a new caption that makes sense and doesn’t add anything extra. Just write the caption, no explanations needed.

A.3 Extended Results

A.3.1 Model Performance with Different Prompts

We demonstrate the results of 3 prompts across different models and various metrics, including R1@0.5, R1@0.7, mAP@0.5, mAP@0.7, and Avg, showing both the Cap and Sum methods.

In Fig. 7, the results indicate that the performance is consistent across different metrics, demonstrating the robustness of the models when using Cap. The models generally maintain similar performance levels regardless of the specific metric, suggesting their stability and reliability in Cap. In contrast, Fig. 8 illustrates the performance of the same models but using Sum. It is evident that prompt 1 consistently outperforms prompts 2 and 3 across all metrics. This indicates that the models are more sensitive to the formulation of the prompt in Sum.

Refer to caption
Figure 7: Model Performance of Cap on ICQ.
Refer to caption
Figure 8: Model Performance of Sum on ICQ.
Table 8: Performance comparison between the original query and corrupted text. The performance drop highlighted in the parenthesis indicates that the modifications on natural language query are non-trivial. {\dagger} indicates the usage of additional audio modality.
Method original
R1@0.5 R1@0.7 mAP@0.5 mAP@0.7 Avg.
Moment-DETR (2021) 54.92 (-4.6%percent\%%) 36.87 (-3.3%percent\%%) 55.95 (-4.2%percent\%%) 31.59 (-4.5%percent\%%) 32.54 (-3.8%percent\%%)
QD-DETR (2023) 62.87 (-8.6%percent\%%) 46.70 (-12.5%percent\%%) 62.66 (-7.6%percent\%%) 41.59 (-12.4%percent\%%) 41.23 (-10.3%percent\%%)
QD-DETR{\dagger} (2023) 63.71 (-6.2%percent\%%) 47.67 (-8.1%percent\%%) 62.9 (-5.6%percent\%%) 42.07 (-6.6%percent\%%) 41.73 (-6.4%percent\%%)
EaTR (2023) 60.93 (-8.0%percent\%%) 46.12 (-9.5%percent\%%) 62.01 (-5.9%percent\%%) 42.11 (-7.6%percent\%%) 41.39 (-6.7%percent\%%)
CG-DETR (2023) 67.27 (-8.9%percent\%%) 51.94 (-13.6%percent\%%) 65.48 (-7.6%percent\%%) 45.64 (-12.4%percent\%%) 44.88 (-11.3%percent\%%)
TR-DETR (2024) 67.08 (-7.5%percent\%%) 51.36 (-8.3%percent\%%) 66.20 (-7.3%percent\%%) 46.28 (-9.3%percent\%%) 44.99 (-8.1%percent\%%)
UMT{\dagger} (2022) 60.22 (-10.0%percent\%%) 44.24 (-14.1%percent\%%) 56.62 (-9.5%percent\%%) 39.85 (-15.2%percent\%%) 38.54 (-12.9%percent\%%)
UniVTG (2023) 59.70 (-8.7%percent\%%) 40.82 (-7.2%percent\%%) 51.22 (-8.0%percent\%%) 32.84 (-9.9%percent\%%) 32.53 (-9.0%percent\%%)
UVCOM (2023) 65.01 (-5.6%percent\%%) 51.75 (-8.0%percent\%%) 64.88 (-5.3%percent\%%) 46.96 (-9.0%percent\%%) 45.83 (-8.2%percent\%%)
SeViLA (2023) 56.57 (-56.2%percent\%%) 40.45 (-62.1%percent\%%) 47.14 (-56.8%percent\%%) 32.69 (-62.3%percent\%%) 33.10 (-60.6%percent\%%)
Table 9: Model performance (Recall) of Captioning without refinement text and Visual Query Enc. on ICQ. We highlight the best score in bold for both methods and reference image style.

{NiceTabular}llcccccccc[colortbl-like] \CodeBefore\Body Model scribble cartoon cinematic realistic
R1@0.5 R1@0.7 R1@0.5 R1@0.7 R1@0.5 R1@0.7 R1@0.5 R1@0.7
Captioning only Moment-DETR (2021) 45.15 28.72 43.60 27.94 44.06 29.70 44.06 28.98
QD-DETR (2023) 49.81 33.70 49.87 34.33 49.67 34.73 50.52 35.25
QD-DETR{\dagger} (2023) 51.29 36.03 48.69 33.88 49.48 34.99 49.93 35.05
EaTR (2023) 52.01 37.77 47.45 33.09 48.56 34.33 49.61 35.64
CG-DETR (2023) 51.42 37.84 49.35 35.90 48.89 34.79 51.04 36.55
TR-DETR (2024) 52.01 37.19 51.04 36.62 50.00 36.03 52.28 37.53
UMT{\dagger} (2022) 46.25 31.57 45.82 30.61 46.34 29.96 46.08 31.85
UniVTG (2023) 47.87 33.76 45.56 29.24 45.43 29.05 46.80 30.42
UVCOM (2023) 52.26 39.39 51.50 37.99 50.98 36.75 51.70 37.53
Visual Query Enc.
Moment-DETR (2021) 12.55 5.69 13.38 6.59 14.36 6.01 14.88 6.53
QD-DETR (2023) 15.91 9.12 14.88 8.62 13.90 8.49 14.62 8.36
QD-DETR{\dagger} (2023) 15.65 10.03 12.60 6.79 12.34 6.72 12.34 7.44
EaTR (2023) 19.86 13.00 19.91 12.99 21.15 13.45 21.48 13.38
CG-DETR (2023) 22.90 13.00 24.93 13.58 23.24 13.12 24.74 14.23
TR-DETR (2024) 17.92 11.19 17.36 11.10 15.14 9.86 15.60 9.53
UMT{\dagger} (2022) 5.43 2.85 4.77 2.09 5.22 2.35 4.57 2.42
UniVTG (2023) 21.93 13.00 23.89 13.64 22.78 13.19 22.52 12.79
UVCOM (2023) 17.08 9.77 16.78 10.97 17.36 11.68 17.10 11.23

A.3.2 Captioning Without Refinement Text V.S. Visual Query Encoding

We compare the model performance between Cap without refinement text and VisEnc, as shown in Tab. A.3.1. Both methods only use reference images as queries without refinement texts. Overall, Cap without refinement texts still significantly outperforms pure VisEnc, highlighting the effectiveness of image captioning. Additionally, TR-DETR and UVCOM perform best across all styles.

A.3.3 Original vs. Corrupted Text in Our Annotation

We have evaluated the model performance based on the original queries in QVHighlights and our corrupted texts to assess the significance of the refinement texts and the sensitivity of different models to natural language queries. [45] points out that the impact of natural language query may be minimal for some existing models, such as Moment-DETR. As shown in Tab. 8, Moment-DETR exhibits relatively smaller drops across all metrics, supporting this claim. On the other hand, the latest models, such as CG-DETR and TR-DETR, experience larger performance drops, indicating a higher sensitivity to query modifications. Furthermore, SeViLA is extremely sensitive to query modifications, which is shown by severe declines in performance across all evaluated metrics. Overall, the considerable performance decline across various models demonstrates that our modifications significantly affect the original queries. This also shows that our introduced refinement texts are not semantically trivial for localizing with multimodal queries.