Localizing Events in Videos with Multimodal Queries

Gengyuan Zhang ^1,4 Mang Ling Ada Fok ²¹¹footnotemark: 1 Yan Xia ^2,4 Yansong Tang ³
Daniel Cremers ^2,4 Philip Torr ⁵ Volker Tresp ^1,4 Jindong Gu ⁵
¹ LMU Munich ² TU Munich ³ Tsinghua University
⁴ Munich Center for Machine Learning (MCML) ⁵ University of Oxford
zhang@dbs.ifi.lmu.de ada.fok@tum.de Equal contribution

Abstract

Video understanding is a pivotal task in the digital era, yet the dynamic and multi-event nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images’ semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization¹¹1Our project is available at https://icq-benchmark.github.io/.

1 Introduction

Videos are the prevailing data medium on the Internet and a common multimodal interface when we interact with the world. User-centric applications, such as video search engines and video highlight/moment recommendations, are increasingly popular on streaming media and short video platforms. Moreover, large foundation models are expected to process videos as input data to understand surroundings and make decisions. Consequently, video understanding has been a long-standing research topic and has recently gained increased attention.

However, videos are inherently dynamic and contain multiple events [66, 75] that are sparsely distributed. This redundancy makes processing and understanding dense videos labor-intensive and computation-demanding for human users and deep-learning models. As a result, the need to localize events in videos becomes essential [28, 48].

Refer to caption — Figure 1: Localizing events in videos with semantics queries: so far, the community has only focused on natural language query-based video event localization as in [29]. Our benchmark ICQ focuses on a more general scenario: localizing events in video with multimodal queries.

Localizing events in videos encompasses a broad spectrum of related tasks. From a practical perspective, particularly in user-centric applications like video search and recommendation, tasks including video moment retrieval [17, 18, 39] and highlight detection [2, 29, 45] focus on identifying and retrieving video segments of interest based on textual queries within extensive, long-range videos. For video foundation models that aim to understand and reason video content, video temporal grounding [11, 12, 13, 15, 24, 56] with a given natural language query not only can reduce the video processing duration but also elucidate the reasoning processes.

A series of benchmarks [4, 16, 29, 54] has been established for exploring video event localization using natural language queries as semantic queries. Building on these foundations, existing models have primarily focused on this natural language query setting [1, 6, 7, 8, 10, 9, 12, 13, 16, 19, 29, 60]. However, with the increasing need for human users to efficiently process massive video data online and the advent of large-scale foundation models in recent years, multimodal interaction with videos is a promising scenario. In other words, texts should not be the only possible query for localizing events in videos. As the saying goes, “A picture is worth a thousand words.”, images are illiteral language and can express rich semantic meaning and describe events in videos.

Multimodal queries, also known as composed queries [23, 57] for video event localization, bring practical benefits. From a pragmatic perspective, using queries such as user-input “scribble images” can facilitate a more natural human-computer interaction. As users, we often opt for writing brief and simple text queries rather than detailed and lengthy paragraphs for semantic search on videos, and thus, a text query can be ambiguous. Texts sometimes fail to deliver the message, while images are capable. In the meantime, grounding/localizing events in videos with multimodal queries heuristically serves as an important module of video foundation models as in temporal grounding and episodic memory search [21, 27, 50]. This relates to grounding an event stimulated by a similar scene, which is similar to a common cognitive phenomenon called Déjà vu.

Since using multimodal queries for semantically searching events in videos remains largely unexplored, this inspires us to propose a new task: localizing events in videos with multimodal queries. We introduce a new benchmark, ICQ, for localizing events in videos using Image-Text Composed Queries as multimodal queries. Our benchmark is targeted at evaluating model performance for localization events in videos with multimodal queries consisting of reference images and refinement texts. Alongside this benchmark, we propose a new evaluation dataset, ICQ-Highlight, as a testbed for our task. Given that reference images may have a significant distribution shift from the video data in styles and that refinement texts should alter the semantic meaning of reference images in various aspects, our dataset highlights 4 reference image styles and 5 refinement text types.

In ICQ, we evaluate a broad spectrum of existing video localization models, from specialized models to LLM-based video foundation models, on the ICQ. To bridge the gap between current natural language query-based models and multimodal queries, we propose 3 adaptation methods: Captioning, Summarization, and Visual Query Encoding. Our results demonstrate that existing models can be effectively adapted to our new benchmark with the aid of Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) despite performance decline and instability to a greater or lesser extent. They should serve as a solid baseline for future studies. Additionally, our findings reveal that multimodal semantic queries can successfully localize events in videos, suggesting multimodal queries have promising applications for video localization.

Our contributions are summarized as follows:

1.

We introduce a new evaluation benchmark, ICQ, and a new evaluation dataset, ICQ-Highlight, for analyzing event localization in videos with multimodal queries;
2.

We propose 3 adaptation methods and evaluate 10 models ranging from specialized models to large-scale video foundation models;
3.

Our comprehensive experiments show that our adaptation method is a simple yet effective baseline method to adapt existing models to ICQ;
4.

We claim that using multimodal queries for video event localization is a practical and feasible scenario with broad prospects.

2 Related Work

2.1 Localizing Event in Videos with Natural Language Queries

Query-based video temporal localization has been a long-standing research topic and is an umbrella of several related tasks. According to their scenarios and motivation, they can be further classified into several similar but slightly different tasks. Video moment retrieval [32, 38, 42, 43, 41, 68, 71, 74] aims to localize a video segment based on a textual caption query that describes events in the video. Video temporal grounding/localization [14, 22, 34, 35, 46, 47, 67, 70, 72] with natural language queries aims to determine the video segment that corresponds with textual description and usually serves downstream Question-answering task [3, 63, 69, 76] and aims to provide relevant segments in videos. Other similar yet less relevant tasks include video highlight detection [2, 29, 45, 54] and action detection; these tasks also involve localizing video segments but with an implicit query or a category-level action label. Our benchmark steps torwards localizing video events in multimodal query. This multimodal query underlines a composed query of images and text, which are different from other works, as a semantic search for events in videos described by multimodal queries.

Regarding the methodology, a line of works are focused on video moment retrieval/ video temporal grounding tasks: this includes two-stage (i.e. proposal-based) models [33] that firstly generates moment candidates and then filter out the matched moment based on the query and one-stage (i.e. proposal-free) models [7, 52, 70] that integrates the moment generation and moment localization into a unified framework. Within the one-stage models, DETR [5] has been widely employed in multiple models for video temporal localization as in [25, 29, 45, 44, 55, 64]. More recent works [31, 40, 65, 62] attempt to uniform multiple video localization tasks, including video moment retrieval and highlight detection in a single framework. This again shows the correlation of video temporal localization tasks. In addition, with the large-scale video foundation models and MLLMs gaining increasing attention, temporal grounding has also been a core module in models like SeViLA [69], InternVideo2 [61], VideoPrism [78], etc. [51, 73].

2.2 Multimodal Query for Image/Video Understanding

Using multimodal queries is a practical and important scenario for video/vision understanding scenarios [57, 58]. However, it is cruical to note that video event localization with multimodal queries differs from image/video retrieval tasks, which primarily involve instance-level similarity matching. Temporal localization requires dense video processing, significantly increasing the complexity of the task.

For video localization tasks, [77] is the first work to use image queries to localize unseen activities in videos. More recently, [20] proposes to ground videos spatio-temporally using images or texts, although their queries are still limited to object or action levels. To the best of our knowledge, our work is the first to attempt localizing events in videos using multimodal semantic queries.

3 ICQ: Video Event Localization with Multimodal Queries

In the following section, we will detail the benchmark ICQ and a new evaluation dataset ICQ-Highlight to benchmark video event localization with multimodal queries.

3.1 Task Definition

We define the multimodal query $q_{m}$ as consisting of a reference image $v_{ref}$ accompanied by a refinement text $t_{ref}$ for minor adjustments to localize the target event in a video that corresponds to the query semantically. The reference image captures the broad semantics of the target event, while the refinement text provides supplementary information that can be either complementary or corrective. We believe that this setting is more adaptable and general in real-world applications.

Given the query $q_{m}$ , the model predicts all the relevant segments or moments $\left[time_{start},time_{end}\right]$ . Similar to the metrics used in common-setting video moment retrieval, we utilize recall R and mean Average Precision as the evaluation metrics for video moment retrieval.

Reference Image Styles Reference images $v_{ref}$ visually describe the semantics of an event in a video. They can be simple scribble images with minimal strokes that describe an event succinctly, effectively summarizing an event for non-verbal semantic queries in video localization or more detailed images that depict semantically relevant scenes in a video. As illustrated in Fig. 2, reference images describe semantically similar scenes yet might vary in details as target videos. In practice, visual queries can differ in style, which may impact model performance. Therefore, we explore multiple reference image styles, as detailed in the subsequent section, to assess whether the model maintains consistent performance across various styles as an indicator of model robustness.

Refinement Texts Refinement texts refer to simple phrases or sentences to complement or correct descriptions that are either missing or contradictory in the reference images. This is particularly practical in real-world applications, as reference images often do not semantically align perfectly with the target video event. We identify 5 different types of refinement texts that can be applied to various aspects of the reference image semantics: “object”, “action”, “relation”, “attribute”, “environment”, and “others” as shown in Fig. 3. This categorization is designed for elements of a semantic scene graph [26] and borrowed by us to summarize different semantic elements of the multimodal queries.

3.2 Dataset Construction

We introduce our new evaluation dataset, ICQ-Highlight, as a testbed for ICQ. This dataset is built upon the validation set of QVHighlight [29], a popular natural-language query-based video localization dataset. For each original query in QVHighlight, we construct multimodal semantic queries that incorporate reference images paired with refinement texts. Considering the reference image style distribution discussed earlier, ICQ-Highlight features 4 variants based on different image styles. In total, the dataset comprises 1515 videos and 1546 test samples on average for each style. The exact numbers may vary slightly across styles and are provided in the Appendix.

Reference Image Generation We generate reference images based on the original natural language queries and refinement texts using a suite of state-of-the-art Text-to-Image models, including DALL-E-2¹¹1https://openai.com/index/dall-e-2/ and Stable Diffusion²²2https://stability.ai/stable-image. For the reference image styles mentioned earlier, we select 4 representative styles: scribble, cartoon, cinematic, and realistic. These styles effectively capture a variety of real-world scenarios such as user inputs, book illustrations, television shows, and actual photographs, where images are often used as queries.

Data Annotation and Preprocessing We emphasize the meticulous crowd-sourced data curation and annotation effort applied to QVHighlight for 2 main reasons: (1) To introduce refinement texts, we purposefully modify the original semantics of text queries in QVHighlight to generate queries that are similar yet subtly different; (2) Given that the original queries in QVHighlight can be too simple and ambiguous to generate reasonable reference images, we add necessary annotations to ensure that the generated image queries are more relevant to the original video semantics. We employed human annotators to annotate and modify the natural language queries. Each query is annotated and reviewed by different annotators to ensure consistency. Further details can be found in the Appendix.

Data Curation and Quality check Image generation can suffer from significant imperfections in terms of semantic consistency and content safety. To address these issues, we implement a quality check in two stages: (1) We calculate the semantic similarity between the generated images and the text queries using BLIP2 [30] encoders, eliminating samples that score lower than 0.2; (2) We perform human sanity check to replace images that are: i) semantically misaligned with the text, ii) mismatched with the required reference image style, iii) containing sensitive or unpleasant content (e.g., violent, racial, sexual content), counterintuitive elements, or obvious generation artifacts.

3.3 Baseline Selection

We have selected and benchmarked 10 models specifically designed for video event localization with natural language queries. We assess the zero-shot performance of these models using checkpoints that have been fine-tuned on the original QVHighlight dataset. This evaluation allows us to understand their effectiveness for multimodal queries straight out of the box.

Particularly, we categorize them as follows and compare the models in different dimensions in the Appendix: (1) Specialized models use natural language as a semantic query and are targeted at video moment retrieval tasks. We have selected a series of these models including Moment-DETR[29], QD-DETR[45], EaTR[25], CG-DETR[44], TR-DETR[55]; (2) Unified frameworks are aimed to solve multiple video localization tasks within one model, such as moment retrieval, highlight detection, and video summarization. We have selected UMT[40], UniVTG[31], and UVCOM[62] as strong baselines; (3) LLM-based Models features the power of Large Language Models, which prove to be a powerful and general head for varied video tasks. We have selected SeViLA [69] as a representative.

3.4 Adaptation Methods

Most existing video localization methods utilize natural language as input queries and are not readily adaptable to composed queries. Thus, we propose 3 adaptation methods: Captioning (Cap), Summarization(Sum), and Visual Query Encoding(VisEnc), as illustrated in Fig. 4. For Cap and Sum, we aim to leverage the power of LLMs and MLLMs to caption reference images $v_{ref}$ and integrate refinement texts $t_{ref}$ : Cap uses MLLMs as a captioner to caption reference images and LLMs as a modifier to integrate refinement texts. In contrast, Sum uses MLLMs to directly summarize reference images and refinement texts in one step. Generated texts $t_{query}$ can be seamlessly used by existing models. For VisEnc, we explore using only reference images and employing visual encoders to embed the reference images as query embeddings $e_{query}$ . This is based on the background that all models we have selected employ a dual-stream encoder that embeds image-text pairs in a joint feature space.

4 Experiments

4.1 Experimental Setup

Implementation We employ state-of-the-art MLLMs, LLaVA-mistral [36, 37], as a captioner and GPT-3.5 as a modifier in our Cap adaptation. For a fair comparison, we utilize LLaVA-mistral for Sum adaptation. We believe that the performance of these models is representative of the SOTA capabilities of MLLMs. For VisEnc, we utilize the corresponding CLIP [49] Visual Encoder, as all models typically employ the CLIP Text Encoder for text query encoding. In this adaptation method, we omit refinement texts and only use the reference image.

Evaluation Metrics We evaluate models on our new testbed ICQ-Highlight. For evaluation, we report both Recall R@1 with IoU thresholds 0.5 and 0.7, mean Average Precision with IoU threshold 0.5 and the average over multiple IoU thresholds [0.5:0.05:0.95] as standard metrics for video moment retrieval and localization [29, 69], where IoU (Intersection over Union) thresholds determine if a predicted temporal window is positive.

4.2 Results & Analysis

We present the pairwise performance of 10 models combined with 3 adaptation methods on ICQ in Tab. 4.2- 4.2. For Cap and Sum methods, we have conducted multiple runs with different prompts used for captioning and summarization and reported the average performance and standard deviation.

Model	scribble		cartoon		cinematic		realistic
Model	R1@0.5	R1@0.7	R1@0.5	R1@0.7	R1@0.5	R1@0.7	R1@0.5	R1@0.7
Moment-DETR	45.15 (-2.7 $\%$ )	28.72 (-3.3 $\%$ )	43.60 (-7.1 $\%$ )	27.94 (-5.8 $\%$ )	44.06 (-7.3 $\%$ )	29.70 (-2.8 $\%$ )	44.06 (-9.3 $\%$ )	28.98 (-6.5 $\%$ )
QD-DETR	49.81 (-4.0 $\%$ )	33.70 (-5.4 $\%$ )	49.87 (-6.6 $\%$ )	34.33 (-6.3 $\%$ )	49.67 (-9.3 $\%$ )	34.73 (-8.1 $\%$ )	50.52 (-5.7 $\%$ )	35.25 (-7.4 $\%$ )
QD-DETR ${\dagger}$	51.29 (-3.9 $\%$ )	36.03 (-3.8 $\%$ )	48.69 (-10.8 $\%$ )	33.88 (-13.4 $\%$ )	49.48 (-8.5 $\%$ )	34.99 (-9.0 $\%$ )	49.93 (-7.5 $\%$ )	35.05 (-10.4 $\%$ )
EaTR	52.01 (+0.5 $\%$ )	37.77 (+1.2 $\%$ )	47.45 (-6.7 $\%$ )	33.09 (-8.0 $\%$ )	48.56 (-7.0)	34.33 (-5.1)	49.61 (-6.1 $\%$ )	35.64 (-3.0 $\%$ )
CG-DETR	51.42 (-4.0 $\%$ )	37.84 (-1.7 $\%$ )	49.35 (-13.0 $\%$ )	35.90 (-13.4 $\%$ )	48.89 (-10.3)	34.79 (-11.3)	51.04 (-10.5 $\%$ )	36.55 (-14.0 $\%$ )
TR-DETR	52.01 (-2.4 $\%$ )	37.19 (-2.9 $\%$ )	51.04 (-9.2 $\%$ )	36.62 (-11.2 $\%$ )	50.00 (-11.8)	36.03 (-12.5)	52.28 (-8.8 $\%$ )	37.53 (-10.6 $\%$ )
UMT ${\dagger}$	46.25 (-3.0 $\%$ )	31.57 (-1.0 $\%$ )	45.82 (-6.9 $\%$ )	30.61 (-7.1 $\%$ )	46.34 (-8.6 $\%$ )	29.96 (-13.7 $\%$ )	46.08 (-6.2 $\%$ )	31.85 (-7.1 $\%$ )
UniVTG	47.87 (-3.8 $\%$ )	33.76 (-2.2 $\%$ )	45.56 (-9.4 $\%$ )	29.24 (-11.5 $\%$ )	45.43 (-11.2 $\%$ )	29.05 (-13.9 $\%$ )	46.80 (-9.3 $\%$ )	30.42 (-12.4 $\%$ )
UVCOM	52.26 (-1.7 $\%$ )	39.39 (+1.0 $\%$ )	51.50 (-6.1 $\%$ )	37.99 (-6.6 $\%$ )	50.98 (-9.4 $\%$ )	36.75 (-11.3 $\%$ )	51.70 (-7.6 $\%$ )	37.53 (-10.5 $\%$ )
SeViLA	13.15 (-30.3 $\%$ )	8.06 (-29.3 $\%$ )	11.89 (-49.8 $\%$ )	6.89 (-57.0 $\%$ )	13.26 (-49.0 $\%$ )	8.32 (-51.5 $\%$ )	13.65 (-49.1 $\%$ )	8.22 (-51.1 $\%$ )

Reference Image Style	#Queries	#Videos	#With Refinement Texts	#Without Refinement Texts
scribble	1546	1515	/	5
cinematic	1532	1502	1445	5
cartoon	1532	1501	1444	5
realistic	1532	1501	1446	4

Reference Image Style	#Queries
Reference Image Style	\textObject	\textAction	\textRelation	\textAttribute	\textEnvironment	\textOthers
scribble	594	242	50	162	343	70
cinematic	588	239	50	162	343	66
cartoon	590	239	48	161	341	68
realistic	586	241	50	161	341	70

Model	Visual Encoder	Query Encoder	Localization Decoder^∗	Source
Moment-DETR (2021)	ViT-B/32 + SlowFast	CLIP Text	DETR	V
QD-DETR (2023)	ViT-B/32 + SlowFast	CLIP Text	DETR	V, V+A
EaTR (2023)	ViT-B/32 + SlowFast	CLIP Text	DETR	V
CG-DETR (2023)	ViT-B/32 + SlowFast	CLIP Text	DETR	V
TR-DETR (2024)	ViT-B/32 + SlowFast	CLIP Text	DETR	V, V+A
UMT (2022)	ViT-B/32 + SlowFast	CLIP Text	Transformer	V+A
UniVTG (2023)	ViT-B/32 + SlowFast	CLIP Text	Conv. Heads	V
UVCOM (2023)	ViT-B/32 + SlowFast	CLIP Text	Transformer Heads	V, V+A
SeViLA (2023)	ViT-G	CLIP Text	Multimodal LLM (BLIP2)	V

	Prompts For Style cartoon/cinematic/realistic	Prompts For Style scribble
1	I have a caption {INPUT DATA}, adjust the {MODIFICATION TYPE} from {MODIFIED DETAIL} to {ORIGINAL DETAIL}. The revised caption should remain coherent and logical without introducing any additional details.	I have a caption {INPUT DATA}. Modify it by adding {NEW TYPE} {NEW DETAIL}. The revised caption should remain coherent and logical without introducing any other additional details.
2	Read this {INPUT DATA}! Change the {MODIFICATION TYPE} from {MODIFIED DETAIL} to {ORIGINAL DETAIL}. Then, write a new caption that fits and doesn’t add new stuff. Only give the caption, no extra words.	Read this {INPUT DATA}! Add the {NEW TYPE} {NEW DETAIL} to it. Then, write a new caption that fits and doesn’t add new stuff. Only give the caption, no extra words.
3	Here’s a caption {INPUT DATA}. Can you change {MODIFICATION TYPE} from {MODIFIED DETAIL} to {ORIGINAL DETAIL}? After that, make a new caption that makes sense and doesn’t add anything extra. Just write the caption, no explanations needed.	Here’s a caption {INPUT DATA}. Can you add {NEW TYPE} {NEW DETAIL}? After that, make a new caption that makes sense and doesn’t add anything extra. Just write the caption, no explanations needed.

Localizing Events in Videos with Multimodal Queries

Abstract

1 Introduction

2 Related Work

2.1 Localizing Event in Videos with Natural Language Queries

2.2 Multimodal Query for Image/Video Understanding

3 ICQ: Video Event Localization with Multimodal Queries

3.1 Task Definition

3.2 Dataset Construction

3.3 Baseline Selection

3.4 Adaptation Methods

4 Experiments

4.1 Experimental Setup

4.2 Results & Analysis

4.3 Ablation studies

5 Conclusion

References

Appendix A Appendix

A.1 Dataset: ICQ-Highlight

A.1.1 License

A.1.2 Construction Pipeline

A.1.3 Statistics

A.1.4 Details of Deleted Data

A.2 Benchmark Details

A.2.1 Model Comparison

A.2.2 Prompt Engineering

A.3 Extended Results

A.3.1 Model Performance with Different Prompts

A.3.2 Captioning Without Refinement Text V.S. Visual Query Encoding

A.3.3 Original vs. Corrupted Text in Our Annotation

Method	original
Method	R1@0.5	R1@0.7	mAP@0.5	mAP@0.7	Avg.
Moment-DETR (2021)	54.92 (-4.6 $\%$ )	36.87 (-3.3 $\%$ )	55.95 (-4.2 $\%$ )	31.59 (-4.5 $\%$ )	32.54 (-3.8 $\%$ )
QD-DETR (2023)	62.87 (-8.6 $\%$ )	46.70 (-12.5 $\%$ )	62.66 (-7.6 $\%$ )	41.59 (-12.4 $\%$ )	41.23 (-10.3 $\%$ )
QD-DETR ${\dagger}$ (2023)	63.71 (-6.2 $\%$ )	47.67 (-8.1 $\%$ )	62.9 (-5.6 $\%$ )	42.07 (-6.6 $\%$ )	41.73 (-6.4 $\%$ )
EaTR (2023)	60.93 (-8.0 $\%$ )	46.12 (-9.5 $\%$ )	62.01 (-5.9 $\%$ )	42.11 (-7.6 $\%$ )	41.39 (-6.7 $\%$ )
CG-DETR (2023)	67.27 (-8.9 $\%$ )	51.94 (-13.6 $\%$ )	65.48 (-7.6 $\%$ )	45.64 (-12.4 $\%$ )	44.88 (-11.3 $\%$ )
TR-DETR (2024)	67.08 (-7.5 $\%$ )	51.36 (-8.3 $\%$ )	66.20 (-7.3 $\%$ )	46.28 (-9.3 $\%$ )	44.99 (-8.1 $\%$ )
UMT ${\dagger}$ (2022)	60.22 (-10.0 $\%$ )	44.24 (-14.1 $\%$ )	56.62 (-9.5 $\%$ )	39.85 (-15.2 $\%$ )	38.54 (-12.9 $\%$ )
UniVTG (2023)	59.70 (-8.7 $\%$ )	40.82 (-7.2 $\%$ )	51.22 (-8.0 $\%$ )	32.84 (-9.9 $\%$ )	32.53 (-9.0 $\%$ )
UVCOM (2023)	65.01 (-5.6 $\%$ )	51.75 (-8.0 $\%$ )	64.88 (-5.3 $\%$ )	46.96 (-9.0 $\%$ )	45.83 (-8.2 $\%$ )
SeViLA (2023)	56.57 (-56.2 $\%$ )	40.45 (-62.1 $\%$ )	47.14 (-56.8 $\%$ )	32.69 (-62.3 $\%$ )	33.10 (-60.6 $\%$ )