Short Film Dataset (SFD):
A Benchmark for Story-Level Video Understanding

Ridouane Ghermi
LIX, Ecole Polytechnique, IP Paris
ridouane.ghermi@inria.fr
&Xi Wang
LIX, Ecole Polytechnique, IP Paris
xi.wang@lix.polytechnique.fr
&Vicky Kalogeiton
LIX, Ecole Polytechnique, IP Paris
vicky.kalogeiton@polytechnique.edu
&Ivan Laptev
MBZUAI
ivan.laptev@inria.fr

Abstract

Recent advances in vision-language models have significantly propelled video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often document the activities of one person in a single scene. Although some movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos and frequently encounter data leakage given the use of movie forums and other resources in LLM training. To address the above limitations, we propose the Short Film Dataset (SFD) with 1,078 publicly available amateur movies, a wide variety of genres and minimal data leakage issues. SFD offers long-term story-oriented video tasks in the form of multiple-choice and open-ended question answering. Our extensive experiments emphasize the need for long-term reasoning to solve SFD tasks. Notably, we find strong signals in movie transcripts leading to the on-par performance of people and LLMs. We also show significantly lower performance of current models compared to people when using vision data alone.

Refer to caption — Figure 1: VideoQA examples from three video domains: instructional videos, egocentric videos and movies. While instructional and egocentric videos usually depict one or two people performing a single task, movies present time-extended stories with a rich variety in terms of scenes, characters and interactions.

https://shortfilmdataset.github.io

1 Introduction

Recent advances in vision-language models [33, 31, 9, 10, 61, 60, 16, 15, 11, 68] show significant promise for enhancing machine perception. Despite their remarkable progress, current datasets are still constrained by some limitations. Most of them consist of short videos [20, 1, 13, 69, 72, 17, 55, 27, 32, 19, 65], often under one minute, and primarily target short-term tasks e.g., action recognition and video retrieval, often requiring only a few seconds of visual content for problem-solving [39, 28]. For long-form video understanding, the community has turned to three video categories. (i) Egocentric videos, an emerging domain featuring extended first-person sequences that capture a person’s continuous actions [18, 12, 24, 34, 39, 52, 51] or chronicle daily activities, offering tasks from object manipulation to social interactions [18]. Nevertheless, while such datasets excel in video duration, they lack narrative depth, focusing mainly on immediate visuals rather than rich storytelling; (ii) Instructional videos [41, 71], which depict a wide variety of human tasks and focus on the procedural aspect, are often brief and deficient in storytelling; (iii) Movies, unlike egocentric and instructional videos, provide complex plots and extended duration, making them an ideal testing ground for developing models that can comprehend long-form video narratives, see Figure 1.

Several movie datasets have been proposed in the literature [48, 4, 56, 29, 53, 21, 64, 22, 59, 66, 49, 40, 44, 7, 6, 38] offering rich narrative-driven content and various tasks. These tasks include clip retrieval [4], video question answering [56, 29, 38], audio description [53, 21], semantic role labeling [49], scene segmentation [22, 44], and visual scene recognition [6]. However, current movie datasets suffer from three main limitations. (i) Accessibility: these datasets are not publicly available as they typically contain commercial copyrighted movies (see Figure 2). (ii) Clip duration: they often consist of incomplete and short clips [56, 29, 45, 64] (i.e., less than 4 minutes, see Figure 2), limiting their potential for long-form understanding with evolving storylines. Moreover, these datasets frequently suffer from ill-designed benchmark questions with insufficient narrative focus [54]. (iii) Data Leakage: as these datasets comprise well-known commercial movies, modern Large Language Models (LLMs) and Vision-Language Models (VLMs) have likely been exposed to some form of movie information (i.e., synopses, reviews, discussions, subtitles, blog posts). Figure 3 showcases this, where given only movie titles, modern LLMs can obtain high accuracy when answering questions from main movie datasets. This data leakage leads to biased benchmarks and ineffective training.

To address the above limitations, we introduce the Short Film Dataset (SFD), a novel video benchmark comprising 1,078 short films, totalling over 243 hours of videos with an average duration of 13 minutes per film—significantly longer than existing datasets (see Figure 2). Unlike copyrighted films, SFD includes publicly accessible amateur films of diverse genres from YouTube. Although these films are shorter than traditional movies, they feature complex narratives that unfold through key sequences of events and multiple character interactions. Most importantly, with minimal exposure to LLMs, our dataset is less prone to data leakage issues plaguing other movie datasets (see Figure 3).

Aiming towards long-term story-level multimodal video understanding, we propose two question-answering tasks on SFD: (a) following recent trend [56, 45], we propose Multiple-Choice Question answering (MCQ); (b) We also introduce Open-Ended Question answering (OEQ), a more generative and unrestricted QA format that more realistically and objectively measures a model’s ability to comprehend long-form videos. All questions and answers are generated by LLMs from movie descriptions and undergo thorough manual curation, ensuring they accurately reflect the settings, characters, storylines, and themes of the movies.

To validate the advantages of SFD, we conduct extensive experiments and quantify data leakage, the benefits of long-term reasoning and comparison to human performance while using state-of-the-art vision and language models. Our findings are: (i) Unlike existing VideoQA datasets, SFD provides story-level tasks defined on public data with minimal data leakage (Section 3.1); (ii) our long-term reasoning experiments in Section 3.2 confirm the advantage of movie-level video understanding for solving SFD tasks; (iii) the evaluation in Section 3 reveals a large gap between recent VideoQA methods and human performance in vision-only and multimodal settings. SFD, hence, provides a unique benchmark for evaluating and advancing new methods for story-level long-term video understanding. Our dataset, code, and models are publicly available at https://shortfilmdataset.github.io.

2 Short Film Dataset

Short films are motion pictures that typically last between 5 and 20 minutes. As illustrated in Figure 4, short films span a variety of styles (e.g., narrative fiction, documentary, animation, etc.) and genres (e.g., action, drama, comedy, horror, etc.). Filmmakers often use the short film format to explore new ideas, techniques, and storytelling styles. Short films, hence, have more of an experimental rather than commercial value and are often made publicly available. Samples from the dataset can be found in Appendix H.

The amount of publicly available short films has significantly increased over the last years (see Figure 6(c)). We take advantage of this development and create a Short Film Dataset (SFD) to foster research on story-level video understanding. In the remaining part of this section, we describe the procedure for collecting videos and corresponding metadata (Section 2.1) as well as generating question-answer pairs (Section 2.2). Section 2.3 presents the analysis and statistics of SFD. The overall pipeline for SFD creation is illustrated in Figure 5.

2.1 Data Collection

To build our dataset, we capitalize on the abundance of short films on video-sharing platforms. We specifically target the Omeleto YouTube channel¹¹1https://www.youtube.com/@Omeleto and its subchannels, including @OmeletoComedy and @OmeletoDrama, among others. These channels promote high-quality and award-winning short films, some of which have been recognized by Oscars and BAFTA awards. We download videos, subtitles, and associated metadata from YouTube using yt-dlp²²2https://github.com/yt-dlp/yt-dlp.

As speech transcripts were originally available only for a portion of the dataset, we have used WhisperX [3] to extract missing transcripts. The metadata accompanying each short film includes the movie title, a logline which is a concise one-sentence summary of the plot, and additional movie details such as genre, release year, region of origin, and language. Each film also comes with a detailed description that offers a deeper understanding of various elements such as a synopsis providing a brief narrative overview, the director’s inspirations, storytelling approach, actors’ performances, and other external information. To maintain clarity and objectivity in our dataset, we retain only the synopsis and exclude other subjective or ambiguous elements of the description.

2.2 Question and Answer Generation

To promote the development of long-form and story-oriented video understanding, we complement SFD with two Video Question Answering (VideoQA) tasks: Multiple-Choice Question answering (MCQ) and Open-Ended Question answering (OEQ). We first automatically generate question-answer pairs by prompting GPT-4 [42] with movie titles, loglines and synopses. We then manually curate questions and answers to ensure their relevance and correctness. We describe the generation process in more details below.

Automatic QA pair generation. For each film we obtain human-written information, i.e. movie titles, loglines and synopses from Omeleto YouTube channels. We then prompt GPT-4 [42] to propose relevant question-answer pairs from movie information, with dedicated prompts to avoid common mistakes. In particular, we emphasise direct and clear questions, with brief and specific answers to avoid speculation, repetition and ambiguity. Our example prompts are available in Appendix I. This stage results in 7,104 question-answer (QA) pairs that we use for the open-ended question-answering task.

Automatic Distractor Generation. To construct the MCQ task, we use GPT-4 [42] to complement each question with 4 distractors. The distractors are crafted to provide plausible yet incorrect answers that fit within the movie’s context. To this end, our prompts are designed with specific criteria: distractors should be syntactically similar to the correct answer while being semantically different. Our distractors also incorporate diverse misdirections relative to movies (character confusion, plot adjustments), consider the movie genre and plausible scenarios, and include accurate but irrelevant information from other parts of the synopsis. This stage results in 7,104 Multiple-Choice Questions (MCQs), with 5 options per question with one correct and four incorrect answers.

Manual QA Curation. To ensure the quality of questions and answers, we perform a thorough manual curation based on the following criteria. We remove questions that cannot be answered based on the synopsis, including factually incorrect, ambiguous, or subjective ones. Additionally, questions considered overly simplistic or answerable through external knowledge are excluded. This stage results in 4,885 MCQs from 1,078 movies.

2.3 Dataset Analysis and Statistics

Our SFD dataset contains a total of 1,078 unique films ranging in duration from 5 to 37 minutes and having an average length of 13 minutes (see Figure 6(a)). Overall, these movies amount to 243 hours and cover distinct genres, as illustrated in Figure 6(d). Each movie in the dataset is accompanied by a title, a logline which averages around 15 words, and a synopsis of about 97 words, equivalent to approximately 2 sentences, as shown in Figure 6(b). We use this metadata for question and answer generation as described in Section 2.2. The majority of films in SFD are in English and mainly originate from North America and Europe (see Figures 6(f) and 6(e)). Examples of the dataset’s diversity can be seen in Figure 4. Additional data preprocessing and statistics are available in Appendix C, including clips, captions, location tags, and face tracks.

In terms of video understanding tasks, SFD comprises 4,885 MCQs and OEQs, averaging 4.53 questions per film. Table 1 shows the comparison to both general video understanding (videoQA) and movie-oriented (movieQA) question-answering datasets. Notably, SFD exhibits the longest average video duration (821 seconds) among all datasets. Compared to videoQA datasets, SFD holds the properties of multimodality and long-term nature, similar to most movieQA datasets. Compared to other movieQA ones, SFD offers advantages in public availability of videos, data leakage (i.e., not exposed to LLMs), and story completeness (using full movies instead of trimmed clips, potentially impacting narrative coherence).

Table 1: Comparison of VideoQA and MovieQA datasets.

Dataset	Venue	Annotation	Avg.	#QA Pairs	Multimodal	Long-Term	Accessible	Unknown	Full
Dataset	Venue	Annotation	Length (s)	#QA Pairs	Multimodal	Long-Term	Accessible	To LLMs	Movies
General VideoQA datasets
MSRVTT-QA [67] (test)	ACM 2017	Auto	15	72,820	✗	✗	✓	✓	-
MSVD-QA [8] (test)	ACM 2017	Auto	10	13,156	✗	✗	✓	✓	-
TGIF-QA [23] (test)	CVPR 2017	Auto	3	25,751	✗	✗	✓	✓	-
ActivityNet-QA [75] (test)	AAAI 2019	Manual	180	8,000	✗	✗	✓	✓	-
How2QA [50] (test)	EMNLP 2020	Manual	60	4,400	✗	✗	✓	✓	-
NeXT-QA [65] (test)	CVPR 2021	Manual	44	9,178	✗	✗	✓	✓	-
iVQA [35]	ICCV 2021	Manual	18	10,000	✗	✗	✓	✓	-
EgoSchema [39]	NeurIPS 2023	Manual + Auto	180	5,000	✗	✓	✓	✓	-
MovieQA datasets
MovieQA [56] (test)	CVPR 2016	Manual	203	6,462	✓	✓	✗	✗	✗
TVQA [29] (test)	EMNLP 2018	Manual	76	15,253	✓	✗	✗	✓	✗
LVU [64] (test)	CVPR 2021	Manual	220	1,223	✓	✓	✓	✗	✗
MovieChat [54] (test)	CVPR 2024	Manual	459	2,417	✓	✓	✗	✓	✗
CinePile [45] (test)	arXiv 2024	Manual + Auto	160	4,940	✓	✓	✓	✓	✗
SFD (Ours)		Manual + Auto	821	4,885	✓	✓	✓	✓	✓

Question Analysis. The generated question-answer pairs are long and diverse: the median answer length is 10 words and the vocabulary contains 8,928 different words. Furthermore, to ensure that the generated QAs cover various aspects of movies, we categorize them into four types (see Table 2):
1. Setting-related questions focus on the location and time of the movie. They may pertain to geographical locations (e.g., a specific city or country), a type of place (e.g., home, office), a historical period (e.g., a specific era), or a time of day (e.g., morning, night).
2. Character-related questions focus on the individuals portrayed in the movie. They may concern personal information (age, gender, profession), traits, motivations, and relationships (siblings, friends). Note that character development, as part of the narrative, is excluded from this category.
3. Story-related questions focus on the key sequence of events and interactions that compose the plot and narrative arc of the movie.
4. Theme-related questions focus on the underlying message or central idea that the movie explores and conveys to its audience.

Table 2: Samples from SFD by question type.

Question Type	Questions	Answers
Setting	Where do John and Emma have their conversation?	On the bus.
	Which city is the backdrop?	Liverpool.
	What decade does the movie take place in?	1990s.
Character	Who is the main character?	Josh.
	What is the nationality of the soldier who becomes a prisoner?	German.
	How is Mr. Jones described?	As stern and harsh.
Story	Can you name a specific tactic Tommy uses against Tiny?	He steals Tiny’s food at dinner and distracts her at a school-wide relay race.
	What is the imminent threat to Greg’s residence?	The house is in foreclosure and the bank is about to repossess it.
	What triggers Carla to believe her husband Frank will surprise her for their anniversary?	A bouquet left on her doorstep.
Theme	What is a major theme?	The theme revolves around reconnecting and the complexities of honesty between exes.
	What main theme is explored through Sarah’s experience?	The theme of romantic expectations versus present realities.
	What is the main theme?	The processing of grief and reconciliation between siblings.

3 Experiments

3.1 Data Leakage

Modern LLMs, pre-trained on vast internet data, risk being exposed during training to information about commercial films, such as synopses, reviews, scripts, and transcripts. This issue may result in models recalling answers directly without analyzing the video content, leading to biased benchmarks. In this section, we quantitatively assess the extent of data leakage on datasets. Specifically, we prompt LLMs to answer open-ended questions using only the movie title and compute the accuracy of responses, following [37]. We experiment with 3 movie datasets: MovieQA [56], LVU [64] and our SFD, 5 open-sourced models: Gemma 2B [57], Mistral 7B/8x7B [25], LLaMA-3 8B/70B [58] and 4 commercial models Claude 3 Haiku/Sonnet, and GPT-3.5/4 [42]. Figure 3 reports the results of the data leakage experiment and further details can be found in Appendix G.

We observe that both MovieQA and LVU suffer from data leakage, reaching up to 71.3% and 76.0% accuracies. In contrast, thanks to the low presence of amateur film on the internet, SFD exhibits a maximum accuracy of 36.0%, indicating a low leakage issue. Furthermore, as expected, the extent of data leakage correlates with the knowledge level reflected by MMLU score [14] (see Appendix section G for more experiments), indicating that LLMs with more knowledge exhibit higher levels of memorization, leading to worse leakage issues. For instance, for both MovieQA and LVU, the zero-shot accuracy increases from approximately random level (20%) to more than 70% as the model size rises. Meanwhile, SFD maintains stable and low accuracies ranging between 19.7% and 36.0%, regardless of LLM knowledge variation, further indicating its low data leakage. This experiment reveals that relying solely on existing datasets to evaluate new methods is insufficient. Instead, our SFD offers a more objective and reliable test bed for long-term video understanding.

3.2 Temporal Window Study

To confirm that our tasks require long-form video understanding, we conduct a temporal window study with benchmarks at three levels: shot-, scene-, and movie-level. A shot, defined as a continuous video clip, typically amounts to around 151 per movie. At the Shot-Level, the model uses data from a single shot, including partial subtitles and visual data, with predictions aggregated by taking the maximum logits. Scene-Level inference aggregates data from approximately 10 shots in a similar manner. Movie-Level inference utilizes the entire film’s data, as detailed in Section 3.3. To further analyze the behaviors of each modality, we ablate at each temporal level three combinations of modalities: Vision, Language, and multimodal Vision-Language. This study is conducted with two top-performing methods from Section 3.3: FrozenBiLM and LLoVi. Further details on the experiment can be found in Appendix E.

Our results, shown in Figure 7, reveal the following trends: (1) Both Language-only and Vision-Language settings show substantial improvements with larger temporal windows. For instance, in the language-only setting, LLoVi’s accuracy increases from 38.5% at the shot-level to 64.2% at the movie-level, reflecting a 25.7% gain. Similarly, FrozenBiLM shows an increasing trend but with a smaller gain in accuracy (+7.3%). The multimodal setting of LLoVi’s accuracy increases from 50.1% to 55.6% at the scene-level but shows only marginal gains at the movie-level. This plateau may be attributed to LLoVi’s naive fusion approach, which combines visual captions with subtitles, limiting further improvements; (2) The Vision-only modality exhibits low and stagnant performance across all levels suggesting that the current handling of visual data is inadequate for this task. Additional results for other methods are presented in Appendix E.

In conclusion, broader temporal windows significantly enhance SFD task performance, especially in language-only settings, underscoring the importance of long-form understanding in our design.

3.3 Baseline Comparison

In this section, we benchmark recent videoQA methods on SFD and evaluate 7 open-source models in three categories: FrozenBiLM [5], mPLUG-Owl2 [74] and Video-LLaVA [33] are VLMs that project visual frames into embedding of an LLM with the help of adapters; MovieChat [54] and TimeChat [47] have more advanced and specific settings, the former introduces a memory mechanism to handle long sequences and the latter proposes a timestamp-aware frame encoder for time-sensitive tasks; LLoVi [79] and LangRepo [26] are text-based methods: they both rely on visual captions on top of an LLM. LangRepo proposes an extra summarization and chain-of-thoughts [62] reasoning prompting.

These models are tested in a zero-shot video question-answering setting, adapted for long-form video understanding by incorporating subtitles and sampling the maximum number of frames possible. In the Multiple-Choice Question answering (MCQ) benchmark, we calculate the accuracy score based on the logits of the predicted options where applicable [5, 33, 74]. For models that generate plain-text answers, we evaluate them using a method similar to the Open-Ended Question answering (OEQ) where we rely on GPT-3.5 to compute the similarity between the predicted and correct answers, as detailed in [37]. All methods are evaluated across three modalities to better assess the contribution of each: Vision-Only (V) with video frames, Language-Only (L) with subtitles, and Vision-Language (VL) combining both. See Appendix D for more details. The results are summarized in Table 3.

User Study. To verify the answerability of our multiple-choice questions and assess the upper limit of SFD, we conducted three user studies accordingly to the aforementioned modality definitions: (1) Vision-Language (VL),—full video with audio and subtitles; (2) Vision-only (V)-muted videos; (3) Language-only (L)-plain text subtitles. For each question, all participants were asked to select the correct answer. Table 3 (last row) reports the results. More details on the user study are available in Appendix F. We observe that when provided with the full multimodal information, participants answer questions with high accuracy (89.8%). As expected, removing modalities lowers accuracy. Specifically, when using only subtitles the performance is 70.9%, whereas the vision-only performance drops to 59.0%.

Table 3: Baselines. Multiple-choice question answering and Open-ended question answering performances when using Vision-Only (V), Language-Only (L) and Vision-Language (VL) information.

Method	Venue	% Accuracy
		Multiple-Choice QA			Open-Ended QA
		V	L	VL	V	L	VL
Random		20.0	20.0	20.0	-	-	-
FrozenBiLM [5]	NeurIPS 2021	23.4	38.2	38.6	-	-	-
mPLUG-Owl2 [74]	CVPR 2024	38.3	20.7	21.3	22.1	1.8	1.6
Video-LLaVA [33]	arXiv 2023	34.2	21.3	24.7	19.2	6.4	8.0
LLoVi [79]	arXiv 2023	30.8	64.2	55.6	16.2	40.3	24.7
LangRepo [26]	arXiv 2024	29.0	32.1	31.0	3.5	10.4	9.5
MovieChat [54]	CVPR 2024	8.4	6.4	8.0	14.0	15.7	11.8
TimeChat [47]	CVPR 2024	25.5	6.4	31.8	26.4	9.4	5.9
Human		59.0	70.9	89.8	-	-	-

Multiple-Choice Question answering (MCQ). These results highlight significant insights. Overall, model performance is generally poor, with LLoVi being a notable exception in the language-only setting (64.2%), driven largely by subtitles. Some models, such as FrozenBiLM, perform adequately in the vision-only setting (38.3%), but their performance drops when combining both modalities, indicating integration issues.

Compared to human performance, there is still substantial room for improvement. Accuracy is on par in the language-only setting, with only a 6.7% difference. However, as the best models mainly rely on subtitles, there is a significant gap in the multi-modal setting, where human performance reaches 89.8%. This discrepancy can be explained by the vision-only setting, where the gap reaches 20.7%. These results highlight that while models perform reasonably well with textual data, they struggle to effectively integrate visual information, underscoring the need for advancements in multi-modal fusion techniques to bridge the gap between human and model performance.

Open-Ended Question answering (OEQ). This task proves even more challenging, with lower accuracy across all models, ranging from 3.5% to 40.3%. Subtitles-only models, particularly LLoVi, lead in performance, highlighting the dominance of language processing in understanding and answering complex questions. These findings emphasize the difficulty of the task and the need for improved methods to handle and integrate multi-modal data effectively.

Overall, our results suggest that (1) text (in the form of subtitles) is a stronger cue than visual information for movie question-answering; (2) modern multimodal methods (max accuracy at 60.0%, see Table 3) fall behind human evaluation; and more importantly (3) there is a large room for improvement in the visual aspect, where the gap in performance between modern methods (average accuracy 38.3%) and user study (59.0% accuracy) is still very high.

4 Related Work

VideoQA benchmarks. Questions in such benchmarks can be framed to assess the reasoning, memory, and comprehension skills of both humans and algorithms. Hence, many datasets have been made available through the years, covering visual descriptions [69, 8], temporal action reasoning [65], compositional reasoning [19], social intelligence [76], instructional videos [71], egocentric videos [39, 24, 34], movies [56, 29, 38], among others [23, 67, 23, 32, 75, 63]. However, these datasets feature short videos and typically require reasoning on only a few frames to solve the task at hand, as shown by temporal certificates in [39].

VideoQA methods. To address video question answering, most modern approaches are based on large-scale pre-trained models, by deploying either multimodal contrastive learning [15, 31, 61, 78, 77, 10, 9] or visually-conditioned large language models [30, 11, 68, 74, 2, 70, 73]. For instance, mPLUG-Owl2 [68] enhances LLMs with visual capabilities, while FrozenBiLM [73], a frozen bidirectional language model, uses lightweight adapters and masked language modelling to excel in zero-shot tasks. FrozenBiLM excels in zero-shot video question-answering without using explicit supervision. More recently, Video-LLaVA [33] used a simple projection layer to connect a visual encoder and a large language model, fine-tuned with visual instruction tuning data.

Long-form video understanding. As new long-context models emerge [46], we need more complex benchmarks. Long-form video understanding evaluates the long-term reasoning capabilities of models. Some datasets have been introduced in recent years, mostly for egocentric videos [39] and movies [64, 53, 21, 56]. Instead, LVU [64] involves long movie clips but the subsequent tasks are relatively limited, focusing on irrelevant aspects like ’like’ ratio and view count prediction. [39] proposes multiple-choice questions on 3-minute videos that require on average 100 seconds of video watching to be answered correctly. [53, 21] introduce an audio description task, which is currently tackled at the clip level but could be extended over the whole movie. Finally, closer to our work, [56] and [54] propose a movie question-answering datasets focused on movie plots, settings and characters. However, they do not release any video data because of copyright issues.

5 Discussion

We introduced the Short Film Dataset (SFD), a long-form video understanding benchmark featuring complete and narrative-driven amateur short movies. It includes two types of QA tasks, multiple-choice and open-ended, that are automatically generated by LLMs and manually curated by human annotators. Compared to other long-form video datasets, SFD stands out by offering richer, story-oriented content with tasks specifically tailored for long-term reasoning, as validated by our experiments. Unlike most movie datasets that use copyrighted commercial films prone to data leakage with LLMs, SFD relies on amateur films, which are publicly accessible and have a limited online presence. Furthermore, our analysis indicates that while language-based methods achieve performance levels comparable to humans, state-of-the-art vision-based and multimodal methods fall behind human evaluation and they still have considerable room for improvement, highlighting the need for further advancements. In conclusion, we believe that our SFD paves the way for accessible, and comprehensive long-term movie understanding that is not known a priori by LLMs, thus helping the community develop robust methods. Our dataset and code will be made publicly available.

6 Acknowledgement

This work was supported by ANR-22-CE23-0007, Hi!Paris grant and fellowship, and was granted access to the High-Performance Computing (HPC) resources of IDRIS under the allocations 2023-AD011014489 made by GENCI. We would like to thank Nicolas Dufour, Robin Courant, Pierre Vassal, and the anonymous reviewers for their insightful comments and suggestions. We also express our gratitude to the annotators of the human study for their help.

References

[1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. In arXiv, 2016.
[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
[3] Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. Whisperx: Time-accurate speech transcription of long-form audio. In INTERSPEECH, 2023.
[4] Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. Condensed movies: Story based retrieval with contextual embeddings. In ACCV, 2020.
[5] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2022.
[6] Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Haoyang Zhang, Yin Cui, Kree Cole-McLaughlin, Huisheng Wang, and Shrikanth Narayanan. Movieclip: Visual scene recognition in movies. In Proc. WACV, 2022.
[7] Boris Chen, Amir Ziai, Rebecca Tucker, and Yuchen Xie. Match cutting: Finding cuts with smooth visual transitions. In Proc. WACV, 2022.
[8] David L. Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL., 2011.
[9] Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Valor: Vision-audio-language omni-perception pretraining model and dataset. In arXiv, 2023.
[10] Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. In NeurIPS, 2023.
[11] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
[12] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
[13] Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jurgen Gall, Rainer Stiefelhagen, and Luc Van Gool. Large scale holistic video understanding. In ECCV, 2020.
[14] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In ICCV, 2021.
[15] Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet : End-to-end video-language transformers with masked visual-token modeling. In arXiv, 2022.
[16] Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. An empirical study of end-to-end video-language transformers with masked visual modeling. In CVPR, 2023.
[17] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzyńska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. The "something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
[18] Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei Huang, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, and Jitendra Malik. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
[19] Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. In CVPR, 2021.
[20] Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
[21] Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, and Andrew Zisserman. Autoad: Movie description in context. In CVPR, 2023.
[22] Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. Movienet: A holistic dataset for movie understanding. In ECCV, 2020.
[23] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
[24] Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. Egotaskqa: Understanding human tasks in egocentric videos. In NeurIPS, 2022.
[25] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. In arXiv, 2023.
[26] Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S Ryoo. Language repository for long video understanding. In arXiv, 2024.
[27] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset. In arXiv, 2017.
[28] Jie Lei, Tamara L. Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. In ACL., 2022.
[29] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. Tvqa: Localized, compositional video question answering. In EMNLP, 2019.
[30] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. ICML, 2023.
[31] Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In ICCV, 2023.
[32] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. Hero: Hierarchical encoder for video+language omni-representation pre-training. In ACL., 2020.
[33] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. In arXiv, 2023.
[34] Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, and Mike Zheng Shou. Egocentric video-language pretraining. In NeurIPS, 2022.
[35] Feng Liu, Tao Xiang, Timothy M Hospedales, Wankou Yang, and Changyin Sun. ivqa: Inverse visual question answering. In CVPR, 2018.
[36] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.
[37] Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In arXiv, 2023.
[38] Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, and Christopher Pal. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In CVPR, 2017.
[39] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. In NeurIPS Datasets and Benchmarks Track, 2023.
[40] Marcin Marszałek, Ivan Laptev, and Cordelia Schmid. Actions in context. In CVPR, 2009.
[41] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
[42] OpenAI. Introducing chatgpt. 2022.
[43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proc. ICML, 2021.
[44] Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, and Dahua Lin. A local-to-global approach to multi-modal movie scene segmentation. In CVPR, 2020.
[45] Ruchit Rawal, Khalid Saifullah, Ronen Basri, David Jacobs, Gowthami Somepalli, and Tom Goldstein. Cinepile: A long video question answering dataset and benchmark. In arXiv, 2024.
[46] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. In arXiv, 2024.
[47] Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In CVPR, 2023.
[48] Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description. In IJCV, 2016.
[49] Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi. Visual semantic role labeling for video understanding. In CVPR, 2021.
[50] Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and Florian Metze. How2: A large-scale dataset for multimodal language understanding. In NeurIPS, 2018.
[51] Aidean Sharghi, Jacob S. Laurel, and Boqing Gong. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In CVPR, 2017.
[52] Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Charades-ego: A large-scale dataset of paired third and first person videos. In arXiv, 2018.
[53] Mattia Soldan, Alejandro Pardo, Juan León Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, and Bernard Ghanem. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In CVPR, 2022.
[54] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding. In CVPR, 2023.
[55] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. In arXiv, 2012.
[56] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016.
[57] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. In arXiv, 2024.
[58] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. In arXiv, 2023.
[59] Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. Moviegraphs: Towards understanding human-centric situations from videos. In CVPR, 2018.
[60] Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. In CVPR, 2022.
[61] Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. In arXiv, 2022.
[62] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
[63] Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B. Tenenbaum, and Chuang Gan. STAR: A benchmark for situated reasoning in real-world videos. In NeurIPS Datasets and Benchmarks Track, 2021.
[64] Chao-Yuan Wu and Philipp Krähenbühl. Towards long-form video understanding. In CVPR, 2021.
[65] Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question-answering to explaining temporal actions. In CVPR, 2021.
[66] Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. A graph-based framework to bridge movies and synopses. In ICCV, 2019.
[67] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Int. Conf. Multimedia, 2017.
[68] Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang, Fei Huang, and Jingren Zhou. mplug-2: A modularized multi-modal foundation model across text, image and video. In Proc. ICML, 2023.
[69] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
[70] Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, and Jiahui Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. In arXiv, 2023.
[71] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021.
[72] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Learning to answer visual questions from web videos. In ICCV, 2022.
[73] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
[74] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality. In arXiv, 2023.
[75] Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
[76] Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In CVPR, 2019.
[77] Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
[78] Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. In NeurIPS, 2021.
[79] Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering. In arXiv, 2023.

Appendix

In this appendix, we show additional materials of the paper Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding. We present the following items:

(A)

Checklist
(B)

Ethical considerations
(C)

Data Preprocessing and Statistics
(D)

Technical details on baselines
(E)

Technical details on the temporal window study
(F)

Human Study
(G)

Technical details on data leakage experiment
(H)

Samples from SFD
(I)

Prompts
(J)

Datasheet for Datasets

Appendix A Checklist

1.
For all authors…
1. (a)
  
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]
2. (b)
  
  Did you describe the limitations of your work? [Yes]
3. (c)
  
  Did you discuss any potential negative societal impacts of your work? [Yes] See Supplementary Materials Section B
4. (d)
  
  Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] See Supplementary Materials Section B
2.
If you are including theoretical results…
1. (a)
  
  Did you state the full set of assumptions of all theoretical results? [N/A]
2. (b)
  
  Did you include complete proofs of all theoretical results? [N/A]
3.
If you ran experiments (e.g. for benchmarks)…
1. (a)
  
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] https://huggingface.co/datasets/rghermi/sfd
2. (b)
  
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Sections D, E, and G.
3. (c)
  
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Our experiments consist of benchmarking various aspects of our dataset. For a given compute, instead of performing one experiment several times (with different random seeds), we chose to experiment with several different multimodal methods and multiple LLMs. We observed that they all report similar performances and overall follow the same trend; hence, we argue that our findings are generalizable.
4. (d)
  
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 3.3
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  
  If your work uses existing assets, did you cite the creators? [Yes] See Section B.
2. (b)
  
  Did you mention the license of the assets? [Yes] See Section B.
3. (c)
  
  Did you include any new assets either in the supplemental material or as a URL? [N/A]
4. (d)
  
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] See Section B.
5. (e)
  
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  
  Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] See Section F.
2. (b)
  
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
3. (c)
  
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix B Ethical considerations

1.
Copyright and License:
1. (a)
  
  We do not distribute raw video content; instead, we provide only URLs redirecting to YouTube, where the full copyright and licensing rights of creators are acknowledged. Additionally, by sharing the URLs, YouTube’s Terms of Service regarding exclusive rights are respected.³³3https://www.youtube.com/static?template=terms (Section: License to Other Users).
2. (b)
  
  The dataset will be used solely for academic research purposes. To further protect the metadata and other information, we apply the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) license to our dataset.
2.
Potential negative social impact:
1. (a)
  
  Bias: Collected movies are primarily from North America and Europe, mainly English-speaking, the under-representation of the content towards non-English-speaking population and culture, potentially leading to unbalanced, biased video understanding model training and testing.
2. (b)
  
  Impact on creative process: Progress on movie understanding may encourage filmmakers to tailor their work to optimize for what is favoured by algorithms, potentially stifling creative freedom and depriving innovation.

Appendix C Data Preprocessing and Statistics

Table 4: Additional statistics. Statistics for the clip and face tracks in all movies of SFD.

(a) Clip statistics

	SFD
#Movies	1,078
#Clips	161,321
#Clips / Movie	151
Avg. Clip Len. (s)	5.43
#Captions	161,321
#Location Tags	161,321

(b) Face tracks

	SFD
#Face Instances	56,265
#Unique Faces	2,598
#Unique Faces / Movie	2.41
#Subtitles	23,820

SFD, alongside the videos and subtitles (see Section 2), also offers a comprehensive set of annotations: clips (shots), shot captions and scales, location tags, and face tracks. These resources, together with tools for downloading, preprocessing, and annotation, will be made available as part of the dataset release. Below, we detail their creation processes.

Clips (Shots). In the context of movies, a clip corresponds to a shot, characterized by its composition, length, and camera movement. We use PySceneDetect⁴⁴4https://www.scenedetect.com/ to segment movies in SFD into 161,321 shots, with each movie containing an average of 151 shots, as shown in Table 5(a). The number of shots varies by genre; for example, comedies average 163 shots, while dramas have about 146.

Clip information. For each shot, we extract detailed shot-level information. We generate shot captions using image and video captioning tools (LLaVA [36], BLIP-2 [30], Video-LLaVA [33]), and we identify locations using CLIP features [43] based on a curated list of locations from MovieCLIP [6].

Face Tracks. We use an advanced facial recognition pipeline ⁵⁵5https://github.com/serengil/deepface to identify over 56k+ face instances in the dataset, corresponding to 2.6k+ unique characters, with an average of 2.41 characters per movie, as shown in Table 5(b).

These annotations enrich the dataset, providing deeper insights for tasks related to character analysis and interaction within the films.

In Figure 8, we perform a word cloud analysis for three video domains commonly used in the literature: instructional videos [71], egocentric videos [39], and movies (ours). We notice that instructional and egocentric videos are focused on actions (verbs like make, go, cut, use, pick, put) and objects (e.g. table, sugar, oil, thing, stuff, table, paper), with an emphasis on body parts for egocentric videos (right hand, left hand). For movies, the vocabulary is more focused on characters (e.g. man, woman, boy, girl, young, old), relationships (e.g. mother, father, family, couple), time (e.g. day, night, year), and locations (e.g. home). These clearly show the story component of SFD.

Appendix D Technical details on baselines

In this section, we present additional details and results for the baselines discussed in Section 3.3 and Table 3 main paper, where we report the benchmark performance of the state-of-the-art video understanding methods on our SFD.

FrozenBiLM [5]. We use FrozenBiLM pre-trained on WebVid10M[5]. Following[39], we reformulate the question-answering task for this model with the following prompt: ‘Question: {question} Is it {option_i}?’. We get logits for the yes/no tokens for each option and take the maximum as the final prediction. We report results with and without subtitles. We compute the accuracy score, by comparing the predicted option index and the correct option index. As it is a masked language model, FrozenBiLM is not suited to open-ended question answering.

mPLUG-Owl2 [74]. We use the instruction-tuned version (‘MAGAer13/mplug-owl2-llama2-7b’), based on ViT-L and LLaMa-7b. Inference is the same as FrozenBiLM. In the open-ended question-answering setting, we only provide the question and record a free-form text output: ‘Question: {question} Please answer with a short sentence.’. Following [37], we use GPT-3.5 (‘gpt-3.5-turbo’) to compute the similarity between the predicted and the correct answer. We report this LLM-based accuracy.

Video-LLaVA [33]. We use the fine-tuned version (‘LanguageBind/Video-LLaVA-7B-hf’), based on LanguageBind and Vicuna-13B v1.5. Inference is the same as mPLUG-Owl2, in both multiple-choice and open-ended question-answering settings.

LLoVi [79]. Following the official implementation, we use LLaVA (‘llava-hf/llava-1.5-13b-hf’) to extract captions at 0.5 FPS and GPT-3.5 (‘gpt-3.5-turbo’) as a backbone LLM. We experiment with the prompting strategy proposed in the original paper [79]. The prompt for multiple-choice question answering is as follows:

‘Please provide a single-letter answer (A, B, C, D, E) to the following multiple-choice question, and your answer must be one of the letters (A, B, C, D, or E). You must not provide any other response or explanation. You are given some language descriptions of a video. The video is {duration} seconds long. Each sentence describes a {clip_length}s clip. The descriptions are sequential and non-overlapping which cover the whole video exactly. Here are the descriptions: {captions}. You are going to answer a multiple-choice question based on the descriptions, and your answer should be a single letter chosen from the choices. Here is the question: {question}. Here are the choices. A: {option_0} B: {option_1} C: {option_2} D: {option_3} E: {option_4}")’

LangRepo [26]. Similar to LLoVi, LangRepo is another LLM-backed video understanding method. We use the open-sourced official implementation, Mixtral 8 $\times$ 7B (‘mistralai/Mixtral-8x7B-Instruct-v0.1’). Unlike LLoVi, LangRepo operates in two stages: The first is summarization, where LangRepo consolidates similar information from descriptions, captions, subtitles, or both, using specific prompts following the official implementation:

‘You are a helpful expert in video analysis. You are given a list of {num_to_rephrase} language descriptions for a video. Each sentence describes a {clip_length}s clip. Here are the descriptions as a list:{memory}. Please summarize and rephrase each item in the list as a single sentence of {num_words_in_rephrase} words. Keep the same original subject. Keep all information intact without leaving anything out. Return only the rephrased list of {num_to_rephrase} descriptions in the same order, without additional details.’

In the second stage, the question-answering is based on the rephrased summarization with prompts:

‘Here are the descriptions: {narration}. You are going to answer a multiple-choice question based on the descriptions. Here is the question: {question}? You must select one of these choices as the answer: A: {option_0} B: {option_1} C: {option_2} D: {option_3} E: {option_4} The correct answer is:’

MovieChat [54]. We use the official implementation of MovieChat (‘rese1f/MovieChat’, which incorporates LLaMa2-7B (‘meta-llama/Llama-2-7b’ and corresponding Video-LLaMa-7B (‘DAMO-NLP-SG/Video-LLaMA-2-7B-Finetuned’). Similarly, we use GPT-3.5 (‘gpt-3.5-turbo’) [42] to compute the accuracy of responses. Following the official implementation, we use the following prompt:

‘By watching the video and combining it with the caption/subtitle from video , you are required to answer a question with multiple options: Question: {question}; Option A: {option_0}; Option B: {option_1}; Option C: {option_2}; Option D: {option_3}; Option E: {option_4}; Answer to the multiple-choice question based on the textual information extracted from the movie. If you don’t know the answer, please output ‘Unknown’. Your output should be just one of A, B, C, D, E and nothing else.’

TimeChat [47]. We use the official implementation of TimeChat (‘ShuhuaiRen/TimeChat-7b’). Following the official implementation, the prompt and the accuracy computation details are the same as those in the previous MovieChat model.

For all models, we also incorporate subtitles as input. For that, we concatenate each query with the following text: ‘Subtitles: {subtitles}’. Subtitles are formatted as follows: ‘{start_timestamp} - {end_timestamp} {subtitle_i}’. We trim subtitles if they cannot fit into the input size. Results and analysis can be seen in Section 3.3 and Table 3.

Appendix E Technical details on the temporal window study

In this section, we present additional details and results for the temporal window study (see Section 3.2 and Figure 7). The goal of the temporal window study is to evaluate the performance of various methods on SFD at different temporal levels—shot, scene, and movie—across all modalities. Along with the discussed FrozenBiLM and LLoVi, we examine Video-LLaVA and LangRepo.

At the shot-level, we truncate the input prompt to include only information from a specific shot. For FrozenBiLM and Video-LLaVA, it means that we only include subtitles and frames extracted from the shot. For LLoVi and LangRepo, we only include subtitles and captions extracted from the shot. Hence, we get logits for each answer option and shot, aggregated in a matrix of shape $n_{shots}\times n_{options}$ : we obtain a final prediction by taking the maximum logit over both dimensions. Note that the shot-level inference takes too much time to compute for LangRepo ( $>2,000$ A100 GPU-hours per experiment), as this method involves many steps of rephrasing and summarizing.

The scene-level inference is similar to the shot-level one, the main difference being that we constrain the input information (subtitles and frames/captions) to the scene-level, which arbitrarily corresponds to 10 shots in our case. We end up with a logit matrix of size $n_{scenes}\times n_{options}$ .

The movie-level inference is straightforward: it is the same as in Section D.

Table 6: Temporal window study.

		Vision	Language	Vision-Language
FrozenBiLM	Shot-Level	22.1	30.9	30.8
	Scene-Level	22.7	34.8	34.8
	Movie-Level	23.4	38.2	38.6
Video-LLaVA	Shot-Level	16.0	24.4	24.0
	Scene-Level	23.7	20.1	20.1
	Movie-Level	34.2	21.3	24.7
LLoVi	Shot-Level	32.8	38.5	50.1
	Scene-Level	34.2	51.2	55.4
	Movie-Level	30.8	64.2	55.6
LangRepo	Shot-Level	-	-	-
	Scene-Level	26.9	27.9	28.1
	Movie-Level	29.0	32.1	31.0
Human	Movie-Level	59.0	70.9	89.8

The numerical results of the temporal window study are reported in Table 6.

Along the temporal direction (i.e. row-wise for each method), we observe that: (i) generally, all methods exhibit consistent gains with larger temporal windows confirming the importance of long-term understanding for our designed questions, and (ii) exceptions are noted, such as with Video-LLaVa in Language and Vision-Language settings, likely due to its low precision (near random chance at 20%) which diminishes the impact of the temporal window.

For different modalities (i.e. column-wise), we make the following observations: (i) Corroborating the argument in the main paper (Section 3.3), the Language-only models, in general, achieve better performances, e.g. LLoVi reaches the best performance at movie-level in the language-only setting (64.2%). (ii) The vision-only setting results in the weakest performance, revealing the largest gap compared to human performance (last row). This suggests significant potential for improvement in visual models. (iii) Current modality-fusion methods show limited effectiveness. VLM models like FrozenBiLM and Video-LLaVA only demonstrate marginal improvements when fusing vision and language, with performance increment from 38.2% to 38.6% for FrozenBiLM and from 21.3% to 24.7% for Video-LLaVa. However, all LLM-based methods experience a regression after modality fusion.

Overall, the human benchmark highlights that there remains a substantial gap between the best methods and human performance, particularly in vision-language integration over extended temporal windows.

Appendix F Human Study

In this section, we present additional details for the conducted human study (see Section 3.3, ‘User Study’). We conducted a comprehensive human study to evaluate the upper-limit performance of our benchmark. Participants were asked to watch a movie and subsequently answer multiple-choice questions based on the content. To analyze the impact of different modalities, we implemented two ablation conditions: a muted version, where participants had access to the vision-only information without audio or subtitles, and a blind version, where participants were provided with the audio-only information, in the form of textual subtitles, without any video.

To avoid any bias, we ensured that different sets of participants were assigned to the full, vision-only, and audio-only settings, preventing any overlap and potential influence from experiencing multiple conditions. Movies were randomly assigned to users to ensure a diverse range of data. The entire study was facilitated through an integrated interface, which restricted access to external information such as movie loglines and synopses, with only the movie title being displayed (see Figure 9).

Appendix G Technical details on data leakage experiment

In this section, we include additional information on the data leakage study mentioned in the main paper Section 3.1. The objective of our data leakage study is to demonstrate that current large language models (LLMs) possess prior knowledge about commercial movies. This is because LLMs have been exposed to vast amounts of movie-related textual information during pretraining, including synopses, blog posts, news articles, and more. The experiment is set up as an open-ended question-answering task where the input consists of a movie title and a corresponding question, and the output is a free-form text generated by the LLM. We then evaluate the answer using an LLM to compute the similarity between the predicted and the correct answer, following [37].

We choose ten of the most recent LLMs, among which six are open-source models: Gemma 2B, Gemma 7B, LLaMA-3 8B, LLaMA-3 70B, Mistral 7B, Mixtral 8x7B; and four commercial models: Claude 3 Haiku, Claude 3 Sonnet, GPT-3.5, GPT-4. The open-source models are run using the Hugging Face transformers library ⁶⁶6https://github.com/huggingface/transformers. The commercial models are run using their official API: the OpenAI API ⁷⁷7https://openai.com/api for GPT and the Anthropic API ⁸⁸8https://www.anthropic.com/api for Claude 3.

We choose three movie datasets: MovieQA [56], LVU [64], and SFD (ours). For MovieQA and SFD, the question-answer pairs are directly available. In the case of LVU, we re-formulated the classification tasks into templated questions to fit the open-ended question-answering format. For example, the director classification task is converted into the question ‘Who directed the movie {movie_title}?’.

The detailed results of the data leakage experiment are presented in Table 7 and Figure 10. Their analysis shows that LLMs exhibit significantly higher data leakage on MovieQA and LVU datasets compared to SFD. Models like GPT-4 and Claude 3 Sonnet answer correctly to more than 70% of the questions of both datasets, without any context except the movie title. The scatter plot emphasizes a correlation between data leakage and model knowledge (i.e. MMLU performance), with higher MMLU scores correlating with increased leakage issues on compromised datasets. This trend is consistent across both open-source and commercial models, highlighting the vulnerability of larger and more advanced models to memorizing training data.

In sharp contrast, the relatively low and flat performance trend on SFD suggests it is a more robust benchmark for evaluating model capabilities. These findings emphasize the need for robust evaluation datasets to accurately assess model performance and mitigate the impact of data leakage.

Appendix H Samples from SFD

In the following pages, we show some samples from SFD. For each sample, we display the title, the logline, the synopsis, a few frames, some subtitles, and the designed multiple-choice questions. See Figures 11, 12,13, 14.

Appendix I Prompts

We provide the prompts for the data generation process: question generation prompt (see Figure 8), distractor generation prompt (see Figure 9), and question classification prompt (see Figure 10). As detailed in Section 2.2 , those prompts were used with GPT-4 (’gpt-4-turbo-2024-04-09’) through the OpenAI API.

[Uncaptioned image] — Table 8: Question generation prompt.

Appendix J Datasheet for Datasets

Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

This dataset and benchmark are created to assess long-form video understanding capabilities of modern multimodal systems. It is different from other video question-answering datasets as it features longer videos and associated story-related questions. Additionally, it stands out as a dataset that is available online and not exposed to current large language models (limited data leakage issues).

Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

⁸⁸footnotetext: LIX, Ecole Polytechnique, CNRS, Institut Polytechnique de Paris

This dataset was created by Ridouane Ghermi^†^†footnotemark: , Xi Wang^†^†footnotemark: , Vicky Kalogeiton^†^†footnotemark: , and Ivan Laptev⁹⁹9MBZUAI.

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

This project is funded by V.Kalogeiton ANR-22-CE23-0007 and a Hi!Paris project.

Any other comments?

Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

Our dataset instance consists of a short movie (around 5-20 mins), some metadata (e.g. movie title, logline, synopsis, country of origin, language, release year), and a list of multiple-choice questions and answers.

How many instances are there in total (of each type, if appropriate)?

There are 1,078 movies and 4,885 questions. Each question comprises one correct answer and four wrong answers.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).

The dataset is not a sample of instances from a larger set.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.

Each instance includes a movie as a video file. Metadata comes as .csv files, in the form of text. We also provide processed features.

Is there a label or target associated with each instance? If so, please provide a description.

Each movie comes along with several multiple-choice questions, each with five possible options. The correct option is designated by an index (from 0 to 4).

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g. because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

No.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

We provide a dataframe that links question ids and video ids.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

SFD is a test set only.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

No, the dataset has been manually curated.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

All movies are available on YouTube; We only provide the corresponding URLs. It should remain constant over time as it comes from YouTube channels that curate a movie catalogue. The dataset is distributed under the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) license.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals non-public communications)? If so, please provide a description.

No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

No.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

Yes, the dataset contains movies with actors.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

No.

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

Yes, it is possible to identify actors.

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

No, everything is fictional.

Any other comments?

Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

Movies are available online and downloaded from YouTube using yt-dlp, along with metadata. Multiple-choice questions are generated by a commercial LLM, GPT-4, and further manually curated.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?

Video URLs are scraped using the official YouTube API. Videos and metadata are downloaded using yt-dlp. Questions are generated using GPT-4 through the official OpenAI API and further refined by manual curation.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

Not applicable.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

Students voluntarily helped to curate the dataset.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

All data are collected and curated during the first half of 2024. Movies were uploaded to the video-sharing platform between 2017 and 2024.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

No.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

No.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

Not applicable.

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

Not applicable.

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

Not applicable.

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

Not applicable.

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

Not applicable.

Any other comments?

–

Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

We extracted several annotations from the dataset, see Sections C.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

Yes.

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.

Yes. For annotation extraction, we will provide the tools and scripts in the GitHub repository. For question generation, we will provide the prompts but the usage requires an access to the OpenAI API (which is not free).

Any other comments?

–

Uses

Has the dataset been used for any tasks already? If so, please provide a description.

The dataset is accompanied by two tasks: multiple-choice and open-ended question answering.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

Yes, we will make this available on the dataset GitHub repository.

What (other) tasks could the dataset be used for?

We strongly believe that, with appropriate annotations, this dataset can be extended for many challenging use cases, e.g. spatio-temporal localization, video grounding, movie summarization, causal reasoning.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

No.

Are there tasks for which the dataset should not be used? If so, please provide a description.

No.

Any other comments? –

Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

Yes, but we do not distribute raw video content; instead, we provide only URLs redirecting to YouTube, where the full copyright and licensing rights of creators are acknowledged and YouTube’s Terms of Service regarding exclusive rights are respected. To further protect the metadata and other information, we apply the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) license to our dataset.

How will the dataset be distributed (e.g., tarball on website, API, GitHub) Does the dataset have a digital object identifier (DOI)?

The dataset is distributed on the HuggingFace platform and as a GitHub repository. We provide video URLs and annotations. Additionally, we provide code to download videos, pre-process the data and run baselines.

When will the dataset be distributed?

The dataset is already available.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

The dataset is distributed under the CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) license. The data was collected from publicly available sources. All rights and credits for the original movie content go to the respective owners. This dataset and any derivatives are intended for non-commercial use only and must be shared under the same license.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

Not applicable.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

Not applicable.

Any other comments?

–

Maintenance

Who will be supporting/hosting/maintaining the dataset?

The dataset is hosted on both GitHub and HuggingFace. It will be maintained by the authors.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

We provide a contact information email, and will also answer any pull requests in the repository.

Is there an erratum? If so, please provide a link or other access point.

The erratum is available on the GitHub repository.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?

The dataset is a test benchmark, it will remain stable to ensure fair evaluation of future methods.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

Not applicable.

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users.

Not applicable.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.

If others want to contribute to the dataset, by adding more tasks for example, we would gladly host them in our repository. Also, we will provide as many details and tools as possible to reproduce our annotation process and create more tasks.

Any other comments?

N/A.

Name	SFD	MovieQA	LVU
Open-source models
Gemma 2B	19.7	18.3	28.8
Gemma 7B	20.7	21.5	52.4
LLaMA 3 8B	22.1	34.5	69.9
LLaMA 3 70B	24.1	51.9	64.1
Mistral 7B	28.9	44.1	61.4
Mixtral 8x7B	33.5	55.4	70.2
Commercial models
Claude 3 Haiku	28.9	55.4	68.5
Claude 3 Sonnet	36.0	64.4	71.0
GPT-3.5	26.3	56.7	75.0
GPT-4	31.5	71.3	76.0