Logo Short Film Dataset (SFD)

A Benchmark for Story-Level Video Understanding

1LIX, Ecole Polytechnique, IP Paris
2MBZUAI

Abstract

Recent advances in vision-language models have significantly propelled video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often document the activities of one person in a single scene. Although some movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos and frequently encounter data leakage given the use of movie forums and other resources in LLM training.

To address the above limitations, we propose the Logo Short Film Dataset (SFD) with 1,078 publicly available amateur movies, a wide variety of genres and minimal data leakage issues. SFD offers long-term story-oriented video tasks in the form of multiple-choice and open-ended question answering. Our extensive experiments emphasize the need for long-term reasoning to solve SFD tasks. Notably, we find strong signals in movie transcripts leading to the on-par performance of people and LLMs. We also show significantly lower performance of current models compared to people when using vision data alone.

Dataset Overview

The Short Film Dataset (SFD) is a comprehensive video benchmark designed for story-level video understanding, featuring 1,078 publicly accessible amateur short films spanning various genres and totaling over 243 hours of content. SFD aims to address limitations in existing datasets by offering long-term, story-oriented tasks with minimal data leakage issues. The dataset includes both Multiple-Choice and Open-Ended Question Answering tasks, with a total of 4,885 curated questions. Each film in the dataset has an average duration of 13 minutes, ensuring a rich narrative structure that supports complex storylines and character development.

Commercial movies are already known by LLMs because these models have been exposed to them through reviews, synopses, news articles, and other sourcesโ€”a phenomenon known as data leakage. For example, GPT-4 can correctly answer 76% and 71.3% of questions from MovieQA and LVU, respectively, when prompted with just the movie title. In contrast, for the Short Film Dataset (SFD), this number falls to 36%. These results underscore SFD's effectiveness as a benchmark for long-term video understanding, providing a more objective and reliable evaluation free from the biases found in existing commercial movie datasets.

Dataset Statistics

grade-lv

Dataset Comparison

grade-lv

Experimental Results

Baselines

grade-lv

Temporal Window Study

grade-lv

BibTeX

@inproceedings{ghermi2024short,
  author  = {Ghermi, Ridouane and Wang, Xi and Kalogeiton, Vicky and Laptev, Ivan},
  title   = {Short Film Dataset (SFD): A Benchmark for Story-Level Video Understanding},
  journal = {arXiv},
  year    = {2024}
}