V1: Toward Multimodal Reasoning by Designing Auxiliary Task

Haonan Wang, Chao Du, Tianyu Pang 15 Mar, 2025

Huggingface: https://huggingface.co/datasets/haonan3/V1-33K

Table of contents

Introduction

Recent Large Reasoning Models (LRMs) such as DeepSeek-R1 have demonstrated impressive reasoning abilities; however, their capabilities are limited to textual data. Current models capture only a small part of the rich information that humans naturally use, which limits our progress toward AGI.

To advance multimodal reasoning, we introduce a future prediction task and its corresponding dataset. Predicting the future is a deeply desired ability, yet forecasting upcoming events from historical video data presents significant challenges for current Multi-modal Large Models (MLMs). Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth for evaluation.

<aside> 🤔

Why isn’t factual QA ideal for video reasoning?

Research indicates that reasoning models like DeepSeek R1 often “over-think”, which can lead to hallucinations on factual QA task. When applied to video data, similar pitfalls emerge if the model is restricted to answering straightforward factual questions. For instance, querying “Where is the cat in the video?” might prompt an overly extended reasoning process, inadvertently increasing the risk of hallucinated outputs.

</aside>

<aside> 💡

Why is future prediction a compelling case for video reasoning?

Much like Doctor Strange’s foresight in Avengers 3: Infinity War (2018), predicting the future demands reasoning over multiple potential outcomes. This challenge is analogous to techniques such as Monte Carlo tree search (MCTS), which systematically explores a wide array of possible scenarios. The inherent complexity of future prediction makes it a powerful task for evaluating and enhancing video reasoning capabilities.

</aside>

<aside> 📽️

Video Future Prediction: A Self-Supervised Task for Multimodal Reasoning

This task is inherently Self-Supervised Learning (SSL). It leverages the inherent causal logic present in video data. By dividing videos into sequential segments, we create implicit labels that embody the natural flow of cause and effect—allowing models to learn from the logical progression of events without the need for manual annotations.

Like Image Contrastive Learning, which uses inherent data structures to construct labels and guide what a model should capture. Video Future Prediction is grounded in the philosophy that real-world events unfold through a chain of cause and effect. It drives the model to focus on the temporal and causal dimensions that underpin real-world scenarios, enhancing multimodal reasoning capabilities. By integrating visual cues, the model develops a holistic reasoning ability to more accurately predict and interpret the progression of complex events.

Moreover, like other self-supervised learning task and unsupervised learning, the data construction is relative cheap, making it a scalable solution for enhancing multimodal reasoning capabilities.

</aside>

Dataset Card

Data Statistics

Dataset	Number of Videos	Duration Range	Number of Videos
activitynet	6,497	0-30s	8,294
Charades	3,692	30-60s	8,832
ego4d	863	1-2m	8,832
NextQA	2,142	2-3m	7,248
youcook2	2,757	Total	33,206
youtube	17,255

Thanks to lmms-lab/LLaVA-Video-178K and Ego4D, our dataset is built upon their foundations.

Data Construction — The First Step of Our Journey

In the era of LLMs and MLMs, we believe that the tasks shape your abilities, while the data quality drives performance. To that end, we propose a multi-stage data pipeline that transforms raw video content into refined predictive reasoning. The pipeline is organized into four sequential stages—Fact Extraction, Analysis, Segmentation, and Reasoning—where each step builds upon the results of the previous one to enhance overall accuracy and insight.