StreamSpatial is a benchmark for evaluating the streaming spatial reasoning ability of (Multimodal) Large Language Models on egocentric / scene videos. Given a video and a query issued at a specific timestamp, a model must reason about space and time across three temporal perspectives:
- Now — reasoning about the current spatial state observed so far (relative direction, relative distance, object counting).
- History — recalling spatial relations from the past (direction, distance, state).
- Future — predicting upcoming spatial events (dynamic-human prediction, interaction, next object, risk).
All questions are 4-way multiple choice (A/B/C/D).
Each json file (under Data/) is a list of QA items. A representative item:
{
"query": "Who am I most likely to meet next? A. ... B. ... C. ... D. ...",
"id": 0,
"parent_task": "Future",
"child_task": "dynamic_human",
"query_time": "00:00:01",
"gt_time": "00:00:12",
"vid_source": "egolife",
"vid_name": "DAY1_A1_JAKE_12530000_12533000",
"ans": "The pink-haired girl wearing a white short-sleeved shirt and black pants.",
"ans_id": "C",
"options": ["...", "...", "...", "..."]
}query_timeis the timestamp at which the question is asked; only frames up toquery_timeshould be fed to the model.ans_idis the ground-truth option letter used for scoring.
The videos are obtained from the sources listed below.
| Dataset | Source (Hugging Face) |
|---|---|
| VSI-Bench | nyu-visionx/VSI-Bench |
| EgoLife | lmms-lab/EgoLife |
| RynnEC | Alibaba-DAMO-Academy/RynnEC-Bench |
The long EgoLife videos are produced by merging the original short clips into longer streaming videos with Code/trans_video.py.
Tested with Python 3.10. Install dependencies:
cd Code
pip install -r requirements.txtKey packages (see Code/requirements.txt for exact versions):
| Package | Version | Purpose |
|---|---|---|
| torch | 2.6.0 | model runtime |
| transformers | 4.55.0 | open-source VLMs |
| accelerate | 1.7.0 | multi-GPU loading |
| qwen-vl-utils | 0.0.8 | Qwen-VL preprocessing |
| openai | 2.37.0 | API models |
| opencv-python | 4.10.0.84 | video frame extraction |
| vllm | (optional) | fast inference engine |
To evaluate models that do not yet support native streaming video input, we adopt two complementary evaluation settings (select via --eval_setting):
- Streaming Setting — All frames sampled at 1 fps from the start of the clip up to the question timestamp (
0s → query_time) are provided to the model, simulating real-time inference. - Sliding Window Setting — A 32-second time window before the question-asking time is captured, with frames extracted at 2 fps (
--window_sizecontrols the window length). This preserves recent temporal context while keeping the frame count bounded.
The pipeline has two steps: (1) generate predictions, then (2) compute accuracy.
Predictions are written to:
{save_root}/{now|future|history}/{task}/qa_results_{model_name}.json
A single entry point evaluate.py handles both open-source local models and closed-source API models via --model_type.
By default the model runs in-process with HuggingFace transformers. If you have started a vLLM OpenAI-compatible server (vllm serve ...), pass its address via --vllm_url to run inference through that server instead.
cd Code
# Default: run the model in-process with transformers
python evaluate.py \
--model_type open \
--model_name Qwen/Qwen3-VL-8B-Instruct \
--eval_setting sliding_window --window_size 32 \
--video_root /path/to/videos \
--save_root ./results
# Optional: route through a running vLLM server (faster, decoupled)
# vllm serve Qwen/Qwen3-VL-8B-Instruct --port 8000
python evaluate.py \
--model_type open --vllm_url http://localhost:8000/v1 \
--model_name Qwen/Qwen3-VL-8B-Instruct \
--eval_setting sliding_window --window_size 32 \
--video_root /path/to/videos \
--save_root ./resultsClosed-source models are accessed through the official OpenAI Python SDK (works with any OpenAI-compatible endpoint). Credentials are read from environment variables; nothing is hard-coded:
export OPENAI_API_KEY=your_api_key
export OPENAI_BASE_URL=https://api.openai.com/v1 # optional; override for other providers
cd Code
python evaluate.py \
--model_type closed \
--model_name gpt-4o \
--eval_setting sliding_window --window_size 32 \
--video_root /path/to/videos \
--save_root ./resultsCommon arguments:
--model_type—open(local open-source VLM) orclosed(OpenAI-compatible API).--vllm_url— (open only) address of a vLLM OpenAI-compatible server; if omitted, the model runs in-process with transformers.--eval_setting—streaming(1 fps from start to query_time) orsliding_window(2 fps within the last--window_sizeseconds).--max_frames— upper cap on frames forsliding_window(default 64); ignored bystreaming.--window_size— window length in seconds forsliding_window(default 32).--image_size— optional square resize for frames.--data_root/--video_root/--save_root— data, video and output directories.
cd Code
python compute_accuracy.py --save_root ./results --model_name Qwen3-VL-8B-InstructThis prints accuracy for every task, aggregated per parent category (Now / History / Future), and an overall score.
This project is licensed under the Apache License 2.0 — see the [LICENSE](./LICENSE) file.
Copyright 2026 OPPO. All rights reserved.
