Skip to content

OPPO-Mente-Lab/StreamSpatial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StreamSpatial [ECCV 2026]

StreamSpatial

StreamSpatial is a benchmark for evaluating the streaming spatial reasoning ability of (Multimodal) Large Language Models on egocentric / scene videos. Given a video and a query issued at a specific timestamp, a model must reason about space and time across three temporal perspectives:

  • Now — reasoning about the current spatial state observed so far (relative direction, relative distance, object counting).
  • History — recalling spatial relations from the past (direction, distance, state).
  • Future — predicting upcoming spatial events (dynamic-human prediction, interaction, next object, risk).

All questions are 4-way multiple choice (A/B/C/D).


Data

Each json file (under Data/) is a list of QA items. A representative item:

{
  "query": "Who am I most likely to meet next? A. ... B. ... C. ... D. ...",
  "id": 0,
  "parent_task": "Future",
  "child_task": "dynamic_human",
  "query_time": "00:00:01",
  "gt_time": "00:00:12",
  "vid_source": "egolife",
  "vid_name": "DAY1_A1_JAKE_12530000_12533000",
  "ans": "The pink-haired girl wearing a white short-sleeved shirt and black pants.",
  "ans_id": "C",
  "options": ["...", "...", "...", "..."]
}
  • query_time is the timestamp at which the question is asked; only frames up to query_time should be fed to the model.
  • ans_id is the ground-truth option letter used for scoring.

The videos are obtained from the sources listed below.

Dataset Source (Hugging Face)
VSI-Bench nyu-visionx/VSI-Bench
EgoLife lmms-lab/EgoLife
RynnEC Alibaba-DAMO-Academy/RynnEC-Bench

The long EgoLife videos are produced by merging the original short clips into longer streaming videos with Code/trans_video.py.


Environment

Tested with Python 3.10. Install dependencies:

cd Code
pip install -r requirements.txt

Key packages (see Code/requirements.txt for exact versions):

Package Version Purpose
torch 2.6.0 model runtime
transformers 4.55.0 open-source VLMs
accelerate 1.7.0 multi-GPU loading
qwen-vl-utils 0.0.8 Qwen-VL preprocessing
openai 2.37.0 API models
opencv-python 4.10.0.84 video frame extraction
vllm (optional) fast inference engine

Evaluation settings

To evaluate models that do not yet support native streaming video input, we adopt two complementary evaluation settings (select via --eval_setting):

  • Streaming Setting — All frames sampled at 1 fps from the start of the clip up to the question timestamp (0s → query_time) are provided to the model, simulating real-time inference.
  • Sliding Window Setting — A 32-second time window before the question-asking time is captured, with frames extracted at 2 fps (--window_size controls the window length). This preserves recent temporal context while keeping the frame count bounded.

How to evaluate

The pipeline has two steps: (1) generate predictions, then (2) compute accuracy.

Predictions are written to: {save_root}/{now|future|history}/{task}/qa_results_{model_name}.json

A single entry point evaluate.py handles both open-source local models and closed-source API models via --model_type.

1a. Open-source VLMs (--model_type open)

By default the model runs in-process with HuggingFace transformers. If you have started a vLLM OpenAI-compatible server (vllm serve ...), pass its address via --vllm_url to run inference through that server instead.

cd Code

# Default: run the model in-process with transformers
python evaluate.py \
    --model_type open \
    --model_name Qwen/Qwen3-VL-8B-Instruct \
    --eval_setting sliding_window --window_size 32 \
    --video_root /path/to/videos \
    --save_root ./results

# Optional: route through a running vLLM server (faster, decoupled)
#   vllm serve Qwen/Qwen3-VL-8B-Instruct --port 8000
python evaluate.py \
    --model_type open --vllm_url http://localhost:8000/v1 \
    --model_name Qwen/Qwen3-VL-8B-Instruct \
    --eval_setting sliding_window --window_size 32 \
    --video_root /path/to/videos \
    --save_root ./results

1b. Closed-source API models (--model_type closed)

Closed-source models are accessed through the official OpenAI Python SDK (works with any OpenAI-compatible endpoint). Credentials are read from environment variables; nothing is hard-coded:

export OPENAI_API_KEY=your_api_key
export OPENAI_BASE_URL=https://api.openai.com/v1   # optional; override for other providers

cd Code
python evaluate.py \
    --model_type closed \
    --model_name gpt-4o \
    --eval_setting sliding_window --window_size 32 \
    --video_root /path/to/videos \
    --save_root ./results

Common arguments:

  • --model_typeopen (local open-source VLM) or closed (OpenAI-compatible API).
  • --vllm_url — (open only) address of a vLLM OpenAI-compatible server; if omitted, the model runs in-process with transformers.
  • --eval_settingstreaming (1 fps from start to query_time) or sliding_window (2 fps within the last --window_size seconds).
  • --max_frames — upper cap on frames for sliding_window (default 64); ignored by streaming.
  • --window_size — window length in seconds for sliding_window (default 32).
  • --image_size — optional square resize for frames.
  • --data_root / --video_root / --save_root — data, video and output directories.

2. Compute accuracy

cd Code
python compute_accuracy.py --save_root ./results --model_name Qwen3-VL-8B-Instruct

This prints accuracy for every task, aggregated per parent category (Now / History / Future), and an overall score.


License

This project is licensed under the Apache License 2.0 — see the [LICENSE](./LICENSE) file.

Copyright 2026 OPPO. All rights reserved.

About

StreamSpatial-Bench, a testbed for continuous visual-spatial reasoning with fine-grained online temporal perspectives, dynamic multi-agent interactions, and comprehensive 3D spatial reasoning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages