Scalable data generation for video reasoning models using AWS Lambda.
Deploy • About • Quick Start • Docs
VBVR DataFactory is a distributed data generation system built on AWS Lambda. It orchestrates 300+ generators from the VBVR-DataFactory Github Organization to create high-quality training data for video reasoning models.
This project is part of the Very Big Video Reasoning (VBVR) initiative.
graph LR
A[CloudFormation] --> B[Submit Lambda]
B --> C[(SQS Queue)]
C --> D[Generator Lambda]
C -.-> E[(DLQ)]
D --> F[(S3 Bucket)]
D --> G[(DynamoDB)]
Deploy to your AWS account in minutes — no local setup required.
🔜 Coming Soon
| S3 Bucket | SQS Queue | Lambda | DLQ | DynamoDB |
|---|---|---|---|---|
| Output storage | Task queue | 300+ generators | Auto-retry | Dedup |
After deployment — How to use
Option 1: Invoke Submit Lambda (Recommended)
Go to AWS Console → Lambda → {stack-name}-submit-tasks → Test with:
{
"generators": ["O-41_nonogram_data-generator", "O-42_object_permanence_data-generator"],
"samples": 10000,
"batch_size": 25
}Or use AWS CLI:
aws lambda invoke \
--function-name vbvr-datafactory-submit-tasks \
--payload '{"samples": 10000}' \
response.jsonOption 2: Send SQS Messages Directly
Go to AWS Console → SQS → {stack-name}-queue → Send message:
{
"type": "O-41_nonogram_data-generator",
"start_index": 0,
"num_samples": 25,
"seed": 42,
"output_format": "tar"
}Download results:
# Download all generated data
aws s3 sync s3://{stack-name}-output-{account-id}/questions/ ./results/
# Results will be in:
# ./results/G-41_generator/task_name_task/task_name_0000/All generated data follows this standardized structure:
questions/
├── G-1_object_trajectory_data-generator/
│ └── object_trajectory_task/
│ ├── object_trajectory_00000000/
│ │ ├── first_frame.png
│ │ ├── final_frame.png
│ │ ├── ground_truth.mp4
│ │ ├── metadata.json
│ │ └── prompt.txt
│ ├── object_trajectory_00000001/
│ │ └── [same 5 files]
│ └── ... (continues with _00000002, _00000003, etc.)
│
├── G-2_another_generator/
│ └── another_task/
│ ├── another_00000000/
│ └── ...
│
└── O-41_nonogram_data-generator/
└── nonogram_task/
├── nonogram_00000000/
│ ├── first_frame.png
│ ├── final_frame.png
│ ├── ground_truth.mp4
│ ├── metadata.json
│ └── prompt.txt
└── ... (continues with _00000001, _00000002, etc.)
Structure breakdown:
- Root:
questions/- All generated data - Generator:
{G|O}-{N}_{task-name}_data-generator/- Each generator has its own folder - Task:
{task-name}_task/- Task-specific directory - Instances:
{task-name}_00000000/- Individual samples with 8-digit zero-padded indices - Files: Each instance contains 2-5 files (first_frame.png, prompt.txt are required; final_frame.png, ground_truth.mp4, and metadata.json are optional)
metadata.json format:
{
"task_id": "object_trajectory_00000000",
"generator": "G-1_object_trajectory_data-generator",
"timestamp": "2026-02-17T06:15:55.000000",
"parameters": { ... },
"param_hash": "cdba87435dd16831",
"generation": {
"seed": 12345,
"git": { "commit": "...", "branch": "main", "repo": "..." }
}
}param_hash: SHA256 first 16 hex chars of task parameters (excluding seed), used for deduplicationparameters: The generation parameters for reproducibility
Tar Archive Format:
When using --output-format tar, files are packaged into compressed archives:
questions/
└── G-1_object_trajectory_data-generator_00000-00099.tar.gz
# Extract to see:
G-1_object_trajectory_data-generator/
└── object_trajectory_task/
├── object_trajectory_00000000/
│ └── [files]
├── object_trajectory_00000001/
└── ... (through _00000099)
- Tar files:
{generator}_{start-index}-{end-index}.tar.gz - Internal structure: Preserves full
{generator}/{task}_task/{samples}/hierarchy - Benefits: Efficient download, reduced S3 requests, maintains organization
# Verify Python 3.11+ (required)
python3 --version # Should be 3.11 or higher
# Create and activate virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate
# Install Node.js (required for AWS CDK CLI)
# Ubuntu/Debian:
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs
# macOS:
# brew install node
# Verify Node.js installation:
node --version # Should be v20.x or higher
npm --version
# Install AWS CDK CLI globally (required for deployment)
# Note: CLI version doesn't need to match aws-cdk-lib Python package version
sudo npm install -g aws-cdk@2.1100.3
# Verify CDK installation:
cdk --version # Should be 2.1100.3
# --- macOS (Homebrew) ---
# brew install awscli
# brew install gh
# Install Docker Desktop from: https://www.docker.com/products/docker-desktop/
# --- Ubuntu/Debian (apt) ---
# sudo apt update
# sudo apt install -y curl unzip git python3-venv
#
# AWS CLI:
# Option A (pip, compatible with boto3/botocore): pip install awscli==1.44.16
# Option B (v2, standalone - no Python dependencies):
# curl -L "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o awscliv2.zip
# unzip -q awscliv2.zip
# sudo ./aws/install
# rm -rf awscliv2.zip aws
#
# GitHub CLI:
# sudo apt install -y gh
# (If `gh` isn't available in your distro repos, install from https://cli.github.com/)
#
# Docker (No Docker Hub account needed - this project uses AWS ECR):
# sudo apt install -y ca-certificates curl
# sudo install -m 0755 -d /etc/apt/keyrings
# sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
# sudo chmod a+r /etc/apt/keyrings/docker.asc
# echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# sudo apt update # Required after adding Docker repository
# sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# sudo usermod -aG docker $USER # Add your Linux user to docker group (avoid needing sudo)
# newgrp docker # Or log out and back in to apply group changes# Configure AWS credentials
aws configure
# It will ask for:
# - AWS Access Key ID
# - AWS Secret Access Key
# - Default region (use: us-east-2)
# - Default output format (use: json)# Clone the repository
git clone https://github.com/video-reason/VBVR-DataFactory
cd VBVR-DataFactory
# Install the package with all dependencies
pip install -e ".[dev,cdk]"# Authenticate with GitHub (first time only)
gh auth login
# Download all generator repositories
cd scripts
./download_all_repos.sh
cd ..
# This downloads all O- and G- generators from VBVR-DataFactory org to ./generators/# Make sure Docker is running first!
# - macOS/Windows: Docker Desktop
# - Linux: Docker Engine (dockerd)
# Ensure you're in the project root directory (VBVR-DataFactory/)
# If you're in the deployment subdirectory, go back:
# cd ..
# Bootstrap CDK (first time only)
cd deployment
cdk bootstrap
cd ..
# Deploy the infrastructure
cd deployment
cdk deploy
cd ..
# Wait for deployment to complete (~5-10 minutes)
# Save the outputs that appear at the end:
# - QueueUrl
# - BucketName
# - DlqUrl
# - DedupTableNameAfter deployment completes, you'll see:
Outputs:
VBVRDataFactoryPipelineStack.QueueUrl = https://sqs.us-east-2.amazonaws.com/123456789/vbvr-datafactory-pipeline-queue
VBVRDataFactoryPipelineStack.BucketName = vbvr-datafactory-123456789-us-east-2
VBVRDataFactoryPipelineStack.DlqUrl = https://sqs.us-east-2.amazonaws.com/123456789/vbvr-datafactory-pipeline-dlq
VBVRDataFactoryPipelineStack.DedupTableName = vbvr-param-hash
Copy these values! You'll need them in the next step.
# Go back to project root
cd ..
# Set the queue URL and bucket from CDK outputs
export SQS_QUEUE_URL="https://sqs.us-east-2.amazonaws.com/123456789/vbvr-datafactory-pipeline-queue"
export OUTPUT_BUCKET="vbvr-datafactory-123456789-us-east-2"
# Optional: Set DLQ URL for monitoring failed tasks
export SQS_DLQ_URL="https://sqs.us-east-2.amazonaws.com/123456789/vbvr-datafactory-pipeline-dlq"
# Optional: Save to .env file for persistence
echo "SQS_QUEUE_URL=$SQS_QUEUE_URL" > .env
echo "OUTPUT_BUCKET=$OUTPUT_BUCKET" >> .env
echo "SQS_DLQ_URL=$SQS_DLQ_URL" >> .env# Test with a single generator (100 samples)
python scripts/submit.py \
--generator G-1_object_trajectory_data-generator \
--samples 100 \
--batch-size 10
# This will:
# - Create 10 SQS messages (10 samples each)
# - Send them to the queue
# - Lambda will automatically process them# Watch queue status in real-time
python scripts/monitor.py --watch
# You'll see:
# - Messages waiting in queue
# - Messages being processed
# - Progress percentage# Once processing is complete, download the generated data
aws s3 sync s3://vbvr-datafactory-123456789-us-east-2/questions/ ./results/
# Results structure (files format):
# results/
# └── G-1_object_trajectory_data-generator/
# └── object_trajectory_task/
# ├── object_trajectory_00000000/
# │ ├── first_frame.png
# │ ├── final_frame.png
# │ ├── prompt.txt
# │ └── ground_truth.mp4
# ├── object_trajectory_0001/
# └── ...
# For tar format, download and extract:
# aws s3 cp s3://vbvr-datafactory-123456789-us-east-2/questions/G-1_generator_00000-00099.tar.gz .
# tar -xzf G-1_generator_00000-00099.tar.gz# Submit 10,000 samples for ALL generators
python scripts/submit.py \
--generator all \
--samples 10000 \
--batch-size 100 \
--seed 42
# With deduplication (recommended for large runs)
python scripts/submit.py \
--generator all \
--samples 20000 \
--batch-size 10 \
--dedup \
--bucket my-output-bucket
# Monitor progress
python scripts/monitor.py --watch --interval 10
# This creates 100,000+ SQS messages
# Lambda processes them in parallel (up to 990 concurrent)
# Estimated time: ~2-4 hours depending on generators# Only O- generators
# First, edit scripts/download_all_repos.sh line 20:
# Change to: grep -E '^O-[0-9]+_'
cd scripts && ./download_all_repos.sh && cd ..
# Then submit tasks
python scripts/submit.py --generator all --samples 5000# Monitor the Dead Letter Queue
python scripts/monitor.py --watch
# Look at the DLQ section
# If you see failed messages, they need investigationYou can import and use vbvrdatafactory in your own Python projects:
from vbvrdatafactory.core.models import TaskMessage
from vbvrdatafactory.sqs.submitter import TaskSubmitter
from vbvrdatafactory.core.config import config
# Method 1: Submit using the submitter class
submitter = TaskSubmitter(queue_url="https://sqs.us-east-2.amazonaws.com/...")
result = submitter.submit_tasks(
generators=["G-1_object_trajectory_data-generator"],
total_samples=1000,
batch_size=100,
seed=42,
)
print(f"Submitted {result['total_successful']} tasks")
# Method 2: Create individual task messages
task = TaskMessage(
type="G-1_object_trajectory_data-generator",
num_samples=100,
start_index=0,
seed=42,
output_format="files",
)
# Validate automatically with Pydantic
validated_json = task.model_dump_json()
# Use this JSON to send to SQS manuallyWhen you run cdk deploy, it creates:
- S3 Bucket - Stores generated data
- SQS Queue - Distributes tasks to workers
- Lambda Function - Runs generators (3GB memory, 15min timeout)
- Dead Letter Queue - Captures failed tasks for retry
- DynamoDB Table - Deduplication via param_hash (optional, enabled with
--dedup) - IAM Roles - Permissions for Lambda to access S3/SQS/DynamoDB
1. You run: python scripts/submit.py
↓
2. Creates task messages and sends to SQS Queue
↓
3. SQS automatically triggers Lambda (up to 990 concurrent)
↓
4. Lambda:
- Validates message with Pydantic
- Runs generator script
- (If --dedup) Checks param_hash against DynamoDB, regenerates duplicates
- Uploads results to S3
- Deletes message from queue
↓
5. If Lambda fails 3 times → message goes to DLQ
{
"type": "G-1_object_trajectory_data-generator",
"num_samples": 100,
"start_index": 0,
"seed": 42,
"output_format": "files"
}Output Format Options:
"files"(default) - Individual files uploaded to S3 with full directory structure"tar"- Compressed tar.gz archive per batch (e.g.,G-1_generator_00000-00099.tar.gz)
All fields are validated by Pydantic. Invalid messages are rejected immediately.
export SQS_QUEUE_URL="https://sqs.us-east-2.amazonaws.com/.../vbvr-datafactory-pipeline-queue"
export OUTPUT_BUCKET="vbvr-datafactory-123456789-us-east-2"export AWS_REGION="us-east-2" # Default region
export SQS_DLQ_URL="https://sqs..." # For monitoring failed tasks
export GENERATORS_PATH="./generators" # Local path to generatorsEdit deployment/cdk.json to adjust:
{
"context": {
"lambdaMemoryMB": 3072, // 3 GB
"lambdaTimeoutMinutes": 15, // 15 minutes
"sqsMaxConcurrency": 990 // Max parallel Lambdas
}
}python scripts/submit.py --generator GENERATOR_NAME --samples NUM_SAMPLES
# Options:
# --generator, -g Generator name or "all" (required)
# --samples, -n Total samples per generator (default: 10000)
# --batch-size, -b Samples per Lambda task (default: 100)
# --seed, -s Random seed (optional)
# --output-format "files" or "tar" (default: files)
# --bucket Override output bucket (optional)
# --dedup Enable DynamoDB deduplication (optional)
# Examples:
python scripts/submit.py -g all -n 10000
python scripts/submit.py -g G-1_object_trajectory_data-generator -n 1000 --seed 42python scripts/monitor.py
# Options:
# --watch, -w Continuous monitoring mode
# --interval, -i Refresh interval in seconds (default: 10)
# Example:
python scripts/monitor.py --watch --interval 5cd scripts
./download_all_repos.sh
# This downloads all O- and G- generators from vm-dataset org
# To download specific types, edit line 20 of the scriptcd scripts
./collect_requirements.sh
# This collects requirements.txt from all generators
# and updates ../requirements-all.txt
# Run this when generators are added or updatedError: Cannot connect to the Docker daemon
Solution:
- macOS/Windows: Start Docker Desktop
- Linux: Start Docker Engine (e.g.,
sudo systemctl start docker)
Error: ModuleNotFoundError: No module named 'pydantic'
Solution:
pip install -e ".[dev,cdk]"Error: Unable to locate credentials
Solution:
aws configure
# Enter your AWS Access Key ID and Secret Access KeyError: SQS_QUEUE_URL environment variable not set
Solution:
export SQS_QUEUE_URL="https://sqs.us-east-2.amazonaws.com/.../vbvr-datafactory-pipeline-queue"Get this value from CDK outputs after deployment.
Error: Generator not found: ./generators/G-1_object_trajectory_data-generator
Solution:
cd scripts
./download_all_repos.sh
cd ..Error: Node version 19 is end of life
Solution:
# macOS (Homebrew):
# brew install node@20
#
# Ubuntu/Debian:
# sudo apt update
# sudo apt install -y nodejs npm
#
# If your distro Node is too old, prefer installing Node 20 via nvm or NodeSource.# Make changes to deployment/cdk/stacks/pipeline_stack.py
# Preview changes
cd deployment
cdk diff
# Apply changes
cdk deploy
cd ..cd deployment
cdk destroy
# This deletes:
# - Lambda function
# - SQS queues
# - IAM roles
# Note: S3 bucket is retained (with your data)ls generators/
# or
python scripts/submit.py --generator all --samples 0 # Will list and exitApache-2.0
Part of the Very Big Video Reasoning (VBVR) project
@article{vbvr2026,
title={A Very Big Video Reasoning Suite},
author={Wang, Maijunxian and Wang, Ruisi and Lin, Juyi and Ji, Ran and Wiedemer, Thaddäus and Gao, Qingying and Luo, Dezhi and Qian, Yaoyao and Huang, Lianyu and Hong, Zelong and Ge, Jiahui and Ma, Qianli and He, Hang and Zhou, Yifan and Guo, Lingzi and Mei, Lantao and Li, Jiachen and Xing, Hanwen and Zhao, Tianqi and Yu, Fengyuan and Xiao, Weihang and Jiao, Yizheng and Hou, Jianheng and Zhang, Danyang and Xu, Pengcheng and Zhong, Boyang and Zhao, Zehong and Fang, Gaoyun and Kitaoka, John and Xu, Yile and Xu, Hua and Blacutt, Kenton and Nguyen, Tin and Song, Siyuan and Sun, Haoran and Wen, Shaoyue and He, Linyang and Wang, Runming and Wang, Yanzhi and Yang, Mengyue and Ma, Ziqiao and Millière, Raphaël and Shi, Freda and Vasconcelos, Nuno and Khashabi, Daniel and Yuille, Alan and Du, Yilun and Liu, Ziming and Lin, Dahua and Liu, Ziwei and Kumar, Vikash and Li, Yijiang and Yang, Lei and Cai, Zhongang and Deng, Hokin},
year={2026}
}