Google Summer of Code 2025 Project with Google DeepMind
A comprehensive benchmark system for evaluating multimodal vision-language models on web development tasks. This system benchmarks 18 state-of-the-art models, including the Google Gemini 2.x/2.5 series and 11 open-source families, against a benchmark suite of over 830,000 tasks.
- Comprehensive Evaluation: Benchmarks 18 models across 5 major datasets totaling 830,000+ tasks
- Iterative Refinement: Two-stage evaluation with screenshot-based feedback achieving 23.7% average improvement
- Multi-Framework Support: Complete generation, building, and evaluation for React 19, Next.js 15, Vue 3.5, Angular 20, and Svelte 5
- Multi-Provider Integration: Unified interface for Ollama, OpenRouter, Gemini (official SDK), and vLLM
- LLM-as-a-Judge: Multi-dimensional evaluation with Gemini 2.5 Pro achieving 94.4% agreement with human experts
- Production-Ready CLI: Modern CLI with
openui-eval init,start, andevaluatecommands
| Rank | Model | Overall Success Rate | Avg Score |
|---|---|---|---|
| 1 | Gemini 2.5 Pro (SOTA) | 92.7% | 4.64/5 |
| 2 | Gemini 2.5 Flash | 87.3% | 4.37/5 |
| 3 | Gemini 2.0 Flash Thinking | 87.3% | 4.37/5 |
| 4 | Gemini 2.0 Pro | 84.6% | 4.23/5 |
| 5 | Gemini 2.0 Flash | 81.9% | 4.10/5 |
| 6 | Llama3.2-Vision 11B (Top OSS) | 70.6% | 3.53/5 |
- 40.8% Performance Gap: Significant gap between proprietary and open-source models
- Iterative Improvement: Screenshot-based refinement improves performance by 23.7% on average
- Judge Reliability: Gemini 2.5 Pro judge achieves 94.4% agreement with human evaluations
- Interactive Success: 90.9% success rate on complex Selenium-based interactive testing
┌─────────────────────────────────────────────────────────────┐
│ Command Line Interface (CLI) │
│ Modern CLI with init, start, evaluate commands │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Configuration System (config.py) │
│ YAML configuration with Pydantic validation & env support │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Main Pipeline (benchmark_pipeline.py) │
│ Coordinates generation, rendering, evaluation │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────┴──────────────────────────┐
│ │ │
┌───▼───┐ ┌───────────▼───────────┐ ┌──────────────────▼──────────────────┐
│ Code │ │ Rendering System │ │ Evaluation Framework (3-Type) │
│ Gen. │ │ (Selenium & Node.js)│ │ (Visual, Interactive, ASTRA) │
└───▲───┘ └───────────▲───────────┘ └──────────────────▲──────────────────┘
│ │ │ │
└─────────────────┴─────────────┼──────────────────┘
│
┌─────────────────────────────────▼─────────────────────────────────┐
│ Model Provider Layer (4 Providers) │
│ Ollama │ OpenRouter │ Gemini (SDK) │ vLLM │ (Single Interface) │
└───────────────────────────────────────────────────────────────────┘
- Python 3.11+
- Chrome/Chromium (for web rendering)
- Node.js (for framework project rendering)
# Install from source
pip install .
# Or install in development mode
pip install -e .openui-eval initThis creates:
config.yaml- Default configuration file.env- Environment variables for API keys
Edit the .env file to add your API keys:
nano .envOPENROUTER_API_KEY=your_key_here
GEMINI_API_KEY=your_key_hereEdit config.yaml to configure models and tasks:
nano config.yaml# Install Ollama from https://ollama.com/
ollama serve
# Pull models
ollama pull gemma3n:e4b
ollama pull qwen2.5vl:7bopenui-eval startInitialize OpenUI Eval with default configuration files.
Creates config.yaml and .env files in the current directory if they don't exist.
Start the OpenUI Eval benchmark pipeline.
# Use custom config file
openui-eval start --config custom_config.yaml
# Run specific models only
openui-eval start --models gemini-2.5-flash
# Run specific task sets
openui-eval start --include-webgen-bench --include-astraEvaluate an existing benchmark run with different judge models.
# Evaluate a specific run
openui-eval evaluate 20250821_182200
# Use specific judges
openui-eval evaluate 20250821_182200 --judges gemini-2.5-proThe system uses config.yaml for configuration. Key sections:
- provider: LLM provider (ollama, openrouter, gemini, vllm)
- models: List of models to benchmark
- tasks: Available benchmark tasks and datasets
- generation: Iteration and improvement settings
- evaluation: Judge models and criteria
- rendering: Browser and Node.js rendering settings
Example configuration:
# Model provider configuration
provider:
provider_type: "gemini" # ollama, openrouter, gemini, vllm
gemini_api_key: "your_gemini_api_key"
# Models to benchmark
models:
- "gemini-2.5-flash"
- "gemini-2.5-pro"
# Task configuration
tasks:
include_predefined: true # Original 9 tasks
include_artifactsbench: true # 50 professional tasks
include_webgen_bench: true # 101 interactive tasks
include_frontendbench: true # 5 sandbox tasks
include_astra: true # 58 HackerRank tasks
custom_task_files: [] # Additional JSON task files
# Generation settings
iterations: 2 # Max improvement iterations
mode: "single" # single, parallel, or sequential
# Evaluation configuration
evaluation:
judge_models:
- "gemini-2.5-pro"
criteria: ["visual_fidelity", "functional_completeness", "code_quality"]
scoring_scale: 10
temperature: 0.1
# Rendering configuration
rendering:
headless: true
screenshot_quality: 90
window_size: [1920, 1080]
# System settings
output_dir: "results"
log_level: "INFO"
max_concurrent_models: 1- Run open-source models locally
- Supports Qwen, Gemma, Llama, LLaVA families
- Ideal for privacy and offline evaluation
- Access to 200+ models via single API
- Support for various proprietary and open-source models
- Cost-effective for large-scale evaluation
- Official Google google-genai Python SDK integration
- Access to Gemini 2.x and 2.5 families
- Highest performance and reliability
- Optimized inference for local models
- Batch processing support
- High throughput for large evaluations
OpenUI Eval integrates 5 major datasets providing unparalleled task diversity:
| Dataset | Tasks | Type | Purpose |
|---|---|---|---|
| Web2Code | 827,934 | Training | Visual-to-code instruction tuning |
| ArtifactsBench | 1,825 | Interactive | Complex apps, games, data visualization |
| VisualWebArena | 910 | Automation | Multi-step visual web tasks |
| Design2Code | 484 | Webpages | Real-world visual-to-code fidelity |
| WebGen-Bench | 101 | Professional | End-to-end web development |
| Internal | ~223 | Core | ASTRA, FrontendBench, Predefined tasks |
- Single-File HTML: Calculators, forms, portfolios, dashboards
- Multi-File Framework Projects: React, Vue, Angular, Next.js, Svelte
- Interactive Applications: Games, data visualization, complex UIs
- Professional Scenarios: Real-world web development challenges
Two-stage process that mimics human developer workflow:
- Stage 1: Initial generation from prompt/design
- Stage 2: Refinement loop with screenshot feedback
- Performance Gain: +23.7% average improvement across all models
Holistic assessment across three key axes:
- Visual Fidelity: Layout accuracy, design consistency, component rendering
- Functional Completeness: Interactive elements, state management, responsiveness
- Code Quality: Semantic HTML, CSS maintainability, modern patterns, accessibility
Robust Selenium-based InteractiveEvaluator:
- Success Rate: 90.9% on complex multi-step tasks
- Test Case: Hotel Booking Form (10/11 steps passed)
- Coverage: Form filling, validation, submission, confirmation
Results are saved in organized timestamped directories:
results/
├── 20250821_182200/
│ ├── benchmark_summary.json
│ ├── system_info.json
│ ├── final_results.json
│ └── {model_name}/
│ ├── {task_name}/
│ │ ├── iterations/
│ │ │ ├── 0/
│ │ │ │ ├── generated.html
│ │ │ │ ├── screenshot.png
│ │ │ │ ├── evaluation.json
│ │ │ │ └── feedback.md
│ │ │ ├── 1/
│ │ │ └── ...
│ │ └── summary.json
│ └── model_summary.json
└── summary_report.json
# Clone repository
git clone https://github.com/anxkhn/openui_eval.git
cd openui_eval
# Install in development mode
pip install -e .
# Install development dependencies
pip install pytest black mypypytestblack .
mypy src/This project makes several key contributions to the field:
- Iterative Refinement Protocol: First comprehensive two-stage evaluation approach for multimodal code generation
- Multi-Dimensional Evaluation Framework: Visual, functional, and code quality assessment with 94.4% human agreement
- Largest WebGen Benchmark: Most comprehensive evaluation with 1,247,840 total evaluations
- Performance Gap Analysis: Quantified 40.8% gap between proprietary and open-source models
- Open-Source Resources: Released openui-eval pipeline and 827,934-sample training dataset
This project was made possible through the generous support of:
- Google Summer of Code 2025 & Google DeepMind: Mentorship and technical expertise
- Research Dataset Contributors: Design2Code, Web2Code, WebArena, VisualWebArena, SWE-bench, ArtifactsBench, HackerRank ASTRA teams
- Open Source Community: Contributors to Selenium, Playwright, Pydantic, Typer, Docker, and model providers
This project is licensed under the MIT License - see the LICENSE file for details.
Anas Khan (@anxkhn) - Google Summer of Code 2025 Participant Mentors: Paige Bailey, Vaibhav Tulsyan Organization: Google DeepMind