OpenUI Eval: A Multimodal LLM Benchmark for Web Generation

Google Summer of Code 2025 Project with Google DeepMind

A comprehensive benchmark system for evaluating multimodal vision-language models on web development tasks. This system benchmarks 18 state-of-the-art models, including the Google Gemini 2.x/2.5 series and 11 open-source families, against a benchmark suite of over 830,000 tasks.

Key Features

Comprehensive Evaluation: Benchmarks 18 models across 5 major datasets totaling 830,000+ tasks
Iterative Refinement: Two-stage evaluation with screenshot-based feedback achieving 23.7% average improvement
Multi-Framework Support: Complete generation, building, and evaluation for React 19, Next.js 15, Vue 3.5, Angular 20, and Svelte 5
Multi-Provider Integration: Unified interface for Ollama, OpenRouter, Gemini (official SDK), and vLLM
LLM-as-a-Judge: Multi-dimensional evaluation with Gemini 2.5 Pro achieving 94.4% agreement with human experts
Production-Ready CLI: Modern CLI with openui-eval init, start, and evaluate commands

Benchmark Results Summary

Top Performing Models

Rank	Model	Overall Success Rate	Avg Score
1	Gemini 2.5 Pro (SOTA)	92.7%	4.64/5
2	Gemini 2.5 Flash	87.3%	4.37/5
3	Gemini 2.0 Flash Thinking	87.3%	4.37/5
4	Gemini 2.0 Pro	84.6%	4.23/5
5	Gemini 2.0 Flash	81.9%	4.10/5
6	Llama3.2-Vision 11B (Top OSS)	70.6%	3.53/5

Key Findings

40.8% Performance Gap: Significant gap between proprietary and open-source models
Iterative Improvement: Screenshot-based refinement improves performance by 23.7% on average
Judge Reliability: Gemini 2.5 Pro judge achieves 94.4% agreement with human evaluations
Interactive Success: 90.9% success rate on complex Selenium-based interactive testing

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Command Line Interface (CLI)             │
│         Modern CLI with init, start, evaluate commands      │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                  Configuration System (config.py)           │
│  YAML configuration with Pydantic validation & env support  │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                  Main Pipeline (benchmark_pipeline.py)      │
│         Coordinates generation, rendering, evaluation       │
└─────────────────────────────────────────────────────────────┘
                              │
    ┌─────────────────────────┴──────────────────────────┐
    │                         │                          │
┌───▼───┐ ┌───────────▼───────────┐ ┌──────────────────▼──────────────────┐
│ Code  │ │     Rendering System  │ │   Evaluation Framework (3-Type)     │
│ Gen.  │ │   (Selenium & Node.js)│ │   (Visual, Interactive, ASTRA)      │
└───▲───┘ └───────────▲───────────┘ └──────────────────▲──────────────────┘
    │                 │             │                  │
    └─────────────────┴─────────────┼──────────────────┘
                                  │
┌─────────────────────────────────▼─────────────────────────────────┐
│                 Model Provider Layer (4 Providers)                │
│   Ollama │ OpenRouter │ Gemini (SDK) │ vLLM │ (Single Interface)  │
└───────────────────────────────────────────────────────────────────┘

Installation

Prerequisites

Python 3.11+
Chrome/Chromium (for web rendering)
Node.js (for framework project rendering)

Install the Package

# Install from source
pip install .

# Or install in development mode
pip install -e .

Quick Start

1. Initialize the Project

openui-eval init

This creates:

config.yaml - Default configuration file
.env - Environment variables for API keys

2. Configure Your Environment

Edit the .env file to add your API keys:

nano .env

OPENROUTER_API_KEY=your_key_here
GEMINI_API_KEY=your_key_here

Edit config.yaml to configure models and tasks:

nano config.yaml

3. Install and Start Ollama (if using local models)

# Install Ollama from https://ollama.com/
ollama serve

# Pull models
ollama pull gemma3n:e4b
ollama pull qwen2.5vl:7b

4. Run the Benchmark

openui-eval start

Commands

`openui-eval init`

Initialize OpenUI Eval with default configuration files.

Creates config.yaml and .env files in the current directory if they don't exist.

`openui-eval start`

Start the OpenUI Eval benchmark pipeline.

# Use custom config file
openui-eval start --config custom_config.yaml

# Run specific models only
openui-eval start --models gemini-2.5-flash

# Run specific task sets
openui-eval start --include-webgen-bench --include-astra

`openui-eval evaluate`

Evaluate an existing benchmark run with different judge models.

# Evaluate a specific run
openui-eval evaluate 20250821_182200

# Use specific judges
openui-eval evaluate 20250821_182200 --judges gemini-2.5-pro

Configuration

The system uses config.yaml for configuration. Key sections:

provider: LLM provider (ollama, openrouter, gemini, vllm)
models: List of models to benchmark
tasks: Available benchmark tasks and datasets
generation: Iteration and improvement settings
evaluation: Judge models and criteria
rendering: Browser and Node.js rendering settings

Example configuration:

# Model provider configuration
provider:
  provider_type: "gemini" # ollama, openrouter, gemini, vllm
  gemini_api_key: "your_gemini_api_key"

# Models to benchmark
models:
  - "gemini-2.5-flash"
  - "gemini-2.5-pro"

# Task configuration
tasks:
  include_predefined: true # Original 9 tasks
  include_artifactsbench: true # 50 professional tasks
  include_webgen_bench: true # 101 interactive tasks
  include_frontendbench: true # 5 sandbox tasks
  include_astra: true # 58 HackerRank tasks
  custom_task_files: [] # Additional JSON task files

# Generation settings
iterations: 2 # Max improvement iterations
mode: "single" # single, parallel, or sequential

# Evaluation configuration
evaluation:
  judge_models:
    - "gemini-2.5-pro"
  criteria: ["visual_fidelity", "functional_completeness", "code_quality"]
  scoring_scale: 10
  temperature: 0.1

# Rendering configuration
rendering:
  headless: true
  screenshot_quality: 90
  window_size: [1920, 1080]

# System settings
output_dir: "results"
log_level: "INFO"
max_concurrent_models: 1

Supported Providers

1. Ollama (Local Models)

Run open-source models locally
Supports Qwen, Gemma, Llama, LLaVA families
Ideal for privacy and offline evaluation

2. OpenRouter (Cloud Models)

Access to 200+ models via single API
Support for various proprietary and open-source models
Cost-effective for large-scale evaluation

3. Gemini (Google SDK)

Official Google google-genai Python SDK integration
Access to Gemini 2.x and 2.5 families
Highest performance and reliability

4. vLLM (High-Performance Local)

Optimized inference for local models
Batch processing support
High throughput for large evaluations

Dataset Integration

OpenUI Eval integrates 5 major datasets providing unparalleled task diversity:

Dataset	Tasks	Type	Purpose
Web2Code	827,934	Training	Visual-to-code instruction tuning
ArtifactsBench	1,825	Interactive	Complex apps, games, data visualization
VisualWebArena	910	Automation	Multi-step visual web tasks
Design2Code	484	Webpages	Real-world visual-to-code fidelity
WebGen-Bench	101	Professional	End-to-end web development
Internal	~223	Core	ASTRA, FrontendBench, Predefined tasks

Task Categories

Single-File HTML: Calculators, forms, portfolios, dashboards
Multi-File Framework Projects: React, Vue, Angular, Next.js, Svelte
Interactive Applications: Games, data visualization, complex UIs
Professional Scenarios: Real-world web development challenges

Evaluation Framework

1. Iterative Refinement Protocol

Two-stage process that mimics human developer workflow:

Stage 1: Initial generation from prompt/design
Stage 2: Refinement loop with screenshot feedback
Performance Gain: +23.7% average improvement across all models

2. Multi-Dimensional Evaluation

Holistic assessment across three key axes:

Visual Fidelity: Layout accuracy, design consistency, component rendering
Functional Completeness: Interactive elements, state management, responsiveness
Code Quality: Semantic HTML, CSS maintainability, modern patterns, accessibility

3. Interactive Testing Success

Robust Selenium-based InteractiveEvaluator:

Success Rate: 90.9% on complex multi-step tasks
Test Case: Hotel Booking Form (10/11 steps passed)
Coverage: Form filling, validation, submission, confirmation

Output Structure

Results are saved in organized timestamped directories:

results/
├── 20250821_182200/
│   ├── benchmark_summary.json
│   ├── system_info.json
│   ├── final_results.json
│   └── {model_name}/
│       ├── {task_name}/
│       │   ├── iterations/
│       │   │   ├── 0/
│       │   │   │   ├── generated.html
│       │   │   │   ├── screenshot.png
│       │   │   │   ├── evaluation.json
│       │   │   │   └── feedback.md
│       │   │   ├── 1/
│       │   │   └── ...
│       │   └── summary.json
│       └── model_summary.json
└── summary_report.json

Development

Setup Development Environment

# Clone repository
git clone https://github.com/anxkhn/openui_eval.git
cd openui_eval

# Install in development mode
pip install -e .

# Install development dependencies
pip install pytest black mypy

Running Tests

pytest

Code Formatting

black .
mypy src/

Research Contributions

This project makes several key contributions to the field:

Iterative Refinement Protocol: First comprehensive two-stage evaluation approach for multimodal code generation
Multi-Dimensional Evaluation Framework: Visual, functional, and code quality assessment with 94.4% human agreement
Largest WebGen Benchmark: Most comprehensive evaluation with 1,247,840 total evaluations
Performance Gap Analysis: Quantified 40.8% gap between proprietary and open-source models
Open-Source Resources: Released openui-eval pipeline and 827,934-sample training dataset

Acknowledgements

This project was made possible through the generous support of:

Google Summer of Code 2025 & Google DeepMind: Mentorship and technical expertise
Research Dataset Contributors: Design2Code, Web2Code, WebArena, VisualWebArena, SWE-bench, ArtifactsBench, HackerRank ASTRA teams
Open Source Community: Contributors to Selenium, Playwright, Pydantic, Typer, Docker, and model providers

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Anas Khan (@anxkhn) - Google Summer of Code 2025 Participant Mentors: Paige Bailey, Vaibhav Tulsyan Organization: Google DeepMind

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
src		src
tasks		tasks
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
evaluate_run.py		evaluate_run.py
main.py		main.py
progress.md		progress.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

OpenUI Eval: A Multimodal LLM Benchmark for Web Generation

Key Features

Benchmark Results Summary

Top Performing Models

Key Findings

System Architecture

Installation

Prerequisites

Install the Package

Quick Start

1. Initialize the Project

2. Configure Your Environment

3. Install and Start Ollama (if using local models)

4. Run the Benchmark

Commands

openui-eval init

openui-eval start

openui-eval evaluate

Configuration

Supported Providers

1. Ollama (Local Models)

2. OpenRouter (Cloud Models)

3. Gemini (Google SDK)

4. vLLM (High-Performance Local)

Dataset Integration

Task Categories

Evaluation Framework

1. Iterative Refinement Protocol

2. Multi-Dimensional Evaluation

3. Interactive Testing Success

Output Structure

Development

Setup Development Environment

Running Tests

Code Formatting

Research Contributions

Acknowledgements

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`openui-eval init`

`openui-eval start`

`openui-eval evaluate`

Packages