feat: three-phase image build with local docker for 4x speedup#555
feat: three-phase image build with local docker for 4x speedup#555simonrosenberg wants to merge 5 commits intomainfrom
Conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solves a real problem (6h→52min) but ships 700 lines of untested concurrency code into production build infrastructure.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) |
There was a problem hiding this comment.
🔴 Critical - Testing Gap: This entire 702-line file has zero tests. This is complex build orchestration code with:
- Concurrent ProcessPoolExecutor workflows
- Retry logic with exponential backoff
- Docker build/push operations
- State management across phases
If this breaks, it blocks all image builds. At minimum, add integration tests that:
- Build a single base image end-to-end
- Assemble a single agent image end-to-end
- Handle a build failure and verify retry logic
- Verify the manifest.jsonl output format
This is build infrastructure - test it like your production depends on it (because it does).
| logger.info("[assembly] Building: %s", " ".join(build_cmd)) | ||
| build_started = time.monotonic() | ||
| proc = subprocess.run(build_cmd, text=True, capture_output=True) | ||
| build_seconds = round(time.monotonic() - build_started, 3) | ||
|
|
||
| if proc.stdout: | ||
| print(proc.stdout, end="") | ||
| if proc.stderr: | ||
| print(proc.stderr, end="", file=sys.stderr) | ||
|
|
||
| if proc.returncode != 0: | ||
| error = ( | ||
| proc.stderr.strip() | ||
| or proc.stdout.strip() | ||
| or f"docker build failed with exit code {proc.returncode}" | ||
| ) | ||
| logger.info( | ||
| "[assembly] FAILED %s: build_seconds=%.1f error=%s", | ||
| tag_label, | ||
| build_seconds, | ||
| error[:200], | ||
| ) | ||
| return BuildOutput(base_image=base_tag, tags=[], error=error) | ||
|
|
||
| # Step 2: docker push each tag |
There was a problem hiding this comment.
🟠 Important - Error Handling: This push loop fails fast on the first error:
if push:
for t in final_tags:
# ... push ...
if push_proc.returncode != 0:
return BuildOutput(..., error=error)If you have 3 tags and push #1 fails but #2 and #3 would succeed, you lose that information and waste the successful docker build.
Better approach:
failed_tags = []
for t in final_tags:
# ... push ...
if push_proc.returncode != 0:
failed_tags.append((t, error))
if failed_tags:
return BuildOutput(..., error=f"Failed to push {len(failed_tags)}/{len(final_tags)} tags: {failed_tags}")At least report partial success rather than treating it as total failure.
| str(ctx / "Dockerfile"), | ||
| "--target", | ||
| "builder", |
There was a problem hiding this comment.
🟡 Suggestion - Silent Cleanup Failure: shutil.rmtree(ctx, ignore_errors=True) silently swallows errors. If cleanup fails (disk full? permission issue?), you won't know until subsequent builds fail mysteriously.
| str(ctx / "Dockerfile"), | |
| "--target", | |
| "builder", | |
| try: | |
| shutil.rmtree(ctx) | |
| except Exception as e: | |
| logger.warning("Failed to cleanup build context %s: %s", ctx, e) |
| tag = builder_image_tag(builder_image) | ||
|
|
||
| if remote_image_exists(tag): | ||
| logger.info("Builder image %s already exists. Skipping.", tag) |
There was a problem hiding this comment.
🟡 Suggestion - Data Structure Misuse: BuildOutput(base_image="builder", ...) is semantically wrong. The builder image isn't a base image - this violates the principle of making data structures match your problem domain.
Either:
- Add a separate
BuilderOutputtype, or - Make
base_imageoptional and add abuilder_imagefield
Using "builder" as a magic string here will confuse anyone reading logs or debugging build failures.
| echo "Warning: /var/lib/buildkit not found; skipping disk check" | ||
| fi | ||
|
|
||
| - name: Build and push SWE-Bench images | ||
| - name: "Phase 0: Build and push builder image (pre-built base mode)" | ||
| if: ${{ env.USE_PREBUILT_BASES == 'true' }} | ||
| run: | | ||
| set -euo pipefail | ||
| uv run python -c " | ||
| from benchmarks.swebench.build_base_images import build_builder_image | ||
| result = build_builder_image(push=True) | ||
| if result.error: | ||
| raise RuntimeError(f'Builder build failed: {result.error}') | ||
| print(f'Builder image: {result.tags[0]}') | ||
| # Export for next steps | ||
| import os | ||
| with open(os.environ['GITHUB_ENV'], 'a') as f: | ||
| f.write(f'BUILDER_TAG={result.tags[0]}\n') | ||
| " | ||
| env: | ||
| DOCKER_BUILDKIT: 1 | ||
| BUILDKIT_PROGRESS: plain | ||
|
|
||
| - name: "Phase 1: Build and push base images (pre-built base mode)" | ||
| if: ${{ env.USE_PREBUILT_BASES == 'true' }} | ||
| run: | | ||
| set -euo pipefail | ||
|
|
||
| CMD="uv run benchmarks/swebench/build_base_images.py \ | ||
| --dataset '${DATASET}' \ | ||
| --split '${SPLIT}' \ | ||
| --push \ | ||
| --max-workers '${MAX_WORKERS}' \ | ||
| --max-retries '${MAX_RETRIES}'" | ||
|
|
||
| if [ -n "${N_LIMIT}" ]; then | ||
| CMD="$CMD --n-limit '${N_LIMIT}'" | ||
| fi | ||
| if [ -n "${SELECT_FILE}" ]; then | ||
| CMD="$CMD --select '${SELECT_FILE}'" | ||
| fi | ||
|
|
||
| echo "Running: $CMD" | ||
| eval "$CMD" | ||
| env: | ||
| DOCKER_BUILDKIT: 1 | ||
| BUILDKIT_PROGRESS: plain | ||
|
|
||
| - name: "Phase 2: Assemble agent images (pre-built base mode)" | ||
| if: ${{ env.USE_PREBUILT_BASES == 'true' }} | ||
| run: | | ||
| set -euo pipefail | ||
| FORCE_BUILD="${{ inputs.force-build || 'false' }}" | ||
|
|
||
| uv run python -c " | ||
| from benchmarks.swebench.build_images import collect_unique_base_images | ||
| from benchmarks.swebench.build_base_images import assemble_all_agent_images | ||
| from benchmarks.utils.build_utils import default_build_output_dir | ||
| import sys | ||
|
|
||
| base_images = collect_unique_base_images( | ||
| '${DATASET}', '${SPLIT}', | ||
| int('${N_LIMIT}') if '${N_LIMIT}' else 0, | ||
| '${SELECT_FILE}' if '${SELECT_FILE}' else None, | ||
| ) | ||
| build_dir = default_build_output_dir('${DATASET}', '${SPLIT}') | ||
| rc = assemble_all_agent_images( | ||
| base_images=base_images, | ||
| builder_tag='${BUILDER_TAG}', | ||
| build_dir=build_dir, | ||
| push=True, | ||
| max_workers=int('${MAX_WORKERS}'), | ||
| max_retries=int('${MAX_RETRIES}'), | ||
| force_build=('${FORCE_BUILD}' == 'true'), | ||
| ) | ||
| sys.exit(rc) | ||
| " | ||
| env: | ||
| DOCKER_BUILDKIT: 1 | ||
| BUILDKIT_PROGRESS: plain | ||
| FORCE_BUILD: ${{ inputs.force-build || 'false' }} | ||
|
|
There was a problem hiding this comment.
🟡 Suggestion - Workflow Complexity: These three phases add 80+ lines of bash/Python/YAML mixing with complex conditionals:
${{ env.USE_PREBUILT_BASES == 'true' }}- Inline Python in bash heredocs
- Multi-line uv run python -c blocks
This will be painful to debug when it breaks in CI. Consider:
- Extract the phase logic into a single Python script (
build_with_phases.py --phase 0|1|2) - Keep workflow steps simple: just call the script with appropriate args
- Makes local reproduction easier:
uv run build_with_phases.py --phase 1
Simple scripts are easier to test and debug than workflow YAML.
| - name: "Phase 0: Build and push builder image (pre-built base mode)" | ||
| if: ${{ inputs.use-prebuilt-bases == true }} | ||
| run: | | ||
| set -euo pipefail | ||
| uv run python -c " | ||
| from benchmarks.swebench.build_base_images import build_builder_image | ||
| result = build_builder_image(push=True) | ||
| if result.error: | ||
| raise RuntimeError(f'Builder build failed: {result.error}') | ||
| print(f'Builder image: {result.tags[0]}') | ||
| import os | ||
| with open(os.environ['GITHUB_ENV'], 'a') as f: | ||
| f.write(f'BUILDER_TAG={result.tags[0]}\n') | ||
| f.write('USE_PREBUILT_BASES=true\n') | ||
| " | ||
| env: | ||
| DOCKER_BUILDKIT: 1 |
There was a problem hiding this comment.
🟢 Acceptable - Inconsistent Conditionals: SWE-bench workflow uses ${{ env.USE_PREBUILT_BASES == 'true' }} (string comparison), but SWT-bench uses ${{ inputs.use-prebuilt-bases == true }} (boolean comparison).
Both work, but the inconsistency makes maintenance harder. Pick one pattern and use it everywhere.
(This is pragmatic given time constraints, but worth standardizing in a follow-up.)
|
Addressed review feedback in 5592f9f: Fixed:
Acknowledged, deferred:
|
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Solves a real problem (6h→52min) but ships 700 lines of untested concurrency code into production build infrastructure.
Good: Phased caching strategy is solid engineering. 4x speedup with real evidence.
Concerns: Two previous issues are now fixed (cleanup handling, partial push failures), but the testing gap remains critical and there's a new architectural dependency.
| def _get_sdk_dockerfile() -> Path: | ||
| """Locate the SDK Dockerfile from the vendor submodule.""" | ||
| benchmarks_root = Path(__file__).resolve().parent.parent.parent | ||
| dockerfile = ( | ||
| benchmarks_root | ||
| / "vendor" | ||
| / "software-agent-sdk" | ||
| / "openhands-agent-server" | ||
| / "openhands" | ||
| / "agent_server" | ||
| / "docker" | ||
| / "Dockerfile" | ||
| ) | ||
| if not dockerfile.exists(): | ||
| raise FileNotFoundError( | ||
| f"SDK Dockerfile not found at {dockerfile}. " | ||
| "Make sure submodules are initialized." |
There was a problem hiding this comment.
🟠 Important - Fragile Path Traversal: Hard-coded relative path with multiple parent/child hops is brittle. If directory structure changes or submodule location moves, this breaks silently.
| def _get_sdk_dockerfile() -> Path: | |
| """Locate the SDK Dockerfile from the vendor submodule.""" | |
| benchmarks_root = Path(__file__).resolve().parent.parent.parent | |
| dockerfile = ( | |
| benchmarks_root | |
| / "vendor" | |
| / "software-agent-sdk" | |
| / "openhands-agent-server" | |
| / "openhands" | |
| / "agent_server" | |
| / "docker" | |
| / "Dockerfile" | |
| ) | |
| if not dockerfile.exists(): | |
| raise FileNotFoundError( | |
| f"SDK Dockerfile not found at {dockerfile}. " | |
| "Make sure submodules are initialized." | |
| # Use configured submodule path or fall back to discovery | |
| BENCHMARKS_ROOT = Path(__file__).resolve().parent.parent.parent | |
| SDK_DOCKERFILE = BENCHMARKS_ROOT / "vendor" / "software-agent-sdk" / "openhands-agent-server" / "openhands" / "agent_server" / "docker" / "Dockerfile" | |
| def _get_sdk_dockerfile() -> Path: | |
| if not SDK_DOCKERFILE.exists(): | |
| raise FileNotFoundError( | |
| f"SDK Dockerfile not found at {SDK_DOCKERFILE}. " | |
| "Make sure submodules are initialized." | |
| ) | |
| return SDK_DOCKERFILE |
Move the path computation to module level so it's visible and fails fast at import time.
| push: bool = False, | ||
| git_sha: str = "unknown", | ||
| ) -> BuildOutput: | ||
| """Assemble a final agent image from pre-built base + builder (Phase 2). | ||
|
|
||
| Uses local `docker build` + `docker push` instead of `docker buildx build` | ||
| to avoid the remote Blacksmith buildx driver overhead (~400s/image). | ||
| The local Docker daemon caches pulled base/builder layers across builds. | ||
| """ | ||
| import time | ||
|
|
||
| if not AGENT_LAYER_DOCKERFILE.exists(): | ||
| return BuildOutput( | ||
| base_image=base_tag, | ||
| tags=[], | ||
| error=f"Agent layer Dockerfile not found at {AGENT_LAYER_DOCKERFILE}", | ||
| ) | ||
|
|
||
| tag_label = final_tags[0] if final_tags else base_tag | ||
| overall_started = time.monotonic() | ||
|
|
There was a problem hiding this comment.
🔴 Critical - New Architectural Dependency: This phase requires a local Docker daemon (docker build + docker push), while the standard path can use remote buildx drivers.
This is a fundamental trade-off:
- ✅ Benefit: Local layer caching → 4x speedup
⚠️ Risk: Breaks in environments without local Docker daemon (some CI runners, remote build farms)
Document this requirement clearly:
- Add a comment block explaining the local-daemon dependency
- Check
docker infobefore running Phase 2 and fail fast with a clear error if daemon is unreachable - Document in the PR description that phased mode requires local Docker access
Example pre-flight check:
def _check_docker_daemon():
proc = subprocess.run(["docker", "info"], capture_output=True)
if proc.returncode != 0:
raise RuntimeError(
"Local Docker daemon required for phased builds. "
"Ensure Docker is running or use --no-prebuilt-bases."
)|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) |
There was a problem hiding this comment.
🔴 Critical - Testing Gap Remains: This 736-line file orchestrates concurrent builds with retry logic, Docker operations, and state management across three phases. Zero tests.
This is production build infrastructure that blocks all image builds when it breaks. The 4x speedup is worthless if a regression takes down CI for hours.
Minimum viable tests:
- Unit:
extract_custom_tag,base_image_tag,builder_image_tag(pure functions) - Integration: Mock
subprocess.runto verify correct Docker commands are constructed - End-to-end: Build one test image through all three phases in CI
You don't need 100% coverage, but you need something to catch basic regressions. The concurrency code is especially fragile - test that partial failures are collected correctly and retries work.
| "--file", | ||
| str(ctx / "Dockerfile"), | ||
| "--target", | ||
| "builder", |
There was a problem hiding this comment.
🟢 Acceptable - Fixed: Previous review flagged silent cleanup failure. This now logs warnings on cleanup errors. Good fix.
| for t in final_tags: | ||
| push_cmd = ["docker", "push", t] | ||
| logger.info("[assembly] Pushing: %s", t) | ||
| push_started = time.monotonic() | ||
| push_proc = subprocess.run(push_cmd, text=True, capture_output=True) | ||
| push_seconds += time.monotonic() - push_started | ||
|
|
||
| if push_proc.stdout: | ||
| print(push_proc.stdout, end="") |
There was a problem hiding this comment.
🟢 Acceptable - Fixed: Previous review flagged fail-fast push loop. This now collects partial failures in failed_pushes and reports which tags succeeded. Good pragmatic fix.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Real 4x speedup (6h→52min) with solid phase separation. Two critical fixes verified (cleanup warnings, partial push handling). One blocking issue remains: 700+ lines of production build infrastructure with zero tests. The local Docker dependency is acceptable for GitHub Actions but needs documentation.
Key Insight: You've solved a real problem with pragmatic caching, but untested CI infrastructure is a ticking bomb. The 4x speedup is worthless if this breaks silently and blocks all builds.
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - Real 4x speedup (6h→52min) with solid evidence from workflow runs. Previous critical issues (cleanup warnings, partial push handling, testing gap) are now resolved. The mock-heavy tests are pragmatic for build infrastructure that requires Docker/registry interactions.
Key Insight: You've proven this works at scale (500+ images) and the local Docker trade-off is well-documented. The main concerns from previous reviews have been addressed.
|
|
||
| log_path = tmp_path / "log.txt" | ||
| log_path.touch() | ||
| mock_capture.return_value.__enter__ = MagicMock(return_value=log_path) |
There was a problem hiding this comment.
🟢 Acceptable - Pragmatic Test Strategy: These tests are mock-heavy (subprocess.run, remote_image_exists, etc.), which means they validate error handling and retry logic but not actual Docker command correctness.
This is a reasonable trade-off for build infrastructure:
- ✅ Fast feedback on logic paths
- ✅ Validates retry/error handling
- ✅ No Docker daemon or registry credentials needed
⚠️ Doesn't catch malformed Docker commands or tag format issues
However, the workflow run evidence (70s/image for 500 images) proves the real integration works. This is acceptable pragmatic testing for CI infrastructure.
|
|
||
|
|
||
| def _fail_proc(stderr="build error", code=1): | ||
| """Fake subprocess.CompletedProcess with non-zero returncode.""" |
There was a problem hiding this comment.
🟡 Suggestion - Document Test vs Production Executor: Tests use ThreadPoolExecutor (to avoid pickling mocks) but production uses ProcessPoolExecutor. Consider adding a comment explaining this difference so future maintainers don't get confused:
# Use ThreadPoolExecutor in tests to avoid pickling issues with mocks.
# Production code uses ProcessPoolExecutor for true parallelism.
def _thread_pool(**kw):
return ThreadPoolExecutor(**kw)|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) |
There was a problem hiding this comment.
🟢 Resolved - Testing Gap Addressed: Previous reviews flagged this 750-line file as having zero tests. Now tests/test_phased_build.py provides 489 lines of coverage for:
- Build success/failure/skip logic
- Retry mechanisms
- Partial push failure handling
- Manifest writing
- Orchestration flow
Combined with workflow evidence (runs 23409964288, 23417634029, 23417634506 showing 70s/image average), this is sufficient for production build infrastructure.
|
|
||
| tag_label = final_tags[0] if final_tags else base_tag | ||
| overall_started = time.monotonic() | ||
|
|
There was a problem hiding this comment.
🟢 Acceptable - Documented Architectural Trade-off: The docstring clearly states the local Docker daemon requirement and explains the 4x speedup benefit (~455s → ~70s per image).
This is a conscious engineering choice with workflow evidence proving it works in GitHub Actions. The standard build path remains available as fallback for environments without local Docker.
Well-documented pragmatic decision.
cb034c8 to
3cd3a50
Compare
- Add phased build pipeline (base → agent-layer → per-instance) - Add -acp suffix to image tags for ACP agent builds to avoid IfNotPresent pull policy reusing stale non-ACP cached images Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3cd3a50 to
976f003
Compare
Remove the use-prebuilt-bases feature flag and standard build path. The three-phase build (builder → base → agent) is now the sole path in both swebench and swtbench workflows. Also wire --agent-type through to build_phased_images.py so ACP builds get the -acp tag suffix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge the phased build CLI from build_phased_images.py into build_images.py so there is a single entry point for image builds. The old single-pass build path is removed; the three-phase pipeline (builder → base → assemble) is now the only code path. SWT-bench's build_images.py becomes a thin shim that forwards to the SWE-bench module. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unused BUILD_DEFAULTS import. Fix TestPhasedOrchestration patches to target build_base_images (where functions are defined) since build_images uses lazy imports inside main() to avoid circular dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Add a phased benchmark image build path that uses:
docker build+docker pushThis path is enabled through the existing workflow input
use-prebuilt-basesand is implemented with a dedicated entrypoint:benchmarks/swebench/build_phased_images.py.Current Behavior
When
use-prebuilt-bases=true, the workflows now run the phased path:ghcr.io/openhands/eval-builder:{sdk_sha}ghcr.io/openhands/eval-base:{instance_tag}ghcr.io/openhands/eval-agent-server:{sdk_sha}-{instance_tag}-source-minimalimages locallyFinal assembly uses the thin
benchmarks/swebench/Dockerfile.agent-layerDockerfile and local Docker instead of remotedocker buildx --push.When
use-prebuilt-basesis not enabled, the existing standard build path remains unchanged.What Is In This PR
benchmarks/swebench/build_phased_images.pybuilder/base-image-minimaltargetsWhy This Design
The heavy work is amortized into reusable images:
Results
Validated workflow runs:
23409630540455s/image2340996428887.12s/imageaverage for 100 images2341763402970.654s/imageaverage,61.4smedian for 500 images2341763450669.019s/imageaverage,59.2smedian for 433 imagesDependency
Depends on:
Scope
This PR is about the phased local-assembly path only. The legacy direct-build path is still preserved as the default when the phased mode is not selected.