Skip to content

feat: three-phase image build with local docker for 4x speedup#555

Open
simonrosenberg wants to merge 5 commits intomainfrom
feat/prebuilt-base-local-build
Open

feat: three-phase image build with local docker for 4x speedup#555
simonrosenberg wants to merge 5 commits intomainfrom
feat/prebuilt-base-local-build

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Mar 23, 2026

Summary

Add a phased benchmark image build path that uses:

  1. one shared builder image
  2. per-instance base images
  3. local final assembly with docker build + docker push

This path is enabled through the existing workflow input use-prebuilt-bases and is implemented with a dedicated entrypoint: benchmarks/swebench/build_phased_images.py.

Current Behavior

When use-prebuilt-bases=true, the workflows now run the phased path:

  • Phase 0: build and push ghcr.io/openhands/eval-builder:{sdk_sha}
  • Phase 1: build and push ghcr.io/openhands/eval-base:{instance_tag}
  • Phase 2: assemble final ghcr.io/openhands/eval-agent-server:{sdk_sha}-{instance_tag}-source-minimal images locally

Final assembly uses the thin benchmarks/swebench/Dockerfile.agent-layer Dockerfile and local Docker instead of remote docker buildx --push.

When use-prebuilt-bases is not enabled, the existing standard build path remains unchanged.

What Is In This PR

  • New phased-build entrypoint: benchmarks/swebench/build_phased_images.py
  • Shared helper logic for:
    • building the shared builder image
    • building base images
    • assembling final images locally
  • Workflow integration for both SWE-bench and SWT-bench
  • Final-image wrapping still applied for repos that need the docutils/roman layer
  • Updated SDK submodule to the SDK branch that exposes the required builder / base-image-minimal targets

Why This Design

The heavy work is amortized into reusable images:

  • the builder image changes with the SDK
  • the base images change with the benchmark environments
  • the final assembly step becomes lightweight and can reuse local Docker layer state across many images in one job

Results

Validated workflow runs:

  • Three-phase remote buildx assembly: run 23409630540
    • about 455s/image
  • Three-phase local assembly: run 23409964288
    • 87.12s/image average for 100 images
  • Three-phase local assembly: run 23417634029
    • 70.654s/image average, 61.4s median for 500 images
  • Three-phase local assembly: run 23417634506
    • 69.019s/image average, 59.2s median for 433 images

Dependency

Depends on:

Scope

This PR is about the phased local-assembly path only. The legacy direct-build path is still preserved as the default when the phased mode is not selected.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solves a real problem (6h→52min) but ships 700 lines of untested concurrency code into production build infrastructure.



if __name__ == "__main__":
sys.exit(main())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Testing Gap: This entire 702-line file has zero tests. This is complex build orchestration code with:

  • Concurrent ProcessPoolExecutor workflows
  • Retry logic with exponential backoff
  • Docker build/push operations
  • State management across phases

If this breaks, it blocks all image builds. At minimum, add integration tests that:

  1. Build a single base image end-to-end
  2. Assemble a single agent image end-to-end
  3. Handle a build failure and verify retry logic
  4. Verify the manifest.jsonl output format

This is build infrastructure - test it like your production depends on it (because it does).

Comment on lines +395 to +419
logger.info("[assembly] Building: %s", " ".join(build_cmd))
build_started = time.monotonic()
proc = subprocess.run(build_cmd, text=True, capture_output=True)
build_seconds = round(time.monotonic() - build_started, 3)

if proc.stdout:
print(proc.stdout, end="")
if proc.stderr:
print(proc.stderr, end="", file=sys.stderr)

if proc.returncode != 0:
error = (
proc.stderr.strip()
or proc.stdout.strip()
or f"docker build failed with exit code {proc.returncode}"
)
logger.info(
"[assembly] FAILED %s: build_seconds=%.1f error=%s",
tag_label,
build_seconds,
error[:200],
)
return BuildOutput(base_image=base_tag, tags=[], error=error)

# Step 2: docker push each tag
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Error Handling: This push loop fails fast on the first error:

if push:
    for t in final_tags:
        # ... push ...
        if push_proc.returncode != 0:
            return BuildOutput(..., error=error)

If you have 3 tags and push #1 fails but #2 and #3 would succeed, you lose that information and waste the successful docker build.

Better approach:

failed_tags = []
for t in final_tags:
    # ... push ...
    if push_proc.returncode != 0:
        failed_tags.append((t, error))

if failed_tags:
    return BuildOutput(..., error=f"Failed to push {len(failed_tags)}/{len(final_tags)} tags: {failed_tags}")

At least report partial success rather than treating it as total failure.

Comment on lines +318 to +320
str(ctx / "Dockerfile"),
"--target",
"builder",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Silent Cleanup Failure: shutil.rmtree(ctx, ignore_errors=True) silently swallows errors. If cleanup fails (disk full? permission issue?), you won't know until subsequent builds fail mysteriously.

Suggested change
str(ctx / "Dockerfile"),
"--target",
"builder",
try:
shutil.rmtree(ctx)
except Exception as e:
logger.warning("Failed to cleanup build context %s: %s", ctx, e)

tag = builder_image_tag(builder_image)

if remote_image_exists(tag):
logger.info("Builder image %s already exists. Skipping.", tag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Data Structure Misuse: BuildOutput(base_image="builder", ...) is semantically wrong. The builder image isn't a base image - this violates the principle of making data structures match your problem domain.

Either:

  1. Add a separate BuilderOutput type, or
  2. Make base_image optional and add a builder_image field

Using "builder" as a magic string here will confuse anyone reading logs or debugging build failures.

Comment on lines +247 to +327
echo "Warning: /var/lib/buildkit not found; skipping disk check"
fi

- name: Build and push SWE-Bench images
- name: "Phase 0: Build and push builder image (pre-built base mode)"
if: ${{ env.USE_PREBUILT_BASES == 'true' }}
run: |
set -euo pipefail
uv run python -c "
from benchmarks.swebench.build_base_images import build_builder_image
result = build_builder_image(push=True)
if result.error:
raise RuntimeError(f'Builder build failed: {result.error}')
print(f'Builder image: {result.tags[0]}')
# Export for next steps
import os
with open(os.environ['GITHUB_ENV'], 'a') as f:
f.write(f'BUILDER_TAG={result.tags[0]}\n')
"
env:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain

- name: "Phase 1: Build and push base images (pre-built base mode)"
if: ${{ env.USE_PREBUILT_BASES == 'true' }}
run: |
set -euo pipefail

CMD="uv run benchmarks/swebench/build_base_images.py \
--dataset '${DATASET}' \
--split '${SPLIT}' \
--push \
--max-workers '${MAX_WORKERS}' \
--max-retries '${MAX_RETRIES}'"

if [ -n "${N_LIMIT}" ]; then
CMD="$CMD --n-limit '${N_LIMIT}'"
fi
if [ -n "${SELECT_FILE}" ]; then
CMD="$CMD --select '${SELECT_FILE}'"
fi

echo "Running: $CMD"
eval "$CMD"
env:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain

- name: "Phase 2: Assemble agent images (pre-built base mode)"
if: ${{ env.USE_PREBUILT_BASES == 'true' }}
run: |
set -euo pipefail
FORCE_BUILD="${{ inputs.force-build || 'false' }}"

uv run python -c "
from benchmarks.swebench.build_images import collect_unique_base_images
from benchmarks.swebench.build_base_images import assemble_all_agent_images
from benchmarks.utils.build_utils import default_build_output_dir
import sys

base_images = collect_unique_base_images(
'${DATASET}', '${SPLIT}',
int('${N_LIMIT}') if '${N_LIMIT}' else 0,
'${SELECT_FILE}' if '${SELECT_FILE}' else None,
)
build_dir = default_build_output_dir('${DATASET}', '${SPLIT}')
rc = assemble_all_agent_images(
base_images=base_images,
builder_tag='${BUILDER_TAG}',
build_dir=build_dir,
push=True,
max_workers=int('${MAX_WORKERS}'),
max_retries=int('${MAX_RETRIES}'),
force_build=('${FORCE_BUILD}' == 'true'),
)
sys.exit(rc)
"
env:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain
FORCE_BUILD: ${{ inputs.force-build || 'false' }}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Workflow Complexity: These three phases add 80+ lines of bash/Python/YAML mixing with complex conditionals:

  • ${{ env.USE_PREBUILT_BASES == 'true' }}
  • Inline Python in bash heredocs
  • Multi-line uv run python -c blocks

This will be painful to debug when it breaks in CI. Consider:

  1. Extract the phase logic into a single Python script (build_with_phases.py --phase 0|1|2)
  2. Keep workflow steps simple: just call the script with appropriate args
  3. Makes local reproduction easier: uv run build_with_phases.py --phase 1

Simple scripts are easier to test and debug than workflow YAML.

Comment on lines +167 to +183
- name: "Phase 0: Build and push builder image (pre-built base mode)"
if: ${{ inputs.use-prebuilt-bases == true }}
run: |
set -euo pipefail
uv run python -c "
from benchmarks.swebench.build_base_images import build_builder_image
result = build_builder_image(push=True)
if result.error:
raise RuntimeError(f'Builder build failed: {result.error}')
print(f'Builder image: {result.tags[0]}')
import os
with open(os.environ['GITHUB_ENV'], 'a') as f:
f.write(f'BUILDER_TAG={result.tags[0]}\n')
f.write('USE_PREBUILT_BASES=true\n')
"
env:
DOCKER_BUILDKIT: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable - Inconsistent Conditionals: SWE-bench workflow uses ${{ env.USE_PREBUILT_BASES == 'true' }} (string comparison), but SWT-bench uses ${{ inputs.use-prebuilt-bases == true }} (boolean comparison).

Both work, but the inconsistency makes maintenance harder. Pick one pattern and use it everywhere.

(This is pragmatic given time constraints, but worth standardizing in a follow-up.)

@simonrosenberg
Copy link
Collaborator Author

Addressed review feedback in 5592f9f:

Fixed:

Acknowledged, deferred:

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Solves a real problem (6h→52min) but ships 700 lines of untested concurrency code into production build infrastructure.

Good: Phased caching strategy is solid engineering. 4x speedup with real evidence.

Concerns: Two previous issues are now fixed (cleanup handling, partial push failures), but the testing gap remains critical and there's a new architectural dependency.

Comment on lines +46 to +62
def _get_sdk_dockerfile() -> Path:
"""Locate the SDK Dockerfile from the vendor submodule."""
benchmarks_root = Path(__file__).resolve().parent.parent.parent
dockerfile = (
benchmarks_root
/ "vendor"
/ "software-agent-sdk"
/ "openhands-agent-server"
/ "openhands"
/ "agent_server"
/ "docker"
/ "Dockerfile"
)
if not dockerfile.exists():
raise FileNotFoundError(
f"SDK Dockerfile not found at {dockerfile}. "
"Make sure submodules are initialized."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Fragile Path Traversal: Hard-coded relative path with multiple parent/child hops is brittle. If directory structure changes or submodule location moves, this breaks silently.

Suggested change
def _get_sdk_dockerfile() -> Path:
"""Locate the SDK Dockerfile from the vendor submodule."""
benchmarks_root = Path(__file__).resolve().parent.parent.parent
dockerfile = (
benchmarks_root
/ "vendor"
/ "software-agent-sdk"
/ "openhands-agent-server"
/ "openhands"
/ "agent_server"
/ "docker"
/ "Dockerfile"
)
if not dockerfile.exists():
raise FileNotFoundError(
f"SDK Dockerfile not found at {dockerfile}. "
"Make sure submodules are initialized."
# Use configured submodule path or fall back to discovery
BENCHMARKS_ROOT = Path(__file__).resolve().parent.parent.parent
SDK_DOCKERFILE = BENCHMARKS_ROOT / "vendor" / "software-agent-sdk" / "openhands-agent-server" / "openhands" / "agent_server" / "docker" / "Dockerfile"
def _get_sdk_dockerfile() -> Path:
if not SDK_DOCKERFILE.exists():
raise FileNotFoundError(
f"SDK Dockerfile not found at {SDK_DOCKERFILE}. "
"Make sure submodules are initialized."
)
return SDK_DOCKERFILE

Move the path computation to module level so it's visible and fails fast at import time.

Comment on lines +353 to +373
push: bool = False,
git_sha: str = "unknown",
) -> BuildOutput:
"""Assemble a final agent image from pre-built base + builder (Phase 2).

Uses local `docker build` + `docker push` instead of `docker buildx build`
to avoid the remote Blacksmith buildx driver overhead (~400s/image).
The local Docker daemon caches pulled base/builder layers across builds.
"""
import time

if not AGENT_LAYER_DOCKERFILE.exists():
return BuildOutput(
base_image=base_tag,
tags=[],
error=f"Agent layer Dockerfile not found at {AGENT_LAYER_DOCKERFILE}",
)

tag_label = final_tags[0] if final_tags else base_tag
overall_started = time.monotonic()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - New Architectural Dependency: This phase requires a local Docker daemon (docker build + docker push), while the standard path can use remote buildx drivers.

This is a fundamental trade-off:

  • ✅ Benefit: Local layer caching → 4x speedup
  • ⚠️ Risk: Breaks in environments without local Docker daemon (some CI runners, remote build farms)

Document this requirement clearly:

  1. Add a comment block explaining the local-daemon dependency
  2. Check docker info before running Phase 2 and fail fast with a clear error if daemon is unreachable
  3. Document in the PR description that phased mode requires local Docker access

Example pre-flight check:

def _check_docker_daemon():
    proc = subprocess.run(["docker", "info"], capture_output=True)
    if proc.returncode != 0:
        raise RuntimeError(
            "Local Docker daemon required for phased builds. "
            "Ensure Docker is running or use --no-prebuilt-bases."
        )



if __name__ == "__main__":
sys.exit(main())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Testing Gap Remains: This 736-line file orchestrates concurrent builds with retry logic, Docker operations, and state management across three phases. Zero tests.

This is production build infrastructure that blocks all image builds when it breaks. The 4x speedup is worthless if a regression takes down CI for hours.

Minimum viable tests:

  1. Unit: extract_custom_tag, base_image_tag, builder_image_tag (pure functions)
  2. Integration: Mock subprocess.run to verify correct Docker commands are constructed
  3. End-to-end: Build one test image through all three phases in CI

You don't need 100% coverage, but you need something to catch basic regressions. The concurrency code is especially fragile - test that partial failures are collected correctly and retries work.

"--file",
str(ctx / "Dockerfile"),
"--target",
"builder",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable - Fixed: Previous review flagged silent cleanup failure. This now logs warnings on cleanup errors. Good fix.

Comment on lines +420 to +428
for t in final_tags:
push_cmd = ["docker", "push", t]
logger.info("[assembly] Pushing: %s", t)
push_started = time.monotonic()
push_proc = subprocess.run(push_cmd, text=True, capture_output=True)
push_seconds += time.monotonic() - push_started

if push_proc.stdout:
print(push_proc.stdout, end="")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable - Fixed: Previous review flagged fail-fast push loop. This now collects partial failures in failed_pushes and reports which tags succeeded. Good pragmatic fix.

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM!

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Real 4x speedup (6h→52min) with solid phase separation. Two critical fixes verified (cleanup warnings, partial push handling). One blocking issue remains: 700+ lines of production build infrastructure with zero tests. The local Docker dependency is acceptable for GitHub Actions but needs documentation.

Key Insight: You've solved a real problem with pragmatic caching, but untested CI infrastructure is a ticking bomb. The 4x speedup is worthless if this breaks silently and blocks all builds.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - Real 4x speedup (6h→52min) with solid evidence from workflow runs. Previous critical issues (cleanup warnings, partial push handling, testing gap) are now resolved. The mock-heavy tests are pragmatic for build infrastructure that requires Docker/registry interactions.

Key Insight: You've proven this works at scale (500+ images) and the local Docker trade-off is well-documented. The main concerns from previous reviews have been addressed.


log_path = tmp_path / "log.txt"
log_path.touch()
mock_capture.return_value.__enter__ = MagicMock(return_value=log_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable - Pragmatic Test Strategy: These tests are mock-heavy (subprocess.run, remote_image_exists, etc.), which means they validate error handling and retry logic but not actual Docker command correctness.

This is a reasonable trade-off for build infrastructure:

  • ✅ Fast feedback on logic paths
  • ✅ Validates retry/error handling
  • ✅ No Docker daemon or registry credentials needed
  • ⚠️ Doesn't catch malformed Docker commands or tag format issues

However, the workflow run evidence (70s/image for 500 images) proves the real integration works. This is acceptable pragmatic testing for CI infrastructure.



def _fail_proc(stderr="build error", code=1):
"""Fake subprocess.CompletedProcess with non-zero returncode."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Document Test vs Production Executor: Tests use ThreadPoolExecutor (to avoid pickling mocks) but production uses ProcessPoolExecutor. Consider adding a comment explaining this difference so future maintainers don't get confused:

# Use ThreadPoolExecutor in tests to avoid pickling issues with mocks.
# Production code uses ProcessPoolExecutor for true parallelism.
def _thread_pool(**kw):
    return ThreadPoolExecutor(**kw)



if __name__ == "__main__":
sys.exit(main())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Resolved - Testing Gap Addressed: Previous reviews flagged this 750-line file as having zero tests. Now tests/test_phased_build.py provides 489 lines of coverage for:

  • Build success/failure/skip logic
  • Retry mechanisms
  • Partial push failure handling
  • Manifest writing
  • Orchestration flow

Combined with workflow evidence (runs 23409964288, 23417634029, 23417634506 showing 70s/image average), this is sufficient for production build infrastructure.


tag_label = final_tags[0] if final_tags else base_tag
overall_started = time.monotonic()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable - Documented Architectural Trade-off: The docstring clearly states the local Docker daemon requirement and explains the 4x speedup benefit (~455s → ~70s per image).

This is a conscious engineering choice with workflow evidence proving it works in GitHub Actions. The standard build path remains available as fallback for environments without local Docker.

Well-documented pragmatic decision.

- Add phased build pipeline (base → agent-layer → per-instance)
- Add -acp suffix to image tags for ACP agent builds to avoid
  IfNotPresent pull policy reusing stale non-ACP cached images

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg force-pushed the feat/prebuilt-base-local-build branch from 3cd3a50 to 976f003 Compare March 23, 2026 23:58
Debug Agent and others added 4 commits March 23, 2026 21:04
Remove the use-prebuilt-bases feature flag and standard build path.
The three-phase build (builder → base → agent) is now the sole path
in both swebench and swtbench workflows.

Also wire --agent-type through to build_phased_images.py so ACP
builds get the -acp tag suffix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge the phased build CLI from build_phased_images.py into build_images.py
so there is a single entry point for image builds. The old single-pass build
path is removed; the three-phase pipeline (builder → base → assemble) is now
the only code path. SWT-bench's build_images.py becomes a thin shim that
forwards to the SWE-bench module.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unused BUILD_DEFAULTS import. Fix TestPhasedOrchestration patches
to target build_base_images (where functions are defined) since build_images
uses lazy imports inside main() to avoid circular dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants