Skip to content

Optimize SWT-bench build throughput: cache-mode=off, raise prune threshold#551

Open
simonrosenberg wants to merge 1 commit intomainfrom
optimize-swtbench-build-throughput
Open

Optimize SWT-bench build throughput: cache-mode=off, raise prune threshold#551
simonrosenberg wants to merge 1 commit intomainfrom
optimize-swtbench-build-throughput

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

  • Switch default cache-mode from max to off — eliminates ~64s/image (16.4%) of cache export overhead. Registry cache imports still work; only export is skipped. Expensive cached layers (apt-get, npm install) persist in registry from previous runs.
  • Raise BUILDKIT_PRUNE_THRESHOLD_PCT from 60% to 85% — delays first prune from ~163 to ~286 images, eliminating one of two prune events (~20 min saved).
  • Lower BUILDKIT_PRUNE_KEEP_GB from 60 to 20 — maximizes post-prune headroom.

Context

Analysis of run #23382357696 (9h04m, 47.8 img/h for 433 images) showed the Dockerfile ARG cache fix (#544) is working (12-13 cached steps/image), but two overheads dominate:

  1. cache-mode=max cache export: avg 64s/image
  2. Two BuildKit prune events at 60% threshold: ~40 min dead time + post-prune growth spike

Expected improvement: ~47.8 → ~65-70 img/h (~6h15m for 433 images), recovering the late-February baseline.

References: #530, #531, #544

Test plan

  • Trigger full SWT-bench build (433 images, force-build) and compare total time + throughput to run #23382357696
  • Verify at most 1 prune event occurs (was 2 before)
  • Verify per-image cache_export_seconds drops to near-zero

🤖 Generated with Claude Code

…shold

Analysis of run #23382357696 (9h04m for 433 images at 47.8 img/h) showed
that despite the Dockerfile ARG cache fix working correctly (12-13 cached
steps per image), two major overheads remain:

1. cache-mode=max: avg 64s/image (16.4% of wall clock) spent exporting
   all layers back to registry — pure overhead. Issue #530 showed
   cache-mode=off gives ~43% higher throughput. Registry cache imports
   still work with mode=off; only the export is skipped. The expensive
   cached layers (apt-get, npm install) persist in the registry from
   previous runs and don't change between SDK bumps.

2. Two BuildKit prune events at 60% threshold: each prune took ~20 min
   and wiped local cache, causing a post-prune growth spike (3-5 GiB/image
   vs steady-state 1.63 GiB/image). Raising to 85% eliminates one prune
   (first prune after ~286 images instead of ~163), and lowering
   keep-storage to 20g maximizes headroom after pruning.

Expected improvement: ~47.8 → ~65-70 img/h (~6h15m for 433 images),
recovering the late-February baseline of ~6h.

References: #530, #531, #544

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Pragmatic optimization based on real measurements, but needs testing evidence before merge.

description: 'BuildKit cache export mode (max, min, off). Default: off'
required: false
default: 'max'
default: 'off'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical - Breaking Change: Changing the default from max to off breaks existing behavior for anyone using this workflow without explicitly setting cache-mode. This violates the "never break userspace" principle.

Impact: Any automation or manual workflows relying on the previous default will see different caching behavior.

Required: Either:

  1. Add a migration notice in the PR description explaining the default change
  2. Bump a workflow version number to make this an opt-in change
  3. Provide evidence that no existing consumers rely on the max default

BUILDKIT_PROGRESS: plain
OPENHANDS_BUILDKIT_CACHE_MODE: ${{ inputs.cache-mode }}
BUILDKIT_PRUNE_THRESHOLD_PCT: '85'
BUILDKIT_PRUNE_KEEP_GB: '20'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important - Needs Justification: Dropping BUILDKIT_PRUNE_KEEP_GB from 60GB to 20GB is aggressive (67% reduction).

Risk: If disk usage patterns vary, this could trigger more frequent prunes in some scenarios, defeating the purpose of raising the threshold to 85%.

Required: Explain the analysis that led to 20GB specifically. What's the typical post-prune disk usage in your measurements?

DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain
OPENHANDS_BUILDKIT_CACHE_MODE: ${{ inputs.cache-mode }}
BUILDKIT_PRUNE_THRESHOLD_PCT: '85'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion - Verify Assumption: The PR description assumes "expensive cached layers persist in registry from previous runs."

Question: What happens in these scenarios?

  1. First-time build on a fresh runner
  2. Registry cache was evicted/expired
  3. Force-rebuild scenario

With cache-mode=off, these scenarios get no benefit from local cache exports. Verify this trade-off is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants