Optimize SWT-bench build throughput: cache-mode=off, raise prune threshold#551
Optimize SWT-bench build throughput: cache-mode=off, raise prune threshold#551simonrosenberg wants to merge 1 commit intomainfrom
Conversation
…shold Analysis of run #23382357696 (9h04m for 433 images at 47.8 img/h) showed that despite the Dockerfile ARG cache fix working correctly (12-13 cached steps per image), two major overheads remain: 1. cache-mode=max: avg 64s/image (16.4% of wall clock) spent exporting all layers back to registry — pure overhead. Issue #530 showed cache-mode=off gives ~43% higher throughput. Registry cache imports still work with mode=off; only the export is skipped. The expensive cached layers (apt-get, npm install) persist in the registry from previous runs and don't change between SDK bumps. 2. Two BuildKit prune events at 60% threshold: each prune took ~20 min and wiped local cache, causing a post-prune growth spike (3-5 GiB/image vs steady-state 1.63 GiB/image). Raising to 85% eliminates one prune (first prune after ~286 images instead of ~163), and lowering keep-storage to 20g maximizes headroom after pruning. Expected improvement: ~47.8 → ~65-70 img/h (~6h15m for 433 images), recovering the late-February baseline of ~6h. References: #530, #531, #544 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic optimization based on real measurements, but needs testing evidence before merge.
| description: 'BuildKit cache export mode (max, min, off). Default: off' | ||
| required: false | ||
| default: 'max' | ||
| default: 'off' |
There was a problem hiding this comment.
🔴 Critical - Breaking Change: Changing the default from max to off breaks existing behavior for anyone using this workflow without explicitly setting cache-mode. This violates the "never break userspace" principle.
Impact: Any automation or manual workflows relying on the previous default will see different caching behavior.
Required: Either:
- Add a migration notice in the PR description explaining the default change
- Bump a workflow version number to make this an opt-in change
- Provide evidence that no existing consumers rely on the
maxdefault
| BUILDKIT_PROGRESS: plain | ||
| OPENHANDS_BUILDKIT_CACHE_MODE: ${{ inputs.cache-mode }} | ||
| BUILDKIT_PRUNE_THRESHOLD_PCT: '85' | ||
| BUILDKIT_PRUNE_KEEP_GB: '20' |
There was a problem hiding this comment.
🟠 Important - Needs Justification: Dropping BUILDKIT_PRUNE_KEEP_GB from 60GB to 20GB is aggressive (67% reduction).
Risk: If disk usage patterns vary, this could trigger more frequent prunes in some scenarios, defeating the purpose of raising the threshold to 85%.
Required: Explain the analysis that led to 20GB specifically. What's the typical post-prune disk usage in your measurements?
| DOCKER_BUILDKIT: 1 | ||
| BUILDKIT_PROGRESS: plain | ||
| OPENHANDS_BUILDKIT_CACHE_MODE: ${{ inputs.cache-mode }} | ||
| BUILDKIT_PRUNE_THRESHOLD_PCT: '85' |
There was a problem hiding this comment.
🟡 Suggestion - Verify Assumption: The PR description assumes "expensive cached layers persist in registry from previous runs."
Question: What happens in these scenarios?
- First-time build on a fresh runner
- Registry cache was evicted/expired
- Force-rebuild scenario
With cache-mode=off, these scenarios get no benefit from local cache exports. Verify this trade-off is acceptable.
Summary
cache-modefrommaxtooff— eliminates ~64s/image (16.4%) of cache export overhead. Registry cache imports still work; only export is skipped. Expensive cached layers (apt-get, npm install) persist in registry from previous runs.BUILDKIT_PRUNE_THRESHOLD_PCTfrom 60% to 85% — delays first prune from ~163 to ~286 images, eliminating one of two prune events (~20 min saved).BUILDKIT_PRUNE_KEEP_GBfrom 60 to 20 — maximizes post-prune headroom.Context
Analysis of run #23382357696 (9h04m, 47.8 img/h for 433 images) showed the Dockerfile ARG cache fix (#544) is working (12-13 cached steps/image), but two overheads dominate:
cache-mode=maxcache export: avg 64s/imageExpected improvement: ~47.8 → ~65-70 img/h (~6h15m for 433 images), recovering the late-February baseline.
References: #530, #531, #544
Test plan
cache_export_secondsdrops to near-zero🤖 Generated with Claude Code