Add benchmark-side Apptainer workspace support by neubig · Pull Request #509 · OpenHands/benchmarks

neubig · 2026-03-12T22:02:23Z

Summary

add benchmark-side --workspace apptainer support in the shared parser/models and the supported runners
introduce a reusable create_apptainer_workspace() helper for pre-built agent-server images, with configurable Apptainer runtime env vars
document Apptainer usage and limitations in the root and benchmark READMEs, plus add focused tests
clarify that Apptainer requires registry-pullable images built with --push, improve the error message for local-only builds, and reuse cached SIFs from APPTAINER_CACHE_DIR

Testing

uv run pre-commit run --files README.md benchmarks/utils/args_parser.py benchmarks/utils/models.py benchmarks/utils/image_utils.py benchmarks/gaia/run_infer.py benchmarks/commit0/run_infer.py benchmarks/multiswebench/run_infer.py benchmarks/swebench/run_infer.py benchmarks/swtbench/run_infer.py benchmarks/swebenchmultimodal/run_infer.py benchmarks/swefficiency/run_infer.py benchmarks/openagentsafety/run_infer.py benchmarks/swebench/README.md benchmarks/multiswebench/README.md benchmarks/swefficiency/README.md benchmarks/swebenchmultimodal/README.md tests/test_image_utils.py tests/test_workspace_types.py
uv run pytest tests/test_image_utils.py tests/test_workspace_types.py
uv run pre-commit run --files benchmarks/utils/image_utils.py benchmarks/swebench/README.md README.md tests/test_image_utils.py
uv run pytest tests/test_image_utils.py -q

Evidence

I attempted a minimal end-to-end benchmark run in this sandbox with a public dataset and a published agent image:
- uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1
That run reached ApptainerWorkspace initialization and resolved a published image successfully:
- ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal
The run then failed immediately in the current sandbox with:
- [Errno 2] No such file or directory: 'apptainer'
Additional sandbox blockers remain:
- apptainer is not installed
- /dev/fuse is unavailable
- /var/run/docker.sock is unavailable, so a local Docker fallback is not possible here either
There is still insufficient evidence to merge this PR, since it has not been run end-to-end.
End-to-end Apptainer validation is blocked by human QA.

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟢 Good taste - Clear, honest documentation that solves a real problem.

This accurately reflects the current state: Apptainer is in the SDK but not wired into the benchmark CLI. The writing is pragmatic and gives users concrete paths forward on Docker-restricted systems. No bikeshedding, no pretending features exist that don't - just straightforward technical documentation.

Taste Rating: Elegant
Verdict: ✅ Ship it

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-03-16T14:36:48Z

Following up: I tried the same validation path with a public benchmark dataset instead of GAIA.

Command attempted:

uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1

This is stronger evidence than the GAIA attempt because it used a public dataset and a published image:

ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal

The run reached ApptainerWorkspace initialization and then failed immediately with:

[Errno 2] No such file or directory: 'apptainer'

So at this point the blocker in this sandbox is no longer dataset access; it is the runtime environment itself. I still cannot complete Apptainer end-to-end validation here because:

apptainer is not installed
/dev/fuse is unavailable
/var/run/docker.sock is unavailable, so I cannot use a local Docker fallback here either

The PR remains draft. There is still insufficient evidence to merge this PR, since it has not been run end-to-end; end-to-end Apptainer validation is currently blocked by human QA.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-03-26T10:49:48Z

Fresh verification on 2026-03-26:

I re-ran the minimal public SWE-Bench validation path on the current PR branch after first running make build (the sandbox needed submodules + uv sync --dev before uv run worked at all).

Commands used:

make build
uv run swebench-infer /tmp/llm-config-eval.json --dataset princeton-nlp/SWE-bench_Lite --split test --select /tmp/swe-select-6jzr4ly4 --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1

What this confirms live:

the benchmark CLI accepts --workspace apptainer
the run initializes normally with the real benchmark entrypoint
it downloads/loads the public princeton-nlp/SWE-bench_Lite dataset
it selects and starts processing astropy__astropy-12907
it reaches create_apptainer_workspace() / SDK ApptainerWorkspace initialization

Current blocking failure:

the run still stops at SDK workspace initialization with FileNotFoundError: [Errno 2] No such file or directory: 'apptainer'
stack trace shows the failure comes from openhands.workspace.apptainer.workspace.ApptainerWorkspace.model_post_init() when it runs apptainer version

Relevant sandbox checks from this environment:

which apptainer -> not found
/dev/fuse -> absent
/var/run/docker.sock -> absent
apt-cache search apptainer did not expose an installable runtime package here either

So this is stronger live evidence than the earlier comment: the benchmark path itself now runs far enough to prove the new CLI wiring is exercised, but I still do not have end-to-end Apptainer execution evidence in this sandbox. The PR should remain draft pending human QA on a machine with a working Apptainer runtime.

Co-authored-by: openhands <openhands@all-hands.dev>

Clarify Apptainer support in benchmark docs

79d308a

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot approved these changes Mar 12, 2026

View reviewed changes

Add benchmark-side Apptainer workspace support

8d6113e

Co-authored-by: openhands <openhands@all-hands.dev>

neubig changed the title ~~Clarify Apptainer support in benchmark docs~~ Add benchmark-side Apptainer workspace support Mar 13, 2026

Improve Apptainer image guidance and cache reuse

2f23853

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as draft March 16, 2026 14:36

Drop stale AGENTS workspace notes

24e2193

Co-authored-by: openhands <openhands@all-hands.dev>

Merge main into Apptainer workspace PR

ae49df2

Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark-side Apptainer workspace support#509

Add benchmark-side Apptainer workspace support#509
neubig wants to merge 5 commits intomainfrom
docs/apptainer-benchmark-clarification

neubig commented Mar 12, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

neubig commented Mar 16, 2026 •

edited

Loading

Uh oh!

neubig commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Evidence

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

neubig commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neubig commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Mar 12, 2026 •

edited

Loading

neubig commented Mar 16, 2026 •

edited

Loading