Skip to content

Add benchmark-side Apptainer workspace support#509

Draft
neubig wants to merge 5 commits intomainfrom
docs/apptainer-benchmark-clarification
Draft

Add benchmark-side Apptainer workspace support#509
neubig wants to merge 5 commits intomainfrom
docs/apptainer-benchmark-clarification

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Mar 12, 2026

Summary

  • add benchmark-side --workspace apptainer support in the shared parser/models and the supported runners
  • introduce a reusable create_apptainer_workspace() helper for pre-built agent-server images, with configurable Apptainer runtime env vars
  • document Apptainer usage and limitations in the root and benchmark READMEs, plus add focused tests
  • clarify that Apptainer requires registry-pullable images built with --push, improve the error message for local-only builds, and reuse cached SIFs from APPTAINER_CACHE_DIR

Testing

  • uv run pre-commit run --files README.md benchmarks/utils/args_parser.py benchmarks/utils/models.py benchmarks/utils/image_utils.py benchmarks/gaia/run_infer.py benchmarks/commit0/run_infer.py benchmarks/multiswebench/run_infer.py benchmarks/swebench/run_infer.py benchmarks/swtbench/run_infer.py benchmarks/swebenchmultimodal/run_infer.py benchmarks/swefficiency/run_infer.py benchmarks/openagentsafety/run_infer.py benchmarks/swebench/README.md benchmarks/multiswebench/README.md benchmarks/swefficiency/README.md benchmarks/swebenchmultimodal/README.md tests/test_image_utils.py tests/test_workspace_types.py
  • uv run pytest tests/test_image_utils.py tests/test_workspace_types.py
  • uv run pre-commit run --files benchmarks/utils/image_utils.py benchmarks/swebench/README.md README.md tests/test_image_utils.py
  • uv run pytest tests/test_image_utils.py -q

Evidence

  • I attempted a minimal end-to-end benchmark run in this sandbox with a public dataset and a published agent image:
    • uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1
  • That run reached ApptainerWorkspace initialization and resolved a published image successfully:
    • ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal
  • The run then failed immediately in the current sandbox with:
    • [Errno 2] No such file or directory: 'apptainer'
  • Additional sandbox blockers remain:
    • apptainer is not installed
    • /dev/fuse is unavailable
    • /var/run/docker.sock is unavailable, so a local Docker fallback is not possible here either
  • There is still insufficient evidence to merge this PR, since it has not been run end-to-end.
  • End-to-end Apptainer validation is blocked by human QA.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clear, honest documentation that solves a real problem.

This accurately reflects the current state: Apptainer is in the SDK but not wired into the benchmark CLI. The writing is pragmatic and gives users concrete paths forward on Docker-restricted systems. No bikeshedding, no pretending features exist that don't - just straightforward technical documentation.

Taste Rating: Elegant
Verdict: ✅ Ship it

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig changed the title Clarify Apptainer support in benchmark docs Add benchmark-side Apptainer workspace support Mar 13, 2026
Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Contributor Author

neubig commented Mar 16, 2026

Following up: I tried the same validation path with a public benchmark dataset instead of GAIA.

Command attempted:

  • uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1

This is stronger evidence than the GAIA attempt because it used a public dataset and a published image:

  • ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal

The run reached ApptainerWorkspace initialization and then failed immediately with:

  • [Errno 2] No such file or directory: 'apptainer'

So at this point the blocker in this sandbox is no longer dataset access; it is the runtime environment itself. I still cannot complete Apptainer end-to-end validation here because:

  • apptainer is not installed
  • /dev/fuse is unavailable
  • /var/run/docker.sock is unavailable, so I cannot use a local Docker fallback here either

The PR remains draft. There is still insufficient evidence to merge this PR, since it has not been run end-to-end; end-to-end Apptainer validation is currently blocked by human QA.

@neubig neubig marked this pull request as draft March 16, 2026 14:36
Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Contributor Author

neubig commented Mar 26, 2026

Fresh verification on 2026-03-26:

I re-ran the minimal public SWE-Bench validation path on the current PR branch after first running make build (the sandbox needed submodules + uv sync --dev before uv run worked at all).

Commands used:

  • make build
  • uv run swebench-infer /tmp/llm-config-eval.json --dataset princeton-nlp/SWE-bench_Lite --split test --select /tmp/swe-select-6jzr4ly4 --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1

What this confirms live:

  • the benchmark CLI accepts --workspace apptainer
  • the run initializes normally with the real benchmark entrypoint
  • it downloads/loads the public princeton-nlp/SWE-bench_Lite dataset
  • it selects and starts processing astropy__astropy-12907
  • it reaches create_apptainer_workspace() / SDK ApptainerWorkspace initialization

Current blocking failure:

  • the run still stops at SDK workspace initialization with FileNotFoundError: [Errno 2] No such file or directory: 'apptainer'
  • stack trace shows the failure comes from openhands.workspace.apptainer.workspace.ApptainerWorkspace.model_post_init() when it runs apptainer version

Relevant sandbox checks from this environment:

  • which apptainer -> not found
  • /dev/fuse -> absent
  • /var/run/docker.sock -> absent
  • apt-cache search apptainer did not expose an installable runtime package here either

So this is stronger live evidence than the earlier comment: the benchmark path itself now runs far enough to prove the new CLI wiring is exercised, but I still do not have end-to-end Apptainer execution evidence in this sandbox. The PR should remain draft pending human QA on a machine with a working Apptainer runtime.

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants