Add MyPCBench as a supported benchmark#3
Open
ljang0 wants to merge 4 commits into
Open
Conversation
Run MACU's multi-agent loop on MyPCBench (184 rubric-graded personal-assistant CUA tasks on a seeded QEMU/KVM VM). MyPCBench's in-VM Control API is OSWorld-compatible and its tasks carry no OSWorld domain, so MACU drives it as a generic CUA benchmark under the apptainer (QEMU) provider -- each parallel CUA subagent gets its own CoW-overlay clone of VM state. - utils/mypcbench_utils.py + scripts/convert_mypcbench_tasks.py: convert MyPCBench tasks to MACU CUA tasks (no_initial_setup), retaining grading rubrics; shared weighted-aggregation helpers. - utils/vm_utils.py: MACU_APPTAINER_BASE_IMAGE override so overlay cloning can target the MyPCBench qcow2 (OSWorld path unchanged when unset). - scripts/run_cua.py: gated post-boot prep (DNS pin, chat-app LLM-key injection, lazy app-DB seeding) ported from MyPCBench's harness, run in-guest via the controller; enabled with MACU_MYPCBENCH=1. - evals/mypcbench_eval.py: offline rubric judge reusing the Odysseys full-trajectory per-rubric Gemini judge, with MyPCBench weighted scoring (Perfect % / Rubric %) so MACU runs are comparable to single-agent MyPCBench. - README/evals/README docs + tests/test_mypcbench_utils.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MACU's only local-VM providers were vmware and apptainer (qemu inside a SIF). Hosts with bare QEMU+KVM but no apptainer -- e.g. GPU/compute nodes where unprivileged user namespaces are disabled, which is how MyPCBench's own harness boots -- had no way to run the multi-agent loop. Add a "qemu" provider that boots the qcow2 with host qemu-system-x86_64 directly. - osworld/desktop_env/providers/qemu/: QemuProvider subclasses ApptainerProvider and overrides only the methods that shell out through `apptainer exec <sif>` (_launch_qemu, qemu-img info) to run on the host, with host OVMF discovery. Inherits the CoW-overlay + QMP state save/restore + sidecar reconnect, so per-subagent VM cloning still works. QemuVMManager mints overlays with host qemu-img. - osworld/patches.py: register provider_name="qemu" (distinct sentinel, chained factory + DesktopEnv.__init__ patch, mirroring apptainer). - osworld/.../apptainer/provider.py: extract the SIF preflight check into an overridable _check_runtime_available() (qemu checks for the qemu binary + KVM + OVMF instead). - utils/vm_utils.py + utils/macu_runtime.py: treat "qemu" as a local-overlay provider like apptainer; mint overlays with host qemu-img when the qemu backend is selected. - scripts/run_cua.py: allow --provider_name qemu; run MyPCBench prep via the Control API /execute (shell) instead of execute_python_command, which invokes `python` (the image ships python3 only). - README: document --provider_name qemu as the bare-KVM path for MyPCBench. Validated end-to-end on bare KVM with a local Qwen3.5-4B: run_macu --no-manager --provider_name qemu boots the MyPCBench image, runs the per-boot prep (DB seeding + LLM key), drives the CUA agent (8 steps + screenshots), and the run is scored by evals/mypcbench_eval.py against the local model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MACU's Qwen agent was computer-only (a single computer_use pyautogui tool), matching MyPCBench's qwen_cua appendix ablation. Add an optional bash tool so it matches the qwen_cuabash main agent (computer + bash). - osworld/mm_agents/qwen35vl_agent.py: enable_bash flag advertises a bash tool in the system prompt as an XML `<function=bash><parameter=command>` form (ported from MyPCBench), gated so computer-only stays the default. - scripts/run_cua.py: --cua_bash flag (or MACU_CUA_BASH=1). The runner parses the bash tool call out of the model response, executes it in the guest via the Control API /execute (shell=True), records a trajectory row, and feeds stdout/stderr back the next turn wrapped in <tool_response>. Bash-only turns continue the episode instead of ending it; applied to both the main and the manager-followup loops. - README: document --cua_bash / MACU_CUA_BASH. The bash tool is enabled per run via the flag; existing computer-only runs are unchanged. Threaded through MACU by exporting MACU_CUA_BASH=1 (run_cua subprocesses inherit it), no orchestrator change needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Match agent-harness/env.py's reference boot (virtio-blk, cache=unsafe, -vga virtio) instead of the -hda/IDE inherited from the apptainer provider. The MyPCBench image is built for virtio; IDE is slower and a different device model than the guest expects. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds MyPCBench (mypcbench.com) as a fourth supported benchmark, so MACU's multi-agent loop can run the 184 rubric-graded personal-assistant CUA tasks and be scored offline.
Why it fits cleanly
/screenshot,/execute,/setup/*), so MACU's existingDesktopEnv+ apptainer/QEMU provider drive it unchanged.domain, so MACU already classifies them as generic CUA (is_osworld=False) — honoringno_initial_setup, skipping the OSWorld evaluator — exactly like Odysseys / Online-Mind2Web.evals/odysseys_eval.py— the new judge reuses that judging core.Each parallel CUA subagent gets its own CoW-overlay clone of VM state via the existing apptainer QMP save/restore.
Changes
utils/mypcbench_utils.py+scripts/convert_mypcbench_tasks.py— convert MyPCBench tasks → MACU CUA tasks (no_initial_setup), retain grading rubrics; shared weighted-aggregation helpers.utils/vm_utils.py—MACU_APPTAINER_BASE_IMAGEoverride so overlay cloning targets the MyPCBench qcow2 (OSWorld path unchanged when unset).scripts/run_cua.py— gated post-boot prep (DNS pin, chat-app LLM-key injection, lazy app-DB seeding) ported from MyPCBench's harness, run in-guest via the controller; enabled withMACU_MYPCBENCH=1.evals/mypcbench_eval.py— offline judge reusing the Odysseys full-trajectory per-rubric Gemini judge with MyPCBench weighted scoring (Perfect % / Rubric %), in ascores.json-compatible shape so MACU runs are directly comparable to single-agent MyPCBench runs.evals/README.mddocs (incl. Qwen path + MACU-vs-single-agent comparison) andtests/test_mypcbench_utils.py.Testing
tests/test_mypcbench_utils.py(converter mapping, rubric normalization, weighted + Perfect%/Rubric% aggregation) passes;test_vm_utilsunaffected.bash -n.Out of scope
🤖 Generated with Claude Code