Skip to content

Add MyPCBench as a supported benchmark#3

Open
ljang0 wants to merge 4 commits into
kohjingyu:masterfrom
ljang0:add-mypcbench-support
Open

Add MyPCBench as a supported benchmark#3
ljang0 wants to merge 4 commits into
kohjingyu:masterfrom
ljang0:add-mypcbench-support

Conversation

@ljang0

@ljang0 ljang0 commented Jun 15, 2026

Copy link
Copy Markdown

What

Adds MyPCBench (mypcbench.com) as a fourth supported benchmark, so MACU's multi-agent loop can run the 184 rubric-graded personal-assistant CUA tasks and be scored offline.

Why it fits cleanly

  • MyPCBench's in-VM Control API is OSWorld-compatible (port 5000, same /screenshot, /execute, /setup/*), so MACU's existing DesktopEnv + apptainer/QEMU provider drive it unchanged.
  • MyPCBench tasks carry no OSWorld domain, so MACU already classifies them as generic CUA (is_osworld=False) — honoring no_initial_setup, skipping the OSWorld evaluator — exactly like Odysseys / Online-Mind2Web.
  • MyPCBench's rubric judge was ported from Odysseys, which this repo already ships as evals/odysseys_eval.py — the new judge reuses that judging core.

Each parallel CUA subagent gets its own CoW-overlay clone of VM state via the existing apptainer QMP save/restore.

Changes

  • utils/mypcbench_utils.py + scripts/convert_mypcbench_tasks.py — convert MyPCBench tasks → MACU CUA tasks (no_initial_setup), retain grading rubrics; shared weighted-aggregation helpers.
  • utils/vm_utils.pyMACU_APPTAINER_BASE_IMAGE override so overlay cloning targets the MyPCBench qcow2 (OSWorld path unchanged when unset).
  • scripts/run_cua.py — gated post-boot prep (DNS pin, chat-app LLM-key injection, lazy app-DB seeding) ported from MyPCBench's harness, run in-guest via the controller; enabled with MACU_MYPCBENCH=1.
  • evals/mypcbench_eval.py — offline judge reusing the Odysseys full-trajectory per-rubric Gemini judge with MyPCBench weighted scoring (Perfect % / Rubric %), in a scores.json-compatible shape so MACU runs are directly comparable to single-agent MyPCBench runs.
  • README / evals/README.md docs (incl. Qwen path + MACU-vs-single-agent comparison) and tests/test_mypcbench_utils.py.

Testing

  • tests/test_mypcbench_utils.py (converter mapping, rubric normalization, weighted + Perfect%/Rubric% aggregation) passes; test_vm_utils unaffected.
  • Converter verified on the real 184-task MyPCBench file; generated guest prep script passes bash -n.
  • VM end-to-end (boot MyPCBench qcow2, prep, run, judge) requires a KVM host + the image and is validated separately — not run in CI.

Out of scope

  • No Docker backend (apptainer/QEMU only). No changes to OSWorld detection/eval. Image fetch stays in the MyPCBench repo.

🤖 Generated with Claude Code

ljang0 and others added 4 commits June 15, 2026 11:04
Run MACU's multi-agent loop on MyPCBench (184 rubric-graded personal-assistant
CUA tasks on a seeded QEMU/KVM VM). MyPCBench's in-VM Control API is
OSWorld-compatible and its tasks carry no OSWorld domain, so MACU drives it as a
generic CUA benchmark under the apptainer (QEMU) provider -- each parallel CUA
subagent gets its own CoW-overlay clone of VM state.

- utils/mypcbench_utils.py + scripts/convert_mypcbench_tasks.py: convert
  MyPCBench tasks to MACU CUA tasks (no_initial_setup), retaining grading
  rubrics; shared weighted-aggregation helpers.
- utils/vm_utils.py: MACU_APPTAINER_BASE_IMAGE override so overlay cloning can
  target the MyPCBench qcow2 (OSWorld path unchanged when unset).
- scripts/run_cua.py: gated post-boot prep (DNS pin, chat-app LLM-key injection,
  lazy app-DB seeding) ported from MyPCBench's harness, run in-guest via the
  controller; enabled with MACU_MYPCBENCH=1.
- evals/mypcbench_eval.py: offline rubric judge reusing the Odysseys
  full-trajectory per-rubric Gemini judge, with MyPCBench weighted scoring
  (Perfect % / Rubric %) so MACU runs are comparable to single-agent MyPCBench.
- README/evals/README docs + tests/test_mypcbench_utils.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MACU's only local-VM providers were vmware and apptainer (qemu inside a SIF).
Hosts with bare QEMU+KVM but no apptainer -- e.g. GPU/compute nodes where
unprivileged user namespaces are disabled, which is how MyPCBench's own harness
boots -- had no way to run the multi-agent loop. Add a "qemu" provider that
boots the qcow2 with host qemu-system-x86_64 directly.

- osworld/desktop_env/providers/qemu/: QemuProvider subclasses ApptainerProvider
  and overrides only the methods that shell out through `apptainer exec <sif>`
  (_launch_qemu, qemu-img info) to run on the host, with host OVMF discovery.
  Inherits the CoW-overlay + QMP state save/restore + sidecar reconnect, so
  per-subagent VM cloning still works. QemuVMManager mints overlays with host
  qemu-img.
- osworld/patches.py: register provider_name="qemu" (distinct sentinel, chained
  factory + DesktopEnv.__init__ patch, mirroring apptainer).
- osworld/.../apptainer/provider.py: extract the SIF preflight check into an
  overridable _check_runtime_available() (qemu checks for the qemu binary + KVM
  + OVMF instead).
- utils/vm_utils.py + utils/macu_runtime.py: treat "qemu" as a local-overlay
  provider like apptainer; mint overlays with host qemu-img when the qemu
  backend is selected.
- scripts/run_cua.py: allow --provider_name qemu; run MyPCBench prep via the
  Control API /execute (shell) instead of execute_python_command, which invokes
  `python` (the image ships python3 only).
- README: document --provider_name qemu as the bare-KVM path for MyPCBench.

Validated end-to-end on bare KVM with a local Qwen3.5-4B: run_macu --no-manager
--provider_name qemu boots the MyPCBench image, runs the per-boot prep (DB
seeding + LLM key), drives the CUA agent (8 steps + screenshots), and the run is
scored by evals/mypcbench_eval.py against the local model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MACU's Qwen agent was computer-only (a single computer_use pyautogui tool),
matching MyPCBench's qwen_cua appendix ablation. Add an optional bash tool so it
matches the qwen_cuabash main agent (computer + bash).

- osworld/mm_agents/qwen35vl_agent.py: enable_bash flag advertises a bash tool in
  the system prompt as an XML `<function=bash><parameter=command>` form (ported
  from MyPCBench), gated so computer-only stays the default.
- scripts/run_cua.py: --cua_bash flag (or MACU_CUA_BASH=1). The runner parses the
  bash tool call out of the model response, executes it in the guest via the
  Control API /execute (shell=True), records a trajectory row, and feeds
  stdout/stderr back the next turn wrapped in <tool_response>. Bash-only turns
  continue the episode instead of ending it; applied to both the main and the
  manager-followup loops.
- README: document --cua_bash / MACU_CUA_BASH.

The bash tool is enabled per run via the flag; existing computer-only runs are
unchanged. Threaded through MACU by exporting MACU_CUA_BASH=1 (run_cua
subprocesses inherit it), no orchestrator change needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Match agent-harness/env.py's reference boot (virtio-blk, cache=unsafe,
-vga virtio) instead of the -hda/IDE inherited from the apptainer
provider. The MyPCBench image is built for virtio; IDE is slower and a
different device model than the guest expects.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant