Add MyPCBench as a supported benchmark by ljang0 · Pull Request #3 · kohjingyu/multi-agent-computer-use

ljang0 · 2026-06-15T15:11:58Z

What

Adds MyPCBench (mypcbench.com) as a fourth supported benchmark, so MACU's multi-agent loop can run the 184 rubric-graded personal-assistant CUA tasks and be scored offline.

Why it fits cleanly

MyPCBench's in-VM Control API is OSWorld-compatible (port 5000, same /screenshot, /execute, /setup/*), so MACU's existing DesktopEnv + apptainer/QEMU provider drive it unchanged.
MyPCBench tasks carry no OSWorld domain, so MACU already classifies them as generic CUA (is_osworld=False) — honoring no_initial_setup, skipping the OSWorld evaluator — exactly like Odysseys / Online-Mind2Web.
MyPCBench's rubric judge was ported from Odysseys, which this repo already ships as evals/odysseys_eval.py — the new judge reuses that judging core.

Each parallel CUA subagent gets its own CoW-overlay clone of VM state via the existing apptainer QMP save/restore.

Changes

utils/mypcbench_utils.py + scripts/convert_mypcbench_tasks.py — convert MyPCBench tasks → MACU CUA tasks (no_initial_setup), retain grading rubrics; shared weighted-aggregation helpers.
utils/vm_utils.py — MACU_APPTAINER_BASE_IMAGE override so overlay cloning targets the MyPCBench qcow2 (OSWorld path unchanged when unset).
scripts/run_cua.py — gated post-boot prep (DNS pin, chat-app LLM-key injection, lazy app-DB seeding) ported from MyPCBench's harness, run in-guest via the controller; enabled with MACU_MYPCBENCH=1.
evals/mypcbench_eval.py — offline judge reusing the Odysseys full-trajectory per-rubric Gemini judge with MyPCBench weighted scoring (Perfect % / Rubric %), in a scores.json-compatible shape so MACU runs are directly comparable to single-agent MyPCBench runs.
README / evals/README.md docs (incl. Qwen path + MACU-vs-single-agent comparison) and tests/test_mypcbench_utils.py.

Testing

tests/test_mypcbench_utils.py (converter mapping, rubric normalization, weighted + Perfect%/Rubric% aggregation) passes; test_vm_utils unaffected.
Converter verified on the real 184-task MyPCBench file; generated guest prep script passes bash -n.
VM end-to-end (boot MyPCBench qcow2, prep, run, judge) requires a KVM host + the image and is validated separately — not run in CI.

Out of scope

No Docker backend (apptainer/QEMU only). No changes to OSWorld detection/eval. Image fetch stays in the MyPCBench repo.

🤖 Generated with Claude Code

Run MACU's multi-agent loop on MyPCBench (184 rubric-graded personal-assistant CUA tasks on a seeded QEMU/KVM VM). MyPCBench's in-VM Control API is OSWorld-compatible and its tasks carry no OSWorld domain, so MACU drives it as a generic CUA benchmark under the apptainer (QEMU) provider -- each parallel CUA subagent gets its own CoW-overlay clone of VM state. - utils/mypcbench_utils.py + scripts/convert_mypcbench_tasks.py: convert MyPCBench tasks to MACU CUA tasks (no_initial_setup), retaining grading rubrics; shared weighted-aggregation helpers. - utils/vm_utils.py: MACU_APPTAINER_BASE_IMAGE override so overlay cloning can target the MyPCBench qcow2 (OSWorld path unchanged when unset). - scripts/run_cua.py: gated post-boot prep (DNS pin, chat-app LLM-key injection, lazy app-DB seeding) ported from MyPCBench's harness, run in-guest via the controller; enabled with MACU_MYPCBENCH=1. - evals/mypcbench_eval.py: offline rubric judge reusing the Odysseys full-trajectory per-rubric Gemini judge, with MyPCBench weighted scoring (Perfect % / Rubric %) so MACU runs are comparable to single-agent MyPCBench. - README/evals/README docs + tests/test_mypcbench_utils.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MACU's only local-VM providers were vmware and apptainer (qemu inside a SIF). Hosts with bare QEMU+KVM but no apptainer -- e.g. GPU/compute nodes where unprivileged user namespaces are disabled, which is how MyPCBench's own harness boots -- had no way to run the multi-agent loop. Add a "qemu" provider that boots the qcow2 with host qemu-system-x86_64 directly. - osworld/desktop_env/providers/qemu/: QemuProvider subclasses ApptainerProvider and overrides only the methods that shell out through `apptainer exec <sif>` (_launch_qemu, qemu-img info) to run on the host, with host OVMF discovery. Inherits the CoW-overlay + QMP state save/restore + sidecar reconnect, so per-subagent VM cloning still works. QemuVMManager mints overlays with host qemu-img. - osworld/patches.py: register provider_name="qemu" (distinct sentinel, chained factory + DesktopEnv.__init__ patch, mirroring apptainer). - osworld/.../apptainer/provider.py: extract the SIF preflight check into an overridable _check_runtime_available() (qemu checks for the qemu binary + KVM + OVMF instead). - utils/vm_utils.py + utils/macu_runtime.py: treat "qemu" as a local-overlay provider like apptainer; mint overlays with host qemu-img when the qemu backend is selected. - scripts/run_cua.py: allow --provider_name qemu; run MyPCBench prep via the Control API /execute (shell) instead of execute_python_command, which invokes `python` (the image ships python3 only). - README: document --provider_name qemu as the bare-KVM path for MyPCBench. Validated end-to-end on bare KVM with a local Qwen3.5-4B: run_macu --no-manager --provider_name qemu boots the MyPCBench image, runs the per-boot prep (DB seeding + LLM key), drives the CUA agent (8 steps + screenshots), and the run is scored by evals/mypcbench_eval.py against the local model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MACU's Qwen agent was computer-only (a single computer_use pyautogui tool), matching MyPCBench's qwen_cua appendix ablation. Add an optional bash tool so it matches the qwen_cuabash main agent (computer + bash). - osworld/mm_agents/qwen35vl_agent.py: enable_bash flag advertises a bash tool in the system prompt as an XML `<function=bash><parameter=command>` form (ported from MyPCBench), gated so computer-only stays the default. - scripts/run_cua.py: --cua_bash flag (or MACU_CUA_BASH=1). The runner parses the bash tool call out of the model response, executes it in the guest via the Control API /execute (shell=True), records a trajectory row, and feeds stdout/stderr back the next turn wrapped in <tool_response>. Bash-only turns continue the episode instead of ending it; applied to both the main and the manager-followup loops. - README: document --cua_bash / MACU_CUA_BASH. The bash tool is enabled per run via the flag; existing computer-only runs are unchanged. Threaded through MACU by exporting MACU_CUA_BASH=1 (run_cua subprocesses inherit it), no orchestrator change needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Match agent-harness/env.py's reference boot (virtio-blk, cache=unsafe, -vga virtio) instead of the -hda/IDE inherited from the apptainer provider. The MyPCBench image is built for virtio; IDE is slower and a different device model than the guest expects. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ljang0 and others added 4 commits June 15, 2026 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MyPCBench as a supported benchmark#3

Add MyPCBench as a supported benchmark#3
ljang0 wants to merge 4 commits into
kohjingyu:masterfrom
ljang0:add-mypcbench-support

ljang0 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ljang0 commented Jun 15, 2026

What

Why it fits cleanly

Changes

Testing

Out of scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant