kohjingyu · ljang0 · Jun 15, 2026 · Jun 15, 2026 · Jun 15, 2026 · Jun 16, 2026
diff --git a/.gitignore b/.gitignore
@@ -15,4 +15,10 @@ data/
 .vmware_lck
 .vmware_vms
 
+# Local QEMU/apptainer provider runtime state (overlays, instance dirs, SIFs).
+# Anchored to repo root so the provider source dir
+# (osworld/desktop_env/providers/apptainer/) is NOT ignored.
+/apptainer/
+/osworld_vms/
+
 .git.old
diff --git a/README.md b/README.md
@@ -74,6 +74,33 @@ curl -L \
   https://raw.githubusercontent.com/ljang0/Odysseys/main/data/odysseys.json
 ```
 
+### MyPCBench Data and VM Image
+
+[MyPCBench](https://mypcbench.com) is a personal-assistant CUA benchmark: 184
+rubric-graded tasks on a seeded QEMU/KVM Ubuntu VM (persona Michael Scott) whose
+in-VM Control API is OSWorld-compatible. MACU runs it as a generic CUA benchmark
+under the **apptainer (QEMU)** provider, so each parallel CUA subagent gets its
+own CoW-overlay clone of the VM state.
+
+Clone MyPCBench and fetch its qcow2 image (no Docker/root required):
+
+```bash
+git clone https://github.com/ljang0/MyPCBench ../MyPCBench
+bash ../MyPCBench/scripts/get-eval-image.sh --out ../MyPCBench/mypcbench-vm
+# Point MACU's apptainer overlay cloning at the MyPCBench image:
+export MACU_APPTAINER_BASE_IMAGE="$(realpath ../MyPCBench/mypcbench-vm/mypcbench.qcow2)"
+```
+
+Convert MyPCBench's task file into a MACU task list (also the rubric source for
+offline judging):
+
+```bash
+mkdir -p data/mypcbench
+python scripts/convert_mypcbench_tasks.py \
+  ../MyPCBench/tasks/final/all_tasks_with_grading.json \
+  data/mypcbench/mypcbench_tasks.json
+```
+
 ### API Keys
 
 Copy `.example_env` to a local `.env`, replace the dummy values, then export them before launching MACU:
@@ -134,6 +161,10 @@ To run Online-Mind2Web or Odysseys directly with `run_macu.py`, use the same com
 | --- | --- | --- |
 | Online-Mind2Web | `data/online_m2w/Online_Mind2Web.json` | `runs/online_m2w_macu` |
 | Odysseys | `data/odysseys/odysseys.json` | `runs/odysseys_macu` |
+| MyPCBench | `data/mypcbench/mypcbench_tasks.json` | `runs/mypcbench_macu` (see below) |
+
+MyPCBench is run on its own VM image under the apptainer (QEMU) provider, so it
+uses a slightly different command — see [Running MyPCBench](#running-mypcbench).
 
 To use GPT-5.4-mini as the CUA subagent instead of vLLM-backed Qwen, switch the CUA provider and model:
 
@@ -151,6 +182,50 @@ python run_macu.py data/osworld/evaluation_examples/test_small.json \
 
 Use `--task-id <id>` to run one task from a larger task file. OSWorld runs call the OSWorld evaluator at the end.
 
+### Running MyPCBench
+
+MyPCBench ships its own seeded VM image, so it runs under one of MACU's local
+QEMU providers with `MACU_APPTAINER_BASE_IMAGE` pointing at the MyPCBench qcow2
+(see [MyPCBench Data and VM Image](#mypcbench-data-and-vm-image) for the one-time
+setup and task conversion). Pick the provider that matches your host:
+
+- **`--provider_name qemu`** — boots the image with bare `qemu-system-x86_64` +
+  KVM, like MyPCBench's own harness. Needs only `/dev/kvm`, QEMU, and host OVMF
+  firmware (the `ovmf` package). Best on GPU/compute nodes where apptainer's
+  unprivileged user namespaces are unavailable.
+- **`--provider_name apptainer`** — runs QEMU inside an apptainer SIF (no QEMU
+  on the host). Use when you already run OSWorld this way.
+
+Both give each parallel CUA subagent its own CoW-overlay clone of the VM. Export
+`MACU_MYPCBENCH=1` so each worker runs MyPCBench's per-boot prep (DNS pin,
+chat-app LLM-key injection, and lazy app-DB seeding) against the freshly cloned
+VM before the agent acts. Requires ~8–16 GB RAM per parallel VM.
+
+```bash
+source .venv/bin/activate
+set -a && source .env && set +a
+export MACU_APPTAINER_BASE_IMAGE="$(realpath ../MyPCBench/mypcbench-vm/mypcbench.qcow2)"
+export MACU_MYPCBENCH=1
+
+# GPT-5.4-mini CUA subagents on bare KVM (--provider_name apptainer also works)
+python run_macu.py data/mypcbench/mypcbench_tasks.json \
+  --result-dir runs/mypcbench_macu_gpt54mini \
+  --osworld-root "$OSWORLD_ROOT" \
+  --manager-provider anthropic --manager-model claude-opus-4-6 \
+  --cua-provider openai \
+  --max-parallelism 4 \
+  -- --headless --provider_name qemu \
+     --model gpt-5.4-mini --max_steps 100 --sleep_after_execution 3.0
+```
+
+For local Qwen CUA workers, keep the vLLM server from the
+[Quick Start](#quick-start) running, export
+`OPENAI_BASE_URL=http://127.0.0.1:8000/v1`, and switch to `--cua-provider qwen`
+plus `--model Qwen/Qwen3.6-27B` (the `--result-dir` and everything else are the
+same). Use `--task-id <id>` to run a single MyPCBench task (start with one — the
+full 184-task set is a long, API-cost-heavy run). MyPCBench scores are produced
+offline by the rubric judge (next section), not at the end of the run.
+
 ### Output Layout
 
 Each task writes a directory under the selected result root:
@@ -183,6 +258,14 @@ We have implemented several models to use as CUA subagents (more to come!):
 | `openai` | GPT-5.4 CUA | `gpt-5.4-mini` or `gpt-5.4` | Uses `OPENAI_API_KEY` |
 | `qwen` | Qwen CUA via OpenAI-compatible vLLM | `Qwen/Qwen3.6-27B` | Uses `OPENAI_BASE_URL`; `OPENAI_API_KEY` can be a dummy value for local vLLM |
 
+The Qwen CUA agent is computer-only by default. Pass `--cua_bash` (or export
+`MACU_CUA_BASH=1`) to also give it a **bash** tool — computer + bash, matching
+MyPCBench's `qwen_cuabash` main agent. The model invokes it as
+`<tool_call><function=bash><parameter=command>…</parameter></function></tool_call>`;
+the command runs in the guest via the Control API and its stdout/stderr is fed
+back the next turn. Without the flag it stays the `qwen_cua` (computer-only)
+ablation.
+
 Manager providers are selected with `--manager-provider {anthropic,openai,google,huggingface}` and `--manager-model`.
 
 ## Evaluation
@@ -209,6 +292,24 @@ Odysseys-style runs can be judged per rubric:
   --max-concurrent-rubrics 4
 ```
 
+MyPCBench runs are judged offline against their weighted rubrics. The judge is
+the same Gemini full-trajectory per-rubric judge MyPCBench uses (it was ported
+from Odysseys), and reports MyPCBench's **Perfect %** and **Rubric %** metrics,
+so a MACU run is directly comparable to a non-MACU single-agent MyPCBench run:
+
+```bash
+.venv/bin/python evals/mypcbench_eval.py \
+  --runs-dir runs/mypcbench_macu_qwen \
+  --task-source-json data/mypcbench/mypcbench_tasks.json \
+  --model gemini-3.1-flash-lite-preview \
+  --num-workers 8 \
+  --max-concurrent-rubrics 4
+```
+
+To compare MACU's multi-agent loop against the single-agent baseline, run the
+same CUA model both ways (MACU here, single-agent in the MyPCBench repo) and
+judge both — the metrics line up. See `evals/README.md` for details.
+
 ### Benchmark Results
 
 We achieved the following results, as described in more detail in our [paper](https://arxiv.org/abs/2606.01533):
@@ -329,4 +430,4 @@ Useful flags:
 
 ## Acknowledgements
 
-Our CUA subagent implementations were based off the official [OSWorld](https://github.com/xlang-ai/OSWorld/tree/main) implementations. Our evaluation scripts were directly lifted from the official [Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web), [WebTailBench](https://github.com/microsoft/fara/blob/main/webeval/src/webeval/benchmarks/webtailbench/webtailbench.py) and [Odysseys](https://github.com/ljang0/Odysseys/) repositories.
+Our CUA subagent implementations were based off the official [OSWorld](https://github.com/xlang-ai/OSWorld/tree/main) implementations. Our evaluation scripts were directly lifted from the official [Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web), [WebTailBench](https://github.com/microsoft/fara/blob/main/webeval/src/webeval/benchmarks/webtailbench/webtailbench.py) and [Odysseys](https://github.com/ljang0/Odysseys/) repositories. MyPCBench support uses the [MyPCBench](https://github.com/ljang0/MyPCBench) VM image and tasks; its rubric judge shares the Odysseys full-trajectory per-rubric implementation.
diff --git a/evals/README.md b/evals/README.md
@@ -223,3 +223,64 @@ tokens but show zero cost.
 - **Unexpected low scores from missing end state** — increase
   `--max-steps` or `--max-images` if the decisive screenshot/action is
   being filtered out.
+
+## mypcbench_eval.py
+
+MyPCBench rubric evaluator. MyPCBench's grader is the same full-trajectory,
+per-rubric multimodal judge as Odysseys (it was ported from it), so this script
+reuses `odysseys_eval.py`'s judging core and changes only the rubric source and
+the aggregation. Rubrics come from MyPCBench's `grading.rubrics` (per-rubric
+`weight`s that sum to 1.0), and scoring is **weighted**, reproducing MyPCBench's
+two headline metrics:
+
+- **Perfect %** (`perfect_rate`) — fraction of tasks where every rubric passed
+- **Rubric %** (`avg_score`) — weighted mean rubric score over all tasks
+  (judge-errored tasks count as 0)
+
+Because the judge model, prompt, and metrics match
+`MyPCBench/agent-harness/judge_results.py`, a MACU run and a non-MACU
+(single-agent) MyPCBench run are directly comparable.
+
+### What it reads
+
+- `--runs-dir` — a single MACU task run dir, or a parent dir of `<task_id>/`
+  run dirs (each with `final_traj/traj.jsonl` or `traj.jsonl`, plus screenshots).
+  Completion is gated on `final_results.json` or a numeric `result.txt`
+  (override with `--include-incomplete`).
+- `--task-source-json` *(required)* — the converted MyPCBench task file from
+  `scripts/convert_mypcbench_tasks.py` (carries `grading.rubrics`).
+
+### What it writes
+
+- `<runs-dir>/mypcbench_scores.json` by default (or `--output`): `summary`
+  (the metrics above + token/cost totals), `rows` (per-task `score`/`perfect`/
+  `status`), and `tasks` (full per-rubric judge detail).
+
+### Basic usage
+
+```bash
+set -a && source .env && set +a    # GEMINI_API_KEY for the paper judge config
+.venv/bin/python evals/mypcbench_eval.py \
+    --runs-dir runs/mypcbench_macu \
+    --task-source-json data/mypcbench/mypcbench_tasks.json \
+    --model gemini-3.1-flash-lite-preview \
+    --num-workers 8 \
+    --max-concurrent-rubrics 4
+```
+
+### Comparing MACU vs. single-agent MyPCBench
+
+Run the same model both ways and judge both with this script (or with
+MyPCBench's own `judge_results.py` — the metrics line up):
+
+```bash
+# MACU multi-agent (this repo)
+python evals/mypcbench_eval.py --runs-dir runs/mypcbench_macu_qwen \
+    --task-source-json data/mypcbench/mypcbench_tasks.json
+# Single-agent baseline (MyPCBench repo), then compare the two scores.json
+python ../MyPCBench/agent-harness/judge_results.py --result_dir ../MyPCBench/results/qwen
+```
+
+Requirements and key flags match `odysseys_eval.py` (same judge backend selection
+via `--model`, `--gemini-api-key` / `--api-key`, `--max-images`, `--max-steps`,
+`--num-workers`, `--max-concurrent-rubrics`).