Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,10 @@ data/
.vmware_lck
.vmware_vms

# Local QEMU/apptainer provider runtime state (overlays, instance dirs, SIFs).
# Anchored to repo root so the provider source dir
# (osworld/desktop_env/providers/apptainer/) is NOT ignored.
/apptainer/
/osworld_vms/

.git.old
103 changes: 102 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,33 @@ curl -L \
https://raw.githubusercontent.com/ljang0/Odysseys/main/data/odysseys.json
```

### MyPCBench Data and VM Image

[MyPCBench](https://mypcbench.com) is a personal-assistant CUA benchmark: 184
rubric-graded tasks on a seeded QEMU/KVM Ubuntu VM (persona Michael Scott) whose
in-VM Control API is OSWorld-compatible. MACU runs it as a generic CUA benchmark
under the **apptainer (QEMU)** provider, so each parallel CUA subagent gets its
own CoW-overlay clone of the VM state.

Clone MyPCBench and fetch its qcow2 image (no Docker/root required):

```bash
git clone https://github.com/ljang0/MyPCBench ../MyPCBench
bash ../MyPCBench/scripts/get-eval-image.sh --out ../MyPCBench/mypcbench-vm
# Point MACU's apptainer overlay cloning at the MyPCBench image:
export MACU_APPTAINER_BASE_IMAGE="$(realpath ../MyPCBench/mypcbench-vm/mypcbench.qcow2)"
```

Convert MyPCBench's task file into a MACU task list (also the rubric source for
offline judging):

```bash
mkdir -p data/mypcbench
python scripts/convert_mypcbench_tasks.py \
../MyPCBench/tasks/final/all_tasks_with_grading.json \
data/mypcbench/mypcbench_tasks.json
```

### API Keys

Copy `.example_env` to a local `.env`, replace the dummy values, then export them before launching MACU:
Expand Down Expand Up @@ -134,6 +161,10 @@ To run Online-Mind2Web or Odysseys directly with `run_macu.py`, use the same com
| --- | --- | --- |
| Online-Mind2Web | `data/online_m2w/Online_Mind2Web.json` | `runs/online_m2w_macu` |
| Odysseys | `data/odysseys/odysseys.json` | `runs/odysseys_macu` |
| MyPCBench | `data/mypcbench/mypcbench_tasks.json` | `runs/mypcbench_macu` (see below) |

MyPCBench is run on its own VM image under the apptainer (QEMU) provider, so it
uses a slightly different command — see [Running MyPCBench](#running-mypcbench).

To use GPT-5.4-mini as the CUA subagent instead of vLLM-backed Qwen, switch the CUA provider and model:

Expand All @@ -151,6 +182,50 @@ python run_macu.py data/osworld/evaluation_examples/test_small.json \

Use `--task-id <id>` to run one task from a larger task file. OSWorld runs call the OSWorld evaluator at the end.

### Running MyPCBench

MyPCBench ships its own seeded VM image, so it runs under one of MACU's local
QEMU providers with `MACU_APPTAINER_BASE_IMAGE` pointing at the MyPCBench qcow2
(see [MyPCBench Data and VM Image](#mypcbench-data-and-vm-image) for the one-time
setup and task conversion). Pick the provider that matches your host:

- **`--provider_name qemu`** — boots the image with bare `qemu-system-x86_64` +
KVM, like MyPCBench's own harness. Needs only `/dev/kvm`, QEMU, and host OVMF
firmware (the `ovmf` package). Best on GPU/compute nodes where apptainer's
unprivileged user namespaces are unavailable.
- **`--provider_name apptainer`** — runs QEMU inside an apptainer SIF (no QEMU
on the host). Use when you already run OSWorld this way.

Both give each parallel CUA subagent its own CoW-overlay clone of the VM. Export
`MACU_MYPCBENCH=1` so each worker runs MyPCBench's per-boot prep (DNS pin,
chat-app LLM-key injection, and lazy app-DB seeding) against the freshly cloned
VM before the agent acts. Requires ~8–16 GB RAM per parallel VM.

```bash
source .venv/bin/activate
set -a && source .env && set +a
export MACU_APPTAINER_BASE_IMAGE="$(realpath ../MyPCBench/mypcbench-vm/mypcbench.qcow2)"
export MACU_MYPCBENCH=1

# GPT-5.4-mini CUA subagents on bare KVM (--provider_name apptainer also works)
python run_macu.py data/mypcbench/mypcbench_tasks.json \
--result-dir runs/mypcbench_macu_gpt54mini \
--osworld-root "$OSWORLD_ROOT" \
--manager-provider anthropic --manager-model claude-opus-4-6 \
--cua-provider openai \
--max-parallelism 4 \
-- --headless --provider_name qemu \
--model gpt-5.4-mini --max_steps 100 --sleep_after_execution 3.0
```

For local Qwen CUA workers, keep the vLLM server from the
[Quick Start](#quick-start) running, export
`OPENAI_BASE_URL=http://127.0.0.1:8000/v1`, and switch to `--cua-provider qwen`
plus `--model Qwen/Qwen3.6-27B` (the `--result-dir` and everything else are the
same). Use `--task-id <id>` to run a single MyPCBench task (start with one — the
full 184-task set is a long, API-cost-heavy run). MyPCBench scores are produced
offline by the rubric judge (next section), not at the end of the run.

### Output Layout

Each task writes a directory under the selected result root:
Expand Down Expand Up @@ -183,6 +258,14 @@ We have implemented several models to use as CUA subagents (more to come!):
| `openai` | GPT-5.4 CUA | `gpt-5.4-mini` or `gpt-5.4` | Uses `OPENAI_API_KEY` |
| `qwen` | Qwen CUA via OpenAI-compatible vLLM | `Qwen/Qwen3.6-27B` | Uses `OPENAI_BASE_URL`; `OPENAI_API_KEY` can be a dummy value for local vLLM |

The Qwen CUA agent is computer-only by default. Pass `--cua_bash` (or export
`MACU_CUA_BASH=1`) to also give it a **bash** tool — computer + bash, matching
MyPCBench's `qwen_cuabash` main agent. The model invokes it as
`<tool_call><function=bash><parameter=command>…</parameter></function></tool_call>`;
the command runs in the guest via the Control API and its stdout/stderr is fed
back the next turn. Without the flag it stays the `qwen_cua` (computer-only)
ablation.

Manager providers are selected with `--manager-provider {anthropic,openai,google,huggingface}` and `--manager-model`.

## Evaluation
Expand All @@ -209,6 +292,24 @@ Odysseys-style runs can be judged per rubric:
--max-concurrent-rubrics 4
```

MyPCBench runs are judged offline against their weighted rubrics. The judge is
the same Gemini full-trajectory per-rubric judge MyPCBench uses (it was ported
from Odysseys), and reports MyPCBench's **Perfect %** and **Rubric %** metrics,
so a MACU run is directly comparable to a non-MACU single-agent MyPCBench run:

```bash
.venv/bin/python evals/mypcbench_eval.py \
--runs-dir runs/mypcbench_macu_qwen \
--task-source-json data/mypcbench/mypcbench_tasks.json \
--model gemini-3.1-flash-lite-preview \
--num-workers 8 \
--max-concurrent-rubrics 4
```

To compare MACU's multi-agent loop against the single-agent baseline, run the
same CUA model both ways (MACU here, single-agent in the MyPCBench repo) and
judge both — the metrics line up. See `evals/README.md` for details.

### Benchmark Results

We achieved the following results, as described in more detail in our [paper](https://arxiv.org/abs/2606.01533):
Expand Down Expand Up @@ -329,4 +430,4 @@ Useful flags:

## Acknowledgements

Our CUA subagent implementations were based off the official [OSWorld](https://github.com/xlang-ai/OSWorld/tree/main) implementations. Our evaluation scripts were directly lifted from the official [Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web), [WebTailBench](https://github.com/microsoft/fara/blob/main/webeval/src/webeval/benchmarks/webtailbench/webtailbench.py) and [Odysseys](https://github.com/ljang0/Odysseys/) repositories.
Our CUA subagent implementations were based off the official [OSWorld](https://github.com/xlang-ai/OSWorld/tree/main) implementations. Our evaluation scripts were directly lifted from the official [Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web), [WebTailBench](https://github.com/microsoft/fara/blob/main/webeval/src/webeval/benchmarks/webtailbench/webtailbench.py) and [Odysseys](https://github.com/ljang0/Odysseys/) repositories. MyPCBench support uses the [MyPCBench](https://github.com/ljang0/MyPCBench) VM image and tasks; its rubric judge shares the Odysseys full-trajectory per-rubric implementation.
61 changes: 61 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,3 +223,64 @@ tokens but show zero cost.
- **Unexpected low scores from missing end state** — increase
`--max-steps` or `--max-images` if the decisive screenshot/action is
being filtered out.

## mypcbench_eval.py

MyPCBench rubric evaluator. MyPCBench's grader is the same full-trajectory,
per-rubric multimodal judge as Odysseys (it was ported from it), so this script
reuses `odysseys_eval.py`'s judging core and changes only the rubric source and
the aggregation. Rubrics come from MyPCBench's `grading.rubrics` (per-rubric
`weight`s that sum to 1.0), and scoring is **weighted**, reproducing MyPCBench's
two headline metrics:

- **Perfect %** (`perfect_rate`) — fraction of tasks where every rubric passed
- **Rubric %** (`avg_score`) — weighted mean rubric score over all tasks
(judge-errored tasks count as 0)

Because the judge model, prompt, and metrics match
`MyPCBench/agent-harness/judge_results.py`, a MACU run and a non-MACU
(single-agent) MyPCBench run are directly comparable.

### What it reads

- `--runs-dir` — a single MACU task run dir, or a parent dir of `<task_id>/`
run dirs (each with `final_traj/traj.jsonl` or `traj.jsonl`, plus screenshots).
Completion is gated on `final_results.json` or a numeric `result.txt`
(override with `--include-incomplete`).
- `--task-source-json` *(required)* — the converted MyPCBench task file from
`scripts/convert_mypcbench_tasks.py` (carries `grading.rubrics`).

### What it writes

- `<runs-dir>/mypcbench_scores.json` by default (or `--output`): `summary`
(the metrics above + token/cost totals), `rows` (per-task `score`/`perfect`/
`status`), and `tasks` (full per-rubric judge detail).

### Basic usage

```bash
set -a && source .env && set +a # GEMINI_API_KEY for the paper judge config
.venv/bin/python evals/mypcbench_eval.py \
--runs-dir runs/mypcbench_macu \
--task-source-json data/mypcbench/mypcbench_tasks.json \
--model gemini-3.1-flash-lite-preview \
--num-workers 8 \
--max-concurrent-rubrics 4
```

### Comparing MACU vs. single-agent MyPCBench

Run the same model both ways and judge both with this script (or with
MyPCBench's own `judge_results.py` — the metrics line up):

```bash
# MACU multi-agent (this repo)
python evals/mypcbench_eval.py --runs-dir runs/mypcbench_macu_qwen \
--task-source-json data/mypcbench/mypcbench_tasks.json
# Single-agent baseline (MyPCBench repo), then compare the two scores.json
python ../MyPCBench/agent-harness/judge_results.py --result_dir ../MyPCBench/results/qwen
```

Requirements and key flags match `odysseys_eval.py` (same judge backend selection
via `--model`, `--gemini-api-key` / `--api-key`, `--max-images`, `--max-steps`,
`--num-workers`, `--max-concurrent-rubrics`).
Loading