[env][examples] Add tau-bench (retail) environment + eval by xyuzh · Pull Request #1853 · NovaSky-AI/SkyRL

xyuzh · 2026-06-30T23:15:17Z

Summary

Adds a tau_bench SkyRL-Gym environment for the retail domain of
tau-bench, plus an eval-only recipe
to baseline a policy model on it.

The environment is multi-turn and tool-using: each turn the agent either calls a
retail tool or sends a message to a simulated user (an LLM served over an
OpenAI-compatible endpoint, separate from the policy engines). Reward is the
upstream retail reward — the final database state must match the gold trajectory's
and all required outputs must have been communicated — so a run reports pass@1
over the 115-task retail test split.

What's included

skyrl-gym/skyrl_gym/envs/tau_bench/ — TauBenchEnv (the SkyRL-Gym wrapper)
and user_simulator.py (HTTP + scripted user sims). The retail domain (tools,
DB, tasks, wiki, reward) is vendored under tau_core/, with two changes vs
upstream: the litellm user simulator is replaced by an injectable one, and the
pydantic types are converted to dataclasses (no new dependency). See
tau_core/README.md for attribution.
Registered as env_class="tau_bench"; TauBenchEnvConfig added to SkyRLGymConfig.
examples/train/tau_bench/ — dataset builder, eval launch script, and an
Anyscale job using the eval-only entrypoint (main_generate) with async
inference. Single 8×H100 node: user-sim on 2 GPUs, policy engines on the rest.
Tests — skyrl-gym/tests/test_tau_bench.py (replaying a task's gold actions
yields reward 1.0; an empty trajectory yields 0.0; tool-call parsing; max-turns).

Action protocol

The SkyRL rollout is tag-based, so the protocol is defined in the system prompt:
each turn is either a tool call —
<tool_call>{"name": "<tool>", "arguments": {<json>}}</tool_call> — or a plain-text
message to the user. Episodes end on user ###STOP###, a terminating tool
(transfer_to_human_agents), or max_turns.

Test plan

uv run --isolated --extra dev pytest skyrl-gym/tests/test_tau_bench.py — 6 passed.
Validated end-to-end on Anyscale: all 115 retail test tasks complete with 0 tool
errors and a sensible pass@1.

Notes

The retail DB JSON + task sets (~2.2 MB) are vendored so the env is self-contained
at rollout time; tau_core/README.md documents the source and how to refresh.

Adds a `tau_bench` SkyRL-Gym environment for the retail domain of tau-bench (sierra-research/tau-bench), plus an eval-only recipe to baseline a policy on it. - skyrl-gym: `TauBenchEnv` — multi-turn, tool-calling agent driven against an LLM user simulator served over an OpenAI-compatible endpoint. Reward is the upstream retail reward (final DB-state match + required outputs). The retail domain (tools, data, tasks, wiki, reward) is vendored under `tau_core/`, with the litellm user simulator replaced by an injectable one and the pydantic types converted to dataclasses (no new dependency). - Registered as env_class `tau_bench`; `TauBenchEnvConfig` added to `SkyRLGymConfig`. - examples/train/tau_bench: dataset builder, eval launch script, and Anyscale job using the eval-only entrypoint (`main_generate`) with async inference. - Tests: `skyrl-gym/tests/test_tau_bench.py` (gold trajectory -> reward 1.0).

xyuzh force-pushed the xinyu/taubench-eval branch from a5b2816 to d93130f Compare June 30, 2026 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[env][examples] Add tau-bench (retail) environment + eval#1853

[env][examples] Add tau-bench (retail) environment + eval#1853
xyuzh wants to merge 1 commit into
NovaSky-AI:mainfrom
xyuzh:xinyu/taubench-eval

xyuzh commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xyuzh commented Jun 30, 2026

Summary

What's included

Action protocol

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant