Skip to content

[env][examples] Add tau-bench (retail) environment + eval#1853

Draft
xyuzh wants to merge 1 commit into
NovaSky-AI:mainfrom
xyuzh:xinyu/taubench-eval
Draft

[env][examples] Add tau-bench (retail) environment + eval#1853
xyuzh wants to merge 1 commit into
NovaSky-AI:mainfrom
xyuzh:xinyu/taubench-eval

Conversation

@xyuzh

@xyuzh xyuzh commented Jun 30, 2026

Copy link
Copy Markdown

Summary

Adds a tau_bench SkyRL-Gym environment for the retail domain of
tau-bench, plus an eval-only recipe
to baseline a policy model on it.

The environment is multi-turn and tool-using: each turn the agent either calls a
retail tool or sends a message to a simulated user (an LLM served over an
OpenAI-compatible endpoint, separate from the policy engines). Reward is the
upstream retail reward — the final database state must match the gold trajectory's
and all required outputs must have been communicated — so a run reports pass@1
over the 115-task retail test split.

What's included

  • skyrl-gym/skyrl_gym/envs/tau_bench/TauBenchEnv (the SkyRL-Gym wrapper)
    and user_simulator.py (HTTP + scripted user sims). The retail domain (tools,
    DB, tasks, wiki, reward) is vendored under tau_core/, with two changes vs
    upstream: the litellm user simulator is replaced by an injectable one, and the
    pydantic types are converted to dataclasses (no new dependency). See
    tau_core/README.md for attribution.
  • Registered as env_class="tau_bench"; TauBenchEnvConfig added to SkyRLGymConfig.
  • examples/train/tau_bench/ — dataset builder, eval launch script, and an
    Anyscale job using the eval-only entrypoint (main_generate) with async
    inference. Single 8×H100 node: user-sim on 2 GPUs, policy engines on the rest.
  • Testsskyrl-gym/tests/test_tau_bench.py (replaying a task's gold actions
    yields reward 1.0; an empty trajectory yields 0.0; tool-call parsing; max-turns).

Action protocol

The SkyRL rollout is tag-based, so the protocol is defined in the system prompt:
each turn is either a tool call —
<tool_call>{"name": "<tool>", "arguments": {<json>}}</tool_call> — or a plain-text
message to the user. Episodes end on user ###STOP###, a terminating tool
(transfer_to_human_agents), or max_turns.

Test plan

  • uv run --isolated --extra dev pytest skyrl-gym/tests/test_tau_bench.py — 6 passed.
  • Validated end-to-end on Anyscale: all 115 retail test tasks complete with 0 tool
    errors and a sensible pass@1.

Notes

  • The retail DB JSON + task sets (~2.2 MB) are vendored so the env is self-contained
    at rollout time; tau_core/README.md documents the source and how to refresh.

Adds a `tau_bench` SkyRL-Gym environment for the retail domain of tau-bench
(sierra-research/tau-bench), plus an eval-only recipe to baseline a policy on it.

- skyrl-gym: `TauBenchEnv` — multi-turn, tool-calling agent driven against an LLM
  user simulator served over an OpenAI-compatible endpoint. Reward is the upstream
  retail reward (final DB-state match + required outputs). The retail domain
  (tools, data, tasks, wiki, reward) is vendored under `tau_core/`, with the
  litellm user simulator replaced by an injectable one and the pydantic types
  converted to dataclasses (no new dependency).
- Registered as env_class `tau_bench`; `TauBenchEnvConfig` added to `SkyRLGymConfig`.
- examples/train/tau_bench: dataset builder, eval launch script, and Anyscale job
  using the eval-only entrypoint (`main_generate`) with async inference.
- Tests: `skyrl-gym/tests/test_tau_bench.py` (gold trajectory -> reward 1.0).
@xyuzh xyuzh force-pushed the xinyu/taubench-eval branch from a5b2816 to d93130f Compare June 30, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant