[env][examples] Add tau-bench (retail) environment + eval#1853
Draft
xyuzh wants to merge 1 commit into
Draft
Conversation
Adds a `tau_bench` SkyRL-Gym environment for the retail domain of tau-bench (sierra-research/tau-bench), plus an eval-only recipe to baseline a policy on it. - skyrl-gym: `TauBenchEnv` — multi-turn, tool-calling agent driven against an LLM user simulator served over an OpenAI-compatible endpoint. Reward is the upstream retail reward (final DB-state match + required outputs). The retail domain (tools, data, tasks, wiki, reward) is vendored under `tau_core/`, with the litellm user simulator replaced by an injectable one and the pydantic types converted to dataclasses (no new dependency). - Registered as env_class `tau_bench`; `TauBenchEnvConfig` added to `SkyRLGymConfig`. - examples/train/tau_bench: dataset builder, eval launch script, and Anyscale job using the eval-only entrypoint (`main_generate`) with async inference. - Tests: `skyrl-gym/tests/test_tau_bench.py` (gold trajectory -> reward 1.0).
a5b2816 to
d93130f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
tau_benchSkyRL-Gym environment for the retail domain oftau-bench, plus an eval-only recipe
to baseline a policy model on it.
The environment is multi-turn and tool-using: each turn the agent either calls a
retail tool or sends a message to a simulated user (an LLM served over an
OpenAI-compatible endpoint, separate from the policy engines). Reward is the
upstream retail reward — the final database state must match the gold trajectory's
and all required outputs must have been communicated — so a run reports
pass@1over the 115-task retail test split.
What's included
skyrl-gym/skyrl_gym/envs/tau_bench/—TauBenchEnv(the SkyRL-Gym wrapper)and
user_simulator.py(HTTP + scripted user sims). The retail domain (tools,DB, tasks, wiki, reward) is vendored under
tau_core/, with two changes vsupstream: the litellm user simulator is replaced by an injectable one, and the
pydantic types are converted to dataclasses (no new dependency). See
tau_core/README.mdfor attribution.env_class="tau_bench";TauBenchEnvConfigadded toSkyRLGymConfig.examples/train/tau_bench/— dataset builder, eval launch script, and anAnyscale job using the eval-only entrypoint (
main_generate) with asyncinference. Single 8×H100 node: user-sim on 2 GPUs, policy engines on the rest.
skyrl-gym/tests/test_tau_bench.py(replaying a task's gold actionsyields reward 1.0; an empty trajectory yields 0.0; tool-call parsing; max-turns).
Action protocol
The SkyRL rollout is tag-based, so the protocol is defined in the system prompt:
each turn is either a tool call —
<tool_call>{"name": "<tool>", "arguments": {<json>}}</tool_call>— or a plain-textmessage to the user. Episodes end on user
###STOP###, a terminating tool(
transfer_to_human_agents), ormax_turns.Test plan
uv run --isolated --extra dev pytest skyrl-gym/tests/test_tau_bench.py— 6 passed.errors and a sensible
pass@1.Notes
at rollout time;
tau_core/README.mddocuments the source and how to refresh.