Skip to content

Re-producing Tau-Bench Issue #29

Description

@steventan0110

I am wondering if it's possible to get eval script for communicating with the served api. I am not able to reproduces the results from the paper. I served the model (8b version) with vllm, but when I evaluate the model, I have the following issue:

I tried 2 approaches:

  • In the first approach, I follow the official multi-turn script (https://github.com/SalesforceAIResearch/xLAM/blob/main/xLAM/client/xLAM.py) to generate results. However, I found that the model does not follow the format at all:
    sometimes it directly output strings
    sometimes it directly output actions.
    in very few cases it output the thought and actions in the json as required by the FORMAT_INSTRUCTION.

This result in very bad score. I feel like something is wrong with the prompt structuring, but I made sure that the prompt is created using the build_prompt function shared from the official multi-turn implementation. @jianguoz I am wondering if you have solution for this? if this is the script used for evaluating tau-bench? Can you also provide some feedback on what might be the reason for bad format following?

  • In the second approach, I used my own structured messages, and I check if the tool_calls and msg_content exist and process them accordingly. some example code below:
response = self.xlam_client.chat.completions.create(
    model_name = "/opt/tiger/voice_agent/ckpt/hf_models/Llama-xLAM-2-8b-fc-r",
    messages=self.messages,
    tools=tools,
    tool_choice="auto"
)
assist_message = response.choices[0].message
msg_content = assist_message.content
msg_tool_calls = assist_message.tool_calls
if msg_content is not None:
    # reply to user
    response_content = msg_content
    self.messages.append({"role": "assistant", "content": response_content})
else:
    assert len(msg_tool_calls) > 0
    # print(msg_tool_calls)
    function = msg_tool_calls[0].function
    cur_fc = {
        "name": function.name,
        "arguments": function.arguments
    }
    function_str = json.dumps(cur_fc)
    self.messages.append({"role": "assistant", "content": function_str})
    response_content = cur_fc

This resolve the previous parsing error, however, the performance of the model after evaluating is far below the reported result (~20-30%).

@jianguoz , can you share the official eval script so that consistent re-production of the baseline is possible? Thanks!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions