Re-producing Tau-Bench Issue

 I am wondering if it's possible to get eval script for communicating with the served api. I am not able to reproduces the results from the paper. I served the model (8b version) with vllm, but when I evaluate the model,  I have the following issue:

I tried 2 approaches:

- In the first approach, I follow the official multi-turn script (https://github.com/SalesforceAIResearch/xLAM/blob/main/xLAM/client/xLAM.py) to generate results. However, **I found that the model does not follow the format at all**: 
sometimes it directly output strings
sometimes it directly output actions. 
in very few cases it output the thought and actions in the json as required by the FORMAT_INSTRUCTION.

This result in very bad score. I feel like something is wrong with the prompt structuring, but I made sure that the prompt is created using the `build_prompt` function shared from the official multi-turn implementation. @jianguoz I am wondering if you have solution for this? if this is the script used for evaluating tau-bench? Can you also provide some feedback on what might be the reason for bad format following?

- In the second approach, I used my own structured messages, and I check if the tool_calls and msg_content exist and process them accordingly. some example code below:

```python
response = self.xlam_client.chat.completions.create(
    model_name = "/opt/tiger/voice_agent/ckpt/hf_models/Llama-xLAM-2-8b-fc-r",
    messages=self.messages,
    tools=tools,
    tool_choice="auto"
)
assist_message = response.choices[0].message
msg_content = assist_message.content
msg_tool_calls = assist_message.tool_calls
if msg_content is not None:
    # reply to user
    response_content = msg_content
    self.messages.append({"role": "assistant", "content": response_content})
else:
    assert len(msg_tool_calls) > 0
    # print(msg_tool_calls)
    function = msg_tool_calls[0].function
    cur_fc = {
        "name": function.name,
        "arguments": function.arguments
    }
    function_str = json.dumps(cur_fc)
    self.messages.append({"role": "assistant", "content": function_str})
    response_content = cur_fc
```

This resolve the previous parsing error, however, the performance of the model after evaluating is far below the reported result (~20-30%).


 @jianguoz , can you share the official eval script so that consistent re-production of the baseline is possible? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Re-producing Tau-Bench Issue #29

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Re-producing Tau-Bench Issue #29

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions