I am wondering if it's possible to get eval script for communicating with the served api. I am not able to reproduces the results from the paper. I served the model (8b version) with vllm, but when I evaluate the model, I have the following issue:
I tried 2 approaches:
- In the first approach, I follow the official multi-turn script (https://github.com/SalesforceAIResearch/xLAM/blob/main/xLAM/client/xLAM.py) to generate results. However, I found that the model does not follow the format at all:
sometimes it directly output strings
sometimes it directly output actions.
in very few cases it output the thought and actions in the json as required by the FORMAT_INSTRUCTION.
This result in very bad score. I feel like something is wrong with the prompt structuring, but I made sure that the prompt is created using the build_prompt function shared from the official multi-turn implementation. @jianguoz I am wondering if you have solution for this? if this is the script used for evaluating tau-bench? Can you also provide some feedback on what might be the reason for bad format following?
- In the second approach, I used my own structured messages, and I check if the tool_calls and msg_content exist and process them accordingly. some example code below:
response = self.xlam_client.chat.completions.create(
model_name = "/opt/tiger/voice_agent/ckpt/hf_models/Llama-xLAM-2-8b-fc-r",
messages=self.messages,
tools=tools,
tool_choice="auto"
)
assist_message = response.choices[0].message
msg_content = assist_message.content
msg_tool_calls = assist_message.tool_calls
if msg_content is not None:
# reply to user
response_content = msg_content
self.messages.append({"role": "assistant", "content": response_content})
else:
assert len(msg_tool_calls) > 0
# print(msg_tool_calls)
function = msg_tool_calls[0].function
cur_fc = {
"name": function.name,
"arguments": function.arguments
}
function_str = json.dumps(cur_fc)
self.messages.append({"role": "assistant", "content": function_str})
response_content = cur_fc
This resolve the previous parsing error, however, the performance of the model after evaluating is far below the reported result (~20-30%).
@jianguoz , can you share the official eval script so that consistent re-production of the baseline is possible? Thanks!
I am wondering if it's possible to get eval script for communicating with the served api. I am not able to reproduces the results from the paper. I served the model (8b version) with vllm, but when I evaluate the model, I have the following issue:
I tried 2 approaches:
sometimes it directly output strings
sometimes it directly output actions.
in very few cases it output the thought and actions in the json as required by the FORMAT_INSTRUCTION.
This result in very bad score. I feel like something is wrong with the prompt structuring, but I made sure that the prompt is created using the
build_promptfunction shared from the official multi-turn implementation. @jianguoz I am wondering if you have solution for this? if this is the script used for evaluating tau-bench? Can you also provide some feedback on what might be the reason for bad format following?This resolve the previous parsing error, however, the performance of the model after evaluating is far below the reported result (~20-30%).
@jianguoz , can you share the official eval script so that consistent re-production of the baseline is possible? Thanks!