I am trying to reproduce the results reported in the paper but am consistently observing discrepancies in accuracy. I’d like to understand whether I am missing some setup detail or doing something incorrectly.
Environment Details:
- Workbench Machine: a2-ultragpu-2g (2x NVIDIA Tesla A100, 12 vCPUs, 340GB RAM)
- Python 3.11.13
- vllm: 0.10.1.1
- torch: 2.7.1+cu128
- CUDA: cuda_11.8.r11.8/compiler.31833905_0
- Taubench: latest commit 4754e6b (noted no major changes since Jan 22)
Results:
| Model |
τ-Retail (local) |
τ-Retail (published) |
Scores across runs (Retail) |
τ-Airline (local) |
τ-Airline (published) |
Scores across runs (Airline) |
| xLAM-2-3b-fc-r |
23.83 |
44.4 |
[26.09, 23.48, 19.13, 24.35, 26.09] |
29 |
32 |
[24.0, 26.0, 46.0, 20.0] |
| xLAM-2-8b-fc-r |
43.31 |
58.2 |
[40.87, 40.0, 42.61, 45.22, 47.83] |
38 |
35.2 |
[40.0, 42.0, 32.0] |
| Qwen-2.5-3B-Instruct |
6.96 |
- |
[6.96] |
14 |
- |
[14] |
| Llama-3.1-8b |
5.22 |
- |
[5.22] |
31.33 |
- |
[32.0, 28.0, 34.0] |
| gpt-4o |
57.39 |
60.3 |
[57.39] |
53 |
42.8 |
[52.0, 54.0] |
| gpt-4o-mini |
44 |
44 |
[44] |
22.9 |
22.5 |
[22.9] |
| Qwen-2.5-32B-Instruct |
51.73 |
24.4 |
[51.3, 52.17] |
28 |
25 |
[30.0, 26.0] |
Observed Discrepancies:
- xLAM models: Much lower accuracy for
xLAM-2-3b-fc-r and xLAM-2-8b-fc-r than published results.
Qwen-2.5-32B-Instruct: Much higher accuracy than published results.
- Tensor parallelism: Using --tensor-parallel-size 2 on xLAM-2-8b-fc-r drops accuracy drastically (from ~43% → ~12%).
My main concern is with [1] and [2]. I’d also like to understand [3], though it may be a separate issue.
Commands Used:
Serve Llama-xLAM-2-8b-fc-r:
vllm serve Salesforce/Llama-xLAM-2-8b-fc-r \
--enable-auto-tool-choice \
--tool-call-parser xlam \
--chat-template examples/tool_chat_template_xlam_llama.jinja
Serve xLAM-2-3b-fc-r:
vllm serve Salesforce/xLAM-2-3b-fc-r \
--enable-auto-tool-choice \
--tool-parser-plugin ./xlam_tool_call_parser.py \
--tool-call-parser xlam \
--chat-template examples/tool_chat_template_xlam_qwen.jinja
Serve Qwen2.5-32B-Instruct:
vllm serve Qwen/Qwen2.5-32B-Instruct --enable-auto-tool-choice --tool-call-parser hermes
Optional optimization flags (used for Qwen models only):
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 \
--max-num-seqs 256
Note: I have seen that using --tensor-parallel-size 2 drops the accuracy of Llama-xLAM-2-8b-fc-r from the above observed 43% to ~12%, which is strange. Also, running qwen-3B and xlam-3B without the above optimizations is very slow on retail taubench (~2 hours), again not sure why. So my above results are with the optimizations for the qwen models, as with these optimizations it takes only about 20 mins.
Evaluation command:
python run.py --agent-strategy tool-calling --env retail \
--model "hosted_vllm/Salesforce/Llama-xLAM-2-8b-fc-r" --model-provider hosted_vllm \
--user-model gpt-4o --user-model-provider openai \
--max-concurrency 64
Question:
Could you please help me understand why my results diverge from the published values?
- Am I missing some key configuration or evaluation detail?
- Is there a recommended setup for reproducing the paper’s reported results?
Happy to provide more logs or details if helpful. Thanks!
I am trying to reproduce the results reported in the paper but am consistently observing discrepancies in accuracy. I’d like to understand whether I am missing some setup detail or doing something incorrectly.
Environment Details:
Results:
Observed Discrepancies:
xLAM-2-3b-fc-randxLAM-2-8b-fc-rthan published results.Qwen-2.5-32B-Instruct: Much higher accuracy than published results.My main concern is with [1] and [2]. I’d also like to understand [3], though it may be a separate issue.
Commands Used:
Serve
Llama-xLAM-2-8b-fc-r:Serve
xLAM-2-3b-fc-r:Serve
Qwen2.5-32B-Instruct:Optional optimization flags (used for Qwen models only):
Note: I have seen that using
--tensor-parallel-size 2drops the accuracy ofLlama-xLAM-2-8b-fc-rfrom the above observed 43% to ~12%, which is strange. Also, runningqwen-3Bandxlam-3Bwithout the above optimizations is very slow on retail taubench (~2 hours), again not sure why. So my above results are with the optimizations for the qwen models, as with these optimizations it takes only about 20 mins.Evaluation command:
python run.py --agent-strategy tool-calling --env retail \ --model "hosted_vllm/Salesforce/Llama-xLAM-2-8b-fc-r" --model-provider hosted_vllm \ --user-model gpt-4o --user-model-provider openai \ --max-concurrency 64Question:
Could you please help me understand why my results diverge from the published values?
Happy to provide more logs or details if helpful. Thanks!