Skip to content

Unable to Reproduce Reported Accuracy from Paper #37

@mithil-turing

Description

@mithil-turing

I am trying to reproduce the results reported in the paper but am consistently observing discrepancies in accuracy. I’d like to understand whether I am missing some setup detail or doing something incorrectly.

Environment Details:

  • Workbench Machine: a2-ultragpu-2g (2x NVIDIA Tesla A100, 12 vCPUs, 340GB RAM)
  • Python 3.11.13
  • vllm: 0.10.1.1
  • torch: 2.7.1+cu128
  • CUDA: cuda_11.8.r11.8/compiler.31833905_0
  • Taubench: latest commit 4754e6b (noted no major changes since Jan 22)

Results:

Model τ-Retail (local) τ-Retail (published) Scores across runs (Retail) τ-Airline (local) τ-Airline (published) Scores across runs (Airline)
xLAM-2-3b-fc-r 23.83 44.4 [26.09, 23.48, 19.13, 24.35, 26.09] 29 32 [24.0, 26.0, 46.0, 20.0]
xLAM-2-8b-fc-r 43.31 58.2 [40.87, 40.0, 42.61, 45.22, 47.83] 38 35.2 [40.0, 42.0, 32.0]
Qwen-2.5-3B-Instruct 6.96 - [6.96] 14 - [14]
Llama-3.1-8b 5.22 - [5.22] 31.33 - [32.0, 28.0, 34.0]
gpt-4o 57.39 60.3 [57.39] 53 42.8 [52.0, 54.0]
gpt-4o-mini 44 44 [44] 22.9 22.5 [22.9]
Qwen-2.5-32B-Instruct 51.73 24.4 [51.3, 52.17] 28 25 [30.0, 26.0]

Observed Discrepancies:

  1. xLAM models: Much lower accuracy for xLAM-2-3b-fc-r and xLAM-2-8b-fc-r than published results.
  2. Qwen-2.5-32B-Instruct: Much higher accuracy than published results.
  3. Tensor parallelism: Using --tensor-parallel-size 2 on xLAM-2-8b-fc-r drops accuracy drastically (from ~43% → ~12%).

My main concern is with [1] and [2]. I’d also like to understand [3], though it may be a separate issue.

Commands Used:

Serve Llama-xLAM-2-8b-fc-r:

vllm serve Salesforce/Llama-xLAM-2-8b-fc-r \
  --enable-auto-tool-choice \
  --tool-call-parser xlam \
  --chat-template examples/tool_chat_template_xlam_llama.jinja

Serve xLAM-2-3b-fc-r:

vllm serve Salesforce/xLAM-2-3b-fc-r \
--enable-auto-tool-choice \
--tool-parser-plugin ./xlam_tool_call_parser.py \
--tool-call-parser xlam \
--chat-template examples/tool_chat_template_xlam_qwen.jinja

Serve Qwen2.5-32B-Instruct:

vllm serve Qwen/Qwen2.5-32B-Instruct --enable-auto-tool-choice --tool-call-parser hermes

Optional optimization flags (used for Qwen models only):

--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 \
--max-num-seqs 256

Note: I have seen that using --tensor-parallel-size 2 drops the accuracy of Llama-xLAM-2-8b-fc-r from the above observed 43% to ~12%, which is strange. Also, running qwen-3B and xlam-3B without the above optimizations is very slow on retail taubench (~2 hours), again not sure why. So my above results are with the optimizations for the qwen models, as with these optimizations it takes only about 20 mins.

Evaluation command:

python run.py --agent-strategy tool-calling --env retail \
  --model "hosted_vllm/Salesforce/Llama-xLAM-2-8b-fc-r" --model-provider hosted_vllm \
  --user-model gpt-4o --user-model-provider openai \
  --max-concurrency 64

Question:

Could you please help me understand why my results diverge from the published values?

  • Am I missing some key configuration or evaluation detail?
  • Is there a recommended setup for reproducing the paper’s reported results?

Happy to provide more logs or details if helpful. Thanks!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions