Unable to Reproduce Reported Accuracy from Paper

I am trying to reproduce the results reported in the paper but am consistently observing discrepancies in accuracy. I’d like to understand whether I am missing some setup detail or doing something incorrectly.

Environment Details:
- Workbench Machine: a2-ultragpu-2g (2x NVIDIA Tesla A100, 12 vCPUs, 340GB RAM)
- Python 3.11.13
- vllm: 0.10.1.1
- torch: 2.7.1+cu128
- CUDA: cuda_11.8.r11.8/compiler.31833905_0
- Taubench: latest commit 4754e6b (noted no major changes since Jan 22)

### Results:

| Model                 |   τ-Retail (local) | τ-Retail (published)   | Scores across runs (Retail)         |   τ-Airline (local) | τ-Airline (published)   | Scores across runs (Airline)   |
|:----------------------|-------------------:|:-----------------------|:------------------------------------|--------------------:|:------------------------|:-------------------------------|
| xLAM-2-3b-fc-r        |              23.83 | 44.4                   | [26.09, 23.48, 19.13, 24.35, 26.09] |               29    | 32                      | [24.0, 26.0, 46.0, 20.0]       |
| xLAM-2-8b-fc-r        |              43.31 | 58.2                   | [40.87, 40.0, 42.61, 45.22, 47.83]  |               38    | 35.2                    | [40.0, 42.0, 32.0]             |
| Qwen-2.5-3B-Instruct  |               6.96 | -                      | [6.96]                              |               14    | -                       | [14]                           |
| Llama-3.1-8b          |               5.22 | -                      | [5.22]                              |               31.33 | -                       | [32.0, 28.0, 34.0]             |
| gpt-4o                |              57.39 | 60.3                   | [57.39]                             |               53    | 42.8                    | [52.0, 54.0]                   |
| gpt-4o-mini           |              44    | 44                     | [44]                                 |               22.9  | 22.5                    | [22.9]                            |
| Qwen-2.5-32B-Instruct |              51.73 | 24.4                   | [51.3, 52.17]                       |               28    | 25                      | [30.0, 26.0]                   |

### Observed Discrepancies:

1. xLAM models: Much lower accuracy for `xLAM-2-3b-fc-r` and `xLAM-2-8b-fc-r` than published results.
2. `Qwen-2.5-32B-Instruct`: Much higher accuracy than published results.
3. Tensor parallelism: Using --tensor-parallel-size 2 on xLAM-2-8b-fc-r drops accuracy drastically (from ~43% → ~12%).

My main concern is with [1] and [2]. I’d also like to understand [3], though it may be a separate issue.

### Commands Used:

Serve `Llama-xLAM-2-8b-fc-r`:
```bash
vllm serve Salesforce/Llama-xLAM-2-8b-fc-r \
  --enable-auto-tool-choice \
  --tool-call-parser xlam \
  --chat-template examples/tool_chat_template_xlam_llama.jinja
```

Serve `xLAM-2-3b-fc-r`:
```bash
vllm serve Salesforce/xLAM-2-3b-fc-r \
--enable-auto-tool-choice \
--tool-parser-plugin ./xlam_tool_call_parser.py \
--tool-call-parser xlam \
--chat-template examples/tool_chat_template_xlam_qwen.jinja
```

Serve `Qwen2.5-32B-Instruct`:
```bash
vllm serve Qwen/Qwen2.5-32B-Instruct --enable-auto-tool-choice --tool-call-parser hermes
```

Optional optimization flags (used for Qwen models only):
```bash
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 \
--max-num-seqs 256
```
Note: I have seen that using `--tensor-parallel-size 2` drops the accuracy of `Llama-xLAM-2-8b-fc-r` from the above observed 43% to ~12%, which is strange. Also, running `qwen-3B` and `xlam-3B` without the above optimizations is very slow on retail taubench (~2 hours), again not sure why. So my above results are with the optimizations for the qwen models, as with these optimizations it takes only about 20 mins.

Evaluation command:
```bash
python run.py --agent-strategy tool-calling --env retail \
  --model "hosted_vllm/Salesforce/Llama-xLAM-2-8b-fc-r" --model-provider hosted_vllm \
  --user-model gpt-4o --user-model-provider openai \
  --max-concurrency 64
```

### Question:

Could you please help me understand why my results diverge from the published values?
- Am I missing some key configuration or evaluation detail?
- Is there a recommended setup for reproducing the paper’s reported results?

Happy to provide more logs or details if helpful. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Reproduce Reported Accuracy from Paper #37

Results:

Observed Discrepancies:

Commands Used:

Question:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	τ-Retail (local)	τ-Retail (published)	Scores across runs (Retail)	τ-Airline (local)	τ-Airline (published)	Scores across runs (Airline)
xLAM-2-3b-fc-r	23.83	44.4	[26.09, 23.48, 19.13, 24.35, 26.09]	29	32	[24.0, 26.0, 46.0, 20.0]
xLAM-2-8b-fc-r	43.31	58.2	[40.87, 40.0, 42.61, 45.22, 47.83]	38	35.2	[40.0, 42.0, 32.0]
Qwen-2.5-3B-Instruct	6.96	-	[6.96]	14	-	[14]
Llama-3.1-8b	5.22	-	[5.22]	31.33	-	[32.0, 28.0, 34.0]
gpt-4o	57.39	60.3	[57.39]	53	42.8	[52.0, 54.0]
gpt-4o-mini	44	44	[44]	22.9	22.5	[22.9]
Qwen-2.5-32B-Instruct	51.73	24.4	[51.3, 52.17]	28	25	[30.0, 26.0]

Unable to Reproduce Reported Accuracy from Paper #37

Description

Results:

Observed Discrepancies:

Commands Used:

Question:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions