Seeking clarification: significantly lower ToolBench workflow scores on Colab vs reported ACBench results

Hi ACBench authors,
we are two students from the University of Copenhagen working on a bachelor project. We’re trying to reproduce your ToolBench workflow results, but our scores are much lower than the numbers reported in your paper: "Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression".

We’d really appreciate guidance on what we may be doing wrong.

**What we ran:**
- Environment: Google Colab

We used Qwen2.5-1.5B-Instruct - where you used Qwen2.5-1.5B, but we would expect instruct to be even better at the task.

NOTE: We had a quite a few dependency issues, which is why we installed and uninstalled packages.

`!git clone https://github.com/pprp/ACBench.git
%cd /content/ACBench
!ls

!pip install -U pip
!pip install -r requirements.txt
%cd /content/ACBench
!pip install -e .
!pip -q uninstall -y tensorflow google-ai-generativelanguage grpcio-status || true

%cd /content

!git clone https://github.com/zjunlp/WorfBench.git

!mkdir -p /content/ACBench/data

!cp -r /content/WorfBench/gold_traj /content/ACBench/data/gold_traj

!find /content/ACBench/data/gold_traj -maxdepth 2 -type f -name "graph_eval.json" | head -n 20

!pip uninstall -y flashinfer flashinfer-python nvidia-cutlass-dsl cutlass || true

%env VLLM_ATTENTION_BACKEND=TORCH_SDPA
%env VLLM_USE_V1=0

!python -m acbench.node_eval \
  --task gen_workflow \
  --task_type toolbench \
  --model_name Qwen/Qwen2.5-1.5B-Instruct \
  --gold_path /content/ACBench/data/gold_traj/toolbench/graph_eval.json \
  --pred_path /content/ACBench/outputs/qwen15_fp16_toolbench.json \
  --few_shot \
  --temperature 0 \
  --top_p 1 \
  --max_tokens 1024 \
  --dtype float16 \
  --tensor_parallel_size 1`


**Output we get:**
We see repeated parser warnings: "edge_workflow is empty"

And final metrics:
Average Precision: 0.2861009847851952
Average Recall:    0.3667502088554719
Average F1_score:  0.3166924206397889

**From your reported results (table. 10), we expected something closer to:**
Precision ≈ 0.77
Recall ≈ 0.74
F1 ≈ 0.73

Thanks a lot for your help, we’d really appreciate any clarification.

All the best,
Lasse and Jakob


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeking clarification: significantly lower ToolBench workflow scores on Colab vs reported ACBench results #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Seeking clarification: significantly lower ToolBench workflow scores on Colab vs reported ACBench results #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions