Skip to content

Seeking clarification: significantly lower ToolBench workflow scores on Colab vs reported ACBench results #3

@jakobahlberg

Description

@jakobahlberg

Hi ACBench authors,
we are two students from the University of Copenhagen working on a bachelor project. We’re trying to reproduce your ToolBench workflow results, but our scores are much lower than the numbers reported in your paper: "Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression".

We’d really appreciate guidance on what we may be doing wrong.

What we ran:

  • Environment: Google Colab

We used Qwen2.5-1.5B-Instruct - where you used Qwen2.5-1.5B, but we would expect instruct to be even better at the task.

NOTE: We had a quite a few dependency issues, which is why we installed and uninstalled packages.

`!git clone https://github.com/pprp/ACBench.git
%cd /content/ACBench
!ls

!pip install -U pip
!pip install -r requirements.txt
%cd /content/ACBench
!pip install -e .
!pip -q uninstall -y tensorflow google-ai-generativelanguage grpcio-status || true

%cd /content

!git clone https://github.com/zjunlp/WorfBench.git

!mkdir -p /content/ACBench/data

!cp -r /content/WorfBench/gold_traj /content/ACBench/data/gold_traj

!find /content/ACBench/data/gold_traj -maxdepth 2 -type f -name "graph_eval.json" | head -n 20

!pip uninstall -y flashinfer flashinfer-python nvidia-cutlass-dsl cutlass || true

%env VLLM_ATTENTION_BACKEND=TORCH_SDPA
%env VLLM_USE_V1=0

!python -m acbench.node_eval
--task gen_workflow
--task_type toolbench
--model_name Qwen/Qwen2.5-1.5B-Instruct
--gold_path /content/ACBench/data/gold_traj/toolbench/graph_eval.json
--pred_path /content/ACBench/outputs/qwen15_fp16_toolbench.json
--few_shot
--temperature 0
--top_p 1
--max_tokens 1024
--dtype float16
--tensor_parallel_size 1`

Output we get:
We see repeated parser warnings: "edge_workflow is empty"

And final metrics:
Average Precision: 0.2861009847851952
Average Recall: 0.3667502088554719
Average F1_score: 0.3166924206397889

From your reported results (table. 10), we expected something closer to:
Precision ≈ 0.77
Recall ≈ 0.74
F1 ≈ 0.73

Thanks a lot for your help, we’d really appreciate any clarification.

All the best,
Lasse and Jakob

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions