Hi ACBench authors,
we are two students from the University of Copenhagen working on a bachelor project. We’re trying to reproduce your ToolBench workflow results, but our scores are much lower than the numbers reported in your paper: "Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression".
We’d really appreciate guidance on what we may be doing wrong.
What we ran:
- Environment: Google Colab
We used Qwen2.5-1.5B-Instruct - where you used Qwen2.5-1.5B, but we would expect instruct to be even better at the task.
NOTE: We had a quite a few dependency issues, which is why we installed and uninstalled packages.
`!git clone https://github.com/pprp/ACBench.git
%cd /content/ACBench
!ls
!pip install -U pip
!pip install -r requirements.txt
%cd /content/ACBench
!pip install -e .
!pip -q uninstall -y tensorflow google-ai-generativelanguage grpcio-status || true
%cd /content
!git clone https://github.com/zjunlp/WorfBench.git
!mkdir -p /content/ACBench/data
!cp -r /content/WorfBench/gold_traj /content/ACBench/data/gold_traj
!find /content/ACBench/data/gold_traj -maxdepth 2 -type f -name "graph_eval.json" | head -n 20
!pip uninstall -y flashinfer flashinfer-python nvidia-cutlass-dsl cutlass || true
%env VLLM_ATTENTION_BACKEND=TORCH_SDPA
%env VLLM_USE_V1=0
!python -m acbench.node_eval
--task gen_workflow
--task_type toolbench
--model_name Qwen/Qwen2.5-1.5B-Instruct
--gold_path /content/ACBench/data/gold_traj/toolbench/graph_eval.json
--pred_path /content/ACBench/outputs/qwen15_fp16_toolbench.json
--few_shot
--temperature 0
--top_p 1
--max_tokens 1024
--dtype float16
--tensor_parallel_size 1`
Output we get:
We see repeated parser warnings: "edge_workflow is empty"
And final metrics:
Average Precision: 0.2861009847851952
Average Recall: 0.3667502088554719
Average F1_score: 0.3166924206397889
From your reported results (table. 10), we expected something closer to:
Precision ≈ 0.77
Recall ≈ 0.74
F1 ≈ 0.73
Thanks a lot for your help, we’d really appreciate any clarification.
All the best,
Lasse and Jakob
Hi ACBench authors,
we are two students from the University of Copenhagen working on a bachelor project. We’re trying to reproduce your ToolBench workflow results, but our scores are much lower than the numbers reported in your paper: "Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression".
We’d really appreciate guidance on what we may be doing wrong.
What we ran:
We used Qwen2.5-1.5B-Instruct - where you used Qwen2.5-1.5B, but we would expect instruct to be even better at the task.
NOTE: We had a quite a few dependency issues, which is why we installed and uninstalled packages.
`!git clone https://github.com/pprp/ACBench.git
%cd /content/ACBench
!ls
!pip install -U pip
!pip install -r requirements.txt
%cd /content/ACBench
!pip install -e .
!pip -q uninstall -y tensorflow google-ai-generativelanguage grpcio-status || true
%cd /content
!git clone https://github.com/zjunlp/WorfBench.git
!mkdir -p /content/ACBench/data
!cp -r /content/WorfBench/gold_traj /content/ACBench/data/gold_traj
!find /content/ACBench/data/gold_traj -maxdepth 2 -type f -name "graph_eval.json" | head -n 20
!pip uninstall -y flashinfer flashinfer-python nvidia-cutlass-dsl cutlass || true
%env VLLM_ATTENTION_BACKEND=TORCH_SDPA
%env VLLM_USE_V1=0
!python -m acbench.node_eval
--task gen_workflow
--task_type toolbench
--model_name Qwen/Qwen2.5-1.5B-Instruct
--gold_path /content/ACBench/data/gold_traj/toolbench/graph_eval.json
--pred_path /content/ACBench/outputs/qwen15_fp16_toolbench.json
--few_shot
--temperature 0
--top_p 1
--max_tokens 1024
--dtype float16
--tensor_parallel_size 1`
Output we get:
We see repeated parser warnings: "edge_workflow is empty"
And final metrics:
Average Precision: 0.2861009847851952
Average Recall: 0.3667502088554719
Average F1_score: 0.3166924206397889
From your reported results (table. 10), we expected something closer to:
Precision ≈ 0.77
Recall ≈ 0.74
F1 ≈ 0.73
Thanks a lot for your help, we’d really appreciate any clarification.
All the best,
Lasse and Jakob