Skip to content

fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation#468

Open
hy2850 wants to merge 2 commits into
tirth8205:mainfrom
hy2850:feature/eval-configs-commit-updated
Open

fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation#468
hy2850 wants to merge 2 commits into
tirth8205:mainfrom
hy2850:feature/eval-configs-commit-updated

Conversation

@hy2850
Copy link
Copy Markdown

@hy2850 hy2850 commented May 11, 2026

Problem

Current token efficiency benchmark uses several test commits that are not suitable for reliable evaluation.

There are two main problems:

1. Some test commits are too small

Several configured commits only change one or a few files, with very small diffs. This makes the token saving efficiency result less meaningful because the benchmark does not test realistic code review workloads.

Token efficiency should be measured against commits with enough changed files and lines to show whether graph-based review context is actually useful at review scale.

2. Some commits are unavailable in the cloned test repositories

A few configured SHAs cannot be found in the local cloned repositories under evaluate/test_repos. When this happens, the benchmark falls back to testing against HEAD~1..HEAD instead of the intended commit.

This hurts reproducibility because the benchmark result then depends on the current repository state, not the commit declared in the eval config.

  1. test commits for fastapi are not found in the cloned repository
    name: fastapi
    url: https://github.com/tiangolo/fastapi
    commit: HEAD
    language: python
    size_category: medium
    test_commits:
    - sha: fa3588c38c7473aca7536b12d686102de4b0f407
    description: "Fix typo for client_secret in OAuth2 form docstrings"
    changed_files: 1
    - sha: 0227991a01e61bf5cdd93cc00e9e243f52b47a4a
    description: "Exclude spam comments from statistics in scripts/people.py"
    changed_files: 1

This is because git clone --depth 50 is used when cloning the test repositories.

https://github.com/hy2850/code-review-graph/blob/52cf3bc63ee77c8b204fb809791a5f212e83a2de/code_review_graph/eval/runner.py#L75-L78

  1. Ineval config for nextjs, url was pointing to code-review-graph, not nextjs

name: nextjs
url: https://github.com/tirth8205/code-review-graph

Fix

Commit f6b14e1 addresses this by replacing the problematic test commits with commits that:

  • exist in the corresponding cloned repositories
  • have their parent commit available locally
  • include larger diffs suitable for token efficiency evaluation
  • generally cover 10+ changed files and 1000+ total changed lines

This makes the token efficiency benchmark more representative and reproducible.


as-is) current test commits (see how small changed_files and diff size are)

Config SHA changed_files Diff size
express.yaml 925a1dff1e42f1b393c977b8b77757fcf633e09f 1 +1 -1
express.yaml b4ab7d65d7724d9309b6faaaf82ad492da2a6d35 1 +69 -0
fastapi.yaml fa3588c38c7473aca7536b12d686102de4b0f407 1 not found locally
fastapi.yaml 0227991a01e61bf5cdd93cc00e9e243f52b47a4a 1 not found locally
flask.yaml fbb6f0bc4c60a0bada0e03c3480d0ccf30a3c1df 10 +194 -80
flask.yaml a29f88ce6f2f9843bd6fcbbfce1390a2071965d6 4 +55 -8
gin.yaml 052d1a79aafe3f04078a2716f8e77d4340308383 5 +76 -0
gin.yaml 472d086af2acd924cb4b9d7be0525f7d790f69bc 2 +159 -1
gin.yaml 5c00df8afadd06cc5be530dde00fe6d9fa4a2e4a 2 +38 -1
httpx.yaml ae1b9f66238f75ced3ced5e4485408435de10768 3 +6 -1
httpx.yaml b55d4635701d9dc22928ee647880c76b078ba3f2 4 +9 -9
nextjs.yaml 528801f (repo url in yaml was pointing to code-review-graph, not nextjs) 3 not found locally
nextjs.yaml 84bde35 (repo url in yaml was pointing to code-review-graph, not nextjs) 2 not found locally

to-be) fixed test commits (+10 file changes, +1000 total line changes)

Config SHA changed_files Diff size (+N -M)
express.yaml f41d09a3cf0592b65a1359495b65d3d7cf949c50 15 +822 -507
express.yaml cec5780db4f07a61e21e139e38af20b02dd5ae3a 11 +29 -1039
fastapi.yaml 22381558446c5d1ac376680a6581dd63b3a04119 23 +1681 -37
fastapi.yaml 749cefdeb1428ba5c3911b03c4a72993f7eb3747 21 +1168 -71
flask.yaml c2705ffd9ce1dc8476cb29eaf5ff5d4c719852d9 36 +779 -1007
flask.yaml 0ec7f713d679ceed2c605e62ac5d38d579f29fa0 10 +1622 -1353
gin.yaml 0a192fb0fa0127eac08cf24c624b92048ed823f6 26 +1477 -86
gin.yaml ac0ad2fed865d40a0adc1ac3ccaadc3acff5db4b 14 +775 -615
gin.yaml 0feaf8cbd80da13be634b13fd28bfb2d6e357839 64 +25 -2393
httpx.yaml 8e36f2bc685dfbe43cd7503bc1c422a6ed6e05a5 29 +533 -947
httpx.yaml ee37a762ef6378ed16681a3452f494a5640d98de 18 +1215 -370
nextjs.yaml d81d5ab7dfbd003bd6b26390b75ee93d43729020 34 +2989 -407
nextjs.yaml d86e19772824281969a6a619e7d91be43663f91d 266 +2789 -572

@hy2850 hy2850 changed the title fix(eval): Use larger, reproducible test commits for more reliable token efficiency evaluation fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant