fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation#468
Open
hy2850 wants to merge 2 commits into
Open
fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation#468hy2850 wants to merge 2 commits into
hy2850 wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Current token efficiency benchmark uses several test commits that are not suitable for reliable evaluation.
There are two main problems:
1. Some test commits are too small
Several configured commits only change one or a few files, with very small diffs. This makes the token saving efficiency result less meaningful because the benchmark does not test realistic code review workloads.
Token efficiency should be measured against commits with enough changed files and lines to show whether graph-based review context is actually useful at review scale.
2. Some commits are unavailable in the cloned test repositories
A few configured SHAs cannot be found in the local cloned repositories under evaluate/test_repos. When this happens, the benchmark falls back to testing against HEAD~1..HEAD instead of the intended commit.
This hurts reproducibility because the benchmark result then depends on the current repository state, not the commit declared in the eval config.
fastapiare not found in the cloned repositorycode-review-graph/code_review_graph/eval/configs/fastapi.yaml
Lines 1 to 13 in 52cf3bc
This is because
git clone --depth 50is used when cloning the test repositories.https://github.com/hy2850/code-review-graph/blob/52cf3bc63ee77c8b204fb809791a5f212e83a2de/code_review_graph/eval/runner.py#L75-L78
nextjs, url was pointing tocode-review-graph, notnextjscode-review-graph/code_review_graph/eval/configs/nextjs.yaml
Lines 1 to 2 in 52cf3bc
Fix
Commit f6b14e1 addresses this by replacing the problematic test commits with commits that:
This makes the token efficiency benchmark more representative and reproducible.
as-is) current test commits (see how small changed_files and diff size are)
to-be) fixed test commits (+10 file changes, +1000 total line changes)