fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation by hy2850 · Pull Request #468 · tirth8205/code-review-graph

hy2850 · 2026-05-11T14:45:46Z

Problem

Current token efficiency benchmark uses several test commits that are not suitable for reliable evaluation.

There are two main problems:

1. Some test commits are too small

Several configured commits only change one or a few files, with very small diffs. This makes the token saving efficiency result less meaningful because the benchmark does not test realistic code review workloads.

Token efficiency should be measured against commits with enough changed files and lines to show whether graph-based review context is actually useful at review scale.

2. Some commits are unavailable in the cloned test repositories

A few configured SHAs cannot be found in the local cloned repositories under evaluate/test_repos. When this happens, the benchmark falls back to testing against HEAD~1..HEAD instead of the intended commit.

This hurts reproducibility because the benchmark result then depends on the current repository state, not the commit declared in the eval config.

test commits for fastapi are not found in the cloned repository

code-review-graph/code_review_graph/eval/configs/fastapi.yaml

Lines 1 to 13 in 52cf3bc

    
           name: fastapi 
        
           url: https://github.com/tiangolo/fastapi 
        
           commit: HEAD 
        
           language: python 
        
           size_category: medium 
        
           test_commits: 
        
             - sha: fa3588c38c7473aca7536b12d686102de4b0f407 
        
               description: "Fix typo for client_secret in OAuth2 form docstrings" 
        
               changed_files: 1 
        
             - sha: 0227991a01e61bf5cdd93cc00e9e243f52b47a4a 
        
               description: "Exclude spam comments from statistics in scripts/people.py" 
        
               changed_files: 1

This is because git clone --depth 50 is used when cloning the test repositories.

https://github.com/hy2850/code-review-graph/blob/52cf3bc63ee77c8b204fb809791a5f212e83a2de/code_review_graph/eval/runner.py#L75-L78

Ineval config for nextjs, url was pointing to code-review-graph, not nextjs

code-review-graph/code_review_graph/eval/configs/nextjs.yaml

Lines 1 to 2 in 52cf3bc

    
           name: nextjs 
        
           url: https://github.com/tirth8205/code-review-graph

Fix

Commit f6b14e1 addresses this by replacing the problematic test commits with commits that:

exist in the corresponding cloned repositories
have their parent commit available locally
include larger diffs suitable for token efficiency evaluation
generally cover 10+ changed files and 1000+ total changed lines

This makes the token efficiency benchmark more representative and reproducible.

as-is) current test commits (see how small changed_files and diff size are)

Config	SHA	changed_files	Diff size
express.yaml	925a1dff1e42f1b393c977b8b77757fcf633e09f	1	+1 -1
express.yaml	b4ab7d65d7724d9309b6faaaf82ad492da2a6d35	1	+69 -0
fastapi.yaml	fa3588c38c7473aca7536b12d686102de4b0f407	1	not found locally
fastapi.yaml	0227991a01e61bf5cdd93cc00e9e243f52b47a4a	1	not found locally
flask.yaml	fbb6f0bc4c60a0bada0e03c3480d0ccf30a3c1df	10	+194 -80
flask.yaml	a29f88ce6f2f9843bd6fcbbfce1390a2071965d6	4	+55 -8
gin.yaml	052d1a79aafe3f04078a2716f8e77d4340308383	5	+76 -0
gin.yaml	472d086af2acd924cb4b9d7be0525f7d790f69bc	2	+159 -1
gin.yaml	5c00df8afadd06cc5be530dde00fe6d9fa4a2e4a	2	+38 -1
httpx.yaml	ae1b9f66238f75ced3ced5e4485408435de10768	3	+6 -1
httpx.yaml	b55d4635701d9dc22928ee647880c76b078ba3f2	4	+9 -9
nextjs.yaml	`528801f` (repo url in yaml was pointing to code-review-graph, not nextjs)	3	not found locally
nextjs.yaml	`84bde35` (repo url in yaml was pointing to code-review-graph, not nextjs)	2	not found locally

to-be) fixed test commits (+10 file changes, +1000 total line changes)

Config	SHA	changed_files	Diff size (+N -M)
express.yaml	f41d09a3cf0592b65a1359495b65d3d7cf949c50	15	+822 -507
express.yaml	cec5780db4f07a61e21e139e38af20b02dd5ae3a	11	+29 -1039
fastapi.yaml	22381558446c5d1ac376680a6581dd63b3a04119	23	+1681 -37
fastapi.yaml	749cefdeb1428ba5c3911b03c4a72993f7eb3747	21	+1168 -71
flask.yaml	c2705ffd9ce1dc8476cb29eaf5ff5d4c719852d9	36	+779 -1007
flask.yaml	0ec7f713d679ceed2c605e62ac5d38d579f29fa0	10	+1622 -1353
gin.yaml	0a192fb0fa0127eac08cf24c624b92048ed823f6	26	+1477 -86
gin.yaml	ac0ad2fed865d40a0adc1ac3ccaadc3acff5db4b	14	+775 -615
gin.yaml	0feaf8cbd80da13be634b13fd28bfb2d6e357839	64	+25 -2393
httpx.yaml	8e36f2bc685dfbe43cd7503bc1c422a6ed6e05a5	29	+533 -947
httpx.yaml	ee37a762ef6378ed16681a3452f494a5640d98de	18	+1215 -370
nextjs.yaml	d81d5ab7dfbd003bd6b26390b75ee93d43729020	34	+2989 -407
nextjs.yaml	d86e19772824281969a6a619e7d91be43663f91d	266	+2789 -572

hy2850 added 2 commits May 11, 2026 12:35

eval: use larger benchmark test commits

f6b14e1

eval: point nextjs config to upstream repo

9c8c016

hy2850 changed the title ~~fix(eval): Use larger, reproducible test commits for more reliable token efficiency evaluation~~ fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation#468

fix(eval): Use larger, reproducible test commits to fix unreliable token efficiency evaluation#468
hy2850 wants to merge 2 commits into
tirth8205:mainfrom
hy2850:feature/eval-configs-commit-updated

hy2850 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	name: fastapi
	url: https://github.com/tiangolo/fastapi
	commit: HEAD
	language: python
	size_category: medium

	test_commits:
	- sha: fa3588c38c7473aca7536b12d686102de4b0f407
	description: "Fix typo for client_secret in OAuth2 form docstrings"
	changed_files: 1
	- sha: 0227991a01e61bf5cdd93cc00e9e243f52b47a4a
	description: "Exclude spam comments from statistics in scripts/people.py"
	changed_files: 1

	name: nextjs
	url: https://github.com/tirth8205/code-review-graph

Conversation

hy2850 commented May 11, 2026

Problem

1. Some test commits are too small

2. Some commits are unavailable in the cloned test repositories

Fix

as-is) current test commits (see how small changed_files and diff size are)

to-be) fixed test commits (+10 file changes, +1000 total line changes)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant