chore(evals): increase nightly eval runs to 10 for precision baseline#23725
Draft
alisa-alisa wants to merge 2 commits intomainfrom
Draft
chore(evals): increase nightly eval runs to 10 for precision baseline#23725alisa-alisa wants to merge 2 commits intomainfrom
alisa-alisa wants to merge 2 commits intomainfrom
Conversation
Contributor
|
Hi @alisa-alisa, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this. We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines. Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed. Thank you for your understanding and for being a part of our community! |
|
Size Change: -4 B (0%) Total Size: 26.3 MB
ℹ️ View Unchanged
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Establishing a High-Fidelity Stability Baseline (Nightly Evals 3 → 10)
This PR increases the nightly evaluation run count from 3 to 10 attempts per model. This is a foundational shift designed to eliminate the "noisy floor" that currently bottlenecks our ability to validate introduced by PR regressions in evals.
Details
The Problem: The "Toil of 33%"
Our current 3-run baseline is mathematically too granular. A single random failure in a nightly run moves the stability score by 33.3%. This high variance makes it impossible for automated checks to distinguish between expected LLM non-determinism and genuine code regressions.
As a result, if we enable the automated check engineers will be forced into "Investigation Toil":
main.The Solution: Precision-Based "Smart Blocking"
By moving to 10 runs (10% resolution), we establish a high-fidelity stability baseline that enables Confidence-Based Blocking:
ALWAYS_PASSESfaster.Mitigating API Pressure:
To handle the increased volume of 60 total nightly runs (6 models x 10 attempts), this PR adds:
max-parallel: 10: Limits the number of concurrent jobs to prevent overwhelming the API backend.The Outcome
This change trades marginal, asynchronous API costs for massive gains in developer productivity. We are "buying back" hours of engineering focus every week by ensuring that when a PR fails, it is for a real, actionable reason—not a statistical fluke.
Related Issues
Related to #23169
How to Validate
Example run: https://github.com/google-gemini/gemini-cli/actions/runs/23516551306
.github/workflows/evals-nightly.yml.run_attemptmatrix has been updated from[1, 2, 3]to[1, 2, 3, 4, 5, 6, 7, 8, 9, 10].Pre-Merge Checklist