chore(evals): increase nightly eval runs to 10 for precision baseline by alisa-alisa · Pull Request #23725 · google-gemini/gemini-cli

alisa-alisa · 2026-03-24T22:45:34Z

Summary

Establishing a High-Fidelity Stability Baseline (Nightly Evals 3 → 10)

This PR increases the nightly evaluation run count from 3 to 10 attempts per model. This is a foundational shift designed to eliminate the "noisy floor" that currently bottlenecks our ability to validate introduced by PR regressions in evals.

Details

The Problem: The "Toil of 33%"

Our current 3-run baseline is mathematically too granular. A single random failure in a nightly run moves the stability score by 33.3%. This high variance makes it impossible for automated checks to distinguish between expected LLM non-determinism and genuine code regressions.

As a result, if we enable the automated check engineers will be forced into "Investigation Toil":

Chasing "Ghost" Regressions: Spending hours debugging PR failures that turn out to be pre-existing flaky tests.
Productivity Drain: Repeatedly re-running CI to "get a green light" breaks focus and delays critical merges.
Trust Decay: When CI "cries wolf" too often, we risk ignoring real red lights, allowing actual regressions to slip into main.

The Solution: Precision-Based "Smart Blocking"

By moving to 10 runs (10% resolution), we establish a high-fidelity stability baseline that enables Confidence-Based Blocking:

Identify Stable vs. Noisy: We can mathematically differentiate between tests that are 90%+ stable and those that are naturally 60% stable.
Automated Triage: We can now automate PR checks to ignore noise in known flaky tests while strictly blocking on significant drops in stable behaviors.
Faster Hardening: We gather as much stability data in 2 nights (20 samples) as we previously did in a full week, allowing us to harden the suite and promote tests to ALWAYS_PASSES faster.

Mitigating API Pressure:
To handle the increased volume of 60 total nightly runs (6 models x 10 attempts), this PR adds:

max-parallel: 10: Limits the number of concurrent jobs to prevent overwhelming the API backend.
Random Jitter: Introduces a 0-60 second sleep at the start of each job to prevent a "thundering herd" of simultaneous requests.

The Outcome

This change trades marginal, asynchronous API costs for massive gains in developer productivity. We are "buying back" hours of engineering focus every week by ensuring that when a PR fails, it is for a real, actionable reason—not a statistical fluke.

Related Issues

Related to #23169

How to Validate

Example run: https://github.com/google-gemini/gemini-cli/actions/runs/23516551306

Inspect .github/workflows/evals-nightly.yml.
Verify the run_attempt matrix has been updated from [1, 2, 3] to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
(Optional) Trigger the workflow manually to verify it spawns 10 parallel runs per model.

Pre-Merge Checklist

gemini-cli · 2026-03-24T22:45:47Z

Hi @alisa-alisa, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

github-actions · 2026-03-24T22:50:04Z

Size Change: -4 B (0%)

Total Size: 26.3 MB

Filename	Size	Change
`./bundle/chunk-4LFPWAO4.js`	0 B	-3.4 kB (removed)	🏆
`./bundle/chunk-A5UY3AVS.js`	0 B	-14.6 MB (removed)	🏆
`./bundle/chunk-DVRBE5UN.js`	0 B	-3.64 MB (removed)	🏆
`./bundle/core-JYCZM4DA.js`	0 B	-43.4 kB (removed)	🏆
`./bundle/devtoolsService-2GJK6OKR.js`	0 B	-27.7 kB (removed)	🏆
`./bundle/gemini-DFSUL7EB.js`	0 B	-521 kB (removed)	🏆
`./bundle/interactiveCli-AFUYI74X.js`	0 B	-1.62 MB (removed)	🏆
`./bundle/oauth2-provider-JEMLLVVV.js`	0 B	-9.16 kB (removed)	🏆
`./bundle/chunk-HP4YP3TU.js`	14.6 MB	+14.6 MB (new file)	🆕
`./bundle/chunk-LM5L3U7F.js`	3.4 kB	+3.4 kB (new file)	🆕
`./bundle/chunk-NEKJ2TYT.js`	3.64 MB	+3.64 MB (new file)	🆕
`./bundle/core-6H4YNIZB.js`	43.4 kB	+43.4 kB (new file)	🆕
`./bundle/devtoolsService-WC2TOQ52.js`	27.7 kB	+27.7 kB (new file)	🆕
`./bundle/gemini-H34U7Q5Z.js`	521 kB	+521 kB (new file)	🆕
`./bundle/interactiveCli-XEITK4P6.js`	1.62 MB	+1.62 MB (new file)	🆕
`./bundle/oauth2-provider-KC54MEIC.js`	9.16 kB	+9.16 kB (new file)	🆕

ℹ️ View Unchanged

Filename	Size	Change
`./bundle/chunk-34MYV7JD.js`	2.45 kB	0 B
`./bundle/chunk-5AUYMPVF.js`	858 B	0 B
`./bundle/chunk-664ZODQF.js`	124 kB	0 B
`./bundle/chunk-DAHVX5MI.js`	206 kB	0 B
`./bundle/chunk-IUUIT4SU.js`	56.5 kB	0 B
`./bundle/chunk-PVQN7ZVP.js`	1.96 MB	0 B
`./bundle/chunk-RJTRUG2J.js`	39.8 kB	0 B
`./bundle/cleanup-QAKRYUSS.js`	0 B	-856 B (removed)	🏆
`./bundle/devtools-36NN55EP.js`	696 kB	0 B
`./bundle/dist-T73EYRDX.js`	356 B	0 B
`./bundle/gemini.js`	2.06 kB	0 B
`./bundle/getMachineId-bsd-TXG52NKR.js`	1.55 kB	0 B
`./bundle/getMachineId-darwin-7OE4DDZ6.js`	1.55 kB	0 B
`./bundle/getMachineId-linux-SHIFKOOX.js`	1.34 kB	0 B
`./bundle/getMachineId-unsupported-5U5DOEYY.js`	1.06 kB	0 B
`./bundle/getMachineId-win-6KLLGOI4.js`	1.72 kB	0 B
`./bundle/memoryDiscovery-WPGC7DAZ.js`	922 B	0 B
`./bundle/multipart-parser-KPBZEGQU.js`	11.7 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js`	221 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js`	227 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js`	11.5 kB	0 B
`./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js`	132 B	0 B
`./bundle/sandbox-macos-permissive-open.sb`	890 B	0 B
`./bundle/sandbox-macos-permissive-proxied.sb`	1.31 kB	0 B
`./bundle/sandbox-macos-restrictive-open.sb`	3.36 kB	0 B
`./bundle/sandbox-macos-restrictive-proxied.sb`	3.56 kB	0 B
`./bundle/sandbox-macos-strict-open.sb`	4.82 kB	0 B
`./bundle/sandbox-macos-strict-proxied.sb`	5.02 kB	0 B
`./bundle/src-QVCVGIUX.js`	47 kB	0 B
`./bundle/tree-sitter-7U6MW5PS.js`	274 kB	0 B
`./bundle/tree-sitter-bash-34ZGLXVX.js`	1.84 MB	0 B
`./bundle/cleanup-HYIC53OA.js`	856 B	+856 B (new file)	🆕

_{compressed-size-action}

chore(evals): increase nightly runs to 10 for precision baseline

9bc176b

gemini-cli bot added area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 24, 2026

chore(evals): add jitter and max-parallel to mitigate 500s

8ebc086

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(evals): increase nightly eval runs to 10 for precision baseline#23725

chore(evals): increase nightly eval runs to 10 for precision baseline#23725
alisa-alisa wants to merge 2 commits intomainfrom
alisa/increase_nightly

alisa-alisa commented Mar 24, 2026 •

edited

Loading

Uh oh!

gemini-cli bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alisa-alisa commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Establishing a High-Fidelity Stability Baseline (Nightly Evals 3 → 10)

Details

The Problem: The "Toil of 33%"

The Solution: Precision-Based "Smart Blocking"

The Outcome

Related Issues

How to Validate

Pre-Merge Checklist

Uh oh!

gemini-cli bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alisa-alisa commented Mar 24, 2026 •

edited

Loading

github-actions bot commented Mar 24, 2026 •

edited

Loading