Skip to content

chore(evals): increase nightly eval runs to 10 for precision baseline#23725

Draft
alisa-alisa wants to merge 2 commits intomainfrom
alisa/increase_nightly
Draft

chore(evals): increase nightly eval runs to 10 for precision baseline#23725
alisa-alisa wants to merge 2 commits intomainfrom
alisa/increase_nightly

Conversation

@alisa-alisa
Copy link
Contributor

@alisa-alisa alisa-alisa commented Mar 24, 2026

Summary

Establishing a High-Fidelity Stability Baseline (Nightly Evals 3 → 10)

This PR increases the nightly evaluation run count from 3 to 10 attempts per model. This is a foundational shift designed to eliminate the "noisy floor" that currently bottlenecks our ability to validate introduced by PR regressions in evals.

Details

The Problem: The "Toil of 33%"

Our current 3-run baseline is mathematically too granular. A single random failure in a nightly run moves the stability score by 33.3%. This high variance makes it impossible for automated checks to distinguish between expected LLM non-determinism and genuine code regressions.

As a result, if we enable the automated check engineers will be forced into "Investigation Toil":

  1. Chasing "Ghost" Regressions: Spending hours debugging PR failures that turn out to be pre-existing flaky tests.
  2. Productivity Drain: Repeatedly re-running CI to "get a green light" breaks focus and delays critical merges.
  3. Trust Decay: When CI "cries wolf" too often, we risk ignoring real red lights, allowing actual regressions to slip into main.

The Solution: Precision-Based "Smart Blocking"

By moving to 10 runs (10% resolution), we establish a high-fidelity stability baseline that enables Confidence-Based Blocking:

  • Identify Stable vs. Noisy: We can mathematically differentiate between tests that are 90%+ stable and those that are naturally 60% stable.
  • Automated Triage: We can now automate PR checks to ignore noise in known flaky tests while strictly blocking on significant drops in stable behaviors.
  • Faster Hardening: We gather as much stability data in 2 nights (20 samples) as we previously did in a full week, allowing us to harden the suite and promote tests to ALWAYS_PASSES faster.

Mitigating API Pressure:
To handle the increased volume of 60 total nightly runs (6 models x 10 attempts), this PR adds:

  • max-parallel: 10: Limits the number of concurrent jobs to prevent overwhelming the API backend.
  • Random Jitter: Introduces a 0-60 second sleep at the start of each job to prevent a "thundering herd" of simultaneous requests.

The Outcome

This change trades marginal, asynchronous API costs for massive gains in developer productivity. We are "buying back" hours of engineering focus every week by ensuring that when a PR fails, it is for a real, actionable reason—not a statistical fluke.

Related Issues

Related to #23169

How to Validate

Example run: https://github.com/google-gemini/gemini-cli/actions/runs/23516551306

  1. Inspect .github/workflows/evals-nightly.yml.
  2. Verify the run_attempt matrix has been updated from [1, 2, 3] to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].
  3. (Optional) Trigger the workflow manually to verify it spawns 10 parallel runs per model.

Pre-Merge Checklist

  • Updated relevant documentation and README (if needed)
  • Added/updated tests (updated evaluation configuration)
  • Noted breaking changes (if any)
  • Validated on required platforms/methods:
    • MacOS
      • npm run
      • npx
      • Docker
      • Podman
      • Seatbelt

@gemini-cli
Copy link
Contributor

gemini-cli bot commented Mar 24, 2026

Hi @alisa-alisa, thank you so much for your contribution to Gemini CLI! We really appreciate the time and effort you've put into this.

We're making some updates to our contribution process to improve how we track and review changes. Please take a moment to review our recent discussion post: Improving Our Contribution Process & Introducing New Guidelines.

Key Update: Starting January 26, 2026, the Gemini CLI project will require all pull requests to be associated with an existing issue. Any pull requests not linked to an issue by that date will be automatically closed.

Thank you for your understanding and for being a part of our community!

@github-actions
Copy link

github-actions bot commented Mar 24, 2026

Size Change: -4 B (0%)

Total Size: 26.3 MB

Filename Size Change
./bundle/chunk-4LFPWAO4.js 0 B -3.4 kB (removed) 🏆
./bundle/chunk-A5UY3AVS.js 0 B -14.6 MB (removed) 🏆
./bundle/chunk-DVRBE5UN.js 0 B -3.64 MB (removed) 🏆
./bundle/core-JYCZM4DA.js 0 B -43.4 kB (removed) 🏆
./bundle/devtoolsService-2GJK6OKR.js 0 B -27.7 kB (removed) 🏆
./bundle/gemini-DFSUL7EB.js 0 B -521 kB (removed) 🏆
./bundle/interactiveCli-AFUYI74X.js 0 B -1.62 MB (removed) 🏆
./bundle/oauth2-provider-JEMLLVVV.js 0 B -9.16 kB (removed) 🏆
./bundle/chunk-HP4YP3TU.js 14.6 MB +14.6 MB (new file) 🆕
./bundle/chunk-LM5L3U7F.js 3.4 kB +3.4 kB (new file) 🆕
./bundle/chunk-NEKJ2TYT.js 3.64 MB +3.64 MB (new file) 🆕
./bundle/core-6H4YNIZB.js 43.4 kB +43.4 kB (new file) 🆕
./bundle/devtoolsService-WC2TOQ52.js 27.7 kB +27.7 kB (new file) 🆕
./bundle/gemini-H34U7Q5Z.js 521 kB +521 kB (new file) 🆕
./bundle/interactiveCli-XEITK4P6.js 1.62 MB +1.62 MB (new file) 🆕
./bundle/oauth2-provider-KC54MEIC.js 9.16 kB +9.16 kB (new file) 🆕
ℹ️ View Unchanged
Filename Size Change
./bundle/chunk-34MYV7JD.js 2.45 kB 0 B
./bundle/chunk-5AUYMPVF.js 858 B 0 B
./bundle/chunk-664ZODQF.js 124 kB 0 B
./bundle/chunk-DAHVX5MI.js 206 kB 0 B
./bundle/chunk-IUUIT4SU.js 56.5 kB 0 B
./bundle/chunk-PVQN7ZVP.js 1.96 MB 0 B
./bundle/chunk-RJTRUG2J.js 39.8 kB 0 B
./bundle/cleanup-QAKRYUSS.js 0 B -856 B (removed) 🏆
./bundle/devtools-36NN55EP.js 696 kB 0 B
./bundle/dist-T73EYRDX.js 356 B 0 B
./bundle/gemini.js 2.06 kB 0 B
./bundle/getMachineId-bsd-TXG52NKR.js 1.55 kB 0 B
./bundle/getMachineId-darwin-7OE4DDZ6.js 1.55 kB 0 B
./bundle/getMachineId-linux-SHIFKOOX.js 1.34 kB 0 B
./bundle/getMachineId-unsupported-5U5DOEYY.js 1.06 kB 0 B
./bundle/getMachineId-win-6KLLGOI4.js 1.72 kB 0 B
./bundle/memoryDiscovery-WPGC7DAZ.js 922 B 0 B
./bundle/multipart-parser-KPBZEGQU.js 11.7 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 221 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 227 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 11.5 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B
./bundle/src-QVCVGIUX.js 47 kB 0 B
./bundle/tree-sitter-7U6MW5PS.js 274 kB 0 B
./bundle/tree-sitter-bash-34ZGLXVX.js 1.84 MB 0 B
./bundle/cleanup-HYIC53OA.js 856 B +856 B (new file) 🆕

compressed-size-action

@gemini-cli gemini-cli bot added area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item. labels Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/platform Issues related to Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt 🔒 maintainer only ⛔ Do not contribute. Internal roadmap item.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant