Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/commands/gemini-issue-fixer.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ prompt = """
<step id="1" name="Understand Project Standards">
The initial context provided to you includes a file tree. If you see a `GEMINI.md` or `CONTRIBUTING.md` file, use the GitHub MCP `get_file_contents` tool to read it first. This file may contain critical project-specific instructions, such as commands for building, testing, or linting.
</step>
<step id="1.5" name="Validate Issue">
Critically evaluate the issue title and body.
- If the issue is too vague to understand or reproduce (e.g., "it's broken"), DO NOT attempt to fix it. Instead, skip to the final step and post a comment asking for specific details, logs, or reproduction steps.
- If the issue is clearly out of scope or impossible (e.g., "support IE6" for a modern app), DO NOT attempt to fix it. Post a comment explicitly stating that this request is out of scope or citing the technical limitation.
</step>
<step id="2" name="Acknowledge and Plan">
1. Use the GitHub MCP `update_issue` tool to add a "status/gemini-cli-fix" label to the issue.
2. Use the `gh issue comment` CLI tool command to post an initial comment. In this comment, you must:
Expand Down
5 changes: 5 additions & 0 deletions .github/commands/gemini-triage.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ You are an issue triage assistant. Analyze the current GitHub issue and identify

- Only use labels that are from the list of available labels.
- You can choose multiple labels to apply.
- **Strictness**: Apply a label if the issue content clearly matches the label's purpose.
- **Functional Failures**: If a user reports that something is "broken", "not working", "crashing", or "stopped working", you should categorize it as a `bug`, even if they provide very few details.
- **Spam & Irrelevant Content**: Do not apply any labels to spam, advertisements, or content that is entirely irrelevant to the project.
- **Extreme Ambiguity**: If an issue is *completely* devoid of context (e.g., just says "Help", "Hi", or "asdf"), do not apply any labels.
- **Questions**: Use the `question` label only when the user is explicitly asking for information or instructions. Do not use it as a fallback for ambiguous issues.
- When generating shell commands, you **MUST NOT** use command substitution with `$(...)`, `<(...)`, or `>(...)`. This is a security measure to prevent unintended command execution.

## Input Data
Expand Down
59 changes: 59 additions & 0 deletions .github/workflows/evals-nightly.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: 'Nightly Evaluations'

on:
schedule:
- cron: '0 1 * * *' # 1 AM UTC
workflow_dispatch:
inputs:
iterations:
description: 'Number of iterations per test case'
required: true
default: '1'

jobs:
evaluate:
runs-on: 'ubuntu-latest'
permissions:
contents: 'read'
strategy:
matrix:
model:
[
'gemini-3-pro-preview',
'gemini-3-flash-preview',
'gemini-2.5-pro',
'gemini-2.5-flash',
'gemini-2.5-flash-lite',
]
name: 'Evaluate ${{ matrix.model }}'

steps:
- name: 'Checkout code'
uses: 'actions/checkout@v4' # ratchet:exclude

- name: 'Set up Node.js'
uses: 'actions/setup-node@v4' # ratchet:exclude
with:
node-version: '20'
cache: 'npm'

- name: 'Install dependencies'
run: |
npm ci

- name: 'Run Evaluations'
env:
GEMINI_API_KEY: '${{ secrets.GEMINI_API_KEY }}'
GEMINI_MODEL: '${{ matrix.model }}'
run: |
npm run test:evals -- --reporter=json --outputFile=eval-results-${{ matrix.model }}.json

- name: 'Upload Results'
uses: 'actions/upload-artifact@v4' # ratchet:exclude
with:
name: 'eval-results-${{ matrix.model }}'
path: 'eval-results-${{ matrix.model }}.json'

- name: 'Job Summary'
run: |
npx tsx scripts/aggregate_evals.ts "eval-results-${{ matrix.model }}.json" >> "$GITHUB_STEP_SUMMARY"
48 changes: 48 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Gemini CLI Workflow Evaluations

This directory contains resources for evaluating and improving the example workflows using a TypeScript + Vitest framework.

## Goals

1. **Systematic Testing:** Ensure changes to prompts or configurations improve quality.
2. **Regression Testing:** Catch degradations in performance.
3. **Benchmarking:** Compare different models (e.g., `gemini-2.5-pro` vs `gemini-2.5-flash`).

## Structure

- `evals/`:
- `test-rig.ts`: Utility to setup a temporary environment for the CLI.
- `issue-triage.eval.ts`: Benchmark for the Issue Triage workflow.
- `pr-review.eval.ts`: Benchmark for the PR Review workflow.
- `issue-fixer.eval.ts`: Benchmark for the autonomous Issue Fixer.
- `gemini-assistant.eval.ts`: Benchmark for the interactive Assistant.
- `gemini-scheduled-triage.eval.ts`: Benchmark for batch triage.
- `data/*.jsonl`: Gold-standard datasets for each workflow.
- `vitest.config.ts`: Configuration for the evaluation runner.

## How to Run

### Prerequisites

- `npm install`
- `gemini-cli` installed and available in your PATH.
- `GEMINI_API_KEY` environment variable set.

### Run Locally

```bash
npm run test:evals
```

To run against a specific model:

```bash
GEMINI_MODEL=gemini-2.5-flash npm run test:evals
```

## Adding New Evals

1. Create a new file in `evals/` ending in `.eval.ts`.
2. Add corresponding test data in `evals/data/`.
3. Use the `TestRig` to set up files, environment variables, and run the CLI.
4. Assert the expected behavior (e.g., check `GITHUB_ENV` output or tool calls captured in telemetry).
36 changes: 36 additions & 0 deletions evals/data/gemini-assistant.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
[
{
"id": "fix-typo",
"inputs": {
"TITLE": "Fix typo in utils.js",
"DESCRIPTION": "There is a typo in the helper function name.",
"EVENT_NAME": "issues",
"IS_PULL_REQUEST": "false",
"ISSUE_NUMBER": "10",
"REPOSITORY": "owner/repo",
"ADDITIONAL_CONTEXT": "Please fix it."
},
"expected_actions": ["AI Assistant: Plan of Action"],
"expected_plan_keywords": ["search", "grep", "read", "replace", "utils.js"]
},
{
"id": "add-feature",
"inputs": {
"TITLE": "Add login page",
"DESCRIPTION": "We need a login page.",
"EVENT_NAME": "issues",
"IS_PULL_REQUEST": "false",
"ISSUE_NUMBER": "11",
"REPOSITORY": "owner/repo",
"ADDITIONAL_CONTEXT": "Make it pretty."
},
"expected_actions": ["AI Assistant: Plan of Action"],
"expected_plan_keywords": [
"create",
"component",
"structure",
"design",
"implement"
]
}
]
19 changes: 19 additions & 0 deletions evals/data/gemini-scheduled-triage.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
[
{
"id": "batch-1",
"inputs": {
"AVAILABLE_LABELS": "bug,enhancement,priority/p0",
"ISSUES_TO_TRIAGE": "[{\"number\": 1, \"title\": \"Crash on start\", \"body\": \"It crashes immediately.\"}, {\"number\": 2, \"title\": \"Add help button\", \"body\": \"Users need help.\"}]"
},
"expected": [
{
"issue_number": 1,
"labels_to_set": ["bug", "priority/p0"]
},
{
"issue_number": 2,
"labels_to_set": ["enhancement"]
}
]
}
]
165 changes: 165 additions & 0 deletions evals/data/issue-fixer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
[
{
"id": "new-page-request",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "1",
"ISSUE_TITLE": "Add a new landing page",
"ISSUE_BODY": "We need a landing page for the new product launch."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": ["explore", "create", "file", "add", "content"]
},
{
"id": "bug-fix-request",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "2",
"ISSUE_TITLE": "Fix login crash",
"ISSUE_BODY": "The app crashes when the user clicks 'forgot password'."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"search",
"reproduce",
"investigate",
"fix",
"logic"
]
},
{
"id": "dependency-update",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "5",
"ISSUE_TITLE": "Update lodash to the latest version",
"ISSUE_BODY": "We need to update lodash to address a known security vulnerability in older versions."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"npm",
"install",
"update",
"package.json",
"verify"
]
},
{
"id": "impossible-request",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "10",
"ISSUE_TITLE": "Fix the bug",
"ISSUE_BODY": "It's broken. Fix it now."
},
"expected_actions": ["gh issue comment"],
"expected_plan_keywords": ["details", "information", "reproduce"]
},
{
"id": "out-of-scope",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "11",
"ISSUE_TITLE": "Support Internet Explorer 6",
"ISSUE_BODY": "Our users are still on IE6, please make this modern React app work on it."
},
"expected_actions": ["gh issue comment"],
"expected_plan_keywords": ["unsupported", "limitation", "scope"]
},
{
"id": "security-vulnerability",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "12",
"ISSUE_TITLE": "Fix potential SQL injection in user search",
"ISSUE_BODY": "The user search query is constructed using string concatenation."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"security",
"injection",
"parameterized",
"sanitize"
]
},
{
"id": "cross-file-refactor",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "20",
"ISSUE_TITLE": "Refactor validation logic into a separate utility",
"ISSUE_BODY": "The validation logic in `UserForm.tsx` and `OrderForm.tsx` is identical. Move it to `src/utils/validation.ts` and update both forms."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"refactor",
"move",
"utility",
"update",
"UserForm",
"OrderForm"
]
},
{
"id": "complex-state-fix",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "21",
"ISSUE_TITLE": "Fix race condition in multi-step wizard",
"ISSUE_BODY": "In the multi-step checkout, if a user clicks 'Next' twice very quickly, they skip a step and end up in an invalid state. We need to disable the button during transition."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"race condition",
"disable",
"button",
"transition",
"state"
]
},
{
"id": "fix-flaky-test",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "30",
"ISSUE_TITLE": "Flaky test: UserProfile should load data",
"ISSUE_BODY": "The test `UserProfile should load data` fails about 10% of the time on CI. It seems to be timing out waiting for the network."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": ["flaky", "wait", "timeout", "mock", "network"]
},
{
"id": "migrate-deprecated-api",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "31",
"ISSUE_TITLE": "Migrate usage of deprecated 'fs.exists'",
"ISSUE_BODY": "`fs.exists` is deprecated. We should replace all occurrences with `fs.stat` or `fs.access`."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"deprecated",
"replace",
"fs.exists",
"fs.stat",
"fs.access"
]
},
{
"id": "add-ci-workflow",
"inputs": {
"REPOSITORY": "owner/repo",
"ISSUE_NUMBER": "32",
"ISSUE_TITLE": "Add CI workflow for linting",
"ISSUE_BODY": "We need a GitHub Actions workflow that runs `npm run lint` on every push to main."
},
"expected_actions": ["update_issue", "gh issue comment"],
"expected_plan_keywords": [
"workflow",
"github/workflows",
"lint",
"push",
"main"
]
}
]
Loading