google-github-actions · cocosheng-g · Feb 4, 2026 · Feb 4, 2026 · Feb 5, 2026 · Feb 5, 2026
@@ -25,6 +25,11 @@ prompt = """
             <step id="1" name="Understand Project Standards">
                 The initial context provided to you includes a file tree. If you see a `GEMINI.md` or `CONTRIBUTING.md` file, use the GitHub MCP `get_file_contents` tool to read it first. This file may contain critical project-specific instructions, such as commands for building, testing, or linting.
             </step>
+            <step id="1.5" name="Validate Issue">
+                Critically evaluate the issue title and body.
+                - If the issue is too vague to understand or reproduce (e.g., "it's broken"), DO NOT attempt to fix it. Instead, skip to the final step and post a comment asking for specific details, logs, or reproduction steps.
+                - If the issue is clearly out of scope or impossible (e.g., "support IE6" for a modern app), DO NOT attempt to fix it. Post a comment explicitly stating that this request is out of scope or citing the technical limitation.
+            </step>
             <step id="2" name="Acknowledge and Plan">
                 1. Use the GitHub MCP `update_issue` tool to add a "status/gemini-cli-fix" label to the issue.
                 2. Use the `gh issue comment` CLI tool command to post an initial comment. In this comment, you must:

@@ -8,6 +8,11 @@ You are an issue triage assistant. Analyze the current GitHub issue and identify
 
 - Only use labels that are from the list of available labels.
 - You can choose multiple labels to apply.
+- **Strictness**: Apply a label if the issue content clearly matches the label's purpose.
+- **Functional Failures**: If a user reports that something is "broken", "not working", "crashing", or "stopped working", you should categorize it as a `bug`, even if they provide very few details.
+- **Spam & Irrelevant Content**: Do not apply any labels to spam, advertisements, or content that is entirely irrelevant to the project.
+- **Extreme Ambiguity**: If an issue is *completely* devoid of context (e.g., just says "Help", "Hi", or "asdf"), do not apply any labels.
+- **Questions**: Use the `question` label only when the user is explicitly asking for information or instructions. Do not use it as a fallback for ambiguous issues.
 - When generating shell commands, you **MUST NOT** use command substitution with `$(...)`, `<(...)`, or `>(...)`. This is a security measure to prevent unintended command execution.
 
 ## Input Data

@@ -0,0 +1,59 @@
+name: 'Nightly Evaluations'
+
+on:
+  schedule:
+    - cron: '0 1 * * *' # 1 AM UTC
+  workflow_dispatch:
+    inputs:
+      iterations:
+        description: 'Number of iterations per test case'
+        required: true
+        default: '1'
+
+jobs:
+  evaluate:
+    runs-on: 'ubuntu-latest'
+    permissions:
+      contents: 'read'
+    strategy:
+      matrix:
+        model:
+          [
+            'gemini-3-pro-preview',
+            'gemini-3-flash-preview',
+            'gemini-2.5-pro',
+            'gemini-2.5-flash',
+            'gemini-2.5-flash-lite',
+          ]
+    name: 'Evaluate ${{ matrix.model }}'
+
+    steps:
+      - name: 'Checkout code'
+        uses: 'actions/checkout@v4' # ratchet:exclude
+
+      - name: 'Set up Node.js'
+        uses: 'actions/setup-node@v4' # ratchet:exclude
+        with:
+          node-version: '20'
+          cache: 'npm'
+
+      - name: 'Install dependencies'
+        run: |
+          npm ci
+
+      - name: 'Run Evaluations'
+        env:
+          GEMINI_API_KEY: '${{ secrets.GEMINI_API_KEY }}'
+          GEMINI_MODEL: '${{ matrix.model }}'
+        run: |
+          npm run test:evals -- --reporter=json --outputFile=eval-results-${{ matrix.model }}.json
+
+      - name: 'Upload Results'
+        uses: 'actions/upload-artifact@v4' # ratchet:exclude
+        with:
+          name: 'eval-results-${{ matrix.model }}'
+          path: 'eval-results-${{ matrix.model }}.json'
+
+      - name: 'Job Summary'
+        run: |
+          npx tsx scripts/aggregate_evals.ts "eval-results-${{ matrix.model }}.json" >> "$GITHUB_STEP_SUMMARY"
@@ -0,0 +1,48 @@
+# Gemini CLI Workflow Evaluations
+
+This directory contains resources for evaluating and improving the example workflows using a TypeScript + Vitest framework.
+
+## Goals
+
+1.  **Systematic Testing:** Ensure changes to prompts or configurations improve quality.
+2.  **Regression Testing:** Catch degradations in performance.
+3.  **Benchmarking:** Compare different models (e.g., `gemini-2.5-pro` vs `gemini-2.5-flash`).
+
+## Structure
+
+- `evals/`:
+  - `test-rig.ts`: Utility to setup a temporary environment for the CLI.
+  - `issue-triage.eval.ts`: Benchmark for the Issue Triage workflow.
+  - `pr-review.eval.ts`: Benchmark for the PR Review workflow.
+  - `issue-fixer.eval.ts`: Benchmark for the autonomous Issue Fixer.
+  - `gemini-assistant.eval.ts`: Benchmark for the interactive Assistant.
+  - `gemini-scheduled-triage.eval.ts`: Benchmark for batch triage.
+  - `data/*.jsonl`: Gold-standard datasets for each workflow.
+  - `vitest.config.ts`: Configuration for the evaluation runner.
+
+## How to Run
+
+### Prerequisites
+
+- `npm install`
+- `gemini-cli` installed and available in your PATH.
+- `GEMINI_API_KEY` environment variable set.
+
+### Run Locally
+
+```bash
+npm run test:evals
+```
+
+To run against a specific model:
+
+```bash
+GEMINI_MODEL=gemini-2.5-flash npm run test:evals
+```
+
+## Adding New Evals
+
+1. Create a new file in `evals/` ending in `.eval.ts`.
+2. Add corresponding test data in `evals/data/`.
+3. Use the `TestRig` to set up files, environment variables, and run the CLI.
+4. Assert the expected behavior (e.g., check `GITHUB_ENV` output or tool calls captured in telemetry).
@@ -0,0 +1,36 @@
+[
+  {
+    "id": "fix-typo",
+    "inputs": {
+      "TITLE": "Fix typo in utils.js",
+      "DESCRIPTION": "There is a typo in the helper function name.",
+      "EVENT_NAME": "issues",
+      "IS_PULL_REQUEST": "false",
+      "ISSUE_NUMBER": "10",
+      "REPOSITORY": "owner/repo",
+      "ADDITIONAL_CONTEXT": "Please fix it."
+    },
+    "expected_actions": ["AI Assistant: Plan of Action"],
+    "expected_plan_keywords": ["search", "grep", "read", "replace", "utils.js"]
+  },
+  {
+    "id": "add-feature",
+    "inputs": {
+      "TITLE": "Add login page",
+      "DESCRIPTION": "We need a login page.",
+      "EVENT_NAME": "issues",
+      "IS_PULL_REQUEST": "false",
+      "ISSUE_NUMBER": "11",
+      "REPOSITORY": "owner/repo",
+      "ADDITIONAL_CONTEXT": "Make it pretty."
+    },
+    "expected_actions": ["AI Assistant: Plan of Action"],
+    "expected_plan_keywords": [
+      "create",
+      "component",
+      "structure",
+      "design",
+      "implement"
+    ]
+  }
+]
@@ -0,0 +1,19 @@
+[
+  {
+    "id": "batch-1",
+    "inputs": {
+      "AVAILABLE_LABELS": "bug,enhancement,priority/p0",
+      "ISSUES_TO_TRIAGE": "[{\"number\": 1, \"title\": \"Crash on start\", \"body\": \"It crashes immediately.\"}, {\"number\": 2, \"title\": \"Add help button\", \"body\": \"Users need help.\"}]"
+    },
+    "expected": [
+      {
+        "issue_number": 1,
+        "labels_to_set": ["bug", "priority/p0"]
+      },
+      {
+        "issue_number": 2,
+        "labels_to_set": ["enhancement"]
+      }
+    ]
+  }
+]
@@ -0,0 +1,165 @@
+[
+  {
+    "id": "new-page-request",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "1",
+      "ISSUE_TITLE": "Add a new landing page",
+      "ISSUE_BODY": "We need a landing page for the new product launch."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": ["explore", "create", "file", "add", "content"]
+  },
+  {
+    "id": "bug-fix-request",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "2",
+      "ISSUE_TITLE": "Fix login crash",
+      "ISSUE_BODY": "The app crashes when the user clicks 'forgot password'."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": [
+      "search",
+      "reproduce",
+      "investigate",
+      "fix",
+      "logic"
+    ]
+  },
+  {
+    "id": "dependency-update",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "5",
+      "ISSUE_TITLE": "Update lodash to the latest version",
+      "ISSUE_BODY": "We need to update lodash to address a known security vulnerability in older versions."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": [
+      "npm",
+      "install",
+      "update",
+      "package.json",
+      "verify"
+    ]
+  },
+  {
+    "id": "impossible-request",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "10",
+      "ISSUE_TITLE": "Fix the bug",
+      "ISSUE_BODY": "It's broken. Fix it now."
+    },
+    "expected_actions": ["gh issue comment"],
+    "expected_plan_keywords": ["details", "information", "reproduce"]
+  },
+  {
+    "id": "out-of-scope",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "11",
+      "ISSUE_TITLE": "Support Internet Explorer 6",
+      "ISSUE_BODY": "Our users are still on IE6, please make this modern React app work on it."
+    },
+    "expected_actions": ["gh issue comment"],
+    "expected_plan_keywords": ["unsupported", "limitation", "scope"]
+  },
+  {
+    "id": "security-vulnerability",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "12",
+      "ISSUE_TITLE": "Fix potential SQL injection in user search",
+      "ISSUE_BODY": "The user search query is constructed using string concatenation."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": [
+      "security",
+      "injection",
+      "parameterized",
+      "sanitize"
+    ]
+  },
+  {
+    "id": "cross-file-refactor",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "20",
+      "ISSUE_TITLE": "Refactor validation logic into a separate utility",
+      "ISSUE_BODY": "The validation logic in `UserForm.tsx` and `OrderForm.tsx` is identical. Move it to `src/utils/validation.ts` and update both forms."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": [
+      "refactor",
+      "move",
+      "utility",
+      "update",
+      "UserForm",
+      "OrderForm"
+    ]
+  },
+  {
+    "id": "complex-state-fix",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "21",
+      "ISSUE_TITLE": "Fix race condition in multi-step wizard",
+      "ISSUE_BODY": "In the multi-step checkout, if a user clicks 'Next' twice very quickly, they skip a step and end up in an invalid state. We need to disable the button during transition."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": [
+      "race condition",
+      "disable",
+      "button",
+      "transition",
+      "state"
+    ]
+  },
+  {
+    "id": "fix-flaky-test",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "30",
+      "ISSUE_TITLE": "Flaky test: UserProfile should load data",
+      "ISSUE_BODY": "The test `UserProfile should load data` fails about 10% of the time on CI. It seems to be timing out waiting for the network."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": ["flaky", "wait", "timeout", "mock", "network"]
+  },
+  {
+    "id": "migrate-deprecated-api",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "31",
+      "ISSUE_TITLE": "Migrate usage of deprecated 'fs.exists'",
+      "ISSUE_BODY": "`fs.exists` is deprecated. We should replace all occurrences with `fs.stat` or `fs.access`."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": [
+      "deprecated",
+      "replace",
+      "fs.exists",
+      "fs.stat",
+      "fs.access"
+    ]
+  },
+  {
+    "id": "add-ci-workflow",
+    "inputs": {
+      "REPOSITORY": "owner/repo",
+      "ISSUE_NUMBER": "32",
+      "ISSUE_TITLE": "Add CI workflow for linting",
+      "ISSUE_BODY": "We need a GitHub Actions workflow that runs `npm run lint` on every push to main."
+    },
+    "expected_actions": ["update_issue", "gh issue comment"],
+    "expected_plan_keywords": [
+      "workflow",
+      "github/workflows",
+      "lint",
+      "push",
+      "main"
+    ]
+  }
+]