Azure · placerda · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026 · Mar 19, 2026
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -36,7 +36,7 @@ Contribution guidelines live in `CONTRIBUTING.md` at the repo root.
   - Local evaluation via `azure-ai-evaluation` SDK (fallback)
 - **Secondary backend**: subprocess-based (generic)
 - **Azure SDK dependencies** (runtime, for Foundry backend):
-  - `azure-ai-projects>=2.0.0b1` — Foundry project client, `get_openai_client()`
+  - `azure-ai-projects>=2.0.1` — Foundry project client, `get_openai_client()`
   - `azure-ai-evaluation` — Local evaluator classes (SimilarityEvaluator, etc.)
   - `azure-identity` — `DefaultAzureCredential` authentication
   - `openai` — Evals API types (`DataSourceConfigCustom`, etc.)
@@ -233,14 +233,15 @@ Do not implement the following unless explicitly discussed:
 
 This repository also defines workflow-oriented Copilot skills under `.github/skills/`.
 
-- Use these skills for operational guidance on running evaluations, investigating regressions, and observability triage workflows.
+- Use these skills for operational guidance on running evaluations, investigating regressions, observability triage, and release management workflows.
 - Treat the CLI as the source of truth and keep planned/stubbed commands clearly marked as not yet implemented.
 - Do not duplicate architecture or code-structure guidance from this file inside workflow skills.
 
 When generating or modifying code:
 
 - **Read `docs/how-it-works.md` first** — it is the single source of truth for architecture
 - **Read `CONTRIBUTING.md`** for contribution rules and workflow
+- Treat the CLI as the source of truth and keep planned/stubbed commands clearly marked as not yet implemented.
 - Do not invent new concepts or commands
 - Prefer clarity and determinism over cleverness
 - Optimize for maintainability and CI usage

diff --git a/.github/extensions/agentops-skills/extension.mjs b/.github/extensions/agentops-skills/extension.mjs
@@ -0,0 +1,149 @@
+// Extension: agentops-skills
+// Injects AgentOps workflow skills as context when relevant prompts are detected.
+
+import { joinSession } from "@github/copilot-sdk/extension";
+
+const SKILLS = {
+    "run-evals": {
+        keywords: [
+            "run eval", "start agentops", "run.yaml", "regenerate report",
+            "evaluation results", "agentops init", "agentops eval", "agentops report",
+            "run an evaluation", "initialize agentops", "results.json", "report.md",
+            "eval run", "run config", "evaluation output",
+        ],
+        context: `## Skill: Run Evaluations
+
+### Purpose
+Guide through the implemented AgentOps evaluation workflow from workspace setup to report interpretation.
+
+### Available Commands
+- agentops init [--path <dir>] — Initialize workspace
+- agentops eval run — Execute evaluation
+- agentops report — Regenerate report from results.json
+
+### Typical Workflow
+1. Initialize workspace: agentops init
+2. Confirm run config exists (.agentops/run.yaml)
+3. Execute evaluation: agentops eval run
+4. Regenerate markdown report: agentops report
+5. Inspect outputs under .agentops/results/latest/
+
+### Outputs
+- results.json (machine-readable normalized results)
+- report.md (human-readable summary)
+- cloud_evaluation.json (cloud evaluation flows only)
+- Latest pointers: .agentops/results/latest/
+
+### Interpretation
+- Start with report.md for quick pass/fail narrative and threshold view.
+- Use results.json for metric-level details, row-level checks, and automation.
+- Distinguish: thresholds passing, threshold failures, runtime/config errors.
+
+### Guardrails
+- Do not invent commands or flags beyond documented CLI behavior.
+- Planned commands (compare, run-history) are stubbed — pivot to artifact inspection.`,
+    },
+
+    "investigate-regression": {
+        keywords: [
+            "regression", "score dropped", "threshold started failing",
+            "compare runs", "eval got worse", "debug evaluation",
+            "evaluation drift", "quality drop", "pass rate dropped",
+            "ci failing", "scores lower", "metrics degraded",
+        ],
+        context: `## Skill: Investigate Regression
+
+### Purpose
+Guide through regression investigation using currently available AgentOps outputs.
+
+### Available Commands
+- agentops eval run — Generate fresh artifacts
+- agentops report — Regenerate report
+
+### Planned (not implemented)
+- agentops eval compare --runs ID1,ID2
+
+### Investigation Steps
+1. Run fresh evaluation: agentops eval run
+2. Regenerate report: agentops report
+3. Compare current artifacts to baseline manually
+4. Report factual deltas, then propose controlled next steps
+
+### Required Inputs
+- At least one recent artifact set (results.json + report.md)
+- Preferably a baseline for side-by-side comparison
+- Context about what changed (prompt, model, dataset, bundle, backend, environment)
+
+### Interpretation
+- Separate observations (artifact-backed) from hypotheses (plausible causes).
+- Prioritize impact: which thresholds flipped, which metrics degraded most, broad vs concentrated failures.
+- End with actionable next checks (rerun with controlled changes, validate dataset, verify config).
+
+### Guardrails
+- agentops eval compare is NOT implemented — use manual artifact comparison.
+- Do not infer causality from correlation alone.
+- Keep remediation tied to reproducible checks.`,
+    },
+
+    "observability-triage": {
+        keywords: [
+            "tracing", "monitoring", "dashboard", "alerts", "triage",
+            "observability", "run health", "production triage",
+            "monitor evals", "set up tracing", "failed evaluation",
+            "quality monitoring",
+        ],
+        context: `## Skill: Observability Triage
+
+### Purpose
+Provide honest observability guidance: use current reporting artifacts today, frame tracing/monitoring as planned future work.
+
+### Available Commands (for triage today)
+- agentops eval run
+- agentops report
+
+### Planned/Stubbed (NOT implemented)
+- agentops trace init
+- agentops monitor setup
+- agentops monitor dashboard
+- agentops monitor alert
+
+### Current Triage Approach
+- Use report.md for quick operational triage (what failed, severity).
+- Use results.json for detailed metric and threshold inspection.
+- Keep run artifacts organized for future compare/monitor automation.
+
+### When Users Ask for Unimplemented Features
+1. State explicitly: planned/stubbed, not available yet.
+2. Provide immediate fallback: artifact-based troubleshooting.
+3. Suggest preparation: organize artifacts for future tooling.
+
+### Guardrails
+- Do not present tracing or monitoring commands as available.
+- Do not imply real-time dashboards/alerts exist in CLI.
+- Always pivot to concrete available outputs (results.json, report.md).`,
+    },
+};
+
+function matchSkills(prompt) {
+    const lower = prompt.toLowerCase();
+    const matched = [];
+    for (const [name, skill] of Object.entries(SKILLS)) {
+        if (skill.keywords.some((kw) => lower.includes(kw))) {
+            matched.push(skill.context);
+        }
+    }
+    return matched;
+}
+
+const session = await joinSession({
+    hooks: {
+        onUserPromptSubmitted: async (input) => {
+            const matched = matchSkills(input.prompt);
+            if (matched.length > 0) {
+                return {
+                    additionalContext: `<agentops_skills>\n${matched.join("\n\n---\n\n")}\n</agentops_skills>`,
+                };
+            }
+        },
+    },
+});
diff --git a/.github/plugins/agentops/skills/agentops-investigate-regression/SKILL.md b/.github/plugins/agentops/skills/agentops-investigate-regression/SKILL.md
@@ -0,0 +1,107 @@
+---
+name: agentops-investigate-regression
+description: Help users investigate evaluation regressions in AgentOps by comparing runs, analyzing row-level scores, and identifying root causes. Trigger when users say "regression", "score dropped", "threshold failed", "compare runs", "why did this eval get worse", "which rows failed", "debug evaluation", "quality degradation". Install agentops-toolkit via pip. Commands are agentops eval run, agentops eval compare, and agentops report.
+---
+
+# AgentOps Investigate Regression
+
+> **Prerequisite:** Install the AgentOps CLI with `pip install agentops-toolkit`.
+
+## Purpose
+Guide users through regression investigation using N-run comparison, row-level score analysis, and structured root cause identification.
+
+## When to Use
+- User reports lower scores versus previous runs.
+- User reports new threshold failures (PASS → FAIL).
+- User asks to compare current and prior evaluation outcomes.
+- CI gating changed from pass to fail and root cause is unclear.
+- User asks which specific rows or questions are failing.
+
+## Available Commands
+
+```bash
+agentops eval run [-c <config>] [-f md|html|all]                    # Generate fresh results
+agentops report [-f md|html|all]                                     # Regenerate report
+agentops eval compare --runs <id1>,<id2>[,...] [-f md|html|all]      # Compare N runs
+```
+
+Run identifiers for `--runs` can be:
+- Timestamped folder names (e.g. `2026-03-01_100000`)
+- The keyword `latest`
+- Absolute or relative paths to a `results.json` or a run directory
+
+## Investigation Workflow
+
+1. **Reproduce:** `agentops eval run -f html` to get fresh results with visual report.
+2. **Compare:** `agentops eval compare --runs <baseline>,latest -f html`
+3. **Check the verdict:** NO REGRESSIONS vs REGRESSIONS DETECTED
+4. **Read run config:** Check Status row — `FAIL (60% · 3/5)` tells you exactly how many rows failed.
+5. **Read Evaluators table:**
+   - ● green dot = Met threshold, ● red dot = Missed
+   - ↑ improved / ↓ regressed vs baseline
+   - `(3/5)` = row pass rate for this evaluator
+6. **Drill into Row Details:** Find exactly which rows scored below threshold and why.
+7. **Act:** Fix the identified issues (prompt tuning, dataset quality, model selection).
+
+## Understanding the Report
+
+### What REGRESSIONS DETECTED means
+A regression is detected ONLY when:
+- A run's overall status flips from **PASS to FAIL** vs baseline
+- A previously-passing **row** now fails
+
+A minor numeric decrease (e.g., latency 4.84s → 6.00s) that stays within the threshold (≤ 10s) is **NOT** a regression. The verdict focuses on threshold-breaking changes, not noise.
+
+### Comparison types
+The report auto-detects what's being compared:
+- **Model Comparison** — same dataset, different models → full row-level analysis valid
+- **Agent Comparison** — same dataset, different agents → full row-level analysis valid
+- **Dataset Coverage** — different datasets → row details skipped (rows aren't comparable)
+- **General** — multiple things vary
+
+### Evaluators table
+Each cell shows: `● score ↑ delta (n/n rows)`
+- **● dot** = Met (green) or Missed (red) vs the absolute threshold target
+- **↑↓ delta** = direction vs baseline run (improved/regressed/unchanged)
+- **(n/n)** = how many rows met the threshold out of total
+- **Green highlight** = best score across all runs
+- Metrics without thresholds (like `samples_evaluated`) show as plain informational numbers
+
+### Row Details table
+Each cell shows per-evaluator scores: `● SimilarityEvaluator: 2`
+- Green ● = this row met the threshold
+- Red ● = this row missed — **this is why the run failed**
+
+### Status
+`PASS (100% · 5/5)` = all rows met all thresholds
+`FAIL (60% · 3/5)` = 3 of 5 rows passed, 2 failed → the specific rows that failed explain the FAIL
+
+## Root Cause Checklist
+When you find regressions:
+
+1. **Which rows failed?** → Check Row Details for red ● dots
+2. **Which evaluator failed?** → The evaluator with red dots tells you what's weak
+3. **Is it the model?** → Compare same dataset across models to isolate
+4. **Is it the dataset?** → Some questions are inherently harder (real-time, ambiguous)
+5. **Is it the agent instructions?** → Compare agent versions on same dataset
+6. **Is it random variance?** → Run the same config 2-3 times and compare
+
+## Guardrails
+- Do not infer causality from correlation alone.
+- Separate observations (data from artifacts) from hypotheses (plausible causes).
+- Keep remediation advice tied to reproducible checks.
+- When comparing runs with different datasets, do NOT analyze row-level changes — they're different questions.
+
+## Examples
+- "My eval went from PASS to FAIL after changing model"
+  → `agentops eval compare --runs <old>,<new> -f html`. Check Evaluators for ↓ regressed metrics and Row Details for newly-failing rows.
+- "Which specific questions are failing?"
+  → Open the HTML report, scroll to Row Details — each row shows the actual score per evaluator with ● Met/Missed.
+- "Is gpt-4.1 better than gpt-5.1 for my use case?"
+  → Create two run.yaml files (same dataset, different model), run both, compare. The Evaluators table with row pass rates tells you which model handles your questions better.
+- "Why is CI failing now?"
+  → `agentops eval compare --runs <last_pass>,latest -f html`. The Status line shows `FAIL (80% · 4/5)` — one row regressed. Row Details shows which.
+
+## Learn More
+- Documentation: https://github.com/Azure/agentops
+- PyPI: https://pypi.org/project/agentops-toolkit/
diff --git a/.github/plugins/agentops/skills/agentops-observability-triage/SKILL.md b/.github/plugins/agentops/skills/agentops-observability-triage/SKILL.md
@@ -0,0 +1,113 @@
+---
+name: agentops-observability-triage
+description: Guide users on observability and triage workflows for AgentOps evaluations. Trigger when users ask about tracing, monitoring, dashboards, alerts, run health, production triage, or understanding evaluation outputs. Common phrases include "set up tracing", "monitor evals", "create alerts", "triage failed evaluations", "observability", "understand eval results", "what do these scores mean". Install agentops-toolkit via pip. Tracing and monitoring commands are planned for a future release.
+---
+
+# AgentOps Observability Triage
+
+> **Prerequisite:** Install the AgentOps CLI with `pip install agentops-toolkit`.
+
+## Purpose
+Provide practical observability guidance using current reporting artifacts. Frame tracing/monitoring as planned future features while showing what's available today — including HTML reports with visual indicators and N-run comparison dashboards.
+
+## When to Use
+- User asks how to monitor ongoing evaluation quality.
+- User asks for tracing, dashboards, or alerts.
+- User needs triage steps after an unexpected evaluation outcome.
+- User asks what the evaluation scores and indicators mean.
+
+## Available Commands
+
+```bash
+agentops eval run [-c <config>] [-f md|html|all]                    # Generate results
+agentops report [--in <results.json>] [-f md|html|all]              # Regenerate report
+agentops eval compare --runs <id1>,<id2>[,...] [-f md|html|all]      # Compare N runs
+```
+
+## Planned Commands (Not Yet Available)
+
+```bash
+agentops trace init             # Initialize tracing
+agentops monitor setup          # Set up monitoring
+agentops monitor dashboard      # Configure dashboards
+agentops monitor alert          # Configure alerts
+```
+
+## Triage Workflow
+
+### Quick triage (single run)
+1. `agentops eval run -f html` — run and generate HTML report
+2. Open `report.html` — check overall status, threshold checks, item verdicts
+3. If FAIL: look at which evaluator thresholds were missed
+
+### Deep triage (comparison)
+1. `agentops eval compare --runs <baseline>,latest -f html`
+2. Open `comparison.html` — visual dashboard with:
+   - **Status**: `PASS (100% · 5/5)` or `FAIL (60% · 3/5)` — immediate pass rate
+   - **Evaluators**: ● dots (Met/Missed), ↑↓ arrows (direction vs baseline), (n/n) row rates
+   - **Row Details**: per-row scores showing exactly which questions failed
+3. Check if regression is real (threshold flip) or noise (minor shift within threshold)
+
+### Multi-run trending
+1. Run the same config multiple times over days/weeks
+2. Compare all: `agentops eval compare --runs <oldest>,<middle>,<latest> -f html`
+3. The Evaluators table shows trend direction for each metric across all runs
+
+### Model selection
+1. Create run configs for each candidate model (same dataset + bundle)
+2. Run each: `agentops eval run -c <model-config> -f html`
+3. Compare: `agentops eval compare --runs <model1>,<model2>,<model3> -f html`
+4. Report auto-detects "Model Comparison" and shows side-by-side with best highlighting
+5. Pick the model that meets thresholds at the best quality/latency/cost ratio
+
+## Understanding Report Indicators
+
+### HTML visual indicators
+- **● green dot** — evaluator score Met the threshold target
+- **● red dot** — evaluator score Missed the threshold target
+- **↑ green arrow** — score improved vs baseline
+- **↓ red arrow** — score regressed vs baseline
+- **→ gray arrow** — unchanged
+- **Green highlighted cell** — best score across all compared runs
+- **(3/5)** — 3 out of 5 rows met this evaluator's threshold
+- **Muted gray text** — informational metric (no threshold, e.g., samples_evaluated)
+
+### Status
+- `PASS (100% · 5/5)` — all 5 rows met all thresholds
+- `FAIL (80% · 4/5)` — 4 of 5 rows passed, 1 failed
+- PASS = all row thresholds met · FAIL = one or more rows missed
+
+### Verdict
+- **NO REGRESSIONS** — no run's status flipped PASS→FAIL vs baseline
+- **REGRESSIONS DETECTED** — at least one run has newly-failing rows or status flipped
+
+### Comparison types (auto-detected)
+- **Model Comparison** — comparing different models on same dataset
+- **Agent Comparison** — comparing different agents on same dataset
+- **Dataset Coverage** — testing same model/agent on different datasets
+- **General** — multiple parameters vary
+
+## Report Formats
+- `-f md` — Markdown (default), good for PRs and CI logs
+- `-f html` — professional visual dashboard, best for analysis
+- `-f all` — generates both
+
+## Guardrails
+- Do not present tracing or monitoring commands as available today.
+- Do not imply real-time dashboards or alerts currently exist.
+- Always pivot to concrete available outputs when asked about unimplemented features.
+- The HTML report IS the current dashboard — it's self-contained, no server needed.
+
+## Examples
+- "How do I set up tracing?"
+  → Tracing (`agentops trace init`) is planned. For now, use `-f html` to generate visual reports with per-row score breakdowns.
+- "Can I monitor eval quality over time?"
+  → Run evals periodically and compare: `agentops eval compare --runs <old>,<mid>,<new> -f html`. The trend arrows show quality direction.
+- "What does FAIL (80% · 4/5) mean?"
+  → 4 of 5 dataset rows met all evaluator thresholds, 1 row missed. Check Row Details to see which row and which evaluator scored below target.
+- "What do the colored dots mean?"
+  → Green ● = score met the threshold target, Red ● = missed. In the Evaluators table, this is the aggregate score; in Row Details, it's per-row.
+
+## Learn More
+- Documentation: https://github.com/Azure/agentops
+- PyPI: https://pypi.org/project/agentops-toolkit/