Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
d563243
feat: implement eval compare, cloud eval fix, distributable Copilot s…
Dongbumlee Mar 19, 2026
0f66f64
docs: add baseline comparison and Copilot skills tutorials
Dongbumlee Mar 19, 2026
5bb5079
docs: enrich tutorials with model-vs-agent guidance and practical depth
Dongbumlee Mar 19, 2026
457f43d
docs: enrich model-direct and agent tutorials with depth and cross-re…
Dongbumlee Mar 19, 2026
5db66f7
feat: N-run comparison, HTML reports, smart comparison conditions, an…
Dongbumlee Mar 20, 2026
26f28ca
feat: GitOps release pipeline with TestPyPI staging and setuptools-scm
Dongbumlee Mar 23, 2026
05995f4
docs: add release process guide for new engineers
Dongbumlee Mar 23, 2026
60e4c1e
evaluation
placerda Mar 23, 2026
93cb885
Merge branch 'develop' into feature/gitops-release-pipeline
placerda Mar 23, 2026
5164836
Merge pull request #30 from Azure/feature/gitops-release-pipeline
placerda Mar 23, 2026
3a41a42
fix: move mid-file import to top to resolve ruff E402 lint error
Dongbumlee Mar 23, 2026
da92b4f
chore: add pre-commit with ruff lint and format hooks
Dongbumlee Mar 23, 2026
8c07ac7
ci: remove macOS from test matrix to avoid queue delays
Dongbumlee Mar 23, 2026
1327dee
Merge pull request #31 from Azure/feature/gitops-release-pipeline
Dongbumlee Mar 23, 2026
677f770
merge: resolve conflicts with develop branch
Dongbumlee Mar 23, 2026
9603575
Merge pull request #32 from Azure/feature/copilot-skill-baseline-comp…
Dongbumlee Mar 23, 2026
cf73554
ci: remove duplicate test runs on release branches
Dongbumlee Mar 23, 2026
b2df5ae
Merge pull request #33 from Azure/feature/copilot-skill-baseline-comp…
Dongbumlee Mar 23, 2026
4b59925
ci: add cut-release workflow and update release docs
Dongbumlee Mar 23, 2026
1e7584f
Merge pull request #34 from Azure/feature/copilot-skill-baseline-comp…
Dongbumlee Mar 23, 2026
08df77b
ci: auto-publish dev builds to TestPyPI on develop push
Dongbumlee Mar 23, 2026
d99aec2
Merge pull request #35 from Azure/feature/ci-dev-publish
Dongbumlee Mar 23, 2026
de5c110
fix: report command respects --format html parameter
Dongbumlee Mar 24, 2026
8e2f428
Merge pull request #36 from Azure/feature/copilot-skill-baseline-comp…
Dongbumlee Mar 24, 2026
4c62cd3
fix: -f all now generates both md and html reports
Dongbumlee Mar 24, 2026
fea1e64
Merge pull request #37 from Azure/feature/copilot-skill-baseline-comp…
Dongbumlee Mar 24, 2026
2917226
chore: stage all develop changes for release/0.2.0
placerda Mar 24, 2026
50dce6a
Merge remote-tracking branch 'origin/develop' into release/0.2.0
placerda Mar 24, 2026
bb7eef6
chore: prepare release 0.2.0
placerda Mar 24, 2026
b2c663e
chore: correct release version to 0.1.3
placerda Mar 24, 2026
91d8d18
fix: remove unused imports and variable flagged by ruff (F401, F841)
placerda Mar 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Contribution guidelines live in `CONTRIBUTING.md` at the repo root.
- Local evaluation via `azure-ai-evaluation` SDK (fallback)
- **Secondary backend**: subprocess-based (generic)
- **Azure SDK dependencies** (runtime, for Foundry backend):
- `azure-ai-projects>=2.0.0b1` — Foundry project client, `get_openai_client()`
- `azure-ai-projects>=2.0.1` — Foundry project client, `get_openai_client()`
- `azure-ai-evaluation` — Local evaluator classes (SimilarityEvaluator, etc.)
- `azure-identity` — `DefaultAzureCredential` authentication
- `openai` — Evals API types (`DataSourceConfigCustom`, etc.)
Expand Down Expand Up @@ -233,14 +233,15 @@ Do not implement the following unless explicitly discussed:

This repository also defines workflow-oriented Copilot skills under `.github/skills/`.

- Use these skills for operational guidance on running evaluations, investigating regressions, and observability triage workflows.
- Use these skills for operational guidance on running evaluations, investigating regressions, observability triage, and release management workflows.
- Treat the CLI as the source of truth and keep planned/stubbed commands clearly marked as not yet implemented.
- Do not duplicate architecture or code-structure guidance from this file inside workflow skills.

When generating or modifying code:

- **Read `docs/how-it-works.md` first** — it is the single source of truth for architecture
- **Read `CONTRIBUTING.md`** for contribution rules and workflow
- Treat the CLI as the source of truth and keep planned/stubbed commands clearly marked as not yet implemented.
- Do not invent new concepts or commands
- Prefer clarity and determinism over cleverness
- Optimize for maintainability and CI usage
Expand Down
149 changes: 149 additions & 0 deletions .github/extensions/agentops-skills/extension.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
// Extension: agentops-skills
// Injects AgentOps workflow skills as context when relevant prompts are detected.

import { joinSession } from "@github/copilot-sdk/extension";

const SKILLS = {
"run-evals": {
keywords: [
"run eval", "start agentops", "run.yaml", "regenerate report",
"evaluation results", "agentops init", "agentops eval", "agentops report",
"run an evaluation", "initialize agentops", "results.json", "report.md",
"eval run", "run config", "evaluation output",
],
context: `## Skill: Run Evaluations

### Purpose
Guide through the implemented AgentOps evaluation workflow from workspace setup to report interpretation.

### Available Commands
- agentops init [--path <dir>] — Initialize workspace
- agentops eval run — Execute evaluation
- agentops report — Regenerate report from results.json

### Typical Workflow
1. Initialize workspace: agentops init
2. Confirm run config exists (.agentops/run.yaml)
3. Execute evaluation: agentops eval run
4. Regenerate markdown report: agentops report
5. Inspect outputs under .agentops/results/latest/

### Outputs
- results.json (machine-readable normalized results)
- report.md (human-readable summary)
- cloud_evaluation.json (cloud evaluation flows only)
- Latest pointers: .agentops/results/latest/

### Interpretation
- Start with report.md for quick pass/fail narrative and threshold view.
- Use results.json for metric-level details, row-level checks, and automation.
- Distinguish: thresholds passing, threshold failures, runtime/config errors.

### Guardrails
- Do not invent commands or flags beyond documented CLI behavior.
- Planned commands (compare, run-history) are stubbed — pivot to artifact inspection.`,
},

"investigate-regression": {
keywords: [
"regression", "score dropped", "threshold started failing",
"compare runs", "eval got worse", "debug evaluation",
"evaluation drift", "quality drop", "pass rate dropped",
"ci failing", "scores lower", "metrics degraded",
],
context: `## Skill: Investigate Regression

### Purpose
Guide through regression investigation using currently available AgentOps outputs.

### Available Commands
- agentops eval run — Generate fresh artifacts
- agentops report — Regenerate report

### Planned (not implemented)
- agentops eval compare --runs ID1,ID2

### Investigation Steps
1. Run fresh evaluation: agentops eval run
2. Regenerate report: agentops report
3. Compare current artifacts to baseline manually
4. Report factual deltas, then propose controlled next steps

### Required Inputs
- At least one recent artifact set (results.json + report.md)
- Preferably a baseline for side-by-side comparison
- Context about what changed (prompt, model, dataset, bundle, backend, environment)

### Interpretation
- Separate observations (artifact-backed) from hypotheses (plausible causes).
- Prioritize impact: which thresholds flipped, which metrics degraded most, broad vs concentrated failures.
- End with actionable next checks (rerun with controlled changes, validate dataset, verify config).

### Guardrails
- agentops eval compare is NOT implemented — use manual artifact comparison.
- Do not infer causality from correlation alone.
- Keep remediation tied to reproducible checks.`,
},

"observability-triage": {
keywords: [
"tracing", "monitoring", "dashboard", "alerts", "triage",
"observability", "run health", "production triage",
"monitor evals", "set up tracing", "failed evaluation",
"quality monitoring",
],
context: `## Skill: Observability Triage

### Purpose
Provide honest observability guidance: use current reporting artifacts today, frame tracing/monitoring as planned future work.

### Available Commands (for triage today)
- agentops eval run
- agentops report

### Planned/Stubbed (NOT implemented)
- agentops trace init
- agentops monitor setup
- agentops monitor dashboard
- agentops monitor alert

### Current Triage Approach
- Use report.md for quick operational triage (what failed, severity).
- Use results.json for detailed metric and threshold inspection.
- Keep run artifacts organized for future compare/monitor automation.

### When Users Ask for Unimplemented Features
1. State explicitly: planned/stubbed, not available yet.
2. Provide immediate fallback: artifact-based troubleshooting.
3. Suggest preparation: organize artifacts for future tooling.

### Guardrails
- Do not present tracing or monitoring commands as available.
- Do not imply real-time dashboards/alerts exist in CLI.
- Always pivot to concrete available outputs (results.json, report.md).`,
},
};

function matchSkills(prompt) {
const lower = prompt.toLowerCase();
const matched = [];
for (const [name, skill] of Object.entries(SKILLS)) {
if (skill.keywords.some((kw) => lower.includes(kw))) {
matched.push(skill.context);
}
}
return matched;
}

const session = await joinSession({
hooks: {
onUserPromptSubmitted: async (input) => {
const matched = matchSkills(input.prompt);
if (matched.length > 0) {
return {
additionalContext: `<agentops_skills>\n${matched.join("\n\n---\n\n")}\n</agentops_skills>`,
};
}
},
},
});
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
name: agentops-investigate-regression
description: Help users investigate evaluation regressions in AgentOps by comparing runs, analyzing row-level scores, and identifying root causes. Trigger when users say "regression", "score dropped", "threshold failed", "compare runs", "why did this eval get worse", "which rows failed", "debug evaluation", "quality degradation". Install agentops-toolkit via pip. Commands are agentops eval run, agentops eval compare, and agentops report.
---

# AgentOps Investigate Regression

> **Prerequisite:** Install the AgentOps CLI with `pip install agentops-toolkit`.

## Purpose
Guide users through regression investigation using N-run comparison, row-level score analysis, and structured root cause identification.

## When to Use
- User reports lower scores versus previous runs.
- User reports new threshold failures (PASS → FAIL).
- User asks to compare current and prior evaluation outcomes.
- CI gating changed from pass to fail and root cause is unclear.
- User asks which specific rows or questions are failing.

## Available Commands

```bash
agentops eval run [-c <config>] [-f md|html|all] # Generate fresh results
agentops report [-f md|html|all] # Regenerate report
agentops eval compare --runs <id1>,<id2>[,...] [-f md|html|all] # Compare N runs
```

Run identifiers for `--runs` can be:
- Timestamped folder names (e.g. `2026-03-01_100000`)
- The keyword `latest`
- Absolute or relative paths to a `results.json` or a run directory

## Investigation Workflow

1. **Reproduce:** `agentops eval run -f html` to get fresh results with visual report.
2. **Compare:** `agentops eval compare --runs <baseline>,latest -f html`
3. **Check the verdict:** NO REGRESSIONS vs REGRESSIONS DETECTED
4. **Read run config:** Check Status row — `FAIL (60% · 3/5)` tells you exactly how many rows failed.
5. **Read Evaluators table:**
- ● green dot = Met threshold, ● red dot = Missed
- ↑ improved / ↓ regressed vs baseline
- `(3/5)` = row pass rate for this evaluator
6. **Drill into Row Details:** Find exactly which rows scored below threshold and why.
7. **Act:** Fix the identified issues (prompt tuning, dataset quality, model selection).

## Understanding the Report

### What REGRESSIONS DETECTED means
A regression is detected ONLY when:
- A run's overall status flips from **PASS to FAIL** vs baseline
- A previously-passing **row** now fails

A minor numeric decrease (e.g., latency 4.84s → 6.00s) that stays within the threshold (≤ 10s) is **NOT** a regression. The verdict focuses on threshold-breaking changes, not noise.

### Comparison types
The report auto-detects what's being compared:
- **Model Comparison** — same dataset, different models → full row-level analysis valid
- **Agent Comparison** — same dataset, different agents → full row-level analysis valid
- **Dataset Coverage** — different datasets → row details skipped (rows aren't comparable)
- **General** — multiple things vary

### Evaluators table
Each cell shows: `● score ↑ delta (n/n rows)`
- **● dot** = Met (green) or Missed (red) vs the absolute threshold target
- **↑↓ delta** = direction vs baseline run (improved/regressed/unchanged)
- **(n/n)** = how many rows met the threshold out of total
- **Green highlight** = best score across all runs
- Metrics without thresholds (like `samples_evaluated`) show as plain informational numbers

### Row Details table
Each cell shows per-evaluator scores: `● SimilarityEvaluator: 2`
- Green ● = this row met the threshold
- Red ● = this row missed — **this is why the run failed**

### Status
`PASS (100% · 5/5)` = all rows met all thresholds
`FAIL (60% · 3/5)` = 3 of 5 rows passed, 2 failed → the specific rows that failed explain the FAIL

## Root Cause Checklist
When you find regressions:

1. **Which rows failed?** → Check Row Details for red ● dots
2. **Which evaluator failed?** → The evaluator with red dots tells you what's weak
3. **Is it the model?** → Compare same dataset across models to isolate
4. **Is it the dataset?** → Some questions are inherently harder (real-time, ambiguous)
5. **Is it the agent instructions?** → Compare agent versions on same dataset
6. **Is it random variance?** → Run the same config 2-3 times and compare

## Guardrails
- Do not infer causality from correlation alone.
- Separate observations (data from artifacts) from hypotheses (plausible causes).
- Keep remediation advice tied to reproducible checks.
- When comparing runs with different datasets, do NOT analyze row-level changes — they're different questions.

## Examples
- "My eval went from PASS to FAIL after changing model"
→ `agentops eval compare --runs <old>,<new> -f html`. Check Evaluators for ↓ regressed metrics and Row Details for newly-failing rows.
- "Which specific questions are failing?"
→ Open the HTML report, scroll to Row Details — each row shows the actual score per evaluator with ● Met/Missed.
- "Is gpt-4.1 better than gpt-5.1 for my use case?"
→ Create two run.yaml files (same dataset, different model), run both, compare. The Evaluators table with row pass rates tells you which model handles your questions better.
- "Why is CI failing now?"
→ `agentops eval compare --runs <last_pass>,latest -f html`. The Status line shows `FAIL (80% · 4/5)` — one row regressed. Row Details shows which.

## Learn More
- Documentation: https://github.com/Azure/agentops
- PyPI: https://pypi.org/project/agentops-toolkit/
113 changes: 113 additions & 0 deletions .github/plugins/agentops/skills/agentops-observability-triage/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
name: agentops-observability-triage
description: Guide users on observability and triage workflows for AgentOps evaluations. Trigger when users ask about tracing, monitoring, dashboards, alerts, run health, production triage, or understanding evaluation outputs. Common phrases include "set up tracing", "monitor evals", "create alerts", "triage failed evaluations", "observability", "understand eval results", "what do these scores mean". Install agentops-toolkit via pip. Tracing and monitoring commands are planned for a future release.
---

# AgentOps Observability Triage

> **Prerequisite:** Install the AgentOps CLI with `pip install agentops-toolkit`.

## Purpose
Provide practical observability guidance using current reporting artifacts. Frame tracing/monitoring as planned future features while showing what's available today — including HTML reports with visual indicators and N-run comparison dashboards.

## When to Use
- User asks how to monitor ongoing evaluation quality.
- User asks for tracing, dashboards, or alerts.
- User needs triage steps after an unexpected evaluation outcome.
- User asks what the evaluation scores and indicators mean.

## Available Commands

```bash
agentops eval run [-c <config>] [-f md|html|all] # Generate results
agentops report [--in <results.json>] [-f md|html|all] # Regenerate report
agentops eval compare --runs <id1>,<id2>[,...] [-f md|html|all] # Compare N runs
```

## Planned Commands (Not Yet Available)

```bash
agentops trace init # Initialize tracing
agentops monitor setup # Set up monitoring
agentops monitor dashboard # Configure dashboards
agentops monitor alert # Configure alerts
```

## Triage Workflow

### Quick triage (single run)
1. `agentops eval run -f html` — run and generate HTML report
2. Open `report.html` — check overall status, threshold checks, item verdicts
3. If FAIL: look at which evaluator thresholds were missed

### Deep triage (comparison)
1. `agentops eval compare --runs <baseline>,latest -f html`
2. Open `comparison.html` — visual dashboard with:
- **Status**: `PASS (100% · 5/5)` or `FAIL (60% · 3/5)` — immediate pass rate
- **Evaluators**: ● dots (Met/Missed), ↑↓ arrows (direction vs baseline), (n/n) row rates
- **Row Details**: per-row scores showing exactly which questions failed
3. Check if regression is real (threshold flip) or noise (minor shift within threshold)

### Multi-run trending
1. Run the same config multiple times over days/weeks
2. Compare all: `agentops eval compare --runs <oldest>,<middle>,<latest> -f html`
3. The Evaluators table shows trend direction for each metric across all runs

### Model selection
1. Create run configs for each candidate model (same dataset + bundle)
2. Run each: `agentops eval run -c <model-config> -f html`
3. Compare: `agentops eval compare --runs <model1>,<model2>,<model3> -f html`
4. Report auto-detects "Model Comparison" and shows side-by-side with best highlighting
5. Pick the model that meets thresholds at the best quality/latency/cost ratio

## Understanding Report Indicators

### HTML visual indicators
- **● green dot** — evaluator score Met the threshold target
- **● red dot** — evaluator score Missed the threshold target
- **↑ green arrow** — score improved vs baseline
- **↓ red arrow** — score regressed vs baseline
- **→ gray arrow** — unchanged
- **Green highlighted cell** — best score across all compared runs
- **(3/5)** — 3 out of 5 rows met this evaluator's threshold
- **Muted gray text** — informational metric (no threshold, e.g., samples_evaluated)

### Status
- `PASS (100% · 5/5)` — all 5 rows met all thresholds
- `FAIL (80% · 4/5)` — 4 of 5 rows passed, 1 failed
- PASS = all row thresholds met · FAIL = one or more rows missed

### Verdict
- **NO REGRESSIONS** — no run's status flipped PASS→FAIL vs baseline
- **REGRESSIONS DETECTED** — at least one run has newly-failing rows or status flipped

### Comparison types (auto-detected)
- **Model Comparison** — comparing different models on same dataset
- **Agent Comparison** — comparing different agents on same dataset
- **Dataset Coverage** — testing same model/agent on different datasets
- **General** — multiple parameters vary

## Report Formats
- `-f md` — Markdown (default), good for PRs and CI logs
- `-f html` — professional visual dashboard, best for analysis
- `-f all` — generates both

## Guardrails
- Do not present tracing or monitoring commands as available today.
- Do not imply real-time dashboards or alerts currently exist.
- Always pivot to concrete available outputs when asked about unimplemented features.
- The HTML report IS the current dashboard — it's self-contained, no server needed.

## Examples
- "How do I set up tracing?"
→ Tracing (`agentops trace init`) is planned. For now, use `-f html` to generate visual reports with per-row score breakdowns.
- "Can I monitor eval quality over time?"
→ Run evals periodically and compare: `agentops eval compare --runs <old>,<mid>,<new> -f html`. The trend arrows show quality direction.
- "What does FAIL (80% · 4/5) mean?"
→ 4 of 5 dataset rows met all evaluator thresholds, 1 row missed. Check Row Details to see which row and which evaluator scored below target.
- "What do the colored dots mean?"
→ Green ● = score met the threshold target, Red ● = missed. In the Evaluators table, this is the aggregate score; in Row Details, it's per-row.

## Learn More
- Documentation: https://github.com/Azure/agentops
- PyPI: https://pypi.org/project/agentops-toolkit/
Loading
Loading