Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
0a5ecfb
evaluations
placerda Mar 27, 2026
3ef9f54
evaluations
placerda Apr 1, 2026
21455d3
feat: add OTLP tracing foundation for evaluation runs
Dongbumlee Apr 3, 2026
a9f0afe
docs: add OTLP telemetry to AGENTS.md and copilot-instructions
Dongbumlee Apr 3, 2026
f932d98
feat: extend Foundry cloud evaluator coverage to 22 built-in evaluato…
Dongbumlee Apr 7, 2026
ab2736a
fix: skip telemetry tests when opentelemetry is not installed
Dongbumlee Apr 7, 2026
5b5aa6e
Merge pull request #56 from Azure/feature/otlp-tracing
Dongbumlee Apr 7, 2026
500966d
merge: resolve CHANGELOG conflict with develop (OTLP tracing)
Dongbumlee Apr 7, 2026
46ede70
docs: align all documentation with current implementation
Dongbumlee Apr 7, 2026
f887f65
feat: implement bundle list/show and run list/show commands
Dongbumlee Apr 7, 2026
2d4a52c
refactor: split CLI into command modules
Dongbumlee Apr 7, 2026
6017f3a
refactor: remove planned.py, move stubs to their command files
Dongbumlee Apr 7, 2026
ce9b628
Merge pull request #57 from Azure/feature/issue-51-extend-evaluators
placerda Apr 13, 2026
4e81967
Merge branch 'develop' into feature/browse-commands
placerda Apr 13, 2026
ba9a465
Merge pull request #59 from Azure/feature/browse-commands
placerda Apr 13, 2026
267a274
evaluations
placerda Apr 13, 2026
1b81ad9
Merge branch 'main' of github.com:Azure/agentops into develop
placerda Apr 13, 2026
dd9172b
fix: remove duplicate _planned_command definition (ruff F811)
Dongbumlee Apr 13, 2026
6f18db6
feat(skills): add 3 new skills for full CLI coverage
Dongbumlee Apr 13, 2026
c6c7c79
feat(skills): add active workspace guard clauses to all downstream sk…
Dongbumlee Apr 13, 2026
e409dd0
feat(skills): add coverage for report show/export, model list, agent …
Dongbumlee Apr 13, 2026
42d5a9a
fix: remove duplicate _planned_command definition (ruff F811)
Dongbumlee Apr 13, 2026
a9653f2
style: apply ruff-format to comparison.py and test_cli_commands.py
Dongbumlee Apr 13, 2026
9d1f235
ci: integrate VSIX packaging with pre-release into CI/CD pipeline
Dongbumlee Apr 13, 2026
f2cd7ce
ci(vsix): add LICENSE to plugin package
Dongbumlee Apr 13, 2026
903be4b
ci(vsix): set publisher to AgentOpsToolkit and fix package name
Dongbumlee Apr 13, 2026
4f248d3
Merge pull request #68 from Azure/feature/agentops-skills
Dongbumlee Apr 13, 2026
daaf73e
Merge pull request #66 from Azure/fix/develop-lint-f811
Dongbumlee Apr 13, 2026
03b6b74
Merge pull request #67 from Azure/feature/skill-vsix-cicd
Dongbumlee Apr 13, 2026
e3b7640
ci(vsix): upload VSIX artifact from CI and staging pipelines (#69)
Dongbumlee Apr 13, 2026
60be078
ci(vsix): sync VSIX version from git tags in all pipelines (#70)
Dongbumlee Apr 13, 2026
b48765a
fix: resolve all mypy type errors across 6 source files (#71)
Dongbumlee Apr 14, 2026
9314553
docs: add CHANGELOG entries for mypy fixes and VSIX pipeline
Dongbumlee Apr 14, 2026
e608b36
fix: use global tag sort for VSIX version derivation
Dongbumlee Apr 14, 2026
f0aeffe
refactor: decouple skills installation from agentops init
placerda Apr 14, 2026
d0849a4
Merge remote-tracking branch 'origin/develop' into feature/evaluations
placerda Apr 14, 2026
e0a1753
fix: make release pipeline resilient to VSIX version conflicts
Dongbumlee Apr 14, 2026
61c9683
post-merge: unify skills, wire browse_commands, fix evaluator classif…
placerda Apr 14, 2026
f4a50c0
Merge remote-tracking branch 'origin/main' into develop
Dongbumlee Apr 14, 2026
bde17ff
Merge branch 'feature/evaluations' into develop
placerda Apr 14, 2026
77e283b
Merge remote-tracking branch 'origin/develop' into develop
placerda Apr 14, 2026
bdcf8e1
fix: resolve 18 ruff lint errors (F401/F811/F841) across 6 files
Dongbumlee Apr 14, 2026
bb29c2b
fix: resolve 31 mypy type errors and enforce mypy in CI
Dongbumlee Apr 14, 2026
3380d64
ci: upgrade GitHub Actions to Node.js 24 runtimes
Dongbumlee Apr 14, 2026
956d091
ci: disable uv cache on non-matrix jobs to fix race condition
Dongbumlee Apr 14, 2026
f04841a
style: apply ruff-format and normalize whitespace across source and w…
Dongbumlee Apr 14, 2026
98bf1eb
chore: prepare release 0.1.5
Dongbumlee Apr 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
307 changes: 224 additions & 83 deletions .github/copilot-instructions.md

Large diffs are not rendered by default.

16 changes: 8 additions & 8 deletions .github/extensions/agentops-skills/extension.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ const SKILLS = {
"run-evals": {
keywords: [
"run eval", "start agentops", "run.yaml", "regenerate report",
"evaluation results", "agentops init", "agentops eval", "agentops report",
"evaluation results", "agentops init", "agentops eval", "agentops report generate",
"run an evaluation", "initialize agentops", "results.json", "report.md",
"eval run", "run config", "evaluation output",
],
Expand All @@ -19,13 +19,13 @@ Guide through the implemented AgentOps evaluation workflow from workspace setup
### Available Commands
- agentops init [--path <dir>] — Initialize workspace
- agentops eval run — Execute evaluation
- agentops report — Regenerate report from results.json
- agentops report generate — Regenerate report from results.json

### Typical Workflow
1. Initialize workspace: agentops init
2. Confirm run config exists (.agentops/run.yaml)
3. Execute evaluation: agentops eval run
4. Regenerate markdown report: agentops report
4. Regenerate markdown report: agentops report generate
5. Inspect outputs under .agentops/results/latest/

### Outputs
Expand Down Expand Up @@ -58,14 +58,14 @@ Guide through regression investigation using currently available AgentOps output

### Available Commands
- agentops eval run — Generate fresh artifacts
- agentops report — Regenerate report
- agentops report generate — Regenerate report

### Planned (not implemented)
- agentops eval compare --runs ID1,ID2

### Investigation Steps
1. Run fresh evaluation: agentops eval run
2. Regenerate report: agentops report
2. Regenerate report: agentops report generate
3. Compare current artifacts to baseline manually
4. Report factual deltas, then propose controlled next steps

Expand Down Expand Up @@ -99,13 +99,13 @@ Provide honest observability guidance: use current reporting artifacts today, fr

### Available Commands (for triage today)
- agentops eval run
- agentops report
- agentops report generate

### Planned/Stubbed (NOT implemented)
- agentops trace init
- agentops monitor setup
- agentops monitor dashboard
- agentops monitor alert
- agentops monitor show
- agentops monitor configure

### Current Triage Approach
- Use report.md for quick operational triage (what failed, severity).
Expand Down
260 changes: 260 additions & 0 deletions .github/skills/agentops-config/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
---
name: agentops-config
description: Infer evaluation scenario from codebase and generate run.yaml. Trigger when users ask to configure an evaluation, create a run config, detect the evaluation scenario, or choose a bundle. Common phrases include "configure", "run.yaml", "which bundle", "set up eval", "scenario", "endpoint", "agentops config", "create run config", "what should I evaluate". Install agentops-toolkit via pip.
---

# AgentOps Config

Generate a complete `.agentops/run.yaml` by inspecting the workspace. Infer everything possible — ask only for values that cannot be found.

## Step 0 — Prerequisites

1. Run `pip install agentops-toolkit` if `agentops` command is not available.
2. Run `agentops init` if `.agentops/` directory does not exist.

## Step 1 — Detect scenario

Analyze the codebase holistically to understand the agent's **primary purpose**:

1. Read the README, system prompt, main entry point, and tool/function definitions.
2. Identify which patterns are present:
- **Tool use**: `@tool`, `tool_definitions`, `function_call`, MCP tools, tool schemas
- **Retrieval**: search client, vector store, retriever, embeddings, index references, context fetching
- **Conversation**: chat history, multi-turn, session management, assistant persona
- **Direct model call**: completion API, no orchestration logic

3. Pick the scenario that best matches the agent's **primary job** — not just the first signal found:

| Primary purpose | `bundle.name` |
|---|---|
| Agent that orchestrates tools to complete tasks | `agent_workflow_baseline` |
| Agent that retrieves context to answer questions | `rag_quality_baseline` |
| Conversational assistant (chat, Q&A, persona) | `conversational_agent_baseline` |
| Direct model call with no agent logic | `model_quality_baseline` |

> A RAG agent that uses a search tool is still primarily RAG — pick `rag_quality_baseline`, not `agent_workflow_baseline`. The test is: *what is the agent's main job?*

4. State what you found: *"Detected RAG scenario — the agent's primary purpose is answering questions using retrieved context (found retriever logic in retriever.py)."*

5. **Responsible AI (optional)**: Ask *"Do you also want to include safety evaluators (violence, hate/unfairness, self-harm, protected material)?"* If yes, add the safety evaluators from `safe_agent_baseline` to the selected bundle.

## Step 2 — Detect endpoint type

| Search for | `endpoint.kind` | `hosting` | `execution_mode` |
|---|---|---|---|
| `AIProjectClient`, `azure-ai-projects`, Foundry URL | `foundry_agent` | `foundry` | `remote` |
| FastAPI, Flask, Django, Express — JSON POST/response | `http` | `containerapps` / `aks` / `local` | `remote` |
| SSE/streaming, non-standard body, custom auth, no server | — | `local` / `containerapps` / `aks` | `local` (callable) |

Also check: `agent_id` references, Dockerfile, bicep, ACA manifests, `.env` files.

**Discover the endpoint URL** — search in this order, stop when found:
1. Env vars: `$env:AGENT_HTTP_URL`, `$env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT`
2. `.env` / `.env.local` in project root
3. `.azure/<env>/.env` files
4. Azure CLI (if hosting is `containerapps` or ACA-deployed):
```bash
az containerapp list -g $RG --subscription $SUB --query "[].{name:name, url:properties.configuration.ingress.fqdn}" -o json
```
5. Azure CLI (if hosting is App Service / webapp):
```bash
az webapp list -g $RG --subscription $SUB --query "[].{name:name, url:defaultHostName}" -o json
```

**Detect auth pattern** — search the codebase:
- `dapr-api-token` / `APP_API_TOKEN` → Dapr auth
- `X-API-KEY` / `api_key` / `API_KEY` → API key auth
- `Authorization` / `Bearer` → Bearer token auth
- Nothing found → assume no auth needed

## Step 3 — Discover Azure values

Search these locations **in order** — stop as soon as each value is found:

1. Shell environment variables (`$env:AZURE_AI_FOUNDRY_PROJECT_ENDPOINT`, etc.)
2. `.env`, `.env.local` in project root
3. `.azure/<env>/.env` files (azd environments) — also read `AZURE_RESOURCE_GROUP`, `AZURE_SUBSCRIPTION_ID`
4. `.azure/config.json` for `defaultEnvironment` to pick the right env folder

If values are **not found** in any file, run Azure CLI discovery:
```bash
# 1. Confirm auth and get subscription
az account show --query "{sub:id, tenant:tenantId}" -o json

# 2. Find AI Services / Foundry accounts and endpoints
az cognitiveservices account list -o json --query "[].{name:name, rg:resourceGroup, endpoint:properties.endpoint, kind:kind}"

# 3. Find model deployments
az cognitiveservices account deployment list --name $ACCOUNT -g $RG --subscription $SUB --query "[].{name:name, model:properties.model.name, version:properties.model.version}" -o json

# 4. Find Foundry projects
az resource list -g $RG --subscription $SUB --resource-type "Microsoft.CognitiveServices/accounts/projects" --query "[].name" -o tsv

# 5. Build endpoints from discovered names
# Foundry: https://<account>.services.ai.azure.com/api/projects/<project>
# OpenAI: https://<account>.openai.azure.com/
```

**Pre-warm Azure token** (prevents intermittent `AzureCliCredential.get_token failed` errors):
```bash
az account get-access-token --resource "https://cognitiveservices.azure.com" --query accessToken -o tsv
```
If this fails, Azure CLI auth is not active — ask the user to run `az login`.

**Only ask the user** if no `.azure/` dir exists AND no env vars are set.

## Step 4 — Pick evaluator model

Read the bundle YAML from `.agentops/bundles/<bundle-name>.yaml`. If it contains **any** evaluator with `source: foundry`, then an evaluator model is required.

Pick from available deployments (discovered in Step 3): `gpt-4.1-mini` > `gpt-4o-mini` > `gpt-4o` > `gpt-4.1`. **Never** use reasoning models (`o1`, `o3`, `o4`, `gpt-5`, `gpt-5-nano`).

If no suitable deployment was found, ask: *"Which model deployment should score your agent's responses? (e.g. gpt-4o-mini)"*

## Step 4.5 — Verify evaluator compatibility

After selecting the bundle, **verify every evaluator is importable** before writing run.yaml.

1. Read `.agentops/bundles/<bundle-name>.yaml` and extract all `class_name` values.
2. Run the import probe:
```bash
python -c "
evaluators = []
missing = []
for name in [<comma-separated class names as strings>]:
try:
getattr(__import__('azure.ai.evaluation', fromlist=[name]), name)
evaluators.append(name)
except (ImportError, AttributeError):
missing.append(name)
print('available:', evaluators)
print('missing:', missing)
"
```
3. If any evaluators are missing, set `enabled: false` on them in the bundle and remove matching thresholds.
4. Warn the user: *"Disabled [X] — not available in your azure-ai-evaluation SDK version."*

**Key compatibility facts:**
- `F1ScoreEvaluator`, `BleuScoreEvaluator`, `RougeScoreEvaluator` are local text-overlap — they do not need Azure credentials.
- `TaskCompletionEvaluator`, `ToolCallAccuracyEvaluator`, `IntentResolutionEvaluator` are SDK-version-dependent — always verify.

## Step 5 — Write run.yaml

Write `.agentops/run.yaml` using the exact structure below. Fill **every** value — no placeholders.

**Remote (Foundry agent):**
```yaml
version: 1
target:
type: agent
hosting: foundry
execution_mode: remote
endpoint:
kind: foundry_agent
agent_id: <DISCOVERED_OR_ASK>
model: <DISCOVERED_MODEL>
project_endpoint_env: AZURE_AI_FOUNDRY_PROJECT_ENDPOINT
bundle:
name: <DETECTED_BUNDLE>
dataset:
name: dataset
output:
write_report: true
```

**Remote (HTTP):**
```yaml
version: 1
target:
type: agent
hosting: containerapps
execution_mode: remote
endpoint:
kind: http
url_env: AGENT_HTTP_URL
request_field: message
response_field: text
bundle:
name: <DETECTED_BUNDLE>
dataset:
name: dataset
output:
write_report: true
```

**Local (callable adapter):**
```yaml
version: 1
target:
type: agent
hosting: local
execution_mode: local
local:
callable: callable_adapter:run_evaluation
bundle:
name: <DETECTED_BUNDLE>
dataset:
name: dataset
output:
write_report: true
```

## Step 6 — Write callable adapter (if execution_mode is local)

Create `callable_adapter.py` at the **project root**. Use ONLY stdlib (`urllib.request`, `json`, `os`).

```python
import json
import os
import urllib.request

ENDPOINT = os.environ["AGENT_HTTP_URL"]
# Auth: set APP_API_TOKEN, API_KEY, or remove the auth lines below.
AUTH_TOKEN = os.environ.get("APP_API_TOKEN", "")

def run_evaluation(input_text: str, context: dict) -> dict:
body = json.dumps({"message": input_text}).encode()
headers = {"Content-Type": "application/json"}
if AUTH_TOKEN:
headers["dapr-api-token"] = AUTH_TOKEN # Change header name if using API_KEY or Bearer
req = urllib.request.Request(ENDPOINT, data=body, headers=headers, method="POST")
with urllib.request.urlopen(req) as resp:
data = json.loads(resp.read())
return {"response": data.get("text", data.get("response", ""))}
```

After writing the file, run: `python -c "from callable_adapter import run_evaluation; print('OK')"`

**Auth detection:** Search codebase for `dapr-api-token`/`APP_API_TOKEN` → Dapr header. `X-API-KEY`/`api_key`/`API_KEY` → API key header. `Authorization`/`Bearer` → recommend HTTP backend with `auth_header_env` instead. Nothing found → remove auth lines.

## Step 7 — Present and confirm

Present a **confirmation table** with all discovered values (do not ask each one separately):
```
┌─────────────────────────┬──────────────────────────────────────────┬────────┐
│ Setting │ Value │ Source │
├─────────────────────────┼──────────────────────────────────────────┼────────┤
│ Scenario │ RAG │ code │
│ Bundle │ rag_quality_baseline │ auto │
│ Endpoint kind │ http │ code │
│ Endpoint URL │ https://myapp.azurecontainerapps.io/chat │ .env │
│ Auth │ dapr-api-token (APP_API_TOKEN) │ code │
│ Evaluator model │ gpt-4o-mini │ Azure │
│ Project endpoint │ https://acct.services.ai.azure.com/... │ .env │
└─────────────────────────┴──────────────────────────────────────────┴────────┘
```

Ask: *"Everything look correct? (yes / edit)"*

Explain: scenario detected, endpoint type, evaluator model chosen, and any assumptions made.

## Rules

- **NEVER** include `backend:` key in run.yaml — it causes a runtime error.
- **NEVER** leave `<replace-...>` placeholders in run.yaml.
- **NEVER** fabricate `agent_id`, model names, or endpoint URLs.
- **NEVER** use dotted import paths like `.agentops.callable_adapter` — they fail.
- **NEVER** use a bundle without running the evaluator import probe first (Step 4.5).
- Do not generate datasets — delegate to `/agentops-dataset`.
- Do not run evaluations — delegate to `/agentops-eval`.
- Always state what you detected and what you assumed.
Loading
Loading