Release v0.1.5 by Dongbumlee · Pull Request #73 · Azure/agentops

Dongbumlee · 2026-04-14T03:45:14Z

Release v0.1.5

Fixed

Make release pipeline resilient to VSIX version conflicts
Resolve 31 mypy type errors and enforce mypy in CI
Resolve 18 ruff lint errors (F401/F811/F841) across 6 files
Fix UV cache race condition in CI

Changed

Upgrade GitHub Actions to Node.js 24 runtimes
Apply ruff-format across source and workflows

See CHANGELOG.md for full details.

- Add utils/telemetry.py with lazy OTel imports and span context managers - Instrument runner.py with three-layer schema (CICD + GenAI + agentops.eval) - Root span per eval run, item spans per row, evaluator child spans - Activated via AGENTOPS_OTLP_ENDPOINT env var (opt-in, zero overhead) - Graceful no-op when opentelemetry-sdk is not installed - 16 unit tests covering disabled, degraded, and enabled states Refs: #14

…rs (#51) - Expand evaluator frozensets: add response_completeness, groundedness_pro, retrieval, tool_selection to existing sets - Add new frozensets: _EVALUATORS_NEEDING_TOOL_DEFS_ONLY (tool_input_accuracy, tool_output_utilization, tool_call_success), _EVALUATORS_NEEDING_OUTPUT_ITEMS (task_adherence) - Fix NLP evaluator names (bleu_score, rouge_score, etc.) to match _to_builtin_evaluator_name conversion - Add default initialization_parameters for RougeScoreEvaluator (rouge_type) - Build item_schema dynamically: include tool_definitions and context_field when evaluators need them - Refactor _default_foundry_input_mapping to frozenset-based routing - Improve error handling: log evaluator errors when score is null, improve runner error message with --verbose hint - Add CI/CD integration models documentation: PR gate, scheduled, post-deploy, multi-env promotion, Azure DevOps pipeline - Add gating best practices: threshold design, evaluator selection by scenario - Add supported evaluators reference table (22 evaluators by category) - Add ~20 unit tests for all new evaluator data_mapping patterns - All 22 evaluators verified end-to-end with live Foundry cloud evaluation Closes #51

TestSpanAttributesWhenEnabled requires opentelemetry to be installed because the code paths import SpanKind/StatusCode when tracing is enabled. Use pytest.importorskip to skip the class in CI where opentelemetry is not a declared dependency.

Add OTLP tracing support and documentation for evaluation runs

- Fix skill paths: plugins/agentops/skills/ (not .github/plugins/) across README, tutorial-copilot-skills (6 instances) - Fix CLI contract: add eval compare and config cicd as implemented commands in AGENTS.md, copilot-instructions.md, how-it-works.md - Fix source tree listings: add cicd.py, comparison.py, telemetry.py, workflows/ across AGENTS.md, how-it-works.md - Fix test listings: add test_cicd, test_cli_commands, test_comparison, test_telemetry across AGENTS.md, copilot-instructions.md, how-it-works.md - Fix agent_tools_baseline: TaskCompletionEvaluator + ToolCallAccuracyEvaluator (not SimilarityEvaluator placeholder) in README, AGENTS.md, how-it-works.md - Fix JSONL path: data/<name>.jsonl (not datasets/) in ci-github-actions.md - Fix init flag: --dir (not --path) in README - Fix evaluator guidance: add frozenset names and NLP_DEFAULT_INIT_PARAMS to copilot-instructions.md - Add context_field to dataset format docs in AGENTS.md - Add rouge_type default note to evaluator reference doc - Update planned command message to list all 5 available commands - Add --format flag to CLI usage examples

- Add services/browse.py with list_bundles, show_bundle, list_runs, show_run - Replace planned stubs with working implementations in cli/app.py - bundle list: shows all bundles with evaluators and threshold count - bundle show: displays full bundle detail (evaluators, thresholds, metadata) - run list: shows all past runs with status, bundle, dataset, duration - run show: displays full run detail (metrics, thresholds, items, Foundry URL) - Add 16 unit tests (service + CLI) in test_browse.py - All commands are read-only, no side effects, no Azure API calls

Split app.py (487 lines) into focused command modules: - app.py (114 lines) — root app, global callback, init, sub-app registration - eval_commands.py (108 lines) — eval run, eval compare - report_commands.py (66 lines) — report, report show/export stubs - browse_commands.py (152 lines) — bundle list/show, run list/show/view - config_commands.py (56 lines) — config cicd, config validate/show stubs - planned.py (57 lines) — dataset, monitor, trace, model, agent stubs - _planned.py (12 lines) — shared planned command helper No behavior changes. All 96 tests pass.

- Move dataset stubs to dataset_commands.py (ready for Tier 2 implementation) - Inline monitor/trace/model/agent stubs in app.py (1-2 commands each) - Delete planned.py — no more catch-all stub file

feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)

feat: implement bundle list/show and run list/show commands

Add agentops-workspace-setup, agentops-browse-inspect, and agentops-dataset-management skills covering all remaining CLI commands not handled by existing evaluation-focused skills. - agentops-workspace-setup: init, config cicd, config validate/show - agentops-browse-inspect: bundle list/show, run list/show/view - agentops-dataset-management: dataset creation, YAML/JSONL format, field mapping, planned validate/describe/import commands

…ills Add '## Before You Start' section to 5 downstream skills enforcing workspace verification before proceeding: - agentops-run-evals - agentops-investigate-regression - agentops-observability-triage - agentops-browse-inspect - agentops-dataset-management Each skill now instructs the agent to check for .agentops/ directory and redirect to agentops-workspace-setup skill if missing. This provides soft enforcement at the skill layer, complementing the hard CLI enforcement (FileNotFoundError) already in place.

…list planned commands

- ci.yml: add build-vsix validation job (package only, no publish) - staging.yml: add publish-vsix-prerelease job (vsce publish --pre-release) - release.yml: add publish-vsix stable job + attach VSIX to GitHub Release - cut-release.yml: sync package.json version via jq, update PR body/checklist - _build.yml: update header comments (Python-only, no VSIX logic) - plugins/agentops: add README.md, CHANGELOG.md, .vscodeignore, package.json scripts Requires VSCE_PAT secret in staging and release GitHub environments.

Feature/agentops skills

fix: remove duplicate _planned_command definition (ruff F811)

ci: integrate VSIX packaging with pre-release into CI/CD pipeline

* ci(vsix): upload VSIX artifact from CI and staging pipelines * ci: publish VSIX pre-release to Marketplace on develop pushes Add publish-vsix-dev job to ci.yml that publishes the VSIX as a pre-release to the VS Code Marketplace on every push to develop, mirroring the publish-dev job that pushes to TestPyPI. - Gated on push to develop only (not PRs) - Depends on lint, test, and build-vsix jobs - Uses staging environment (VSCE_PAT secret) - Packages with --pre-release flag - Includes step summary with Marketplace link

* ci(vsix): sync VSIX version from git tags in all pipelines Derive package.json version at CI time from the latest git tag using git describe + jq. Mimics setuptools-scm patch-increment behavior: - On exact tag (release): use tag version directly (e.g. v0.2.0 -> 0.2.0) - Off tag (develop/PR): increment patch (e.g. v0.1.0 + commits -> 0.1.1) Applied to all 4 VSIX jobs: - ci.yml: build-vsix, publish-vsix-dev - staging.yml: publish-vsix-prerelease - release.yml: publish-vsix Also adds fetch-depth: 0 to checkout steps so git describe has access to the full tag history. * fix(vsix): update Marketplace link placeholder in README * docs(vsix): improve README — remove misleading Prerequisites, expand Usage examples * docs(vsix): remove CLI install note — skills handle setup automatically

* ci(vsix): sync VSIX version from git tags in all pipelines Derive package.json version at CI time from the latest git tag using git describe + jq. Mimics setuptools-scm patch-increment behavior: - On exact tag (release): use tag version directly (e.g. v0.2.0 -> 0.2.0) - Off tag (develop/PR): increment patch (e.g. v0.1.0 + commits -> 0.1.1) Applied to all 4 VSIX jobs: - ci.yml: build-vsix, publish-vsix-dev - staging.yml: publish-vsix-prerelease - release.yml: publish-vsix Also adds fetch-depth: 0 to checkout steps so git describe has access to the full tag history. * fix(vsix): update Marketplace link placeholder in README * docs(vsix): improve README — remove misleading Prerequisites, expand Usage examples * docs(vsix): remove CLI install note — skills handle setup automatically * fix: resolve all mypy type errors across 6 source files - foundry_backend.py: assert narrowing for Optional[str], Dict type widening - config_loader.py: added BaseModel import and TypeVar bound - reporter.py: removed conflicting annotations, renamed shadowed loop vars - browse.py: split Path | None annotation into separate assignment - comparison.py: fixed _compute_metric_direction return type, renamed loop vars - runner.py: added imports, Pydantic model constructors

Replace git describe --tags --abbrev=0 with git tag -l --sort=-v:refname to find the latest tag across ALL branches, not just reachable ones. Root cause: v0.1.3 tag on main was not reachable from develop, so git describe found v0.1.2 and derived version 0.1.3, which already existed on the Marketplace. Also adds continue-on-error on dev/staging VSIX publish steps as a safety net against 'already exists' errors.

Skills are now managed exclusively via 'agentops skills install'. The 'init' command only scaffolds .agentops/ and prints guidance.

# Conflicts: # .github/copilot-instructions.md # AGENTS.md # CHANGELOG.md # README.md # docs/ci-github-actions.md # docs/foundry-evaluation-sdk-built-in-evaluators.md # docs/how-it-works.md # plugins/agentops/skills/agentops-investigate-regression/SKILL.md # plugins/agentops/skills/agentops-observability-triage/SKILL.md # plugins/agentops/skills/agentops-run-evals/SKILL.md # src/agentops/backends/foundry_backend.py # src/agentops/cli/app.py # src/agentops/core/config_loader.py # src/agentops/services/runner.py # tests/unit/test_foundry_backend.py

- Add continue-on-error on 'Publish stable to VS Code Marketplace' step to tolerate 'already exists' errors from staging pre-release - Decouple github-release job from publish-vsix result so GitHub Release proceeds when PyPI publish succeeds regardless of VSIX outcome - Update CHANGELOG with v0.1.4 section and workflow fix entry

…ications - Remove old develop-only plugin skills (workspace-setup, browse-inspect, dataset-management) - Sync plugin skills from templates (8 canonical skills) - Update plugin package.json to reference 8 skills - Wire browse_commands.py into app.py (bundle list/show, run list/show/view) - Port develop evaluator name fixes (bleu->bleu_score, rouge->rouge_score, etc.) - Add _EVALUATORS_NEEDING_TOOL_DEFS_ONLY and _EVALUATORS_NEEDING_OUTPUT_ITEMS - Add _NLP_DEFAULT_INIT_PARAMS for rouge_score - Move groundedness_pro from _SAFETY_EVALUATORS to _EVALUATORS_NEEDING_CONTEXT - Fix tests for new evaluator classifications - Fix skills tests for init/skills decoupling

# Conflicts: # .github/workflows/release.yml # .github/workflows/staging.yml # CHANGELOG.md

# Conflicts: # .github/workflows/release.yml # .github/workflows/staging.yml # docs/release-process.md

# Conflicts: # .github/workflows/release.yml # .github/workflows/staging.yml

- reporter.py: rename shadowed loop variable t -> it - subprocess_backend.py: add type: ignore for deprecated backend_config - eval_engine.py: add assert for str|None narrowing - foundry_backend.py: add asserts and fix Dict type annotations - runner.py: import Backend type, use Pydantic model constructors - ci.yml: remove continue-on-error from mypy step (now a hard gate)

Upgrade all action versions across all 5 workflow files to resolve Node.js 20 deprecation warnings (forced Node.js 24 after June 2 2026): - actions/checkout v4 -> v6 - actions/upload-artifact v4 -> v7 - actions/download-artifact v4 -> v7 - astral-sh/setup-uv v6 -> v7 - actions/setup-node v4 -> v6 - actions/setup-python v5 -> v6 - Node.js runtime version 20 -> 22 (LTS) pypa/gh-action-pypi-publish unchanged (Docker container action).

Add enable-cache: false to lint, coverage, and publish-dev jobs. These shared cache keys with test matrix entries, causing 'Unable to reserve cache' warnings during post-job cleanup. The test matrix jobs remain sole cache owners per (OS, Python) combo.

…orkflows

placerda and others added 30 commits March 27, 2026 01:42

evaluations

0a5ecfb

evaluations

3ef9f54

docs: add OTLP telemetry to AGENTS.md and copilot-instructions

a9f0afe

Merge pull request #56 from Azure/feature/otlp-tracing

5b5aa6e

Add OTLP tracing support and documentation for evaluation runs

merge: resolve CHANGELOG conflict with develop (OTLP tracing)

500966d

refactor: remove planned.py, move stubs to their command files

6017f3a

- Move dataset stubs to dataset_commands.py (ready for Tier 2 implementation) - Inline monitor/trace/model/agent stubs in app.py (1-2 commands each) - Delete planned.py — no more catch-all stub file

Merge pull request #57 from Azure/feature/issue-51-extend-evaluators

ce9b628

feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)

Merge branch 'develop' into feature/browse-commands

4e81967

Merge pull request #59 from Azure/feature/browse-commands

ba9a465

feat: implement bundle list/show and run list/show commands

evaluations

267a274

Merge branch 'main' of github.com:Azure/agentops into develop

1b81ad9

fix: remove duplicate _planned_command definition (ruff F811)

dd9172b

feat(skills): add coverage for report show/export, model list, agent …

e409dd0

…list planned commands

fix: remove duplicate _planned_command definition (ruff F811)

42d5a9a

style: apply ruff-format to comparison.py and test_cli_commands.py

a9653f2

ci(vsix): add LICENSE to plugin package

f2cd7ce

ci(vsix): set publisher to AgentOpsToolkit and fix package name

903be4b

Merge pull request #68 from Azure/feature/agentops-skills

4f248d3

Feature/agentops skills

Merge pull request #66 from Azure/fix/develop-lint-f811

daaf73e

fix: remove duplicate _planned_command definition (ruff F811)

Merge pull request #67 from Azure/feature/skill-vsix-cicd

03b6b74

ci: integrate VSIX packaging with pre-release into CI/CD pipeline

Dongbumlee and others added 17 commits April 13, 2026 16:28

docs: add CHANGELOG entries for mypy fixes and VSIX pipeline

9314553

refactor: decouple skills installation from agentops init

f0aeffe

Skills are now managed exclusively via 'agentops skills install'. The 'init' command only scaffolds .agentops/ and prints guidance.

Merge remote-tracking branch 'origin/main' into develop

f4a50c0

# Conflicts: # .github/workflows/release.yml # .github/workflows/staging.yml # CHANGELOG.md

Merge branch 'feature/evaluations' into develop

bde17ff

# Conflicts: # .github/workflows/release.yml # .github/workflows/staging.yml # docs/release-process.md

Merge remote-tracking branch 'origin/develop' into develop

77e283b

# Conflicts: # .github/workflows/release.yml # .github/workflows/staging.yml

fix: resolve 18 ruff lint errors (F401/F811/F841) across 6 files

bdcf8e1

style: apply ruff-format and normalize whitespace across source and w…

f04841a

…orkflows

chore: prepare release 0.1.5

98bf1eb

Dongbumlee temporarily deployed to staging April 14, 2026 03:47 — with GitHub Actions Inactive

Dongbumlee merged commit 73a768a into main Apr 14, 2026
5 checks passed

Dongbumlee deleted the release/v0.1.5 branch April 14, 2026 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.1.5#73

Release v0.1.5#73
Dongbumlee merged 47 commits into
mainfrom
release/v0.1.5

Dongbumlee commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dongbumlee commented Apr 14, 2026

Release v0.1.5

Fixed

Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants