Release 0.1.3 by placerda · Pull Request #39 · Azure/agentops

placerda · 2026-03-24T03:01:01Z

Release 0.1.3

This PR merges release/0.1.3 into main to ship the 0.1.3 release of agentops-toolkit.

What's in this release

Added: HTTP backend (type: http), agentops eval compare for N-run regression detection, distributable Copilot skills under skills/, plugin.json for VS Code plugin install, marketplace.json for Copilot CLI plugin install.

Changed: Versioning migrated to setuptools-scm (tag-driven). Release pipeline redesigned into _build.yml, staging.yml, and release.yml.

Fixed: Cloud evaluation API path (Foundry Project Evals API). Comparison metric polarity for lower-is-better metrics. report --format html and -f all now generate correct outputs. CI secret PIPY_TOKEN corrected to PYPI_TOKEN.

TestPyPI verification

Tag v0.1.3 has been pushed. Verify after CI publishes:
pip install --index-url https://test.pypi.org/simple/ agentops-toolkit==0.1.3

…kills (#13) - Implement agentops eval compare --runs for baseline comparison - Pydantic models: ComparisonResult, MetricDelta, ThresholdDelta, ItemDelta - Comparison service with run discovery (timestamps, latest, paths) - Comparison markdown report generator - Exit codes: 0=no regressions, 2=regressions, 1=error - Metric polarity: lower-is-better metrics (<=) correctly show improved - Fix Foundry cloud evaluation to use Project Evals API - Use {project_endpoint}/openai/evals?api-version=2025-11-15-preview - Supports azure_ai_evaluator testing criteria (New Foundry Experience) - Replaces OpenAI SDK path that lacked azure_ai_evaluator support - Add distributable Copilot skills under .github/plugins/agentops/skills/ - agentops-run-evals, agentops-investigate-regression, agentops-observability-triage - GitHub-based distribution (Channel 1) matching azure-skills pattern - Remove .github/skills/ internal folder (superseded by plugins) - Align azure-ai-projects version to >=2.0.1 across all files - Update README, AGENTS.md, how-it-works.md, CHANGELOG - 87 unit tests passing

- docs/tutorial-baseline-comparison.md: step-by-step comparison workflow, CI patterns, regression investigation guide - docs/tutorial-copilot-skills.md: skill installation (GitHub, manual, project), usage examples, skill quality evaluation with AgentOps - Update README docs section with new tutorial links

- tutorial-baseline-comparison.md: add model-direct vs agent evaluation target section with when-to-use, pros/cons, expected score differences, cross-target comparison guidance, detailed regression investigation patterns, and baseline management strategies - tutorial-copilot-skills.md: add context on why skills matter, detailed usage examples showing before/after skill behavior, skill quality evaluation workflow using AgentOps itself

…ferences - tutorial-model-direct.md: add when/why to use model-direct, how scores differ from agent, dataset writing guidance, transitioning to agent eval - tutorial-basic-foundry-agent.md: add model-vs-agent decision guide, score expectations, named vs legacy agents, why both agent_id and model are needed, cross-scenario comparison guidance, evaluation scenarios table

…d updated Copilot skills - Unified ComparisonResult model supporting 2+ runs - HTML report format with modern light theme, visual indicators (dots, arrows, badges) - --format md|html|all flag on eval run, eval compare, and report commands - Comparison dimension detection (Model/Agent/Dataset Coverage/General) - Conditions section showing fixed vs varying parameters - Merged Evaluators table with dual evaluation (Met/Missed + direction) - Row Details with per-row evaluator scores and Met/Missed - Smart number formatting (integers without decimals) - Met/Missed threshold terminology - Status with pass rate (PASS 100% 5/5 / FAIL 80% 4/5) - Regression detection based on threshold flips only (not numeric noise) - Informational metrics (samples_evaluated) shown as plain values - Foundry backend command strings enriched with target + model - Updated all 3 Copilot skills with N-run workflows, HTML guide, model benchmarking

- Migrate versioning from static pyproject.toml to setuptools-scm (version derived automatically from git tags, no manual bumps) - Split release workflow into 3 files with reusable build: - _build.yml: reusable build workflow (test + package) - staging.yml: release/* branch -> TestPyPI + verify - release.yml: v* tag -> TestPyPI + verify -> PyPI (approval) -> GitHub Release - CLI smoke test: agentops --version, --help, init in temp directory - Fix secret reference PIPY_TOKEN -> PYPI_TOKEN, add TEST_PYPI_TOKEN - Two GitHub environments: staging (TestPyPI) and release (PyPI, approval gate) - Add consistent workflow index header across all CI/CD files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Step-by-step guide covering staging (TestPyPI) and production (PyPI) release workflows, setuptools-scm versioning, environment setup, release checklist, and troubleshooting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat: GitOps release pipeline with TestPyPI staging and setuptools-scm

fix: move mid-file import to top to resolve ruff E402 lint error

…arison feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills

…arison ci: remove duplicate test runs on release branches, add pre-commit hooks

…arison ci: add cut-release workflow, remove duplicate CI, pre-commit hooks, docs updates

ci: auto-publish dev builds to TestPyPI on develop push

The default output path was hardcoded to report.md regardless of the --format flag. When -f html was passed, the file was still named report.md and the returned path did not reflect the actual format. - Default output filename now uses the correct suffix based on report_format (report.html when html, report.md otherwise) - Returned output_report_path now tracks which file was actually written

…arison fix: report command respects --format html parameter

ReportResult now carries an optional html_report_path so the CLI can display both output paths when report_format is 'all'.

…arison fix: -f all now generates both md and html reports

Dongbumlee and others added 30 commits March 19, 2026 12:18

evaluation

60e4c1e

Merge branch 'develop' into feature/gitops-release-pipeline

93cb885

Merge pull request #30 from Azure/feature/gitops-release-pipeline

5164836

feat: GitOps release pipeline with TestPyPI staging and setuptools-scm

fix: move mid-file import to top to resolve ruff E402 lint error

3a41a42

chore: add pre-commit with ruff lint and format hooks

da92b4f

ci: remove macOS from test matrix to avoid queue delays

8c07ac7

Merge pull request #31 from Azure/feature/gitops-release-pipeline

1327dee

fix: move mid-file import to top to resolve ruff E402 lint error

merge: resolve conflicts with develop branch

677f770

Merge pull request #32 from Azure/feature/copilot-skill-baseline-comp…

9603575

…arison feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills

ci: remove duplicate test runs on release branches

cf73554

Merge pull request #33 from Azure/feature/copilot-skill-baseline-comp…

b2df5ae

…arison ci: remove duplicate test runs on release branches, add pre-commit hooks

ci: add cut-release workflow and update release docs

4b59925

Merge pull request #34 from Azure/feature/copilot-skill-baseline-comp…

1e7584f

…arison ci: add cut-release workflow, remove duplicate CI, pre-commit hooks, docs updates

ci: auto-publish dev builds to TestPyPI on develop push

08df77b

Merge pull request #35 from Azure/feature/ci-dev-publish

d99aec2

ci: auto-publish dev builds to TestPyPI on develop push

Merge pull request #36 from Azure/feature/copilot-skill-baseline-comp…

8e2f428

…arison fix: report command respects --format html parameter

fix: -f all now generates both md and html reports

4c62cd3

ReportResult now carries an optional html_report_path so the CLI can display both output paths when report_format is 'all'.

Merge pull request #37 from Azure/feature/copilot-skill-baseline-comp…

fea1e64

…arison fix: -f all now generates both md and html reports

chore: stage all develop changes for release/0.2.0

2917226

Merge remote-tracking branch 'origin/develop' into release/0.2.0

50dce6a

chore: prepare release 0.2.0

bb7eef6

chore: correct release version to 0.1.3

b2c663e

fix: remove unused imports and variable flagged by ruff (F401, F841)

91d8d18

placerda temporarily deployed to staging March 24, 2026 03:07 — with GitHub Actions Inactive

placerda had a problem deploying to release March 24, 2026 03:08 — with GitHub Actions Failure

placerda closed this Mar 24, 2026

placerda deleted the release/0.1.3 branch March 24, 2026 17:28

placerda temporarily deployed to staging March 25, 2026 01:29 — with GitHub Actions Inactive

placerda had a problem deploying to release March 25, 2026 01:30 — with GitHub Actions Failure

placerda temporarily deployed to staging March 25, 2026 01:46 — with GitHub Actions Inactive

placerda had a problem deploying to release March 25, 2026 01:47 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.1.3#39

Release 0.1.3#39
placerda wants to merge 31 commits into
mainfrom
release/0.1.3

placerda commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

placerda commented Mar 24, 2026

Release 0.1.3

What's in this release

TestPyPI verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants