Skip to content

Release 0.1.3#39

Closed
placerda wants to merge 31 commits into
mainfrom
release/0.1.3
Closed

Release 0.1.3#39
placerda wants to merge 31 commits into
mainfrom
release/0.1.3

Conversation

@placerda
Copy link
Copy Markdown
Contributor

Release 0.1.3

This PR merges release/0.1.3 into main to ship the 0.1.3 release of agentops-toolkit.

What's in this release

Added: HTTP backend (type: http), agentops eval compare for N-run regression detection, distributable Copilot skills under skills/, plugin.json for VS Code plugin install, marketplace.json for Copilot CLI plugin install.

Changed: Versioning migrated to setuptools-scm (tag-driven). Release pipeline redesigned into _build.yml, staging.yml, and release.yml.

Fixed: Cloud evaluation API path (Foundry Project Evals API). Comparison metric polarity for lower-is-better metrics. report --format html and -f all now generate correct outputs. CI secret PIPY_TOKEN corrected to PYPI_TOKEN.

TestPyPI verification

Tag v0.1.3 has been pushed. Verify after CI publishes:
pip install --index-url https://test.pypi.org/simple/ agentops-toolkit==0.1.3

Dongbumlee and others added 30 commits March 19, 2026 12:18
…kills (#13)

- Implement agentops eval compare --runs for baseline comparison
  - Pydantic models: ComparisonResult, MetricDelta, ThresholdDelta, ItemDelta
  - Comparison service with run discovery (timestamps, latest, paths)
  - Comparison markdown report generator
  - Exit codes: 0=no regressions, 2=regressions, 1=error
  - Metric polarity: lower-is-better metrics (<=) correctly show improved

- Fix Foundry cloud evaluation to use Project Evals API
  - Use {project_endpoint}/openai/evals?api-version=2025-11-15-preview
  - Supports azure_ai_evaluator testing criteria (New Foundry Experience)
  - Replaces OpenAI SDK path that lacked azure_ai_evaluator support

- Add distributable Copilot skills under .github/plugins/agentops/skills/
  - agentops-run-evals, agentops-investigate-regression, agentops-observability-triage
  - GitHub-based distribution (Channel 1) matching azure-skills pattern
  - Remove .github/skills/ internal folder (superseded by plugins)

- Align azure-ai-projects version to >=2.0.1 across all files
- Update README, AGENTS.md, how-it-works.md, CHANGELOG
- 87 unit tests passing
- docs/tutorial-baseline-comparison.md: step-by-step comparison workflow,
  CI patterns, regression investigation guide
- docs/tutorial-copilot-skills.md: skill installation (GitHub, manual, project),
  usage examples, skill quality evaluation with AgentOps
- Update README docs section with new tutorial links
- tutorial-baseline-comparison.md: add model-direct vs agent evaluation
  target section with when-to-use, pros/cons, expected score differences,
  cross-target comparison guidance, detailed regression investigation
  patterns, and baseline management strategies
- tutorial-copilot-skills.md: add context on why skills matter, detailed
  usage examples showing before/after skill behavior, skill quality
  evaluation workflow using AgentOps itself
…ferences

- tutorial-model-direct.md: add when/why to use model-direct, how scores
  differ from agent, dataset writing guidance, transitioning to agent eval
- tutorial-basic-foundry-agent.md: add model-vs-agent decision guide,
  score expectations, named vs legacy agents, why both agent_id and model
  are needed, cross-scenario comparison guidance, evaluation scenarios table
…d updated Copilot skills

- Unified ComparisonResult model supporting 2+ runs
- HTML report format with modern light theme, visual indicators (dots, arrows, badges)
- --format md|html|all flag on eval run, eval compare, and report commands
- Comparison dimension detection (Model/Agent/Dataset Coverage/General)
- Conditions section showing fixed vs varying parameters
- Merged Evaluators table with dual evaluation (Met/Missed + direction)
- Row Details with per-row evaluator scores and Met/Missed
- Smart number formatting (integers without decimals)
- Met/Missed threshold terminology
- Status with pass rate (PASS 100% 5/5 / FAIL 80% 4/5)
- Regression detection based on threshold flips only (not numeric noise)
- Informational metrics (samples_evaluated) shown as plain values
- Foundry backend command strings enriched with target + model
- Updated all 3 Copilot skills with N-run workflows, HTML guide, model benchmarking
- Migrate versioning from static pyproject.toml to setuptools-scm
  (version derived automatically from git tags, no manual bumps)
- Split release workflow into 3 files with reusable build:
  - _build.yml: reusable build workflow (test + package)
  - staging.yml: release/* branch -> TestPyPI + verify
  - release.yml: v* tag -> TestPyPI + verify -> PyPI (approval) -> GitHub Release
- CLI smoke test: agentops --version, --help, init in temp directory
- Fix secret reference PIPY_TOKEN -> PYPI_TOKEN, add TEST_PYPI_TOKEN
- Two GitHub environments: staging (TestPyPI) and release (PyPI, approval gate)
- Add consistent workflow index header across all CI/CD files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Step-by-step guide covering staging (TestPyPI) and production (PyPI)
release workflows, setuptools-scm versioning, environment setup,
release checklist, and troubleshooting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
feat: GitOps release pipeline with TestPyPI staging and setuptools-scm
fix: move mid-file import to top to resolve ruff E402 lint error
…arison

feat: N-run comparison, HTML reports, smart comparison conditions, and updated Copilot skills
…arison

ci: remove duplicate test runs on release branches, add pre-commit hooks
…arison

ci: add cut-release workflow, remove duplicate CI, pre-commit hooks, docs updates
ci: auto-publish dev builds to TestPyPI on develop push
The default output path was hardcoded to report.md regardless of the
--format flag. When -f html was passed, the file was still named
report.md and the returned path did not reflect the actual format.

- Default output filename now uses the correct suffix based on
  report_format (report.html when html, report.md otherwise)
- Returned output_report_path now tracks which file was actually
  written
…arison

fix: report command respects --format html parameter
ReportResult now carries an optional html_report_path so the CLI
can display both output paths when report_format is 'all'.
…arison

fix: -f all now generates both md and html reports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants