Skip to content

Release v0.1.5#73

Merged
Dongbumlee merged 47 commits into
mainfrom
release/v0.1.5
Apr 14, 2026
Merged

Release v0.1.5#73
Dongbumlee merged 47 commits into
mainfrom
release/v0.1.5

Conversation

@Dongbumlee
Copy link
Copy Markdown
Collaborator

Release v0.1.5

Fixed

  • Make release pipeline resilient to VSIX version conflicts
  • Resolve 31 mypy type errors and enforce mypy in CI
  • Resolve 18 ruff lint errors (F401/F811/F841) across 6 files
  • Fix UV cache race condition in CI

Changed

  • Upgrade GitHub Actions to Node.js 24 runtimes
  • Apply ruff-format across source and workflows

See CHANGELOG.md for full details.

placerda and others added 30 commits March 27, 2026 01:42
- Add utils/telemetry.py with lazy OTel imports and span context managers
- Instrument runner.py with three-layer schema (CICD + GenAI + agentops.eval)
- Root span per eval run, item spans per row, evaluator child spans
- Activated via AGENTOPS_OTLP_ENDPOINT env var (opt-in, zero overhead)
- Graceful no-op when opentelemetry-sdk is not installed
- 16 unit tests covering disabled, degraded, and enabled states

Refs: #14
…rs (#51)

- Expand evaluator frozensets: add response_completeness, groundedness_pro,
  retrieval, tool_selection to existing sets
- Add new frozensets: _EVALUATORS_NEEDING_TOOL_DEFS_ONLY (tool_input_accuracy,
  tool_output_utilization, tool_call_success), _EVALUATORS_NEEDING_OUTPUT_ITEMS
  (task_adherence)
- Fix NLP evaluator names (bleu_score, rouge_score, etc.) to match
  _to_builtin_evaluator_name conversion
- Add default initialization_parameters for RougeScoreEvaluator (rouge_type)
- Build item_schema dynamically: include tool_definitions and context_field when
  evaluators need them
- Refactor _default_foundry_input_mapping to frozenset-based routing
- Improve error handling: log evaluator errors when score is null, improve
  runner error message with --verbose hint
- Add CI/CD integration models documentation: PR gate, scheduled, post-deploy,
  multi-env promotion, Azure DevOps pipeline
- Add gating best practices: threshold design, evaluator selection by scenario
- Add supported evaluators reference table (22 evaluators by category)
- Add ~20 unit tests for all new evaluator data_mapping patterns
- All 22 evaluators verified end-to-end with live Foundry cloud evaluation

Closes #51
TestSpanAttributesWhenEnabled requires opentelemetry to be installed
because the code paths import SpanKind/StatusCode when tracing is
enabled. Use pytest.importorskip to skip the class in CI where
opentelemetry is not a declared dependency.
Add OTLP tracing support and documentation for evaluation runs
- Fix skill paths: plugins/agentops/skills/ (not .github/plugins/)
  across README, tutorial-copilot-skills (6 instances)
- Fix CLI contract: add eval compare and config cicd as implemented
  commands in AGENTS.md, copilot-instructions.md, how-it-works.md
- Fix source tree listings: add cicd.py, comparison.py, telemetry.py,
  workflows/ across AGENTS.md, how-it-works.md
- Fix test listings: add test_cicd, test_cli_commands, test_comparison,
  test_telemetry across AGENTS.md, copilot-instructions.md, how-it-works.md
- Fix agent_tools_baseline: TaskCompletionEvaluator + ToolCallAccuracyEvaluator
  (not SimilarityEvaluator placeholder) in README, AGENTS.md, how-it-works.md
- Fix JSONL path: data/<name>.jsonl (not datasets/) in ci-github-actions.md
- Fix init flag: --dir (not --path) in README
- Fix evaluator guidance: add frozenset names and NLP_DEFAULT_INIT_PARAMS
  to copilot-instructions.md
- Add context_field to dataset format docs in AGENTS.md
- Add rouge_type default note to evaluator reference doc
- Update planned command message to list all 5 available commands
- Add --format flag to CLI usage examples
- Add services/browse.py with list_bundles, show_bundle, list_runs, show_run
- Replace planned stubs with working implementations in cli/app.py
- bundle list: shows all bundles with evaluators and threshold count
- bundle show: displays full bundle detail (evaluators, thresholds, metadata)
- run list: shows all past runs with status, bundle, dataset, duration
- run show: displays full run detail (metrics, thresholds, items, Foundry URL)
- Add 16 unit tests (service + CLI) in test_browse.py
- All commands are read-only, no side effects, no Azure API calls
Split app.py (487 lines) into focused command modules:

- app.py (114 lines) — root app, global callback, init, sub-app registration
- eval_commands.py (108 lines) — eval run, eval compare
- report_commands.py (66 lines) — report, report show/export stubs
- browse_commands.py (152 lines) — bundle list/show, run list/show/view
- config_commands.py (56 lines) — config cicd, config validate/show stubs
- planned.py (57 lines) — dataset, monitor, trace, model, agent stubs
- _planned.py (12 lines) — shared planned command helper

No behavior changes. All 96 tests pass.
- Move dataset stubs to dataset_commands.py (ready for Tier 2 implementation)
- Inline monitor/trace/model/agent stubs in app.py (1-2 commands each)
- Delete planned.py — no more catch-all stub file
feat: extend Foundry cloud evaluator coverage to 22 built-in evaluators (#51)
feat: implement bundle list/show and run list/show commands
Add agentops-workspace-setup, agentops-browse-inspect, and
agentops-dataset-management skills covering all remaining CLI
commands not handled by existing evaluation-focused skills.

- agentops-workspace-setup: init, config cicd, config validate/show
- agentops-browse-inspect: bundle list/show, run list/show/view
- agentops-dataset-management: dataset creation, YAML/JSONL format,
  field mapping, planned validate/describe/import commands
…ills

Add '## Before You Start' section to 5 downstream skills enforcing
workspace verification before proceeding:
- agentops-run-evals
- agentops-investigate-regression
- agentops-observability-triage
- agentops-browse-inspect
- agentops-dataset-management

Each skill now instructs the agent to check for .agentops/ directory
and redirect to agentops-workspace-setup skill if missing. This
provides soft enforcement at the skill layer, complementing the hard
CLI enforcement (FileNotFoundError) already in place.
- ci.yml: add build-vsix validation job (package only, no publish)
- staging.yml: add publish-vsix-prerelease job (vsce publish --pre-release)
- release.yml: add publish-vsix stable job + attach VSIX to GitHub Release
- cut-release.yml: sync package.json version via jq, update PR body/checklist
- _build.yml: update header comments (Python-only, no VSIX logic)
- plugins/agentops: add README.md, CHANGELOG.md, .vscodeignore, package.json scripts

Requires VSCE_PAT secret in staging and release GitHub environments.
fix: remove duplicate _planned_command definition (ruff F811)
ci: integrate VSIX packaging with pre-release into CI/CD pipeline
* ci(vsix): upload VSIX artifact from CI and staging pipelines

* ci: publish VSIX pre-release to Marketplace on develop pushes

Add publish-vsix-dev job to ci.yml that publishes the VSIX as a
pre-release to the VS Code Marketplace on every push to develop,
mirroring the publish-dev job that pushes to TestPyPI.

- Gated on push to develop only (not PRs)
- Depends on lint, test, and build-vsix jobs
- Uses staging environment (VSCE_PAT secret)
- Packages with --pre-release flag
- Includes step summary with Marketplace link
Dongbumlee and others added 17 commits April 13, 2026 16:28
* ci(vsix): sync VSIX version from git tags in all pipelines

Derive package.json version at CI time from the latest git tag using
git describe + jq. Mimics setuptools-scm patch-increment behavior:
- On exact tag (release): use tag version directly (e.g. v0.2.0 -> 0.2.0)
- Off tag (develop/PR): increment patch (e.g. v0.1.0 + commits -> 0.1.1)

Applied to all 4 VSIX jobs:
- ci.yml: build-vsix, publish-vsix-dev
- staging.yml: publish-vsix-prerelease
- release.yml: publish-vsix

Also adds fetch-depth: 0 to checkout steps so git describe has
access to the full tag history.

* fix(vsix): update Marketplace link placeholder in README

* docs(vsix): improve README — remove misleading Prerequisites, expand Usage examples

* docs(vsix): remove CLI install note — skills handle setup automatically
* ci(vsix): sync VSIX version from git tags in all pipelines

Derive package.json version at CI time from the latest git tag using
git describe + jq. Mimics setuptools-scm patch-increment behavior:
- On exact tag (release): use tag version directly (e.g. v0.2.0 -> 0.2.0)
- Off tag (develop/PR): increment patch (e.g. v0.1.0 + commits -> 0.1.1)

Applied to all 4 VSIX jobs:
- ci.yml: build-vsix, publish-vsix-dev
- staging.yml: publish-vsix-prerelease
- release.yml: publish-vsix

Also adds fetch-depth: 0 to checkout steps so git describe has
access to the full tag history.

* fix(vsix): update Marketplace link placeholder in README

* docs(vsix): improve README — remove misleading Prerequisites, expand Usage examples

* docs(vsix): remove CLI install note — skills handle setup automatically

* fix: resolve all mypy type errors across 6 source files

- foundry_backend.py: assert narrowing for Optional[str], Dict type widening
- config_loader.py: added BaseModel import and TypeVar bound
- reporter.py: removed conflicting annotations, renamed shadowed loop vars
- browse.py: split Path | None annotation into separate assignment
- comparison.py: fixed _compute_metric_direction return type, renamed loop vars
- runner.py: added imports, Pydantic model constructors
Replace git describe --tags --abbrev=0 with git tag -l --sort=-v:refname
to find the latest tag across ALL branches, not just reachable ones.

Root cause: v0.1.3 tag on main was not reachable from develop, so
git describe found v0.1.2 and derived version 0.1.3, which already
existed on the Marketplace.

Also adds continue-on-error on dev/staging VSIX publish steps as a
safety net against 'already exists' errors.
Skills are now managed exclusively via 'agentops skills install'.
The 'init' command only scaffolds .agentops/ and prints guidance.
# Conflicts:
#	.github/copilot-instructions.md
#	AGENTS.md
#	CHANGELOG.md
#	README.md
#	docs/ci-github-actions.md
#	docs/foundry-evaluation-sdk-built-in-evaluators.md
#	docs/how-it-works.md
#	plugins/agentops/skills/agentops-investigate-regression/SKILL.md
#	plugins/agentops/skills/agentops-observability-triage/SKILL.md
#	plugins/agentops/skills/agentops-run-evals/SKILL.md
#	src/agentops/backends/foundry_backend.py
#	src/agentops/cli/app.py
#	src/agentops/core/config_loader.py
#	src/agentops/services/runner.py
#	tests/unit/test_foundry_backend.py
- Add continue-on-error on 'Publish stable to VS Code Marketplace' step
  to tolerate 'already exists' errors from staging pre-release
- Decouple github-release job from publish-vsix result so GitHub Release
  proceeds when PyPI publish succeeds regardless of VSIX outcome
- Update CHANGELOG with v0.1.4 section and workflow fix entry
…ications

- Remove old develop-only plugin skills (workspace-setup, browse-inspect, dataset-management)
- Sync plugin skills from templates (8 canonical skills)
- Update plugin package.json to reference 8 skills
- Wire browse_commands.py into app.py (bundle list/show, run list/show/view)
- Port develop evaluator name fixes (bleu->bleu_score, rouge->rouge_score, etc.)
- Add _EVALUATORS_NEEDING_TOOL_DEFS_ONLY and _EVALUATORS_NEEDING_OUTPUT_ITEMS
- Add _NLP_DEFAULT_INIT_PARAMS for rouge_score
- Move groundedness_pro from _SAFETY_EVALUATORS to _EVALUATORS_NEEDING_CONTEXT
- Fix tests for new evaluator classifications
- Fix skills tests for init/skills decoupling
# Conflicts:
#	.github/workflows/release.yml
#	.github/workflows/staging.yml
#	CHANGELOG.md
# Conflicts:
#	.github/workflows/release.yml
#	.github/workflows/staging.yml
#	docs/release-process.md
# Conflicts:
#	.github/workflows/release.yml
#	.github/workflows/staging.yml
- reporter.py: rename shadowed loop variable t -> it
- subprocess_backend.py: add type: ignore for deprecated backend_config
- eval_engine.py: add assert for str|None narrowing
- foundry_backend.py: add asserts and fix Dict type annotations
- runner.py: import Backend type, use Pydantic model constructors
- ci.yml: remove continue-on-error from mypy step (now a hard gate)
Upgrade all action versions across all 5 workflow files to resolve
Node.js 20 deprecation warnings (forced Node.js 24 after June 2 2026):

- actions/checkout v4 -> v6
- actions/upload-artifact v4 -> v7
- actions/download-artifact v4 -> v7
- astral-sh/setup-uv v6 -> v7
- actions/setup-node v4 -> v6
- actions/setup-python v5 -> v6
- Node.js runtime version 20 -> 22 (LTS)

pypa/gh-action-pypi-publish unchanged (Docker container action).
Add enable-cache: false to lint, coverage, and publish-dev jobs.
These shared cache keys with test matrix entries, causing
'Unable to reserve cache' warnings during post-job cleanup.
The test matrix jobs remain sole cache owners per (OS, Python) combo.
@Dongbumlee Dongbumlee merged commit 73a768a into main Apr 14, 2026
5 checks passed
@Dongbumlee Dongbumlee deleted the release/v0.1.5 branch April 14, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants