Skip to content

Build the evaluation harness, scorecards, and policy-aware audit plus redaction flow #67

@KSemenenko

Description

@KSemenenko

Problem

Operators need a usable trust layer that combines transcript evaluation, readable scorecards, and safe audit exports with redaction control.

Scope

  • Track evaluation harness inputs, transcript scoring, scorecard history, audit exports, and redaction policy behavior
  • Cover how quality and trust state appears per agent profile or session
  • Keep sensitive data handling explicit

Out of scope

  • Low-level telemetry emission that belongs in earlier observability issues
  • Toolchain installation and provider health flows

Implementation notes

  • Make scorecards and audit exports actionable to operators
  • Align redaction with provider secrets, prompt data, tool arguments, and results
  • Keep the issue compatible with the official evaluation packages

Definition of Done

  • The issue defines the minimum trust surfaces: harness, scorecards, audit, and redaction
  • Later implementation can proceed without re-deciding how evaluations surface in the product

Verification

  • Review the issue against the feature spec, evaluation-package issue, and approval flows

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions