Skip to content

evaluate_handoff: path-walk failures should report ERROR, not CAUGHT #24

@aural-psynapse

Description

@aural-psynapse

Summary

When _get_by_json_path (or whatever resolves the claim's json_path against the indexed payload) cannot walk the path, the per-claim verdict comes back as result: "CAUGHT" with the parser error in detail, e.g.:

detail: json_path: 'expected dict at segment "['value'][0]['subject']", got list'

A consumer can't tell that apart from a real value mismatch — both look like result: "CAUGHT" to anyone reading per_claim. That has two downstream costs:

  1. KPI / metrics — counting "hallucinations caught" inflates the number with false positives every time the LLM picks a path the parser can't handle. Demos and customer dashboards over-report drift.
  2. Heal-loop classifiers — substitute / reprompt / fail decisions are made on result == "CAUGHT". An evaluator-can't-walk-this verdict triggers the wrong tier, retries that can never converge, and noisy traces.

Proposal

Emit result: "ERROR" (or a new "EVALUATOR_ERROR", whichever fits the existing taxonomy) for path-walk failures, with the same detail so debug info is preserved. The two cases that consumers want to disambiguate:

  • CAUGHT — path resolved, claimed value disagrees with indexed value (real drift)
  • ERROR — path could not be resolved (evaluator limitation, missing field, malformed path, etc.)

Reproduction

Any tool whose response is a JSON array at the root, with the LLM emitting Python-bracket-key paths (['value'][0]['subject']) or bracketed numeric indexes ([0].subject). The SDK's parser bails before reaching the leaf. Currently surfaces as CAUGHT instead of ERROR.

Workaround on the consumer side

We're patching this locally in customer-support-sdk-demo (evaluate_node.py_patch_array_path_verdicts): if the SDK's verdict has an "expected … got list/dict" detail, we re-walk with a more permissive parser, mark verdicts as PASS when our local walk verifies the claim, and as ERROR when it can't. That belongs in the SDK so every consumer doesn't reinvent it.

Related

Tracks alongside the existing array-indexing limitation in _get_by_json_path (which the consumer-side workaround was originally created to bridge). Resolving this report-classification issue is independent of fixing the underlying parser — even a path the SDK genuinely can't walk would be more honestly classified as ERROR than CAUGHT.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions