Skip to content

fix: return ERROR not CAUGHT when evaluator cannot perform comparison#36

Open
MdSadiqMd wants to merge 1 commit into
ProvablyAI:mainfrom
MdSadiqMd:fix/eval-path-walk-verdict
Open

fix: return ERROR not CAUGHT when evaluator cannot perform comparison#36
MdSadiqMd wants to merge 1 commit into
ProvablyAI:mainfrom
MdSadiqMd:fix/eval-path-walk-verdict

Conversation

@MdSadiqMd
Copy link
Copy Markdown

Closes #24

Summary

  • Path-walk failures and other "evaluator can't compare" cases in eval_modes.py returned CAUGHT, inflating hallucination metrics and triggering heal-loop retries that can never converge
  • Changed six sites to return ERROR — aligning with the evaluator's own docstring: "ERROR — the evaluator could not actually evaluate. This is not evidence of tampering and must not be conflated with CAUGHT"
Line Situation Comparison happened? Before After
36 Path walk fails (KeyError/IndexError/TypeError/ValueError) No, value never extracted CAUGHT ERROR
46 Unknown verification_mode No, evaluator doesn't know the mode CAUGHT ERROR
74 expected_json_schema is empty for schema_type No, nothing to validate against CAUGHT ERROR
80 SchemaError — the schema itself is invalid No, broken schema can't validate anything CAUGHT ERROR
86 range_min and range_max both None No - no bounds to check against CAUGHT ERROR
90 Indexed value at path isn't numeric No, can't perform numeric comparison CAUGHT ERROR

Test plan

  • uv run pytest tests/unit/ -> 116 passed
  • uv run pytest tests/e2e/ -> 15 passed
  • Updated test_field_extraction_error_when_index_out_of_range to expect ERROR
  • Updated test_schema_type_missing_path_is_error to expect ERROR
  • Verified real-mismatch tests (test_evaluate_handoff_caught_on_mismatch, test_final_verify_failure_marks_caught, scenario B tampered claim) still return CAUGHT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

evaluate_handoff: path-walk failures should report ERROR, not CAUGHT

1 participant