Skip to content

Update Tool Call Accuracy to output unified format#46319

Merged
m7md7sien merged 18 commits into
mainfrom
mohessie/unify_output/tool_call_accuracy
Apr 26, 2026
Merged

Update Tool Call Accuracy to output unified format#46319
m7md7sien merged 18 commits into
mainfrom
mohessie/unify_output/tool_call_accuracy

Conversation

@m7md7sien
Copy link
Copy Markdown
Contributor

@m7md7sien m7md7sien commented Apr 14, 2026

Description

Update Tool Call Accuracy to output unified format

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@github-actions github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 14, 2026
Copilot AI and others added 7 commits April 16, 2026 22:10
…ed properties handling (#46355)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…outputs (#46449)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
m7md7sien and others added 2 commits April 23, 2026 22:03
Update documentation to state deprecate 'gpt_' prefix
…t_applicable_result` (#46500)

* rename not_applicable to pass in _return_not_applicable_result and update tests

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e94d600e-75a6-4b62-92cf-420fb1597e29

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* restore TODO comment above _return_not_applicable_result

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1ac22d46-abad-4a51-9269-cc884c11835d

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
@m7md7sien m7md7sien marked this pull request as ready for review April 26, 2026 18:10
@m7md7sien m7md7sien requested a review from a team as a code owner April 26, 2026 18:10
Copilot AI review requested due to automatic review settings April 26, 2026 18:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Tool Call Accuracy evaluator to emit a more unified result format (e.g., *_score, *_properties, *_status) and aligns unit tests and the underlying prompty schema accordingly.

Changes:

  • Updated the Tool Call Accuracy prompty contract to output reason, score, status, and properties (including a new “skipped” status behavior).
  • Updated ToolCallAccuracyEvaluator to map prompty output into the unified SDK result shape (adds *_score, *_properties, *_status, *_passed, while keeping legacy keys for compatibility).
  • Updated unit tests to assert against the new unified output keys and “skipped” behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py Updates mock prompty outputs and assertions to use *_score/*_properties and new skipped behavior.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_agent_evaluators.py Updates ToolCallAccuracyEvaluator expectations to reflect skipped results (score/properties None, status skipped).
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty Renames output fields and adds a “skipped” status contract.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py Implements unified output formatting, status handling, and properties packaging.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py Adds a helper to return unified “not applicable/skipped” results.
Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py:241

  • _convert_kwargs_to_eval_input never includes a response field in eval_input, so _is_intermediate_response(eval_input.get("response")) and the subsequent response preprocessing are effectively dead code and won't prevent evaluating intermediate assistant messages. Consider moving the intermediate-response check to _real_call (using the original kwargs["response"]) or removing these branches if response is not part of the prompty inputs.
        # Check for intermediate response
        if _is_intermediate_response(eval_input.get("response")):
            return self._return_not_applicable_result(
                "Intermediate response. Please provide the agent's final response for evaluation.",
                self.threshold,
            )

        # Preprocess messages if they are lists
        if isinstance(eval_input.get("response"), list):
            eval_input["response"] = _preprocess_messages(eval_input["response"])

@m7md7sien m7md7sien merged commit 6039d5c into main Apr 26, 2026
26 checks passed
@m7md7sien m7md7sien deleted the mohessie/unify_output/tool_call_accuracy branch April 26, 2026 18:44
fafhrd91 pushed a commit to fafhrd91/azure-sdk-for-python that referenced this pull request Apr 28, 2026
* Update Tool Call Accuracy to output unified format

* Update tests

* reformatting

* Refactor not applicable result method calls

* Fix test assertions for new unified output format and apply black formatting (Azure#46336)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (Azure#46355)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Fix tool call accuracy test for skipped output schema (Azure#46356)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Add back backward-compatible base result keys for tool call accuracy outputs (Azure#46449)

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Update documentation to state deprecate 'gpt_' prefix

Update documentation to state deprecate 'gpt_' prefix

* Rename `_result` value from `not_applicable` to `pass` in `_return_not_applicable_result` (Azure#46500)

* rename not_applicable to pass in _return_not_applicable_result and update tests

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e94d600e-75a6-4b62-92cf-420fb1597e29

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* restore TODO comment above _return_not_applicable_result

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1ac22d46-abad-4a51-9269-cc884c11835d

Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

* Add TODO for pass in _return_not_applicable_result

* Add back gpt_ key for backward compatibility.

Co-authored-by: Copilot <copilot@github.com>

---------

Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Co-authored-by: Copilot <copilot@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants