Update Tool Call Accuracy to output unified format by m7md7sien · Pull Request #46319 · Azure/azure-sdk-for-python

m7md7sien · 2026-04-14T22:49:43Z

Description

Update Tool Call Accuracy to output unified format

If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.

All SDK Contribution checklist:

The pull request does not introduce [breaking changes]
CHANGELOG is updated for new features, bug fixes or other significant changes.
I have read the contribution guidelines.

General Guidelines and Best Practices

Title of the pull request is clear and informative.
There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

Pull request includes test coverage for the included changes.

…matting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

…ed properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

…outputs (#46449) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Update documentation to state deprecate 'gpt_' prefix

…t_applicable_result` (#46500) * rename not_applicable to pass in _return_not_applicable_result and update tests Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e94d600e-75a6-4b62-92cf-420fb1597e29 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * restore TODO comment above _return_not_applicable_result Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1ac22d46-abad-4a51-9269-cc884c11835d Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>

Co-authored-by: Copilot <copilot@github.com>

Copilot

Pull request overview

This PR updates the Tool Call Accuracy evaluator to emit a more unified result format (e.g., *_score, *_properties, *_status) and aligns unit tests and the underlying prompty schema accordingly.

Changes:

Updated the Tool Call Accuracy prompty contract to output reason, score, status, and properties (including a new “skipped” status behavior).
Updated ToolCallAccuracyEvaluator to map prompty output into the unified SDK result shape (adds *_score, *_properties, *_status, *_passed, while keeping legacy keys for compatibility).
Updated unit tests to assert against the new unified output keys and “skipped” behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py	Updates mock prompty outputs and assertions to use `_score`/`_properties` and new skipped behavior.
sdk/evaluation/azure-ai-evaluation/tests/unittests/test_agent_evaluators.py	Updates ToolCallAccuracyEvaluator expectations to reflect skipped results (score/properties `None`, status `skipped`).
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty	Renames output fields and adds a “skipped” status contract.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py	Implements unified output formatting, status handling, and properties packaging.
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py	Adds a helper to return unified “not applicable/skipped” results.

Comments suppressed due to low confidence (1)

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py:241

_convert_kwargs_to_eval_input never includes a response field in eval_input, so _is_intermediate_response(eval_input.get("response")) and the subsequent response preprocessing are effectively dead code and won't prevent evaluating intermediate assistant messages. Consider moving the intermediate-response check to _real_call (using the original kwargs["response"]) or removing these branches if response is not part of the prompty inputs.

        # Check for intermediate response
        if _is_intermediate_response(eval_input.get("response")):
            return self._return_not_applicable_result(
                "Intermediate response. Please provide the agent's final response for evaluation.",
                self.threshold,
            )

        # Preprocess messages if they are lists
        if isinstance(eval_input.get("response"), list):
            eval_input["response"] = _preprocess_messages(eval_input["response"])

* Update Tool Call Accuracy to output unified format * Update tests * reformatting * Refactor not applicable result method calls * Fix test assertions for new unified output format and apply black formatting (Azure#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (Azure#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix tool call accuracy test for skipped output schema (Azure#46356) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Add back backward-compatible base result keys for tool call accuracy outputs (Azure#46449) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Update documentation to state deprecate 'gpt_' prefix Update documentation to state deprecate 'gpt_' prefix * Rename `_result` value from `not_applicable` to `pass` in `_return_not_applicable_result` (Azure#46500) * rename not_applicable to pass in _return_not_applicable_result and update tests Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e94d600e-75a6-4b62-92cf-420fb1597e29 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * restore TODO comment above _return_not_applicable_result Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1ac22d46-abad-4a51-9269-cc884c11835d Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Add TODO for pass in _return_not_applicable_result * Add back gpt_ key for backward compatibility. Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>

Update Tool Call Accuracy to output unified format

3eb40a8

github-actions Bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Apr 14, 2026

m7md7sien and others added 5 commits April 15, 2026 20:09

Update tests

d3c4092

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

d076d5c

reformatting

5032e26

Refactor not applicable result method calls

a525806

aprilk-ms reviewed Apr 16, 2026

View reviewed changes

Comment thread ...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

aprilk-ms reviewed Apr 16, 2026

View reviewed changes

Comment thread ...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py Outdated

aprilk-ms reviewed Apr 16, 2026

View reviewed changes

Comment thread ...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

Copilot AI and others added 7 commits April 16, 2026 22:10

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

83576b4

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

1893bc8

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

1a7f191

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

adff374

m7md7sien commented Apr 23, 2026

View reviewed changes

Comment thread ...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py Outdated

m7md7sien and others added 2 commits April 23, 2026 22:03

Update documentation to state deprecate 'gpt_' prefix

3d2aaa1

Update documentation to state deprecate 'gpt_' prefix

aprilk-ms reviewed Apr 24, 2026

View reviewed changes

Comment thread ...evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py

aprilk-ms reviewed Apr 24, 2026

View reviewed changes

Comment thread ...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py Outdated

salma-elshafey reviewed Apr 26, 2026

View reviewed changes

m7md7sien and others added 3 commits April 26, 2026 19:54

Add TODO for pass in _return_not_applicable_result

a288559

Add back gpt_ key for backward compatibility.

4db82df

Co-authored-by: Copilot <copilot@github.com>

Merge branch 'main' into mohessie/unify_output/tool_call_accuracy

7c1243a

aprilk-ms approved these changes Apr 26, 2026

View reviewed changes

m7md7sien marked this pull request as ready for review April 26, 2026 18:10

m7md7sien requested a review from a team as a code owner April 26, 2026 18:10

Copilot AI review requested due to automatic review settings April 26, 2026 18:10

Copilot started reviewing on behalf of m7md7sien April 26, 2026 18:11 View session

m7md7sien enabled auto-merge (squash) April 26, 2026 18:12

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread ...ure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py

Comment thread sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py

m7md7sien merged commit 6039d5c into main Apr 26, 2026
26 checks passed

m7md7sien deleted the mohessie/unify_output/tool_call_accuracy branch April 26, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Tool Call Accuracy to output unified format#46319

Update Tool Call Accuracy to output unified format#46319
m7md7sien merged 18 commits into
mainfrom
mohessie/unify_output/tool_call_accuracy

m7md7sien commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

m7md7sien commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

All SDK Contribution checklist:

General Guidelines and Best Practices

Testing Guidelines

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

m7md7sien commented Apr 14, 2026 •

edited

Loading