Update Tool Call Accuracy to output unified format#46319
Merged
Conversation
…matting (#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
aprilk-ms
reviewed
Apr 16, 2026
…ed properties handling (#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
…outputs (#46449) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
m7md7sien
commented
Apr 23, 2026
Update documentation to state deprecate 'gpt_' prefix
…t_applicable_result` (#46500) * rename not_applicable to pass in _return_not_applicable_result and update tests Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e94d600e-75a6-4b62-92cf-420fb1597e29 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * restore TODO comment above _return_not_applicable_result Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1ac22d46-abad-4a51-9269-cc884c11835d Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com>
aprilk-ms
reviewed
Apr 24, 2026
Co-authored-by: Copilot <copilot@github.com>
aprilk-ms
approved these changes
Apr 26, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Tool Call Accuracy evaluator to emit a more unified result format (e.g., *_score, *_properties, *_status) and aligns unit tests and the underlying prompty schema accordingly.
Changes:
- Updated the Tool Call Accuracy prompty contract to output
reason,score,status, andproperties(including a new “skipped” status behavior). - Updated
ToolCallAccuracyEvaluatorto map prompty output into the unified SDK result shape (adds*_score,*_properties,*_status,*_passed, while keeping legacy keys for compatibility). - Updated unit tests to assert against the new unified output keys and “skipped” behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_tool_call_accuracy_evaluator.py | Updates mock prompty outputs and assertions to use *_score/*_properties and new skipped behavior. |
| sdk/evaluation/azure-ai-evaluation/tests/unittests/test_agent_evaluators.py | Updates ToolCallAccuracyEvaluator expectations to reflect skipped results (score/properties None, status skipped). |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty | Renames output fields and adds a “skipped” status contract. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py | Implements unified output formatting, status handling, and properties packaging. |
| sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_prompty_eval.py | Adds a helper to return unified “not applicable/skipped” results. |
Comments suppressed due to low confidence (1)
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py:241
_convert_kwargs_to_eval_inputnever includes aresponsefield ineval_input, so_is_intermediate_response(eval_input.get("response"))and the subsequent response preprocessing are effectively dead code and won't prevent evaluating intermediate assistant messages. Consider moving the intermediate-response check to_real_call(using the originalkwargs["response"]) or removing these branches ifresponseis not part of the prompty inputs.
# Check for intermediate response
if _is_intermediate_response(eval_input.get("response")):
return self._return_not_applicable_result(
"Intermediate response. Please provide the agent's final response for evaluation.",
self.threshold,
)
# Preprocess messages if they are lists
if isinstance(eval_input.get("response"), list):
eval_input["response"] = _preprocess_messages(eval_input["response"])
fafhrd91
pushed a commit
to fafhrd91/azure-sdk-for-python
that referenced
this pull request
Apr 28, 2026
* Update Tool Call Accuracy to output unified format * Update tests * reformatting * Refactor not applicable result method calls * Fix test assertions for new unified output format and apply black formatting (Azure#46336) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/23f40ca5-7114-46ec-89be-a369e38ac971 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Rename tool_call_accuracy reasoning output to reason and update skipped properties handling (Azure#46355) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/89b3b528-f2ac-4284-88fb-c484d4c0cce1 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Fix tool call accuracy test for skipped output schema (Azure#46356) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/8ab1c161-c24f-4272-95ff-c8e595089e22 Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Add back backward-compatible base result keys for tool call accuracy outputs (Azure#46449) Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/77f12326-0743-466c-9fda-8e4906364d4f Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Update documentation to state deprecate 'gpt_' prefix Update documentation to state deprecate 'gpt_' prefix * Rename `_result` value from `not_applicable` to `pass` in `_return_not_applicable_result` (Azure#46500) * rename not_applicable to pass in _return_not_applicable_result and update tests Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/e94d600e-75a6-4b62-92cf-420fb1597e29 Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * restore TODO comment above _return_not_applicable_result Agent-Logs-Url: https://github.com/Azure/azure-sdk-for-python/sessions/1ac22d46-abad-4a51-9269-cc884c11835d Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> * Add TODO for pass in _return_not_applicable_result * Add back gpt_ key for backward compatibility. Co-authored-by: Copilot <copilot@github.com> --------- Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: m7md7sien <16615690+m7md7sien@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Update Tool Call Accuracy to output unified format
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines