Skip to content

Python: Foundry Evals integration for Python#4750

Draft
alliscode wants to merge 11 commits intomicrosoft:mainfrom
alliscode:af-foundry-evals-python
Draft

Python: Foundry Evals integration for Python#4750
alliscode wants to merge 11 commits intomicrosoft:mainfrom
alliscode:af-foundry-evals-python

Conversation

@alliscode
Copy link
Member

Add evaluation framework with local and Foundry-hosted evaluator support:

  • EvalItem/EvalResult core types with conversation splitting strategies
  • @evaluator decorator for defining custom evaluation functions
  • LocalEvaluator for running evaluations locally
  • FoundryEvals provider for Azure AI Foundry hosted evaluations
  • evaluate_agent() orchestration with expected values support
  • evaluate_workflow() for multi-agent workflow evaluation
  • Comprehensive test suite and evaluation samples

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@markwallace-microsoft markwallace-microsoft added documentation Improvements or additions to documentation python labels Mar 17, 2026
@github-actions github-actions bot changed the title Foundry Evals integration for Python Python: Foundry Evals integration for Python Mar 17, 2026
@alliscode alliscode force-pushed the af-foundry-evals-python branch from a0edd5f to fe9e621 Compare March 17, 2026 21:21
@alliscode alliscode force-pushed the af-foundry-evals-python branch 6 times, most recently from 15d8640 to aad92ac Compare March 19, 2026 20:41
@markwallace-microsoft
Copy link
Member

markwallace-microsoft commented Mar 19, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/azure-ai/agent_framework_azure_ai
   _foundry_evals.py2351792%278, 309–313, 322–331, 757
packages/core/agent_framework
   _agents.py3644787%465, 469, 524, 949, 985, 1001, 1098–1102, 1157, 1185, 1318, 1334, 1336, 1349, 1355, 1391, 1393, 1402–1407, 1412, 1414, 1420–1421, 1428, 1430–1431, 1439–1440, 1443–1445, 1455–1460, 1464, 1469, 1471
   _evaluation.py6279684%217, 247, 262, 476, 478, 582–583, 662–664, 669, 706–709, 766–767, 770, 776–778, 782, 815–817, 869, 894–902, 907–908, 913–916, 921, 926, 932, 1028, 1140, 1456, 1458, 1466, 1476, 1480, 1520, 1524–1526, 1531–1534, 1538–1539, 1558, 1560, 1566–1569, 1571, 1641, 1647, 1662, 1666–1668, 1698, 1704–1708, 1742, 1763–1766, 1768, 1770–1772, 1782, 1790–1791, 1793, 1818–1819, 1824
packages/core/agent_framework/_workflows
   _agent_executor.py2051791%109, 133, 174, 200–201, 256–257, 259–260, 296–298, 300, 413–414, 479, 498
   _workflow.py2711992%88, 269–271, 273–274, 292, 296, 435, 623, 644, 700, 712, 718, 723, 743–745, 758
TOTAL28166334688% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5522 21 💤 0 ❌ 0 🔥 1m 27s ⏱️

@alliscode alliscode force-pushed the af-foundry-evals-python branch 2 times, most recently from af0ccf6 to 45527ee Compare March 20, 2026 21:24
alliscode and others added 3 commits March 23, 2026 08:48
Merged and refactored eval module per Eduard's PR review:

- Merge _eval.py + _local_eval.py into single _evaluation.py
- Convert EvalItem from dataclass to regular class
- Rename to_dict() to to_eval_data()
- Convert _AgentEvalData to TypedDict
- Simplify check system: unified async pattern with isawaitable
- Parallelize checks and evaluators with asyncio.gather
- Add all/any mode to tool_called_check
- Fix bool(passed) truthy bug in _coerce_result
- Remove deprecated function_evaluator/async_function_evaluator aliases
- Remove _MinimalAgent, tighten evaluate_agent signature
- Set self.name in __init__ (LocalEvaluator, FoundryEvals)
- Limit FoundryEvals to AsyncOpenAI only
- Type project_client as AIProjectClient
- Remove NotImplementedError continuous eval code
- Add evaluation samples in 02-agents/ and 03-workflows/
- Update all imports and tests (167 passing)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use cast(list[Any], x) with type: ignore[redundant-cast] comments to
satisfy both mypy (which considers casting Any redundant) and pyright
strict mode (which needs explicit casts to narrow Unknown types).

Also fix evaluator decorator check_name type annotation to be
explicitly str, resolving mypy str|Any|None mismatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…attr

- Apply pyupgrade: Sequence from collections.abc, remove forward-ref quotes
- Add @overload signatures to evaluator() for proper @evaluator usage
- Fix evaluate_workflow sample to use WorkflowBuilder(start_executor=) API
- Fix _workflow.py executor.reset() to use getattr pattern for pyright
- Remove unused EvalResults forward-ref string in default_factory lambda

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@alliscode alliscode force-pushed the af-foundry-evals-python branch from 5c6ab9b to 5dccdc2 Compare March 23, 2026 15:48
alliscode and others added 8 commits March 23, 2026 09:21
The test_configure_otel_providers_with_env_file_and_vs_code_port test
triggers gRPC OTLP exporter creation, but the grpc dependency is
optional and not installed by default. Add skipif decorator matching
the pattern used by all other gRPC exporter tests in the same file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move module docstrings before imports (after copyright header)
- Add -> None return type to all main() and helper functions
- Fix line-too-long in multiturn sample conversation data
- Add Workflow import for typed return in all_patterns_sample

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nings

- Simplify _ensure_async_result to direct await (async-only clients)
- Replace get_event_loop() with get_running_loop()
- Narrow _fetch_output_items exception handling to specific types
- Add warning log when _filter_tool_evaluators falls back to defaults
- Add DeprecationWarning to options alias in Agent.__init__
- Add DeprecationWarning to evaluate_response()
- Rename raw key to _raw_arguments in convert_message fallback
- Fix evaluate_agent_sample.py: replace evals.select() with FoundryEvals()
- Fix evaluate_multiturn_sample.py: use Message/Content/FunctionTool types
- Fix evaluate_workflow_sample.py: replace evals.select() with FoundryEvals()
- Update test mocks to use AsyncMock for awaited API calls

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add num_repetitions=2 positive test verifying 2×items and 4 agent calls
- Add _poll_eval_run tests: timeout, failed, and canceled paths
- Add evaluate_traces tests: validation error, response_ids path, trace_ids path
- Add evaluate_foundry_target happy-path test with target/query verification

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wrap implicit string concatenation in parens in evaluate_multiturn_sample.py
- Apply ruff formatter to 6 other files with minor formatting drift

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nch)

Reverts changes to _agents.py, _agent_executor.py, and _workflow.py
back to upstream/main. These fixes are now in a separate PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code fixes:
- Fix _normalize_queries inverted condition (single query now replicates
  to match expected_count)
- Fix substring match bug: 'end' in 'backend' matched; use exact set
  lookup for executor ID filtering
- Fix used_available_tools sample: tool_definitions→tools param, use
  FunctionTool attribute access instead of dict .get()
- Add None-check in _resolve_openai_client for misconfigured project
- Add Returns section to evaluate_workflow docstring
- Cache inspect.signature in @evaluator wrapper (avoid per-item reflection)

Architecture:
- Extract _evaluate_via_responses as module-level helper; evaluate_traces
  now calls it directly instead of creating a FoundryEvals instance
- Move Foundry-specific typed-content conversion out of core to_eval_data;
  core now returns plain role/content dicts, FoundryEvals applies
  AgentEvalConverter in _evaluate_via_dataset

Tests:
- evaluate_response() deprecation warning emission and delegation
- num_repetitions > 1 with expected_output and expected_tool_calls
- Mock output_items.list in test_evaluate_calls_evals_api
- Update to_eval_data assertions for plain-dict format
- Unknown param error now raised at @evaluator decoration time

Skipped (separate PR): executor reset loop, xfail removal, options alias

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants