Skip to content

Use -passed for result aggregation instead of _result#46833

Draft
Copilot wants to merge 2 commits into
mainfrom
copilot/isolate-azure-evaluation-changes
Draft

Use -passed for result aggregation instead of _result#46833
Copilot wants to merge 2 commits into
mainfrom
copilot/isolate-azure-evaluation-changes

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 11, 2026

This PR extracts only the _evaluate.py portion of #46436 into a standalone change targeting main. Scope is intentionally limited to schema-alignment updates in sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py.

  • Scope

    • Single-file change: azure/ai/evaluation/_evaluate/_evaluate.py
    • No changelog/version or cross-file edits
  • Binary aggregation alignment (_result_passed)

    • Removed EVALUATION_PASS_FAIL_MAPPING import from this module
    • Updated binary-column discovery to aggregate outputs.*_passed columns
    • Updated related comments/docstring wording to reflect passed-state aggregation
  • Pass-count computation update

    • Replaced mapped string counting with direct boolean sum for pass count:
    pass_count = df[col].sum()
  • Metric exclusion update during mean aggregation

    • Updated exclusion set from *_result to *_passed
    • Renamed local variable accordingly (threshold_and_result_colsthreshold_and_passed_cols)
Original prompt

This PR isolates the changes made to sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py in PR #46436 (#46436) into a standalone PR targeting main.

Goal

Apply exactly the following changes (and only these changes) to the file sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py on a new branch created off main, then open a PR targeting main.

Do not modify any other file in the repository. Do not run formatters, linters, or any tooling that may alter other lines. Do not change the changelog or version files.

Exact diff to apply

Apply this unified diff verbatim to sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py (the file currently on main):

@@ -26,7 +26,6 @@
 
 from .._constants import (
     CONTENT_SAFETY_DEFECT_RATE_THRESHOLD_DEFAULT,
-    EVALUATION_PASS_FAIL_MAPPING,
     EvaluationMetrics,
     DefaultOpenEncoding,
     Prefixes,
@@ -251,7 +250,7 @@ def _aggregation_binary_output(df: pd.DataFrame) -> Dict[str, float]:
     """
     Aggregate binary output results (pass/fail) from evaluation dataframe.
 
-    For each evaluator, calculates the proportion of "pass" results.
+    For each evaluator, calculates the proportion of passed results.
 
     :param df: The dataframe of evaluation results.
     :type df: ~pandas.DataFrame
@@ -260,12 +259,12 @@ def _aggregation_binary_output(df: pd.DataFrame) -> Dict[str, float]:
     """
     results = {}
 
-    # Find all columns that end with "_result"
-    result_columns = [col for col in df.columns if col.startswith("outputs.") and col.endswith("_result")]
+    # Find all columns that end with "_passed"
+    passed_columns = [col for col in df.columns if col.startswith("outputs.") and col.endswith("_passed")]
 
-    for col in result_columns:
+    for col in passed_columns:
         # Extract the evaluator name from the column name
-        # (outputs.<evaluator>.<metric>_result)
+        # (outputs.<evaluator>.<metric>_passed)
         parts = col.split(".")
         evaluator_name = None
         if len(parts) >= 3:
@@ -292,7 +291,7 @@ def _aggregation_binary_output(df: pd.DataFrame) -> Dict[str, float]:
 
             # Calculate the proportion of EVALUATION_PASS_FAIL_MAPPING[True] results
             total_rows = len(df)
-            pass_count = value_counts.get(EVALUATION_PASS_FAIL_MAPPING[True], 0)
+            pass_count = df[col].sum()
             proportion = pass_count / total_rows if total_rows > 0 else 0.0
 
             # Set the result with the evaluator name as the key
@@ -384,10 +383,10 @@ def _aggregate_metrics(df: pd.DataFrame, evaluators: Dict[str, Callable]) -> Dic
     token_count_cols = _get_token_count_columns_to_exclude(df)
     handled_columns.extend(token_count_cols)
 
-    # Exclude threshold and result columns from aggregation
+    # Exclude threshold and passed columns from aggregation
     # These are per-row metadata, not metrics to be averaged
-    threshold_and_result_cols = [col for col in df.columns if col.endswith("_threshold") or col.endswith("_result")]
-    handled_columns.extend(threshold_and_result_cols)
+    threshold_and_passed_cols = [col for col in df.columns if col.endswith("_threshold") or col.endswith("_passed")]
+    handled_columns.extend(threshold_and_passed_cols)
 
     # For rest of metrics, we will calculate mean
     df.drop(columns=handled_columns, inplace=True)

Authoritative reference

The exact content the file should have AFTER applying the diff matches the version currently on the mohessie/standardize_output_schema branch. You can copy the file directly from there to ensure byte-for-byte equivalence:

Recommended approach: fetch the file content from the mohessie/standardize_output_schema branch and commit it to your new branch off main. Verify with git diff origin/main that the only changed file is _evaluate.py and the diff matches the one above exactly.

Branch and PR requirements

Validation

  • The PR's "Files changed" tab must show exactly one file changed: sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluate/_evaluate.py.
  • The diff must match the unified diff above exactly (no extra whitespace, no ref...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Isolate changes from PR #46436 for evaluation module [evaluation] Split _evaluate.py schema-alignment changes from #46436 May 11, 2026
Copilot AI requested a review from m7md7sien May 11, 2026 22:52
@m7md7sien m7md7sien changed the title [evaluation] Split _evaluate.py schema-alignment changes from #46436 USe -passed for result aggregation instead of _result May 11, 2026
@m7md7sien m7md7sien changed the title USe -passed for result aggregation instead of _result Use -passed for result aggregation instead of _result May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants