Skip to content

Conversation

@xmican10
Copy link
Contributor

@xmican10 xmican10 commented Jan 14, 2026

Description

Added ability to set list of expected_response to handle non-deterministic nature of LLM/LCORE API responses. Added unit tests as well. Updated README.md.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Support for multiple alternative expected responses per evaluation turn; evaluator tries alternatives, stops on first pass, and aggregates scores/reasons/token usage.
  • Documentation

    • README: added "Multiple expected responses" subsection with usage, behavior, and updated examples (placed before "Metrics override behavior").
  • Behavior

    • NLP-style metrics still require a single expected response (providing a list will error).
  • Refactor

    • Standardized metric result structure surfaced to evaluation outputs.
  • Tests

    • Added unit tests for validation and multi-response evaluator behavior.
  • Examples

    • Added sample evaluation data demonstrating multiple expected responses.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 14, 2026

Walkthrough

Turn-level models now accept expected_response as a string or list of strings with validation. Evaluator logic was refactored to handle multiple expected responses (when allowed by METRIC_REQUIREMENTS), aggregate judge token counts, and short-circuit on PASS. NLPMetrics now rejects list expected responses. README, tests, and sample YAML updated.

Changes

Cohort / File(s) Summary
Docs & sample
README.md, config/evaluation_data_multiple_expected_responses.yaml
Added "Multiple expected responses" subsection, updated examples, and added sample YAML demonstrating an array of expected_responses.
Data models & public API
src/lightspeed_evaluation/core/models/data.py, src/lightspeed_evaluation/core/models/__init__.py
TurnData.expected_response and EvaluationResult.expected_response changed to Optional[Union[str, list[str]]] with validators. Introduced and exported public MetricResult. EvaluationResult now inherits from MetricResult.
Evaluation engine
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Centralized evaluation in _evaluate_wrapper; added support for multiple expected responses (when METRIC_REQUIREMENTS permit), per-alternative token accounting, PASS short-circuiting, aggregation of scores/reasons, and unified MetricResult propagation.
Metrics behavior
src/lightspeed_evaluation/core/metrics/nlp.py
NLPMetrics._extract_turn_data now raises MetricError if expected_response is a list (NLP metrics require a single string).
Tests
tests/unit/core/models/test_data.py, tests/unit/pipeline/evaluation/test_evaluator.py
Added tests for TurnData.expected_response validation (string/list/None and invalid cases). Expanded evaluator tests to cover string/list/None expected_response across frameworks, threshold outcomes, token accounting, and multi-alternative behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller (pipeline / test)
    participant Eval as MetricsEvaluator
    participant Scope as EvaluationScope
    participant Handler as Metric Handler
    Caller->>Eval: evaluate_metric(request, scope, ...)
    Eval->>Eval: _evaluate_wrapper(request, scope)
    Eval->>Eval: extract TurnData.expected_response
    alt expected_response is list and metric supports multiple
        loop alt in expected_responses
            Eval->>Scope: build alternate scope with alt
            Eval->>Handler: _evaluate(request, alternate_scope)
            Handler-->>Eval: MetricResult(score, reason, judge_in, judge_out)
            alt result == PASS
                Eval->>Eval: stop loop (short-circuit)
            end
        end
    else expected_response is string/None or metric doesn't support multiple
        Eval->>Handler: _evaluate(request, scope)
        Handler-->>Eval: MetricResult(score, reason, judge_in, judge_out)
    end
    Eval->>Eval: aggregate tokens, determine final status, compute execution_time
    Eval-->>Caller: return EvaluationResult (includes MetricResult fields + aggregated tokens)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • asamal4
  • tisnik
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Leads 6 set expected responses' is vague and uses a non-descriptive pattern that doesn't clearly convey the main change; 'Leads 6' appears to be a branch/ticket prefix rather than descriptive language. Consider a more descriptive title like 'Support multiple expected responses for metrics' or 'Enable list-based expected responses to handle non-deterministic LLM outputs' to clearly communicate the feature.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 96.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/core/metrics/nlp.py`:
- Around line 64-68: The code incorrectly concatenates list alternatives with
"".join(), corrupting text boundaries; change the handling of expected_response
so when it is a list you pick a single alternative (e.g., use
expected_response[0] if list non-empty) instead of joining, or implement the
evaluator.py pattern of scoring each alternative and taking the best—update the
block where expected_response is normalized (the expected_response list
handling) to either select the first element or iterate alternatives to compute
and return the best metric score.
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/models/data.py (1)

61-64: Type change looks good; consider adding validation for list contents.

The type update to Optional[Union[str, list[str]]] correctly supports multiple expected responses. However, unlike expected_keywords which has a validator to ensure non-empty strings, expected_response lacks similar validation when provided as a list.

This could allow edge cases like empty strings within the list or an empty list. If this is intentional (relying on downstream handling), this is acceptable. Otherwise, consider adding a @field_validator similar to validate_expected_keywords.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

103-103: Minor: Unused variable from split.

The metric_name from the split on line 103 is unused here since it's extracted again in evaluate() at line 214. Consider using _ for the unused part or removing the split entirely and only extracting framework.

♻️ Proposed fix
             # Parse framework and metric
-            framework, _ = request.metric_identifier.split(":", 1)
+            framework = request.metric_identifier.split(":", 1)[0]
tests/unit/pipeline/evaluation/test_evaluator.py (1)

507-617: Good test coverage for the new evaluate() method.

The tests effectively cover the key scenarios:

  • List expected_response with non-requiring metric (no iteration)
  • List expected_response with requiring metric (iteration with early PASS break)
  • String expected_response
  • None expected_response

Consider adding tests for these edge cases:

  1. geval:* metrics which also trigger iteration per the implementation
  2. Empty list expected_response=[] to verify handling
  3. Conversation-level metric path (turn_data=None)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d05ac65 and 68ea282.

📒 Files selected for processing (5)
  • README.md
  • src/lightspeed_evaluation/core/metrics/nlp.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead

Files:

  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/metrics/nlp.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Include type hints for all public functions and methods in Python
Use Google-style docstrings for all public APIs in Python

Files:

  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/metrics/nlp.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Use custom exceptions from core.system.exceptions module for error handling
Use structured logging with appropriate log levels in Python code

Files:

  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/metrics/nlp.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest mocking with mocker fixture instead of unittest.mock
Import mocking utilities from pytest (mocker fixture) rather than from unittest.mock
Mirror test directory structure to match source code structure
Use test_.py naming convention for test files and test_ prefix for test functions
Use Test* prefix for test classes in Python test files
Mock LLM calls in tests using pytest mocker fixture

Files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
🧠 Learnings (8)
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

  • README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-12T11:01:11.398Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-12T11:01:11.398Z
Learning: Do not add new features to deprecated `lsc_agent_eval/` directory - add to `src/lightspeed_evaluation/` instead

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
🧬 Code graph analysis (1)
tests/unit/pipeline/evaluation/test_evaluator.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationScope (468-479)
  • TurnData (36-316)
  • EvaluationRequest (482-535)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • evaluate (207-256)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
🔇 Additional comments (3)
src/lightspeed_evaluation/core/models/data.py (1)

444-447: LGTM!

The type change in EvaluationResult.expected_response mirrors the change in TurnData, maintaining consistency across the data models.

README.md (1)

231-231: LGTM!

The documentation correctly reflects the updated type for expected_response. Consider adding a brief note in the Examples section (around line 243) explaining how multiple expected responses are evaluated (e.g., "evaluation passes if any alternative matches").

tests/unit/pipeline/evaluation/test_evaluator.py (1)

471-505: Good helper method for reducing test boilerplate.

The _setup_evaluate_test helper cleanly handles both single return values and side_effect lists, reducing duplication across the new tests.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@xmican10 xmican10 force-pushed the LEADS-6-set-expected-responses branch from ce37c95 to 43a308f Compare January 14, 2026 16:27
Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 💪 , I added some comments.

if score is not None and self._determine_status(score, threshold) == "PASS":
break

return score, reason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When all the expected responses fail, we return just the score+reason from the last one. If this is desired, I'd at least describe it in the documentation.

Other solution might be to return list of scores, reasons, but this would make issues in computing statistics -- one would need to select one score (max, min, ... ?)

Maybe we could return maximum score and add a list of all the scores to the reason.
@asamal4 ^^

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, Ultimately we probably need separate entries in final output. But for now taking max will make sense for the score field. for reason field we can combine all reasons with individual score.

Additionally we need to add the token usage for LLMasJudge metrics

This is one of the old story, when the framework was lot simpler. With the addition of GEval and other mertics + token counts this has become little bit tricky now.

Copy link
Contributor Author

@xmican10 xmican10 Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will modify the logic to return the max score and reason field with all reasons and scores..

For the judge token counts, since I didn't touch that I would assume that the token_tracker in evaluate_metric() will sum it all up. I will definitely verify that..

Copy link
Contributor Author

@xmican10 xmican10 Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VladimirKadlec summary of our Backlog refinement meeting discussion:

  • expected_response list is not supported for GEval, document it
  • make the evaluate() logic more modular to make it usable for Panel of Judges
  • if all responses in expected_response fail, return max score and merged score; reason... one line in output csv file
  • verify the judgeLLM token counts, if possible sum it up.. if not inspect possible solutions
  • add testing evaluation_data2.yaml with expected_response being a list

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@config/evaluation_data_multiple_expected_responses.yaml`:
- Around line 15-16: The expected_response array (symbol: expected_response)
contains a typo ("assistent") and a contradictory phrase "I am human assistent"
that conflicts with the other entry and the sample context; update the entries
to consistent, correctly spelled alternatives (e.g., change "I am human
assistent, how can i help you today?" to "I am an AI assistant; how can I help
you today?" or remove the human-asserting entry) so both options correctly
reflect an AI assistant and fix capitalization/punctuation as needed.

In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 537-603: The parametrized decorator and long test function
signatures in test_evaluate_with_expected_response_list (and its param list) are
not Black-formatted; run Black (or reflow the lines) to wrap the
`@pytest.mark.parametrize`(...) arguments and the function parameter list to
conform to Black's line-length rules, ensuring the decorator string list
["ragas:context_recall", "custom:answer_correctness", "nlp:rouge"] and the
multi-line def header for test_evaluate_with_expected_response_list are split
across lines as Black would produce; then reformat the nearby
test_evaluate_with_expected_response_list_fail function signature similarly and
commit the formatted file.
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_evaluator.py (1)

470-533: Mock ScriptEvalMetrics to keep the helper isolated.

The helper docstring states it returns a script handler, but ScriptEvalMetrics isn’t patched and isn’t included in mock_handlers. This can allow real handler initialization and breaks the helper contract.

♻️ Proposed update
         mock_custom = create_mock_handler(mocker, mock_return)
         mocker.patch(
             "lightspeed_evaluation.pipeline.evaluation.evaluator.CustomMetrics",
             return_value=mock_custom,
         )

+        mock_script = create_mock_handler(mocker, mock_return)
+        mocker.patch(
+            "lightspeed_evaluation.pipeline.evaluation.evaluator.ScriptEvalMetrics",
+            return_value=mock_script,
+        )
+
         mock_nlp = create_mock_handler(mocker, mock_return)
         mocker.patch(
             "lightspeed_evaluation.pipeline.evaluation.evaluator.NLPMetrics",
             return_value=mock_nlp,
         )
@@
         mock_handlers = {
             'ragas': mock_ragas,
             'geval': mock_deepeval,
             'custom': mock_custom,
+            'script': mock_script,
             'nlp': mock_nlp,
         }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 43a308f and bc1bdb0.

📒 Files selected for processing (5)
  • README.md
  • config/evaluation_data_multiple_expected_responses.yaml
  • src/lightspeed_evaluation/core/metrics/nlp.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
✅ Files skipped from review due to trivial changes (1)
  • src/lightspeed_evaluation/core/metrics/nlp.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • README.md
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest mocking with mocker fixture instead of unittest.mock
Import mocking utilities from pytest (mocker fixture) rather than from unittest.mock
Mirror test directory structure to match source code structure
Use test_.py naming convention for test files and test_ prefix for test functions
Use Test* prefix for test classes in Python test files
Mock LLM calls in tests using pytest mocker fixture

Files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead

Files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2026-01-12T11:01:11.398Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-12T11:01:11.398Z
Learning: Do not add new features to deprecated `lsc_agent_eval/` directory - add to `src/lightspeed_evaluation/` instead

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
🪛 GitHub Actions: Black
tests/unit/pipeline/evaluation/test_evaluator.py

[error] 1-1: Black formatting check failed. 2 files would be reformatted by 'black --check'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.12)
  • GitHub Check: tests (3.11)
🔇 Additional comments (3)
tests/unit/pipeline/evaluation/test_evaluator.py (3)

6-12: LGTM: EvaluationScope import for new tests.


604-629: LGTM: expected_response string case covered.


630-673: LGTM: expected_response not needed paths covered.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@xmican10 xmican10 force-pushed the LEADS-6-set-expected-responses branch from bc1bdb0 to 4bd8152 Compare January 16, 2026 14:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/core/metrics/nlp.py`:
- Around line 63-68: Replace the runtime-unsafe assert in the expected_response
validation with a guaranteed exception: in the function where expected_response
is set (the block that currently does "expected_response =
turn_data.expected_response if turn_data else \"\""), change the assert not
isinstance(expected_response, list) ... to raise MetricError(...) when
expected_response is a list; keep the existing error message text ("NLP metrics
require expected_response to be a string, not a list") and reference MetricError
(already imported) so validation always runs in optimized mode.

In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 334-337: The except block in evaluate_metric is wrapping every
exception with EvaluationError(f"Evaluation error: {e}"), causing duplicate
"Evaluation error" prefixes; change the handler in evaluator.py so you always
call token_tracker.stop() but only wrap non-EvaluationError exceptions (i.e., if
isinstance(e, EvaluationError): re-raise e) and for other exceptions raise
EvaluationError(str(e)) from e (or otherwise use the raw message) to avoid
double-prefixing while preserving the original traceback.
- Around line 254-265: The token_tracker is never reset between iterations so
token_tracker.get_counts() returns cumulative totals that get double-counted;
before each per-iteration call to token_tracker.start() in the evaluation loop,
call token_tracker.reset() so start()/stop()/get_counts() report only the
current iteration's tokens (update the loop around token_tracker.start(),
token_tracker.stop(), and token_tracker.get_counts() in evaluator.py to invoke
token_tracker.reset() immediately prior to token_tracker.start()).
🧹 Nitpick comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

228-233: Replace asserts with explicit validation + EvaluationError.

Lines 229-232 and 292-295 rely on assert, which is stripped under -O and bypasses validation. Use explicit checks and raise EvaluationError instead.

🔧 Proposed fix
-                assert (
-                    evaluation_scope.turn_data is not None
-                    and evaluation_scope.turn_data.expected_response is not None
-                )
+                if (
+                    evaluation_scope.turn_data is None
+                    or evaluation_scope.turn_data.expected_response is None
+                ):
+                    raise EvaluationError(
+                        "expected_response list requires turn_data.expected_response"
+                    )
-                assert (
-                    evaluation_scope.turn_data is not None
-                    and evaluation_scope.turn_data.expected_response is not None
-                )
+                if (
+                    evaluation_scope.turn_data is None
+                    or evaluation_scope.turn_data.expected_response is None
+                ):
+                    raise EvaluationError(
+                        "expected_response list requires turn_data.expected_response"
+                    )
As per coding guidelines, use custom exceptions for validation.

Also applies to: 292-295

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc1bdb0 and 4bd8152.

📒 Files selected for processing (5)
  • README.md
  • config/evaluation_data_multiple_expected_responses.yaml
  • src/lightspeed_evaluation/core/metrics/nlp.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • tests/unit/pipeline/evaluation/test_evaluator.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • config/evaluation_data_multiple_expected_responses.yaml
  • README.md
🧰 Additional context used
📓 Path-based instructions (4)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest mocking with mocker fixture instead of unittest.mock
Import mocking utilities from pytest (mocker fixture) rather than from unittest.mock
Mirror test directory structure to match source code structure
Use test_.py naming convention for test files and test_ prefix for test functions
Use Test* prefix for test classes in Python test files
Mock LLM calls in tests using pytest mocker fixture

Files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead

Files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/nlp.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Include type hints for all public functions and methods in Python
Use Google-style docstrings for all public APIs in Python

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/nlp.py
src/lightspeed_evaluation/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/lightspeed_evaluation/**/*.py: Use custom exceptions from core.system.exceptions module for error handling
Use structured logging with appropriate log levels in Python code

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/nlp.py
🧠 Learnings (13)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-13T13:31:28.016Z
Learnt from: bsatapat-jpg
Repo: lightspeed-core/lightspeed-evaluation PR: 137
File: src/lightspeed_evaluation/core/metrics/custom/tool_eval.py:69-93
Timestamp: 2026-01-13T13:31:28.016Z
Learning: In the lightspeed-evaluation framework's tool_eval.py, explicit validation for invalid `match` parameter values is not required. The code is designed to fall through to full match behavior for non-"partial" values.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T10:13:42.924Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:112-112
Timestamp: 2025-07-16T10:13:42.924Z
Learning: In the lsc_agent_eval judge LLM evaluation system, the _extract_numeric_result method should only accept numeric values 1 or 0 from judge responses. Boolean keywords like "true", "false", "yes", "no" are not acceptable values for judge LLM responses.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-12T11:01:11.398Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-12T11:01:11.398Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions module for error handling

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-15T00:41:27.575Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 140
File: src/lightspeed_evaluation/runner/evaluation.py:41-41
Timestamp: 2026-01-15T00:41:27.575Z
Learning: In Python code reviews for the lightspeed-evaluation project, allow keeping the pylint disable: too-many-locals in functions that use lazy imports to group many local variables in one scope for readability. If a function uses lazy imports to justify this pattern, document why and avoid over-refactoring into multiple helpers just for lint reasons. If you can refactor without increasing complexity or number of globals, prefer refactoring; otherwise retain the pylint directive with a clear rationale.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/nlp.py
🧬 Code graph analysis (1)
tests/unit/pipeline/evaluation/test_evaluator.py (2)
src/lightspeed_evaluation/core/models/data.py (5)
  • EvaluationScope (497-508)
  • TurnData (36-345)
  • EvaluationData (348-416)
  • EvaluationRequest (511-564)
  • for_turn (539-553)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • evaluate (194-339)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

[notice] 229-230:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.


[notice] 292-293:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

src/lightspeed_evaluation/core/metrics/nlp.py

[notice] 66-67:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
🔇 Additional comments (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

24-26: Imports align with the new evaluation flow.

Looks good—these imports are used by the new evaluate() path.


100-145: Centralized evaluate() call reads clean.

The delegation and error handling integration look consistent and easy to follow.


194-199: Refactor to remove pylint disable on evaluate().

Line 194 disables too-many-locals and too-many-statements, which the project guidelines prohibit. Please split the logic into focused helpers (e.g., multi-expected-response vs single-path) and remove the directive. As per coding guidelines, avoid disabling lint warnings.

⛔ Skipped due to learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 140
File: src/lightspeed_evaluation/runner/evaluation.py:41-41
Timestamp: 2026-01-15T00:41:27.575Z
Learning: In the lightspeed-core/lightspeed-evaluation repository, user asamal4 prefers to keep `# pylint: disable=too-many-locals` when a function uses lazy imports to keep maximum function usage in one place for code cleanliness, rather than refactoring into multiple helper functions.
tests/unit/pipeline/evaluation/test_evaluator.py (1)

6-10: Evaluate() test coverage looks solid.

The helper reduces duplication, and the new parametrized cases exercise list/string/None expected_response paths well.

Also applies to: 470-681

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@xmican10 xmican10 force-pushed the LEADS-6-set-expected-responses branch 2 times, most recently from ee39528 to 27893d0 Compare January 19, 2026 09:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 470-535: The helper _setup_evaluate_test is missing a mock patch
for ScriptEvalMetrics, so when MetricsEvaluator(...) is instantiated it will
call the real ScriptEvalMetrics; add a mock similar to the others: create a mock
via create_mock_handler(mocker, mock_return) and patch
"lightspeed_evaluation.pipeline.evaluation.evaluator.ScriptEvalMetrics" to
return that mock (and include it in the returned mock_handlers dict under
"script") so ScriptEvalMetrics is fully mocked during tests.
🧹 Nitpick comments (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)

194-210: Add Google-style docstring for public method.

The evaluate() method is a public API but lacks a proper Google-style docstring as required by coding guidelines. The current one-liner doesn't document Args, Returns, or Raises sections.

📝 Suggested docstring
     def evaluate(  # pylint: disable=too-many-locals,too-many-statements
         self,
         request: EvaluationRequest,
         evaluation_scope: EvaluationScope,
         threshold: Optional[float],
     ) -> tuple[Optional[float], str, str, int, int]:
-        """Evaluate metric logic, handling expected_response lists."""
+        """Evaluate metric logic, handling expected_response lists.
+
+        Supports multiple expected_response values for metrics that require it.
+        Iterates through alternatives until PASS or returns max score on failure.
+
+        Args:
+            request: Evaluation request containing conversation data and metric.
+            evaluation_scope: Scope with turn data and evaluation context.
+            threshold: Score threshold for PASS/FAIL determination.
+
+        Returns:
+            Tuple of (score, reason, status, judge_input_tokens, judge_output_tokens).
+
+        Raises:
+            EvaluationError: When metric evaluation fails unexpectedly.
+        """

284-288: Verify None score handling in max comparison.

Line 286 uses a ternary to handle None scores, but this occurs after checking if score is not None on line 279 for status determination. If score is None, the loop continues without updating score_max or reason_acc, which may be the intended behavior. However, if score is None, adding f"{score}; {reason}\n" to reason_acc will produce "None; {reason}\n" which may not be desirable.

Consider skipping reason accumulation when score is None:

💡 Suggested improvement
                     # Expected response did not PASS; keep track of highest score
-                    score_max = max(
-                        score_max, score if score is not None else score_max
-                    )
-                    reason_acc += f"{score}; {reason}\n"
+                    if score is not None:
+                        score_max = max(score_max, score)
+                        reason_acc += f"{score}; {reason}\n"
+                    else:
+                        reason_acc += f"None; {reason}\n"

346-349: Consider preserving original exception message without double-wrapping.

Line 349 passes the exception e directly to EvaluationError(e), which may result in the exception object being converted to string. Then at line 191, the outer handler wraps it again with f"Evaluation error: {e}". If the original exception already contains "Evaluation error" or similar text, this could cause redundant prefixing.

💡 Suggested improvement
         except Exception as e:  # pylint: disable=broad-exception-caught
             # Stop token tracking on error
             token_tracker.stop()
-            raise EvaluationError(e) from e
+            raise EvaluationError(str(e)) from e

100-102: Minor: Framework is parsed twice.

The framework is extracted from metric_identifier at line 101 for the script metric check, and again at line 202 inside evaluate(). Consider passing the already-parsed framework to evaluate() to avoid duplicate string splitting.

Also applies to: 201-202

@xmican10 xmican10 force-pushed the LEADS-6-set-expected-responses branch from 27893d0 to d45c8e1 Compare January 19, 2026 09:54
@xmican10 xmican10 force-pushed the LEADS-6-set-expected-responses branch from d45c8e1 to 3a57edf Compare January 19, 2026 10:05
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 255-309: Detect when evaluation_scope.turn_data.expected_response
is empty before the for-loop and handle it explicitly: if empty, set score=None,
reason="no expected responses" (or similar descriptive string), status="FAIL",
and leave judge_input_tokens/judge_output_tokens as 0, and emit a debug log via
logger.debug indicating the empty expected_response; update the logic around
score_max/status/reason_acc (in the same method that contains score_max,
reason_acc, token_tracker usage) so the early-empty branch returns the same
tuple shape as the normal path with the descriptive reason instead of an empty
string.
- Around line 362-365: Replace the broad "except Exception as e:  # pylint:
disable=broad-exception-caught" with specific exception handlers for the errors
the try block can raise (for example ValueError, TypeError, RuntimeError,
OSError, or any domain-specific exceptions used by the surrounding code), call
token_tracker.stop() in each handler (or move token_tracker.stop() into a
finally: block) and re-raise as EvaluationError(e) from e; remove the pylint
disable comment. Locate the try/except around token_tracker.stop() and
EvaluationError in evaluator.py (the block referencing token_tracker.stop() and
raise EvaluationError(e)) and implement specific except clauses or a finally to
ensure deterministic cleanup and no broad exception catching.
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_evaluator.py (1)

641-684: Consider adding edge case test for empty list.

Good parametrized test coverage for metrics that don't require expected_response. Consider adding a test case for the empty list edge case (expected_response=[]) when the metric does require expected_response to verify expected behavior (ERROR result or validation failure).

Example additional test
def test_evaluate_with_empty_expected_response_list(
    self, config_loader, mock_metric_manager, mock_script_manager, mocker
):
    """Test evaluate() with empty expected_response list returns error."""
    evaluator, mock_handlers = self._setup_evaluate_test(
        config_loader,
        mock_metric_manager,
        mock_script_manager,
        mocker,
        (0.85, "Score"),
    )

    turn_data = TurnData(
        turn_id="1",
        query="Q",
        response="R",
        expected_response=[],  # Empty list
        contexts=["C"],
    )
    conv_data = EvaluationData(conversation_group_id="test_conv", turns=[turn_data])
    request = EvaluationRequest.for_turn(
        conv_data, "ragas:context_recall", 0, turn_data
    )
    scope = EvaluationScope(turn_idx=0, turn_data=turn_data, is_conversation=False)

    score, reason, status, _, _ = evaluator.evaluate(request, scope, 0.7)

    # Verify behavior for empty list - adjust assertion based on intended behavior
    assert score is None or status == "ERROR"

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the code, it seems like we are adding too much logic in pipeline/evaluator.py. It is just supposed to delegate evaluation to individual framework/flow.

I think more refactoring is required, but that was not planned in original task. So please try to modularize as much as possible (at least the new code). If required we can create follow up tasks.

Comment on lines +65 to +69
# For NLP metrics, expected_response must be a string (not a list)
if isinstance(expected_response, list):
raise MetricError(
"NLP metrics require expected_response to be a string, not a list"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this constraint for nlp based metrics ? only for G-Eval we are avoiding multiple expected responses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is for the typechecker, cause turn_data.expected_response could either be None|str|list[str].. during runtime only string expected_response will be passed to NLP, but the checker complains anyways

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 420-422: Remove the unnecessary time.sleep(0.1) in evaluator.py:
after calling self.handlers[framework].evaluate(metric_name, request.conv_data,
evaluation_scope) the TokenTracker callback is already invoked synchronously by
LiteLLM, so immediately call token_tracker.get_counts() (keep the existing
input_tokens, output_tokens assignment) and delete the time.sleep line to avoid
the added latency; ensure no other code relies on that delay and run tests
around Evaluator.evaluate and TokenTracker.get_counts to confirm behavior.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

243-244: Coding guideline: Avoid # pylint: disable comments.

The coding guidelines state: "Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead."

While this is a complex method, consider catching specific exception types that the evaluation handlers can raise (e.g., ValueError, TypeError, RuntimeError, OSError, KeyError, AttributeError) instead of a broad catch with a pylint disable.

♻️ Suggested approach
-        except Exception as e:  # pylint: disable=broad-exception-caught
-            raise EvaluationError(e) from e
+        except (ValueError, TypeError, RuntimeError, OSError, KeyError, AttributeError) as e:
+            raise EvaluationError(str(e)) from e

Based on coding guidelines: "Do not disable lint warnings... - fix the underlying issue instead."

@xmican10 xmican10 force-pushed the LEADS-6-set-expected-responses branch from 4451cba to f88ff40 Compare January 21, 2026 10:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 133-142: The error path drops judge LLM token/score fields because
the early return calls _create_error_result without preserving MetricResult
fields from metric_result returned by _evaluate_wrapper; update the error
handling to pass the metric_result's judge_llm_* counts plus any score/threshold
into _create_error_result (or change _create_error_result signature to accept a
MetricResult and copy its fields) so that judge_llm_inference_tokens,
judge_llm_prompt_tokens, judge_llm_total_tokens and score/threshold from
metric_result are preserved in the produced EvaluationResult; modify callers of
_create_error_result accordingly and adjust _create_error_result to populate
those fields when constructing the error EvaluationResult.
- Around line 275-283: The RuntimeError raised when evaluation_scope.turn_data
or evaluation_scope.turn_data.expected_response is missing should be replaced
with the project's custom EvaluationError from core.system.exceptions; update
the raise to use EvaluationError(f"Metric '{request.metric_identifier}' requires
'expected_response' field. Could not proceed with evaluation.") and import
EvaluationError at the top, and make the same replacement for the other similar
branches that check expected_response (the other checks around the same logic in
this file) so all missing-expected-response failures raise EvaluationError
instead of RuntimeError.
♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

186-246: Remove pylint disables by shrinking _evaluate_wrapper and narrowing exceptions.

The new wrapper still relies on # pylint: disable=too-many-* and broad-exception-caught. Please split the decision logic into smaller helpers and catch specific exceptions (re-raising EvaluationError) so the disables can be removed. As per coding guidelines, please avoid pylint disable comments.

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/models/data.py (1)

419-447: Document MetricResult with a Google‑style docstring (public model).

MetricResult is a new public API surface; please add a Google-style docstring (e.g., an Attributes section for each field) to align with project standards. As per coding guidelines, please use Google-style docstrings for public APIs.

Comment on lines +133 to 142
# Evaluate metric
metric_result = self._evaluate_wrapper(request, evaluation_scope, threshold)

execution_time = time.time() - start_time

if metric_result.result == "ERROR":
return self._create_error_result(
request, metric_result.reason, execution_time
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve judge token counts on ERROR results.

When metric_result.result == "ERROR", returning _create_error_result drops judge_llm_* counts (and score/threshold) captured in MetricResult, so usage stats become 0 even if LLM calls happened. Consider letting ERROR results flow through the standard EvaluationResult construction (or passing token counts into _create_error_result).

🐛 Suggested fix (preserve MetricResult fields)
-            if metric_result.result == "ERROR":
-                return self._create_error_result(
-                    request, metric_result.reason, execution_time
-                )
+            # Keep ERROR results in the standard EvaluationResult path
+            # so judge_llm_* counts and context fields are preserved.
🤖 Prompt for AI Agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py` around lines 133
- 142, The error path drops judge LLM token/score fields because the early
return calls _create_error_result without preserving MetricResult fields from
metric_result returned by _evaluate_wrapper; update the error handling to pass
the metric_result's judge_llm_* counts plus any score/threshold into
_create_error_result (or change _create_error_result signature to accept a
MetricResult and copy its fields) so that judge_llm_inference_tokens,
judge_llm_prompt_tokens, judge_llm_total_tokens and score/threshold from
metric_result are preserved in the produced EvaluationResult; modify callers of
_create_error_result accordingly and adjust _create_error_result to populate
those fields when constructing the error EvaluationResult.

Comment on lines +275 to +283
# This check satisfies the linter but is logically redundant
if (
evaluation_scope.turn_data is None
or evaluation_scope.turn_data.expected_response is None
):
raise RuntimeError(
f"Metric '{request.metric_identifier}' requires 'expected_response' field. "
f"Could not proceed with evaluation."
)
Copy link
Contributor

@coderabbitai coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Raise EvaluationError instead of RuntimeError for missing expected_response.

These branches are already handled as evaluation failures; using the project’s custom EvaluationError keeps error handling consistent and avoids leaking generic RuntimeError upstream.

🔧 Suggested fix
-            raise RuntimeError(
+            raise EvaluationError(
                 f"Metric '{request.metric_identifier}' requires 'expected_response' field. "
                 f"Could not proceed with evaluation."
             )
@@
-            raise RuntimeError(
+            raise EvaluationError(
                 f"Metric '{request.metric_identifier}' requires 'expected_response' field. "
                 f"Could not proceed with evaluation."
             )
@@
-            raise RuntimeError(
+            raise EvaluationError(
                 f"Metric '{request.metric_identifier}' requires 'expected_response' field. "
                 f"Could not proceed with evaluation."
             )

As per coding guidelines, use custom exceptions from core.system.exceptions for error handling.

Also applies to: 332-336, 369-375

🤖 Prompt for AI Agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py` around lines 275
- 283, The RuntimeError raised when evaluation_scope.turn_data or
evaluation_scope.turn_data.expected_response is missing should be replaced with
the project's custom EvaluationError from core.system.exceptions; update the
raise to use EvaluationError(f"Metric '{request.metric_identifier}' requires
'expected_response' field. Could not proceed with evaluation.") and import
EvaluationError at the top, and make the same replacement for the other similar
branches that check expected_response (the other checks around the same logic in
this file) so all missing-expected-response failures raise EvaluationError
instead of RuntimeError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @VladimirKadlec that using RuntimeError is more appropriate in this case...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants