-
Notifications
You must be signed in to change notification settings - Fork 24
Leads 6 set expected responses #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Leads 6 set expected responses #138
Conversation
WalkthroughTurn-level models now accept Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller (pipeline / test)
participant Eval as MetricsEvaluator
participant Scope as EvaluationScope
participant Handler as Metric Handler
Caller->>Eval: evaluate_metric(request, scope, ...)
Eval->>Eval: _evaluate_wrapper(request, scope)
Eval->>Eval: extract TurnData.expected_response
alt expected_response is list and metric supports multiple
loop alt in expected_responses
Eval->>Scope: build alternate scope with alt
Eval->>Handler: _evaluate(request, alternate_scope)
Handler-->>Eval: MetricResult(score, reason, judge_in, judge_out)
alt result == PASS
Eval->>Eval: stop loop (short-circuit)
end
end
else expected_response is string/None or metric doesn't support multiple
Eval->>Handler: _evaluate(request, scope)
Handler-->>Eval: MetricResult(score, reason, judge_in, judge_out)
end
Eval->>Eval: aggregate tokens, determine final status, compute execution_time
Eval-->>Caller: return EvaluationResult (includes MetricResult fields + aggregated tokens)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/core/metrics/nlp.py`:
- Around line 64-68: The code incorrectly concatenates list alternatives with
"".join(), corrupting text boundaries; change the handling of expected_response
so when it is a list you pick a single alternative (e.g., use
expected_response[0] if list non-empty) instead of joining, or implement the
evaluator.py pattern of scoring each alternative and taking the best—update the
block where expected_response is normalized (the expected_response list
handling) to either select the first element or iterate alternatives to compute
and return the best metric score.
🧹 Nitpick comments (3)
src/lightspeed_evaluation/core/models/data.py (1)
61-64: Type change looks good; consider adding validation for list contents.The type update to
Optional[Union[str, list[str]]]correctly supports multiple expected responses. However, unlikeexpected_keywordswhich has a validator to ensure non-empty strings,expected_responselacks similar validation when provided as a list.This could allow edge cases like empty strings within the list or an empty list. If this is intentional (relying on downstream handling), this is acceptable. Otherwise, consider adding a
@field_validatorsimilar tovalidate_expected_keywords.src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
103-103: Minor: Unused variable from split.The
metric_namefrom the split on line 103 is unused here since it's extracted again inevaluate()at line 214. Consider using_for the unused part or removing the split entirely and only extractingframework.♻️ Proposed fix
# Parse framework and metric - framework, _ = request.metric_identifier.split(":", 1) + framework = request.metric_identifier.split(":", 1)[0]tests/unit/pipeline/evaluation/test_evaluator.py (1)
507-617: Good test coverage for the newevaluate()method.The tests effectively cover the key scenarios:
- List expected_response with non-requiring metric (no iteration)
- List expected_response with requiring metric (iteration with early PASS break)
- String expected_response
- None expected_response
Consider adding tests for these edge cases:
geval:*metrics which also trigger iteration per the implementation- Empty list
expected_response=[]to verify handling- Conversation-level metric path (
turn_data=None)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.mdsrc/lightspeed_evaluation/core/metrics/nlp.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/pipeline/evaluation/test_evaluator.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead
Files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/metrics/nlp.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/pipeline/evaluation/test_evaluator.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Include type hints for all public functions and methods in Python
Use Google-style docstrings for all public APIs in Python
Files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/metrics/nlp.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Use custom exceptions from core.system.exceptions module for error handling
Use structured logging with appropriate log levels in Python code
Files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/metrics/nlp.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest mocking with mocker fixture instead of unittest.mock
Import mocking utilities from pytest (mocker fixture) rather than from unittest.mock
Mirror test directory structure to match source code structure
Use test_.py naming convention for test files and test_ prefix for test functions
Use Test* prefix for test classes in Python test files
Mock LLM calls in tests using pytest mocker fixture
Files:
tests/unit/pipeline/evaluation/test_evaluator.py
🧠 Learnings (8)
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Applied to files:
README.md
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-12T11:01:11.398Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-12T11:01:11.398Z
Learning: Do not add new features to deprecated `lsc_agent_eval/` directory - add to `src/lightspeed_evaluation/` instead
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
🧬 Code graph analysis (1)
tests/unit/pipeline/evaluation/test_evaluator.py (2)
src/lightspeed_evaluation/core/models/data.py (3)
EvaluationScope(468-479)TurnData(36-316)EvaluationRequest(482-535)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
evaluate(207-256)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
🔇 Additional comments (3)
src/lightspeed_evaluation/core/models/data.py (1)
444-447: LGTM!The type change in
EvaluationResult.expected_responsemirrors the change inTurnData, maintaining consistency across the data models.README.md (1)
231-231: LGTM!The documentation correctly reflects the updated type for
expected_response. Consider adding a brief note in the Examples section (around line 243) explaining how multiple expected responses are evaluated (e.g., "evaluation passes if any alternative matches").tests/unit/pipeline/evaluation/test_evaluator.py (1)
471-505: Good helper method for reducing test boilerplate.The
_setup_evaluate_testhelper cleanly handles both single return values and side_effect lists, reducing duplication across the new tests.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
ce37c95 to
43a308f
Compare
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 💪 , I added some comments.
| if score is not None and self._determine_status(score, threshold) == "PASS": | ||
| break | ||
|
|
||
| return score, reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When all the expected responses fail, we return just the score+reason from the last one. If this is desired, I'd at least describe it in the documentation.
Other solution might be to return list of scores, reasons, but this would make issues in computing statistics -- one would need to select one score (max, min, ... ?)
Maybe we could return maximum score and add a list of all the scores to the reason.
@asamal4 ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, Ultimately we probably need separate entries in final output. But for now taking max will make sense for the score field. for reason field we can combine all reasons with individual score.
Additionally we need to add the token usage for LLMasJudge metrics
This is one of the old story, when the framework was lot simpler. With the addition of GEval and other mertics + token counts this has become little bit tricky now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will modify the logic to return the max score and reason field with all reasons and scores..
For the judge token counts, since I didn't touch that I would assume that the token_tracker in evaluate_metric() will sum it all up. I will definitely verify that..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VladimirKadlec summary of our Backlog refinement meeting discussion:
expected_responselist is not supported for GEval, document it- make the evaluate() logic more modular to make it usable for Panel of Judges
- if all responses in
expected_responsefail, return maxscoreand mergedscore; reason... one line in output csv file - verify the judgeLLM token counts, if possible sum it up.. if not inspect possible solutions
- add testing
evaluation_data2.yamlwith expected_response being a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@config/evaluation_data_multiple_expected_responses.yaml`:
- Around line 15-16: The expected_response array (symbol: expected_response)
contains a typo ("assistent") and a contradictory phrase "I am human assistent"
that conflicts with the other entry and the sample context; update the entries
to consistent, correctly spelled alternatives (e.g., change "I am human
assistent, how can i help you today?" to "I am an AI assistant; how can I help
you today?" or remove the human-asserting entry) so both options correctly
reflect an AI assistant and fix capitalization/punctuation as needed.
In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 537-603: The parametrized decorator and long test function
signatures in test_evaluate_with_expected_response_list (and its param list) are
not Black-formatted; run Black (or reflow the lines) to wrap the
`@pytest.mark.parametrize`(...) arguments and the function parameter list to
conform to Black's line-length rules, ensuring the decorator string list
["ragas:context_recall", "custom:answer_correctness", "nlp:rouge"] and the
multi-line def header for test_evaluate_with_expected_response_list are split
across lines as Black would produce; then reformat the nearby
test_evaluate_with_expected_response_list_fail function signature similarly and
commit the formatted file.
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
470-533: Mock ScriptEvalMetrics to keep the helper isolated.The helper docstring states it returns a
scripthandler, but ScriptEvalMetrics isn’t patched and isn’t included inmock_handlers. This can allow real handler initialization and breaks the helper contract.♻️ Proposed update
mock_custom = create_mock_handler(mocker, mock_return) mocker.patch( "lightspeed_evaluation.pipeline.evaluation.evaluator.CustomMetrics", return_value=mock_custom, ) + mock_script = create_mock_handler(mocker, mock_return) + mocker.patch( + "lightspeed_evaluation.pipeline.evaluation.evaluator.ScriptEvalMetrics", + return_value=mock_script, + ) + mock_nlp = create_mock_handler(mocker, mock_return) mocker.patch( "lightspeed_evaluation.pipeline.evaluation.evaluator.NLPMetrics", return_value=mock_nlp, ) @@ mock_handlers = { 'ragas': mock_ragas, 'geval': mock_deepeval, 'custom': mock_custom, + 'script': mock_script, 'nlp': mock_nlp, }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.mdconfig/evaluation_data_multiple_expected_responses.yamlsrc/lightspeed_evaluation/core/metrics/nlp.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/pipeline/evaluation/test_evaluator.py
✅ Files skipped from review due to trivial changes (1)
- src/lightspeed_evaluation/core/metrics/nlp.py
🚧 Files skipped from review as they are similar to previous changes (2)
- README.md
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest mocking with mocker fixture instead of unittest.mock
Import mocking utilities from pytest (mocker fixture) rather than from unittest.mock
Mirror test directory structure to match source code structure
Use test_.py naming convention for test files and test_ prefix for test functions
Use Test* prefix for test classes in Python test files
Mock LLM calls in tests using pytest mocker fixture
Files:
tests/unit/pipeline/evaluation/test_evaluator.py
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead
Files:
tests/unit/pipeline/evaluation/test_evaluator.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2026-01-12T11:01:11.398Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-12T11:01:11.398Z
Learning: Do not add new features to deprecated `lsc_agent_eval/` directory - add to `src/lightspeed_evaluation/` instead
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
🪛 GitHub Actions: Black
tests/unit/pipeline/evaluation/test_evaluator.py
[error] 1-1: Black formatting check failed. 2 files would be reformatted by 'black --check'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: tests (3.11)
🔇 Additional comments (3)
tests/unit/pipeline/evaluation/test_evaluator.py (3)
6-12: LGTM: EvaluationScope import for new tests.
604-629: LGTM: expected_response string case covered.
630-673: LGTM: expected_response not needed paths covered.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
bc1bdb0 to
4bd8152
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/core/metrics/nlp.py`:
- Around line 63-68: Replace the runtime-unsafe assert in the expected_response
validation with a guaranteed exception: in the function where expected_response
is set (the block that currently does "expected_response =
turn_data.expected_response if turn_data else \"\""), change the assert not
isinstance(expected_response, list) ... to raise MetricError(...) when
expected_response is a list; keep the existing error message text ("NLP metrics
require expected_response to be a string, not a list") and reference MetricError
(already imported) so validation always runs in optimized mode.
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 334-337: The except block in evaluate_metric is wrapping every
exception with EvaluationError(f"Evaluation error: {e}"), causing duplicate
"Evaluation error" prefixes; change the handler in evaluator.py so you always
call token_tracker.stop() but only wrap non-EvaluationError exceptions (i.e., if
isinstance(e, EvaluationError): re-raise e) and for other exceptions raise
EvaluationError(str(e)) from e (or otherwise use the raw message) to avoid
double-prefixing while preserving the original traceback.
- Around line 254-265: The token_tracker is never reset between iterations so
token_tracker.get_counts() returns cumulative totals that get double-counted;
before each per-iteration call to token_tracker.start() in the evaluation loop,
call token_tracker.reset() so start()/stop()/get_counts() report only the
current iteration's tokens (update the loop around token_tracker.start(),
token_tracker.stop(), and token_tracker.get_counts() in evaluator.py to invoke
token_tracker.reset() immediately prior to token_tracker.start()).
🧹 Nitpick comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
228-233: Replace asserts with explicit validation + EvaluationError.Lines 229-232 and 292-295 rely on
assert, which is stripped under-Oand bypasses validation. Use explicit checks and raiseEvaluationErrorinstead.As per coding guidelines, use custom exceptions for validation.🔧 Proposed fix
- assert ( - evaluation_scope.turn_data is not None - and evaluation_scope.turn_data.expected_response is not None - ) + if ( + evaluation_scope.turn_data is None + or evaluation_scope.turn_data.expected_response is None + ): + raise EvaluationError( + "expected_response list requires turn_data.expected_response" + )- assert ( - evaluation_scope.turn_data is not None - and evaluation_scope.turn_data.expected_response is not None - ) + if ( + evaluation_scope.turn_data is None + or evaluation_scope.turn_data.expected_response is None + ): + raise EvaluationError( + "expected_response list requires turn_data.expected_response" + )Also applies to: 292-295
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.mdconfig/evaluation_data_multiple_expected_responses.yamlsrc/lightspeed_evaluation/core/metrics/nlp.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/pipeline/evaluation/test_evaluator.py
🚧 Files skipped from review as they are similar to previous changes (2)
- config/evaluation_data_multiple_expected_responses.yaml
- README.md
🧰 Additional context used
📓 Path-based instructions (4)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest mocking with mocker fixture instead of unittest.mock
Import mocking utilities from pytest (mocker fixture) rather than from unittest.mock
Mirror test directory structure to match source code structure
Use test_.py naming convention for test files and test_ prefix for test functions
Use Test* prefix for test classes in Python test files
Mock LLM calls in tests using pytest mocker fixture
Files:
tests/unit/pipeline/evaluation/test_evaluator.py
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead
Files:
tests/unit/pipeline/evaluation/test_evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/nlp.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Include type hints for all public functions and methods in Python
Use Google-style docstrings for all public APIs in Python
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/nlp.py
src/lightspeed_evaluation/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/lightspeed_evaluation/**/*.py: Use custom exceptions from core.system.exceptions module for error handling
Use structured logging with appropriate log levels in Python code
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/nlp.py
🧠 Learnings (13)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
tests/unit/pipeline/evaluation/test_evaluator.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-13T13:31:28.016Z
Learnt from: bsatapat-jpg
Repo: lightspeed-core/lightspeed-evaluation PR: 137
File: src/lightspeed_evaluation/core/metrics/custom/tool_eval.py:69-93
Timestamp: 2026-01-13T13:31:28.016Z
Learning: In the lightspeed-evaluation framework's tool_eval.py, explicit validation for invalid `match` parameter values is not required. The code is designed to fall through to full match behavior for non-"partial" values.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T13:21:53.225Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-07-16T10:13:42.924Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:112-112
Timestamp: 2025-07-16T10:13:42.924Z
Learning: In the lsc_agent_eval judge LLM evaluation system, the _extract_numeric_result method should only accept numeric values 1 or 0 from judge responses. Boolean keywords like "true", "false", "yes", "no" are not acceptable values for judge LLM responses.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-12T11:01:11.398Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-01-12T11:01:11.398Z
Learning: Applies to src/lightspeed_evaluation/**/*.py : Use custom exceptions from core.system.exceptions module for error handling
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2026-01-15T00:41:27.575Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 140
File: src/lightspeed_evaluation/runner/evaluation.py:41-41
Timestamp: 2026-01-15T00:41:27.575Z
Learning: In Python code reviews for the lightspeed-evaluation project, allow keeping the pylint disable: too-many-locals in functions that use lazy imports to group many local variables in one scope for readability. If a function uses lazy imports to justify this pattern, document why and avoid over-refactoring into multiple helpers just for lint reasons. If you can refactor without increasing complexity or number of globals, prefer refactoring; otherwise retain the pylint directive with a clear rationale.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/nlp.py
🧬 Code graph analysis (1)
tests/unit/pipeline/evaluation/test_evaluator.py (2)
src/lightspeed_evaluation/core/models/data.py (5)
EvaluationScope(497-508)TurnData(36-345)EvaluationData(348-416)EvaluationRequest(511-564)for_turn(539-553)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
evaluate(194-339)
🪛 GitHub Check: Bandit
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
[notice] 229-230:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
[notice] 292-293:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
src/lightspeed_evaluation/core/metrics/nlp.py
[notice] 66-67:
Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
🔇 Additional comments (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
24-26: Imports align with the new evaluation flow.Looks good—these imports are used by the new evaluate() path.
100-145: Centralized evaluate() call reads clean.The delegation and error handling integration look consistent and easy to follow.
194-199: Refactor to remove pylint disable on evaluate().Line 194 disables
too-many-localsandtoo-many-statements, which the project guidelines prohibit. Please split the logic into focused helpers (e.g., multi-expected-response vs single-path) and remove the directive. As per coding guidelines, avoid disabling lint warnings.⛔ Skipped due to learnings
Learnt from: asamal4 Repo: lightspeed-core/lightspeed-evaluation PR: 140 File: src/lightspeed_evaluation/runner/evaluation.py:41-41 Timestamp: 2026-01-15T00:41:27.575Z Learning: In the lightspeed-core/lightspeed-evaluation repository, user asamal4 prefers to keep `# pylint: disable=too-many-locals` when a function uses lazy imports to keep maximum function usage in one place for code cleanliness, rather than refactoring into multiple helper functions.tests/unit/pipeline/evaluation/test_evaluator.py (1)
6-10: Evaluate() test coverage looks solid.The helper reduces duplication, and the new parametrized cases exercise list/string/None expected_response paths well.
Also applies to: 470-681
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
ee39528 to
27893d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/unit/pipeline/evaluation/test_evaluator.py`:
- Around line 470-535: The helper _setup_evaluate_test is missing a mock patch
for ScriptEvalMetrics, so when MetricsEvaluator(...) is instantiated it will
call the real ScriptEvalMetrics; add a mock similar to the others: create a mock
via create_mock_handler(mocker, mock_return) and patch
"lightspeed_evaluation.pipeline.evaluation.evaluator.ScriptEvalMetrics" to
return that mock (and include it in the returned mock_handlers dict under
"script") so ScriptEvalMetrics is fully mocked during tests.
🧹 Nitpick comments (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4)
194-210: Add Google-style docstring for public method.The
evaluate()method is a public API but lacks a proper Google-style docstring as required by coding guidelines. The current one-liner doesn't document Args, Returns, or Raises sections.📝 Suggested docstring
def evaluate( # pylint: disable=too-many-locals,too-many-statements self, request: EvaluationRequest, evaluation_scope: EvaluationScope, threshold: Optional[float], ) -> tuple[Optional[float], str, str, int, int]: - """Evaluate metric logic, handling expected_response lists.""" + """Evaluate metric logic, handling expected_response lists. + + Supports multiple expected_response values for metrics that require it. + Iterates through alternatives until PASS or returns max score on failure. + + Args: + request: Evaluation request containing conversation data and metric. + evaluation_scope: Scope with turn data and evaluation context. + threshold: Score threshold for PASS/FAIL determination. + + Returns: + Tuple of (score, reason, status, judge_input_tokens, judge_output_tokens). + + Raises: + EvaluationError: When metric evaluation fails unexpectedly. + """
284-288: Verify None score handling in max comparison.Line 286 uses a ternary to handle
Nonescores, but this occurs after checkingif score is not Noneon line 279 for status determination. IfscoreisNone, the loop continues without updatingscore_maxorreason_acc, which may be the intended behavior. However, ifscoreisNone, addingf"{score}; {reason}\n"toreason_accwill produce"None; {reason}\n"which may not be desirable.Consider skipping reason accumulation when score is
None:💡 Suggested improvement
# Expected response did not PASS; keep track of highest score - score_max = max( - score_max, score if score is not None else score_max - ) - reason_acc += f"{score}; {reason}\n" + if score is not None: + score_max = max(score_max, score) + reason_acc += f"{score}; {reason}\n" + else: + reason_acc += f"None; {reason}\n"
346-349: Consider preserving original exception message without double-wrapping.Line 349 passes the exception
edirectly toEvaluationError(e), which may result in the exception object being converted to string. Then at line 191, the outer handler wraps it again withf"Evaluation error: {e}". If the original exception already contains "Evaluation error" or similar text, this could cause redundant prefixing.💡 Suggested improvement
except Exception as e: # pylint: disable=broad-exception-caught # Stop token tracking on error token_tracker.stop() - raise EvaluationError(e) from e + raise EvaluationError(str(e)) from e
100-102: Minor: Framework is parsed twice.The
frameworkis extracted frommetric_identifierat line 101 for the script metric check, and again at line 202 insideevaluate(). Consider passing the already-parsedframeworktoevaluate()to avoid duplicate string splitting.Also applies to: 201-202
27893d0 to
d45c8e1
Compare
d45c8e1 to
3a57edf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 255-309: Detect when evaluation_scope.turn_data.expected_response
is empty before the for-loop and handle it explicitly: if empty, set score=None,
reason="no expected responses" (or similar descriptive string), status="FAIL",
and leave judge_input_tokens/judge_output_tokens as 0, and emit a debug log via
logger.debug indicating the empty expected_response; update the logic around
score_max/status/reason_acc (in the same method that contains score_max,
reason_acc, token_tracker usage) so the early-empty branch returns the same
tuple shape as the normal path with the descriptive reason instead of an empty
string.
- Around line 362-365: Replace the broad "except Exception as e: # pylint:
disable=broad-exception-caught" with specific exception handlers for the errors
the try block can raise (for example ValueError, TypeError, RuntimeError,
OSError, or any domain-specific exceptions used by the surrounding code), call
token_tracker.stop() in each handler (or move token_tracker.stop() into a
finally: block) and re-raise as EvaluationError(e) from e; remove the pylint
disable comment. Locate the try/except around token_tracker.stop() and
EvaluationError in evaluator.py (the block referencing token_tracker.stop() and
raise EvaluationError(e)) and implement specific except clauses or a finally to
ensure deterministic cleanup and no broad exception catching.
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_evaluator.py (1)
641-684: Consider adding edge case test for empty list.Good parametrized test coverage for metrics that don't require
expected_response. Consider adding a test case for the empty list edge case (expected_response=[]) when the metric does requireexpected_responseto verify expected behavior (ERROR result or validation failure).Example additional test
def test_evaluate_with_empty_expected_response_list( self, config_loader, mock_metric_manager, mock_script_manager, mocker ): """Test evaluate() with empty expected_response list returns error.""" evaluator, mock_handlers = self._setup_evaluate_test( config_loader, mock_metric_manager, mock_script_manager, mocker, (0.85, "Score"), ) turn_data = TurnData( turn_id="1", query="Q", response="R", expected_response=[], # Empty list contexts=["C"], ) conv_data = EvaluationData(conversation_group_id="test_conv", turns=[turn_data]) request = EvaluationRequest.for_turn( conv_data, "ragas:context_recall", 0, turn_data ) scope = EvaluationScope(turn_idx=0, turn_data=turn_data, is_conversation=False) score, reason, status, _, _ = evaluator.evaluate(request, scope, 0.7) # Verify behavior for empty list - adjust assertion based on intended behavior assert score is None or status == "ERROR"
asamal4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the code, it seems like we are adding too much logic in pipeline/evaluator.py. It is just supposed to delegate evaluation to individual framework/flow.
I think more refactoring is required, but that was not planned in original task. So please try to modularize as much as possible (at least the new code). If required we can create follow up tasks.
| # For NLP metrics, expected_response must be a string (not a list) | ||
| if isinstance(expected_response, list): | ||
| raise MetricError( | ||
| "NLP metrics require expected_response to be a string, not a list" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this constraint for nlp based metrics ? only for G-Eval we are avoiding multiple expected responses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is for the typechecker, cause turn_data.expected_response could either be None|str|list[str].. during runtime only string expected_response will be passed to NLP, but the checker complains anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 420-422: Remove the unnecessary time.sleep(0.1) in evaluator.py:
after calling self.handlers[framework].evaluate(metric_name, request.conv_data,
evaluation_scope) the TokenTracker callback is already invoked synchronously by
LiteLLM, so immediately call token_tracker.get_counts() (keep the existing
input_tokens, output_tokens assignment) and delete the time.sleep line to avoid
the added latency; ensure no other code relies on that delay and run tests
around Evaluator.evaluate and TokenTracker.get_counts to confirm behavior.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
243-244: Coding guideline: Avoid# pylint: disablecomments.The coding guidelines state: "Do not disable lint warnings with # noqa, # type: ignore, or # pylint: disable comments - fix the underlying issue instead."
While this is a complex method, consider catching specific exception types that the evaluation handlers can raise (e.g.,
ValueError,TypeError,RuntimeError,OSError,KeyError,AttributeError) instead of a broad catch with a pylint disable.♻️ Suggested approach
- except Exception as e: # pylint: disable=broad-exception-caught - raise EvaluationError(e) from e + except (ValueError, TypeError, RuntimeError, OSError, KeyError, AttributeError) as e: + raise EvaluationError(str(e)) from eBased on coding guidelines: "Do not disable lint warnings... - fix the underlying issue instead."
4451cba to
f88ff40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`:
- Around line 133-142: The error path drops judge LLM token/score fields because
the early return calls _create_error_result without preserving MetricResult
fields from metric_result returned by _evaluate_wrapper; update the error
handling to pass the metric_result's judge_llm_* counts plus any score/threshold
into _create_error_result (or change _create_error_result signature to accept a
MetricResult and copy its fields) so that judge_llm_inference_tokens,
judge_llm_prompt_tokens, judge_llm_total_tokens and score/threshold from
metric_result are preserved in the produced EvaluationResult; modify callers of
_create_error_result accordingly and adjust _create_error_result to populate
those fields when constructing the error EvaluationResult.
- Around line 275-283: The RuntimeError raised when evaluation_scope.turn_data
or evaluation_scope.turn_data.expected_response is missing should be replaced
with the project's custom EvaluationError from core.system.exceptions; update
the raise to use EvaluationError(f"Metric '{request.metric_identifier}' requires
'expected_response' field. Could not proceed with evaluation.") and import
EvaluationError at the top, and make the same replacement for the other similar
branches that check expected_response (the other checks around the same logic in
this file) so all missing-expected-response failures raise EvaluationError
instead of RuntimeError.
♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
186-246: Remove pylint disables by shrinking _evaluate_wrapper and narrowing exceptions.The new wrapper still relies on
# pylint: disable=too-many-*andbroad-exception-caught. Please split the decision logic into smaller helpers and catch specific exceptions (re-raisingEvaluationError) so the disables can be removed. As per coding guidelines, please avoid pylint disable comments.
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/models/data.py (1)
419-447: Document MetricResult with a Google‑style docstring (public model).MetricResult is a new public API surface; please add a Google-style docstring (e.g., an Attributes section for each field) to align with project standards. As per coding guidelines, please use Google-style docstrings for public APIs.
| # Evaluate metric | ||
| metric_result = self._evaluate_wrapper(request, evaluation_scope, threshold) | ||
|
|
||
| execution_time = time.time() - start_time | ||
|
|
||
| if metric_result.result == "ERROR": | ||
| return self._create_error_result( | ||
| request, metric_result.reason, execution_time | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve judge token counts on ERROR results.
When metric_result.result == "ERROR", returning _create_error_result drops judge_llm_* counts (and score/threshold) captured in MetricResult, so usage stats become 0 even if LLM calls happened. Consider letting ERROR results flow through the standard EvaluationResult construction (or passing token counts into _create_error_result).
🐛 Suggested fix (preserve MetricResult fields)
- if metric_result.result == "ERROR":
- return self._create_error_result(
- request, metric_result.reason, execution_time
- )
+ # Keep ERROR results in the standard EvaluationResult path
+ # so judge_llm_* counts and context fields are preserved.🤖 Prompt for AI Agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py` around lines 133
- 142, The error path drops judge LLM token/score fields because the early
return calls _create_error_result without preserving MetricResult fields from
metric_result returned by _evaluate_wrapper; update the error handling to pass
the metric_result's judge_llm_* counts plus any score/threshold into
_create_error_result (or change _create_error_result signature to accept a
MetricResult and copy its fields) so that judge_llm_inference_tokens,
judge_llm_prompt_tokens, judge_llm_total_tokens and score/threshold from
metric_result are preserved in the produced EvaluationResult; modify callers of
_create_error_result accordingly and adjust _create_error_result to populate
those fields when constructing the error EvaluationResult.
| # This check satisfies the linter but is logically redundant | ||
| if ( | ||
| evaluation_scope.turn_data is None | ||
| or evaluation_scope.turn_data.expected_response is None | ||
| ): | ||
| raise RuntimeError( | ||
| f"Metric '{request.metric_identifier}' requires 'expected_response' field. " | ||
| f"Could not proceed with evaluation." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Raise EvaluationError instead of RuntimeError for missing expected_response.
These branches are already handled as evaluation failures; using the project’s custom EvaluationError keeps error handling consistent and avoids leaking generic RuntimeError upstream.
🔧 Suggested fix
- raise RuntimeError(
+ raise EvaluationError(
f"Metric '{request.metric_identifier}' requires 'expected_response' field. "
f"Could not proceed with evaluation."
)
@@
- raise RuntimeError(
+ raise EvaluationError(
f"Metric '{request.metric_identifier}' requires 'expected_response' field. "
f"Could not proceed with evaluation."
)
@@
- raise RuntimeError(
+ raise EvaluationError(
f"Metric '{request.metric_identifier}' requires 'expected_response' field. "
f"Could not proceed with evaluation."
)As per coding guidelines, use custom exceptions from core.system.exceptions for error handling.
Also applies to: 332-336, 369-375
🤖 Prompt for AI Agents
In `@src/lightspeed_evaluation/pipeline/evaluation/evaluator.py` around lines 275
- 283, The RuntimeError raised when evaluation_scope.turn_data or
evaluation_scope.turn_data.expected_response is missing should be replaced with
the project's custom EvaluationError from core.system.exceptions; update the
raise to use EvaluationError(f"Metric '{request.metric_identifier}' requires
'expected_response' field. Could not proceed with evaluation.") and import
EvaluationError at the top, and make the same replacement for the other similar
branches that check expected_response (the other checks around the same logic in
this file) so all missing-expected-response failures raise EvaluationError
instead of RuntimeError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @VladimirKadlec that using RuntimeError is more appropriate in this case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Description
Added ability to set list of expected_response to handle non-deterministic nature of LLM/LCORE API responses. Added unit tests as well. Updated README.md.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Behavior
Refactor
Tests
Examples
✏️ Tip: You can customize this high-level summary in your review settings.