chore: Summary and statistics refactor by xmican10 · Pull Request #233 · lightspeed-core/lightspeed-evaluation

xmican10 · 2026-05-11T11:54:19Z

Description

Reduced redundancy and code duplication between summary and statistics modules

Created new core/models/statistics.py for shared statistics data models
Eliminated duplicate code from core/models/summary.py
Refactored core/output/statistics.py to use the consolidated models
Updated evaluation runner and output generators to reference the shared models
Adjusted all related unit tests

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Refactor
- Centralized and type-safe statistics pipeline; summary/reporting now uses structured stats objects (with confidence intervals defaulting to 95%) and includes token totals.
New Features
- Agent-based HTTP API configuration and defaults added, with per-evaluation and per-conversation agent overrides.
Tests
- Extensive test updates and new suites covering agent config, migration, and the new statistics APIs.

coderabbitai · 2026-05-11T11:54:34Z

Warning

Rate limit exceeded

@xmican10 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 39 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: aa8766c1-98e5-4a5a-ab6f-a70fcd095500

📥 Commits

Reviewing files that changed from the base of the PR and between 0318545 and db52406.

📒 Files selected for processing (14)

src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/statistics.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/output/visualization.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_final_coverage.py
tests/unit/core/output/test_statistics.py
tests/unit/runner/test_evaluation.py

Walkthrough

Extract statistics models into a new module and replace dict-based aggregations with typed compute_* functions; add agent configuration models and migrate SystemConfig/api usage to agents, wiring agents through pipeline, processor, runner, API client, loader, generators, visualization, and tests.

Changes

Statistics Models and Computation

Layer / File(s)	Summary
Statistics data models `src/lightspeed_evaluation/core/models/statistics.py`	New Pydantic models: `NumericStats`, `ConfidenceInterval`, `ScoreStatistics`, `OverallStats`, `MetricStats`, `ConversationStats`, `TagStats`, `DetailedStats`, `StreamingStats`, `ApiTokenUsage`.
Model re-exports & imports `src/lightspeed_evaluation/core/models/__init__.py`, `src/lightspeed_evaluation/core/models/quality.py`	Statistics models and agent models added to public exports; quality.py updated to import models from the package root.
Typed compute functions `src/lightspeed_evaluation/core/output/statistics.py`	Add `_try_bootstrap` and typed `compute_*` functions replacing dict-based APIs: numeric/streaming/overall/score/token/metric/tag/conversation/detailed.
Summary & consumers `src/lightspeed_evaluation/core/models/summary.py`, `src/lightspeed_evaluation/core/output/generator.py`, `src/lightspeed_evaluation/core/output/visualization.py`, `src/lightspeed_evaluation/runner/evaluation.py`	EvaluationSummary now delegates to `compute_*`; generator/visualization use `.model_dump()` for serialization; runner prints and returns object-attribute-based summaries.
Tests: statistics consumers `tests/unit/core/output/*`, `tests/unit/core/models/test_summary.py`, `tests/unit/core/models/conftest.py`	Tests updated to call `compute_*`, construct/inspect model instances, use `.model_dump()` for detailed outputs, and adjust imports/patch targets for bootstrap CI.

Agent Configuration and Wiring

Layer / File(s)	Summary
Agent constants & models `src/lightspeed_evaluation/core/constants.py`, `src/lightspeed_evaluation/core/models/agents.py`	Add `DEFAULT_AGENT_TYPE`, `SUPPORTED_AGENT_TYPES`, and Pydantic agent models (`MCPServerConfig`, `MCPHeadersConfig`, `HttpApiBaseFields`, `HttpApiAgentConfig`, `AgentDefaultConfig`, `AgentsConfig`) with validation and resolution helpers.
SystemConfig migration & data fields `src/lightspeed_evaluation/core/models/system.py`, `src/lightspeed_evaluation/core/models/data.py`	`APIConfig` now reuses `HttpApiBaseFields`; `SystemConfig` gains `agents` and validators to auto-migrate `api:` → `agents:`; `EvaluationData` adds `agent` and `agent_config` fields.
Pipeline / Processor / API client & loader `src/lightspeed_evaluation/pipeline/evaluation/pipeline.py`, `src/lightspeed_evaluation/pipeline/evaluation/processor.py`, `src/lightspeed_evaluation/core/api/client.py`, `src/lightspeed_evaluation/core/system/loader.py`	Gate API/client and script/amendment behavior on `config.agents.enabled`; resolve agent configs into `HttpApiAgentConfig` for `APIClient`; `ConfigLoader` forwards `agents` YAML into `SystemConfig`.
Tests: agents & migration `tests/unit/core/models/test_agents.py`, `tests/unit/core/system/test_loader.py`, updated pipeline/processor/runner tests	Add coverage for agent models, AgentsConfig.resolve_agent_config merging and errors, EvaluationData agent fields, SystemConfig migration behavior, ConfigLoader agents handling, and tests adapt to use `AgentsConfig` fixtures.

Estimated code review effort: 🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs:

lightspeed-core/lightspeed-evaluation#228: overlapping changes to core/models/__init__.py agent exports.

Suggested reviewers:

asamal4
VladimirKadlec

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main refactoring objective: consolidating summary and statistics models to reduce code duplication.
Docstring Coverage	✅ Passed	Docstring coverage is 96.82% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/lightspeed_evaluation/core/output/generator.py (1)
20-28: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix import paths: stats models no longer available from core.models.summary.

The file imports ConversationStats, MetricStats, OverallStats, StreamingStats, and TagStats from lightspeed_evaluation.core.models.summary (lines 20-27), but these classes were moved to core/models/statistics.py and are no longer available in summary.py. This will cause a runtime import failure.

These stats models are correctly re-exported at the package level (lightspeed_evaluation.core.models). Update the imports to align with the rest of the codebase:
Import fix
-from lightspeed_evaluation.core.models import EvaluationData, EvaluationResult
-from lightspeed_evaluation.core.models.summary import (
-    ConversationStats,
-    EvaluationSummary,
-    MetricStats,
-    OverallStats,
-    StreamingStats,
-    TagStats,
-)
+from lightspeed_evaluation.core.models import (
+    ConversationStats,
+    EvaluationData,
+    EvaluationResult,
+    MetricStats,
+    OverallStats,
+    StreamingStats,
+    TagStats,
+)
+from lightspeed_evaluation.core.models.summary import EvaluationSummary
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/generator.py` around lines 20 - 28, The
imports for ConversationStats, MetricStats, OverallStats, StreamingStats, and
TagStats are pointing to lightspeed_evaluation.core.models.summary but those
classes moved to core/models/statistics; update the import targets to import
these symbols from the package-level re-export
(lightspeed_evaluation.core.models) or directly from
lightspeed_evaluation.core.models.statistics so that ConversationStats,
MetricStats, OverallStats, StreamingStats, and TagStats are resolved (leave
QualityReport import as-is).
src/lightspeed_evaluation/core/output/visualization.py (1)
459-463: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Critical: key-shape mismatch — _generate_pass_rates_graph will KeyError on the fallback path.

_calculate_detailed_summary_stats returns compute_detailed_stats(results).model_dump(). The Pydantic models (OverallStats, MetricStats, ConversationStats, TagStats) have fields named passed, failed, error, skipped, so .model_dump() produces keys with those names.

However, _generate_pass_rates_graph (lines 161, 165, 401-406) reads stats['pass'] and stats['fail']:
f"P:{stats['pass']} F:{stats['fail']} E:{stats['error']} S:{skipped}"
When generate_all_graphs is invoked without detailed_stats (the fallback at line 99), these accesses will raise KeyError. The external caller in generator.py::_summary_to_detailed_stats_dict avoids this by translating field names via _group_stats_to_dict, converting passed → pass and failed → fail. This creates two incompatible code paths emitting different dict shapes for the same consumer.

Either:

Have _calculate_detailed_summary_stats apply the same conversion (reuse or extract _group_stats_to_dict from generator.py), or

Update _generate_pass_rates_graph and other consumers (lines 401-406) to use Pydantic field names and have generator.py stop translating.

The second option is cleaner and removes the translation layer entirely.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/visualization.py` around lines 459 -
463, The dict shape mismatch comes from using Pydantic names (passed/failed) in
the fallback returned by _calculate_detailed_summary_stats while
_generate_pass_rates_graph (and its string formatting at lines 401-406) expects
pass/fail keys; update the consumer side to use the canonical Pydantic field
names instead of the translated ones: change all occurrences of stats['pass']
and stats['fail'] to stats['passed'] and stats['failed'] (and ensure
skipped/error remain consistent), and update any related string formatting or
variable names in _generate_pass_rates_graph and generate_all_graphs so both the
direct Pydantic .model_dump() path and
generator.py::_summary_to_detailed_stats_dict path accept the same keys (prefer
removing or deprecating the _group_stats_to_dict translation in generator.py if
present to keep a single canonical shape).

🧹 Nitpick comments (5)

tests/unit/core/models/test_quality.py (1)
6-6: 💤 Low value

Optional: align import with package-level re-export for consistency.

quality.py and tests/unit/core/models/conftest.py both import MetricStats via lightspeed_evaluation.core.models; this file reaches into the submodule directly. Using the same re-export path everywhere keeps a single public API surface.
♻️ Proposed change
-from lightspeed_evaluation.core.models.statistics import MetricStats
+from lightspeed_evaluation.core.models import MetricStats
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_quality.py` at line 6, The test imports
MetricStats directly from the submodule (MetricStats in
tests/unit/core/models/test_quality.py uses
lightspeed_evaluation.core.models.statistics); change the import to use the
package-level re-export (from lightspeed_evaluation.core.models import
MetricStats) so it matches other files (e.g., tests/unit/core/models/conftest.py
and quality.py) and keeps a consistent public API surface.
src/lightspeed_evaluation/core/output/statistics.py (1)
151-175: 💤 Low value

Minor: redundant None check and fragile or 0.0 coalescing.

Two small polish points in compute_score_statistics:

After if not scores: return ScoreStatistics(), compute_numeric_stats(scores) cannot return None (it only returns None for empty input), so the if num_stats is None branch is dead.

num_stats.mean or 0.0 treats 0.0 (a valid mean) as "missing". It happens to be harmless because the fallback is also 0.0, but it's a footgun if any of these defaults ever change (e.g., a sentinel for a non-coalescing field). Prefer an explicit None check.
♻️ Proposed cleanup
 def compute_score_statistics(
     scores: list[float],
     compute_ci: bool = False,
 ) -> ScoreStatistics:
     """Compute score statistics from a list of scores."""
     if not scores:
         return ScoreStatistics()

     num_stats = compute_numeric_stats(scores)
-    if num_stats is None:
-        return ScoreStatistics()
+    assert num_stats is not None  # guaranteed by the `not scores` check above

     confidence_interval = None
     if compute_ci and len(scores) > 1:
         confidence_interval = _try_bootstrap(scores)

     return ScoreStatistics(
         count=num_stats.count,
-        mean=num_stats.mean or 0.0,
-        median=num_stats.median or 0.0,
-        std=num_stats.std or 0.0,
-        min_score=num_stats.min_value or 0.0,
-        max_score=num_stats.max_value or 0.0,
+        mean=num_stats.mean if num_stats.mean is not None else 0.0,
+        median=num_stats.median if num_stats.median is not None else 0.0,
+        std=num_stats.std if num_stats.std is not None else 0.0,
+        min_score=num_stats.min_value if num_stats.min_value is not None else 0.0,
+        max_score=num_stats.max_value if num_stats.max_value is not None else 0.0,
         confidence_interval=confidence_interval,
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 151 - 175,
The function compute_score_statistics contains a dead check for num_stats being
None and uses fragile "or 0.0" coalescing that treats valid zeros as missing;
remove the redundant "if num_stats is None" branch (compute_numeric_stats only
returns None for empty scores which is already handled) and replace expressions
like num_stats.mean or 0.0 with explicit None checks (e.g., mean =
num_stats.mean if num_stats.mean is not None else 0.0) for mean, median, std,
min_value, and max_value when constructing the ScoreStatistics; keep the
confidence_interval logic with _try_bootstrap unchanged.
tests/unit/core/output/test_statistics.py (1)
894-895: 💤 Low value

Duplicate assertions on consecutive lines.

Three test methods contain the same assert stats.total_embedding_tokens == ... check repeated on adjacent lines (894-895, 914-915, 923-924). Looks like a copy-paste artifact; the duplicate adds no coverage.
♻️ Proposed cleanup
         assert stats.total_embedding_tokens == 350
-        assert stats.total_embedding_tokens == 350
         assert stats.total_embedding_tokens == 0
-        assert stats.total_embedding_tokens == 0
         assert stats.total_embedding_tokens == 0
-        assert stats.total_embedding_tokens == 0
Also applies to: 914-915, 923-924
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/output/test_statistics.py` around lines 894 - 895, Remove the
duplicated consecutive assertion statements that check
stats.total_embedding_tokens in the unit tests; locate the test methods in
tests/unit/core/output/test_statistics.py where you see two identical lines
asserting stats.total_embedding_tokens == <value> (the duplicate is a copy-paste
artifact) and delete the redundant assertion so each test only contains a single
assert stats.total_embedding_tokens == <value> check.
src/lightspeed_evaluation/runner/evaluation.py (1)
134-137: ⚡ Quick win

compute_overall_stats and compute_basic_stats are not true duplicates—clarify their intended usage.

compute_overall_stats (in summary.py, line 81) is a wrapper around compute_basic_stats that explicitly handles empty results with a default return. However, compute_basic_stats already guards against division by zero (e.g., if total > 0 else 0.0), so both functions behave identically on empty input. The naming distinction ("overall" vs "basic") is subtle and potentially confusing since they compute the same statistics.

Consider either:

Using compute_basic_stats consistently throughout (it is already production-safe), or

Renaming the wrapper to clarify intent (e.g., compute_overall_stats_with_default) if the explicit empty-result sentinel is semantically important.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/runner/evaluation.py` around lines 134 - 137, The
two functions compute_overall_stats and compute_basic_stats currently compute
identical metrics and cause naming confusion; pick one approach: either replace
callers of compute_overall_stats with compute_basic_stats to use the
production-safe implementation everywhere (locate usages in evaluation.py and
summary.py and import/Call compute_basic_stats instead), or rename the wrapper
compute_overall_stats to a clearer name (e.g.,
compute_overall_stats_with_default) and update its docstring and all call sites
to reflect the semantic difference; ensure only one canonical function remains
used across the codebase to avoid duplication.
src/lightspeed_evaluation/core/models/summary.py (1)
8-17: ⚡ Quick win

Import directly from sibling submodules instead of the parent core.models package.

src/lightspeed_evaluation/core/models/summary.py imports model classes from lightspeed_evaluation.core.models (its parent package). While this currently works because __init__.py does not import from summary.py, it creates unnecessary coupling and makes the code fragile if the package structure changes. Import directly from the submodules (data and statistics) to make dependencies explicit and independent of __init__.py ordering.
♻️ Proposed fix
-from lightspeed_evaluation.core.models import (
-    EvaluationData,
-    EvaluationResult,
-    StreamingStats,
-    ApiTokenUsage,
-    OverallStats,
-    MetricStats,
-    ConversationStats,
-    TagStats,
-)
+from lightspeed_evaluation.core.models.data import (
+    EvaluationData,
+    EvaluationResult,
+)
+from lightspeed_evaluation.core.models.statistics import (
+    ApiTokenUsage,
+    ConversationStats,
+    MetricStats,
+    OverallStats,
+    StreamingStats,
+    TagStats,
+)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/models/summary.py` around lines 8 - 17,
Replace the broad import from lightspeed_evaluation.core.models with explicit
imports from the two submodules: import EvaluationData, EvaluationResult, and
ApiTokenUsage from lightspeed_evaluation.core.models.data and import
StreamingStats, OverallStats, MetricStats, ConversationStats, and TagStats from
lightspeed_evaluation.core.models.statistics; remove the original parent-package
import line so summary.py depends directly on the specific submodules (use the
exact class names: EvaluationData, EvaluationResult, ApiTokenUsage,
StreamingStats, OverallStats, MetricStats, ConversationStats, TagStats).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/models/statistics.py`:
- Around line 28-30: The confidence_level Field currently claims a 0–1 scale but
_try_bootstrap in output/statistics.py stores values like 95.0; fix the mismatch
by either normalizing in _try_bootstrap (divide by 100 before assigning to
confidence_level) or, simpler, update the Field description on confidence_level
in statistics.py to say it's a percentage (e.g., "Confidence level percentage
(e.g., 95.0 for 95%)") so it matches the emitted 95.0 values; reference the
Field named confidence_level in
src/lightspeed_evaluation/core/models/statistics.py and the _try_bootstrap
producer in output/statistics.py when making the change.

---

Outside diff comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 20-28: The imports for ConversationStats, MetricStats,
OverallStats, StreamingStats, and TagStats are pointing to
lightspeed_evaluation.core.models.summary but those classes moved to
core/models/statistics; update the import targets to import these symbols from
the package-level re-export (lightspeed_evaluation.core.models) or directly from
lightspeed_evaluation.core.models.statistics so that ConversationStats,
MetricStats, OverallStats, StreamingStats, and TagStats are resolved (leave
QualityReport import as-is).

In `@src/lightspeed_evaluation/core/output/visualization.py`:
- Around line 459-463: The dict shape mismatch comes from using Pydantic names
(passed/failed) in the fallback returned by _calculate_detailed_summary_stats
while _generate_pass_rates_graph (and its string formatting at lines 401-406)
expects pass/fail keys; update the consumer side to use the canonical Pydantic
field names instead of the translated ones: change all occurrences of
stats['pass'] and stats['fail'] to stats['passed'] and stats['failed'] (and
ensure skipped/error remain consistent), and update any related string
formatting or variable names in _generate_pass_rates_graph and
generate_all_graphs so both the direct Pydantic .model_dump() path and
generator.py::_summary_to_detailed_stats_dict path accept the same keys (prefer
removing or deprecating the _group_stats_to_dict translation in generator.py if
present to keep a single canonical shape).

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/models/summary.py`:
- Around line 8-17: Replace the broad import from
lightspeed_evaluation.core.models with explicit imports from the two submodules:
import EvaluationData, EvaluationResult, and ApiTokenUsage from
lightspeed_evaluation.core.models.data and import StreamingStats, OverallStats,
MetricStats, ConversationStats, and TagStats from
lightspeed_evaluation.core.models.statistics; remove the original parent-package
import line so summary.py depends directly on the specific submodules (use the
exact class names: EvaluationData, EvaluationResult, ApiTokenUsage,
StreamingStats, OverallStats, MetricStats, ConversationStats, TagStats).

In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 151-175: The function compute_score_statistics contains a dead
check for num_stats being None and uses fragile "or 0.0" coalescing that treats
valid zeros as missing; remove the redundant "if num_stats is None" branch
(compute_numeric_stats only returns None for empty scores which is already
handled) and replace expressions like num_stats.mean or 0.0 with explicit None
checks (e.g., mean = num_stats.mean if num_stats.mean is not None else 0.0) for
mean, median, std, min_value, and max_value when constructing the
ScoreStatistics; keep the confidence_interval logic with _try_bootstrap
unchanged.

In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 134-137: The two functions compute_overall_stats and
compute_basic_stats currently compute identical metrics and cause naming
confusion; pick one approach: either replace callers of compute_overall_stats
with compute_basic_stats to use the production-safe implementation everywhere
(locate usages in evaluation.py and summary.py and import/Call
compute_basic_stats instead), or rename the wrapper compute_overall_stats to a
clearer name (e.g., compute_overall_stats_with_default) and update its docstring
and all call sites to reflect the semantic difference; ensure only one canonical
function remains used across the codebase to avoid duplication.

In `@tests/unit/core/models/test_quality.py`:
- Line 6: The test imports MetricStats directly from the submodule (MetricStats
in tests/unit/core/models/test_quality.py uses
lightspeed_evaluation.core.models.statistics); change the import to use the
package-level re-export (from lightspeed_evaluation.core.models import
MetricStats) so it matches other files (e.g., tests/unit/core/models/conftest.py
and quality.py) and keeps a consistent public API surface.

In `@tests/unit/core/output/test_statistics.py`:
- Around line 894-895: Remove the duplicated consecutive assertion statements
that check stats.total_embedding_tokens in the unit tests; locate the test
methods in tests/unit/core/output/test_statistics.py where you see two identical
lines asserting stats.total_embedding_tokens == <value> (the duplicate is a
copy-paste artifact) and delete the redundant assertion so each test only
contains a single assert stats.total_embedding_tokens == <value> check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a5f5edd2-8fe6-4408-a30a-5f2517228a07

📥 Commits

Reviewing files that changed from the base of the PR and between 1951eb9 and 0318545.

📒 Files selected for processing (14)

src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/statistics.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/output/visualization.py
src/lightspeed_evaluation/runner/evaluation.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_final_coverage.py
tests/unit/core/output/test_statistics.py
tests/unit/runner/test_evaluation.py

asamal4

Thanks. a minor change..

VladimirKadlec

LGTM, thank you 👍

…gration (lightspeed-core#228) * feat: add agents configuration layer with backward-compatible api: migration Introduce a generic `agents:` configuration block in system.yaml that makes the HTTP API one agent type among potentially many (e.g., Kubernetes CRD agents for Openshift Agentic Lightspeed). The key design ensures zero breakage of existing configs and code: - New Pydantic models: AgentsConfig, AgentDefaultConfig, HttpApiAgentConfig in a new `core/models/agents.py` module - SystemConfig gains an `agents: Optional[AgentsConfig]` field alongside the existing `api:` field - A model_validator auto-migrates `api:` to `agents:` when `agents:` is absent, so all existing system.yaml files continue working unchanged - `api.enabled=True` maps to `default.agent="http_api"`; `api.enabled=False` maps to `default.agent=None` - The `api` field is preserved — all downstream code reading `config.api` continues to work without modification - EvaluationData gains `agent` and `agent_config` optional fields for per-conversation-group agent selection and config overrides - Config resolution follows a 3-level priority chain: eval_data.agent_config > agents.<name> > agents.default - ConfigLoader passes `agents` YAML data through to SystemConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: address PR lightspeed-core#228 review feedback on agents config layer - Replace timeout/retry fields with generic agent_config dict on AgentDefaultConfig and HttpApiAgentConfig, implementing 3-level per-key merge (default < agent < eval_data) in resolve_agent_config - Move enabled from per-agent (HttpApiAgentConfig) to AgentsConfig level, add api_enabled property on SystemConfig for backward-compatible access - Move whitespace pattern validator from EvaluationData.agent to AgentDefaultConfig.agent where agent names are defined as identifiers - Update endpoint_type description to include infer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve default agent config for API client instead of legacy api defaults _create_api_client now calls AgentsConfig.resolve_agent_config() to obtain the correct endpoint_type, api_base, and other settings from the default agent definition. Previously it used config.api directly, which fell back to legacy defaults (endpoint_type: streaming) even when agents were explicitly configured with endpoint_type: query. Widen APIClient config type to accept HttpApiAgentConfig alongside APIConfig since both share the same HttpApiBaseFields interface. Per-conversation agent overrides are not yet supported and will be addressed by the agent driver architecture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/models/statistics.py

xmican10 force-pushed the summary-statistics-refactor branch 3 times, most recently from 7dd9b6e to 3ce56ea Compare May 11, 2026 12:48

asamal4 reviewed May 13, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/models/statistics.py Outdated

xmican10 force-pushed the summary-statistics-refactor branch from 3ce56ea to c752880 Compare May 13, 2026 08:24

VladimirKadlec previously approved these changes May 13, 2026

View reviewed changes

xmican10 dismissed VladimirKadlec’s stale review via d2cbffb May 13, 2026 14:43

xmican10 force-pushed the summary-statistics-refactor branch from c752880 to d2cbffb Compare May 13, 2026 14:43

xmican10 force-pushed the summary-statistics-refactor branch from d2cbffb to db52406 Compare May 13, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Summary and statistics refactor#233

chore: Summary and statistics refactor#233
xmican10 wants to merge 1 commit into
lightspeed-core:mainfrom
xmican10:summary-statistics-refactor

xmican10 commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

asamal4 left a comment

Uh oh!

Uh oh!

VladimirKadlec left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

xmican10 commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xmican10 commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading