chore: Summary and statistics refactor#233
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (14)
WalkthroughExtract statistics models into a new module and replace dict-based aggregations with typed compute_* functions; add agent configuration models and migrate SystemConfig/api usage to agents, wiring agents through pipeline, processor, runner, API client, loader, generators, visualization, and tests. ChangesStatistics Models and Computation
Agent Configuration and Wiring
Estimated code review effort: 🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs:
Suggested reviewers:
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/core/output/generator.py (1)
20-28:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix import paths: stats models no longer available from
core.models.summary.The file imports
ConversationStats,MetricStats,OverallStats,StreamingStats, andTagStatsfromlightspeed_evaluation.core.models.summary(lines 20-27), but these classes were moved tocore/models/statistics.pyand are no longer available insummary.py. This will cause a runtime import failure.These stats models are correctly re-exported at the package level (
lightspeed_evaluation.core.models). Update the imports to align with the rest of the codebase:Import fix
-from lightspeed_evaluation.core.models import EvaluationData, EvaluationResult -from lightspeed_evaluation.core.models.summary import ( - ConversationStats, - EvaluationSummary, - MetricStats, - OverallStats, - StreamingStats, - TagStats, -) +from lightspeed_evaluation.core.models import ( + ConversationStats, + EvaluationData, + EvaluationResult, + MetricStats, + OverallStats, + StreamingStats, + TagStats, +) +from lightspeed_evaluation.core.models.summary import EvaluationSummary🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/generator.py` around lines 20 - 28, The imports for ConversationStats, MetricStats, OverallStats, StreamingStats, and TagStats are pointing to lightspeed_evaluation.core.models.summary but those classes moved to core/models/statistics; update the import targets to import these symbols from the package-level re-export (lightspeed_evaluation.core.models) or directly from lightspeed_evaluation.core.models.statistics so that ConversationStats, MetricStats, OverallStats, StreamingStats, and TagStats are resolved (leave QualityReport import as-is).src/lightspeed_evaluation/core/output/visualization.py (1)
459-463:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftCritical: key-shape mismatch —
_generate_pass_rates_graphwill KeyError on the fallback path.
_calculate_detailed_summary_statsreturnscompute_detailed_stats(results).model_dump(). The Pydantic models (OverallStats,MetricStats,ConversationStats,TagStats) have fields namedpassed,failed,error,skipped, so.model_dump()produces keys with those names.However,
_generate_pass_rates_graph(lines 161, 165, 401-406) readsstats['pass']andstats['fail']:f"P:{stats['pass']} F:{stats['fail']} E:{stats['error']} S:{skipped}"When
generate_all_graphsis invoked withoutdetailed_stats(the fallback at line 99), these accesses will raiseKeyError. The external caller ingenerator.py::_summary_to_detailed_stats_dictavoids this by translating field names via_group_stats_to_dict, convertingpassed→passandfailed→fail. This creates two incompatible code paths emitting different dict shapes for the same consumer.Either:
- Have
_calculate_detailed_summary_statsapply the same conversion (reuse or extract_group_stats_to_dictfromgenerator.py), or- Update
_generate_pass_rates_graphand other consumers (lines 401-406) to use Pydantic field names and havegenerator.pystop translating.The second option is cleaner and removes the translation layer entirely.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/visualization.py` around lines 459 - 463, The dict shape mismatch comes from using Pydantic names (passed/failed) in the fallback returned by _calculate_detailed_summary_stats while _generate_pass_rates_graph (and its string formatting at lines 401-406) expects pass/fail keys; update the consumer side to use the canonical Pydantic field names instead of the translated ones: change all occurrences of stats['pass'] and stats['fail'] to stats['passed'] and stats['failed'] (and ensure skipped/error remain consistent), and update any related string formatting or variable names in _generate_pass_rates_graph and generate_all_graphs so both the direct Pydantic .model_dump() path and generator.py::_summary_to_detailed_stats_dict path accept the same keys (prefer removing or deprecating the _group_stats_to_dict translation in generator.py if present to keep a single canonical shape).
🧹 Nitpick comments (5)
tests/unit/core/models/test_quality.py (1)
6-6: 💤 Low valueOptional: align import with package-level re-export for consistency.
quality.pyandtests/unit/core/models/conftest.pyboth importMetricStatsvialightspeed_evaluation.core.models; this file reaches into the submodule directly. Using the same re-export path everywhere keeps a single public API surface.♻️ Proposed change
-from lightspeed_evaluation.core.models.statistics import MetricStats +from lightspeed_evaluation.core.models import MetricStats🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/models/test_quality.py` at line 6, The test imports MetricStats directly from the submodule (MetricStats in tests/unit/core/models/test_quality.py uses lightspeed_evaluation.core.models.statistics); change the import to use the package-level re-export (from lightspeed_evaluation.core.models import MetricStats) so it matches other files (e.g., tests/unit/core/models/conftest.py and quality.py) and keeps a consistent public API surface.src/lightspeed_evaluation/core/output/statistics.py (1)
151-175: 💤 Low valueMinor: redundant
Nonecheck and fragileor 0.0coalescing.Two small polish points in
compute_score_statistics:
- After
if not scores: return ScoreStatistics(),compute_numeric_stats(scores)cannot returnNone(it only returnsNonefor empty input), so theif num_stats is Nonebranch is dead.num_stats.mean or 0.0treats0.0(a valid mean) as "missing". It happens to be harmless because the fallback is also0.0, but it's a footgun if any of these defaults ever change (e.g., a sentinel for a non-coalescing field). Prefer an explicitNonecheck.♻️ Proposed cleanup
def compute_score_statistics( scores: list[float], compute_ci: bool = False, ) -> ScoreStatistics: """Compute score statistics from a list of scores.""" if not scores: return ScoreStatistics() num_stats = compute_numeric_stats(scores) - if num_stats is None: - return ScoreStatistics() + assert num_stats is not None # guaranteed by the `not scores` check above confidence_interval = None if compute_ci and len(scores) > 1: confidence_interval = _try_bootstrap(scores) return ScoreStatistics( count=num_stats.count, - mean=num_stats.mean or 0.0, - median=num_stats.median or 0.0, - std=num_stats.std or 0.0, - min_score=num_stats.min_value or 0.0, - max_score=num_stats.max_value or 0.0, + mean=num_stats.mean if num_stats.mean is not None else 0.0, + median=num_stats.median if num_stats.median is not None else 0.0, + std=num_stats.std if num_stats.std is not None else 0.0, + min_score=num_stats.min_value if num_stats.min_value is not None else 0.0, + max_score=num_stats.max_value if num_stats.max_value is not None else 0.0, confidence_interval=confidence_interval, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 151 - 175, The function compute_score_statistics contains a dead check for num_stats being None and uses fragile "or 0.0" coalescing that treats valid zeros as missing; remove the redundant "if num_stats is None" branch (compute_numeric_stats only returns None for empty scores which is already handled) and replace expressions like num_stats.mean or 0.0 with explicit None checks (e.g., mean = num_stats.mean if num_stats.mean is not None else 0.0) for mean, median, std, min_value, and max_value when constructing the ScoreStatistics; keep the confidence_interval logic with _try_bootstrap unchanged.tests/unit/core/output/test_statistics.py (1)
894-895: 💤 Low valueDuplicate assertions on consecutive lines.
Three test methods contain the same
assert stats.total_embedding_tokens == ...check repeated on adjacent lines (894-895, 914-915, 923-924). Looks like a copy-paste artifact; the duplicate adds no coverage.♻️ Proposed cleanup
assert stats.total_embedding_tokens == 350 - assert stats.total_embedding_tokens == 350assert stats.total_embedding_tokens == 0 - assert stats.total_embedding_tokens == 0assert stats.total_embedding_tokens == 0 - assert stats.total_embedding_tokens == 0Also applies to: 914-915, 923-924
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/output/test_statistics.py` around lines 894 - 895, Remove the duplicated consecutive assertion statements that check stats.total_embedding_tokens in the unit tests; locate the test methods in tests/unit/core/output/test_statistics.py where you see two identical lines asserting stats.total_embedding_tokens == <value> (the duplicate is a copy-paste artifact) and delete the redundant assertion so each test only contains a single assert stats.total_embedding_tokens == <value> check.src/lightspeed_evaluation/runner/evaluation.py (1)
134-137: ⚡ Quick win
compute_overall_statsandcompute_basic_statsare not true duplicates—clarify their intended usage.
compute_overall_stats(in summary.py, line 81) is a wrapper aroundcompute_basic_statsthat explicitly handles empty results with a default return. However,compute_basic_statsalready guards against division by zero (e.g.,if total > 0 else 0.0), so both functions behave identically on empty input. The naming distinction ("overall" vs "basic") is subtle and potentially confusing since they compute the same statistics.Consider either:
- Using
compute_basic_statsconsistently throughout (it is already production-safe), or- Renaming the wrapper to clarify intent (e.g.,
compute_overall_stats_with_default) if the explicit empty-result sentinel is semantically important.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/runner/evaluation.py` around lines 134 - 137, The two functions compute_overall_stats and compute_basic_stats currently compute identical metrics and cause naming confusion; pick one approach: either replace callers of compute_overall_stats with compute_basic_stats to use the production-safe implementation everywhere (locate usages in evaluation.py and summary.py and import/Call compute_basic_stats instead), or rename the wrapper compute_overall_stats to a clearer name (e.g., compute_overall_stats_with_default) and update its docstring and all call sites to reflect the semantic difference; ensure only one canonical function remains used across the codebase to avoid duplication.src/lightspeed_evaluation/core/models/summary.py (1)
8-17: ⚡ Quick winImport directly from sibling submodules instead of the parent
core.modelspackage.
src/lightspeed_evaluation/core/models/summary.pyimports model classes fromlightspeed_evaluation.core.models(its parent package). While this currently works because__init__.pydoes not import fromsummary.py, it creates unnecessary coupling and makes the code fragile if the package structure changes. Import directly from the submodules (dataandstatistics) to make dependencies explicit and independent of__init__.pyordering.♻️ Proposed fix
-from lightspeed_evaluation.core.models import ( - EvaluationData, - EvaluationResult, - StreamingStats, - ApiTokenUsage, - OverallStats, - MetricStats, - ConversationStats, - TagStats, -) +from lightspeed_evaluation.core.models.data import ( + EvaluationData, + EvaluationResult, +) +from lightspeed_evaluation.core.models.statistics import ( + ApiTokenUsage, + ConversationStats, + MetricStats, + OverallStats, + StreamingStats, + TagStats, +)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/models/summary.py` around lines 8 - 17, Replace the broad import from lightspeed_evaluation.core.models with explicit imports from the two submodules: import EvaluationData, EvaluationResult, and ApiTokenUsage from lightspeed_evaluation.core.models.data and import StreamingStats, OverallStats, MetricStats, ConversationStats, and TagStats from lightspeed_evaluation.core.models.statistics; remove the original parent-package import line so summary.py depends directly on the specific submodules (use the exact class names: EvaluationData, EvaluationResult, ApiTokenUsage, StreamingStats, OverallStats, MetricStats, ConversationStats, TagStats).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/core/models/statistics.py`:
- Around line 28-30: The confidence_level Field currently claims a 0–1 scale but
_try_bootstrap in output/statistics.py stores values like 95.0; fix the mismatch
by either normalizing in _try_bootstrap (divide by 100 before assigning to
confidence_level) or, simpler, update the Field description on confidence_level
in statistics.py to say it's a percentage (e.g., "Confidence level percentage
(e.g., 95.0 for 95%)") so it matches the emitted 95.0 values; reference the
Field named confidence_level in
src/lightspeed_evaluation/core/models/statistics.py and the _try_bootstrap
producer in output/statistics.py when making the change.
---
Outside diff comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 20-28: The imports for ConversationStats, MetricStats,
OverallStats, StreamingStats, and TagStats are pointing to
lightspeed_evaluation.core.models.summary but those classes moved to
core/models/statistics; update the import targets to import these symbols from
the package-level re-export (lightspeed_evaluation.core.models) or directly from
lightspeed_evaluation.core.models.statistics so that ConversationStats,
MetricStats, OverallStats, StreamingStats, and TagStats are resolved (leave
QualityReport import as-is).
In `@src/lightspeed_evaluation/core/output/visualization.py`:
- Around line 459-463: The dict shape mismatch comes from using Pydantic names
(passed/failed) in the fallback returned by _calculate_detailed_summary_stats
while _generate_pass_rates_graph (and its string formatting at lines 401-406)
expects pass/fail keys; update the consumer side to use the canonical Pydantic
field names instead of the translated ones: change all occurrences of
stats['pass'] and stats['fail'] to stats['passed'] and stats['failed'] (and
ensure skipped/error remain consistent), and update any related string
formatting or variable names in _generate_pass_rates_graph and
generate_all_graphs so both the direct Pydantic .model_dump() path and
generator.py::_summary_to_detailed_stats_dict path accept the same keys (prefer
removing or deprecating the _group_stats_to_dict translation in generator.py if
present to keep a single canonical shape).
---
Nitpick comments:
In `@src/lightspeed_evaluation/core/models/summary.py`:
- Around line 8-17: Replace the broad import from
lightspeed_evaluation.core.models with explicit imports from the two submodules:
import EvaluationData, EvaluationResult, and ApiTokenUsage from
lightspeed_evaluation.core.models.data and import StreamingStats, OverallStats,
MetricStats, ConversationStats, and TagStats from
lightspeed_evaluation.core.models.statistics; remove the original parent-package
import line so summary.py depends directly on the specific submodules (use the
exact class names: EvaluationData, EvaluationResult, ApiTokenUsage,
StreamingStats, OverallStats, MetricStats, ConversationStats, TagStats).
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 151-175: The function compute_score_statistics contains a dead
check for num_stats being None and uses fragile "or 0.0" coalescing that treats
valid zeros as missing; remove the redundant "if num_stats is None" branch
(compute_numeric_stats only returns None for empty scores which is already
handled) and replace expressions like num_stats.mean or 0.0 with explicit None
checks (e.g., mean = num_stats.mean if num_stats.mean is not None else 0.0) for
mean, median, std, min_value, and max_value when constructing the
ScoreStatistics; keep the confidence_interval logic with _try_bootstrap
unchanged.
In `@src/lightspeed_evaluation/runner/evaluation.py`:
- Around line 134-137: The two functions compute_overall_stats and
compute_basic_stats currently compute identical metrics and cause naming
confusion; pick one approach: either replace callers of compute_overall_stats
with compute_basic_stats to use the production-safe implementation everywhere
(locate usages in evaluation.py and summary.py and import/Call
compute_basic_stats instead), or rename the wrapper compute_overall_stats to a
clearer name (e.g., compute_overall_stats_with_default) and update its docstring
and all call sites to reflect the semantic difference; ensure only one canonical
function remains used across the codebase to avoid duplication.
In `@tests/unit/core/models/test_quality.py`:
- Line 6: The test imports MetricStats directly from the submodule (MetricStats
in tests/unit/core/models/test_quality.py uses
lightspeed_evaluation.core.models.statistics); change the import to use the
package-level re-export (from lightspeed_evaluation.core.models import
MetricStats) so it matches other files (e.g., tests/unit/core/models/conftest.py
and quality.py) and keeps a consistent public API surface.
In `@tests/unit/core/output/test_statistics.py`:
- Around line 894-895: Remove the duplicated consecutive assertion statements
that check stats.total_embedding_tokens in the unit tests; locate the test
methods in tests/unit/core/output/test_statistics.py where you see two identical
lines asserting stats.total_embedding_tokens == <value> (the duplicate is a
copy-paste artifact) and delete the redundant assertion so each test only
contains a single assert stats.total_embedding_tokens == <value> check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a5f5edd2-8fe6-4408-a30a-5f2517228a07
📒 Files selected for processing (14)
src/lightspeed_evaluation/core/models/__init__.pysrc/lightspeed_evaluation/core/models/quality.pysrc/lightspeed_evaluation/core/models/statistics.pysrc/lightspeed_evaluation/core/models/summary.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/output/visualization.pysrc/lightspeed_evaluation/runner/evaluation.pytests/unit/core/models/conftest.pytests/unit/core/models/test_quality.pytests/unit/core/models/test_summary.pytests/unit/core/output/test_final_coverage.pytests/unit/core/output/test_statistics.pytests/unit/runner/test_evaluation.py
7dd9b6e to
3ce56ea
Compare
asamal4
left a comment
There was a problem hiding this comment.
Thanks. a minor change..
3ce56ea to
c752880
Compare
c752880 to
d2cbffb
Compare
…gration (lightspeed-core#228) * feat: add agents configuration layer with backward-compatible api: migration Introduce a generic `agents:` configuration block in system.yaml that makes the HTTP API one agent type among potentially many (e.g., Kubernetes CRD agents for Openshift Agentic Lightspeed). The key design ensures zero breakage of existing configs and code: - New Pydantic models: AgentsConfig, AgentDefaultConfig, HttpApiAgentConfig in a new `core/models/agents.py` module - SystemConfig gains an `agents: Optional[AgentsConfig]` field alongside the existing `api:` field - A model_validator auto-migrates `api:` to `agents:` when `agents:` is absent, so all existing system.yaml files continue working unchanged - `api.enabled=True` maps to `default.agent="http_api"`; `api.enabled=False` maps to `default.agent=None` - The `api` field is preserved — all downstream code reading `config.api` continues to work without modification - EvaluationData gains `agent` and `agent_config` optional fields for per-conversation-group agent selection and config overrides - Config resolution follows a 3-level priority chain: eval_data.agent_config > agents.<name> > agents.default - ConfigLoader passes `agents` YAML data through to SystemConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor: address PR lightspeed-core#228 review feedback on agents config layer - Replace timeout/retry fields with generic agent_config dict on AgentDefaultConfig and HttpApiAgentConfig, implementing 3-level per-key merge (default < agent < eval_data) in resolve_agent_config - Move enabled from per-agent (HttpApiAgentConfig) to AgentsConfig level, add api_enabled property on SystemConfig for backward-compatible access - Move whitespace pattern validator from EvaluationData.agent to AgentDefaultConfig.agent where agent names are defined as identifiers - Update endpoint_type description to include infer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: resolve default agent config for API client instead of legacy api defaults _create_api_client now calls AgentsConfig.resolve_agent_config() to obtain the correct endpoint_type, api_base, and other settings from the default agent definition. Previously it used config.api directly, which fell back to legacy defaults (endpoint_type: streaming) even when agents were explicitly configured with endpoint_type: query. Widen APIClient config type to accept HttpApiAgentConfig alongside APIConfig since both share the same HttpApiBaseFields interface. Per-conversation agent overrides are not yet supported and will be addressed by the agent driver architecture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
d2cbffb to
db52406
Compare
Description
Reduced redundancy and code duplication between summary and statistics modules
core/models/statistics.pyfor shared statistics data modelscore/models/summary.pycore/output/statistics.pyto use the consolidated modelsType of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit