[LEADS-348, LEADS-364] API Latency and Token Calculation#230
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (21)
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a00a8f1 to
861d59e
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
119-122:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winSet
agent_latencyin the error path for accurate timing.When the API call fails,
turn_data.agent_latencyis never set. This means it will use the default value (likely0.0), which misrepresents the actual time spent on the failed API call. Consider settingagent_latencybased on the elapsed time even when an error occurs, to accurately track API call overhead.🕒 Proposed fix to track latency in error path
except APIError as e: + api_latency = time.perf_counter() - api_start_time + turn_data.agent_latency = api_latency error_msg = f"API Error for turn {turn_data.turn_id}: {e}" logger.error(error_msg) return error_msg, conversation_id🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 119 - 122, In the APIError except block inside amender.py (the except APIError as e: handler for the call that sets turn_data), record the elapsed time and assign it to turn_data.agent_latency before logging and returning: compute the latency using the same start/end timing mechanism used in the successful path (or capture time at exception entry), set turn_data.agent_latency = elapsed, then include that updated timing prior to calling logger.error and returning error_msg, conversation_id so failed calls reflect actual agent_latency.
🧹 Nitpick comments (4)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
71-76: 💤 Low valueConsider tracking API call time separately from agent processing time.
The current logic sets
agent_latencyto0.0for cached responses (when tokens are zero), which discards network overhead and authentication time. While this appears intentional (tracking only agent processing time), it means you lose visibility into the total API call latency for cached responses.Consider one of these alternatives:
- Track both
api_call_time(total) andagent_compute_time(processing only)- Document that
agent_latencyexcludes cached response overhead- Verify that zero tokens reliably indicates a cached response in all scenarios
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 71 - 76, The current assignment to turn_data.agent_latency overwrites total API latency for cached responses (when turn_data.api_input_tokens and turn_data.api_output_tokens are zero); change this by adding separate fields (e.g., turn_data.api_call_time = api_latency and turn_data.agent_compute_time = ... ) and set agent_compute_time to api_latency only when tokens > 0 (or to 0 otherwise), so you preserve total api_latency in turn_data.api_call_time while keeping agent processing time in turn_data.agent_compute_time; update any downstream usage of turn_data.agent_latency to use the new fields or document the semantics clearly in the turn_data class/definition.src/lightspeed_evaluation/core/output/statistics.py (1)
311-314: ⚡ Quick winUse Google-style docstring for the new public helper.
calculate_field_numeric_stats_from_evaluation_datais a public API insrc/and should include explicitArgsandReturnssections for consistency with repository standards.As per coding guidelines
src/**/*.py: "Use Google-style docstrings for all public APIs in Python".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 311 - 314, The docstring for the public function calculate_field_numeric_stats_from_evaluation_data lacks the required Google-style format; update its docstring to include an explicit short description followed by Google-style Args and Returns sections (Args: evaluation_data: list[EvaluationData] — list of evaluation records to inspect; field_name: str — name of the numeric field to compute stats for) and Returns: dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median, min, max, std) and any notes about filtering zeros), ensuring the docstring matches repository standards for public APIs.tests/unit/core/models/test_summary.py (2)
223-241: 💤 Low valueConsider adding a test for mixed zero and non-zero latencies.
The current tests cover "all non-zero" and "all zero" scenarios. Consider adding a test where some turns have
agent_latency=0(e.g., cached responses) and others have positive values (actual API calls). This would verify the behavior when API calls are only partially measured.📝 Example test case
def test_with_mixed_api_latency(self) -> None: """Test from_results with mixed zero and non-zero latencies.""" results = [_make_result(turn_id="t1")] eval_data = [ EvaluationData( conversation_group_id="conv1", turns=[ TurnData(turn_id="t1", query="Query 1", agent_latency=1.5), TurnData(turn_id="t2", query="Query 2", agent_latency=0), # cached TurnData(turn_id="t3", query="Query 3", agent_latency=2.0), ], ) ] summary = EvaluationSummary.from_results(results, evaluation_data=eval_data) # Should compute stats only from non-zero values (1.5, 2.0) assert summary.agent_latency_stats is not None assert summary.agent_latency_stats.count == 2 assert summary.agent_latency_stats.min_value == 1.5 assert summary.agent_latency_stats.max_value == 2.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/models/test_summary.py` around lines 223 - 241, Add a new unit test that calls EvaluationSummary.from_results with EvaluationData containing a mix of TurnData entries where some have agent_latency == 0 and others have positive values; verify that EvaluationSummary.agent_latency_stats is not None and that its count, min_value, and max_value are computed only from the non-zero latencies (e.g., ensure count==number of positive latencies and min/max reflect those values). Use the existing test pattern and helper _make_result to create results and assert the expected stats for the non-zero latencies.
199-222: 💤 Low valueConsider more precise assertion for mean latency.
Line 219 asserts
mean > 0which is very loose. Given the input values (1.5, 2.0, 1.8), the expected mean is approximately 1.7667. A more precise assertion would validate the calculation correctness.🔍 Suggested improvement
assert summary.agent_latency_stats is not None assert summary.agent_latency_stats.count == 3 - assert summary.agent_latency_stats.mean is not None - assert summary.agent_latency_stats.mean > 0 + expected_mean = (1.5 + 2.0 + 1.8) / 3 # ≈ 1.7667 + assert summary.agent_latency_stats.mean is not None + assert abs(summary.agent_latency_stats.mean - expected_mean) < 0.0001 assert summary.agent_latency_stats.min_value == 1.5 assert summary.agent_latency_stats.max_value == 2.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/core/models/test_summary.py` around lines 199 - 222, The test test_with_api_latency currently asserts summary.agent_latency_stats.mean > 0 which is too loose; update the assertion after calling EvaluationSummary.from_results to check the mean against the expected value (mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison (e.g., assert round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4) or use pytest.approx) referencing the test function name and the agent_latency_stats.mean field to ensure the calculation is correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 523-529: The NumericStats fields on agent_latency (mean, median,
std, min_value, max_value, p95, p99) may be None and the current f"{...:.3f}"
formatting will raise; update the block that writes these stats in generator.py
to guard each field (e.g., check value is not None) and format conditionally—use
f"{value:.3f}s" when value is present and a safe fallback string like "N/A" (or
"0.000s" if preferred) when None so writing never raises; ensure all referenced
symbols (agent_latency.mean, agent_latency.median, agent_latency.std,
agent_latency.min_value, agent_latency.max_value, agent_latency.p95,
agent_latency.p99) are handled consistently.
- Around line 779-783: The JSON output drops newly added percentile fields
because _streaming_stats_to_dict (used when summary.streaming is not None in
generator.py) does not include p95/p99 for TTFT, duration, and throughput;
update _streaming_stats_to_dict to map and include the percentile fields (p95
and p99) from the incoming streaming stats object into the returned dict (e.g.,
add keys like ttft_p95, ttft_p99, duration_p95, duration_p99, throughput_p95,
throughput_p99 or preserve existing naming conventions) so summary.streaming
serializes the new percentile data into result["streaming_performance"].
In `@tests/unit/pipeline/evaluation/test_amender.py`:
- Around line 267-299: The test is flaky because it asserts turn.agent_latency >
0 based on real timing; patch time.perf_counter inside the test so
amend_single_turn's timing is deterministic: use
mocker.patch("time.perf_counter", side_effect=[START, END]) (or return_value for
a single call if amend_single_turn only reads once) before calling
APIDataAmender.amend_single_turn, choose START and END such that END > START
(e.g. 1.0, 1.5) so agent_latency becomes a known value, then assert the exact
expected latency (END-START) and token fields as before; reference
APIDataAmender.amend_single_turn and TurnData.agent_latency when making the
change.
---
Outside diff comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 119-122: In the APIError except block inside amender.py (the
except APIError as e: handler for the call that sets turn_data), record the
elapsed time and assign it to turn_data.agent_latency before logging and
returning: compute the latency using the same start/end timing mechanism used in
the successful path (or capture time at exception entry), set
turn_data.agent_latency = elapsed, then include that updated timing prior to
calling logger.error and returning error_msg, conversation_id so failed calls
reflect actual agent_latency.
---
Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 311-314: The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.
In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 71-76: The current assignment to turn_data.agent_latency
overwrites total API latency for cached responses (when
turn_data.api_input_tokens and turn_data.api_output_tokens are zero); change
this by adding separate fields (e.g., turn_data.api_call_time = api_latency and
turn_data.agent_compute_time = ... ) and set agent_compute_time to api_latency
only when tokens > 0 (or to 0 otherwise), so you preserve total api_latency in
turn_data.api_call_time while keeping agent processing time in
turn_data.agent_compute_time; update any downstream usage of
turn_data.agent_latency to use the new fields or document the semantics clearly
in the turn_data class/definition.
In `@tests/unit/core/models/test_summary.py`:
- Around line 223-241: Add a new unit test that calls
EvaluationSummary.from_results with EvaluationData containing a mix of TurnData
entries where some have agent_latency == 0 and others have positive values;
verify that EvaluationSummary.agent_latency_stats is not None and that its
count, min_value, and max_value are computed only from the non-zero latencies
(e.g., ensure count==number of positive latencies and min/max reflect those
values). Use the existing test pattern and helper _make_result to create results
and assert the expected stats for the non-zero latencies.
- Around line 199-222: The test test_with_api_latency currently asserts
summary.agent_latency_stats.mean > 0 which is too loose; update the assertion
after calling EvaluationSummary.from_results to check the mean against the
expected value (mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison
(e.g., assert round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4)
or use pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e495129c-0580-484c-8dd8-b4f993bdcbae
📒 Files selected for processing (21)
README.mdconfig/system.yamldocs/EVALUATION_GUIDE.mdsrc/lightspeed_evaluation/core/constants.pysrc/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/models/quality.pysrc/lightspeed_evaluation/core/models/summary.pysrc/lightspeed_evaluation/core/output/generator.pysrc/lightspeed_evaluation/core/output/statistics.pysrc/lightspeed_evaluation/core/storage/sql_storage.pysrc/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pytests/unit/core/models/conftest.pytests/unit/core/models/test_quality.pytests/unit/core/models/test_summary.pytests/unit/core/output/test_generator.pytests/unit/core/output/test_statistics.pytests/unit/core/output/test_statistics_api.pytests/unit/core/output/test_statistics_detailed.pytests/unit/core/storage/test_sql_storage.pytests/unit/pipeline/evaluation/test_amender.py
861d59e to
be3c850
Compare
be3c850 to
b07a6f5
Compare
Description
This PR addresses the following acceptance criteria (LEADS-348):
For LEADS-364 API token statistics were added into
quality_report.jsontest_statisticsmodule was splited due to reaching file line limitsType of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation