[LEADS-348, LEADS-364] API Latency and Token Calculation by xmican10 · Pull Request #230 · lightspeed-core/lightspeed-evaluation

xmican10 · 2026-05-06T10:52:41Z

Description

This PR addresses the following acceptance criteria (LEADS-348):

Calculate api execution time and store in final result. agent_latency
Add statistics for all 3 latencies → 50, 95, 99 percentile.
Add information to all relevant reports including the new quality report (LEADS-349)

For LEADS-364 API token statistics were added into quality_report.json

test_statistics module was splited due to reaching file line limits

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude(e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- API latency tracking now measures and reports response times with percentile statistics (p50, p95, p99).
- CSV reports include agent latency data.
- Performance metrics section added for API-enabled evaluations, including streaming metrics and token usage.
Documentation
- Evaluation guide updated with detailed API latency and streaming performance metrics.
- README updated with API Latency Tracking feature description.

coderabbitai · 2026-05-06T10:52:48Z

Warning

Rate limit exceeded

@xmican10 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 51 minutes and 58 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7be71ba0-3cb2-41a0-8f72-abacc089413f

📥 Commits

Reviewing files that changed from the base of the PR and between 861d59e and b07a6f5.

📒 Files selected for processing (21)

README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_statistics_api.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/pipeline/evaluation/test_amender.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
119-122: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Set agent_latency in the error path for accurate timing.

When the API call fails, turn_data.agent_latency is never set. This means it will use the default value (likely 0.0), which misrepresents the actual time spent on the failed API call. Consider setting agent_latency based on the elapsed time even when an error occurs, to accurately track API call overhead.
🕒 Proposed fix to track latency in error path
         except APIError as e:
+            api_latency = time.perf_counter() - api_start_time
+            turn_data.agent_latency = api_latency
             error_msg = f"API Error for turn {turn_data.turn_id}: {e}"
             logger.error(error_msg)
             return error_msg, conversation_id
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 119 -
122, In the APIError except block inside amender.py (the except APIError as e:
handler for the call that sets turn_data), record the elapsed time and assign it
to turn_data.agent_latency before logging and returning: compute the latency
using the same start/end timing mechanism used in the successful path (or
capture time at exception entry), set turn_data.agent_latency = elapsed, then
include that updated timing prior to calling logger.error and returning
error_msg, conversation_id so failed calls reflect actual agent_latency.

🧹 Nitpick comments (4)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

71-76: 💤 Low value

Consider tracking API call time separately from agent processing time.

The current logic sets agent_latency to 0.0 for cached responses (when tokens are zero), which discards network overhead and authentication time. While this appears intentional (tracking only agent processing time), it means you lose visibility into the total API call latency for cached responses.

Consider one of these alternatives:

Track both api_call_time (total) and agent_compute_time (processing only)

Document that agent_latency excludes cached response overhead

Verify that zero tokens reliably indicates a cached response in all scenarios
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py` around lines 71 -
76, The current assignment to turn_data.agent_latency overwrites total API
latency for cached responses (when turn_data.api_input_tokens and
turn_data.api_output_tokens are zero); change this by adding separate fields
(e.g., turn_data.api_call_time = api_latency and turn_data.agent_compute_time =
... ) and set agent_compute_time to api_latency only when tokens > 0 (or to 0
otherwise), so you preserve total api_latency in turn_data.api_call_time while
keeping agent processing time in turn_data.agent_compute_time; update any
downstream usage of turn_data.agent_latency to use the new fields or document
the semantics clearly in the turn_data class/definition.

src/lightspeed_evaluation/core/output/statistics.py (1)

311-314: ⚡ Quick win

Use Google-style docstring for the new public helper.

calculate_field_numeric_stats_from_evaluation_data is a public API in src/ and should include explicit Args and Returns sections for consistency with repository standards.

As per coding guidelines src/**/*.py: "Use Google-style docstrings for all public APIs in Python".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/core/output/statistics.py` around lines 311 - 314,
The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.

tests/unit/core/models/test_summary.py (2)

223-241: 💤 Low value

Consider adding a test for mixed zero and non-zero latencies.

The current tests cover "all non-zero" and "all zero" scenarios. Consider adding a test where some turns have agent_latency=0 (e.g., cached responses) and others have positive values (actual API calls). This would verify the behavior when API calls are only partially measured.

📝 Example test case

def test_with_mixed_api_latency(self) -> None:
    """Test from_results with mixed zero and non-zero latencies."""
    results = [_make_result(turn_id="t1")]
    
    eval_data = [
        EvaluationData(
            conversation_group_id="conv1",
            turns=[
                TurnData(turn_id="t1", query="Query 1", agent_latency=1.5),
                TurnData(turn_id="t2", query="Query 2", agent_latency=0),  # cached
                TurnData(turn_id="t3", query="Query 3", agent_latency=2.0),
            ],
        )
    ]
    
    summary = EvaluationSummary.from_results(results, evaluation_data=eval_data)
    
    # Should compute stats only from non-zero values (1.5, 2.0)
    assert summary.agent_latency_stats is not None
    assert summary.agent_latency_stats.count == 2
    assert summary.agent_latency_stats.min_value == 1.5
    assert summary.agent_latency_stats.max_value == 2.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` around lines 223 - 241, Add a new
unit test that calls EvaluationSummary.from_results with EvaluationData
containing a mix of TurnData entries where some have agent_latency == 0 and
others have positive values; verify that EvaluationSummary.agent_latency_stats
is not None and that its count, min_value, and max_value are computed only from
the non-zero latencies (e.g., ensure count==number of positive latencies and
min/max reflect those values). Use the existing test pattern and helper
_make_result to create results and assert the expected stats for the non-zero
latencies.

199-222: 💤 Low value

Consider more precise assertion for mean latency.

Line 219 asserts mean > 0 which is very loose. Given the input values (1.5, 2.0, 1.8), the expected mean is approximately 1.7667. A more precise assertion would validate the calculation correctness.

🔍 Suggested improvement

         assert summary.agent_latency_stats is not None
         assert summary.agent_latency_stats.count == 3
-        assert summary.agent_latency_stats.mean is not None
-        assert summary.agent_latency_stats.mean > 0
+        expected_mean = (1.5 + 2.0 + 1.8) / 3  # ≈ 1.7667
+        assert summary.agent_latency_stats.mean is not None
+        assert abs(summary.agent_latency_stats.mean - expected_mean) < 0.0001
         assert summary.agent_latency_stats.min_value == 1.5
         assert summary.agent_latency_stats.max_value == 2.0

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/core/models/test_summary.py` around lines 199 - 222, The test
test_with_api_latency currently asserts summary.agent_latency_stats.mean > 0
which is too loose; update the assertion after calling
EvaluationSummary.from_results to check the mean against the expected value
(mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison (e.g., assert
round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4) or use
pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/output/generator.py`:
- Around line 523-529: The NumericStats fields on agent_latency (mean, median,
std, min_value, max_value, p95, p99) may be None and the current f"{...:.3f}"
formatting will raise; update the block that writes these stats in generator.py
to guard each field (e.g., check value is not None) and format conditionally—use
f"{value:.3f}s" when value is present and a safe fallback string like "N/A" (or
"0.000s" if preferred) when None so writing never raises; ensure all referenced
symbols (agent_latency.mean, agent_latency.median, agent_latency.std,
agent_latency.min_value, agent_latency.max_value, agent_latency.p95,
agent_latency.p99) are handled consistently.
- Around line 779-783: The JSON output drops newly added percentile fields
because _streaming_stats_to_dict (used when summary.streaming is not None in
generator.py) does not include p95/p99 for TTFT, duration, and throughput;
update _streaming_stats_to_dict to map and include the percentile fields (p95
and p99) from the incoming streaming stats object into the returned dict (e.g.,
add keys like ttft_p95, ttft_p99, duration_p95, duration_p99, throughput_p95,
throughput_p99 or preserve existing naming conventions) so summary.streaming
serializes the new percentile data into result["streaming_performance"].

In `@tests/unit/pipeline/evaluation/test_amender.py`:
- Around line 267-299: The test is flaky because it asserts turn.agent_latency >
0 based on real timing; patch time.perf_counter inside the test so
amend_single_turn's timing is deterministic: use
mocker.patch("time.perf_counter", side_effect=[START, END]) (or return_value for
a single call if amend_single_turn only reads once) before calling
APIDataAmender.amend_single_turn, choose START and END such that END > START
(e.g. 1.0, 1.5) so agent_latency becomes a known value, then assert the exact
expected latency (END-START) and token fields as before; reference
APIDataAmender.amend_single_turn and TurnData.agent_latency when making the
change.

---

Outside diff comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 119-122: In the APIError except block inside amender.py (the
except APIError as e: handler for the call that sets turn_data), record the
elapsed time and assign it to turn_data.agent_latency before logging and
returning: compute the latency using the same start/end timing mechanism used in
the successful path (or capture time at exception entry), set
turn_data.agent_latency = elapsed, then include that updated timing prior to
calling logger.error and returning error_msg, conversation_id so failed calls
reflect actual agent_latency.

---

Nitpick comments:
In `@src/lightspeed_evaluation/core/output/statistics.py`:
- Around line 311-314: The docstring for the public function
calculate_field_numeric_stats_from_evaluation_data lacks the required
Google-style format; update its docstring to include an explicit short
description followed by Google-style Args and Returns sections (Args:
evaluation_data: list[EvaluationData] — list of evaluation records to inspect;
field_name: str — name of the numeric field to compute stats for) and Returns:
dict[str, Any] — dictionary of computed statistics (e.g., count, mean, median,
min, max, std) and any notes about filtering zeros), ensuring the docstring
matches repository standards for public APIs.

In `@src/lightspeed_evaluation/pipeline/evaluation/amender.py`:
- Around line 71-76: The current assignment to turn_data.agent_latency
overwrites total API latency for cached responses (when
turn_data.api_input_tokens and turn_data.api_output_tokens are zero); change
this by adding separate fields (e.g., turn_data.api_call_time = api_latency and
turn_data.agent_compute_time = ... ) and set agent_compute_time to api_latency
only when tokens > 0 (or to 0 otherwise), so you preserve total api_latency in
turn_data.api_call_time while keeping agent processing time in
turn_data.agent_compute_time; update any downstream usage of
turn_data.agent_latency to use the new fields or document the semantics clearly
in the turn_data class/definition.

In `@tests/unit/core/models/test_summary.py`:
- Around line 223-241: Add a new unit test that calls
EvaluationSummary.from_results with EvaluationData containing a mix of TurnData
entries where some have agent_latency == 0 and others have positive values;
verify that EvaluationSummary.agent_latency_stats is not None and that its
count, min_value, and max_value are computed only from the non-zero latencies
(e.g., ensure count==number of positive latencies and min/max reflect those
values). Use the existing test pattern and helper _make_result to create results
and assert the expected stats for the non-zero latencies.
- Around line 199-222: The test test_with_api_latency currently asserts
summary.agent_latency_stats.mean > 0 which is too loose; update the assertion
after calling EvaluationSummary.from_results to check the mean against the
expected value (mean of 1.5, 2.0, 1.8 ≈ 1.7666667) by using a precise comparison
(e.g., assert round(summary.agent_latency_stats.mean, 4) == round(1.7666667, 4)
or use pytest.approx) referencing the test function name and the
agent_latency_stats.mean field to ensure the calculation is correct.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e495129c-0580-484c-8dd8-b4f993bdcbae

📥 Commits

Reviewing files that changed from the base of the PR and between 1951eb9 and 861d59e.

📒 Files selected for processing (21)

README.md
config/system.yaml
docs/EVALUATION_GUIDE.md
src/lightspeed_evaluation/core/constants.py
src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/models/quality.py
src/lightspeed_evaluation/core/models/summary.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/storage/sql_storage.py
src/lightspeed_evaluation/pipeline/evaluation/amender.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
tests/unit/core/models/conftest.py
tests/unit/core/models/test_quality.py
tests/unit/core/models/test_summary.py
tests/unit/core/output/test_generator.py
tests/unit/core/output/test_statistics.py
tests/unit/core/output/test_statistics_api.py
tests/unit/core/output/test_statistics_detailed.py
tests/unit/core/storage/test_sql_storage.py
tests/unit/pipeline/evaluation/test_amender.py

LEADS-349-calculate-aggregated-score-from-key-metrics

09b699a

xmican10 force-pushed the LEADS-348-calculate-api-latency branch 4 times, most recently from a00a8f1 to 861d59e Compare May 13, 2026 08:10

xmican10 changed the title ~~[LEADS-348] API Latency Calculation~~ [LEADS-348, LEADS-364] API Latency Calculation May 13, 2026

xmican10 changed the title ~~[LEADS-348, LEADS-364] API Latency Calculation~~ [LEADS-348, LEADS-364] API Latency and Token Calculation May 13, 2026

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/lightspeed_evaluation/core/output/generator.py Outdated

Comment thread src/lightspeed_evaluation/core/output/generator.py

Comment thread tests/unit/pipeline/evaluation/test_amender.py

xmican10 force-pushed the LEADS-348-calculate-api-latency branch from 861d59e to be3c850 Compare May 13, 2026 12:05

API latency calculation

b07a6f5

xmican10 force-pushed the LEADS-348-calculate-api-latency branch 2 times, most recently from be3c850 to b07a6f5 Compare May 13, 2026 12:53

xmican10 marked this pull request as ready for review May 13, 2026 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LEADS-348, LEADS-364] API Latency and Token Calculation#230

[LEADS-348, LEADS-364] API Latency and Token Calculation#230
xmican10 wants to merge 2 commits into
lightspeed-core:mainfrom
xmican10:LEADS-348-calculate-api-latency

xmican10 commented May 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xmican10 commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xmican10 commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading