UN-3322 [FIX] Optimize dashboard metrics aggregation task#1837
UN-3322 [FIX] Optimize dashboard metrics aggregation task#1837
Conversation
Reduce DB queries per org from ~42 to ~19 and add self-healing lock: - Combine 4 LLM metric queries into 1 using conditional aggregation - Cache org identifier lookup for PageUsage-based metrics - Merge daily+monthly queries into single source query (backfill pattern) - Tighten active org filter from 2 months to 7 days - Add self-healing Redis lock to prevent stuck-lock incidents - Remove unused org_identifier param from get_failed_pages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughResolves org identifiers once, consolidates LLM metric queries into a combined/split flow, adds a self-healing timestamp-based aggregation lock, and introduces helper primitives to aggregate hourly, daily, and monthly metrics with fewer queries. Changes
Sequence Diagram(s)sequenceDiagram
participant Task as AggregationTask
participant Lock as LockStore
participant Service as MetricsQueryService
participant DB as Database
Task->>Lock: _acquire_aggregation_lock()
Lock->>DB: read/update lock record (timestamp)
alt lock acquired
Task->>Service: _resolve_org_identifier(org_id)
Service->>DB: fetch Organization.organization_id (if needed)
Task->>Service: get_llm_metrics_split(org_id, start/end, granularity)
Service->>DB: execute combined LLM query (single query per granularity)
DB-->>Service: combined LLM rows
Service-->>Task: split into per-metric series (llm_calls, challenges, ...)
Task->>Service: _aggregate_single_metric(...) for each non-LLM metric
Service->>DB: metric-specific queries (hourly + daily/monthly)
DB-->>Service: metric rows
Service-->>Task: aggregated buckets (hourly/daily/monthly)
Task->>DB: bulk upsert aggregated metrics
Task->>Lock: release lock
else lock held or stale reclaimed
Task-->>Task: skip or reclaim then proceed
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/dashboard_metrics/tasks.py (1)
507-519:⚠️ Potential issue | 🟠 Major
WorkflowExecutionis too narrow for the new active-org prefilter.In cloud deployments,
hitl_completionsis aggregated fromHITLQueue.approved_at, so an organization can have fresh completions inside the 7-day window even when its latestWorkflowExecution.created_atis older. This filter skips that org entirely and leaves those daily/monthly buckets stale. Please include the other source tables that can emit fresh metrics in the window, or exempt those metrics from the prefilter.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 507 - 519, The active_org_ids prefilter currently gathers orgs only from WorkflowExecution.created_at (see variable active_org_ids and daily_start), which misses orgs with recent HITL completions recorded via HITLQueue.approved_at and therefore causes hitl_completions buckets to be skipped; modify the prefilter to also include organization IDs from HITLQueue (use approved_at >= daily_start) and any other recent-metric source tables that can emit fresh metrics in the 7-day window, or alternatively exclude hitl_completions (and other affected metric calculations) from using active_org_ids so those metrics always scan their source tables without the prefilter. Ensure you update the set construction that builds active_org_ids to union in these additional org IDs (or branch the metric path to bypass the prefilter) so daily/monthly buckets for HITL and similar sources are not left stale.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/services.py`:
- Around line 246-250: The wrapper helpers get_llm_calls, get_challenges,
get_summarization_calls, and get_llm_usage_cost each call
get_llm_metrics_combined separately, causing the heavy combined aggregation to
run multiple times; update the hot callers (notably
backfill_metrics.Command._collect_metrics and views.live_series) to call
MetricsQueryService.get_llm_metrics_combined once per organization/date
range/granularity and then derive the four values from that single result
instead of invoking the single-metric wrappers repeatedly; leave the wrapper
helpers for convenience but change the callers to fetch the combined result and
map to
{"period","llm_calls","challenges","summarization_calls","llm_usage_cost"} (or
extract the specific fields needed) to avoid quadruple aggregation.
In `@backend/dashboard_metrics/tasks.py`:
- Around line 338-340: The daily cutoff comparison uses a timeful daily_start
against midnight-truncated day_ts, which excludes the boundary day; normalize
the cutoff to a day boundary by comparing dates (e.g., day_ts.date() >=
daily_start.date()) or by truncating daily_start to midnight before comparisons
so the bucket exactly 7 days ago is included; apply the same change to the other
occurrence that uses day_ts and daily_start (the block that builds the key with
(org_id, day_ts.date().isoformat(), metric_name, "default", "") and the similar
logic around the second helper).
---
Outside diff comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 507-519: The active_org_ids prefilter currently gathers orgs only
from WorkflowExecution.created_at (see variable active_org_ids and daily_start),
which misses orgs with recent HITL completions recorded via
HITLQueue.approved_at and therefore causes hitl_completions buckets to be
skipped; modify the prefilter to also include organization IDs from HITLQueue
(use approved_at >= daily_start) and any other recent-metric source tables that
can emit fresh metrics in the 7-day window, or alternatively exclude
hitl_completions (and other affected metric calculations) from using
active_org_ids so those metrics always scan their source tables without the
prefilter. Ensure you update the set construction that builds active_org_ids to
union in these additional org IDs (or branch the metric path to bypass the
prefilter) so daily/monthly buckets for HITL and similar sources are not left
stale.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1bb907c1-14fd-4aaa-be13-db97326b34e6
📒 Files selected for processing (3)
backend/dashboard_metrics/management/commands/backfill_metrics.pybackend/dashboard_metrics/services.pybackend/dashboard_metrics/tasks.py
- Replace 4 thin LLM wrapper methods with get_llm_metrics_split() that fetches combined data once and splits into per-metric series - Update live_series view to use single combined query (4→1 DB queries) - Update backfill_metrics command to use combined query (8→2 DB queries) - Fix daily_start boundary day exclusion by truncating to midnight - Rename resolved → page_usage_org_id for clarity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
backend/dashboard_metrics/views.py (1)
840-843: Dead code:metric_name == "llm_usage"check is now unreachable.Since LLM metrics (including
llm_usage) are handled in the earlier block (lines 782-822) and excluded frommetric_queriesviaLLM_METRIC_KEYS, this condition at line 842 will never be true for non-LLM metrics processed here.🧹 Suggested simplification
series.append( { "metric_name": metric_name, - "metric_type": MetricType.HISTOGRAM - if metric_name == "llm_usage" - else MetricType.COUNTER, + "metric_type": MetricType.COUNTER, "data": [🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 840 - 843, The conditional that sets MetricType based on metric_name == "llm_usage" is dead code because LLM metrics (including "llm_usage") are handled earlier and excluded from metric_queries via LLM_METRIC_KEYS; update the block that constructs the metric dict (the code using "metric_name" and MetricType) to always assign MetricType.COUNTER (remove the if/else branch) so the metric creation is simplified and no unreachable check remains.backend/dashboard_metrics/management/commands/backfill_metrics.py (1)
61-67: Class attribute type hint for mutable defaults.Static analysis flags
LLM_METRIC_TYPESas a mutable class attribute. While this is safe since it's read-only in practice, you could addClassVarfor explicit typing if desired.🧹 Optional type hint improvement
+from typing import Any, ClassVar ... - LLM_METRIC_TYPES: dict[str, bool] = { + LLM_METRIC_TYPES: ClassVar[dict[str, bool]] = { "llm_calls": False, "challenges": False, "summarization_calls": False, "llm_usage": True, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/management/commands/backfill_metrics.py` around lines 61 - 67, LLM_METRIC_TYPES is a mutable class attribute flagged by static analysis; annotate it as a class-level constant by adding a ClassVar type hint (e.g., from typing import ClassVar) so its declaration becomes ClassVar[dict[str, bool]] and makes the intent explicit; locate the LLM_METRIC_TYPES attribute in the class in backfill_metrics.py and update its type annotation accordingly while keeping the same value.backend/dashboard_metrics/tasks.py (2)
285-359: Cognitive complexity is elevated but the structure is clear.SonarCloud flags this function for complexity (16 vs 15 allowed). The logic is straightforward (hourly loop + daily/monthly loop with split). Consider extracting the daily/monthly bucket loop into a small helper if you want to address the warning, but it's not critical.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 285 - 359, The function _aggregate_single_metric exceeds allowed cognitive complexity; extract the daily+monthly bucketing logic (the second query loop that builds daily_agg and monthly_buckets and the subsequent monthly_agg population) into a small helper function (e.g., _bucket_daily_and_monthly or similar) that accepts rows/params (org_id, daily_start, metric_name, metric_type, daily_agg, monthly_agg) and returns/updates monthly_buckets, then call that helper from _aggregate_single_metric and keep the hourly loop as-is to reduce complexity.
362-440: Cognitive complexity warning for_aggregate_llm_combined.SonarCloud reports complexity of 26 (vs 15 allowed). The nested loops over
llm_combined_fieldsinside both the daily and monthly paths contribute significantly. Extracting the field iteration into a helper could reduce nesting.♻️ Suggested refactor to reduce complexity
def _ingest_llm_row_to_agg( row: dict, ts_str: str, org_id: str, agg_dict: dict, llm_combined_fields: dict, ) -> None: """Ingest a single LLM combined row into an aggregation dict.""" for field, (metric_name, metric_type) in llm_combined_fields.items(): value = row[field] or 0 key = (org_id, ts_str, metric_name, "default", "") if key not in agg_dict: agg_dict[key] = {"metric_type": metric_type, "value": 0, "count": 0} agg_dict[key]["value"] += value agg_dict[key]["count"] += 1Then call this helper in both hourly and daily loops.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 362 - 440, The _aggregate_llm_combined function is over-complex due to duplicated nested loops over llm_combined_fields; extract the inner per-field logic into a helper (e.g., _ingest_llm_row_to_agg) that accepts row, ts_str, org_id, agg_dict, and llm_combined_fields and performs the value extraction, key construction, metric_type init, and accumulation, then call that helper from the hourly loop and from the daily branch to replace the duplicated code; for the monthly bucketing extract a small helper (e.g., _ingest_llm_row_to_monthly_bucket) that builds the (month_key_str, metric_name) bkey, initializes bucket entries and accumulates value/count, and use it in the monthly loop—this reduces nesting and cognitive complexity while preserving existing behavior in _aggregate_llm_combined.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 203-243: _acquire_aggregation_lock has a race between
cache.delete(...) and cache.add(...) when reclaiming stale/corrupted locks;
replace the non-atomic delete+add with atomic Redis operations using the raw
client from django_redis (get_redis_connection) so you perform SET NX EX for the
initial grab and use GET/GETSET (or GETSET + check old value and set TTL) when
reclaiming stale locks to ensure you only replace the lock if it still contains
the stale timestamp you observed; update references to AGGREGATION_LOCK_KEY and
the cache.add/cache.delete usage in _acquire_aggregation_lock to use
redis.set(..., nx=True, ex=...) and redis.getset / expire logic with a
float-check of the previous value to avoid the race.
---
Nitpick comments:
In `@backend/dashboard_metrics/management/commands/backfill_metrics.py`:
- Around line 61-67: LLM_METRIC_TYPES is a mutable class attribute flagged by
static analysis; annotate it as a class-level constant by adding a ClassVar type
hint (e.g., from typing import ClassVar) so its declaration becomes
ClassVar[dict[str, bool]] and makes the intent explicit; locate the
LLM_METRIC_TYPES attribute in the class in backfill_metrics.py and update its
type annotation accordingly while keeping the same value.
In `@backend/dashboard_metrics/tasks.py`:
- Around line 285-359: The function _aggregate_single_metric exceeds allowed
cognitive complexity; extract the daily+monthly bucketing logic (the second
query loop that builds daily_agg and monthly_buckets and the subsequent
monthly_agg population) into a small helper function (e.g.,
_bucket_daily_and_monthly or similar) that accepts rows/params (org_id,
daily_start, metric_name, metric_type, daily_agg, monthly_agg) and
returns/updates monthly_buckets, then call that helper from
_aggregate_single_metric and keep the hourly loop as-is to reduce complexity.
- Around line 362-440: The _aggregate_llm_combined function is over-complex due
to duplicated nested loops over llm_combined_fields; extract the inner per-field
logic into a helper (e.g., _ingest_llm_row_to_agg) that accepts row, ts_str,
org_id, agg_dict, and llm_combined_fields and performs the value extraction, key
construction, metric_type init, and accumulation, then call that helper from the
hourly loop and from the daily branch to replace the duplicated code; for the
monthly bucketing extract a small helper (e.g.,
_ingest_llm_row_to_monthly_bucket) that builds the (month_key_str, metric_name)
bkey, initializes bucket entries and accumulates value/count, and use it in the
monthly loop—this reduces nesting and cognitive complexity while preserving
existing behavior in _aggregate_llm_combined.
In `@backend/dashboard_metrics/views.py`:
- Around line 840-843: The conditional that sets MetricType based on metric_name
== "llm_usage" is dead code because LLM metrics (including "llm_usage") are
handled earlier and excluded from metric_queries via LLM_METRIC_KEYS; update the
block that constructs the metric dict (the code using "metric_name" and
MetricType) to always assign MetricType.COUNTER (remove the if/else branch) so
the metric creation is simplified and no unreachable check remains.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 30788ac3-29a5-450c-9105-71430a26b833
📒 Files selected for processing (4)
backend/dashboard_metrics/management/commands/backfill_metrics.pybackend/dashboard_metrics/services.pybackend/dashboard_metrics/tasks.pybackend/dashboard_metrics/views.py
- Extract _upsert_agg helper in tasks.py to eliminate repeated dict-init-and-increment pattern in both aggregation functions - Extract _build_series_entry and _build_error_entry helpers in views.py to deduplicate series construction in live_series endpoint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move LLM and non-LLM metric fetching logic out of the view method into a standalone function, bringing the view's cognitive complexity well under the Sonar threshold of 15. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/dashboard_metrics/tasks.py (1)
450-462:⚠️ Potential issue | 🔴 CriticalWorkflowExecution prefilter will silently skip orgs with recent PageUsage, Usage, or HITLQueue data.
The prefilter (lines 450-462) uses only
WorkflowExecutionwith a 7-day window to select which organizations to process. However, the following metrics query their data directly and independently:
pages_processed→ queriesPageUsagedirectlyllm_calls,challenges,summarization_calls,llm_usage→ queryUsagedirectlyhitl_reviews,hitl_completions→ queryHITLQueuedirectlyIf an org has recent activity in any of these tables but no
WorkflowExecutionin the past 7 days, the entire org is skipped and all its metrics are silently not aggregated. This is a data integrity issue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/tasks.py` around lines 450 - 462, The prefilter active_org_ids currently only pulls org IDs from WorkflowExecution using daily_start; update it to include org IDs with recent activity from PageUsage, Usage, and HITLQueue as well (using the same daily_start window) so that pages_processed, llm_calls/challenges/summarization_calls/llm_usage, and hitl_reviews/hitl_completions aren’t skipped; implement this by computing the union of distinct organization IDs from WorkflowExecution.values_list("workflow__organization_id"), PageUsage.values_list("organization_id"), Usage.values_list("organization_id"), and HITLQueue.values_list("organization_id") into the existing active_org_ids set (or replace active_org_ids with a combined queryset/set) to ensure all orgs with recent activity in any of those tables are processed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 360-383: The combined-LLM loops are upserting zeroed metrics for
fields that weren’t actually present in the bucket; modify the fan-out in the
loops that iterate over MetricsQueryService.get_llm_metrics_combined so you only
call _upsert_agg for a given (field, metric_name) when the field was truly
present in the row: first prefer a per-field presence flag if the query exposes
one (e.g. row.get(f"{field}_present") is truthy), otherwise require row[field]
is not None (and skip when row[field] is missing/None) before computing value
and calling _upsert_agg for hourly_agg, daily_agg, and monthly_agg; keep the
surrounding timestamp handling (_truncate_to_hour/_day/_month,
llm_combined_fields) unchanged.
In `@backend/dashboard_metrics/views.py`:
- Around line 49-73: Replace the implicit llm_usage-vs-everything-else logic
with an explicit metric type map (e.g., METRIC_TYPE_MAP = {"llm_usage":
MetricType.HISTOGRAM, "pages_processed": MetricType.HISTOGRAM, ...}) and use
that map inside _build_series_entry and _build_error_entry to determine
metric_type (falling back to MetricType.COUNTER if not listed); update
_build_series_entry to read metric_type = METRIC_TYPE_MAP.get(metric_name,
MetricType.COUNTER) and include it in the returned dict, and update
_build_error_entry to also look up and include the same metric_type so partial
failures preserve the correct type.
- Around line 792-829: The metric mapping in metric_queries dropped the
"failed_pages" entry so requests to GET /live-series?metric_name=failed_pages
return empty series; restore the mapping by adding "failed_pages":
MetricsQueryService.get_failed_pages to the metric_queries dict (in the same
block where documents_processed, pages_processed, etc. are defined) so
_build_series_entry and the aggregation in backend/dashboard_metrics/tasks.py
continue to work, or alternatively add explicit validation at the start of the
handler to reject unknown metrics (using MetricsQueryService.LLM_METRIC_KEYS and
the metric_queries keys) so a missing metric is surfaced instead of returning an
empty series.
---
Outside diff comments:
In `@backend/dashboard_metrics/tasks.py`:
- Around line 450-462: The prefilter active_org_ids currently only pulls org IDs
from WorkflowExecution using daily_start; update it to include org IDs with
recent activity from PageUsage, Usage, and HITLQueue as well (using the same
daily_start window) so that pages_processed,
llm_calls/challenges/summarization_calls/llm_usage, and
hitl_reviews/hitl_completions aren’t skipped; implement this by computing the
union of distinct organization IDs from
WorkflowExecution.values_list("workflow__organization_id"),
PageUsage.values_list("organization_id"), Usage.values_list("organization_id"),
and HITLQueue.values_list("organization_id") into the existing active_org_ids
set (or replace active_org_ids with a combined queryset/set) to ensure all orgs
with recent activity in any of those tables are processed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 269487ef-bace-4842-8c97-b0e70bda04d4
📒 Files selected for processing (2)
backend/dashboard_metrics/tasks.pybackend/dashboard_metrics/views.py
Add _HISTOGRAM_METRICS set and _metric_type helper to correctly classify pages_processed and failed_pages as histograms, consistent with tasks.py and backfill_metrics.py. Previously only llm_usage was marked as histogram in the views layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Test ResultsSummary
Runner Tests - Full Report
|
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/dashboard_metrics/views.py (1)
76-127: Cognitive complexity exceeds threshold (22 vs 15 allowed).SonarCloud flags this function for high cognitive complexity. Consider extracting the LLM metrics handling (lines 94-110) into a separate helper function to reduce nesting.
Suggested refactor
+def _fetch_llm_series( + org_id: str, + start_date: datetime, + end_date: datetime, + granularity: str, + requested_metric: str | None, + llm_metric_keys: dict, +) -> tuple[list[dict], list[str]]: + """Fetch LLM metrics series.""" + series: list[dict] = [] + errors: list[str] = [] + try: + llm_split = MetricsQueryService.get_llm_metrics_split( + org_id, start_date, end_date, granularity + ) + for name, data in llm_split.items(): + if not requested_metric or name == requested_metric: + series.append(_build_series_entry(name, data)) + except Exception: + logger.exception("Failed to fetch LLM metrics") + for name in llm_metric_keys: + if not requested_metric or name == requested_metric: + errors.append(name) + series.append(_build_error_entry(name)) + return series, errors + + def _fetch_live_series( org_id: str, start_date: datetime, end_date: datetime, granularity: str, metric_queries: dict, requested_metric: str | None, ) -> tuple[list[dict], list[str]]: - """Fetch all metric series (LLM combined + individual). - - Returns: - Tuple of (series list, error names list). - """ - series: list[dict] = [] - errors: list[str] = [] + """Fetch all metric series (LLM combined + individual).""" llm_metric_keys = MetricsQueryService.LLM_METRIC_KEYS - # Fetch all 4 LLM metrics in a single query + series: list[dict] = [] + errors: list[str] = [] + if not requested_metric or requested_metric in llm_metric_keys: - try: - llm_split = MetricsQueryService.get_llm_metrics_split( - org_id, - start_date, - end_date, - granularity, - ) - for name, data in llm_split.items(): - if not requested_metric or name == requested_metric: - series.append(_build_series_entry(name, data)) - except Exception: - logger.exception("Failed to fetch LLM metrics") - for name in llm_metric_keys: - if not requested_metric or name == requested_metric: - errors.append(name) - series.append(_build_error_entry(name)) + llm_series, llm_errors = _fetch_llm_series( + org_id, start_date, end_date, granularity, requested_metric, llm_metric_keys + ) + series.extend(llm_series) + errors.extend(llm_errors) # Filter non-LLM metrics if a specific metric was requested if requested_metric:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 76 - 127, The function _fetch_live_series is too complex; extract the LLM-specific block (the try/except that calls MetricsQueryService.get_llm_metrics_split, iterates llm_split, appends _build_series_entry or on error logs via logger.exception and appends _build_error_entry for each MetricsQueryService.LLM_METRIC_KEYS item) into a new helper function (e.g., _fetch_llm_series) that accepts org_id, start_date, end_date, granularity, requested_metric and returns (list_of_series, list_of_error_names); then replace the original LLM block in _fetch_live_series with a single call to that helper and merge its returned series/errors with the rest of the function, keeping existing use of _build_series_entry, _build_error_entry and logger.exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 76-127: The function _fetch_live_series is too complex; extract
the LLM-specific block (the try/except that calls
MetricsQueryService.get_llm_metrics_split, iterates llm_split, appends
_build_series_entry or on error logs via logger.exception and appends
_build_error_entry for each MetricsQueryService.LLM_METRIC_KEYS item) into a new
helper function (e.g., _fetch_llm_series) that accepts org_id, start_date,
end_date, granularity, requested_metric and returns (list_of_series,
list_of_error_names); then replace the original LLM block in _fetch_live_series
with a single call to that helper and merge its returned series/errors with the
rest of the function, keeping existing use of _build_series_entry,
_build_error_entry and logger.exception.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0bd16704-6ee8-43fe-a52c-ca5340d7c29e
📒 Files selected for processing (2)
backend/dashboard_metrics/tasks.pybackend/dashboard_metrics/views.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
backend/dashboard_metrics/views.py (1)
72-79:⚠️ Potential issue | 🟡 MinorAdd
metric_typeto error entries for response consistency.The error entry is missing the
metric_typefield. When partial failures occur, the frontend receives inconsistent response shapes—successful entries havemetric_typewhile error entries don't. This was noted in a previous review.Proposed fix
def _build_error_entry(metric_name: str) -> dict: """Build an error series entry for a failed metric.""" return { "metric_name": metric_name, + "metric_type": _metric_type(metric_name), "error": "unavailable", "data": [], "total_value": 0, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 72 - 79, The error entry returned by _build_error_entry is missing the metric_type field, causing inconsistent response shapes; update _build_error_entry (function name: _build_error_entry) to accept an optional metric_type parameter (e.g., metric_type: Optional[str] = None) and include "metric_type": metric_type (or a default like "unavailable"/None) in the returned dict so error entries match the successful entries' shape.
🧹 Nitpick comments (1)
backend/dashboard_metrics/views.py (1)
82-134: Consider extracting helpers to reduce cognitive complexity.SonarCloud flags this function with cognitive complexity 22 (limit 15). The logic is correct, but the nested conditionals for LLM vs non-LLM paths and per-metric error handling contribute to the complexity.
Suggested refactor: extract LLM fetching
+def _fetch_llm_series( + org_id: str, + start_date: datetime, + end_date: datetime, + granularity: str, + requested_metric: str | None, +) -> tuple[list[dict], list[str]]: + """Fetch LLM metrics series.""" + series: list[dict] = [] + errors: list[str] = [] + llm_keys = MetricsQueryService.LLM_METRIC_KEYS + + try: + llm_split = MetricsQueryService.get_llm_metrics_split( + org_id, start_date, end_date, granularity + ) + for name, data in llm_split.items(): + if not requested_metric or name == requested_metric: + series.append(_build_series_entry(name, data)) + except Exception: + logger.exception("Failed to fetch LLM metrics") + for name in llm_keys: + if not requested_metric or name == requested_metric: + errors.append(name) + series.append(_build_error_entry(name)) + + return series, errors + + def _fetch_live_series( org_id: str, start_date: datetime, end_date: datetime, granularity: str, metric_queries: dict, requested_metric: str | None, ) -> tuple[list[dict], list[str]]: - """Fetch all metric series (LLM combined + individual). - - Returns: - Tuple of (series list, error names list). - """ + """Fetch all metric series (LLM combined + individual).""" series: list[dict] = [] errors: list[str] = [] - llm_metric_keys = MetricsQueryService.LLM_METRIC_KEYS - # Fetch all 4 LLM metrics in a single query - if not requested_metric or requested_metric in llm_metric_keys: - try: - llm_split = MetricsQueryService.get_llm_metrics_split( - org_id, - start_date, - end_date, - granularity, - ) - for name, data in llm_split.items(): - if not requested_metric or name == requested_metric: - series.append(_build_series_entry(name, data)) - except Exception: - logger.exception("Failed to fetch LLM metrics") - for name in llm_metric_keys: - if not requested_metric or name == requested_metric: - errors.append(name) - series.append(_build_error_entry(name)) + # Fetch LLM metrics + if not requested_metric or requested_metric in MetricsQueryService.LLM_METRIC_KEYS: + llm_series, llm_errors = _fetch_llm_series( + org_id, start_date, end_date, granularity, requested_metric + ) + series.extend(llm_series) + errors.extend(llm_errors) # Filter non-LLM metrics if a specific metric was requested ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/dashboard_metrics/views.py` around lines 82 - 134, The _fetch_live_series function is too complex; extract the LLM-specific fetch and error handling into a helper to reduce nesting. Create a helper (e.g., fetch_llm_series(org_id, start_date, end_date, granularity, requested_metric)) that calls MetricsQueryService.get_llm_metrics_split, builds series entries with _build_series_entry, and on exception logs and returns error names and error entries using _build_error_entry; then simplify _fetch_live_series to call that helper when appropriate and only handle the remaining metric_queries loop (calling each query_fn and using _build_series_entry/_build_error_entry) so the main function becomes a simple orchestration of fetch_llm_series(...) + the per-metric loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 851-860: The metric mapping in metric_queries is missing the
failed_pages entry, causing live-series queries for "failed_pages" to return
empty; add "failed_pages": MetricsQueryService.get_failed_pages to the
metric_queries dict so that the failed_pages histogram is routed to
MetricsQueryService.get_failed_pages (update the metric_queries definition where
it lists documents_processed, pages_processed, etc., and ensure the key matches
"failed_pages").
---
Duplicate comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 72-79: The error entry returned by _build_error_entry is missing
the metric_type field, causing inconsistent response shapes; update
_build_error_entry (function name: _build_error_entry) to accept an optional
metric_type parameter (e.g., metric_type: Optional[str] = None) and include
"metric_type": metric_type (or a default like "unavailable"/None) in the
returned dict so error entries match the successful entries' shape.
---
Nitpick comments:
In `@backend/dashboard_metrics/views.py`:
- Around line 82-134: The _fetch_live_series function is too complex; extract
the LLM-specific fetch and error handling into a helper to reduce nesting.
Create a helper (e.g., fetch_llm_series(org_id, start_date, end_date,
granularity, requested_metric)) that calls
MetricsQueryService.get_llm_metrics_split, builds series entries with
_build_series_entry, and on exception logs and returns error names and error
entries using _build_error_entry; then simplify _fetch_live_series to call that
helper when appropriate and only handle the remaining metric_queries loop
(calling each query_fn and using _build_series_entry/_build_error_entry) so the
main function becomes a simple orchestration of fetch_llm_series(...) + the
per-metric loop.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 04a9559c-afd2-425d-9cd7-053321776b6d
📒 Files selected for processing (1)
backend/dashboard_metrics/views.py



What
aggregate_metrics_from_sourcesCelery task to reduce DB queries by ~55%Why
redis-cli DEL) when workers were OOM-killedHow
Count(filter=Q(...)),Sum) — saves 9 queries/orgOrganization.organization_idstring once per org for PageUsage queries instead of per-metric DB lookup — saves 5 lookups/orgWorkflowExecution.created_at__gte=daily_start(7 days) instead ofmonthly_start(2 months). Dormant orgs re-enter the window when activity resumes; monthly totals were already persisted by prior runstime.time()as lock value. On contention, check age > timeout → reclaim stale lock. Handles migration from old"running"string value viaTypeError/ValueErrorbranchorg_identifierparameter fromget_failed_pages(joins through workflow FK, not PageUsage)Trade-off: Performance vs Stability
This PR trades marginal stability for significant performance gains. Specifically:
usagetable has a schema issue (e.g., corruptedllm_usage_reasoncolumn), all 4 LLM metrics fail together instead of independently. Mitigation: the outertry/exceptper org catches this and logs the error — other orgs and non-LLM metrics are unaffected.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
get_llm_calls,get_challenges, etc.) preserve backward compatibility for views.py and backfill command"running"string in RedisDatabase Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
Aggregation: X active orgs out of Y total— X will be lower than before (7-day vs 2-month window), this is expectedredis-cli SET "dashboard_metrics:aggregation_lock" "running", verify next run reclaims it and logs a warningpython manage.py backfill_metrics --days=1 --dry-runScreenshots
N/A (backend-only change)
Checklist
I have read and understood the Contribution Guidelines.