ref(llm-detection): Convert detect_llm_issues_for_project to async batch processing #106839

nora-shap · 2026-01-22T23:00:31Z

Note: This PR is being merged into parent branch ID-1151. It depends on additional changes that are not yet written. Full cutover will happen when all pieces are ready. Tech spec here

Converts detect_llm_issues_for_project from synchronous per-trace requests to async batch processing. This is a breaking change that depends on corresponding Seer changes and does not need to be backwards compatible.

Changes

Send up to 20 traces per project in a single batch request (2x target to give Seer selection buffer)
Expect 202 response indicating Seer accepted the work
Mark traces as processed after 202
Add _get_unprocessed_traces() using bulk Redis mget instead of per-trace checks
Replace mark_trace_as_processed() with mark_traces_as_processed() using Redis pipeline
Reduce timeouts: Seer request 180s → 10s, task deadline 10min → 3min

linear · 2026-01-22T23:00:35Z

ID-1152 Add endpoint for issue creation

ID-1154 refactor task to dispatch-only

sentry · 2026-01-22T23:04:22Z

src/sentry/tasks/llm_issue_detection/detection.py

+    response = make_signed_seer_api_request(
+        connection_pool=seer_issue_detection_connection_pool,
+        path=SEER_ANALYZE_ISSUE_ENDPOINT_PATH,
+        body=json.dumps(seer_request.dict()).encode("utf-8"),
+        timeout=SEER_TIMEOUT_S,
+        retries=0,
+    )


Bug: The call to make_signed_seer_api_request is not wrapped in a try-except block, which can cause the task to crash on network errors and lead to repeated processing failures.
_{Severity: MEDIUM}

Suggested Fix

Wrap the make_signed_seer_api_request call in a try-except Exception block. In the except block, log the exception for visibility and allow the task to exit gracefully without crashing. This ensures that a transient network failure does not stop the entire task and allows the unprocessed traces to be retried on the next scheduled run.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/sentry/tasks/llm_issue_detection/detection.py#L290-L296 Potential issue: The call to `make_signed_seer_api_request` is not wrapped in a `try-except` block. While the previous synchronous timeout handling was removed, other network-related exceptions like `ConnectionError` or DNS failures can still occur. An unhandled exception will cause the Celery task to crash. Because the crash happens before `mark_traces_as_processed` is called, the same batch of traces will be picked up and retried on the next scheduled run, potentially causing them to be stuck in a retry loop until their Redis TTL expires. Other parts of the codebase making similar Seer API calls include robust exception handling.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

Now we send all traces in a single batch request. There's no loop to continue, so catching errors serves no purpose.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-22T23:08:08Z

src/sentry/tasks/llm_issue_detection/detection.py

+        body=json.dumps(seer_request.dict()).encode("utf-8"),
+        timeout=SEER_TIMEOUT_S,
+        retries=0,
+    )


Removed timeout exception handling for Seer request

Medium Severity

The TimeoutError exception handling was removed when converting to batch processing. If make_signed_seer_api_request times out due to network issues, the entire task crashes instead of handling the error gracefully. The old code caught TimeoutError from urllib3 and logged it while continuing to process other traces. The new code has no exception handling, causing the task to fail completely on timeout.

cursor · 2026-01-22T23:08:08Z

src/sentry/tasks/llm_issue_detection/detection.py

+    ][:TRACES_TO_SEND_TO_SEER]
+
+    if not traces_to_send:
+        return


Race condition in trace deduplication logic

Medium Severity

The refactored trace deduplication logic splits the atomic check-and-set operation into separate check and set steps. The old code used nx=True for atomic deduplication, but the new code checks unprocessed traces via _get_unprocessed_traces, then later marks them after receiving a 202 response. If two task instances run concurrently for the same project, both can identify the same traces as unprocessed, send duplicate batches to Seer, and both receive 202 responses, causing duplicate trace analysis.

Additional Locations (1)

src/sentry/tasks/llm_issue_detection/detection.py#L297-L299

cursor · 2026-01-22T23:08:08Z

src/sentry/tasks/llm_issue_detection/detection.py

+    )
+
+    if response.status == 202:
+        mark_traces_as_processed([trace.trace_id for trace in traces_to_send])


Task crash after Seer accepts work causes duplicate processing

Medium Severity

Traces are marked as processed AFTER Seer returns 202, rather than before submission. If the task crashes or mark_traces_as_processed fails after Seer accepts the work but before Redis marks complete, those traces won't be marked and will be resubmitted on the next run. This causes Seer to process the same traces multiple times, potentially creating duplicate issues. The old code atomically marked traces before sending to Seer, preventing this failure mode.

roggenkemper

looks good!

Convert detect_llm_issues_for_project to async batch processing

8ed40bb

nora-shap requested review from roggenkemper and shayna-ch January 22, 2026 23:00

nora-shap requested a review from a team as a code owner January 22, 2026 23:00

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 22, 2026

sentry bot reviewed Jan 22, 2026

View reviewed changes

cursor bot reviewed Jan 22, 2026

View reviewed changes

roggenkemper approved these changes Jan 23, 2026

View reviewed changes

shayna-ch approved these changes Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ref(llm-detection): Convert detect_llm_issues_for_project to async batch processing #106839

ref(llm-detection): Convert detect_llm_issues_for_project to async batch processing #106839

nora-shap commented Jan 22, 2026

Uh oh!

linear bot commented Jan 22, 2026

Uh oh!

sentry bot Jan 22, 2026

Uh oh!

nora-shap Jan 22, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 22, 2026

Uh oh!

cursor bot Jan 22, 2026

Uh oh!

cursor bot Jan 22, 2026

Uh oh!

roggenkemper left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

ref(llm-detection): Convert detect_llm_issues_for_project to async batch processing #106839

Are you sure you want to change the base?

ref(llm-detection): Convert detect_llm_issues_for_project to async batch processing #106839

Conversation

nora-shap commented Jan 22, 2026

Changes

Uh oh!

linear bot commented Jan 22, 2026

Uh oh!

sentry bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

nora-shap Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 22, 2026

Choose a reason for hiding this comment

Removed timeout exception handling for Seer request

Uh oh!

cursor bot Jan 22, 2026

Choose a reason for hiding this comment

Race condition in trace deduplication logic

Uh oh!

cursor bot Jan 22, 2026

Choose a reason for hiding this comment

Task crash after Seer accepts work causes duplicate processing

Uh oh!

roggenkemper left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants