Skip to content

Conversation

@nora-shap
Copy link
Member

Note: This PR is being merged into parent branch ID-1151. It depends on additional changes that are not yet written. Full cutover will happen when all pieces are ready. Tech spec here

Converts detect_llm_issues_for_project from synchronous per-trace requests to async batch processing. This is a breaking change that depends on corresponding Seer changes and does not need to be backwards compatible.

Changes

  • Send up to 20 traces per project in a single batch request (2x target to give Seer selection buffer)
  • Expect 202 response indicating Seer accepted the work
  • Mark traces as processed after 202
  • Add _get_unprocessed_traces() using bulk Redis mget instead of per-trace checks
  • Replace mark_trace_as_processed() with mark_traces_as_processed() using Redis pipeline
  • Reduce timeouts: Seer request 180s → 10s, task deadline 10min → 3min

@nora-shap nora-shap requested a review from a team as a code owner January 22, 2026 23:00
@linear
Copy link

linear bot commented Jan 22, 2026

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 22, 2026
Comment on lines +290 to +296
response = make_signed_seer_api_request(
connection_pool=seer_issue_detection_connection_pool,
path=SEER_ANALYZE_ISSUE_ENDPOINT_PATH,
body=json.dumps(seer_request.dict()).encode("utf-8"),
timeout=SEER_TIMEOUT_S,
retries=0,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The call to make_signed_seer_api_request is not wrapped in a try-except block, which can cause the task to crash on network errors and lead to repeated processing failures.
Severity: MEDIUM

Suggested Fix

Wrap the make_signed_seer_api_request call in a try-except Exception block. In the except block, log the exception for visibility and allow the task to exit gracefully without crashing. This ensures that a transient network failure does not stop the entire task and allows the unprocessed traces to be retried on the next scheduled run.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry/tasks/llm_issue_detection/detection.py#L290-L296

Potential issue: The call to `make_signed_seer_api_request` is not wrapped in a
`try-except` block. While the previous synchronous timeout handling was removed, other
network-related exceptions like `ConnectionError` or DNS failures can still occur. An
unhandled exception will cause the Celery task to crash. Because the crash happens
before `mark_traces_as_processed` is called, the same batch of traces will be picked up
and retried on the next scheduled run, potentially causing them to be stuck in a retry
loop until their Redis TTL expires. Other parts of the codebase making similar Seer API
calls include robust exception handling.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we send all traces in a single batch request. There's no loop to continue, so catching errors serves no purpose.

Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

body=json.dumps(seer_request.dict()).encode("utf-8"),
timeout=SEER_TIMEOUT_S,
retries=0,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed timeout exception handling for Seer request

Medium Severity

The TimeoutError exception handling was removed when converting to batch processing. If make_signed_seer_api_request times out due to network issues, the entire task crashes instead of handling the error gracefully. The old code caught TimeoutError from urllib3 and logged it while continuing to process other traces. The new code has no exception handling, causing the task to fail completely on timeout.

Fix in Cursor Fix in Web

][:TRACES_TO_SEND_TO_SEER]

if not traces_to_send:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition in trace deduplication logic

Medium Severity

The refactored trace deduplication logic splits the atomic check-and-set operation into separate check and set steps. The old code used nx=True for atomic deduplication, but the new code checks unprocessed traces via _get_unprocessed_traces, then later marks them after receiving a 202 response. If two task instances run concurrently for the same project, both can identify the same traces as unprocessed, send duplicate batches to Seer, and both receive 202 responses, causing duplicate trace analysis.

Additional Locations (1)

Fix in Cursor Fix in Web

)

if response.status == 202:
mark_traces_as_processed([trace.trace_id for trace in traces_to_send])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task crash after Seer accepts work causes duplicate processing

Medium Severity

Traces are marked as processed AFTER Seer returns 202, rather than before submission. If the task crashes or mark_traces_as_processed fails after Seer accepts the work but before Redis marks complete, those traces won't be marked and will be resubmitted on the next run. This causes Seer to process the same traces multiple times, potentially creating duplicate issues. The old code atomically marked traces before sending to Seer, preventing this failure mode.

Fix in Cursor Fix in Web

Copy link
Member

@roggenkemper roggenkemper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants