-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
ref(llm-detection): Convert detect_llm_issues_for_project to async batch processing #106839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nora/ID-1151
Are you sure you want to change the base?
Conversation
| response = make_signed_seer_api_request( | ||
| connection_pool=seer_issue_detection_connection_pool, | ||
| path=SEER_ANALYZE_ISSUE_ENDPOINT_PATH, | ||
| body=json.dumps(seer_request.dict()).encode("utf-8"), | ||
| timeout=SEER_TIMEOUT_S, | ||
| retries=0, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: The call to make_signed_seer_api_request is not wrapped in a try-except block, which can cause the task to crash on network errors and lead to repeated processing failures.
Severity: MEDIUM
Suggested Fix
Wrap the make_signed_seer_api_request call in a try-except Exception block. In the except block, log the exception for visibility and allow the task to exit gracefully without crashing. This ensures that a transient network failure does not stop the entire task and allows the unprocessed traces to be retried on the next scheduled run.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry/tasks/llm_issue_detection/detection.py#L290-L296
Potential issue: The call to `make_signed_seer_api_request` is not wrapped in a
`try-except` block. While the previous synchronous timeout handling was removed, other
network-related exceptions like `ConnectionError` or DNS failures can still occur. An
unhandled exception will cause the Celery task to crash. Because the crash happens
before `mark_traces_as_processed` is called, the same batch of traces will be picked up
and retried on the next scheduled run, potentially causing them to be stuck in a retry
loop until their Redis TTL expires. Other parts of the codebase making similar Seer API
calls include robust exception handling.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we send all traces in a single batch request. There's no loop to continue, so catching errors serves no purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| body=json.dumps(seer_request.dict()).encode("utf-8"), | ||
| timeout=SEER_TIMEOUT_S, | ||
| retries=0, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed timeout exception handling for Seer request
Medium Severity
The TimeoutError exception handling was removed when converting to batch processing. If make_signed_seer_api_request times out due to network issues, the entire task crashes instead of handling the error gracefully. The old code caught TimeoutError from urllib3 and logged it while continuing to process other traces. The new code has no exception handling, causing the task to fail completely on timeout.
| ][:TRACES_TO_SEND_TO_SEER] | ||
|
|
||
| if not traces_to_send: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition in trace deduplication logic
Medium Severity
The refactored trace deduplication logic splits the atomic check-and-set operation into separate check and set steps. The old code used nx=True for atomic deduplication, but the new code checks unprocessed traces via _get_unprocessed_traces, then later marks them after receiving a 202 response. If two task instances run concurrently for the same project, both can identify the same traces as unprocessed, send duplicate batches to Seer, and both receive 202 responses, causing duplicate trace analysis.
Additional Locations (1)
| ) | ||
|
|
||
| if response.status == 202: | ||
| mark_traces_as_processed([trace.trace_id for trace in traces_to_send]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task crash after Seer accepts work causes duplicate processing
Medium Severity
Traces are marked as processed AFTER Seer returns 202, rather than before submission. If the task crashes or mark_traces_as_processed fails after Seer accepts the work but before Redis marks complete, those traces won't be marked and will be resubmitted on the next run. This causes Seer to process the same traces multiple times, potentially creating duplicate issues. The old code atomically marked traces before sending to Seer, preventing this failure mode.
roggenkemper
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
Note: This PR is being merged into parent branch
ID-1151. It depends on additional changes that are not yet written. Full cutover will happen when all pieces are ready. Tech spec hereConverts
detect_llm_issues_for_projectfrom synchronous per-trace requests to async batch processing. This is a breaking change that depends on corresponding Seer changes and does not need to be backwards compatible.Changes
_get_unprocessed_traces()using bulk Redis mget instead of per-trace checksmark_trace_as_processed()withmark_traces_as_processed()using Redis pipeline