Skip to content

WIP: feat: client reports#1549

Draft
jpnurmi wants to merge 95 commits intojpnurmi/feat/http-retryfrom
jpnurmi/feat/client-reports
Draft

WIP: feat: client reports#1549
jpnurmi wants to merge 95 commits intojpnurmi/feat/http-retryfrom
jpnurmi/feat/client-reports

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Mar 3, 2026

This PR implements client reports for tracking discarded events.

  • Add sentry_client_report module
  • Add send_client_reports option (enabled by default)
  • Instrument discard points: sample_rate, before_send, rate limiting, network errors, queue overflow
  • Piggyback client reports onto outgoing envelopes
  • Record network error discards per envelope item with correct data category, only when retry is not available

Close: #1216

#skip-changelog (wip)

jpnurmi and others added 30 commits February 23, 2026 14:06
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The deferred startup retry scan (100ms delay) could pick up files
written by the current session. Filter by startup_time so only
previous-session files are processed. Also ensure the cache directory
exists when cache_keep is enabled, since sentry__process_old_runs
only creates it conditionally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Monotonic time is process-relative and doesn't work across restarts.
Retry envelope timestamps need to persist across sessions, so use
time() (seconds since epoch) for file timestamps, startup_time, and
backoff comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename SENTRY_RETRY_BACKOFF_BASE_MS to SENTRY_RETRY_BACKOFF_BASE_S
and sentry__retry_backoff_ms to sentry__retry_backoff, since file
timestamps are now in seconds. The bgworker delay sites multiply
by 1000 to convert to the milliseconds it expects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move startup_time initialization into sentry__retry_new and remove the
unnecessary sentry__retry_set_startup_time indirection. Tests now use
write_retry_file with timestamps well in the past to match production
behavior where retry files are from previous sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When files exist but aren't eligible yet (backoff not elapsed),
foreach was returning 0 causing the retry polling task to stop.
Return total valid retry files found instead of just the eligible
count so the caller keeps rescheduling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make handle_result return bool (true = file rescheduled for retry,
false = file consumed) and use it in foreach to decrement the total
count. This avoids one extra no-op poll cycle after the last retry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Y_THROTTLE

Replace SENTRY_RETRY_BACKOFF_BASE_S and SENTRY_RETRY_STARTUP_DELAY_MS
with ms-based constants so the transport uses them directly without
leaking unit conversion details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Give the retry module a bgworker ref and send callback so it owns all
scheduling. Transport just calls _start and _enqueue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sentry__retry_new only returns NULL on failure, not based on options.
sentry__retry_start and _enqueue require non-NULL retry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicate prepare/send/free sequence shared by retry_send_cb and
http_send_task.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Already covered by retry_throttle and retry_result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ashpad

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass startup_time directly to _foreach as a `before` filter instead of
a bool. Clear it after the first run so subsequent polls use backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deduplicate filename construction across write_envelope, handle_result,
and tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the transport supports retry and http_retries > 0,
sentry__process_old_runs now skips caching .envelope files from old
runs. The retry system handles persistence, so duplicating into
cache/ is unnecessary.

Also simplifies sentry__retry_handle_result: only cache on max
retries exhausted, not on successful send.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the retry-aware check before cache_dir creation so we avoid
mkdir when the retry system handles persistence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry callback now receives a sentry_envelope_t and returns a
status code. The retry system handles deserialization and file
lifecycle internally, keeping path concerns out of the transport.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add test case for successful send at max retry count with cache_keep
enabled, confirming envelopes are cached regardless of send outcome.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lopes

The startup poll used `ts >= startup_time` to skip envelopes written
after startup. With second-precision timestamps, this also skipped
cross-session envelopes written in the same second as a fast restart.

Reset `startup_time` in `sentry__retry_enqueue` so the startup poll
falls through to the backoff path for same-session envelopes. The
bgworker processes the send task (immediate) before the startup poll
(delayed), so by the time the poll fires, `startup_time` is already 0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Submit a one-shot retry send task before bgworker shutdown to ensure
pre-existing retry files are sent even if the startup poll hasn't
fired yet. The flush checks startup_time on the worker thread to
avoid re-sending files already handled by enqueue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `time(NULL)` (1-second granularity) with `sentry__usec_time() / 1000`
(millisecond granularity) to avoid timestamp collisions that caused flaky
`>=` vs `>` comparison behavior in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Make sentry__retry_flush block until the flush task completes by adding
a bgworker_flush call, and subtract the elapsed time from the shutdown
timeout. This ensures retries are actually sent before the worker stops.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jpnurmi and others added 29 commits February 24, 2026 15:23
- Change sentry__retry_write_envelope to return bool indicating success
- Return false for nil event IDs (session envelopes) and write failures
- sentry__retry_enqueue now returns early if write fails, preserving
  startup_time for session envelopes so retry_flush_task can flush them
Add null check after cloning cache_path to prevent dereferencing null
later in sentry__retry_send when iterating directory or joining paths.
When system clock moves backward (now < ts), the condition now >= ts
was false, causing the backoff check to be skipped entirely. This made
items immediately eligible for retry regardless of their count.

Now checks if now < ts (clock skew) OR if backoff hasn't elapsed.
Crashpad's kRetryAttempts=5 with `upload_attempts > kRetryAttempts`
(checked before post-increment) allows upload_attempts 0-5, giving
6 retries with backoffs: 15m, 30m, 1h, 2h, 4h, 8h.

sentry-native's `count + 1 < SENTRY_RETRY_ATTEMPTS` with the old
value of 5 only allowed counts 0-3 to be re-enqueued, so the max
backoff reached was 4h. Bumping to 6 gives the same 6 retries and
the full backoff sequence up to 8h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename sentry__retry_flush to sentry__retry_shutdown and remove the
bgworker_flush call so bgworker_shutdown gets the full timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check the sentry__path_rename return value and log a warning on failure
instead of silently ignoring it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move eligible++ after all item fields are populated so a NULL path from
allocation failure does not leave a half-initialized item in the array.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace mutable startup_time + sealed with a state enum so that
the startup flag is managed via atomic operations instead of
clearing a uint64_t that may tear on 32-bit systems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If sentry__retry_new() fails, retry_func was still set on the transport,
causing can_retry to return true and envelopes to be dropped instead of
cached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Envelopes without an event_id (e.g. sessions) were silently dropped
by sentry__retry_write_envelope. This was a workaround (cd57ff4) for
the old retry_write_envelope approach that regenerated a random UUID on
each attempt, orphaning files on disk. The rewrite to rename-based
counter bumps in handle_result (0f37177) made this safe: the UUID is
parsed from the filename and preserved across renames, so a random UUID
assigned at initial write stays stable through all retry cycles.

Generate a random UUIDv4 for nil event_id envelopes instead of skipping
them. Extract unreachable_dsn to module level. Add UUID consistency
assertions to existing retry tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Clear the scheduled flag before scanning so that a concurrent
retry_enqueue always sees 0 and successfully arms a new poll via CAS.
Previously, the flag was cleared after the scan returned, creating a
window where enqueue could see 1, skip scheduling, and then the poll
would clear the flag — stranding the newly enqueued item.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n_write_cache

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…_result

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The retry_count >= 0 branch passed the full source filename to %.36s,
which would grab the timestamp prefix instead of the UUID for retry-
format filenames. Extract the cache name (last 45 chars) before either
branch so both use the correct UUID.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a mutex (sealed_lock) to serialize the SEALED check in
retry_enqueue with the SEALED set in retry_dump_queue. Store the
envelope address as uintptr_t so retry_dump_cb can skip envelopes
already written by retry_enqueue without risking accidental
dereferencing. The address is safe to compare because the task holds
a ref that keeps the envelope alive during foreach_matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…move_cache

Move sentry__retry_parse_filename and sentry__retry_make_path into
sentry_database as sentry__parse_cache_filename and
sentry__run_make_cache_path. This consolidates cache filename format
knowledge in one module and replaces the fragile `src_len > 45`
heuristic in sentry__run_move_cache with proper parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SENTRY_RETRY_ATTEMPTS constant was bumped from 5 to 6 in 81d0f68
but the public API documentation was not updated to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use a SENTRY_POLL_SHUTDOWN sentinel so that a concurrent
retry_poll_task cannot resubmit the delayed poll that shutdown
just dropped. The CAS(SCHEDULED→IDLE) in retry_poll_task is a
no-op when scheduled is SHUTDOWN, and the subsequent
CAS(IDLE→SCHEDULED) also fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Windows, WinHTTP TCP connect to an unreachable host takes ~2s,
which can exceed the shutdown timeout. Add a cancel_client callback
that closes just the WinHTTP request handle, unblocking the worker
thread so it can process the failure and shut down cleanly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and worker

Use InterlockedExchangePointer to atomically swap client->request to
NULL in cancel, shutdown, and worker exit cleanup. Whichever thread wins
the swap closes the handle; the loser gets NULL and skips.

The worker also snapshots client->request into a local variable right
after WinHttpOpenRequest and uses the local for all subsequent API calls,
so it never reads NULL from the struct if cancel fires mid-function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ncel

Replace the unconditional cancel_client call before bgworker_shutdown
with an on_timeout callback that fires only when the shutdown timeout
expires. This avoids aborting in-flight requests that would have
completed within the timeout, while still unblocking the worker when
it's stuck (e.g. WinHTTP connect to unreachable host).

The callback closes session/connect handles to cancel pending WinHTTP
operations, then the shutdown loop falls through to the !running check
which joins the worker thread. This ensures handle_result runs and the
retry counter is properly bumped on disk.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd_task

The local `HINTERNET request = client->request` snapshot was only needed
for the cancel_client approach. Since shutdown_client only fires at the
timeout point, mid-function reads of client->request are safe. Keep only
the InterlockedExchangePointer in the exit block to prevent double-close.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move sentry__atomic_store(&bgw->running, 0) from before the on_timeout
callback to the else branch (detach path). This lets the worker's
shutdown_task set running=0 naturally after finishing in-flight work,
making the dump_queue safety-net reachable if the callback fails to
unblock the worker within another 250ms cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iness

Tests that directly call sentry__retry_send were racing with the
transport's background retry worker polling the same cache directory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e data loss

After retry_enqueue writes an envelope and stores its address in
sealed_envelope, the envelope is freed when the bgworker task
completes. If a subsequent envelope is allocated at the same address
and is still pending during shutdown, retry_dump_cb would incorrectly
skip it, losing the envelope. Clear sealed_envelope after the first
match so later iterations cannot false-match a reused address.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement SDK telemetry for tracking discarded events per
https://develop.sentry.dev/sdk/telemetry/client-reports/

- Add sentry_client_report module
- Add send_client_reports option (enabled by default)
- Instrument discard points: sample_rate, before_send, rate limiting,
  network errors, queue overflow
- Piggyback client reports onto outgoing envelopes
- Record network error discards per envelope item with correct data
  category, only when retry is not available
- Add unit and integration tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from b133fd6 to 93db900 Compare March 17, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant