You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's no way to know if an event was successfully delivered to all its matched destinations. Discovering delivery gaps requires exporting logstore data and running offline analysis.
Users want to know: "is something broken that I need to act on?"
Background
Events are stateless, immutable facts
Attempts are stateless log entries with a status (success/failed)
retry_exhausted — max attempts reached, no success
destination_disabled — destination was disabled when delivery was attempted
destination_not_found — destination was deleted/not found
Options
Option A: Logstore aggregation query
Query the logstore for event+destination pairs with no successful attempt.
SELECT event_id, destination_id, count(*) as attempts, ...
FROM attempts
GROUP BY event_id, destination_id
HAVING countIf(status ='success') =0
Pros:
No new store, no new write path
Stateless, derived from existing data
Cons:
Expensive aggregation query on large datasets (needs materialized view in ClickHouse)
Can't distinguish "retry pending" from "retry dropped" without cross-checking RSMQ
Filtering out in-progress pairs breaks pagination (fetch 100, filter to 20, page is incomplete)
Can't reliably provide reason — destination state is mutable and not historical
Option B: Logstore query + RSMQ enrichment
Same as Option A, but enrich each result with a live RSMQ ZSCORE lookup to get next_attempt_time.
Retry ID is deterministic: sha256(event_id + ":" + destination_id) → RSMQ message ID
ZSCORE is O(1) per lookup, ~100 per page is cheap
Pros:
No new store
next_attempt_time present = still in progress, null = terminal
Live data, not stale
Cons:
Same expensive logstore aggregation as Option A
Pagination problem remains — can't server-side filter "in progress" pairs without over-fetching
Cross-store query (logstore + RSMQ), no atomicity (retry could fire between queries)
reason still hard to derive
Option C: Dedicated undelivered store
Maintain a dedicated store that tracks event+destination pairs needing attention. Could be a Redis sorted set, or a new table/space within the existing logstore (ClickHouse or Postgres).
Write triggers (add to store):
Retry scheduler exhausts max attempts → add with reason retry_exhausted
Delivery worker encounters disabled destination → add with reason destination_disabled
Delivery worker encounters deleted/not-found destination → add with reason destination_not_found
Removal triggers:
Manual retry succeeds → remove
User acknowledges/dismisses → remove
Pros:
Fast reads, naturally filtered to just problematic cases
Clean pagination — no over-fetching or cross-store filtering
reason is captured at write time (point-in-time, accurate)
No expensive aggregation queries
listUndelivered is a simple read from this store
Cons:
New store and write path to maintain
Write triggers must be added to retry scheduler + delivery worker
If a write is missed (bug, crash), the pair won't appear — no self-healing
Store can drift from reality (e.g. destination re-enabled after being marked as destination_disabled)
Out of scope
Replay/resolution — acting on undelivered events (replay, dismiss, bulk retry) is a separate concern. This endpoint is a read-only view. The undelivered store could be one input to a replay mechanism, but replay can also be triggered from other sources (user selection, time range, post-outage bulk replay).
Open questions
API path? GET /events/undelivered conflicts with GET /events/:event_id
Filtering — by destination, time range, failure code, reason?
An event can be undelivered without ever having an attempt (e.g. destination disabled/deleted at publish time). In that case, the event was never written to the logstore. Should the undelivered store capture the full event data (self-contained, can replay) or just IDs (lightweight, but event data may not exist anywhere)?
For Option C: what store? Redis sorted set by timestamp? Separate table in logstore?
Problem
There's no way to know if an event was successfully delivered to all its matched destinations. Discovering delivery gaps requires exporting logstore data and running offline analysis.
Users want to know: "is something broken that I need to act on?"
Background
Proposal
A new query operation —
event.listUndelivered()— that surfaces event+destination pairs where the delivery journey ended without success.What "undelivered" means
An event+destination pair is undelivered when the journey is over and no attempt succeeded:
Pairs with a pending retry are not undelivered — they're still in progress.
Response shape
[ { "event_id": "evt_...", "destination_id": "des_...", "attempts": 3, "last_code": "429", "last_attempt_time": "2026-02-19T08:53:22Z", "reason": "retry_exhausted" } ]reasonvalues:retry_exhausted— max attempts reached, no successdestination_disabled— destination was disabled when delivery was attempteddestination_not_found— destination was deleted/not foundOptions
Option A: Logstore aggregation query
Query the logstore for event+destination pairs with no successful attempt.
Pros:
Cons:
reason— destination state is mutable and not historicalOption B: Logstore query + RSMQ enrichment
Same as Option A, but enrich each result with a live RSMQ
ZSCORElookup to getnext_attempt_time.sha256(event_id + ":" + destination_id)→ RSMQ message IDZSCOREis O(1) per lookup, ~100 per page is cheapPros:
next_attempt_timepresent = still in progress,null= terminalCons:
reasonstill hard to deriveOption C: Dedicated undelivered store
Maintain a dedicated store that tracks event+destination pairs needing attention. Could be a Redis sorted set, or a new table/space within the existing logstore (ClickHouse or Postgres).
Write triggers (add to store):
retry_exhausteddestination_disableddestination_not_foundRemoval triggers:
Pros:
reasonis captured at write time (point-in-time, accurate)listUndeliveredis a simple read from this storeCons:
destination_disabled)Out of scope
Open questions
GET /events/undeliveredconflicts withGET /events/:event_idtenant_idfilter (or uses JWT tenant scope), otherwise returns global results