Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
I think we should rename One thing to consider here is what we'll do once we have alerts based on failure rate and the associated event type.
Yes, your proposition makes more sense
No strong opinion, but the current payload does lack |
yes, let's do
that's just the computed value of also do you know |
|
I would represent the threshold itself which may not line up with the current / max exactly. |
|
@alexbouchardd updated, the PR description is up-to-date if you want to review the schema, etc. |
alexbouchardd
left a comment
There was a problem hiding this comment.
Changes to the docs seems to be missing
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
TDD setup - tests will pass once feature is implemented. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Send alert when destination is auto-disabled after reaching consecutive failure threshold. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…bject Replace flat fields (ConsecutiveFailures, MaxConsecutiveFailures, Progress, WillDisable) with a scoped ConsecutiveFailures struct containing Current, Max, and Threshold. This produces a cleaner JSON payload structure and removes the redundant WillDisable field (threshold == 100 implies disable). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace ConsecutiveFailures struct in DestinationDisabledData with a Reason string field. This decouples the disabled alert from the specific trigger mechanism, allowing future disable reasons (e.g., error rate) without restructuring the payload. Also update e2e types and assertions for the nested ConsecutiveFailures struct introduced in the previous commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d 100 After disabling a destination, update the consecutive failure alert's destination to include DisabledAt, so consumers see the post-disable state. Add test asserting this behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use generic "failed to send alert" / "alert sent" messages with a topic field instead of alert-type-specific message strings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Following the user conversation, I think we need to revisit the alert semantics altogether. Really, these are "Operator Events" and could span different topics, such as alerts. What I would propose is the following changes:
The list of initial events would be:
The new alert configs would look something like
What do you think? |
|
I think the overall idea makes sense. There are some details that I may want to bring up for clarity, but nothing critical. The only high level topic I want to discuss is around publishing these Operator Events
I assume your intention of relying on an MQ like SQS/PubSub is for reliability instead of HTTP? With that said, it's not fully guaranteed because there could be issue from Outpost side? So basically we cannot guarantee at-least-once but instead at-most-once for these events. Would that be the right take? I'm thinking, if we want to provide more guarantee for these events, we can consider reusing the deliverymq mechanism of some sort? Treat it like a system-level destination somehow. I have an approach in mind that's not super huge in scope, just want to make sure we're on the same page first. |
|
Good consideration, while I considered re-using the system level destination, my thinking is that it would leave a bunch of edge cases around having to hardcoded some exceptions for this "internal" destination, while we already have a pattern for operators to bring their own queue. In terms of guarantee, we may be able can get there. Those events can be emitted by the log service when pulling attempts, and the message would get nacked in the log queue if the event can't be published. That would imply re-inserting into storage which should already be idempotent. |
Changes
Alert Payload Schema
consecutive_failuresinto a nested object withcurrent,max,thresholdwill_disablefield (threshold == 100 implies disable)MarshalJSONmethods in favor of standardjson.Marshaltenant_idto top-level of alert dataAlertDestinationwithfilter,metadata,updated_atDeliveryAttemptstruct: passEvent,Destination,AttemptdirectlyNew
alert.destination.disabledCallbackDisabledAtis set on returned destinationreasonfield (e.g.,"consecutive_failure") instead of embedding trigger-specific data, keeping the payload extensible for future disable mechanisms (e.g., error rate)alert.destination.disabledandalert.destination.consecutive_failure(withthreshold: 100) are sent. The disabled alert is sent first, but ordering is not guaranteed.Consecutive Failure Alert at Threshold 100
disabled_at)Error Handling
DestinationDisablerreturns disabled destination for timestamp consistencyNotification Delivery
Alert notifications are sent synchronously via HTTP within
HandleAttempt, which itself is called asynchronously from the delivery pipeline (go h.handleAlertAttempt(...)). This means notifications don't block event delivery, but a slow notification (e.g., the disabled alert) will delay subsequent notifications (e.g., the consecutive failure alert) within the same handler call.Future consideration: If notification latency becomes an issue, we could decouple the two notifications — either by sending them concurrently or by introducing a notification queue. For now, the synchronous approach keeps the implementation simple and the best-effort error handling makes the current behavior acceptable.
Alert Payloads
eventis a fullmodels.Eventandattemptis a fullmodels.Attempt.alert.destination.consecutive_failure{ "topic": "alert.destination.consecutive_failure", "timestamp": "2025-01-15T10:30:00Z", "data": { "tenant_id": "tenant_123", "event": { ... }, "attempt": { ... }, "consecutive_failures": { "current": 10, "max": 20, "threshold": 50 }, "destination": { "id": "dest_xyz", "tenant_id": "tenant_123", "type": "webhook", "topics": ["*"], "filter": {}, "config": {}, "metadata": {}, "created_at": "2025-01-01T00:00:00Z", "updated_at": "2025-01-01T00:00:00Z", "disabled_at": null } } }alert.destination.disabled{ "topic": "alert.destination.disabled", "timestamp": "2025-01-15T10:30:00Z", "data": { "tenant_id": "tenant_123", "disabled_at": "2025-01-15T10:30:00Z", "reason": "consecutive_failure", "event": { ... }, "attempt": { ... }, "destination": { "id": "dest_xyz", "tenant_id": "tenant_123", "type": "webhook", "topics": ["*"], "filter": {}, "config": {}, "metadata": {}, "created_at": "2025-01-01T00:00:00Z", "updated_at": "2025-01-15T10:30:00Z", "disabled_at": "2025-01-15T10:30:00Z" } } }