Skip to content

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118

Merged
jcardozagc merged 1 commit intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error
Mar 30, 2026
Merged

IX-1782: Que health check surfaces unhealthy state if postgres_error on any worker#118
jcardozagc merged 1 commit intomasterfrom
ix-1782-healthcheck-checks-workers-for-pg-error

Conversation

@jcardozagc
Copy link
Copy Markdown
Contributor

@jcardozagc jcardozagc commented Feb 25, 2026

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state on its last work_loop cycle). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning.

Yes, this health check now can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.

@jcardozagc jcardozagc force-pushed the ix-1782-healthcheck-checks-workers-for-pg-error branch 3 times, most recently from 7f5a866 to fcd7731 Compare March 19, 2026 14:55
end

def call(_env)
if @worker_group.workers.all?(&:healthy?)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about if we want to say that the app is unhealthy if just one worker has a problem, that could be transient any might mean we end up restarting the pod for a single error which probably isn't what we need.

Maybe we do something like

if @worker_group.workers.any?(&:healthy?)

So we end up only triggering the else when ALL the workers are unhealthy?

During a recent incident, after a CloudSQL maintenance event, Que worker pods can end up in a stale database connection state. Workers handle this internally - on a PG::Error, they return :postgres_error, sleep for the wake interval, and retry. The health check is a hardcoded lambda that always returns 200. Hence, Kubernetes also has no signal to restart the affected pods, and the worker in the incident ended up perpetually retrying a faulty connection every 5 seconds.

Let's track the result of each work cycle on the workers itself and expose it through a healthy? predicate (which purely checks whether the worker is in a postgres_error state). The health check endpoint now delegates to a new WorkerHealthCheck that checks every worker in the group and returns 503 if any of them are in a unhealthy state, otherwise returns the 200 it has always been returning. Yes, this endpoint now actually can return a non-200 and this is the only 'breaking' change, but should mean Kubernetes has a signal to restart pods when liveness probe threshold gets breached.
@jcardozagc jcardozagc force-pushed the ix-1782-healthcheck-checks-workers-for-pg-error branch from fcd7731 to 66be132 Compare March 19, 2026 15:34
@jcardozagc jcardozagc marked this pull request as ready for review March 23, 2026 09:54
@jcardozagc
Copy link
Copy Markdown
Contributor Author

Leaving this running on payments-service live-staging for a wee bit

Copy link
Copy Markdown
Contributor

@stephenbinns stephenbinns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given its been running a while now and the idea is sound I think we can get this out next week

@jcardozagc jcardozagc merged commit 2e7e4c1 into master Mar 30, 2026
8 checks passed
@jcardozagc jcardozagc deleted the ix-1782-healthcheck-checks-workers-for-pg-error branch March 30, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants