Skip to content

grpc: server-side health check default: 5s->500ms#8681

Open
jsha wants to merge 1 commit intomainfrom
faster-healthcheckers
Open

grpc: server-side health check default: 5s->500ms#8681
jsha wants to merge 1 commit intomainfrom
faster-healthcheckers

Conversation

@jsha
Copy link
Contributor

@jsha jsha commented Mar 18, 2026

Since the RA always starts unhealthy (it needs to load overrides), this allows it to become healthy sooner.

This speeds up integration test startup by about 20s, because we start four instances of RA, and they were all waiting on their second internal health check before they could become healthy.

Note that our health-checker binary (used from startservers.py) checks health every 100ms, but no requests are actually sent over the wire because the server side has not sent the client a streaming reply to the HealthClient.Watch() RPC. We do have the option of plumbing things up so that RA can call serverBuilder.healthSrv.SetServingStatus() as soon as its overrides are loaded, which would speed things up a bit more, but this seems good enough for now.

Since the RA always starts unhealthy (it needs to load overrides),
this allows it to become healthy sooner.

This speeds up integration test startup by about 20s, because we start four
instances of RA, and they were all waiting on their second internal health check
before they could become healthy.
@jsha jsha marked this pull request as ready for review March 18, 2026 05:38
@jsha jsha requested a review from a team as a code owner March 18, 2026 05:38
@jsha jsha requested a review from beautifulentropy March 18, 2026 05:38
Copy link
Contributor

@aarongable aarongable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that every SA instance will be doing two SELECT 1s every 500ms, too:

boulder/sa/sa.go

Lines 986 to 997 in c3c1684

func (ssa *SQLStorageAuthority) Health(ctx context.Context) error {
err := ssa.dbMap.SelectOne(ctx, new(int), "SELECT 1")
if err != nil {
return err
}
err = ssa.SQLStorageAuthorityRO.Health(ctx)
if err != nil {
return err
}
return nil
}

That's probably fine, but I'm not fully convinced we want to do that in prod.

@jsha
Copy link
Contributor Author

jsha commented Mar 19, 2026

We actually have a healthCheckInterval setting in the SA (and only the SA), which sets this period differently. I had assumed it was set in prod, but looks like not.

I do think it's fine to do SELECT 1 every 500ms in prod, but I do want to avoid spurious prod changes to resolve slowness in CI. I'll look at other fixes here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants