grpc: server-side health check default: 5s->500ms by jsha · Pull Request #8681 · letsencrypt/boulder

jsha · 2026-03-18T00:04:48Z

Since the RA always starts unhealthy (it needs to load overrides), this allows it to become healthy sooner.

This speeds up integration test startup by about 20s, because we start four instances of RA, and they were all waiting on their second internal health check before they could become healthy.

Note that our health-checker binary (used from startservers.py) checks health every 100ms, but no requests are actually sent over the wire because the server side has not sent the client a streaming reply to the HealthClient.Watch() RPC. We do have the option of plumbing things up so that RA can call serverBuilder.healthSrv.SetServingStatus() as soon as its overrides are loaded, which would speed things up a bit more, but this seems good enough for now.

Since the RA always starts unhealthy (it needs to load overrides), this allows it to become healthy sooner. This speeds up integration test startup by about 20s, because we start four instances of RA, and they were all waiting on their second internal health check before they could become healthy.

aarongable

This means that every SA instance will be doing two SELECT 1s every 500ms, too:

boulder/sa/sa.go

Lines 986 to 997 in c3c1684

    
           func (ssa *SQLStorageAuthority) Health(ctx context.Context) error { 
        
           	err := ssa.dbMap.SelectOne(ctx, new(int), "SELECT 1") 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	err = ssa.SQLStorageAuthorityRO.Health(ctx) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	return nil 
        
           }

That's probably fine, but I'm not fully convinced we want to do that in prod.

jsha · 2026-03-19T17:49:33Z

We actually have a healthCheckInterval setting in the SA (and only the SA), which sets this period differently. I had assumed it was set in prod, but looks like not.

I do think it's fine to do SELECT 1 every 500ms in prod, but I do want to avoid spurious prod changes to resolve slowness in CI. I'll look at other fixes here.

jsha marked this pull request as ready for review March 18, 2026 05:38

jsha requested a review from a team as a code owner March 18, 2026 05:38

jsha requested a review from beautifulentropy March 18, 2026 05:38

beautifulentropy approved these changes Mar 18, 2026

View reviewed changes

aarongable reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

grpc: server-side health check default: 5s->500ms#8681

grpc: server-side health check default: 5s->500ms#8681
jsha wants to merge 1 commit intomainfrom
faster-healthcheckers

jsha commented Mar 18, 2026

Uh oh!

aarongable left a comment

Uh oh!

jsha commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	func (ssa *SQLStorageAuthority) Health(ctx context.Context) error {
	err := ssa.dbMap.SelectOne(ctx, new(int), "SELECT 1")
	if err != nil {
	return err
	}

	err = ssa.SQLStorageAuthorityRO.Health(ctx)
	if err != nil {
	return err
	}
	return nil
	}

Uh oh!

Conversation

jsha commented Mar 18, 2026

Uh oh!

aarongable left a comment

Choose a reason for hiding this comment

Uh oh!

jsha commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants