-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Problem
When using the “pause ingestion based on resource utilization” feature (docs), upon restart, controllers initially don’t have their cache of server disk utilization information populated until the ResourceUtilizationChecker periodic task runs. There’s a config controller.resource.utilization.checker.initial.delay that we can set to zero seconds to kick off populating the cache immediately, but the controller could still start serving requests before the checker finishes populating the cache since the controller doesn’t wait for the checker to finish before marking itself as ready.
This is a problem for minion-based offline segment generation (code) and offline segment uploads (new feature proposed in #17557), since the disk utilization check will return UNDETERMINED if the controller’s disk utilization cache isn’t yet populated – so the segment creation/upload is allowed to proceed, even if the disk threshold has already been breached.
Solution
I propose adding an opt-in config controller.resource.utilization.checker.waitDuringStartup that ensures the disk utilization cache is populated before marking the controller as ready. This way, the controller is immediately ready to correctly reject segment creation/upload requests after starting up.
I was thinking of adding another serviceStatusCallback (like this one)that checks if the disk utilization cache has been populated yet, and doesn’t return GOOD until it’s populated.