-
Notifications
You must be signed in to change notification settings - Fork 282
Add custom delay for instance refresh actions #465
Copy link
Copy link
Open
Labels
Priority: MediumThis issue will be seen by about half of usersThis issue will be seen by about half of usersType: EnhancementNew feature or requestNew feature or requeststalebot-ignoreTo NOT let the stalebot update or close the Issue / PRTo NOT let the stalebot update or close the Issue / PR
Metadata
Metadata
Assignees
Labels
Priority: MediumThis issue will be seen by about half of usersThis issue will be seen by about half of usersType: EnhancementNew feature or requestNew feature or requeststalebot-ignoreTo NOT let the stalebot update or close the Issue / PRTo NOT let the stalebot update or close the Issue / PR
Type
Fields
Give feedbackNo fields configured for issues without a type.
When using instance refresh to update ASGs it looks like the events come through with a start date of now which triggers the node-termination handler to start cordoning and draining the node immediately. This does work correctly if the ASG healthy percentage is set to 100% and all pods have replicas and PDBs (for NTH we need #463 to satisfy this); but single pods such as Prometheus will often be un-schedulable for a short period while the new node boots up.
To make this whole process function without any downtime a custom duration to wait on ASG termination events could be adopted and defaulted to something like 90 seconds. Assuming that this wait time was longer than the time to start and join a node to the cluster there would be no un-schedulable pods and the ability to use a non 100% ASG healthy percentage. Combined with the ASG lifecycle hook timeout this would support a high level of customisation without much extra complexity.