Add custom delay for instance refresh actions

When using instance refresh to update ASGs it looks like the events come through with a start date of now which triggers the node-termination handler to start cordoning and draining the node immediately. This does work correctly if the ASG healthy percentage is set to 100% and all pods have replicas and PDBs (for NTH we need https://github.com/aws/aws-node-termination-handler/pull/463 to satisfy this); but single pods such as Prometheus will often be un-schedulable for a short period while the new node boots up.

To make this whole process function without any downtime a custom duration to wait on ASG termination events could be adopted and defaulted to something like 90 seconds. Assuming that this wait time was longer than the time to start and join a node to the cluster there would be no un-schedulable pods and the ability to use a non 100% ASG healthy percentage. Combined with the ASG lifecycle hook timeout this would support a high level of customisation without much extra complexity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom delay for instance refresh actions #465

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add custom delay for instance refresh actions #465

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions