Skip to content

(3.x.x) Instance Bootstrap Failures and Protected Mode when Launching Large Number of Nodes #7265

@hgreebe

Description

@hgreebe

The issue

When launching a large number of instances, bootstrap failures may occur due to a delay in EC2 API eventual consistency.
When this happens, the private IP addresses might not be available in the response of DescribeInstances API, even though the ips have been assigned at the response of RunInstances API.

This can be seen in the clustermgtd logs:

2026-03-04 17:50:20,107 - [slurm_plugin.instance_manager:get_cluster_instances] - WARNING - Ignoring instance i-1234abcd because not all EC2 info are available, exception: KeyError, message: 'PrivateIpAddress'

The cluster treats this as a bootstrap failure and can enter protected mode and fail cluster creation. For information on protected mode and how to recover from it, refer to this documentation.

Affected ParallelCluster versions, OSes and schedulers

  • All ParallelCluster versions and OSes
  • Slurm scheduler

Mitigation

You can find a detailed explanation and the mitigation of the problem here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions