Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 12 additions & 49 deletions modules/cnf-numa-resource-scheduling-strategies.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,22 @@

:_mod-docs-content-type: CONCEPT
[id="cnf-numa-resource-scheduling-strategies_{context}"]
= NUMA resource scheduling strategies
= NUMA resource scheduling strategies

When scheduling high-performance workloads, the secondary scheduler can employ different strategies to determine which NUMA node within a chosen worker node will handle the workload. The supported strategies in {product-title} include `LeastAllocated`, `MostAllocated`, and `BalancedAllocation`. Understanding these strategies helps optimize workload placement for performance and resource utilization.
[role="_abstract"]
The secondary scheduler optimizes the placement of high-performance workloads by using NUMA-aware scoring strategies to select the most suitable worker nodes. This process assigns workloads to nodes with sufficient resources while allowing local node managers to handle final resource pinning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"local node resource managers", probably?


When a high-performance workload is scheduled in a NUMA-aware cluster, the following steps occur:
When scheduling high-performance workloads, the secondary scheduler determines which worker node is best suited for the task based on its internal NUMA resource distribution. While the scheduler uses NUMA-level data to score and select a worker node, the actual resource pinning within that node is managed by the local Topology Manager and CPU Manager.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still only mention Topology and CPU manager, we should either spell them out or use the collective name "resource managers" like it was done on line 20 below. Other than this, LGTM.


. The scheduler first selects a suitable worker node based on cluster-wide criteria. For example taints, labels, or resource availability.
When a high-performance workload is scheduled in a NUMA-aware cluster, the following steps occur:

. After a worker node is selected, the scheduler evaluates its NUMA nodes and applies a scoring strategy to decide which NUMA node will handle the workload.
. *Node filtering*: The scheduler first filters the entire cluster to find a shortlist of _feasible_ nodes. A node is only kept if it meets all requirements, such as matching labels, respecting taints and tolerations, and, importantly, having sufficient available resources within its specific NUMA zones. If a node cannot satisfy the workload's NUMA affinity, it is filtered out at this stage.

. After a workload is scheduled, the selected NUMA node’s resources are updated to reflect the allocation.
. *Node selection*: Once a shortlist of suitable nodes is established, the scheduler evaluates them to find the best fit. It applies a NUMA-aware scoring strategy to rank these candidates based on their resource distribution. The node with the highest score is then selected for the workload.

The default strategy applied is the `LeastAllocated` strategy. This assigns workloads to the NUMA node with the most available resources that is the least utilized NUMA node. The goal of this strategy is to spread workloads across NUMA nodes to reduce contention and avoid hotspots.
. *Local Allocation*: Once the pod is assigned to a worker node, the node-level components (CPU, memory, device, and topology managers) perform the authoritative allocation of specific CPUs and memory. The scheduler does not influence this final selection.

The following table summarizes the different strategies and their outcomes:
The following table summarizes the different {product-title} strategies and their outcomes:

[discrete]
[id="cnf-scoringstrategy-summary_{context}"]
Expand All @@ -28,45 +29,7 @@ The following table summarizes the different strategies and their outcomes:
[cols="2,3,3", options="header"]
|===
|Strategy |Description |Outcome
|`LeastAllocated` |Favors NUMA nodes with the most available resources. |Spreads workloads to reduce contention and ensure headroom for high-priority tasks.
|`MostAllocated` |Favors NUMA nodes with the least available resources. |Consolidates workloads on fewer NUMA nodes, freeing others for energy efficiency.
|`BalancedAllocation` |Favors NUMA nodes with balanced CPU and memory usage. |Ensures even resource utilization, preventing skewed usage patterns.
|`LeastAllocated` |Favors worker nodes that contain NUMA zones with the most available resources. |Distributes workloads across the cluster to nodes with the highest available headroom.
|`MostAllocated` |Favors worker nodes where the requested resources fit into NUMA zones that are already highly utilized.|Consolidates workloads on already utilized nodes, potentially leaving other nodes idle.
|`BalancedAllocation` |Favors worker nodes with the most balanced CPU and memory usage across NUMA zones. |Prevents skewed usage patterns where one resource type, such as CPU, is exhausted while another, such as memory, remains idle.
|===

[discrete]
[id="cnf-leastallocated-example_{context}"]
== LeastAllocated strategy example
The `LeastAllocated` is the default strategy. This strategy assigns workloads to the NUMA node with the most available resources, minimizing resource contention and spreading workloads across NUMA nodes. This reduces hotspots and ensures sufficient headroom for high-priority tasks. Assume a worker node has two NUMA nodes, and the workload requires 4 vCPUs and 8 GB of memory:

.Example initial NUMA nodes state
[cols="5,2,2,2,2,2", options="header"]
|===
|NUMA node |Total CPUs |Used CPUs |Total memory (GB) |Used memory (GB) |Available resources
|NUMA 1 |16 |12 |64 |56 |4 CPUs, 8 GB memory
|NUMA 2 |16 |6 |64 |24 |10 CPUs, 40 GB memory
|===

Because NUMA 2 has more available resources compared to NUMA 1, the workload is assigned to NUMA 2.

[discrete]
[id="cnf-mostallocated-example_{context}"]
== MostAllocated strategy example
The `MostAllocated` strategy consolidates workloads by assigning them to the NUMA node with the least available resources, which is the most utilized NUMA node. This approach helps free other NUMA nodes for energy efficiency or critical workloads requiring full isolation. This example uses the "Example initial NUMA nodes state" values listed in the `LeastAllocated` section.

The workload again requires 4 vCPUs and 8 GB memory. NUMA 1 has fewer available resources compared to NUMA 2, so the scheduler assigns the workload to NUMA 1, further utilizing its resources while leaving NUMA 2 idle or minimally loaded.

[discrete]
[id="cnf-balanceallocated-example_{context}"]
== BalancedAllocation strategy example
The `BalancedAllocation` strategy assigns workloads to the NUMA node with the most balanced resource utilization across CPU and memory. The goal is to prevent imbalanced usage, such as high CPU utilization with underutilized memory. Assume a worker node has the following NUMA node states:

.Example NUMA nodes initial state for `BalancedAllocation`
[cols="2,2,2,2",options="header"]
|===
|NUMA node |CPU usage |Memory usage |`BalancedAllocation` score
|NUMA 1 |60% |55% |High (more balanced)
|NUMA 2 |80% |20% |Low (less balanced)
|===

NUMA 1 has a more balanced CPU and memory utilization compared to NUMA 2 and therefore, with the `BalancedAllocation` strategy in place, the workload is assigned to NUMA 1.