Skip to content

TELCODOCS-2644 clarify the NUMA aware scheduler scoring behavior#106102

Open
kquinn1204 wants to merge 1 commit intoopenshift:mainfrom
kquinn1204:TELCODOCS-2644
Open

TELCODOCS-2644 clarify the NUMA aware scheduler scoring behavior#106102
kquinn1204 wants to merge 1 commit intoopenshift:mainfrom
kquinn1204:TELCODOCS-2644

Conversation

@kquinn1204
Copy link
Contributor

@kquinn1204 kquinn1204 commented Feb 6, 2026

[TELCODOCS-2644]: Clarify the NUMA aware scheduler scoring behavior

Version(s): 4.18, 4.19, 4.20, 4.21 and main

Issue: https://issues.redhat.com/browse/TELCODOCS-2644

Link to docs preview: https://106102--ocpdocs-pr.netlify.app/openshift-enterprise/latest/scalability_and_performance/cnf-numa-aware-scheduling.html#cnf-numa-resource-scheduling-strategies_numa-aware

QE review:

  • QE has approved this change.

Additional information:

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 6, 2026
@ocpdocs-previewbot
Copy link

ocpdocs-previewbot commented Feb 6, 2026

[id="cnf-balanceallocated-example_{context}"]
== BalancedAllocation strategy example
The `BalancedAllocation` strategy assigns workloads to the NUMA node with the most balanced resource utilization across CPU and memory. The goal is to prevent imbalanced usage, such as high CPU utilization with underutilized memory. Assume a worker node has the following NUMA node states:
The `BalancedAllocation` strategy favors worker nodes that exhibit the most balanced resource utilization (CPU versus Memory) within their NUMA zones. This prevents _skewed_ usage where a node might be out of CPU cycles while having massive amounts of idle memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [error] RedHat.TermsErrors: Use 'compared to' rather than 'versus'. For more information, see RedHat.TermsErrors.

. *Node Selection*: The scheduler first selects a suitable worker node based on cluster-wide criteria. For example taints, labels, or resource availability.

. After a worker node is selected, the scheduler evaluates its NUMA nodes and applies a scoring strategy to decide which NUMA node will handle the workload.
. *NUMA-Aware Scoring*: After a worker node is selected, the scheduler evaluates the available resources within each worker node's NUMA zones. It applies a scoring strategy to select the worker node that best fits the desired resource distribution.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing, because the Node selection talks about a singular selected node.

Change it so it says Node filtering and describe that it keeps only nodes that are suitable based on what you have + NUMA zone available resources.

The second step is then Node selection and that one uses the score to pick the "best" node out of the shortlist based on the strategy.

. After a workload is scheduled, the selected NUMA node’s resources are updated to reflect the allocation.

The default strategy applied is the `LeastAllocated` strategy. This assigns workloads to the NUMA node with the most available resources that is the least utilized NUMA node. The goal of this strategy is to spread workloads across NUMA nodes to reduce contention and avoid hotspots.
. *Local Allocation*: Once the pod is assigned to a worker node, the node-level components (Topology Manager) perform the authoritative allocation of specific CPUs and memory. The scheduler does not influence this final selection.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Suggested change
. *Local Allocation*: Once the pod is assigned to a worker node, the node-level components (Topology Manager) perform the authoritative allocation of specific CPUs and memory. The scheduler does not influence this final selection.
. *Local Allocation*: Once the pod is assigned to a worker node, the node-level components (CPU and Topology Managers) perform the authoritative allocation of specific CPUs and memory. The scheduler does not influence this final selection.

CPU manager picks the CPUs based on input from all the others like Topology manager, Memory manager (for hugepages affinity) and other device managers (eg. for SR-IOV affinity).

The default strategy applied is the `LeastAllocated` strategy. This assigns workloads to the NUMA node with the most available resources that is the least utilized NUMA node. The goal of this strategy is to spread workloads across NUMA nodes to reduce contention and avoid hotspots.
. *Local Allocation*: Once the pod is assigned to a worker node, the node-level components (Topology Manager) perform the authoritative allocation of specific CPUs and memory. The scheduler does not influence this final selection.

The following table summarizes the different strategies and their outcomes:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following table summarizes the different strategies and their outcomes:
The following table summarizes the different OpenShift Node selection strategies and their outcomes:

@ffromani
Copy link

ffromani commented Feb 9, 2026

LGTM with this caveat: when you write "(CPU and Topology Managers)" I'd either use "Topology and resource managers" to group the active managers, or spell them all out: "(CPU, Memory, Device and Topology managers)". I'm fine with both approaches.

@openshift-ci
Copy link

openshift-ci bot commented Feb 9, 2026

@kquinn1204: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants