Skip to content

OCPBUGS-86480: E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases.#1509

Open
mrniranjan wants to merge 7 commits into
openshift:mainfrom
mrniranjan:ovs_dynamic_pinning_wp
Open

OCPBUGS-86480: E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases.#1509
mrniranjan wants to merge 7 commits into
openshift:mainfrom
mrniranjan:ovs_dynamic_pinning_wp

Conversation

@mrniranjan
Copy link
Copy Markdown
Contributor

@mrniranjan mrniranjan commented May 6, 2026

This PR has 2 commits

  1. First commit refactors some of the existing code in terms of readability etc
  2. Add new test cases related related to workload partitioning

Code was generated using AI-Assisted:
AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking

Summary by CodeRabbit

  • Tests
    • Enhanced performance-profile tests to gate CPU-affinity expectations based on Workload Partitioning (enabled/disabled).
    • Updated OVN/OVS affinity checks to validate cpuset behavior, including ovs-vswitchd thread affinity and multi-pod scenarios.
    • Rewrote dynamic OVS pinning tests to cover GU pod pinning, exclusion from OVS affinity, and persistence across pod deletion and node/control-plane reboots.
  • Chores
    • Added a cluster utility to detect Workload Partitioning status.

mrniranjan added 2 commits May 6, 2026 18:56
Extract reusable helper functions (getOvnContainerAffinity,
getOvsAffinities, verifyOvsAffinity, verifyOvsMatchesExpected,
createGuPod, waitForDeploymentReady, expectedOvsAffinity,
getGuPodCPUs, collectGuCPUsFromPodList) to eliminate repeated
patterns across OVS dynamic CPU affinity test cases.

Add cluster.IsWorkloadPartitioningEnabled() to the shared cluster
utility package, querying Infrastructure.Status.CPUPartitioning.

Refactor test_id:64100, 64101, 64102, 64103, 75257 to use the new
helpers, add WP-aware assertion branching, fix indentation
inconsistencies, and reduce overall test noise.

AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking v1.0

Signed-off-by: Niranjan M.R <mniranja@redhat.com>
Add new test cases for OVS dynamic CPU affinity under workload
partitioning and control plane reboot scenarios:

- Skip test suite early when workerRTNode is a non-schedulable
  control plane node

- Add control plane reboot test: verifies OVS affinity is not
  restricted to reserved CPUs before and after rebooting a control
  plane node with dynamic pinning enabled

- Add workload partitioning test context with four tests:
  - OVN pod restricted to reserved CPUs
  - OVS affinity wider than OVN pod
  - Reserved CPUs always included in OVS affinity (with/without GU pods)
  - OVS falls to reserved-only when all isolated CPUs are exhausted

AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking v1.0

Signed-off-by: Niranjan M.R <mniranja@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The test suite now detects whether Workload Partitioning (WP) is enabled and adjusts CPU affinity/cpuset expectations. It adds a cluster helper to check WP, refactors affinity collection/verifications into shared helpers, and expands tests for dynamic OVS pinning, GU pod scenarios, and reboot persistence (including control-plane reboot).

Changes

Workload Partitioning Detection

Layer / File(s) Summary
WP detection helper
test/e2e/performanceprofile/functests/utils/cluster/cluster.go
Added IsWorkloadPartitioningEnabled(ctx) to read the cluster Infrastructure CPUPartitioning setting and report WP status.

CPU Affinity Test Refactoring & Expansion

Layer / File(s) Summary
Imports, state, and setup
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Updated imports, added reserved/isolated CPU set state and WP gating, added control-plane schedulability guard and runtime parsing of profile CPU sets.
OVN kube-node & ovs.slice baseline checks
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
OVN container cpuset.cpus assertion now conditional on WP (reserved vs online); ovs.slice effective cpuset parsed and compared to node online CPUs.
Affinity collection & verification helpers
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Added helpers: getOvnContainerAffinity, getOvsAffinities, verifyOvsAffinity/verifyOvsMatchesExpected/verifyOvsMatchesExpectedWP, expectedOvsAffinity, and parseProfileCPUSets.
GU pod dynamic pinning tests
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Rewrote tests to create GU pods, compute union of GU-pinned CPUs, assert OVS excludes those CPUs under WP, and delete/sync GU pods.
Reboot persistence & control-plane reboot
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Affirm OVS/OVN affinity before and after worker reboot; added control-plane reboot validation ensuring OVS matches reserved+isolated baseline.
ovs-vswitchd thread affinity
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Updated thread-affinity test to use shared pid/affinity collection and confirm GU-pod pinned CPUs are not subsets of ovs-vswitchd thread affinity, including after pod replacement.
Small helpers & utilities
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Added createGuPod, waitForDeploymentReady, collectGuCPUsFromPodList, getGuPodCPUs, ovsSliceCpuset, and inserted GinkgoHelper() in helper entrypoints.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Harness
    participant API as Kubernetes API
    participant Node as Node (kubelet)
    participant OVN as OVN container
    participant OVS as OVS processes

    Test->>API: Query Infrastructure (CPUPartitioning)
    API-->>Test: WP enabled? (true/false)

    alt WP disabled
        Test->>Node: Read onlineCPUSet
        Test->>OVN: Read OVN container cpuset -> expect == onlineCPUSet
        Test->>OVS: Read OVS pids & affinities -> expect == OVN cpuset
    else WP enabled
        Test->>API: Read PerformanceProfile reserved & isolated sets
        Test->>OVN: Read OVN container cpuset -> expect == reservedCPUSet
        Test->>Test: Create GU pods -> get guPodCPUs (union)
        Test->>OVS: Read OVS pids & affinities -> expect == (reserved ∪ isolated) − guPodCPUs
    end

    Test->>Node: Delete GU pods / Reboot node
    Node-->>Test: Node back online
    Test->>OVS: Re-read OVS affinities -> verify exclusions persist
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm, jira/valid-reference

Suggested reviewers

  • swatisehgal
  • jmencak
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Tests lack proper cleanup patterns for resource creation and 47% of assertions lack failure messages. Some tests create pods without defer protection, risking resource leaks if assertions fail. Add defer cleanup blocks for all resource creation. Add failure message strings to all Expect() assertions per guidelines: Expect(...).To(..., "descriptive message").
Microshift Test Compatibility ⚠️ Warning New tests use config.openshift.io/v1 APIs (Infrastructure, Scheduler) unavailable on MicroShift. API queries will fail before conditional skip logic executes. Add [apigroup:config.openshift.io] tag to tests or guard IsWorkloadPartitioningEnabled() with MicroShift check to skip before API queries fail.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in cgroups.go are static, deterministic string literals. No dynamic content including pod names, node names, timestamps, UUIDs, or IP addresses found in test declarations.
Single Node Openshift (Sno) Test Compatibility ✅ Passed All tests are SNO-compatible. CP reboot test checks workerRTNode first. WP tests skip if not enabled. Pods use NodeSelector targeting single node.
Topology-Aware Scheduling Compatibility ✅ Passed E2E test code only. Test pods use nodeSelector by hostname, not topology-sensitive labels. Control-plane tests check IsControlPlaneSchedulable() and skip if not applicable.
Ote Binary Stdout Contract ✅ Passed PR does not violate OTE Binary Stdout Contract. No process-level stdout writes in modified files. Test suite properly routes logging to stderr and uses GinkgoWriter for test logs.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No IPv4 assumptions or external connectivity requirements detected. Tests focus on CPU affinity, cgroups, and workload partitioning with all cluster operations internal.
Title check ✅ Passed The title accurately summarizes the main changes: refactoring OVS affinity tests into shared helpers and adding workload-partitioning and control-plane reboot test cases, which directly align with the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Tal-or and swatisehgal May 6, 2026 13:33
@mrniranjan mrniranjan changed the title AI-Assisted: E2E: Refactor ovs dynamic pinning test case and add new test cases related to schedulable control plane and workload partitioning AI-Assisted: E2E: Refactor OVS affinity e2e tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. May 6, 2026
@mrniranjan mrniranjan changed the title AI-Assisted: E2E: Refactor OVS affinity e2e tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. AI-Assisted: E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. May 6, 2026
@mrniranjan mrniranjan changed the title AI-Assisted: E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. May 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)

562-567: 💤 Low value

BeforeEach re-checks a value that's already cluster-stable.

isWorkloadPartitioningEnabled is computed once in BeforeAll. Using BeforeEach here re-evaluates the skip on every It, which is harmless but wasteful. Switching to BeforeAll makes the intent — "skip the whole context if WP is disabled" — clearer and avoids per-spec churn in test logs.

♻️ Proposed change
-		Context("Workload Partitioning OVS affinity", Label(string(label.Tier2)), func() {
-			BeforeEach(func() {
+		Context("Workload Partitioning OVS affinity", Label(string(label.Tier2)), Ordered, func() {
+			BeforeAll(func() {
 				if !isWorkloadPartitioningEnabled {
 					Skip("Workload partitioning is not enabled on this cluster")
 				}
 			})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 562 - 567, The context "Workload Partitioning OVS affinity"
currently uses BeforeEach to skip when the cluster-wide flag
isWorkloadPartitioningEnabled is false; change this to BeforeAll so the check
runs once for the whole context instead of before every spec—replace the
BeforeEach hook with a BeforeAll hook that calls Skip("Workload partitioning is
not enabled on this cluster") when isWorkloadPartitioningEnabled is false to
avoid per-test churn in logs.

937-959: 💤 Low value

Extract the 30-second stabilisation delay into a named constant.

The same time.Sleep(30 * time.Second) is used in getOvnContainerAffinity (Line 944), getOvsAffinities (Line 955), and inline in the WP test (Line 597). A named constant near the top of the file documents intent (waiting for OVS dynamic pinning to converge) and avoids drift between call sites.

♻️ Proposed change
 const minRequiredCPUs = 8
+
+// affinityStabilizationDelay is the time to wait for OVS dynamic pinning
+// to converge after pod creation/deletion before sampling affinities.
+const affinityStabilizationDelay = 30 * time.Second

…then replace the magic literal at each site with affinityStabilizationDelay.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 937 - 959, Introduce a named constant affinityStabilizationDelay
(value 30 * time.Second) near the top of the file and replace the hard-coded
time.Sleep(30 * time.Second) calls in getOvnContainerAffinity and
getOvsAffinities (and the inline wait in the WP test mentioned in the comment)
with time.Sleep(affinityStabilizationDelay) so the intent ("waiting for OVS
dynamic pinning to converge") is documented and the delay is centralized for
consistency.

588-616: ⚡ Quick win

Duplicates the getOvnContainerAffinity / getOvsAffinities helpers.

This block re-implements logic that was just refactored into helpers (Lines 937-959), including a bare time.Sleep(30 * time.Second). Reusing the helpers makes intent clearer and keeps the stabilisation delay in one place.

♻️ Proposed simplification
 		It("Verify OVS affinity is wider than OVN pod under workload partitioning", func() {
 			By("Get OVN container affinity")
-			ovnPod, err := ovnCnfNodePod(ctx, workerRTNode)
-			Expect(err).ToNot(HaveOccurred(), "Unable to get ovnPod")
-			ovnContainerids, err := ovnPodContainers(&ovnPod)
-			Expect(err).ToNot(HaveOccurred())
-			containerPid, err := nodes.ContainerPid(ctx, workerRTNode, ovnContainerids[0])
-			Expect(err).ToNot(HaveOccurred())
-
-			time.Sleep(30 * time.Second)
-			ctnCpuset := taskSet(ctx, containerPid, workerRTNode)
-			testlog.Infof("OVN container affinity: %s", ctnCpuset)
+			ctnCpuset := getOvnContainerAffinity(ctx, workerRTNode)
 			Expect(ctnCpuset).To(Equal(reservedCPUSet),
 				"OVN container should be restricted to reserved CPUs under workload partitioning")
 
 			By("Get OVS process affinity and verify it is wider")
-			pidList, err := ovsPids(ctx, ovsSystemdServices, workerRTNode)
-			Expect(err).ToNot(HaveOccurred())
-			pidToCPUs, err := getCPUMaskForPids(ctx, pidList, workerRTNode)
-			Expect(err).ToNot(HaveOccurred())
+			pidToCPUs := getOvsAffinities(ctx, ovsSystemdServices, workerRTNode)
 
 			for pid, ovsAffinity := range pidToCPUs {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 588 - 616, This test duplicates existing helpers: replace the
manual OVN/OVS affinity logic and the explicit time.Sleep with calls to the
refactored helpers getOvnContainerAffinity and getOvsAffinities; call
getOvnContainerAffinity(ctx, workerRTNode, reservedCPUSet) to get the OVN
container cpuset and assert it equals reservedCPUSet, then call
getOvsAffinities(ctx, ovsSystemdServices, workerRTNode) to retrieve pid->cpus
and loop the results asserting reservedCPUSet.IsSubsetOf(ovsAffinity) and that
ovsAffinity.Equals(reservedCPUSet) is false when appropriate—remove the
duplicated containerPid/ovnPod/ovnContainerids code and the 30s sleep so the
stabilisation logic stays centralized in the helpers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 654-723: The test currently only checks
reservedCPUSet.IsSubsetOf(mask) which allows mask to still equal onlineCPUSet;
update the assertion in the OVS verification loop (using ovsAffinities from
getOvsAffinities) to require exact equality with reservedCPUSet instead of
subset—e.g. replace the IsSubsetOf Expect with an equality check
(mask.Equals(reservedCPUSet) or equivalent) and update the failure message to
state that OVS affinity must equal reserved CPUs after isolated CPUs are
exhausted.
- Around line 402-477: The test uses reservedCPUSet (parsed in BeforeAll for the
workerRTNode) but then may pick a different control-plane node (cpNode) whose
PerformanceProfile and reserved CPUs differ; update the test to obtain the
cpNode's own reserved CPU set before doing affinity checks or skip if none:
after selecting cpNode (the variable determined from workerRTNode or the
control-plane node loop), lookup the PerformanceProfile for cpNode via its
labels (same approach used in BeforeAll), parse Spec.CPU.Reserved into a
cpReservedCPUSet, and replace uses of reservedCPUSet in the mask.Equals(...)
checks and in the expectedOvsAffinity(cpOnlineCPUs, reservedCPUSet, ...) call
with cpReservedCPUSet; if no profile/reserved CPUs can be resolved for cpNode,
call Skip to avoid unsound assertions. Ensure changes reference reservedCPUSet,
workerRTNode, cpNode, expectedOvsAffinity, getOvsAffinities, and BeforeAll so
reviewers can locate the logic.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 562-567: The context "Workload Partitioning OVS affinity"
currently uses BeforeEach to skip when the cluster-wide flag
isWorkloadPartitioningEnabled is false; change this to BeforeAll so the check
runs once for the whole context instead of before every spec—replace the
BeforeEach hook with a BeforeAll hook that calls Skip("Workload partitioning is
not enabled on this cluster") when isWorkloadPartitioningEnabled is false to
avoid per-test churn in logs.
- Around line 937-959: Introduce a named constant affinityStabilizationDelay
(value 30 * time.Second) near the top of the file and replace the hard-coded
time.Sleep(30 * time.Second) calls in getOvnContainerAffinity and
getOvsAffinities (and the inline wait in the WP test mentioned in the comment)
with time.Sleep(affinityStabilizationDelay) so the intent ("waiting for OVS
dynamic pinning to converge") is documented and the delay is centralized for
consistency.
- Around line 588-616: This test duplicates existing helpers: replace the manual
OVN/OVS affinity logic and the explicit time.Sleep with calls to the refactored
helpers getOvnContainerAffinity and getOvsAffinities; call
getOvnContainerAffinity(ctx, workerRTNode, reservedCPUSet) to get the OVN
container cpuset and assert it equals reservedCPUSet, then call
getOvsAffinities(ctx, ovsSystemdServices, workerRTNode) to retrieve pid->cpus
and loop the results asserting reservedCPUSet.IsSubsetOf(ovsAffinity) and that
ovsAffinity.Equals(reservedCPUSet) is false when appropriate—remove the
duplicated containerPid/ovnPod/ovnContainerids code and the 30s sleep so the
stabilisation logic stays centralized in the helpers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c301f9bd-d453-41ea-a409-7d57d2b1fbfb

📥 Commits

Reviewing files that changed from the base of the PR and between fa5b0e1 and bd99aa1.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
  • test/e2e/performanceprofile/functests/utils/cluster/cluster.go

Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Copy link
Copy Markdown
Contributor

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, couple of comments inside.

On the u/s the cluster are not configured with WP, are we going to test cluster with WP configured d/s?

Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)

50-73: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Outer workerRTNodes is shadowed in BeforeAll and is never populated.

Line 68 uses := inside the BeforeAll closure, which creates a new local workerRTNodes (and err) that shadow the outer-scope declarations from line 53/61. The local variable is correctly populated and used on line 73 to set workerRTNode, but the outer-scope workerRTNodes slice remains nil for the lifetime of the suite.

Consequently, the for _, node := range workerRTNodes loops at lines 117–122 and 192–196 never execute their bodies — both test assertions are vacuous.

🐛 Proposed fix — use `=` for the already-declared outer variables
-		workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
+		var localErr error
+		workerRTNodes, localErr = nodes.GetByLabels(testutils.NodeSelectorLabels)
+		Expect(localErr).ToNot(HaveOccurred())

-		workerRTNodes, err = nodes.MatchingOptionalSelector(workerRTNodes)
-		Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("error looking for the optional selector: %v", err))
+		workerRTNodes, localErr = nodes.MatchingOptionalSelector(workerRTNodes)
+		Expect(localErr).ToNot(HaveOccurred(), fmt.Sprintf("error looking for the optional selector: %v", localErr))

Or, simpler — because err is already declared in the outer scope, drop the : and use plain assignment:

-		workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
+		workerRTNodes, err = nodes.GetByLabels(testutils.NodeSelectorLabels)
		Expect(err).ToNot(HaveOccurred())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 50 - 73, The BeforeAll closure is accidentally shadowing the outer
workerRTNodes and err by using :=; change the assignments in the BeforeAll block
to use plain = so the already-declared outer-scope variables (workerRTNodes,
err) are populated (specifically update the nodes.GetByLabels(...) and
nodes.MatchingOptionalSelector(...) lines to use = instead of :=), ensuring the
outer workerRTNodes slice is available for the later for _, node := range
workerRTNodes loops and workerRTNode assignment.
🧹 Nitpick comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)

942-966: ⚡ Quick win

Hardcoded time.Sleep(30s) in getOvnContainerAffinity and getOvsAffinities is both fragile and slow.

Both helpers unconditionally sleep 30 seconds "for affinity to stabilise." This sleep fires on every call — including the multiple calls inside collectAndVerify and the WP tests — adding 60–120+ seconds of artificial delay per test run. If stabilization takes longer than 30 s the tests remain flaky; if it's faster the sleep is wasted.

Replace with an Eventually poll on the affinity value itself:

♻️ Proposed refactor for `getOvsAffinities`
 func getOvsAffinities(ctx context.Context, services []string, node *corev1.Node) map[string]cpuset.CPUSet {
 	GinkgoHelper()
 	pidList, err := ovsPids(ctx, services, node)
 	Expect(err).ToNot(HaveOccurred())
-	time.Sleep(30 * time.Second)
-	affinities, err := getCPUMaskForPids(ctx, pidList, node)
-	Expect(err).ToNot(HaveOccurred())
-	return affinities
+	var affinities map[string]cpuset.CPUSet
+	Eventually(func() error {
+		var e error
+		affinities, e = getCPUMaskForPids(ctx, pidList, node)
+		return e
+	}, 2*time.Minute, 5*time.Second).Should(Succeed())
+	return affinities
 }

Apply the same pattern to getOvnContainerAffinity and remove the time.Sleep(30 * time.Second) at line 591.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 942 - 966, The fixed code should remove the hardcoded time.Sleep(30
* time.Second) calls in getOvnContainerAffinity and getOvsAffinities and instead
poll until the CPU affinity stabilises using Gomega's Eventually: for
getOvsAffinities, repeatedly call ovsPids(ctx, services, node) and
getCPUMaskForPids(ctx, pidList, node) (or a helper that returns the computed
map) inside an Eventually block with a reasonable timeout/interval until the
returned affinities stop changing; for getOvnContainerAffinity, poll the
sequence ovnCnfNodePod(ctx, node) -> ovnPodContainers(&ovnPod) ->
nodes.ContainerPid(ctx, node, containerId) -> taskSet(ctx, pid, node) inside
Eventually until the affinity equals a previous value (stable) instead of
sleeping; remove the Sleep lines and ensure Expect checks remain outside or
inside Eventually as appropriate so tests fail fast on errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 1033-1040: The waitForDeploymentReady function currently only
checks container readiness for podList.Items[0], which misses other replicas;
update the readiness check to iterate over all pods in podList.Items and for
each pod iterate over pod.Status.ContainerStatuses ensuring every container is
Ready before returning true (i.e., only return true when all pods' containers
are ready); this will ensure subsequent calls like getGuPodCPUs observe all
replicas ready.
- Around line 932-938: The function expectedOvsAffinity has an unused parameter
reservedCPUs; remove reservedCPUs from the signature (change
expectedOvsAffinity(onlineCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet) and
update every call site that passes three args to pass only onlineCPUs and
guPodCPUs instead, ensuring the function body remains return
onlineCPUs.Difference(guPodCPUs); also update any references/imports or test
helpers that assert the old signature to match the new two-parameter form.
- Around line 1046-1058: The function collectGuCPUsFromPodList has an unused
parameter namespace (the namespace is provided via listOpts.Namespace); remove
the namespace parameter from its signature and from all call sites (e.g., where
collectGuCPUsFromPodList is invoked around the mentioned call site) so callers
pass only (ctx, listOpts), and update any references to match the new signature;
ensure build and tests pass after the change.

---

Outside diff comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 50-73: The BeforeAll closure is accidentally shadowing the outer
workerRTNodes and err by using :=; change the assignments in the BeforeAll block
to use plain = so the already-declared outer-scope variables (workerRTNodes,
err) are populated (specifically update the nodes.GetByLabels(...) and
nodes.MatchingOptionalSelector(...) lines to use = instead of :=), ensuring the
outer workerRTNodes slice is available for the later for _, node := range
workerRTNodes loops and workerRTNode assignment.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 942-966: The fixed code should remove the hardcoded time.Sleep(30
* time.Second) calls in getOvnContainerAffinity and getOvsAffinities and instead
poll until the CPU affinity stabilises using Gomega's Eventually: for
getOvsAffinities, repeatedly call ovsPids(ctx, services, node) and
getCPUMaskForPids(ctx, pidList, node) (or a helper that returns the computed
map) inside an Eventually block with a reasonable timeout/interval until the
returned affinities stop changing; for getOvnContainerAffinity, poll the
sequence ovnCnfNodePod(ctx, node) -> ovnPodContainers(&ovnPod) ->
nodes.ContainerPid(ctx, node, containerId) -> taskSet(ctx, pid, node) inside
Eventually until the affinity equals a previous value (stable) instead of
sleeping; remove the Sleep lines and ensure Expect checks remain outside or
inside Eventually as appropriate so tests fail fast on errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8924fa3-bc0f-46aa-81b9-12917df83050

📥 Commits

Reviewing files that changed from the base of the PR and between bd99aa1 and df26d1b.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

@mrniranjan
Copy link
Copy Markdown
Contributor Author

/retest

- Add GinkgoHelper() to all test helper functions that perform
  assertions, so failures report the caller location
- Use cpuset.Equals() for cpuset comparisons instead of struct equality
- Refactor CP reboot test to assert against online CPUs directly,
  avoiding dependency on worker profile's reservedCPUSet which may
  not apply to the CP node on MNO clusters
- Fix exhaustion test: rename to reflect it verifies GU pod exclusion
  (not full exhaustion), compute exact expected affinity using
  expectedOvsAffinity, and assert equality instead of weak subset check
- Simplify expectedOvsAffinity to onlineCPUs.Difference(guPodCPUs)
- Fix indentation to match surrounding test blocks
@mrniranjan mrniranjan force-pushed the ovs_dynamic_pinning_wp branch from df26d1b to cd213c0 Compare May 8, 2026 08:17
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)

68-68: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

workerRTNodes outer variable is never assigned due to := shadowing in the BeforeAll closure.

Line 68 uses short variable declaration (:=), creating a new local workerRTNodes variable scoped to the BeforeAll closure. The outer workerRTNodes (declared at line 53) remains nil, causing the test loops at lines 117 and 192 to iterate zero times, making those assertions vacuous.

Change line 68 to use = instead:

🐛 Proposed fix
-		workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
+		workerRTNodes, err = nodes.GetByLabels(testutils.NodeSelectorLabels)

The variable err is already declared in the outer var block (line 22), so no additional declaration is needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
at line 68, The outer variable workerRTNodes is left nil because the BeforeAll
closure uses short-declaration (:=) which shadows it; change the declaration in
the BeforeAll where nodes.GetByLabels(testutils.NodeSelectorLabels) is called to
an assignment (use = not :=) so the outer workerRTNodes is set, and reuse the
already-declared err variable instead of redeclaring it; locate the
nodes.GetByLabels call inside the BeforeAll closure and replace := with = so
subsequent loops over workerRTNodes in the test (e.g., later assertions) operate
on the populated slice.
♻️ Duplicate comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)

1046-1058: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

namespace parameter in collectGuCPUsFromPodList is still unused — not addressed.

The namespace argument is never referenced in the function body; the namespace is already carried by listOpts.Namespace.

♻️ Proposed fix
-func collectGuCPUsFromPodList(ctx context.Context, namespace string, listOpts *client.ListOptions) cpuset.CPUSet {
+func collectGuCPUsFromPodList(ctx context.Context, listOpts *client.ListOptions) cpuset.CPUSet {

Update the call site at line 378:

-guCPUs := collectGuCPUsFromPodList(ctx, dp.Namespace, dpListOpts)
+guCPUs := collectGuCPUsFromPodList(ctx, dpListOpts)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 1046 - 1058, The function collectGuCPUsFromPodList currently
ignores its namespace parameter; fix it by ensuring the list options actually
use that namespace before calling testclient.DataPlaneClient.List: if
listOpts.Namespace is empty or different, create or copy listOpts and set its
Namespace to the namespace argument (so List will query the intended namespace),
then proceed to list pods and union their CPUs as before; update any callers
only if they depend on the previous behavior.

1033-1040: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

waitForDeploymentReady still only checks the first pod's container statuses — not addressed.

The loop at line 1036 iterates podList.Items[0].Status.ContainerStatuses regardless of replicas. For a 2-replica deployment, the second pod's containers are never checked, so the function can return while only one pod is truly Ready.

🐛 Proposed fix
-		for _, s := range podList.Items[0].Status.ContainerStatuses {
-			if !s.Ready {
-				return false
-			}
-		}
+		for i := range podList.Items {
+			for _, s := range podList.Items[i].Status.ContainerStatuses {
+				if !s.Ready {
+					return false
+				}
+			}
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 1033 - 1040, In waitForDeploymentReady, the readiness check only
inspects podList.Items[0].Status.ContainerStatuses; change the logic to iterate
over every pod in podList.Items and for each pod iterate over
pod.Status.ContainerStatuses to ensure all containers in all pods are Ready
(still return false if podList.Items is empty or any container is not Ready),
keeping the existing early-return behavior and using the existing podList.Items
and ContainerStatuses references so all replicas are validated.

936-938: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

reservedCPUs parameter in expectedOvsAffinity is still unused — not addressed.

The reservedCPUs argument is accepted but never referenced in the function body, making every call site (lines 714, 987) silently pass a value that has no effect.

♻️ Proposed fix
-func expectedOvsAffinity(onlineCPUs, reservedCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet {
-	return onlineCPUs.Difference(guPodCPUs)
+func expectedOvsAffinity(onlineCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet {
+	return onlineCPUs.Difference(guPodCPUs)
 }

And update call sites at lines 714 and 987:

-expected := expectedOvsAffinity(onlineCPUSet, reservedCPUSet, guPodCPUs)
+expected := expectedOvsAffinity(onlineCPUSet, guPodCPUs)
-expected := expectedOvsAffinity(onlineCPUs, reservedCPUs, guCPUs)
+expected := expectedOvsAffinity(onlineCPUs, guCPUs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 936 - 938, The function expectedOvsAffinity currently ignores its
reservedCPUs parameter; update its implementation to exclude reservedCPUs from
the resulting CPU set (e.g., return
onlineCPUs.Difference(guPodCPUs).Difference(reservedCPUs) or return
onlineCPUs.Difference(guPodCPUs.Union(reservedCPUs))) so the reserved CPUs are
not assigned to OVS, and leave the function signature unchanged so existing
callers of expectedOvsAffinity continue to work without changes.
🧹 Nitpick comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)

981-991: ⚡ Quick win

Non-WP path in verifyOvsMatchesExpected ignores guCPUs — misleading test semantics.

When isWP=false, the function asserts OVS affinity == ovnAffinity. Under non-WP, ovnAffinity == onlineCPUSet, so the assertion reduces to OVS affinity == onlineCPUSet regardless of whether GU pods are present. This means test_id:64101 ("Creating gu pods modifies affinity of ovs") and test_id:64102 do not actually verify any change in OVS affinity on non-WP clusters — the test passes vacuously in that mode.

If dynamic OVS pinning is not expected to change OVS affinity under non-WP, the test descriptions and the guCPUs parameter to verifyOvsMatchesExpected are misleading and the unused parameter should be removed from the non-WP branch. If OVS affinity is expected to change under non-WP, the assertion formula should account for guCPUs there too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 981 - 991, The non-WP branch of verifyOvsMatchesExpected currently
ignores guCPUs and just compares OVS to ovnAffinity, which makes tests vacuous;
change the else branch to compute expected := expectedOvsAffinity(onlineCPUs,
reservedCPUs, guCPUs) and call verifyOvsAffinity(ovsAffinities, expected)
(instead of verifyOvsAffinity(ovsAffinities, ovnAffinity)) so non-WP behavior
accounts for guCPUs; alternatively, if OVS should not change under non-WP,
remove/stop passing the guCPUs parameter and update the test description
accordingly — but the quick fix is to use expectedOvsAffinity(...) in the else
branch of verifyOvsMatchesExpected.

591-591: ⚡ Quick win

Another hardcoded time.Sleep(30 * time.Second) directly in a test body.

Line 591 sleeps unconditionally inside the It block before reading ctnCpuset. The same concerns apply as for the helper-level sleeps: unreliable on slow clusters, always wasteful on fast ones. This particular call also duplicates the 30-second wait that getOvnContainerAffinity would already impose if that helper were used here instead. Consider either using Eventually to poll taskSet until the affinity stabilises, or unifying the wait with the helper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
at line 591, Replace the hardcoded time.Sleep(30 * time.Second) inside the It
block with a deterministic wait: either call the existing helper
getOvnContainerAffinity (which already contains the required wait) instead of
the manual sleep, or use Gomega.Eventually to poll the taskSet/ctnCpuset until
the container CPU set/affinity stabilizes (e.g. assert ctnCpuset equals the
expected value within a timeout). Locate the sleep in the test body where
ctnCpuset is read, and modify the code to poll taskSet/ctnCpuset (or delegate to
getOvnContainerAffinity) so the test is robust on slow clusters and not wasteful
on fast ones.

950-962: ⚡ Quick win

Unconditional 30-second sleeps in shared helpers will compound across every test call.

Both getOvnContainerAffinity (line 950) and getOvsAffinities (line 962) each time.Sleep(30 * time.Second). Tests that call both helpers spend 60 s sleeping per invocation pair; test_id:64102 calls them twice (before and after pod deletion) — 120 s of fixed sleep per run regardless of cluster speed. This pattern is also unreliable: fast clusters waste 30 s, slow clusters may still read stale state.

Replace with Eventually-based polling on the affinity value, or at minimum centralise the single wait in the call site rather than burying it in each helper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 950 - 962, Both getOvnContainerAffinity and getOvsAffinities
contain unconditional time.Sleep(30 * time.Second) calls causing compounded
fixed delays; remove those sleeps and replace with polling (Gomega Eventually or
a shared WaitForAffinityStable helper) that repeatedly calls the underlying
probes (taskSet for getOvnContainerAffinity, ovsPids/affinity reader for
getOvsAffinities) until the affinity stabilises or a timeout is reached, or
alternatively centralise a single wait in the test caller instead of sleeping
inside these helpers so callers control timeout and frequency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 443-449: The loops over the maps returned by getOvsAffinities
(e.g., ovsBeforeReboot and ovsAfterReboot) can vacuously succeed when the map is
empty; add an explicit non-empty guard before each loop by asserting the map
length is > 0 (e.g., Expect(len(ovsBeforeReboot)).To(BeNumerically(">", 0), "no
OVS affinities found for %v on node %v; check %v", cpOvsServices, cpNode,
cpOvsServices)) so the test fails fast when no OVS PIDs were discovered; apply
the same check for all other getOvsAffinities consumers referenced (the blocks
using ovsBeforeReboot, ovsAfterReboot and the other call sites around lines
noted) so the subsequent per-PID Expect(mask.Equals(...)) assertions actually
run.

---

Outside diff comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Line 68: The outer variable workerRTNodes is left nil because the BeforeAll
closure uses short-declaration (:=) which shadows it; change the declaration in
the BeforeAll where nodes.GetByLabels(testutils.NodeSelectorLabels) is called to
an assignment (use = not :=) so the outer workerRTNodes is set, and reuse the
already-declared err variable instead of redeclaring it; locate the
nodes.GetByLabels call inside the BeforeAll closure and replace := with = so
subsequent loops over workerRTNodes in the test (e.g., later assertions) operate
on the populated slice.

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 1046-1058: The function collectGuCPUsFromPodList currently ignores
its namespace parameter; fix it by ensuring the list options actually use that
namespace before calling testclient.DataPlaneClient.List: if listOpts.Namespace
is empty or different, create or copy listOpts and set its Namespace to the
namespace argument (so List will query the intended namespace), then proceed to
list pods and union their CPUs as before; update any callers only if they depend
on the previous behavior.
- Around line 1033-1040: In waitForDeploymentReady, the readiness check only
inspects podList.Items[0].Status.ContainerStatuses; change the logic to iterate
over every pod in podList.Items and for each pod iterate over
pod.Status.ContainerStatuses to ensure all containers in all pods are Ready
(still return false if podList.Items is empty or any container is not Ready),
keeping the existing early-return behavior and using the existing podList.Items
and ContainerStatuses references so all replicas are validated.
- Around line 936-938: The function expectedOvsAffinity currently ignores its
reservedCPUs parameter; update its implementation to exclude reservedCPUs from
the resulting CPU set (e.g., return
onlineCPUs.Difference(guPodCPUs).Difference(reservedCPUs) or return
onlineCPUs.Difference(guPodCPUs.Union(reservedCPUs))) so the reserved CPUs are
not assigned to OVS, and leave the function signature unchanged so existing
callers of expectedOvsAffinity continue to work without changes.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 981-991: The non-WP branch of verifyOvsMatchesExpected currently
ignores guCPUs and just compares OVS to ovnAffinity, which makes tests vacuous;
change the else branch to compute expected := expectedOvsAffinity(onlineCPUs,
reservedCPUs, guCPUs) and call verifyOvsAffinity(ovsAffinities, expected)
(instead of verifyOvsAffinity(ovsAffinities, ovnAffinity)) so non-WP behavior
accounts for guCPUs; alternatively, if OVS should not change under non-WP,
remove/stop passing the guCPUs parameter and update the test description
accordingly — but the quick fix is to use expectedOvsAffinity(...) in the else
branch of verifyOvsMatchesExpected.
- Line 591: Replace the hardcoded time.Sleep(30 * time.Second) inside the It
block with a deterministic wait: either call the existing helper
getOvnContainerAffinity (which already contains the required wait) instead of
the manual sleep, or use Gomega.Eventually to poll the taskSet/ctnCpuset until
the container CPU set/affinity stabilizes (e.g. assert ctnCpuset equals the
expected value within a timeout). Locate the sleep in the test body where
ctnCpuset is read, and modify the code to poll taskSet/ctnCpuset (or delegate to
getOvnContainerAffinity) so the test is robust on slow clusters and not wasteful
on fast ones.
- Around line 950-962: Both getOvnContainerAffinity and getOvsAffinities contain
unconditional time.Sleep(30 * time.Second) calls causing compounded fixed
delays; remove those sleeps and replace with polling (Gomega Eventually or a
shared WaitForAffinityStable helper) that repeatedly calls the underlying probes
(taskSet for getOvnContainerAffinity, ovsPids/affinity reader for
getOvsAffinities) until the affinity stabilises or a timeout is reached, or
alternatively centralise a single wait in the test caller instead of sleeping
inside these helpers so callers control timeout and frequency.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f9a886a-e21b-435e-a245-31c24902c58d

📥 Commits

Reviewing files that changed from the base of the PR and between df26d1b and cd213c0.

📒 Files selected for processing (1)
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Verify ovs affinity map is not empty

Signed-off-by: Niranjan M.R <mniranja@redhat.com>
// where Allocatable = Online - Reserved (i.e., the isolated CPUs).
func expectedOvsAffinity(onlineCPUs, reservedCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet {
isolated := onlineCPUs.Difference(reservedCPUs)
return isolated.Difference(guPodCPUs).Union(reservedCPUs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably valid for now, but we will have to change it again once @Tal-or reviews the dedicated cpus PR. I wonder if we could join the systemd cpu affinity + Allocatable - GU_Pinned and make it future proof.

Copy link
Copy Markdown
Contributor Author

@mrniranjan mrniranjan May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarSik Can i use OvsSliceCpuset - GU_Pinned ?
The OVS slice cpuset (read from cgroup) reflects the actual kernel-enforced CPU constraint. Subtracting GU-pinned CPUs yields the correct expected affinity . I.e we read cpuset.cpus.effective from the ovs.slice and use that to subtract the GU_Pinned ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it does, but it won't in the future. @Tal-or can show you the details, but OVS should be checked to run on PerfProfile.spec.cpus.reserved + PerfProfile.spec.cpus.isolated - GU_pinned. This way it will be future proof.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarSik fixed in the latest commit.

}

// IsWorkloadPartitioningEnabled checks whether CPU partitioning is enabled
// cluster-wide by querying the Infrastructure resource's CPUPartitioning status.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WP might be enabled, but not effective when the PerformanceProfile is not applied. That is probably not important for our CI, but it is a corner case (E2E upgrade might be using paused MCPs without PP applied).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarSik we don't run our tests in CI without Performance profile. In fact without performance profile tests would not even start.

}, 5*time.Minute, 10*time.Second).Should(Succeed(),
"OVN pod did not become ready on control plane node after reboot")

By("Verify OVS affinity after reboot is not restricted to reserved CPUs")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this giving enought time to ovn-k to change the cpu affinity? In other words, will it be racy on say slow clusters?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice any issues when i ran on compact clusters (BM)

"OVN pod did not become ready on control plane node after reboot")

By("Verify OVS affinity after reboot is not restricted to reserved CPUs")
By("Verify OVS affinity after reboot spans all online CPUs")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be true after dedicated cpus are introduced as I mentioned before.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the latest commit

@Tal-or
Copy link
Copy Markdown
Contributor

Tal-or commented May 11, 2026

@MarSik I think it would be easier to create a follow up PR once dedicated CPUs are in.
I'll add a note to myself, to revise those tests in the future.
CNF-23698

@mrniranjan mrniranjan force-pushed the ovs_dynamic_pinning_wp branch from 53ff9ce to b8a409d Compare May 20, 2026 00:53
Add polarion test case id for test cases
related to workload partitioning

Signed-off-by: Niranjan M.R <mniranja@redhat.com>
@mrniranjan mrniranjan force-pushed the ovs_dynamic_pinning_wp branch from b8a409d to 7aa6690 Compare May 20, 2026 01:05
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 20, 2026

@mrniranjan: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

@MarSik MarSik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold for Talor to take a look

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 25, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 25, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik, mrniranjan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2026
@mrniranjan mrniranjan changed the title E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. OCPBUGS-86480: E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. May 25, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@mrniranjan: This pull request references Jira Issue OCPBUGS-86480, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR has 2 commits

  1. First commit refactors some of the existing code in terms of readability etc
  2. Add new test cases related related to workload partitioning

Code was generated using AI-Assisted:
AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking

Summary by CodeRabbit

  • Tests
  • Enhanced performance-profile tests to gate CPU-affinity expectations based on Workload Partitioning (enabled/disabled).
  • Updated OVN/OVS affinity checks to validate cpuset behavior, including ovs-vswitchd thread affinity and multi-pod scenarios.
  • Rewrote dynamic OVS pinning tests to cover GU pod pinning, exclusion from OVS affinity, and persistence across pod deletion and node/control-plane reboots.
  • Chores
  • Added a cluster utility to detect Workload Partitioning status.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mrniranjan
Copy link
Copy Markdown
Contributor Author

/verified by @mrniranjan

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@mrniranjan: This PR has been marked as verified by @mrniranjan.

Details

In response to this:

/verified by @mrniranjan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants