Skip to content

OCPBUGS-75894: use --delete-if-present for karg removal#5914

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
isabella-janssen:ocpbugs-75894
May 6, 2026
Merged

OCPBUGS-75894: use --delete-if-present for karg removal#5914
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
isabella-janssen:ocpbugs-75894

Conversation

@isabella-janssen
Copy link
Copy Markdown
Member

@isabella-janssen isabella-janssen commented May 1, 2026

Closes: OCPBUGS-75894

Note: The changes in this PR and description were created with the assistance of Claude.

- What I did

The generateKargs() function used --delete= to remove old kernel arguments during reconciliation. When there is drift or a mismatch in the expected and actual kargs, which should typically be caught by our configuration drift monitor, but can occur in cases such as the one highlighted in OCPBUGS-75894, rpm-ostree fails the entire kargs transaction with "No key found", leaving the node degraded. This changes --delete= to --delete-if-present=, which allows us to skip missing kargs without erroring out.

- How to verify it

  1. Apply a MC to add custom kargs to nodes and wait for the change to be applied.
$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                
kind: MachineConfig                                                                                                                 
metadata:                                                       
  labels:                                                                                                                           
    machineconfiguration.openshift.io/role: infra              
  name: 99-karg-test                                                                                                                
spec:
  kernelArguments:                                                                                                                  
    - rcutree.nohz_full_patience_delay=1000                     
    - nmi_watchdog=0
EOF
  1. Confirm the karg is present on the node.
$ oc debug node/<node> -- chroot /host rpm-ostree kargs | grep rcutree
# Expected output includes `rcutree.nohz_full_patience_delay=1000`
  1. Create drift by removing the karg and rebooting the node.
$ oc debug node/<node> -- chroot /host rpm-ostree kargs --delete=rcutree.nohz_full_patience_delay=1000
$ oc debug node/<node> -- chroot /host systemctl reboot
  1. Wait for the node to become Ready. The MCD will detect the missing karg and mark the node degraded. This is expected.
  2. Apply another MC to create a kargs diff.
$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                                                                                    
kind: MachineConfig                                                                                                                 
metadata:                                                       
  labels:                                                                                                                           
    machineconfiguration.openshift.io/role: infra              
  name: 99-karg-test2                                        
spec:
  kernelArguments:                                                                                                                  
    - rcutree.nohz_full_patience_delay=2000                     
    - nmi_watchdog=0
EOF
  1. Point the node at the new rendered MC and touch the force file.
$ oc annotate node <node> machineconfiguration.openshift.io/desiredConfig=<new-mc> --overwrite
$ oc debug node/<node>  -- chroot /host touch /run/machine-config-daemon-force
  1. Check the MCD logs. Before the fix the logs will include error: No key 'rcutree.nohz_full_patience_delay' found and the node degrades. After the fix, the update should succeed and the node should not degrade.

- Description for the changelog
OCPBUGS-75894: use --delete-if-present for karg removal

Summary by CodeRabbit

  • Bug Fixes
    • Kernel argument removal during system updates is now drift-tolerant when the bootloader state has changed or is missing, reducing update failures in edge cases.
    • Addition of new kernel arguments is unaffected and continues to work as before.
    • Behavior validated by updated tests to ensure reliability.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 1, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-75894, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Closes: OCPBUGS-75894

Note: The changes in this PR and description were created with the assistance of Claude.

- What I did

The generateKargs() function used --delete= to remove old kernel arguments during reconciliation. When there is drift or a mismatch in the expected and actual kargs, which should typically be caught by our configuration drift monitor, but can occur in cases such as the one highlighted in OCPBUGS-75894, rpm-ostree fails the entire kargs transaction with "No key found", leaving the node degraded. This changes --delete= to --delete-if-present=, which allows us to skip missing kargs without erroring out.

- How to verify it

- Description for the changelog
OCPBUGS-75894: use --delete-if-present for karg removal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: dd6d5e23-eab2-4230-a1a2-9e9448f876be

📥 Commits

Reviewing files that changed from the base of the PR and between 3c1c3a7 and d54b9b4.

📒 Files selected for processing (2)
  • pkg/daemon/update.go
  • pkg/daemon/update_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/daemon/update_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/daemon/update.go

Walkthrough

generateKargs in the daemon update path now emits rpm-ostree kernel-argument deletions with --delete-if-present= (instead of --delete=) to tolerate missing/changed bootloader state. Tests were updated to expect the new deletion flag.

Changes

Kernel Argument Deletion Update

Layer / File(s) Summary
Core Implementation
pkg/daemon/update.go
Changed kernel-argument removal to emit --delete-if-present=<arg> for each arg from the old MachineConfig; updated inline comments to document drift-tolerant behavior.
Behavioral Outputs
pkg/daemon/update.go
Preserved existing --append=<arg> emission for new kernel args and unchanged return behavior.
Tests
pkg/daemon/update_test.go
Updated TestKernelAguments expected outputs to use --delete-if-present=<karg> for removal cases; append expectations remain --append=<karg>.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: using --delete-if-present instead of --delete for kernel argument removal operations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The test file uses Go's standard testing framework rather than Ginkgo, so the custom check for Ginkgo tests is not applicable.
Test Structure And Quality ✅ Passed The custom check is designed to assess Ginkgo test code quality, but the pull request modifies standard Go unit tests using the testing package and testify assertions, not Ginkgo.
Microshift Test Compatibility ✅ Passed PR modifies existing unit tests in pkg/daemon/update_test.go, not new Ginkgo e2e tests, so the Ginkgo compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR modifies only standard Go unit tests in pkg/daemon/, not new Ginkgo e2e tests, so the SNO Test Compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Changes only modify kernel argument generation in update.go; no deployment manifests, scheduling constraints, or topology-aware configurations are affected.
Ote Binary Stdout Contract ✅ Passed The PR modifies generateKargs() to use --delete-if-present= flag, a string-building operation with no new stdout writes in process-level code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies unit tests in pkg/daemon/update_test.go, not Ginkgo e2e tests; no IPv4 assumptions or external connectivity requirements detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 1, 2026
@isabella-janssen
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-75894, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @tlbueno

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 1, 2026
@openshift-ci openshift-ci Bot requested a review from tlbueno May 1, 2026 16:36
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-75894, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @tlbueno

Details

In response to this:

Closes: OCPBUGS-75894

Note: The changes in this PR and description were created with the assistance of Claude.

- What I did

The generateKargs() function used --delete= to remove old kernel arguments during reconciliation. When there is drift or a mismatch in the expected and actual kargs, which should typically be caught by our configuration drift monitor, but can occur in cases such as the one highlighted in OCPBUGS-75894, rpm-ostree fails the entire kargs transaction with "No key found", leaving the node degraded. This changes --delete= to --delete-if-present=, which allows us to skip missing kargs without erroring out.

- How to verify it

- Description for the changelog
OCPBUGS-75894: use --delete-if-present for karg removal

Summary by CodeRabbit

  • Bug Fixes
  • Improved kernel argument deletion to be more tolerant when bootloader state is missing or changed, making system updates more resilient to edge cases in boot configuration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen isabella-janssen marked this pull request as ready for review May 4, 2026 15:16
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026
@openshift-ci openshift-ci Bot requested review from pablintino and yuqi-zhang May 4, 2026 15:17
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@djoshy
Copy link
Copy Markdown
Contributor

djoshy commented May 4, 2026

/lgtm

Seems sane to me

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [djoshy,isabella-janssen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@isabella-janssen
Copy link
Copy Markdown
Member Author

/override ci/prow/e2e-gcp-op-ocl-part2

This test is newly required and still a bit flaky. Failures are unrelated to this fix.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

@isabella-janssen: Overrode contexts on behalf of isabella-janssen: ci/prow/e2e-gcp-op-ocl-part2

Details

In response to this:

/override ci/prow/e2e-gcp-op-ocl-part2

This test is newly required and still a bit flaky. Failures are unrelated to this fix.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@isabella-janssen
Copy link
Copy Markdown
Member Author

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op-part1
/test e2e-gcp-op-part2
/test e2e-gcp-op-single-node
/test e2e-hypershift

@isabella-janssen
Copy link
Copy Markdown
Member Author

/cherrypick release-4.22 release-4.21 release-4.20 release-4.19 release-4.18

@openshift-cherrypick-robot
Copy link
Copy Markdown

@isabella-janssen: once the present PR merges, I will cherry-pick it on top of release-4.22 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.22 release-4.21 release-4.20 release-4.19 release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ptalgulk01
Copy link
Copy Markdown
Contributor

ptalgulk01 commented May 6, 2026

Pre-merge tested:

Environment Setup:
OCP version: 4.23.0-0-2026-05-06-044606-test-ci-ln-rkygpq2-latest
Platform: AWS

Steps:

  1. Create a custom MCP
    result :
cat custom-mcp.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""
      
oc label node ip-10-0-18-135.ec2.internal  node-role.kubernetes.io/infra=
node/ip-10-0-18-135.ec2.internal labeled
  1. Apply kernel args based MC
    result:
oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                
kind: MachineConfig                                                                                                                 
metadata:                                                       
  labels:                                                                                                                           
    machineconfiguration.openshift.io/role: infra              
  name: 99-karg-test                                                                                                                
spec:
  kernelArguments:                                                                                                                  
    - rcutree.nohz_full_patience_delay=1000                     
    - nmi_watchdog=0
EOF
machineconfig.machineconfiguration.openshift.io/99-karg-test created
  1. wait for MCP update to complete and check changes are applied
    result:
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-80d079354410892d6cc2243091556875    True      False      False      1              1                   1                     0                      4m8s
master   rendered-master-f52b00f0c8d6f798e4d42e6d54bae44f   True      False      False      3              3                   3                     0                      44m
worker   rendered-worker-4d5670a93f584960a6b52d9ffb3d6d81   True      False      False      2              2                   2                     0                      44m

oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host rpm-ostree kargs | grep rcutree
Starting pod/ip-10-0-18-135ec2internal-debug-vd97r ...
To use host binaries, run `chroot /host`
rw $ignition_firstboot  ostree=/ostree/boot.1/rhcos/fb0de0a9527eb6a0e87c6ae933a8ee7354a3a6a6c34c044222df3376b1418eb1/0 ignition.platform.id=aws console=tty0 console=ttyS0,115200n8 root=UUID=4f412f58-02cb-4afa-a82f-7f03e86e5d1b rw rootflags=prjquota boot=UUID=6742737b-f7bc-410d-94c2-618557c90905 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" rcutree.nohz_full_patience_delay=1000 nmi_watchdog=0

Removing debug pod ...
  1. Create config_drift by adding changes into node and reboot the node
    result:
oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host rpm-ostree kargs --delete=rcutree.nohz_full_patience_delay=1000
Starting pod/ip-10-0-18-135ec2internal-debug-snn4p ...
To use host binaries, run `chroot /host`
Staging deployment...done
Changes queued for next boot. Run "systemctl reboot" to start a reboot

Removing debug pod ...

oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host systemctl reboot
  1. Wait for MCP degrade
    result:
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-80d079354410892d6cc2243091556875    False     True       True       1              0                   0                     1                      20m
master   rendered-master-f52b00f0c8d6f798e4d42e6d54bae44f   True      False      False      3              3                   3                     0                      60m
worker   rendered-worker-4d5670a93f584960a6b52d9ffb3d6d81   True      False      False      2              2                   2                     0                      60m

oc get mcp infra -o yaml
  - lastTransitionTime: "2026-05-06T06:03:47Z"
    message: 'Node ip-10-0-18-135.ec2.internal is reporting: "Node ip-10-0-18-135.ec2.internal
      upgrade failure. unexpected on-disk state validating against rendered-infra-80d079354410892d6cc2243091556875:
      missing expected kernel arguments: [rcutree.nohz_full_patience_delay=1000]",
      Node ip-10-0-18-135.ec2.internal is reporting: "unexpected on-disk state validating
      against rendered-infra-80d079354410892d6cc2243091556875: missing expected kernel
      arguments: [rcutree.nohz_full_patience_delay=1000]"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  1. Apply second MC
    result:
oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                                                                                    
kind: MachineConfig                                                                                                                 
metadata:                                                       
  labels:                                                                                                                           
    machineconfiguration.openshift.io/role: infra              
  name: 99-karg-test2                                        
spec:
  kernelArguments:                                                                                                                  
    - rcutree.nohz_full_patience_delay=2000                     
    - nmi_watchdog=0
EOF
machineconfig.machineconfiguration.openshift.io/99-karg-test2 created
  1. Point the new rendered to failed node and touch the /run/machine-config-daemon-force
    result:
oc annotate node ip-10-0-18-135.ec2.internal machineconfiguration.openshift.io/desiredConfig=rendered-infra-85af3154972140056ffe4a55a5ee17ab --overwrite
node/ip-10-0-18-135.ec2.internal annotated

oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host touch /run/machine-config-daemon-force
  1. Check MCP is no more degrade and no error logs error: No key 'rcutree.nohz_full_patience_delay' found in MCD
    result:
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-85af3154972140056ffe4a55a5ee17ab    True      False      False      1              1                   1                     0                      28m
master   rendered-master-f52b00f0c8d6f798e4d42e6d54bae44f   True      False      False      3              3                   3                     0                      68m
worker   rendered-worker-4d5670a93f584960a6b52d9ffb3d6d81   True      False      False      2              2                   2                     0                      68m

(Note Not to add in polarion) tested using this TC too

  [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:42365][OTP] add real time kernel argument [Disruptive]                                                                                            
  [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:42364][OTP] add selinux kernel argument [Disruptive]                                                                                              
  [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:67825][OTP] Use duplicated kernel arguments [Disruptive] 
    [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:72136][OTP] Reject MCs with ignition containing kernelArguments [Disruptive]                                                                      
  [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:53668][OTP] when FIPS and realtime kernel are both enabled node should NOT be degraded [Disruptive]                                               
  [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:67787][OTP] switch kernel type to 64k-pages for clusters with arm64 nodes [Disruptive]                                                            
  [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:67788][OTP] kernel type 64k-pages is not supported on non-arm64 nodes [Disruptive]     

/label qe-approved
/verified by @ptalgulk01

@openshift-ci openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label May 6, 2026
@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@ptalgulk01: This PR has been marked as verified by @ptalgulk01.

Details

In response to this:

Pre-merge tested:

Environment Setup:
OCP version: 4.23.0-0-2026-05-06-044606-test-ci-ln-rkygpq2-latest
Platform: AWS

Steps:

  1. Create a custom MCP
cat custom-mcp.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
 name: infra
spec:
 machineConfigSelector:
   matchExpressions:
     - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
 nodeSelector:
   matchLabels:
     node-role.kubernetes.io/infra: ""
     
oc label node ip-10-0-18-135.ec2.internal  node-role.kubernetes.io/infra=
node/ip-10-0-18-135.ec2.internal labeled
  1. Apply kernel args based MC
oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                
kind: MachineConfig                                                                                                                 
metadata:                                                       
 labels:                                                                                                                           
   machineconfiguration.openshift.io/role: infra              
 name: 99-karg-test                                                                                                                
spec:
 kernelArguments:                                                                                                                  
   - rcutree.nohz_full_patience_delay=1000                     
   - nmi_watchdog=0
EOF
machineconfig.machineconfiguration.openshift.io/99-karg-test created
  1. wait for MCP update to complete and check changes are applied
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-80d079354410892d6cc2243091556875    True      False      False      1              1                   1                     0                      4m8s
master   rendered-master-f52b00f0c8d6f798e4d42e6d54bae44f   True      False      False      3              3                   3                     0                      44m
worker   rendered-worker-4d5670a93f584960a6b52d9ffb3d6d81   True      False      False      2              2                   2                     0                      44m

oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host rpm-ostree kargs | grep rcutree
Starting pod/ip-10-0-18-135ec2internal-debug-vd97r ...
To use host binaries, run `chroot /host`
rw $ignition_firstboot  ostree=/ostree/boot.1/rhcos/fb0de0a9527eb6a0e87c6ae933a8ee7354a3a6a6c34c044222df3376b1418eb1/0 ignition.platform.id=aws console=tty0 console=ttyS0,115200n8 root=UUID=4f412f58-02cb-4afa-a82f-7f03e86e5d1b rw rootflags=prjquota boot=UUID=6742737b-f7bc-410d-94c2-618557c90905 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1="all" rcutree.nohz_full_patience_delay=1000 nmi_watchdog=0

Removing debug pod ...
  1. Create config_drift by adding changes into node and reboot the node
oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host rpm-ostree kargs --delete=rcutree.nohz_full_patience_delay=1000
Starting pod/ip-10-0-18-135ec2internal-debug-snn4p ...
To use host binaries, run `chroot /host`
Staging deployment...done
Changes queued for next boot. Run "systemctl reboot" to start a reboot

Removing debug pod ...

oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host systemctl reboot
  1. Wait for MCP degrade
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-80d079354410892d6cc2243091556875    False     True       True       1              0                   0                     1                      20m
master   rendered-master-f52b00f0c8d6f798e4d42e6d54bae44f   True      False      False      3              3                   3                     0                      60m
worker   rendered-worker-4d5670a93f584960a6b52d9ffb3d6d81   True      False      False      2              2                   2                     0                      60m

oc get mcp infra -o yaml
 - lastTransitionTime: "2026-05-06T06:03:47Z"
   message: 'Node ip-10-0-18-135.ec2.internal is reporting: "Node ip-10-0-18-135.ec2.internal
     upgrade failure. unexpected on-disk state validating against rendered-infra-80d079354410892d6cc2243091556875:
     missing expected kernel arguments: [rcutree.nohz_full_patience_delay=1000]",
     Node ip-10-0-18-135.ec2.internal is reporting: "unexpected on-disk state validating
     against rendered-infra-80d079354410892d6cc2243091556875: missing expected kernel
     arguments: [rcutree.nohz_full_patience_delay=1000]"'
   reason: 1 nodes are reporting degraded status on sync
   status: "True"
   type: NodeDegraded
  1. Apply second MC
oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                                                                                    
kind: MachineConfig                                                                                                                 
metadata:                                                       
 labels:                                                                                                                           
   machineconfiguration.openshift.io/role: infra              
 name: 99-karg-test2                                        
spec:
 kernelArguments:                                                                                                                  
   - rcutree.nohz_full_patience_delay=2000                     
   - nmi_watchdog=0
EOF
machineconfig.machineconfiguration.openshift.io/99-karg-test2 created
  1. Point the new rendered to failed node and touch the /run/machine-config-daemon-force
oc annotate node ip-10-0-18-135.ec2.internal machineconfiguration.openshift.io/desiredConfig=rendered-infra-85af3154972140056ffe4a55a5ee17ab --overwrite
node/ip-10-0-18-135.ec2.internal annotated

oc debug node/ip-10-0-18-135.ec2.internal -- chroot /host touch /run/machine-config-daemon-force
  1. Check MCP is no more degrade and no error logs error: No key 'rcutree.nohz_full_patience_delay' found in MCD
oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-85af3154972140056ffe4a55a5ee17ab    True      False      False      1              1                   1                     0                      28m
master   rendered-master-f52b00f0c8d6f798e4d42e6d54bae44f   True      False      False      3              3                   3                     0                      68m
worker   rendered-worker-4d5670a93f584960a6b52d9ffb3d6d81   True      False      False      2              2                   2                     0                      68m

Tested using this TC too

 [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:42365][OTP] add real time kernel argument [Disruptive]                                                                                            
 [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:42364][OTP] add selinux kernel argument [Disruptive]                                                                                              
 [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:67825][OTP] Use duplicated kernel arguments [Disruptive] 
   [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:72136][OTP] Reject MCs with ignition containing kernelArguments [Disruptive]                                                                      
 [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:53668][OTP] when FIPS and realtime kernel are both enabled node should NOT be degraded [Disruptive]                                               
 [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:67787][OTP] switch kernel type to 64k-pages for clusters with arm64 nodes [Disruptive]                                                            
 [sig-mco][Suite:openshift/machine-config-operator/longduration][Serial][Disruptive] MCO kernel [PolarionID:67788][OTP] kernel type 64k-pages is not supported on non-arm64 nodes [Disruptive]     

/label qe-approved
/verified by @ptalgulk01

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

@isabella-janssen: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 48d0c9d into openshift:main May 6, 2026
17 checks passed
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@isabella-janssen: Jira Issue Verification Checks: Jira Issue OCPBUGS-75894
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-75894 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Closes: OCPBUGS-75894

Note: The changes in this PR and description were created with the assistance of Claude.

- What I did

The generateKargs() function used --delete= to remove old kernel arguments during reconciliation. When there is drift or a mismatch in the expected and actual kargs, which should typically be caught by our configuration drift monitor, but can occur in cases such as the one highlighted in OCPBUGS-75894, rpm-ostree fails the entire kargs transaction with "No key found", leaving the node degraded. This changes --delete= to --delete-if-present=, which allows us to skip missing kargs without erroring out.

- How to verify it

  1. Apply a MC to add custom kargs to nodes and wait for the change to be applied.
$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                
kind: MachineConfig                                                                                                                 
metadata:                                                       
 labels:                                                                                                                           
   machineconfiguration.openshift.io/role: infra              
 name: 99-karg-test                                                                                                                
spec:
 kernelArguments:                                                                                                                  
   - rcutree.nohz_full_patience_delay=1000                     
   - nmi_watchdog=0
EOF
  1. Confirm the karg is present on the node.
$ oc debug node/<node> -- chroot /host rpm-ostree kargs | grep rcutree
# Expected output includes `rcutree.nohz_full_patience_delay=1000`
  1. Create drift by removing the karg and rebooting the node.
$ oc debug node/<node> -- chroot /host rpm-ostree kargs --delete=rcutree.nohz_full_patience_delay=1000
$ oc debug node/<node> -- chroot /host systemctl reboot
  1. Wait for the node to become Ready. The MCD will detect the missing karg and mark the node degraded. This is expected.
  2. Apply another MC to create a kargs diff.
$ oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1                                                                                    
kind: MachineConfig                                                                                                                 
metadata:                                                       
 labels:                                                                                                                           
   machineconfiguration.openshift.io/role: infra              
 name: 99-karg-test2                                        
spec:
 kernelArguments:                                                                                                                  
   - rcutree.nohz_full_patience_delay=2000                     
   - nmi_watchdog=0
EOF
  1. Point the node at the new rendered MC and touch the force file.
$ oc annotate node <node> machineconfiguration.openshift.io/desiredConfig=<new-mc> --overwrite
$ oc debug node/<node>  -- chroot /host touch /run/machine-config-daemon-force
  1. Check the MCD logs. Before the fix the logs will include error: No key 'rcutree.nohz_full_patience_delay' found and the node degrades. After the fix, the update should succeed and the node should not degrade.

- Description for the changelog
OCPBUGS-75894: use --delete-if-present for karg removal

Summary by CodeRabbit

  • Bug Fixes
  • Kernel argument removal during system updates is now drift-tolerant when the bootloader state has changed or is missing, reducing update failures in edge cases.
  • Addition of new kernel arguments is unaffected and continues to work as before.
  • Behavior validated by updated tests to ensure reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen isabella-janssen deleted the ocpbugs-75894 branch May 6, 2026 15:33
@openshift-cherrypick-robot
Copy link
Copy Markdown

@isabella-janssen: new pull request created: #6007

Details

In response to this:

/cherrypick release-4.22 release-4.21 release-4.20 release-4.19 release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot
Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-05-07-054132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants