OCPBUGS-84814: Skip chrony-wait on boot 2 when time was recently synced#5990
OCPBUGS-84814: Skip chrony-wait on boot 2 when time was recently synced#5990sdodson wants to merge 1 commit into
Conversation
During node scale-up, chrony-wait.service blocks boot 2 for 8-14s (AWS) or up to 24s (Azure) waiting for NTP synchronization. This is unnecessary when chronyd already synced time during boot 1 (MCD firstboot) and wrote the drift file on clean shutdown — which happens reliably in 100% of 250+ tested scale-up runs across AWS and Azure. Add chrony-drift-check.service that runs after chronyd but before chrony-wait. It checks whether /var/lib/chrony/drift was modified less than 60 minutes ago. If so, it creates /run/chrony-recently-synced (tmpfs, auto-cleaned on every reboot). A drop-in on chrony-wait.service adds ConditionPathExists=!/run/chrony-recently-synced, causing systemd to skip the blocking wait entirely when the flag is present. This approach uses a separate service rather than modifying chrony-wait.service's ExecStart because chrony-wait runs in a strict systemd sandbox (ProtectSystem=strict, PrivateUsers=yes) which causes SELinux to deny access to /var/lib/chrony/drift from the unconfined_service_t context that a wrapper script would run under. The separate unsandboxed service avoids this entirely while keeping chrony-wait's own security sandbox untouched. On a fresh scale-up where the machine was powered off for more than an hour (or on initial install where no drift file exists), the check falls through and chrony-wait blocks normally, ensuring correct time before kubelet starts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
WalkthroughA new chrony synchronization detection mechanism is introduced: a check script monitors drift file age and creates a marker when recently synced; a systemd service runs the script with proper ordering; a drop-in config skips chrony-wait if that marker exists. ChangesChrony Sync Detection
Sequence DiagramsequenceDiagram
actor System
participant chronyd as chronyd.service
participant check as chrony-drift-check.service
participant script as drift-check.sh
participant chrony_wait as chrony-wait.service
System->>chronyd: Start
activate chronyd
chronyd->>check: (After chronyd starts)
check->>script: Execute /usr/local/bin/chrony-drift-check.sh
activate script
script->>script: Check /var/lib/chrony/drift age
alt Drift file < 3600 seconds old
script->>script: Touch /run/chrony-recently-synced
end
script-->>check: Exit 0
deactivate script
check-->>System: Service complete (RemainAfterExit=yes)
rect rgba(200, 150, 100, 0.5)
Note over chrony_wait: Now ready to start<br/>(Before chrony-drift-check)
end
chrony_wait->>chrony_wait: Check ConditionPathExists=/run/chrony-recently-synced
alt Marker exists
chrony_wait-->>System: Skip execution
else Marker absent
chrony_wait->>chrony_wait: Run normally
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
Skipping CI for Draft Pull Request. |
|
/test e2e-aws-ovn |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sdodson The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@sdodson: This pull request references Jira Issue OCPBUGS-84814, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sdodson: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
During node scale-up,
chrony-wait.serviceblocks boot 2 for 8–14s on AWS (up to 24s on Azure) waiting for NTP synchronization. This wait is unnecessary whenchronydalready synced time during boot 1 (MCD firstboot) and wrote the drift file on clean shutdown.This PR adds a lightweight check that skips the blocking
chrony-waitwhen the drift file proves time was recently synced:chrony-drift-check.service— runs afterchronyd, beforechrony-wait. Checks if/var/lib/chrony/driftwas modified less than 60 minutes ago. If so, creates/run/chrony-recently-synced(tmpfs, auto-cleaned on reboot).chrony-wait.service— addsConditionPathExists=!/run/chrony-recently-synced. When the flag exists, systemd skips the blocking wait entirely.Why this approach
Why not modify chrony-wait's ExecStart directly?
chrony-wait.serviceruns in a strict systemd sandbox (ProtectSystem=strict,PrivateUsers=yes). A wrapper script replacingExecStartruns under theunconfined_service_tSELinux context, which is denied access to/var/lib/chrony/drift(labeledchronyd_var_lib_t). The directory has0750 chrony:chronypermissions, making it inaccessible even to root under the user namespace remapping.Using a separate unsandboxed service avoids this entirely while leaving
chrony-wait's security sandbox untouched.Why not use ConditionPathExists on the drift file?
ConditionPathExistscan only check existence, not file age. We need to distinguish between a drift file that was written minutes ago (safe to skip) and one that was written hours or days ago (machine was powered off, need fresh NTP sync).Why is this safe?
chronydreliably writes the drift file on clean shutdown during boot 1. Verified across 250+ scale-up runs on AWS and Azure — 100% success rate.chronydstartup on boot 2 (within the same second), confirming it persists across the MCD reboot./run/(tmpfs), so it is automatically cleaned on every reboot — no stale state accumulates.chrony-waitblocks normally.Test results
Tested on AWS 4.20 with m6i.xlarge instances:
Before (baseline):
chrony-waitblocks for 8–14s on boot 2After:
chrony-waitskipped entirelyTest plan
chronyc trackingshows clock is synchronized after boot completes🤖 Generated with Claude Code
Summary by CodeRabbit