Skip to content

OCPBUGS-84814: Skip chrony-wait on boot 2 when time was recently synced#5990

Draft
sdodson wants to merge 1 commit into
openshift:mainfrom
sdodson:chrony-wait-skip-if-recently-synced
Draft

OCPBUGS-84814: Skip chrony-wait on boot 2 when time was recently synced#5990
sdodson wants to merge 1 commit into
openshift:mainfrom
sdodson:chrony-wait-skip-if-recently-synced

Conversation

@sdodson
Copy link
Copy Markdown
Member

@sdodson sdodson commented May 4, 2026

Summary

During node scale-up, chrony-wait.service blocks boot 2 for 8–14s on AWS (up to 24s on Azure) waiting for NTP synchronization. This wait is unnecessary when chronyd already synced time during boot 1 (MCD firstboot) and wrote the drift file on clean shutdown.

This PR adds a lightweight check that skips the blocking chrony-wait when the drift file proves time was recently synced:

  • chrony-drift-check.service — runs after chronyd, before chrony-wait. Checks if /var/lib/chrony/drift was modified less than 60 minutes ago. If so, creates /run/chrony-recently-synced (tmpfs, auto-cleaned on reboot).
  • Drop-in for chrony-wait.service — adds ConditionPathExists=!/run/chrony-recently-synced. When the flag exists, systemd skips the blocking wait entirely.

Why this approach

Why not modify chrony-wait's ExecStart directly?

chrony-wait.service runs in a strict systemd sandbox (ProtectSystem=strict, PrivateUsers=yes). A wrapper script replacing ExecStart runs under the unconfined_service_t SELinux context, which is denied access to /var/lib/chrony/drift (labeled chronyd_var_lib_t). The directory has 0750 chrony:chrony permissions, making it inaccessible even to root under the user namespace remapping.

Using a separate unsandboxed service avoids this entirely while leaving chrony-wait's security sandbox untouched.

Why not use ConditionPathExists on the drift file?

ConditionPathExists can only check existence, not file age. We need to distinguish between a drift file that was written minutes ago (safe to skip) and one that was written hours or days ago (machine was powered off, need fresh NTP sync).

Why is this safe?

  • chronyd reliably writes the drift file on clean shutdown during boot 1. Verified across 250+ scale-up runs on AWS and Azure — 100% success rate.
  • The drift file is read immediately at chronyd startup on boot 2 (within the same second), confirming it persists across the MCD reboot.
  • The flag is in /run/ (tmpfs), so it is automatically cleaned on every reboot — no stale state accumulates.
  • If the drift file is absent or stale (>1 hour), the check falls through and chrony-wait blocks normally.

Test results

Tested on AWS 4.20 with m6i.xlarge instances:

Before (baseline): chrony-wait blocks for 8–14s on boot 2

Starting Wait for chrony to synchronize system clock...
[8-14 seconds of chronyc waitsync retries]
Finished Wait for chrony to synchronize system clock.

After: chrony-wait skipped entirely

Starting Check if chrony drift file is recent...
chrony-drift-check: drift file age=24s
chrony-drift-check: created /run/chrony-recently-synced, chrony-wait will skip
Finished Check if chrony drift file is recent.
Wait for chrony to synchronize system clock was skipped because of an unmet condition check (ConditionPathExists=!/run/chrony-recently-synced).

Test plan

  • Verify fresh node scale-up skips chrony-wait on boot 2
  • Verify existing worker MCP rollout (reboot) skips chrony-wait
  • Verify initial cluster install still blocks on chrony-wait (no drift file exists on first-ever boot)
  • Verify node that was powered off for >1 hour blocks on chrony-wait (stale drift file)
  • Verify chronyc tracking shows clock is synchronized after boot completes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added automated chrony drift monitoring to detect recent time synchronization events
    • Implemented conditional service behavior that skips dependent services when synchronization is current

During node scale-up, chrony-wait.service blocks boot 2 for 8-14s (AWS)
or up to 24s (Azure) waiting for NTP synchronization. This is unnecessary
when chronyd already synced time during boot 1 (MCD firstboot) and wrote
the drift file on clean shutdown — which happens reliably in 100% of
250+ tested scale-up runs across AWS and Azure.

Add chrony-drift-check.service that runs after chronyd but before
chrony-wait. It checks whether /var/lib/chrony/drift was modified less
than 60 minutes ago. If so, it creates /run/chrony-recently-synced
(tmpfs, auto-cleaned on every reboot). A drop-in on chrony-wait.service
adds ConditionPathExists=!/run/chrony-recently-synced, causing systemd
to skip the blocking wait entirely when the flag is present.

This approach uses a separate service rather than modifying
chrony-wait.service's ExecStart because chrony-wait runs in a strict
systemd sandbox (ProtectSystem=strict, PrivateUsers=yes) which causes
SELinux to deny access to /var/lib/chrony/drift from the
unconfined_service_t context that a wrapper script would run under. The
separate unsandboxed service avoids this entirely while keeping
chrony-wait's own security sandbox untouched.

On a fresh scale-up where the machine was powered off for more than an
hour (or on initial install where no drift file exists), the check falls
through and chrony-wait blocks normally, ensuring correct time before
kubelet starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Walkthrough

A new chrony synchronization detection mechanism is introduced: a check script monitors drift file age and creates a marker when recently synced; a systemd service runs the script with proper ordering; a drop-in config skips chrony-wait if that marker exists.

Changes

Chrony Sync Detection

Layer / File(s) Summary
Script Implementation
templates/common/_base/files/chrony-drift-check.yaml
Bash script installed to /usr/local/bin/chrony-drift-check.sh checks if /var/lib/chrony/drift is less than 3600 seconds old; if so, creates /run/chrony-recently-synced marker and exits successfully.
Service Orchestration
templates/common/_base/units/chrony-drift-check.service.yaml
Systemd oneshot service executes the drift-check script, ordered to run after chronyd.service and before chrony-wait.service, installed to multi-user.target.
Conditional Skip Integration
templates/common/_base/units/chrony-wait.service.yaml
Drop-in config 10-skip-if-recently-synced.conf adds ConditionPathExists=!/run/chrony-recently-synced to skip chrony-wait.service when the marker file exists.

Sequence Diagram

sequenceDiagram
    actor System
    participant chronyd as chronyd.service
    participant check as chrony-drift-check.service
    participant script as drift-check.sh
    participant chrony_wait as chrony-wait.service

    System->>chronyd: Start
    activate chronyd
    chronyd->>check: (After chronyd starts)
    check->>script: Execute /usr/local/bin/chrony-drift-check.sh
    activate script
    script->>script: Check /var/lib/chrony/drift age
    alt Drift file < 3600 seconds old
        script->>script: Touch /run/chrony-recently-synced
    end
    script-->>check: Exit 0
    deactivate script
    check-->>System: Service complete (RemainAfterExit=yes)
    
    rect rgba(200, 150, 100, 0.5)
    Note over chrony_wait: Now ready to start<br/>(Before chrony-drift-check)
    end
    
    chrony_wait->>chrony_wait: Check ConditionPathExists=/run/chrony-recently-synced
    alt Marker exists
        chrony_wait-->>System: Skip execution
    else Marker absent
        chrony_wait->>chrony_wait: Run normally
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Pull request contains only YAML template files for systemd services and scripts, with no Ginkgo test files or test definitions added.
Test Structure And Quality ✅ Passed The PR contains only YAML template files and no Ginkgo test code, making the test quality custom check not applicable.
Microshift Test Compatibility ✅ Passed This pull request does not add any Ginkgo e2e tests. The changes consist entirely of three YAML template files for systemd services related to chrony time synchronization, not test definitions.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds systemd service unit templates and shell scripts for chrony synchronization, not Ginkgo e2e tests. SNO compatibility check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed Pull request adds systemd service unit templates and bash script for chrony drift checking. These are host-level systemd service definitions with no Kubernetes scheduling constructs, pod affinity rules, node selectors, or topology-dependent logic.
Ote Binary Stdout Contract ✅ Passed The OTE Binary Stdout Contract check is not applicable to this PR. The PR exclusively adds three YAML template files and an embedded Bash script for chrony drift file checks. No Go source code was modified, and the shell script does not write to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds YAML templates and bash script only; no Ginkgo e2e tests with IPv4 assumptions or external connectivity requirements are introduced.
Title check ✅ Passed The title accurately reflects the main changes: introducing a mechanism to skip chrony-wait.service when time was recently synced, which is the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@sdodson
Copy link
Copy Markdown
Member Author

sdodson commented May 4, 2026

/test e2e-aws-ovn

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sdodson
Once this PR has been reviewed and has the lgtm label, please assign pablintino for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson sdodson changed the title Skip chrony-wait on boot 2 when time was recently synced OCPBUGS-84814: Skip chrony-wait on boot 2 when time was recently synced May 4, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 4, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sdodson: This pull request references Jira Issue OCPBUGS-84814, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

During node scale-up, chrony-wait.service blocks boot 2 for 8–14s on AWS (up to 24s on Azure) waiting for NTP synchronization. This wait is unnecessary when chronyd already synced time during boot 1 (MCD firstboot) and wrote the drift file on clean shutdown.

This PR adds a lightweight check that skips the blocking chrony-wait when the drift file proves time was recently synced:

  • chrony-drift-check.service — runs after chronyd, before chrony-wait. Checks if /var/lib/chrony/drift was modified less than 60 minutes ago. If so, creates /run/chrony-recently-synced (tmpfs, auto-cleaned on reboot).
  • Drop-in for chrony-wait.service — adds ConditionPathExists=!/run/chrony-recently-synced. When the flag exists, systemd skips the blocking wait entirely.

Why this approach

Why not modify chrony-wait's ExecStart directly?

chrony-wait.service runs in a strict systemd sandbox (ProtectSystem=strict, PrivateUsers=yes). A wrapper script replacing ExecStart runs under the unconfined_service_t SELinux context, which is denied access to /var/lib/chrony/drift (labeled chronyd_var_lib_t). The directory has 0750 chrony:chrony permissions, making it inaccessible even to root under the user namespace remapping.

Using a separate unsandboxed service avoids this entirely while leaving chrony-wait's security sandbox untouched.

Why not use ConditionPathExists on the drift file?

ConditionPathExists can only check existence, not file age. We need to distinguish between a drift file that was written minutes ago (safe to skip) and one that was written hours or days ago (machine was powered off, need fresh NTP sync).

Why is this safe?

  • chronyd reliably writes the drift file on clean shutdown during boot 1. Verified across 250+ scale-up runs on AWS and Azure — 100% success rate.
  • The drift file is read immediately at chronyd startup on boot 2 (within the same second), confirming it persists across the MCD reboot.
  • The flag is in /run/ (tmpfs), so it is automatically cleaned on every reboot — no stale state accumulates.
  • If the drift file is absent or stale (>1 hour), the check falls through and chrony-wait blocks normally.

Test results

Tested on AWS 4.20 with m6i.xlarge instances:

Before (baseline): chrony-wait blocks for 8–14s on boot 2

Starting Wait for chrony to synchronize system clock...
[8-14 seconds of chronyc waitsync retries]
Finished Wait for chrony to synchronize system clock.

After: chrony-wait skipped entirely

Starting Check if chrony drift file is recent...
chrony-drift-check: drift file age=24s
chrony-drift-check: created /run/chrony-recently-synced, chrony-wait will skip
Finished Check if chrony drift file is recent.
Wait for chrony to synchronize system clock was skipped because of an unmet condition check (ConditionPathExists=!/run/chrony-recently-synced).

Test plan

  • Verify fresh node scale-up skips chrony-wait on boot 2
  • Verify existing worker MCP rollout (reboot) skips chrony-wait
  • Verify initial cluster install still blocks on chrony-wait (no drift file exists on first-ever boot)
  • Verify node that was powered off for >1 hour blocks on chrony-wait (stale drift file)
  • Verify chronyc tracking shows clock is synchronized after boot completes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
  • Added automated chrony drift monitoring to detect recent time synchronization events
  • Implemented conditional service behavior that skips dependent services when synchronization is current

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 4, 2026

@sdodson: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants