Skip to content

[debian/rules]: atomic symlink replacement for /lib/modules/<KVER>/{build,source}#39

Open
bhouse-nexthop wants to merge 1 commit into
sonic-net:sdk-6.5.35-dnxfrom
bhouse-nexthop:bhouse.fix-symlink-replace-race
Open

[debian/rules]: atomic symlink replacement for /lib/modules/<KVER>/{build,source}#39
bhouse-nexthop wants to merge 1 commit into
sonic-net:sdk-6.5.35-dnxfrom
bhouse-nexthop:bhouse.fix-symlink-replace-race

Conversation

@bhouse-nexthop
Copy link
Copy Markdown

@bhouse-nexthop bhouse-nexthop commented May 23, 2026

Summary

Replace the racy rm + ln -s pair in debian/rules with ln -sfn to make the /lib/modules/$(KVER_ARCH)/{build,source} symlink replacement atomic.

Why this matters

The current pattern at debian/rules:92-95 (lines from sdk-6.5.35-dnx):

cd /; sudo rm /lib/modules/$(KVER_ARCH)/build
cd /; sudo rm /lib/modules/$(KVER_ARCH)/source
cd /; sudo ln -s /usr/src/linux-headers-$(KVER_COMMON)/ /lib/modules/$(KVER_ARCH)/source
cd /; sudo ln -s /usr/src/linux-headers-$(KVER_ARCH)/ /lib/modules/$(KVER_ARCH)/build

leaves a window of tens to hundreds of milliseconds where /lib/modules/$(KVER_ARCH)/build and /lib/modules/$(KVER_ARCH)/source do not exist on disk between the rm and the ln -s. ln -sfn performs the unlink + symlink in a single rename(2) syscall (via a hidden temp path), so the target name is never absent from a concurrent observer's point of view.

Why this is observable

This debian/rules runs inside sonic-buildimage as part of the saibcm-modules-dnx submodule. sonic-buildimage drives parallel package builds (SONIC_BUILD_JOBS=N), and many sibling platform-modules-* recipes (Dell, Delta, Ingrasys, Inventec, Mitac, Accton, Centec, Marvell-Teralynx, Nephos, etc.) iterate over their MODULE_DIRS calling

make -C /lib/modules/$(KVER_ARCH)/build M=.../<mod>/modules clean

per platform. If a single iteration lands in the window between this recipe's rm and ln -s, GNU make's chdir(/lib/modules/.../build) returns ENOENT and it aborts with

make[3]: *** /lib/modules/<KVER>/build: No such file or directory.  Stop.

failing the whole sibling recipe. Observed during a sonic-buildimage Broadcom build (SONIC_BUILD_JOBS=12), where the Dell platform-modules recipe's post-build clean iteration succeeded for the first six platforms (s6000 .. s5224f) and failed on the seventh (s5232f):

make[3]: *** /lib/modules/6.12.41+deb13-sonic-amd64/build: No such file or directory.  Stop.
make[2]: *** [debian/rules:101: override_dh_auto_clean] Error 2
make[1]: *** [Makefile:22: /sonic/target/debs/trixie/platform-modules-z9100_1.1_amd64.deb] Error 2
make: *** [slave.mk:915: target/debs/trixie/platform-modules-z9100_1.1_amd64.deb] Error 1

That's exactly the kind of stochastic window-strike a non-atomic symlink replacement produces — first 6 iterations pass because the window happened to be quiet, then one iteration lands during a saibcm-modules-dnx (or opennsl-modules-dnx) build-arch-stamp re-link.

Cross-reference: non-DNX variant already has this fix

The sibling saibcm-modules/debian/rules (non-DNX) already uses ln -sfn:

# saibcm-modules/debian/rules, lines 92-93:
cd /; sudo ln -sfn /usr/src/linux-headers-$(KVER_COMMON)/ /lib/modules/$(KVER_ARCH)/source
cd /; sudo ln -sfn /usr/src/linux-headers-$(KVER_ARCH)/ /lib/modules/$(KVER_ARCH)/build

So this is a delta the DNX branches missed when copy-pasting the recipe. The fix is behaviorally identical when no concurrent observer is racing, and race-free when one is.

Affected branches

The same rm + ln -s pattern is present in (at least):

  • sdk-6.5.35-dnx (this PR's base)
  • sdk-6.5.34-dnx-gpl
  • sdk-6.5.32-dnx-gpl-trixie
  • sdk-6.5.32-dnx-gpl
  • likely older DNX branches as well

This PR targets sdk-6.5.35-dnx since the buildimage submodule currently floats commits forward through there. The patch is trivial to cherry-pick to the other active DNX branches; happy to file follow-ups if maintainers prefer.

Test plan

  • Reviewed sibling recipe saibcm-modules/debian/rules already uses ln -sfn; behavior is otherwise identical.
  • Verified the failure mode is reproducible under SONIC_BUILD_JOBS=12 Broadcom build, both pre-patch (fails) and that the only difference at fault is this non-atomic symlink replacement.

…uild,source}

Replace the racy `rm` + `ln -s` pair with `ln -sfn`. The two-step
form leaves a window of tens-to-hundreds of milliseconds where
/lib/modules/$(KVER_ARCH)/build and /lib/modules/$(KVER_ARCH)/source
do not exist on disk. `ln -sfn` performs the unlink + symlink in a
single rename(2) syscall via a hidden temp path, so the target name
is never absent from a concurrent observer's point of view.

This race fails sibling platform-modules-* recipes in
sonic-buildimage that build concurrently. Each of those recipes
iterates over its MODULE_DIRS calling
  make -C /lib/modules/$(KVER_ARCH)/build M=... clean
per platform. If a single iteration lands in the window between
this recipe's `rm` and `ln -s`, GNU make's chdir fails with
"No such file or directory" on the kernel build dir, aborting the
whole recipe. Observed in a sonic-buildimage Broadcom build:

    make[3]: *** /lib/modules/6.12.41+deb13-sonic-amd64/build: \
        No such file or directory.  Stop.
    [...]
    make: *** [slave.mk:915: target/debs/trixie/\
        platform-modules-z9100_1.1_amd64.deb] Error 1

The dell platform-modules iteration succeeded for the first six
platforms (s6000 .. s5224f) and failed on the seventh (s5232f),
exactly the kind of stochastic window-strike that a non-atomic
symlink replacement produces.

The non-DNX variant `saibcm-modules/debian/rules` already uses
this idiom (lines 92-93 at HEAD of sdk-6.5.32-gpl):

    cd /; sudo ln -sfn /usr/src/linux-headers-$(KVER_COMMON)/ \
        /lib/modules/$(KVER_ARCH)/source
    cd /; sudo ln -sfn /usr/src/linux-headers-$(KVER_ARCH)/ \
        /lib/modules/$(KVER_ARCH)/build

so this is a delta the DNX branches missed. Behaviorally identical
when no concurrent observer is racing; race-free when one is.

Signed-off-by: Brad House <bhouse@nexthop.ai>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

No pipelines are associated with this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants