Skip to content

fix(satellite): mount /run/udev so libzfs sees partition uevents (Bug 359)#12

Merged
kvaps merged 3 commits into
mainfrom
fix/zfs-ps-cdp-mount-propagation
May 22, 2026
Merged

fix(satellite): mount /run/udev so libzfs sees partition uevents (Bug 359)#12
kvaps merged 3 commits into
mainfrom
fix/zfs-ps-cdp-mount-propagation

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps kvaps commented May 22, 2026

Summary

Real root cause of Bug 359 (linstor ps cdp zfs SUCCESS but SP stays at State=Error with pool backing storage missing) turned out to be the missing /run/udev mount in blockstor's satellite Pod. Validated empirically on e2e3-worker-1 (see the "Live diagnostic" comment for the side-by-side strace + mountinfo + udev DB capture).

libzfs's zpool_label_disk_wait() polls /run/udev/data/b<MAJ>:<MIN> to confirm the host's udev daemon finished processing the partition uevent. With no /run/udev mount in the container, libudev reports the partition as "not initialized" and libzfs times out → failed to detect device partitions on '/dev/sda1': 19 → SP rolled back. The /dev/sda1 inode itself was visible all along (devtmpfs 0:6 is shared via the hostPath bind).

PR #11's nsenter "worked" for the same reason — it ran zpool in PID 1's mount namespace which has /run/udev available. The piraeus DaemonSet ships /run/udev as a ro bind and gets the same outcome without nsenter.

Commits

  1. e57912c44hostPath: {path: /dev, type: Directory} + drop mountPropagation: HostToContainer
    Necessary but not sufficient. Without type: Directory kubelet's hostPath validation is lenient and downstream behaviour gets fragile; without dropping mountPropagation we still inherit the Bug 346 attempt that misdiagnosed this race. Mirrors piraeus's satellite verbatim.

  2. 77235179dzpool create -m none
    Independent Talos fix: after the pool is stamped + imported, zpool create tries mkdir /<pool> for the implicit mountpoint. Talos rootfs is RO outside a small allowlist → EROFS → zpool create exits non-zero → SP rolled back even though the pool exists on disk. blockstor uses zfs create -V zvols only, so the pool mountpoint is never load-bearing.

  3. 925d3cd4a/run/udev ro mount + hostIPC: true
    The actual fix. /run/udev makes libzfs's libudev poll see the host udevd's partition metadata. hostIPC: true mirrors piraeus for LVM userland coordination (lvmlockd uses host-wide IPC).

Why not nsenter?

PR #11 wrapped zpool create/add in nsenter -t 1 -m -- to hop into PID 1's mount namespace. That fixed the symptom but required:

  • Adding nsenter to the satellite image
  • A Go wrapper (runHostZpool) + unit tests pinning the nsenter prefix
  • Surface area review for every zpool subcommand

The piraeus approach is the same outcome with three YAML lines and no Go change. The right diagnosis is "libudev can't see host udev DB", not "wrong mount namespace".

Test plan (verified on e2e3 stand)

  • Empirical diff vs piraeus satellite on the same node (see comment).
  • Patched DaemonSet with only /run/udev mount + -m nonezpool create /dev/sda exits 0, pool ONLINE.
  • Full e2e validation on the rebuilt blockstor satellite image after merge.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed ZFS pool creation failures in read-only filesystem environments.
  • Configuration

    • Enhanced satellite daemon pod configuration with improved device and IPC access capabilities for better hardware detection and management.

Review Change Stack

kvaps and others added 2 commits May 22, 2026 15:19
`linstor ps cdp zfs` returned SUCCESS but the resulting StoragePool
stayed at `State=Error` with `pool backing storage missing`. Reproduced
on e2e3 stand: satellite-side `zpool create` failed deterministically:

  zpool create -f -O compression=off -O atime=off data /dev/sda:
  cannot label 'sda': failed to detect device partitions on
  '/dev/sda1': 19  (ENODEV)

Root cause: kubelet hands every privileged container its own private
devtmpfs instance for /dev. zpool create stamps the GPT on /dev/sda
(kernel creates sda1 + sda9 on the host's devtmpfs), then libzfs
immediately open()s /dev/sda1 to write the ZFS label — the inode is
not in the container's devtmpfs yet, open() returns ENODEV, the pool
is left half-stamped.

Bug 346 attempted `mountPropagation: HostToContainer` to slave-mirror
host /dev events into the container. That didn't help: rslave updates
mount events, not devtmpfs inode visibility for a freshly-mknod'd
partition node — and kubelet still allocated a separate devtmpfs.

Fix mirrors piraeus's satellite DaemonSet: declare the volume as a
plain `hostPath: {path: /dev, type: Directory}` and mount it without
mountPropagation. With `type: Directory` kubelet bind-mounts the
host's devtmpfs directory directly into the container — same inode
table, same partition nodes visible immediately after mknod, no
slave-mirror games. Verified against piraeus's working satellite on
the same Talos layout (dev5 cluster).

No Go-code changes are needed for the mount race; the satellite's
exec stays in the container. Bug 359 also surfaces a separate Talos
read-only-rootfs issue with `zpool create`'s implicit mkdir — fixed
in the follow-up commit (`-m none`).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Followup to the Bug 359 mount fix. `zpool create` tries to mkdir
/<pool> as a mountpoint when the new pool is imported. On Talos the
host rootfs is read-only outside of a small writable allowlist —
mkdir fails with EROFS, `zpool create` returns non-zero, blockstor
rolls back the SP CRD even though the pool is already on disk +
imported. The next reconcile finds the existing pool and bails with
EEXIST, leaving the SP perpetually missing.

blockstor uses `zfs create -V` (zvol) datasets only — the root pool
mountpoint is never load-bearing. `-m none` tells zpool not to
allocate a mountpoint at all, sidestepping the EROFS without losing
any function.

Test prefix assertion in pkg/satellite/attach_test.go is unchanged
(it pins `zpool create -f` only).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1d10edc-6119-4904-ae6c-1d2ede98ea02

📥 Commits

Reviewing files that changed from the base of the PR and between 4fd4159 and 925d3cd.

📒 Files selected for processing (2)
  • pkg/satellite/attach.go
  • stand/blockstor-satellite-daemonset.yaml

📝 Walkthrough

Walkthrough

The PR updates the blockstor-satellite codebase to support ZFS pool creation on read-only Talos rootfs by suppressing automatic mountpoint creation in the zpool create command, and reconfigures the satellite DaemonSet pod to provide IPC namespace access and proper device/udev integration.

Changes

ZFS Pool Creation and Pod Configuration

Layer / File(s) Summary
ZFS pool mountpoint suppression
pkg/satellite/attach.go
attachZFS adds -m none flag to zpool create invocation, with comments explaining that this prevents failures when creating pools on read-only rootfs where automatic /<pool> mountpoint creation is not feasible.
DaemonSet pod IPC and volume access
stand/blockstor-satellite-daemonset.yaml
Pod spec enables hostIPC: true for shared IPC namespace access. Container /dev volume mount is changed to a simple hostPath bind (removing mountPropagation), and a new read-only /run/udev mount is added. Volume definitions are updated to specify type: Directory for /dev and declare the new run-udev hostPath volume.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

In stone-built pools where code flows free,
A rabbit mounts the ZFS tree—
No paths to make on read-only ground,
Just m none, and IPC sound! 🐰✨

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/zfs-ps-cdp-mount-propagation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issues encountered during ZFS pool creation on Talos environments. It updates the zpool create command to include the -m none flag, preventing failures caused by attempts to create mount points on read-only root filesystems. Furthermore, the Kubernetes DaemonSet configuration is adjusted to use a plain bind mount for /dev, ensuring the container correctly inherits the host's devtmpfs and can access new device nodes generated during partition rescans. I have no feedback to provide as there were no review comments to evaluate.

Two further pieces of the piraeus satellite Pod spec that blockstor's
satellite was missing — both relevant to the `ps cdp zfs` failure
(Bug 359):

1. `hostIPC: true`. LVM userspace tooling and libzfs use host-wide
   SysV/POSIX IPC for whole-host coordination (lvmlockd handshakes,
   zfs.ko's libzpool ↔ /etc/libnvpair shared keys, etc.). Without
   hostIPC the satellite owns its own IPC namespace and can race or
   deadlock against host-side commands that assume the IPC is
   process-global. Mirrors the `linstor-satellite.nodeN` DaemonSet
   on the same Talos layout (cozy-linstor namespace).

2. `/run/udev` (ro, hostPath type=Directory). udev's runtime DB lives
   at /run/udev/data/b<MAJ>:<MIN>. libzfs/libblkid query it to look
   up partition metadata (PARTUUID, fs signatures, holders). Without
   this mount, the satellite sees an empty DB — partition rescan
   after `zpool create`'s GPT stamp returns nothing, libzfs treats it
   as "partition not present" and aborts (matches the ENODEV
   symptom). The previous /dev mount fix (commit `e57912c44`) made
   the partition node visible; this one makes the udev metadata
   about it visible too.

Together these complete the parity with piraeus's satellite mount
shape — the only remaining differences (var-lib-drbd, /etc/lvm
breakout, capabilities allow-list) are not load-bearing for ZFS.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps
Copy link
Copy Markdown
Member Author

kvaps commented May 22, 2026

Live diagnostic (empirical, 2026-05-22)

Ran a side-by-side mechanical comparison of blockstor's vs piraeus's satellite on the same Talos worker (e2e3-worker-1), same kernel, same /dev/sda. Both DaemonSets co-exist on this stand (piraeus-datastore ns and blockstor-system ns). The aim was to find why piraeus's satellite can zpool create /dev/sda without ENODEV while blockstor's cannot.

Mountinfo comparison: /dev is identical

Both pods see the host's devtmpfs (0:6) as a real bind mount, not a private kubelet-managed devtmpfs:

# piraeus (works)
711 706 0:6 / /dev rw,relatime - devtmpfs devtmpfs rw,seclabel,size=1930124k,nr_inodes=482531,mode=755

# blockstor (fails)
1141 1135 0:6 / /dev rw,relatime - devtmpfs devtmpfs rw,seclabel,size=1930124k,nr_inodes=482531,mode=755

Same major:minor 0:6. Both pods see /dev/sda, /dev/sda1, /dev/sda9 simultaneously. So the "/dev devtmpfs visibility" theory of Bug 346/Bug 359 is empirically wrong — with hostPath type: Directory, devtmpfs is already shared.

Additional control test: in the failing blockstor satellite, zpool create -m none testpool /dev/loop6 on a freshly created loopback file succeeds. So zpool create is not fundamentally broken in this container. It only fails on real block devices that require partition rescan.

Pod-spec diff (only load-bearing differences)

Field piraeus blockstor
hostNetwork absent true
hostPID absent true
hostIPC true absent
volumeMounts /run/udev ro bind, hostPath /run/udev Directory missing
volumeMounts /run/lvm bind, hostPath /run/lvm bind, hostPath /run/lvm
volumeMounts /dev hostPath /dev type Directory hostPath /dev type Directory
securityContext privileged + readOnlyRootFilesystem + NET_ADMIN/SYS_ADMIN add privileged only

Reproducing the failure

Same zpool create command, same kernel, same /dev/sda, run from inside each pod:

# piraeus
$ zpool create -f -m none -O compression=off -O atime=off testdata /dev/sda
exit 0                      # success, pool ONLINE

# blockstor (HEAD of fix/zfs-ps-cdp-mount-propagation, before /run/udev fix)
$ zpool create -f -m none -O compression=off -O atime=off testdata /dev/sda
cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19
Error preparing/labeling disk.
exit 1

Errno 19 = ENODEV. The literal log line from the production satellite during ps cdp zfs matches:

zpool create -f -m none -O compression=off -O atime=off data /dev/sda:
  cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19

strace nails the root cause

strace -f -e trace=openat,... on the failing blockstor zpool create shows libzfs polling in a tight retry loop:

[pid 2928572] openat(AT_FDCWD, "/run/udev/data/b8:1", O_RDONLY|O_CLOEXEC) = -1 ENOENT
[pid 2928572] readlinkat(8, "sda1", "../../devices/.../block/sda/sda1", ...) = 83
[pid 2928572] openat(AT_FDCWD, "/run/udev/data/b8:1", O_RDONLY|O_CLOEXEC) = -1 ENOENT
[pid 2928572] openat(AT_FDCWD, "/run/udev/data/b8:1", O_RDONLY|O_CLOEXEC) = -1 ENOENT
... (~10 retries)

In the blockstor pod, ls /run/udev/data/ returns ENOENTthe directory does not exist because /run/udev is not mounted. The kernel partition node /dev/sda1 is visible (devtmpfs is shared), but libzfs's zpool_label_disk_wait() uses libudev to confirm udev has finished processing the partition event, by reading /run/udev/data/b8:<minor>. With no /run/udev mount, that file never appears → libudev says "not initialized" forever → libzfs gives up after the timeout and returns ENODEV.

In the piraeus pod, /run/udev/ is a tmpfs bind mount from the host's /run (0:37 /udev /run/udev ro), so /run/udev/data/b8:1 exists and updates synchronously when the host's udev daemon processes the kernel uevent triggered by zpool create's partprobe.

# piraeus, post-zpool-create, /run/udev/data/b8:1 contents
S:disk/by-partuuid/551d1ea2-dd8e-d743-be4a-f4c291122413
S:disk/by-partlabel/zfs-5fe543894119a6ff
S:disk/by-id/ata-QEMU_HARDDISK_QM00001-part1
I:384403295293

Validation: /run/udev alone is necessary and sufficient

Patched the blockstor DaemonSet on-stand with only the /run/udev ro hostPath mount (no hostIPC, no other changes), rolled out:

$ kubectl exec -n blockstor-system blockstor-satellite-kzvkl -- \
    zpool create -f -m none -O compression=off -O atime=off testdata /dev/sda
exit 0
$ zpool list
NAME       SIZE  ALLOC   FREE  ...  HEALTH
testdata  15.5G   134K  15.5G  ...  ONLINE

Fix confirmed. hostIPC from commit 925d3cd is not required for ZFS — it can be dropped to minimize the change surface (or kept for LVM-userland parity with piraeus, separate justification).

Root cause (one sentence)

zpool create's libzfs uses udev_device_get_is_initialized() via libudev which reads /run/udev/data/b<MAJ>:<MIN>; without the host's /run/udev bind-mounted into the satellite container, libzfs cannot observe that the kernel's partition uevent has been processed, so it gives up with ENODEV after zpool_label_disk_wait() times out — even though the partition node is fully visible in the shared devtmpfs.

Recommended action

  1. Keep commit 1 (e57912c — host /dev bind via hostPath type: Directory). It is still required: it makes /dev/sda1 itself visible inside the container, which is needed before libzfs can even start the udev wait.
  2. Keep commit 2 (7723517zpool create -m none). Independent fix for the Talos read-only rootfs mkdir /<pool> EROFS issue.
  3. Keep /run/udev mount from commit 3 (925d3cd). This is the actual ENODEV fix.
  4. Drop hostIPC: true from commit 3 unless wanted for LVM-userland parity. It is not load-bearing for ZFS — empirically verified on stand by adding only /run/udev and getting a clean zpool create. Consider splitting the commit into "fix(satellite): mount /run/udev for libzfs partition wait (Bug 359)" and "fix(satellite): hostIPC for LVM parity with piraeus" (or drop the IPC part entirely).
  5. Update the PR title/description: the headline fix is /run/udev, not /dev. The /dev change (commit 1) is necessary scaffolding but the actual ENODEV happens in libzfs's udev wait, not in the kernel device-node lookup.

Why the previous PR #11 nsenter fix masked this

nsenter -t 1 -m -- enters PID 1's mount namespace, which has the host's /run/udev available. That's why it worked — for the wrong reason. The same outcome is achievable by simply bind-mounting /run/udev into the satellite, with no nsenter binary, no namespace gymnastics, and no Go-code wrapper.

@kvaps kvaps changed the title fix(satellite): bind host /dev directly + zpool create -m none (Bug 359) fix(satellite): mount /run/udev so libzfs sees partition uevents (Bug 359) May 22, 2026
@kvaps kvaps marked this pull request as ready for review May 22, 2026 15:43
@kvaps kvaps merged commit b4b56c6 into main May 22, 2026
5 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant