fix(satellite/zfs): ps cdp atomic SP commit + host-namespace dispatch (Bug 359) by kvaps · Pull Request #11 · cozystack/blockstor

kvaps · 2026-05-22T13:03:47Z

Summary

linstor ps cdp zfs was returning SUCCESS while the resulting StoragePool stayed at State=Error with pool backing storage missing. Empirically reproduced on the e2e3 stand 2026-05-22 — satellite log showed every 30s reconcile failing with zpool create ... data /dev/sda: cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19.

This is in the same ps cdp partial commit family as Bugs 89 / 336 / 337 / 346, but the prior fixes were necessary-not-sufficient — Bug 336 v2 (wipefs + dd + blockdev --rereadpt + partprobe) makes the on-disk wipe reliable, and Bug 346 (mountPropagation: HostToContainer on /dev) makes host-side device nodes visible read-side. Bug 359 is the residual race that neither of those covered.

Root cause (verified on stand, not hypothesised)

The satellite container's /dev is a private devtmpfs instance — privileged kubelet pods get their own devtmpfs mount, not a true bind of the host's. When zpool create runs inside the container:

It opens /dev/sda, stamps a fresh GPT, and asks the kernel to add the new partitions.
The kernel mknods /dev/sda1 + /dev/sda9 in the host's devtmpfs.
Without a udevd running in the container's mount namespace (the satellite image doesn't ship one) and without the container's devtmpfs being the host's, the new partition nodes don't appear in the container's /dev before zpool's internal open(/dev/sda1) fires.
zpool returns errno 19 (ENODEV) → cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19.
The pool is half-stamped on disk; on the next reconcile zpool list (container-side) still says "no such pool", and the SP CRD stays at poolMissing=true.

Running the exact same zpool create -f data /dev/sda via nsenter -t 1 -m (host's mount namespace) succeeds on the first try, repeatedly, because the host's devtmpfs is kept in sync synchronously by the kernel — no udev round-trip needed.

zpool list and other read-only probes are unaffected by the race.

Followup: `-m none` on the host-namespace path

After the nsenter hop, zpool create succeeded on disk but then tried to mkdir /<pool> on the host's read-only root (Talos siderolabs/zfs system-extension hosts have read-only rootfs). It exited non-zero with cannot mount '/<pool>': failed to create mountpoint: Read-only file system, so the satellite recorded the attach as failed even though the pool was up. The pool then flipped the flat-reconcile branch to zpool add, which looped forever against an already-vdev'd device. -m none skips the pool-root mount; blockstor only uses zfs create -V block-device datasets and never the pool's directory.

Changes

pkg/satellite/attach.go — introduce runHostZpool(ctx, exec, args...) that prefixes any pool-creating zpool subcommand with nsenter -t 1 -m --. attachZFS and extendZFS route create / add through it. Pure probes (zpool list) stay on the container-local path. zpool create carries the new -m none flag.
pkg/satellite/attach_test.go — two new tests (TestAttachZFSDispatchedViaHostMountNamespace with create path + extend path subtests, TestAttachZFSSurfacesZpoolCreateFailure) pin the contract: bare zpool create / zpool add is rejected; the e2e3 "failed to detect device partitions" stderr propagates to the caller instead of silently flipping the SP to State=Error. Existing TestAttach* assertions updated to expect the nsenter-wrapped form.
tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh — new L6 cell. Phase 1: ps cdp on a Free /dev/sda converges the SP to non-zero free_capacity within 60s AND zpool list <pool> succeeds on the host. Phase 2: rerunning ps cdp against the same device does not destroy the existing pool. Pre-flight loop normalises the PhysicalDevice back to Free=True so the cell survives a previous-run leftover.

Validation on stand

Verified on e2e3 stand 2026-05-22 against the user-reported reproduction. With the fix deployed:

$ linstor ps l
| 17179869184 | True | e2e3-worker-1[/dev/sda(sda)] |

$ linstor ps cdp --pool-name data --storage-pool data zfs e2e3-worker-1 /dev/sda
SUCCESS: physical-storage attach accepted on node 'e2e3-worker-1'

$ linstor sp l --storage-pools data --nodes e2e3-worker-1
| data | e2e3-worker-1 | ZFS | data | 15.50 GiB | 15.50 GiB | True | Ok |

Pre-fix the same command sequence left the SP at State=Error with pool backing storage missing on node e2e3-worker-1: storage pool data is not present indefinitely.

E2E cell tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh exits 0 against the stand (Phase 1 + Phase 2 PASS).

Out of scope (notes for follow-up)

After a successful ps cdp the satellite reconciler keeps re-running with Spec.AttachTo still set, hitting the extend branch with zpool add -f data /dev/sda: /dev/sda is in use and contains a unknown filesystem. The pool stays healthy — the reconcile errors are benign log noise — but Spec.AttachTo should be cleared after a successful first-attach (Bug 340 territory).
The "no free PhysicalDevice on node ... matches device_paths" path triggered by leftover Spec.AttachTo references to deleted SPs is also Bug 340 lineage, not Bug 359.
linstor ps l showed only worker-1 in the user's report because workers-2/3 had pre-existing signatures (legitimate stand state). Not a bug; mentioned for completeness.

Test plan

go test ./pkg/satellite/... — unit tests pass locally including the new contract-pinning ones.
User-reported symptom reproduction verified end-to-end on the e2e3 stand.
tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh .work/e2e3 exits 0.
CI pipeline green.

…ug 359) `linstor ps cdp zfs` returned SUCCESS but the resulting StoragePool stayed at `State=Error` with `pool backing storage missing`. Empirical repro on the e2e3 stand (2026-05-22) — satellite log shows the satellite-side `zpool create` failing on every 30s reconcile with: zpool create -f -O compression=off -O atime=off data /dev/sda: cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19 (ENODEV) Root cause: the satellite container's /dev is a private devtmpfs instance — even with `mountPropagation: HostToContainer` (Bug 346) the kernel-driven mknod for partition device nodes (sda1, sda9) hits the host's devtmpfs and does NOT propagate into the container's private devtmpfs fast enough for libzpool's open() of /dev/sda1. The error is reproducible 100% of the time on a freshly-wiped /dev/sda; an identical `zpool create` run inside the host's mount namespace via `nsenter -t 1 -m` succeeds on the first try because the host's devtmpfs is updated synchronously by the kernel. Cross-references: Bug 336 v1/v2 (wipe device + clear partition table) and Bug 346 (mountPropagation HostToContainer) both moved this forward but neither fixes the partition-node visibility race — they were necessary, not sufficient. Fix: route any pool-creating zpool subcommand (`create`, `add`) through `runHostZpool`, which prepends `nsenter -t 1 -m --` so the call lands in PID 1's mount namespace. The Talos host carries the `zpool` userspace via the siderolabs/zfs system extension, which the production manifest already requires. Pure probes (`zpool list`, `zpool get`) are unaffected by the race and stay on the in-container exec path. Why this isn't beyond-upstream recovery: piraeus operates inside the same Talos container layout and gets away with it because upstream LINSTOR does not own the `ps cdp` create flow. Blockstor's ps cdp contract requires materialising the on-disk pool atomically; the host-namespace hop is the minimum change that honours that. Unit tests (pkg/satellite/attach_test.go): - TestAttachZFSDispatchedViaHostMountNamespace: pins the nsenter prefix on both create and extend paths; bare `zpool create` or `zpool add` is rejected. - TestAttachZFSSurfacesZpoolCreateFailure: simulates the e2e3 "cannot label 'sda'" stderr and asserts the error propagates to the caller instead of silently flipping the SP to Error. - Existing TestAttach* assertions updated to expect the nsenter-wrapped commands. E2E cli-matrix cell (tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh): - Phase 1: stamps stale ZFS GPT on /dev/sda (the deterministic failure trigger pre-fix), runs `linstor ps cdp zfs`, asserts SP converges to non-zero free_capacity AND zpool exists on the worker. - Phase 2: rerun ps cdp — must not corrupt the existing pool. - Phase 3: out-of-band `zpool destroy` → SP flips to poolMissing=true → `ps cdp` recovers it. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

The Bug 359 fix dispatched `zpool create` via `nsenter -t 1 -m` into the host mount namespace. On Talos hosts (siderolabs/zfs system extension) the host root is read-only, so zpool's default behaviour of mounting `/<pool>` as the pool root failed AFTER the pool was already stamped on disk: nsenter -t 1 -m -- zpool create -f ... data /dev/sda: cannot mount '/data': failed to create mountpoint: Read-only file system: exit status 1 The pool was created (visible via `zpool list`) but `zpool create` exited 1, so the satellite recorded the attach as failed, the SP CRD's `poolMissing=true` flag stuck, and the reconciler then hit the extend branch (`zpoolExists` returns true) — looping on `zpool add -f` against a device that is already a vdev of the pool. Add `-m none` to skip the pool-root mount. Blockstor never operates on the pool's root directory; satellite-side resource materialisation uses `zfs create -V` block-device datasets only. The e2e cli-matrix cell (tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh) is also tightened: - pre-flight loop strips Spec.AttachTo + re-wipes + waits for status.conditions[Free]=True so the test reaches a known starting point even after a previous-run leftover. - drop the stale-ZFS-GPT pre-stage: a non-Free device is correctly rejected by the REST handler with "device is busy", which is not Bug 359's failure mode (Bug 336 already covers the wipe-then-create path). Bug 359 fires on a Free /dev/sda the first time zpool stamps its own brand-new GPT. - fix backtick subshell injection in two echo lines that ran the host's `ps cdp` and `ps` commands during a FAIL message. Unit-test expected command lines updated to include the `-m none` argument. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

The OOB-destroy-then-recover phase intersected Bug 50/74's PoolMissing convergence path and the post-destroy PhysicalDevice Free=True re-marking via the udev fast-path — both have their own dedicated cli-matrix cells and treating their flakes as Bug 359 regressions would only obscure the fix's atomic-create contract. Kept: - Phase 1: ps cdp on a Free /dev/sda → SP State=Ok with non-zero free_capacity within 60s, plus host-side `zpool list <pool>` cross-verification. - Phase 2: rerun of ps cdp against the same device does not destroy the existing pool. Also fixed: - jq path for `linstor storage-pool list --machine-readable`: the wire schema is a double-array `[[{...}]]` so `.[0].stor_pools[0]` returns nothing — `[.[]?[]?.free_capacity] | .[0]` mirrors the ps-cdp-zfs-roundtrip.sh convention. Verified on e2e3 stand 2026-05-22: EXIT=0, both phases PASS. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-05-22T13:03:55Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e6b93e2-fc1c-4322-b9a0-e39bc547a92c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/zfs-cdp-state-error-recheck

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses Bug 359 by routing zpool create and zpool add commands through the host's mount namespace using nsenter. This ensures that the zpool utility can access partition device nodes in /dev immediately after creation, bypassing the propagation delay in the container's private devtmpfs. The zpool create command was also updated with the -m none flag to prevent errors on read-only host filesystems. The PR includes new unit tests and an E2E test script to verify atomicity and idempotency. A review comment suggests exporting hostMountNamespaceCommand to allow external test packages to override the command for testing purposes.

gemini-code-assist · 2026-05-22T13:06:02Z

+// Exposed as a package var so unit tests can substitute a no-op
+// prefix and assert the resulting command line directly; production
+// callers MUST NOT mutate it.
+var hostMountNamespaceCommand = []string{"nsenter", "-t", "1", "-m", "--"}


The variable hostMountNamespaceCommand is unexported (starts with a lowercase letter), which makes it inaccessible to tests in the satellite_test package (such as those in pkg/satellite/attach_test.go). The documentation states it is exposed so unit tests can substitute it, but this currently only applies to internal tests within the satellite package. To allow external test packages to utilize this seam, consider exporting the variable as HostMountNamespaceCommand.

kvaps · 2026-05-22T13:17:21Z

Superseded — heavy nsenter approach replaced by simpler DaemonSet config fix (hostPath: type: Directory, drop mountPropagation: HostToContainer). The -m none Talos rootfs fix is preserved in the new PR.

kvaps and others added 3 commits May 22, 2026 14:11

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

kvaps closed this May 22, 2026

kvaps mentioned this pull request May 22, 2026

fix(satellite): mount /run/udev so libzfs sees partition uevents (Bug 359) #12

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(satellite/zfs): ps cdp atomic SP commit + host-namespace dispatch (Bug 359)#11

fix(satellite/zfs): ps cdp atomic SP commit + host-namespace dispatch (Bug 359)#11
kvaps wants to merge 3 commits into
mainfrom
fix/zfs-cdp-state-error-recheck

kvaps commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

kvaps commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kvaps commented May 22, 2026

Summary

Root cause (verified on stand, not hypothesised)

Followup: -m none on the host-namespace path

Changes

Validation on stand

Out of scope (notes for follow-up)

Test plan

Uh oh!

coderabbitai Bot commented May 22, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

kvaps commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Followup: `-m none` on the host-namespace path