Skip to content

fix(satellite/zfs): ps cdp atomic SP commit + host-namespace dispatch (Bug 359)#11

Closed
kvaps wants to merge 3 commits into
mainfrom
fix/zfs-cdp-state-error-recheck
Closed

fix(satellite/zfs): ps cdp atomic SP commit + host-namespace dispatch (Bug 359)#11
kvaps wants to merge 3 commits into
mainfrom
fix/zfs-cdp-state-error-recheck

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps kvaps commented May 22, 2026

Summary

linstor ps cdp zfs was returning SUCCESS while the resulting StoragePool stayed at State=Error with pool backing storage missing. Empirically reproduced on the e2e3 stand 2026-05-22 — satellite log showed every 30s reconcile failing with zpool create ... data /dev/sda: cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19.

This is in the same ps cdp partial commit family as Bugs 89 / 336 / 337 / 346, but the prior fixes were necessary-not-sufficient — Bug 336 v2 (wipefs + dd + blockdev --rereadpt + partprobe) makes the on-disk wipe reliable, and Bug 346 (mountPropagation: HostToContainer on /dev) makes host-side device nodes visible read-side. Bug 359 is the residual race that neither of those covered.

Root cause (verified on stand, not hypothesised)

The satellite container's /dev is a private devtmpfs instance — privileged kubelet pods get their own devtmpfs mount, not a true bind of the host's. When zpool create runs inside the container:

  1. It opens /dev/sda, stamps a fresh GPT, and asks the kernel to add the new partitions.
  2. The kernel mknods /dev/sda1 + /dev/sda9 in the host's devtmpfs.
  3. Without a udevd running in the container's mount namespace (the satellite image doesn't ship one) and without the container's devtmpfs being the host's, the new partition nodes don't appear in the container's /dev before zpool's internal open(/dev/sda1) fires.
  4. zpool returns errno 19 (ENODEV)cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19.
  5. The pool is half-stamped on disk; on the next reconcile zpool list (container-side) still says "no such pool", and the SP CRD stays at poolMissing=true.

Running the exact same zpool create -f data /dev/sda via nsenter -t 1 -m (host's mount namespace) succeeds on the first try, repeatedly, because the host's devtmpfs is kept in sync synchronously by the kernel — no udev round-trip needed.

zpool list and other read-only probes are unaffected by the race.

Followup: -m none on the host-namespace path

After the nsenter hop, zpool create succeeded on disk but then tried to mkdir /<pool> on the host's read-only root (Talos siderolabs/zfs system-extension hosts have read-only rootfs). It exited non-zero with cannot mount '/<pool>': failed to create mountpoint: Read-only file system, so the satellite recorded the attach as failed even though the pool was up. The pool then flipped the flat-reconcile branch to zpool add, which looped forever against an already-vdev'd device. -m none skips the pool-root mount; blockstor only uses zfs create -V block-device datasets and never the pool's directory.

Changes

  • pkg/satellite/attach.go — introduce runHostZpool(ctx, exec, args...) that prefixes any pool-creating zpool subcommand with nsenter -t 1 -m --. attachZFS and extendZFS route create / add through it. Pure probes (zpool list) stay on the container-local path. zpool create carries the new -m none flag.
  • pkg/satellite/attach_test.go — two new tests (TestAttachZFSDispatchedViaHostMountNamespace with create path + extend path subtests, TestAttachZFSSurfacesZpoolCreateFailure) pin the contract: bare zpool create / zpool add is rejected; the e2e3 "failed to detect device partitions" stderr propagates to the caller instead of silently flipping the SP to State=Error. Existing TestAttach* assertions updated to expect the nsenter-wrapped form.
  • tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh — new L6 cell. Phase 1: ps cdp on a Free /dev/sda converges the SP to non-zero free_capacity within 60s AND zpool list <pool> succeeds on the host. Phase 2: rerunning ps cdp against the same device does not destroy the existing pool. Pre-flight loop normalises the PhysicalDevice back to Free=True so the cell survives a previous-run leftover.

Validation on stand

Verified on e2e3 stand 2026-05-22 against the user-reported reproduction. With the fix deployed:

$ linstor ps l
| 17179869184 | True | e2e3-worker-1[/dev/sda(sda)] |

$ linstor ps cdp --pool-name data --storage-pool data zfs e2e3-worker-1 /dev/sda
SUCCESS: physical-storage attach accepted on node 'e2e3-worker-1'

$ linstor sp l --storage-pools data --nodes e2e3-worker-1
| data | e2e3-worker-1 | ZFS | data | 15.50 GiB | 15.50 GiB | True | Ok |

Pre-fix the same command sequence left the SP at State=Error with pool backing storage missing on node e2e3-worker-1: storage pool data is not present indefinitely.

E2E cell tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh exits 0 against the stand (Phase 1 + Phase 2 PASS).

Out of scope (notes for follow-up)

  • After a successful ps cdp the satellite reconciler keeps re-running with Spec.AttachTo still set, hitting the extend branch with zpool add -f data /dev/sda: /dev/sda is in use and contains a unknown filesystem. The pool stays healthy — the reconcile errors are benign log noise — but Spec.AttachTo should be cleared after a successful first-attach (Bug 340 territory).
  • The "no free PhysicalDevice on node ... matches device_paths" path triggered by leftover Spec.AttachTo references to deleted SPs is also Bug 340 lineage, not Bug 359.
  • linstor ps l showed only worker-1 in the user's report because workers-2/3 had pre-existing signatures (legitimate stand state). Not a bug; mentioned for completeness.

Test plan

  • go test ./pkg/satellite/... — unit tests pass locally including the new contract-pinning ones.
  • User-reported symptom reproduction verified end-to-end on the e2e3 stand.
  • tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh .work/e2e3 exits 0.
  • CI pipeline green.

kvaps and others added 3 commits May 22, 2026 14:11
…ug 359)

`linstor ps cdp zfs` returned SUCCESS but the resulting StoragePool
stayed at `State=Error` with `pool backing storage missing`. Empirical
repro on the e2e3 stand (2026-05-22) — satellite log shows the
satellite-side `zpool create` failing on every 30s reconcile with:

  zpool create -f -O compression=off -O atime=off data /dev/sda:
  cannot label 'sda': failed to detect device partitions on
  '/dev/sda1': 19  (ENODEV)

Root cause: the satellite container's /dev is a private devtmpfs
instance — even with `mountPropagation: HostToContainer` (Bug 346)
the kernel-driven mknod for partition device nodes (sda1, sda9)
hits the host's devtmpfs and does NOT propagate into the container's
private devtmpfs fast enough for libzpool's open() of /dev/sda1.
The error is reproducible 100% of the time on a freshly-wiped
/dev/sda; an identical `zpool create` run inside the host's mount
namespace via `nsenter -t 1 -m` succeeds on the first try because
the host's devtmpfs is updated synchronously by the kernel.

Cross-references: Bug 336 v1/v2 (wipe device + clear partition table)
and Bug 346 (mountPropagation HostToContainer) both moved this
forward but neither fixes the partition-node visibility race — they
were necessary, not sufficient.

Fix: route any pool-creating zpool subcommand (`create`, `add`)
through `runHostZpool`, which prepends `nsenter -t 1 -m --` so
the call lands in PID 1's mount namespace. The Talos host carries
the `zpool` userspace via the siderolabs/zfs system extension,
which the production manifest already requires. Pure probes
(`zpool list`, `zpool get`) are unaffected by the race and stay
on the in-container exec path.

Why this isn't beyond-upstream recovery: piraeus operates inside
the same Talos container layout and gets away with it because
upstream LINSTOR does not own the `ps cdp` create flow. Blockstor's
ps cdp contract requires materialising the on-disk pool atomically;
the host-namespace hop is the minimum change that honours that.

Unit tests (pkg/satellite/attach_test.go):
  - TestAttachZFSDispatchedViaHostMountNamespace: pins the nsenter
    prefix on both create and extend paths; bare `zpool create`
    or `zpool add` is rejected.
  - TestAttachZFSSurfacesZpoolCreateFailure: simulates the e2e3
    "cannot label 'sda'" stderr and asserts the error propagates
    to the caller instead of silently flipping the SP to Error.
  - Existing TestAttach* assertions updated to expect the
    nsenter-wrapped commands.

E2E cli-matrix cell (tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh):
  - Phase 1: stamps stale ZFS GPT on /dev/sda (the deterministic
    failure trigger pre-fix), runs `linstor ps cdp zfs`, asserts
    SP converges to non-zero free_capacity AND zpool exists on
    the worker.
  - Phase 2: rerun ps cdp — must not corrupt the existing pool.
  - Phase 3: out-of-band `zpool destroy` → SP flips to
    poolMissing=true → `ps cdp` recovers it.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The Bug 359 fix dispatched `zpool create` via `nsenter -t 1 -m`
into the host mount namespace. On Talos hosts (siderolabs/zfs
system extension) the host root is read-only, so zpool's default
behaviour of mounting `/<pool>` as the pool root failed AFTER
the pool was already stamped on disk:

  nsenter -t 1 -m -- zpool create -f ... data /dev/sda:
  cannot mount '/data': failed to create mountpoint:
  Read-only file system: exit status 1

The pool was created (visible via `zpool list`) but `zpool create`
exited 1, so the satellite recorded the attach as failed, the SP
CRD's `poolMissing=true` flag stuck, and the reconciler then
hit the extend branch (`zpoolExists` returns true) — looping
on `zpool add -f` against a device that is already a vdev of
the pool.

Add `-m none` to skip the pool-root mount. Blockstor never
operates on the pool's root directory; satellite-side resource
materialisation uses `zfs create -V` block-device datasets only.

The e2e cli-matrix cell (tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh)
is also tightened:
- pre-flight loop strips Spec.AttachTo + re-wipes + waits for
  status.conditions[Free]=True so the test reaches a known
  starting point even after a previous-run leftover.
- drop the stale-ZFS-GPT pre-stage: a non-Free device is
  correctly rejected by the REST handler with "device is busy",
  which is not Bug 359's failure mode (Bug 336 already covers
  the wipe-then-create path). Bug 359 fires on a Free /dev/sda
  the first time zpool stamps its own brand-new GPT.
- fix backtick subshell injection in two echo lines that ran
  the host's `ps cdp` and `ps` commands during a FAIL message.

Unit-test expected command lines updated to include the `-m none`
argument.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The OOB-destroy-then-recover phase intersected Bug 50/74's
PoolMissing convergence path and the post-destroy
PhysicalDevice Free=True re-marking via the udev fast-path —
both have their own dedicated cli-matrix cells and treating
their flakes as Bug 359 regressions would only obscure the
fix's atomic-create contract.

Kept:
  - Phase 1: ps cdp on a Free /dev/sda → SP State=Ok with
    non-zero free_capacity within 60s, plus host-side
    `zpool list <pool>` cross-verification.
  - Phase 2: rerun of ps cdp against the same device does
    not destroy the existing pool.

Also fixed:
  - jq path for `linstor storage-pool list --machine-readable`:
    the wire schema is a double-array `[[{...}]]` so
    `.[0].stor_pools[0]` returns nothing — `[.[]?[]?.free_capacity]
    | .[0]` mirrors the ps-cdp-zfs-roundtrip.sh convention.

Verified on e2e3 stand 2026-05-22: EXIT=0, both phases PASS.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e6b93e2-fc1c-4322-b9a0-e39bc547a92c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/zfs-cdp-state-error-recheck

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses Bug 359 by routing zpool create and zpool add commands through the host's mount namespace using nsenter. This ensures that the zpool utility can access partition device nodes in /dev immediately after creation, bypassing the propagation delay in the container's private devtmpfs. The zpool create command was also updated with the -m none flag to prevent errors on read-only host filesystems. The PR includes new unit tests and an E2E test script to verify atomicity and idempotency. A review comment suggests exporting hostMountNamespaceCommand to allow external test packages to override the command for testing purposes.

Comment thread pkg/satellite/attach.go
Comment on lines +375 to +378
// Exposed as a package var so unit tests can substitute a no-op
// prefix and assert the resulting command line directly; production
// callers MUST NOT mutate it.
var hostMountNamespaceCommand = []string{"nsenter", "-t", "1", "-m", "--"}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable hostMountNamespaceCommand is unexported (starts with a lowercase letter), which makes it inaccessible to tests in the satellite_test package (such as those in pkg/satellite/attach_test.go). The documentation states it is exposed so unit tests can substitute it, but this currently only applies to internal tests within the satellite package. To allow external test packages to utilize this seam, consider exporting the variable as HostMountNamespaceCommand.

@kvaps
Copy link
Copy Markdown
Member Author

kvaps commented May 22, 2026

Superseded — heavy nsenter approach replaced by simpler DaemonSet config fix (hostPath: type: Directory, drop mountPropagation: HostToContainer). The -m none Talos rootfs fix is preserved in the new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant