Skip to content

fix: bound zpool export at shutdown with a configurable timeout#1104

Open
bernardgut wants to merge 1 commit into
siderolabs:mainfrom
bernardgut:fix/zfs-export-timeout
Open

fix: bound zpool export at shutdown with a configurable timeout#1104
bernardgut wants to merge 1 commit into
siderolabs:mainfrom
bernardgut:fix/zfs-export-timeout

Conversation

@bernardgut
Copy link
Copy Markdown
Contributor

@bernardgut bernardgut commented Jun 1, 2026

What

Make the shutdown zpool export -a in the zfs-service extension configurable via a ZFS_EXPORT_TIMEOUT env var, without changing the default behavior:

  • unset (default): the export runs unbounded, exactly as today — no change;
  • ZFS_EXPORT_TIMEOUT=0: skip the export entirely (pre-v1.13 behavior);
  • ZFS_EXPORT_TIMEOUT=<duration> (> 0): bound the export with a context timeout; on timeout/error, log and exit 0 (not log.Fatalf) so the containerd shim can unmount /dev cleanly. Un-exported pools are recovered next boot by the existing zpool import -fal.

Refs #1085.

Why

v1.13.0 (#1028, @gjabell) added zpool export -a at shutdown so that, for LUKS-on-ZFS, the backing device is released and cryptsetup close stops looping so an upgrade can finish (talos#9307 / #8800). It is a blocking cmd.Run() with log.Fatalf on error.

Talos gives extension services a fixed 10s SIGTERM→SIGKILL grace (GracefulShutdownTimeout in internal/app/machined/pkg/system/runner/runner.go), with no machine-config knob to raise it. On busy pools the export overruns the grace → zpool is SIGKILLed mid-export → the parent is SIGKILLed → the containerd shim hangs ~20 min cleaning up the extension's /dev rshared-rbind mount (still referencing zvol device nodes). v1.12.x ran zfs unmount -au only (no export) and never wedged.

The two topologies are mirror images:

  • Export zpools on zfs shutdown #1028's case — LUKS under ZFS: the export releases the device cryptsetup needs — load-bearing.
  • This bug's case — DRBD/LUKS over ZFS zvols (e.g. LINSTOR/Piraeus): only raw disk sits under the pool, so the export releases nothing downstream; it is just the shutdown step most likely to block past the grace.

A single fixed policy can't serve both — hence: make it configurable, default unchanged.

How

The three-state logic above, in storage/zfs/zfs-service/main.go. The unset path is byte-for-byte the original unbounded export; the timeout path mirrors the in-repo context + exec.CommandContext + WaitDelay pattern from nvidia-gpu/.../nvidia-fabricmanager-wrapper. Documented in storage/zfs/README.md:

apiVersion: v1alpha1
kind: ExtensionServiceConfig
name: zfs-service
environment:
  - ZFS_EXPORT_TIMEOUT=8s

Discussion / open question

Per @smira, this is a community extension and the default shouldn't regress the original author's case — so this PR leaves the default unchanged (unbounded export) and makes the bound opt-in.

@gjabell — to decide whether a bounded default (rather than unbounded) would be safe for the LUKS-on-ZFS case: does cryptsetup close need zpool export -a to fully complete, and roughly how long does the export take on your pools? If a few seconds is reliably enough, a bounded default (e.g. ~7–8s, under the 10s grace) would fix the wedge out-of-the-box for everyone without regressing you. Happy to switch to that if you and @frezbo prefer.

Scope notes (kept minimal)

  • Only zpool export -a is made configurable. The preceding zfs unmount -au is still an unbounded blocking cmd.Run() + log.Fatalf; if a busy pool hangs unmount it could wedge similarly — can bound it the same way if wanted.
  • An invalid ZFS_EXPORT_TIMEOUT is treated as a fatal config error (fast exit, export skipped); a negative duration expires the context immediately (effectively a skip). Can add explicit guards if preferred.

Backport

Requested for release-1.13 (v1.13.x) — the regression landed in v1.13.0 (f89b232). Understood that contributors don't PR release branches directly and @smira batches these into the next backports: for v1.13.x.

Validation status

Validated all three modes on a disposable single-node QEMU Talos v1.13.3 cluster running this patched extension (image 2.4.2-v1.13.3, zfs module loaded), with a non-trivial pool — 64 zvols → 64 /dev zvol device nodes in the extension's /dev rbind; the pool's real zpool export -a measures ~1.4s:

ZFS_EXPORT_TIMEOUT observed behavior reboot
200ms (bounded) logs zpool export did not finish within 200ms: signal: killed; leaving pools for boot-time import; exits 0; ext-zfs-service Finished 1.52s after SIGTERM clean, 59.8s
unset (default) env confirmed empty; unbounded export runs to completion; exits 0; Finished 1.67s after SIGTERM clean, 59.9s
0 (skip) logs ZFS_EXPORT_TIMEOUT=0, skipping zpool export; pool left imported (not exported) immediate clean exit

In every mode ext-zfs-service reached Finished: Service finished successfully well within the fixed 10s grace, and the shutdown's unmountPodMounts / unmountSystemDiskBindMounts / unmountSystem phases each completed in single-digit milliseconds — i.e. the containerd /dev-rbind cleanup that wedges ~20 min in the bug proceeds cleanly.

Honest limits: single-node KVM lab with a file-vdev pool, not the prod DRBD/Piraeus stack. It exercises the real ext-zfs-service and the real containerd /dev-rbind cleanup, but does not 1:1 reproduce DRBD's kernel-pinning of zvols. The interruptible export here was reaped by SIGKILL within the bound; a zpool stuck in an uninterruptible D-state ioctl would not be reaped faster — the change is then strictly no worse than today.

@bernardgut bernardgut force-pushed the fix/zfs-export-timeout branch from f2a026e to b248e86 Compare June 1, 2026 20:25
Since v1.13.0 (siderolabs#1028) the zfs-service runs `zpool export -a` at shutdown
via a blocking cmd.Run() that log.Fatalf's on error. Talos gives extension
services a fixed 10s SIGTERM->SIGKILL grace with no knob to raise it, so on
busy pools the export overruns it: zpool is SIGKILLed mid-export and the
containerd shim then hangs ~20 minutes cleaning up the /dev rbind.

Make the export configurable via ZFS_EXPORT_TIMEOUT, default unchanged:
unset runs it unbounded as before; 0 skips it (pre-v1.13); a positive Go
duration bounds it and, on timeout or error, logs and exits 0 instead of
log.Fatalf so the shim can unmount /dev cleanly (pools are re-imported next
boot by `zpool import -fal`). Leaving the default unchanged avoids
regressing the original LUKS-on-ZFS case.

Refs siderolabs#1085

Signed-off-by: Bernard Gütermann <bernard.gutermann@sekops.ch>
@bernardgut bernardgut force-pushed the fix/zfs-export-timeout branch from b248e86 to babd63e Compare June 1, 2026 20:54
@bernardgut bernardgut marked this pull request as ready for review June 2, 2026 07:26
@github-project-automation github-project-automation Bot moved this to To Do in Planning Jun 2, 2026
@talos-bot talos-bot moved this from To Do to In Review in Planning Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants