fix: bound zpool export at shutdown with a configurable timeout by bernardgut · Pull Request #1104 · siderolabs/extensions

bernardgut · 2026-06-01T20:10:37Z

What

Make the shutdown zpool export -a in the zfs-service extension configurable via a ZFS_EXPORT_TIMEOUT env var, without changing the default behavior:

unset (default): the export runs unbounded, exactly as today — no change;
ZFS_EXPORT_TIMEOUT=0: skip the export entirely (pre-v1.13 behavior);
ZFS_EXPORT_TIMEOUT=<duration> (> 0): bound the export with a context timeout; on timeout/error, log and exit 0 (not log.Fatalf) so the containerd shim can unmount /dev cleanly. Un-exported pools are recovered next boot by the existing zpool import -fal.

Refs #1085.

Why

v1.13.0 (#1028, @gjabell) added zpool export -a at shutdown so that, for LUKS-on-ZFS, the backing device is released and cryptsetup close stops looping so an upgrade can finish (talos#9307 / #8800). It is a blocking cmd.Run() with log.Fatalf on error.

Talos gives extension services a fixed 10s SIGTERM→SIGKILL grace (GracefulShutdownTimeout in internal/app/machined/pkg/system/runner/runner.go), with no machine-config knob to raise it. On busy pools the export overruns the grace → zpool is SIGKILLed mid-export → the parent is SIGKILLed → the containerd shim hangs ~20 min cleaning up the extension's /dev rshared-rbind mount (still referencing zvol device nodes). v1.12.x ran zfs unmount -au only (no export) and never wedged.

The two topologies are mirror images:

Export zpools on zfs shutdown #1028's case — LUKS under ZFS: the export releases the device cryptsetup needs — load-bearing.
This bug's case — DRBD/LUKS over ZFS zvols (e.g. LINSTOR/Piraeus): only raw disk sits under the pool, so the export releases nothing downstream; it is just the shutdown step most likely to block past the grace.

A single fixed policy can't serve both — hence: make it configurable, default unchanged.

How

The three-state logic above, in storage/zfs/zfs-service/main.go. The unset path is byte-for-byte the original unbounded export; the timeout path mirrors the in-repo context + exec.CommandContext + WaitDelay pattern from nvidia-gpu/.../nvidia-fabricmanager-wrapper. Documented in storage/zfs/README.md:

apiVersion: v1alpha1
kind: ExtensionServiceConfig
name: zfs-service
environment:
  - ZFS_EXPORT_TIMEOUT=8s

Discussion / open question

Per @smira, this is a community extension and the default shouldn't regress the original author's case — so this PR leaves the default unchanged (unbounded export) and makes the bound opt-in.

@gjabell — to decide whether a bounded default (rather than unbounded) would be safe for the LUKS-on-ZFS case: does cryptsetup close need zpool export -a to fully complete, and roughly how long does the export take on your pools? If a few seconds is reliably enough, a bounded default (e.g. ~7–8s, under the 10s grace) would fix the wedge out-of-the-box for everyone without regressing you. Happy to switch to that if you and @frezbo prefer.

Scope notes (kept minimal)

Only zpool export -a is made configurable. The preceding zfs unmount -au is still an unbounded blocking cmd.Run() + log.Fatalf; if a busy pool hangs unmount it could wedge similarly — can bound it the same way if wanted.
An invalid ZFS_EXPORT_TIMEOUT is treated as a fatal config error (fast exit, export skipped); a negative duration expires the context immediately (effectively a skip). Can add explicit guards if preferred.

Backport

Requested for release-1.13 (v1.13.x) — the regression landed in v1.13.0 (f89b232). Understood that contributors don't PR release branches directly and @smira batches these into the next backports: for v1.13.x.

Validation status

Validated all three modes on a disposable single-node QEMU Talos v1.13.3 cluster running this patched extension (image 2.4.2-v1.13.3, zfs module loaded), with a non-trivial pool — 64 zvols → 64 /dev zvol device nodes in the extension's /dev rbind; the pool's real zpool export -a measures ~1.4s:

`ZFS_EXPORT_TIMEOUT`	observed behavior	reboot
`200ms` (bounded)	logs `zpool export did not finish within 200ms: signal: killed; leaving pools for boot-time import`; exits 0; `ext-zfs-service` Finished 1.52s after SIGTERM	clean, 59.8s
unset (default)	env confirmed empty; unbounded export runs to completion; exits 0; Finished 1.67s after SIGTERM	clean, 59.9s
`0` (skip)	logs `ZFS_EXPORT_TIMEOUT=0, skipping zpool export`; pool left imported (not exported)	immediate clean exit

In every mode ext-zfs-service reached Finished: Service finished successfully well within the fixed 10s grace, and the shutdown's unmountPodMounts / unmountSystemDiskBindMounts / unmountSystem phases each completed in single-digit milliseconds — i.e. the containerd /dev-rbind cleanup that wedges ~20 min in the bug proceeds cleanly.

Honest limits: single-node KVM lab with a file-vdev pool, not the prod DRBD/Piraeus stack. It exercises the real ext-zfs-service and the real containerd /dev-rbind cleanup, but does not 1:1 reproduce DRBD's kernel-pinning of zvols. The interruptible export here was reaped by SIGKILL within the bound; a zpool stuck in an uninterruptible D-state ioctl would not be reaped faster — the change is then strictly no worse than today.

Since v1.13.0 (siderolabs#1028) the zfs-service runs `zpool export -a` at shutdown via a blocking cmd.Run() that log.Fatalf's on error. Talos gives extension services a fixed 10s SIGTERM->SIGKILL grace with no knob to raise it, so on busy pools the export overruns it: zpool is SIGKILLed mid-export and the containerd shim then hangs ~20 minutes cleaning up the /dev rbind. Make the export configurable via ZFS_EXPORT_TIMEOUT, default unchanged: unset runs it unbounded as before; 0 skips it (pre-v1.13); a positive Go duration bounds it and, on timeout or error, logs and exits 0 instead of log.Fatalf so the shim can unmount /dev cleanly (pools are re-imported next boot by `zpool import -fal`). Leaving the default unchanged avoids regressing the original LUKS-on-ZFS case. Refs siderolabs#1085 Signed-off-by: Bernard Gütermann <bernard.gutermann@sekops.ch>

bernardgut force-pushed the fix/zfs-export-timeout branch from f2a026e to b248e86 Compare June 1, 2026 20:25

bernardgut force-pushed the fix/zfs-export-timeout branch from b248e86 to babd63e Compare June 1, 2026 20:54

bernardgut marked this pull request as ready for review June 2, 2026 07:26

talos-bot added this to Planning Jun 2, 2026

github-project-automation Bot moved this to To Do in Planning Jun 2, 2026

talos-bot moved this from To Do to In Review in Planning Jun 2, 2026

bernardgut mentioned this pull request Jun 2, 2026

ZFS extension v1.13.0+ wedges shutdown on clusters with non-trivial pools (regression from v1.12.x) #1085

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bound zpool export at shutdown with a configurable timeout#1104

fix: bound zpool export at shutdown with a configurable timeout#1104
bernardgut wants to merge 1 commit into
siderolabs:mainfrom
bernardgut:fix/zfs-export-timeout

bernardgut commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardgut commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Discussion / open question

Scope notes (kept minimal)

Backport

Validation status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bernardgut commented Jun 1, 2026 •

edited

Loading