fix: bound zpool export at shutdown with a configurable timeout#1104
Open
bernardgut wants to merge 1 commit into
Open
fix: bound zpool export at shutdown with a configurable timeout#1104bernardgut wants to merge 1 commit into
bernardgut wants to merge 1 commit into
Conversation
f2a026e to
b248e86
Compare
Since v1.13.0 (siderolabs#1028) the zfs-service runs `zpool export -a` at shutdown via a blocking cmd.Run() that log.Fatalf's on error. Talos gives extension services a fixed 10s SIGTERM->SIGKILL grace with no knob to raise it, so on busy pools the export overruns it: zpool is SIGKILLed mid-export and the containerd shim then hangs ~20 minutes cleaning up the /dev rbind. Make the export configurable via ZFS_EXPORT_TIMEOUT, default unchanged: unset runs it unbounded as before; 0 skips it (pre-v1.13); a positive Go duration bounds it and, on timeout or error, logs and exits 0 instead of log.Fatalf so the shim can unmount /dev cleanly (pools are re-imported next boot by `zpool import -fal`). Leaving the default unchanged avoids regressing the original LUKS-on-ZFS case. Refs siderolabs#1085 Signed-off-by: Bernard Gütermann <bernard.gutermann@sekops.ch>
b248e86 to
babd63e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Make the shutdown
zpool export -ain thezfs-serviceextension configurable via aZFS_EXPORT_TIMEOUTenv var, without changing the default behavior:ZFS_EXPORT_TIMEOUT=0: skip the export entirely (pre-v1.13 behavior);ZFS_EXPORT_TIMEOUT=<duration>(> 0): bound the export with a context timeout; on timeout/error, log and exit 0 (notlog.Fatalf) so the containerd shim can unmount/devcleanly. Un-exported pools are recovered next boot by the existingzpool import -fal.Refs #1085.
Why
v1.13.0 (#1028, @gjabell) added
zpool export -aat shutdown so that, for LUKS-on-ZFS, the backing device is released andcryptsetup closestops looping so an upgrade can finish (talos#9307 / #8800). It is a blockingcmd.Run()withlog.Fatalfon error.Talos gives extension services a fixed 10s SIGTERM→SIGKILL grace (
GracefulShutdownTimeoutininternal/app/machined/pkg/system/runner/runner.go), with no machine-config knob to raise it. On busy pools the export overruns the grace →zpoolis SIGKILLed mid-export → the parent is SIGKILLed → the containerd shim hangs ~20 min cleaning up the extension's/devrshared-rbind mount (still referencing zvol device nodes). v1.12.x ranzfs unmount -auonly (no export) and never wedged.The two topologies are mirror images:
cryptsetupneeds — load-bearing.A single fixed policy can't serve both — hence: make it configurable, default unchanged.
How
The three-state logic above, in
storage/zfs/zfs-service/main.go. Theunsetpath is byte-for-byte the original unbounded export; the timeout path mirrors the in-repocontext+exec.CommandContext+WaitDelaypattern fromnvidia-gpu/.../nvidia-fabricmanager-wrapper. Documented instorage/zfs/README.md:Discussion / open question
Per @smira, this is a community extension and the default shouldn't regress the original author's case — so this PR leaves the default unchanged (unbounded export) and makes the bound opt-in.
@gjabell — to decide whether a bounded default (rather than unbounded) would be safe for the LUKS-on-ZFS case: does
cryptsetup closeneedzpool export -ato fully complete, and roughly how long does the export take on your pools? If a few seconds is reliably enough, a bounded default (e.g. ~7–8s, under the 10s grace) would fix the wedge out-of-the-box for everyone without regressing you. Happy to switch to that if you and @frezbo prefer.Scope notes (kept minimal)
zpool export -ais made configurable. The precedingzfs unmount -auis still an unbounded blockingcmd.Run()+log.Fatalf; if a busy pool hangs unmount it could wedge similarly — can bound it the same way if wanted.ZFS_EXPORT_TIMEOUTis treated as a fatal config error (fast exit, export skipped); a negative duration expires the context immediately (effectively a skip). Can add explicit guards if preferred.Backport
Requested for release-1.13 (v1.13.x) — the regression landed in v1.13.0 (
f89b232). Understood that contributors don't PR release branches directly and @smira batches these into the nextbackports: for v1.13.x.Validation status
Validated all three modes on a disposable single-node QEMU Talos v1.13.3 cluster running this patched extension (image
2.4.2-v1.13.3, zfs module loaded), with a non-trivial pool — 64 zvols → 64/devzvol device nodes in the extension's/devrbind; the pool's realzpool export -ameasures ~1.4s:ZFS_EXPORT_TIMEOUT200ms(bounded)zpool export did not finish within 200ms: signal: killed; leaving pools for boot-time import; exits 0;ext-zfs-serviceFinished 1.52s after SIGTERM0(skip)ZFS_EXPORT_TIMEOUT=0, skipping zpool export; pool left imported (not exported)In every mode
ext-zfs-servicereachedFinished: Service finished successfullywell within the fixed 10s grace, and the shutdown'sunmountPodMounts/unmountSystemDiskBindMounts/unmountSystemphases each completed in single-digit milliseconds — i.e. the containerd/dev-rbind cleanup that wedges ~20 min in the bug proceeds cleanly.Honest limits: single-node KVM lab with a file-vdev pool, not the prod DRBD/Piraeus stack. It exercises the real
ext-zfs-serviceand the real containerd/dev-rbind cleanup, but does not 1:1 reproduce DRBD's kernel-pinning of zvols. The interruptible export here was reaped by SIGKILL within the bound; azpoolstuck in an uninterruptible D-state ioctl would not be reaped faster — the change is then strictly no worse than today.