Skip to content

FOD drv/output bind-mount patching leaks /tmp/<hash>.patched.<rand> mounts on build interruption → GC aborts with EBUSY #178

@schickling-assistant

Description

@schickling-assistant

Summary

Determinate Nix 3.17.3 (Nix 2.33.3) leaks bind mounts from its FOD drv/output patching mechanism when builds are interrupted (SIGTERM / cancellation). The leaked mounts persist in the host mount namespace across daemon restarts and eventually break nix-collect-garbage entirely.

On one CI-runner host I observed 793 leaked mountpoints all dated ~3 days old, which silently broke every subsequent GC — nix-collect-garbage aborts on the first EBUSY unlink() with 0 store paths deleted, 0.0 KiB freed. Root filesystem filled to 100% (1.7T) as a result.

Observed pattern

For every affected store path, /proc/1/mountinfo shows:

<mid> <pid> 254:0 /tmp/<storeHash>-<name>.patched.<rand> /nix/store/<storeHash>-<name> rw,relatime shared:1 - ext4 /dev/mapper/vg0-root0 rw,stripe=32

Example:

/tmp/m7i135y30si5vafq4n3q5489xmcslcfm-pnpm-install.drv.patched.liPdju
  → /nix/store/m7i135y30si5vafq4n3q5489xmcslcfm-pnpm-install.drv
/tmp/94m2xm7qgp1mkmv04rwdrk7qh46rhqnr-pnpm-install.patched.M4wFNa
  → /nix/store/94m2xm7qgp1mkmv04rwdrk7qh46rhqnr-pnpm-install

Both the .drv file itself and the FOD output directory get the bind-mount treatment.

GC failure

$ nix-collect-garbage -d
finding garbage collector roots...
deleting garbage...
deleting '/nix/store/m7i135y30si5vafq4n3q5489xmcslcfm-pnpm-install.drv'
error: cannot unlink "/nix/store/m7i135y30si5vafq4n3q5489xmcslcfm-pnpm-install.drv": Device or resource busy
0 store paths deleted, 0.0 KiB freed

GC aborts on the first EBUSY rather than skipping-and-continuing, so a single stale mount zeroes out GC progress indefinitely. Combined with nix.settings.min-free / max-free inline GC (which silently no-ops for the same reason), disk usage grows unbounded until builds start failing on ENOSPC.

Suspected trigger

CI runner with GitHub Actions concurrency.cancel-in-progress: true — cancellations SIGTERM nix builds mid-FOD-patch, and the cleanup path for the .patched bind mount doesn't run.

Workaround

grep -E 'patched\.' /proc/1/mountinfo | awk '{print $5}' | sort -u \
  | xargs -n1 -P4 sudo umount -l
nix-collect-garbage -d

After unmounting all 793 stale mounts, GC proceeded normally and freed 715 GiB (227,226 paths). No active builds were disrupted (the two "live"-looking mounts also turned out to be 3 days old and equally stale).

Environment

  • Determinate Nix 3.17.3 (Nix 2.33.3)
  • NixOS x86_64, kernel 6.18.13
  • Workload: GitHub Actions self-hosted runner building pnpm-install-style FODs under heavy concurrency with frequent cancellation

Suggested fixes

  1. Make GC resilient to EBUSY on unlink — log-and-skip instead of abort-the-run. A single leaked mount should not zero GC.
  2. Reap stale .patched.* mounts on daemon startup (clear ones whose source /tmp/...patched.<rand> is older than some threshold and whose target store path isn't owned by a live build).
  3. Install SIGTERM/cleanup handlers around the FOD patching bind-mount so it unmounts on abnormal build termination.

Happy to provide more data if useful.

Posted on behalf of @schickling
field value
agent_name 👁️ cl1-iris
agent_session_id 420ca8a2-8003-42d7-a440-a7cd4d317076
agent_tool Claude Code
agent_tool_version 2.1.118 (Claude Code)
agent_runtime Claude Code 2.1.118 (Claude Code)
agent_model claude-opus-4-7
worktree dotfiles/main
machine dev3
tooling_profile dotfiles@f937ca8-dirty

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions