Skip to content

posixage/flock: ghost-lock race when stale-recovery unlinks a still-flocked file #530

@Benehiko

Description

@Benehiko

Summary

store/posixage/internal/flock can hand out the same exclusive lock to two callers simultaneously. The bug lives in the stale-recovery path: recoverStaleLock unlinks the lock file when its modtime ages past 30s, but on Unix unlink does not invalidate any open fd's flock(2) on the now-orphaned inode. A second caller can create a fresh inode at the same path, flock that, and walk away believing it owns the lock — while the original holder still legitimately holds a kernel-valid flock on the orphan.

The race has two halves. Either is sufficient on its own.

Half 1 — acquirer-side

A caller in the middle of tryLock can lock the orphan inode that recovery just removed:

  1. Stale lock file exists at the path on inode A. No live flock holder (e.g. previous holder crashed; or holder is sitting on its own flock without refreshing modtime).
  2. Caller X calls openFile → fd₁ → A.
  3. Caller Y calls recoverStaleLockstat, sees stale → Remove(lockFileName) unlinks A. The path is now empty; A still exists in the kernel via fd₁.
  4. X calls flock(fd₁)SUCCESS (orphaned inode, no holders).
  5. Y calls openFile again → creates inode B at the path → flock(fd₂)SUCCESS.
  6. X and Y both believe they hold the exclusive lock. A race-free critical section is no longer race-free.

Half 2 — holder-side

A caller that already legitimately holds the lock can be hijacked if its critical section exceeds the 30s threshold:

  1. Caller X acquires the lock cleanly on inode A. Truncate sets modtime to T₀.
  2. X does work for >30s without refreshing the modtime (the package never re-touched the file during a hold).
  3. Caller Y calls tryLock, fails initial acquireOnce, runs recoverStaleLock, sees time.Since(modtime) > 30s → unlinks A.
  4. Y then calls openFile → creates inode B → flock(fd_B) → SUCCESS.
  5. X is still happily inside its critical section on A, Y is now inside its critical section on B. Both think they have it.

Half 3 — recovery TOCTOU (minor)

Two callers reaching recoverStaleLock concurrently both saw a stale modtime, both called Remove, the loser got ENOENT, and tryLock surfaced that as a hard lock-acquisition failure even though recovery had succeeded. Not a correctness bug on its own, but conflates a benign race with a real failure.

Impact

Anything calling flock.TryLock / flock.TryRLock from store/posixage/store.go. Read/write paths that assume exclusive access (the whole point of the lock) can interleave. Symptoms in production would be intermittent corruption of the on-disk secret store under contention with a slow operation or after a process crashed mid-hold.

Why the 30s threshold was always a kludge

flock is kernel-released on process death. A held flock therefore by definition belongs to a live process — there is no need to "recover" from a stale flock, only from a stale file. Once you accept that, the only thing the modtime check tells you is "the path's modtime is old," which says nothing about whether the inode the path points to is locked. Removing the file does nothing to a live holder's flock; it only creates the inode-swap hazard above.

Why the race is hard to reproduce

  • Acquirer-side needs unlink to land between another caller's open and flock. Microseconds of window per attempt.
  • Holder-side needs a critical section that exceeds 30s and a concurrent acquirer firing recovery while the holder is mid-work. Easier to construct in a stress test than to observe in normal use.

The race tests in the fix PR cover the building blocks (isCurrentLockFile correctness, concurrent recovery returning success) and a heartbeat smoke test (recovery refuses to fire against a heartbeating holder). A deterministic end-to-end ghost-lock reproducer would require test hooks to delay flock after open — not added.

Affected code

  • store/posixage/internal/flock/flock.go
  • store/posixage/internal/flock/recover.go
  • store/posixage/internal/flock/flock_unix.go (release/unlock split)
  • store/posixage/internal/flock/flock_windows.go (release/unlock split)

Fix

PR #529 — closes both halves:

  • Verify the locked inode matches the path inode after every flock (acquirer-side).
  • Run a background goroutine while the lock is held that re-truncates the lock file every 10s and re-verifies the inode each tick (holder-side).
  • Treat ENOENT as success on both Stat and Remove inside recoverStaleLock (TOCTOU).

Heartbeat starvation past 30s (e.g. SIGSTOP, extreme scheduler pressure) is logged via log/slog when detected at the next tick. It cannot fail the caller's outstanding operation in-band — UnlockFunc is the only callback we hold — so the log line is the safety net for that pathological case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions