posixage/flock: ghost-lock race when stale-recovery unlinks a still-flocked file

## Summary

`store/posixage/internal/flock` can hand out the same exclusive lock to two callers simultaneously. The bug lives in the stale-recovery path: `recoverStaleLock` unlinks the lock file when its modtime ages past 30s, but on Unix unlink does not invalidate any open fd's `flock(2)` on the now-orphaned inode. A second caller can create a fresh inode at the same path, flock that, and walk away believing it owns the lock — while the original holder still legitimately holds a kernel-valid flock on the orphan.

The race has two halves. Either is sufficient on its own.

### Half 1 — acquirer-side

A caller in the middle of `tryLock` can lock the orphan inode that recovery just removed:

1. Stale lock file exists at the path on inode A. No live `flock` holder (e.g. previous holder crashed; or holder is sitting on its own flock without refreshing modtime).
2. Caller **X** calls `openFile` → fd₁ → A.
3. Caller **Y** calls `recoverStaleLock` → `stat`, sees stale → `Remove(lockFileName)` unlinks A. The path is now empty; A still exists in the kernel via fd₁.
4. **X** calls `flock(fd₁)` → **SUCCESS** (orphaned inode, no holders).
5. **Y** calls `openFile` again → creates inode B at the path → `flock(fd₂)` → **SUCCESS**.
6. **X** and **Y** both believe they hold the exclusive lock. A race-free critical section is no longer race-free.

### Half 2 — holder-side

A caller that already legitimately holds the lock can be hijacked if its critical section exceeds the 30s threshold:

1. Caller **X** acquires the lock cleanly on inode A. Truncate sets modtime to T₀.
2. **X** does work for >30s without refreshing the modtime (the package never re-touched the file during a hold).
3. Caller **Y** calls `tryLock`, fails initial `acquireOnce`, runs `recoverStaleLock`, sees `time.Since(modtime) > 30s` → unlinks A.
4. **Y** then calls `openFile` → creates inode B → `flock(fd_B)` → SUCCESS.
5. **X** is still happily inside its critical section on A, **Y** is now inside its critical section on B. Both think they have it.

### Half 3 — recovery TOCTOU (minor)

Two callers reaching `recoverStaleLock` concurrently both saw a stale modtime, both called `Remove`, the loser got `ENOENT`, and `tryLock` surfaced that as a hard lock-acquisition failure even though recovery had succeeded. Not a correctness bug on its own, but conflates a benign race with a real failure.

## Impact

Anything calling `flock.TryLock` / `flock.TryRLock` from `store/posixage/store.go`. Read/write paths that assume exclusive access (the whole point of the lock) can interleave. Symptoms in production would be intermittent corruption of the on-disk secret store under contention with a slow operation or after a process crashed mid-hold.

## Why the 30s threshold was always a kludge

`flock` is kernel-released on process death. A held `flock` therefore by definition belongs to a live process — there is no need to "recover" from a stale flock, only from a stale *file*. Once you accept that, the only thing the modtime check tells you is "the path's modtime is old," which says nothing about whether the inode the path points to is locked. Removing the file does nothing to a live holder's flock; it only creates the inode-swap hazard above.

## Why the race is hard to reproduce

- Acquirer-side needs `unlink` to land between another caller's `open` and `flock`. Microseconds of window per attempt.
- Holder-side needs a critical section that exceeds 30s and a concurrent acquirer firing recovery while the holder is mid-work. Easier to construct in a stress test than to observe in normal use.

The race tests in the fix PR cover the building blocks (`isCurrentLockFile` correctness, concurrent recovery returning success) and a heartbeat smoke test (recovery refuses to fire against a heartbeating holder). A deterministic end-to-end ghost-lock reproducer would require test hooks to delay `flock` after `open` — not added.

## Affected code

- `store/posixage/internal/flock/flock.go`
- `store/posixage/internal/flock/recover.go`
- `store/posixage/internal/flock/flock_unix.go` (release/unlock split)
- `store/posixage/internal/flock/flock_windows.go` (release/unlock split)

## Fix

PR #529 — closes both halves:

- Verify the locked inode matches the path inode after every `flock` (acquirer-side).
- Run a background goroutine while the lock is held that re-truncates the lock file every 10s and re-verifies the inode each tick (holder-side).
- Treat `ENOENT` as success on both `Stat` and `Remove` inside `recoverStaleLock` (TOCTOU).

Heartbeat starvation past 30s (e.g. SIGSTOP, extreme scheduler pressure) is logged via `log/slog` when detected at the next tick. It cannot fail the caller's outstanding operation in-band — `UnlockFunc` is the only callback we hold — so the log line is the safety net for that pathological case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

posixage/flock: ghost-lock race when stale-recovery unlinks a still-flocked file #530

Summary

Half 1 — acquirer-side

Half 2 — holder-side

Half 3 — recovery TOCTOU (minor)

Impact

Why the 30s threshold was always a kludge

Why the race is hard to reproduce

Affected code

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

posixage/flock: ghost-lock race when stale-recovery unlinks a still-flocked file #530

Description

Summary

Half 1 — acquirer-side

Half 2 — holder-side

Half 3 — recovery TOCTOU (minor)

Impact

Why the 30s threshold was always a kludge

Why the race is hard to reproduce

Affected code

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions