Skip to content

Commit 6fc7b31

Browse files
committed
fix(fs_lock): always-join heartbeat thread in Drop to prevent post-cleanup writes
`LockGuard::drop` previously used `heartbeat_done.recv_timeout(100ms)` to wait for the heartbeat thread's acknowledgement, then fell through to `remove_lock_if_owned` regardless of whether the ack arrived. Under CI load this created a race: 1. Drop signals shutdown; recv_timeout times out before heartbeat acknowledges (heartbeat mid-`atomic_write_lock_metadata` IO). 2. Drop logs warning, calls `remove_lock_if_owned` → file removed. 3. A different caller acquires the lock, writes its own metadata. 4. Our still-alive heartbeat finishes its in-flight write — its `heartbeat_once` validated our ownership BEFORE the on-disk swap, so its rename overwrites the new owner's lockfile with our stale metadata. 5. The new owner's heartbeat sees foreign metadata, exits NotOwner. The new owner's Drop sees foreign metadata, `remove_lock_if_owned` returns Ok(false), the lockfile persists. The Linux unit test `acquire_serializes_concurrent_callers` then panics at `assert!(!path.exists())` — and any production lock with this shape would leak a stale lockfile that the next acquire can't match to remove. Fix: Drop now unconditionally `unpark`s and `join()`s the heartbeat handle. This bounds drop latency to one `park_timeout` iteration (~25ms) plus the current `heartbeat_once` IO — typically <500ms under CI load — but guarantees the heartbeat is dead before `remove_lock_if_owned` runs. The `heartbeat_done` channel field is kept (drained defensively) for backward compatibility but is no longer used for synchronization. Verified locally with 5 consecutive `cargo test -p agent-file-tools --release --lib fs_lock::` runs (10/10 pass each), plus full 850-test lib suite still green.
1 parent 554cc66 commit 6fc7b31

1 file changed

Lines changed: 29 additions & 16 deletions

File tree

crates/aft/src/fs_lock.rs

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -68,25 +68,38 @@ pub struct LockGuard {
6868

6969
impl Drop for LockGuard {
7070
fn drop(&mut self) {
71+
// Signal shutdown then unconditionally join the heartbeat thread
72+
// BEFORE removing the lockfile. The earlier `recv_timeout(100ms)`
73+
// implementation could let `remove_lock_if_owned` race with a
74+
// still-alive heartbeat:
75+
//
76+
// 1. Drop signals shutdown, ack times out under CI load.
77+
// 2. Drop calls `remove_lock_if_owned` → file removed.
78+
// 3. Another caller acquires the lock → writes its metadata.
79+
// 4. Our heartbeat (still alive, mid-`atomic_write_lock_metadata`
80+
// from before shutdown was checked) overwrites the new
81+
// owner's file with our stale metadata. heartbeat_once's
82+
// ownership check happens BEFORE the write, so it can race
83+
// with a concurrent acquire that flips ownership in between.
84+
// 5. The new owner's heartbeat sees foreign metadata, exits
85+
// `NotOwner`. The new owner's drop sees foreign metadata,
86+
// `remove_lock_if_owned` returns `Ok(false)`, file persists.
87+
//
88+
// Always-joining bounds drop latency to one `park_timeout`
89+
// iteration (~25ms) plus the current `heartbeat_once` IO —
90+
// typically <500ms under CI load. The unused `heartbeat_done`
91+
// channel is kept for backward compatibility with any external
92+
// code that may still construct LockGuard manually, but Drop no
93+
// longer relies on it.
7194
self.shutdown.store(true, Ordering::Release);
72-
if let Some(handle) = self.heartbeat.as_ref() {
95+
if let Some(handle) = self.heartbeat.take() {
7396
handle.thread().unpark();
97+
let _ = handle.join();
7498
}
75-
76-
if self
77-
.heartbeat_done
78-
.recv_timeout(Duration::from_millis(100))
79-
.is_ok()
80-
{
81-
if let Some(handle) = self.heartbeat.take() {
82-
let _ = handle.join();
83-
}
84-
} else {
85-
slog_warn!(
86-
"fs lock heartbeat thread for {} did not stop within 100ms",
87-
self.path.display()
88-
);
89-
}
99+
// Drain any pending ack so the receiver doesn't carry stale state
100+
// if this LockGuard is somehow re-used (it isn't today, but be
101+
// defensive).
102+
while self.heartbeat_done.try_recv().is_ok() {}
90103

91104
match remove_lock_if_owned(&self.path, &self.metadata) {
92105
Ok(true) => slog_info!("released filesystem lock at {}", self.path.display()),

0 commit comments

Comments
 (0)