Add optimistic concurrency control and dedup score guardrail for memory writes#2127
Add optimistic concurrency control and dedup score guardrail for memory writes#2127Jay-ju wants to merge 4 commits into
Conversation
…ry writes
Phase 2 of session commit (memory extraction) runs without distributed
locks, which creates read-modify-write races when the same user has
multiple sessions committing concurrently. This can cause:
1. Profile merge lost-updates: Session A merges into profile.md,
then Session B overwrites with a stale base, losing A's changes.
2. Tool/Skill statistics regression: Both sessions read total_calls=10,
A writes 15, B writes 13 — A's increment is lost.
3. Unprotected DELETE: LLM hallucinates a DELETE decision for a
low-similarity memory, causing user data loss.
Fix 1 - OCC (Compare-And-Swap) writes:
Add VikingFS.cas_write_file() which reads the file's modTime before
the expensive LLM call, then verifies it hasn't changed before
writing. If a concurrent modification is detected, the write is
aborted (treated as skipped) and the next commit cycle will re-attempt.
Applied to:
- SessionCompressor._merge_into_existing (preferences/entities/patterns)
- MemoryExtractor._append_to_profile (profile.md)
- MemoryExtractor._merge_tool_memory (tools/{name}.md)
- MemoryExtractor._merge_skill_memory (skills/{name}.md)
Fix 2 - Dedup score guardrail for DELETE:
The deduplicator stores _dedup_score on each similar memory with the
comment "for later destructive-action guardrails" but never checks it.
Now _delete_existing_memory refuses to delete memories with
_dedup_score < 0.5, preventing LLM-hallucinated deletes on
low-similarity matches.
TestDeleteGuardrail: - delete blocked when _dedup_score < 0.5 - delete allowed when _dedup_score >= 0.5 - delete allowed when no _dedup_score (backward compatible) - delete blocked for zero and negative scores TestMergeOCC: - merge succeeds when modTime unchanged (CAS passes) - merge aborted when modTime changed (CAS fails) - merge falls back to write_file when modTime unavailable TestCasWriteFile: - cas_write_file succeeds when modTime matches - cas_write_file fails when modTime differs - cas_write_file fails when stat raises exception
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
R7 (Error Handling): Replace bare 'except Exception: pass' and 'except Exception: return False' with specific exception types and logging, per .pr_agent.toml rules. - cas_write_file: split into FileNotFoundError (debug) + general Exception (warning) with descriptive messages - _merge_into_existing: log warning on vector cleanup failure instead of silently swallowing - _delete_existing_memory: fix docstring to match implementation (dedup score guardrail, not OCC)
Replace 'except Exception: pass' with specific FileNotFoundError (debug-level) and general Exception (warning-level) with descriptive messages, per .pr_agent.toml R7 rule.
Summary
Phase 2 of session commit (memory extraction) runs as a background
asyncio.create_taskwithoutdistributed locks. When the same user has multiple sessions committing concurrently, this creates
read-modify-write races that can cause:
profile.md, then Session B overwriteswith a stale base, losing A's changes.
total_calls=10, A writes 15, B writes13 — A's increment is lost.
_dedup_scoreon similar memories with thecomment "for later destructive-action guardrails" but never checks it. An LLM-hallucinated
DELETE on a low-similarity match can silently erase user memories.
Changes
1. OCC (Compare-And-Swap) writes
Added
VikingFS.cas_write_file(uri, content, expected_mod_time)which reads the file'smodTimebefore the expensive LLM call, then verifies it hasn't changed before writing. If a concurrent
modification is detected, the write is aborted (treated as skipped) and the next commit cycle will
re-attempt.
Applied to:
SessionCompressor._merge_into_existing(preferences / entities / patterns)MemoryExtractor._append_to_profile(profile.md)MemoryExtractor._merge_tool_memory(tools/{name}.md)MemoryExtractor._merge_skill_memory(skills/{name}.md)When
modTimeis unavailable (e.g. stat failed or file doesn't exist yet), the code falls back tothe original
write_file— no behavior change for the non-concurrent path.2. Dedup score guardrail for DELETE
_delete_existing_memorynow refuses to delete memories with_dedup_score < 0.5. This implementsthe guardrail that was documented but never enforced — the
_dedup_scorefield was stored with thecomment "for later destructive-action guardrails" but was never checked before destructive
operations.
Testing
Design Notes
Holding a distributed lock for that duration would block all other sessions for the same user.
OCC only checks at the write instant — zero blocking.
automatically. Retrying inside the same commit would add complexity without meaningful benefit.
SIMILARITY_THRESHOLDis 0.0, meaning any vectorsearch result is sent to the LLM. A score below 0.5 indicates the memories are only loosely
related, making a DELETE decision unreliable. This threshold can be tuned via a constant if
needed.