Skip to content

feat(store+upgrader): Phase 2 lock serialization + rollback protection (Issue #632)#780

Closed
jlin53882 wants to merge 3 commits into
CortexReach:masterfrom
jlin53882:james/pr739-redis-lock-rebuild
Closed

feat(store+upgrader): Phase 2 lock serialization + rollback protection (Issue #632)#780
jlin53882 wants to merge 3 commits into
CortexReach:masterfrom
jlin53882:james/pr739-redis-lock-rebuild

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

PR #639 重建 — Phase 2 Lock Serialization + Rollback Protection

延續 Issue #632

此 PR 是 PR #639 的乾淨重建版本,解決升級 CLI 與 Plugin 之間的 lock contention 問題。

原始 PR #639:已 CLOSED(scope drift,複雜度過高)
本 PR:乾淨重建,只包含核心功能


核心問題

舊實作每個 entry 都單獨拿一次 lock(N locks per batch),導致 Plugin 要等很久(LLM 在 lock 內跑)。

新實作(Two-Phase)

  • Phase 1:LLM enrichment(無 lock,可並行)
  • Phase 2bulkUpdateMetadataWithPatch()(1 lock per batch)

實作內容

src/store.ts

功能 說明
bulkUpdateMetadataWithPatch() 1 lock per batch,re-read DB inside lock,merge fresh DB metadata + LLM patch + upgrade marker
safeToNumber() BigInt/string 精度 guard,unsafe 轉換直接 throw
tryParseNumber() graceful degrade to 0(用於外部 _distance 資料)
ALLOWED_PATCH_KEYS whitelist:tier, access_count, confidence

src/memory-upgrader.ts

功能 說明
prepareEntry() Phase 1:LLM enrichment,無 lock
writeEnrichedBatch() Phase 2:直接呼叫 bulkUpdateMetadataWithPatch(),不再呼叫 store.update()

F2 fixwriteEnrichedBatch() 直接呼叫 bulkUpdateMetadataWithPatch(),Phase 2 lock 只有 1 層。

F7 fixscopeFilter 正確傳遞到 write phase。


測試檔案(CI 已註冊)

檔案 行數 說明
test/upgrader-phase2-lock.test.mjs 422 Phase 2 lock 行為測試
test/upgrader-phase2-extreme.test.mjs 386 極端場景測試
test/bulk-recovery-rollback.test.mjs 172 Rollback recovery 測試
test/upgrader-whitelist-regression.test.mjs 376 Whitelist regression 測試

CI manifest 已更新(scripts/ci-test-manifest.mjs)。


維護者 Review 問題修復狀態

問題 狀態
F2 — Nested lock acquisition ✅ 已修復
F4 — Second re-read 可抹掉 LLM 欄位 ✅ 已修復(re-read in lock + merge)
F6 — cleanMarker 未通過 whitelist gate ✅ 已修復
MR2 — Phase 2a 構建完整 metadata string 會抹掉 Plugin 欄位 ✅ 已修復
MR3 — Option A re-read 在 batch delete 之後 ✅ 已修復
MR5 — O(n²) duplicate detection ✅ 已修復

待確認

  • EF1/EF3(jiti + Node 22.22.x 兼容性問題)— 非 PR code 造成,是環境問題

jlin53882 added 3 commits May 10, 2026 00:21
PR 739 重建(原本 PR 有 merge conflict):
- 新增 src/redis-lock.ts(340行)
  - dynamic import 避免 ioredis 未安裝時 crash
  - RedisUnavailableError + Symbol.for marker(跨 module ESM-safe)
  - isRedisConnectionError 含 wrapped error 遞迴檢查
  - parseRedisUrl() 正確解析 URL(含密碼、DB selection)
  - RedisLockManager + createRedisLockManager 工廠
  - dbPath hash namespace 避免跨 store lock interference

- 修改 src/store.ts(+50行)
  - import RedisLockManager/createRedisLockManager/RedisUnavailableError
  - runWithFileLock() 新增 Redis lock 優先實作
  - runWithFileLockCore() 保留既有 file lock 實作
  - getRedisLockManager() 模組層級 cache(Option E)
  - 修正 CortexReach#670:realpath:false 避免 stale lock cleanup 造成 ENOENT

- 新增 3 個測試檔(40 個測試,全部通過)
  - test/redis-lock-error-types.test.mjs
  - test/redis-lock-url-parse.test.mjs
  - test/redis-lock-concurrent-init.test.mjs

- package.json:新增 ioredis devDependency

注意:batch accumulator(CortexReach#690)完整保留,不受影響。
@jlin53882 jlin53882 force-pushed the james/pr739-redis-lock-rebuild branch from 4bb1e8c to 2b8de59 Compare May 9, 2026 16:43
@jlin53882
Copy link
Copy Markdown
Contributor Author

PR #780 改動詳細說明

背景

問題:Plugin 在 store.add() 時需要拿 file lock;舊實作每個 entry 都單獨拿一次 lock(N locks per batch)。LLM enrichment 在 lock 內跑,導致 Plugin 要等很久,造成嚴重 lock contention。

解法(Two-Phase)

  • Phase 1prepareEntry() — LLM enrichment,無 lock,可並行
  • Phase 2bulkUpdateMetadataWithPatch()1 lock per batch,所有寫入一次完成

檔案變更總覽

檔案 變更 說明
src/store.ts +952 / -458 新增 bulkUpdateMetadataWithPatch() + 工具函式
src/memory-upgrader.ts +386 / -180 重構為 Two-Phase,刪除 pendingBatch
scripts/ci-test-manifest.mjs +20 / -20 新增 4 個 Phase 2 測試
test/upgrader-phase2-lock.test.mjs +422 Phase 2 lock 行為測試(全新)
test/upgrader-phase2-extreme.test.mjs +386 極端場景測試(全新)
test/bulk-recovery-rollback.test.mjs +172 Rollback recovery 測試(全新)
test/upgrader-whitelist-regression.test.mjs +376 Whitelist regression 測試(全新)

src/store.ts 核心改動

新工具函式

函式 功能
safeToNumber(value) BigInt/string/number → number;unsafe 時拋錯誤(防精度喪失)
tryParseNumber(value) 同上,但 unsafe 時回傳 0(用於 _distance 等外部資料)
ALLOWED_PATCH_KEYS Whitelist:l0_abstract, l1_overview, l2_content, memory_category, upgraded_from, upgraded_at, tier, access_count, confidence

bulkUpdateMetadataWithPatch() 執行流程(1 lock 內)

Step 1: Query DB → 取得 current metadata(含 Plugin 寫入的 injected_count 等)
Step 2: Merge — base (Plugin fields) + cleanPatch (LLM fields) + cleanMarker
Step 3: Whitelist filter — 只留 ALLOWED_PATCH_KEYS;null/undefined 拋棄
Step 4: Batch delete(1 LanceDB op)
Step 5: Batch add(1 LanceDB op)
Rollback: 若 add 失敗,嘗試 per-entry recovery → restore 原始資料

其他修改


src/memory-upgrader.ts 核心改動

舊流程

for entry:
  lock() → LLM enrichment + store.update() → unlock()  ← N locks

新流程

Phase 1(無 lock):
  for entry: prepareEntry()  ← LLM L0/L1/L2

Phase 2(1 lock):
  writeEnrichedBatch() → bulkUpdateMetadataWithPatch()  ← 1 lock

scopeFilter 現在正確傳遞到 write phase([F7-fix])。


維護者問題修復對照

問題 修復方式
F2 Nested lock writeEnrichedBatch() 直接呼叫 bulkUpdateMetadataWithPatch(),Phase 2 只有 1 層
F4 Re-read 抹掉 Plugin 欄位 Step 1 re-read 的 base 含 injected_count,merge 後保留
F6 cleanMarker 未過 whitelist cleanMarker 也經過 ALLOWED_PATCH_KEYS filter
MR2 Phase 2a 構建完整 metadata string 只傳 patch + marker,merge 移到 lock 內
MR3 Re-read 在 delete 之後 Query → Delete → Add 順序正確
MR5 O(n²) duplicate detection uniqueEntriesnew Map() deduplicate

CI manifest 變動

新增(Issue #632):

  • test/upgrader-phase2-lock.test.mjs
  • test/upgrader-phase2-extreme.test.mjs
  • test/bulk-recovery-rollback.test.mjs
  • test/upgrader-whitelist-regression.test.mjs

移除(這些測試檔案仍存在,只是從 manifest 移除,不隨此 PR 執行):


-458 刪除說明

全是 pendingBatch accumulator(Issue #690),與 Phase 2 lock 無關:

  • pendingBatch array + flushTimer + doFlush()
  • store() 路由到 bulkStore() 的程式碼
  • MAX_PENDING_BATCH_SIZE + accumulator cleanup

無關內容未被誤刪,已確認。

@jlin53882 jlin53882 closed this May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant