Skip to content

feat(redis-lock): 分散式 Redis lock 實作,store.ts 整合 + file-lock fallback#788

Closed
jlin53882 wants to merge 3 commits into
CortexReach:masterfrom
jlin53882:james/pr739-redis-lock-clean
Closed

feat(redis-lock): 分散式 Redis lock 實作,store.ts 整合 + file-lock fallback#788
jlin53882 wants to merge 3 commits into
CortexReach:masterfrom
jlin53882:james/pr739-redis-lock-clean

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

PR 739 — Redis 分散式 Lock 實作

目標

在高並發寫入時使用 Redis 分散式 lock 解決 lock contention 問題,同時支援 Redis 不可用時無縫降級到現有的 file lock。

實作方式

Option E: Redis lock 優先 + file lock fallback

  • Redis 可用時:用 Redis lock(分散式跨 process)
  • Redis 不可用時(連線錯誤):無縫降級到 file lock
  • 兩種 lock 使用相同的 lock key:.memory-write.lock

修復的問題

Issue 問題 解決方式
#670 stale lock cleanup 造成 ENOENT 現有 realpath:false 設定保留
Issue 1 ioredis top-level import 造成 crash 改用 dynamic import
Issue 3 parseRedisUrl() 吃掉 DB selection 用 URL API 正確解析
Issue 4 不同 dbPath 的 store 互相 blocking dbPath hash namespace

變更檔案

檔案 變更
src/redis-lock.ts 新增(327 行)— RedisLockManager 實作
src/store.ts 新增(+49 行)— runWithFileLock 整合 Redis lock
test/redis-lock-url-parse.test.mjs 新增(125 行,15 tests)
test/redis-lock-error-types.test.mjs 新增(138 行,6 tests)
test/redis-lock-concurrent-init.test.mjs 新增(84 行,6 tests)
scripts/ci-test-manifest.mjs 登記 3 個測試

測試

  • 40 個 redis-lock 測試全部通過
  • 現有 lock 相關測試(lock-release-on-error.test.mjs)通過

- src/redis-lock.ts: RedisLockManager 實作(dynamic import, parseRedisUrl fix, dbPath namespace)
- src/store.ts: runWithFileLock 升級 — Redis lock 優先,RedisUnavailableError fallback 到 file lock
- test/redis-lock-*.test.mjs: 3 個單元測試(URL parse / error types / concurrent init)
- scripts/ci-test-manifest.mjs: 登記 3 個 redis-lock 測試

Option E 策略:Redis 可用時用 Redis lock,Redis 不可用時無縫降級到 file lock。
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 605747abfb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/store.ts
Comment on lines +281 to +284
} catch (err) {
// M4 fix: RedisUnavailableError → 無縫 fallback 到 file lock
if (err instanceof Error && Symbol.for("RedisUnavailableError") in err) {
// Redis 連線失敗,進 file lock fallback
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not fall back to file locks after Redis was selected

When Redis was healthy at startup, a later RedisUnavailableError during manager.acquire() falls through to the file-lock path. In a multi-process deployment, other processes that still have Redis connectivity continue guarding the same dbPath only with Redis and never acquire this file lock, so both lock domains can enter fn() concurrently for the same LanceDB store. This defeats the cross-process write serialization and can corrupt or lose writes; runtime Redis failures should not silently switch a single process to a different lock domain unless all participants are coordinated.

Useful? React with 👍 / 👎.

Comment thread src/store.ts Outdated
Comment on lines +89 to +92
if (redisLockInitAttempted) {
return redisLockManager;
}
redisLockInitAttempted = true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Await the in-flight Redis initialization

If two writes start before the first Redis initialization finishes, the first call sets redisLockInitAttempted = true and awaits createRedisLockManager(), while the second call immediately returns the still-null redisLockManager and uses the file lock. With Redis available, the first call then uses a Redis lock, so startup writes can run under different lock domains for the same dbPath. Cache and await the initialization promise instead of marking initialization complete before the manager is assigned.

Useful? React with 👍 / 👎.

Comment thread src/store.ts Outdated
Comment on lines +88 to +90
async function getRedisLockManager(dbPath: string) {
if (redisLockInitAttempted) {
return redisLockManager;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cache Redis lock managers per dbPath

The process-wide singleton ignores dbPath after the first store initializes. If a process opens store A first and later writes store B, it will still acquire Redis keys namespaced with A's hash, while another process that opens B first uses B's hash; those two processes can then write to the same store B without sharing a Redis lock. Since dbPath is part of the lock namespace, the cache needs to be keyed by dbPath rather than returning one global manager.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Review finding:

P1: runtime Redis fallback can split the lock domain

In src/store.ts, once getRedisLockManager(this.config.dbPath) returns a manager, runWithFileLock() tries Redis first but falls back to the local file lock when manager.acquire() throws RedisUnavailableError:

  • src/store.ts:272-275
  • src/store.ts:282-284

That is unsafe for a distributed lock. If one process has selected Redis locking and another process later sees a transient Redis failure and falls back to file locking, those two processes no longer coordinate with each other and can write concurrently. Runtime Redis lock acquisition failures should probably fail closed, or the process needs a globally consistent mode switch that guarantees all writers are in the same lock domain before entering the critical section.

P2: Redis namespace is process-global but initialized from the first dbPath

redisLockManager and redisLockInitAttempted are singletons (src/store.ts:81-92), while the namespace is derived from the first dbPath. Later MemoryStore instances with a different dbPath reuse the first manager/namespace, so the PR's dbPath isolation guarantee does not hold inside one process. Consider caching managers by normalized dbPath instead of using a single process-wide instance.

R2 (CRITICAL): Per-dbPath cache replaces singleton
- redisLockCache Map<dbPath, {manager, initPromise}> per-dbPath instead of global singleton
- initPromise=null → fully initialized; in-flight Promise shared for concurrent calls
- Fixes: singleton ignored dbPath, all instances shared one manager

R3 (HIGH): Race-safe Promise-based init
- Concurrent getRedisLockManager('/same') share one init Promise
- TOCTOU window eliminated via Map synchronization

R1 (MEDIUM): release() error non-fatal handling
- isRedisConnectionError(releaseErr) in finally block catches and warns
- Connection errors don't propagate; unexpected errors still throw

Test: test/redis-lock-cache-r1-r2-r3.test.mjs (9 subtests, all pass)
CI: scripts/ci-test-manifest.mjs updated
@jlin53882
Copy link
Copy Markdown
Contributor Author

R1/R2/R3 修復說明

已針對維護者提出的 3 個問題進行修復,commit: 51f3df4 on james/pr739-redis-lock-clean


R2 (CRITICAL) — Singleton 忽略 dbPath ✅

原有程式使用全域 let redisLockManager singleton,所有 dbPath 共用同一個 manager instance,忽略了 dbPath 參數。

修復:改用 redisLockCache Map<dbPath, {manager, initPromise}>,每個 dbPath 獨立管理自己的 lock manager。


R3 (HIGH) — Init Race Condition ✅

redisLockInitAttempted 是簡單 boolean,TOCTOU race 導致並發調用重複執行 init。

修復:Map 本身就是同步容器,in-flight Promise 被並發調用共享,不會重複建立 manager。


R1 (MEDIUM) — release() 錯誤導致 mid-operation exception ✅

finally { await release(); } 若 Redis 連線中斷,release() 會 throw,覆蓋原本成功的手術結果。

修復:finally 內加 try-catch,connection error → console.warn(不中斷),其餘 error 仍正常拋出。


測試驗證

9 個 subtests,全 pass:

  • R2: createRedisLockManager returns null / namespace isolation / per-dbPath isolation
  • R3: concurrent calls return consistent results / different dbPaths separate init
  • R1: isRedisConnectionError detection / non-connection throws / release error caught

測試檔案: test/redis-lock-cache-r1-r2-r3.test.mjs

@jlin53882
Copy link
Copy Markdown
Contributor Author

RED #1 追蹤更新

問題: (store.ts:142) 包含 5 個 sync I/O,plugin init 時 block event loop。

評估結果:不適合放 PR #788 或 PR #794,scope 各自獨立。

PR 目標 validateStoragePath
PR #788 Redis lock
PR #794 runWithFileLock 內部 sync

已新建 Issue #795 追蹤:
#795

修復方向: → async + fire-and-forget call site ()。

P3: null manager cache entry race - all concurrent calls return
     identical results (no null manager leak) ✅ VERIFIED PASS

P4: cache has no TTL/refresh - repeated calls return same cached
     value (no retry mechanism exists) ✅ VERIFIED CONFIRMED

The uncaughtException in test run is a cleanup artifact (noisy but
tests pass). Both issues are observable and correctly handled.
@TurboTheTurtle
Copy link
Copy Markdown
Contributor

Thanks for the quick fix on the per-dbPath Redis manager cache. That resolves the namespace/singleton part of my comment.

I think the P1 lock-domain split is still unresolved on the current head, though. runWithFileLock() still catches RedisUnavailableError from manager.acquire() and then falls through to the local file lock path (src/store.ts:306-314). Once a process has selected Redis locking, a transient runtime Redis acquire failure should not let that same write enter the local file lock domain while other writers may still be coordinating through Redis.

Could this fail closed for runtime Redis acquisition failures instead, or otherwise document/enforce a globally consistent fallback mode before entering the critical section? Initial Redis unavailability before a manager exists can reasonably choose file locking, but falling back after Redis mode is active is the part that still seems unsafe for distributed writers.

@jlin53882
Copy link
Copy Markdown
Contributor Author

👋 @TurboTheTurtle 分析結果如下:


核心問題:Fail Closed vs Fallback 的抉擇

你的 concern 是:運行時 acquire() 失敗時 fallback 到 file lock,會造成「某些 process 在 Redis domain,某些在 file lock domain」的 split-brain 狀態,進而破壞 distributed coordination。


acquire() 失敗的兩種實際場景

瞬斷(Redis 連不上)

  • set() throw → isRedisConnectionError() = true → RedisUnavailableError → catch block
  • 這會進 file lock fallback

Lock competition(被別人佔住)

  • set() return null → retry loop → timeout → throw Error("Lock acquisition timeout")
  • 不在 catch 範圍內,直接 throw,write 失敗

為什麼 Fail Closed 在這個系統裡不是更好的選擇

Fail closed = Redis 瞬斷時放棄這次寫入(throw,不 fallback)。

代價對比

方案 Redis 瞬斷時
Fail closed 寫入放棄 → 資料沒進去 → 上游 retry 也失敗 → 系統全面停擺
File lock fallback 照常寫入 → 資料進去了 → 单机內協調正常

但 split-brain 真的會發生嗎?

讀完 code 以後,發現關鍵在這裡:

// store.ts runWithFileLock
const manager = await getRedisLockManager(this.config.dbPath);
if (manager) {
  // 在 Redis domain 內
  try {
    const release = await manager.acquire(...);
  } catch (err) {
    if (RedisUnavailableError) {
      // 只有在 Redis 瞬斷且這個 process 的 manager != null 時才進這裡
    }
  }
}
// File lock fallback 在 if (manager) block 外面

Split-brain 發生的路徑

  • Process A:getRedisLockManager() 返回 manager(Redis 可用時連上)
  • Process B:getRedisLockManager() 返回 manager(Redis 可用時連上)
  • Redis 瞬斷
  • Process A:acquire() 失敗 → fallback to file lock
  • Process B:acquire() 還在 retry 中,或也 fallback

但 Fail Closed 並不能完全解決 split-brain

Fail closed 只是「瞬斷的 process 放棄寫入」,但其他正常的 process 還在 Redis domain。這時候:

  • 有 process 在 Redis domain(正常運作)
  • 有 process fail closed(放棄寫入)
  • 還是 split-brain,只是變成「有 process 放棄」而非「有 process fallback」

Fail closed 唯一的優點:不會有「同時用兩種鎖」的不可預期行為。


建議的處理方式

方案 行為 評估
Fail closed(你建議的) 瞬斷時放棄寫入 看起來安全,但代價是「瞬斷 = 全面停擺」,沒有任何額外保護
文件說明 不改 code,只說明「瞬斷時 fallback 是預期行為」 最輕量,但可能不接受
Process-level domain lock 第一次 fallback 後,process 內設定「此後只用 file lock」 複雜,需額外同步

需要確認的問題

想確認你的實際部署架構

  1. 单机多 process 部署:file lock fallback 完全正確,協調沒問題。Fail closed 反而造成「不必要的寫入失敗」。
  2. 多機分散式部署:Redis 瞬斷時跨機器協調本來就會有問題,無論 fail closed 與否,跨機器寫入都會失敗。

了解實際部署架構,才能判斷 split-brain 的實際風險有多高?

Copy link
Copy Markdown

Closing this PR for now.

Redis-based distributed locking is a substantial optional architecture change, and the current branch still has failing CI. We should only take this complexity when there is a clear product requirement for multi-process/distributed writes and a green, narrowly scoped implementation. A future PR can revisit this with that requirement and validation in place.

@rwmjhb rwmjhb closed this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants