Skip to content

fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784

Open
jlin53882 wants to merge 2 commits into
CortexReach:masterfrom
jlin53882:james/pr783-truly-clean
Open

fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784
jlin53882 wants to merge 2 commits into
CortexReach:masterfrom
jlin53882:james/pr783-truly-clean

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

@jlin53882 jlin53882 commented May 9, 2026

Summary

Rebuild of PRs #668 and #708 with cleaner scope — fixing the "stuck session" issue (embedQuery HTTP call holds lock after timeout) and adding BM25 fallback when embedding fails.

Problem

PR #668: Session lock held after timeout

Auto-recall uses Promise.race(timeout, recallWork()) — when timeout resolves after 5s, the hook returns undefined but the in-flight embedQuery HTTP call keeps running. This holds the per-agent memory-runtime mutex for minutes.

PR #708: Auto-recall always uses BM25 (quality regression)

PR #708 tried to fix the above by forcing auto-recall to BM25 mode (no embedding, no network). But this changes the default behavior — manual/CLI retrieval still uses hybrid, while auto-recall always uses BM25. Quality regression for normal auto-recall scenarios.

Solution (James's design)

Instead of always using BM25 for auto-recall, use BM25 as a fallback when embedding fails:

Normal path: vector/hybrid + signal threading
  → embedQuery(query, signal) — can be cancelled by AbortController

Error path: if embedQuery fails → fallback to BM25
  → no embedding needed, no network, immediate result

Important: When the signal is already aborted (deliberate cancel), we re-throw immediately — no BM25 fallback. The BM25 fallback is reserved for non-abort errors (network failures, API errors, genuine embedding failures).

Changes

File Change
src/retriever.ts Add signal?: AbortSignal to RetrievalContext
src/retriever.ts Add signal param to vectorOnlyRetrieval + hybridRetrieval
src/retriever.ts Pass signal to embedQuery(query, signal)
src/retriever.ts Add try/catch in embedQuery: if signal?.aborted → re-throw; else → BM25 fallback
src/retriever.ts Remove useLightweightAutoRecall (replaced by fallback design)
src/retriever.ts Fix retrieveWithTrace() missing mode/diagnostics declarations
test/retriever-auto-recall-signal.test.mjs Unit tests for signal propagation + abort/fallback behavior
scripts/ci-test-manifest.mjs Register test in core-regression

Key behaviors

  1. Signal threading: AbortController.signal flows from RetrievalContext.signalembedQuery(query, signal) → HTTP request is cancelled on abort
  2. AbortError → re-throw, no fallback: If embedQuery throws because the signal was aborted, re-throw immediately (don't waste resources on BM25)
  3. Non-abort error → BM25 fallback: If embedQuery throws for other reasons (network/API), try BM25 instead. Only re-throws if BM25 also returns no results.
  4. No behavior change for normal cases: If embedQuery completes normally, retrieval works exactly as before

Comparison

PR #668 (original) PR #708 This PR
Cancel embed on timeout
Auto-recall always BM25
Embed fail → BM25 fallback
Abort signal → re-throw (no waste)
Backward compatible
Scope index.ts + retriever.ts retriever.ts only retriever.ts only

Review fix (v2)

After PR #784 review by TurboTheTurtle, added explicit abort check before BM25 fallback:

} catch (embedError) {
  // Only do BM25 fallback for non-abort errors
  if (signal?.aborted) {
    throw embedError;  // ← New: re-throw for abort, no waste
  }
  // BM25 fallback for genuine failures
  ...
}

…68/PR708 clean rebuild)

Rebuild of PRs CortexReach#668 and CortexReach#708 with cleaner scope.

Changes:
- Add signal?: AbortSignal to RetrievalContext interface
- Add signal param to vectorOnlyRetrieval + hybridRetrieval
- Pass signal to embedQuery(query, signal) for true cancellation
- Add try/catch in embedQuery: catch → BM25 fallback (James's design)
- Remove useLightweightAutoRecall (replaced by fallback design)
- Fix retrieveWithTrace() missing mode/diagnostics declarations

Design: normal path uses vector/hybrid; if embed fails (timeout/network/API error),
fallback to BM25 instead of returning nothing. Doesn't change default behavior.

Tests:
- retriever-auto-recall-signal.test.mjs: signal propagation + BM25 fallback

Refs: PR CortexReach#668, PR CortexReach#708, PR CortexReach#746
Copy link
Copy Markdown
Contributor

Review finding:

P2: aborted embedding requests still continue into BM25 fallback work

The new signal is passed into embedQuery(), but the catch blocks treat every embedding failure the same and immediately run BM25 fallback:

  • src/retriever.ts:742-747
  • src/retriever.ts:950-955

If the caller aborts the signal because auto-recall timed out, this path still performs store reads and may return fallback results instead of stopping promptly. That undercuts the cancellation goal and can keep the session lock active after the timeout. Abort errors should probably be rethrown or explicitly short-circuited before attempting BM25 fallback; the fallback should be reserved for non-abort embedding failures.

When embedQuery() throws due to an abort signal, we now re-throw
immediately instead of wasting resources doing BM25 fallback retrieval.

Only BM25 fallback is attempted for non-abort errors (network/API failures).

Fixes PR784 review comment by TurboTheTurtle:
"aborted embedding requests still continue into BM25 fallback work"

Updated tests:
- AbortError re-thrown in vectorOnlyRetrieval (signal already aborted)
- AbortError re-thrown in hybridRetrieval (signal already aborted)
- BM25 fallback used for non-abort errors (network error, signal not aborted)
@jlin53882
Copy link
Copy Markdown
Contributor Author

Review fix applied ✅

已根據你的 comment 修復。

問題

} catch (embedError) {
  // 所有的 embed error 都走這裡 — 包括 AbortError
  const bm25Results = await this.bm25OnlyRetrieval(...);  // ← 不該跑的浪費
  ...
}

當用戶超時取消(controller.abort())時,embedQuery 拋出 AbortError。原本的實作在這個 catch 裡又去做 BM25 fallback——浪費資源又違背取消的原意。

修復

在 catch 區塊最前面加了一個 guard:

} catch (embedError) {
  // Only do BM25 fallback for non-abort errors.
  // If the caller aborted the signal, re-throw immediately.
  if (signal?.aborted) {
    throw embedError;  // ← 新的判斷
  }
  // Fallback to BM25 for genuine failures (network/API errors)
  ...
}

三種情境現在的行為

情境 embedQuery 行為 catch 區塊行為
正常完成 ✅ 返回 vector 不進 catch
超時取消(signal.aborted) throw AbortError ✅ re-throw,不做 BM25
網路 / API 失敗 throw Error ✅ BM25 fallback

commit: e1d80ab

@jlin53882
Copy link
Copy Markdown
Contributor Author

為什麼用 BM25 fallback 而不是 retry?

好問題。這是刻意的設計取捨。

兩個選項

Retry(重試):embedQuery 失敗 → 等一下再試 → 可能成功 → 可能還是失敗 → 使用者等到超時還是沒結果

BM25 Fallback:embedQuery 失敗 → 立刻放棄嵌入 → 用本地 BM25(不需要網路)→ 毫秒級返回結果

為什麼在 auto-recall 語境下選 BM25 fallback

  1. 我們已經在超時情境下:auto-recall 用 Promise.race(timeout, recallWork()) 限制了時間(預設 5 秒)。如果 embedQuery 超時了,retry 大概率也會超時。

  2. 使用者體驗:重試失敗後,使用者拿到的是「超時無結果」。BM25 fallback 起碼有東西可以回。

  3. Session lock 盡快釋放:原本的問題就是 embedQuery HTTP 佔住 session lock 幾分鐘。重試只會讓這段佔用更長。

BM25 fallback 的代價

品質稍微低一點(語意 vs 關鍵字)。但:

  • 服務立刻可用(不是「完全沒回應」)
  • Session lock 立刻釋放(不會卡住其他操作)

什麼時候適合 retry?

  • 沒有超時限制的場景:CLI 查詢、手動檢索(沒有 5 秒限制)
  • 錯誤是明確可復原的:如「連接數过多」這類短暫錯誤

在 auto-recall 的語境下,BM25 fallback 是合理選擇。

@jlin53882
Copy link
Copy Markdown
Contributor Author

Code quality fixes — defensive coding

Added null-check guards in the BM25 fallback path (two locations) to prevent potential runtime crashes if diagnostics is undefined.

Changes

vectorOnlyRetrieval (line 751-761)

// Before
diagnostics.bm25Query = query;
...
diagnostics.dropSummary = buildDropSummary(diagnostics);
this.lastDiagnostics = diagnostics;

// After
if (diagnostics) diagnostics.bm25Query = query;
...
if (diagnostics) {
  diagnostics.dropSummary = buildDropSummary(diagnostics);
  this.lastDiagnostics = diagnostics;
}

hybridRetrieval (line 968-976) — same pattern applied.

Why

diagnostics is optional in RetrievalContext. Currently retrieve() always constructs it before calling vectorOnlyRetrieval / hybridRetrieval, so this never crashes today. But:

  • These are private methods — future refactors or copy-paste could call them without diagnostics
  • Adding the guard makes the code more defensive and self-documenting

Severity

🟡 Low — no functional bug today, purely defensive. Does not affect the core signal-threading or BM25 fallback logic. No test changes needed.


Commit: f649c66 — fix(retriever): add diagnostics null-check guards in BM25 fallback (defensive coding)

@OKmiao88888888
Copy link
Copy Markdown

We're experiencing the exact issue described in this PR — auto-recall timing out after 30s and stale locks persisting across sessions. This is happening in our production environment on OpenClaw 2026.5.x. Would love to see this merged soon! Thanks for the great work.

@TurboTheTurtle
Copy link
Copy Markdown
Contributor

Thanks for confirming the production impact. For maintainers: this report lines up with the timeout/stale-session behavior that #784 and #790 address from complementary angles. #784 makes cancellation propagate through retrieval, while #790 prevents late auto-recall work from being injected after the timeout boundary. I’d treat these as linked fixes for the OpenClaw 2026.5.x symptom.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants