fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784
fix(retriever): AbortSignal threading + embedQuery BM25 fallback (PR668/PR708 clean rebuild)#784jlin53882 wants to merge 2 commits into
Conversation
…68/PR708 clean rebuild) Rebuild of PRs CortexReach#668 and CortexReach#708 with cleaner scope. Changes: - Add signal?: AbortSignal to RetrievalContext interface - Add signal param to vectorOnlyRetrieval + hybridRetrieval - Pass signal to embedQuery(query, signal) for true cancellation - Add try/catch in embedQuery: catch → BM25 fallback (James's design) - Remove useLightweightAutoRecall (replaced by fallback design) - Fix retrieveWithTrace() missing mode/diagnostics declarations Design: normal path uses vector/hybrid; if embed fails (timeout/network/API error), fallback to BM25 instead of returning nothing. Doesn't change default behavior. Tests: - retriever-auto-recall-signal.test.mjs: signal propagation + BM25 fallback Refs: PR CortexReach#668, PR CortexReach#708, PR CortexReach#746
|
Review finding: P2: aborted embedding requests still continue into BM25 fallback work The new
If the caller aborts the signal because auto-recall timed out, this path still performs store reads and may return fallback results instead of stopping promptly. That undercuts the cancellation goal and can keep the session lock active after the timeout. Abort errors should probably be rethrown or explicitly short-circuited before attempting BM25 fallback; the fallback should be reserved for non-abort embedding failures. |
When embedQuery() throws due to an abort signal, we now re-throw immediately instead of wasting resources doing BM25 fallback retrieval. Only BM25 fallback is attempted for non-abort errors (network/API failures). Fixes PR784 review comment by TurboTheTurtle: "aborted embedding requests still continue into BM25 fallback work" Updated tests: - AbortError re-thrown in vectorOnlyRetrieval (signal already aborted) - AbortError re-thrown in hybridRetrieval (signal already aborted) - BM25 fallback used for non-abort errors (network error, signal not aborted)
Review fix applied ✅已根據你的 comment 修復。 問題} catch (embedError) {
// 所有的 embed error 都走這裡 — 包括 AbortError
const bm25Results = await this.bm25OnlyRetrieval(...); // ← 不該跑的浪費
...
}當用戶超時取消( 修復在 catch 區塊最前面加了一個 guard: } catch (embedError) {
// Only do BM25 fallback for non-abort errors.
// If the caller aborted the signal, re-throw immediately.
if (signal?.aborted) {
throw embedError; // ← 新的判斷
}
// Fallback to BM25 for genuine failures (network/API errors)
...
}三種情境現在的行為
commit: |
為什麼用 BM25 fallback 而不是 retry?好問題。這是刻意的設計取捨。 兩個選項Retry(重試):embedQuery 失敗 → 等一下再試 → 可能成功 → 可能還是失敗 → 使用者等到超時還是沒結果 BM25 Fallback:embedQuery 失敗 → 立刻放棄嵌入 → 用本地 BM25(不需要網路)→ 毫秒級返回結果 為什麼在 auto-recall 語境下選 BM25 fallback
BM25 fallback 的代價品質稍微低一點(語意 vs 關鍵字)。但:
什麼時候適合 retry?
在 auto-recall 的語境下,BM25 fallback 是合理選擇。 |
Code quality fixes — defensive codingAdded null-check guards in the BM25 fallback path (two locations) to prevent potential runtime crashes if ChangesvectorOnlyRetrieval (line 751-761) // Before
diagnostics.bm25Query = query;
...
diagnostics.dropSummary = buildDropSummary(diagnostics);
this.lastDiagnostics = diagnostics;
// After
if (diagnostics) diagnostics.bm25Query = query;
...
if (diagnostics) {
diagnostics.dropSummary = buildDropSummary(diagnostics);
this.lastDiagnostics = diagnostics;
}hybridRetrieval (line 968-976) — same pattern applied. Why
Severity🟡 Low — no functional bug today, purely defensive. Does not affect the core signal-threading or BM25 fallback logic. No test changes needed. Commit: |
|
We're experiencing the exact issue described in this PR — auto-recall timing out after 30s and stale locks persisting across sessions. This is happening in our production environment on OpenClaw 2026.5.x. Would love to see this merged soon! Thanks for the great work. |
|
Thanks for confirming the production impact. For maintainers: this report lines up with the timeout/stale-session behavior that #784 and #790 address from complementary angles. #784 makes cancellation propagate through retrieval, while #790 prevents late auto-recall work from being injected after the timeout boundary. I’d treat these as linked fixes for the OpenClaw 2026.5.x symptom. |
Summary
Rebuild of PRs #668 and #708 with cleaner scope — fixing the "stuck session" issue (embedQuery HTTP call holds lock after timeout) and adding BM25 fallback when embedding fails.
Problem
PR #668: Session lock held after timeout
Auto-recall uses
Promise.race(timeout, recallWork())— when timeout resolves after 5s, the hook returnsundefinedbut the in-flight embedQuery HTTP call keeps running. This holds the per-agent memory-runtime mutex for minutes.PR #708: Auto-recall always uses BM25 (quality regression)
PR #708 tried to fix the above by forcing auto-recall to BM25 mode (no embedding, no network). But this changes the default behavior — manual/CLI retrieval still uses hybrid, while auto-recall always uses BM25. Quality regression for normal auto-recall scenarios.
Solution (James's design)
Instead of always using BM25 for auto-recall, use BM25 as a fallback when embedding fails:
Important: When the signal is already aborted (deliberate cancel), we re-throw immediately — no BM25 fallback. The BM25 fallback is reserved for non-abort errors (network failures, API errors, genuine embedding failures).
Changes
src/retriever.tssignal?: AbortSignaltoRetrievalContextsrc/retriever.tssignalparam tovectorOnlyRetrieval+hybridRetrievalsrc/retriever.tssignaltoembedQuery(query, signal)src/retriever.tssignal?.aborted→ re-throw; else → BM25 fallbacksrc/retriever.tsuseLightweightAutoRecall(replaced by fallback design)src/retriever.tsretrieveWithTrace()missing mode/diagnostics declarationstest/retriever-auto-recall-signal.test.mjsscripts/ci-test-manifest.mjscore-regressionKey behaviors
AbortController.signalflows fromRetrievalContext.signal→embedQuery(query, signal)→ HTTP request is cancelled on abortembedQuerythrows because the signal was aborted, re-throw immediately (don't waste resources on BM25)embedQuerythrows for other reasons (network/API), try BM25 instead. Only re-throws if BM25 also returns no results.Comparison
Review fix (v2)
After PR #784 review by TurboTheTurtle, added explicit abort check before BM25 fallback: