Skip to content

Realm-server: batch link resolution in loadLinks (CS-11038)#4668

Merged
habdelra merged 10 commits intomainfrom
cs-11038-batch-load-links
May 6, 2026
Merged

Realm-server: batch link resolution in loadLinks (CS-11038)#4668
habdelra merged 10 commits intomainfrom
cs-11038-batch-load-links

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

@habdelra habdelra commented May 5, 2026

Summary

Closes CS-11038.

RealmIndexQueryEngine.loadLinks was the dominant cost behind _federated-search 90s timeouts on staging — it resolved every linked card with its own getInstance(linkURL) round-trip. For an N-result search with M average links per card, that produced N×M sequential SELECT * FROM boxel_index WHERE i.url = $1 round-trips, all serialized on Node's event loop and contending for the shared PG pool with every other concurrent request.

This PR replaces the recursive per-link traversal with a level-order BFS: each recursion depth issues at most one batched WHERE i.url IN (...) query (plus one for file-meta) regardless of how many siblings reference links at that depth.

Live data anchor (staging, 2026-04-28 → 2026-05-05)

  • 110 of 126 (87%) _federated-search 90s timeouts in 7d traced to N×M sequential link fan-out in loadLinks
  • During one observed flood window (UTC 13:24:25), 30+ concurrent _federated-search requests piled up in one second; per-realm cards-grid latency varied by ~300× under the same load — the variance correlates with result-set size, consistent with N×M sequential awaits dominating response time
  • Total wasted compute on _federated-search 90s timeouts: ~2.75 worker-hours / 7d

Relationship to PR #4667 (CS-11037)

Orthogonal — both ship independently:

  • PR Realm-server: make _federated-search JSON:API-compliant via include #4667 lets callers opt out of loadLinks entirely (include: []). It's the cards-grid fast path: when the consumer doesn't traverse relationships, skip link loading altogether.
  • This PR makes the case where you DO need links N+M → 1+1. Future caller-driven sparse-fields filtering still benefits from the batched pass.

No file overlap (#4667 doesn't touch realm-index-query-engine.ts). Branched off main.

Implementation

Level-order BFS in loadLinks

The original recursive loadLinks(resource) was called once per root resource, and each invocation awaited getInstance(linkURL) per relationship. Restructured the function so each layer batches across siblings:

  1. Pre-pass — walk every relationship across all resources in the current layer; classify each link as in-realm card / in-realm file / cross-realm; accumulate URL Sets (de-duplicated). Skip omit / visited URLs.
  2. Batch fetchawait Promise.all([getInstances(cardURLs), getFiles(fileURLs)]). One round-trip per kind, regardless of how many siblings asked.
  3. Hydration — walk the same entries again; pull linkResource from the resulting Maps for in-realm; cross-realm still goes through the existing per-link fetch path (out of scope to parallelize per the ticket).
  4. Build next layer from each newly-resolved linkResource; repeat.

searchCards now passes the entire doc.data array as the root layer (previously it looped per-resource), which is what enables the first layer to batch across all roots' relationships in one query.

Preserved semantics

  • Cycle preventionvisited: Set<string> shared across all layers; each linkResource is added to visited on first encounter; later encounters are short-circuited.
  • Depth gatedescendStack.length <= maxLinkDepth matches the original stack.length <= maxLinkDepth check; nothing is added to included[] or scheduled for descent past the limit.
  • linkFields filter — applied only at the root layer (matching the original "linkFields only constrains the immediate relationships of the root card"). Carried via a per-layer-item applyLinkFields bool.
  • Stale type fallbackfieldExpectsFileMeta is still called when the relationship's data.type is missing or stale, so file-meta links incorrectly typed as 'card' still resolve correctly.
  • relationship.data rewrite — same three-way condition (just-found / in-omit / in-included → rewrite with resolved type; otherwise fallback type when linkResource is unavailable).
  • URL absolutizationcloneDeep + visitInstanceURLs + visitModuleDeps applied to each newly-pushed included[] entry, identical to the original.

New batch primitives in IndexQueryEngine

  • getInstances(urls, opts) — returns Map<lookupURL, InstanceOrError>. Handles both i.url IN (...) and i.file_alias IN (...) matches in a single query so a row found via either column is keyed by whichever lookup URL the caller passed.
  • getFiles(urls, opts) — analogous batch for file-meta lookups.
  • The single-URL getInstance / getFile were refactored to share extracted row-mapping helpers (#rowToInstanceOrError, #rowToIndexedFile) with their batched siblings.

Acceptance criteria

  • loadLinks for a cards-grid-shape query issues 1 batched DB query per recursion depth, not N×M separate ones
  • Recursion still terminates (visited set + depth gate unchanged in semantics)
  • Cross-realm link resolution unchanged in correctness; can be parallelized as a follow-up
  • Regression test: 50-card / 5-link search → ≤ 2 batched-link queries (and 0 per-link queries)

Out of scope

  • Cross-realm link parallelization (sibling ticket — Promise.all on outbound HTTP fetches)
  • Caching link resolutions across requests (would need invalidation discipline)
  • Changing Realm.search's default loadLinks: true behavior (covered by Realm-server: make _federated-search JSON:API-compliant via include #4667)
  • Batching the populateQueryFields fan-out (separate searchCards queries per query-field) — out of scope here

Test plan

  • New realm-server/tests/load-links-batching-test.ts — wraps dbAdapter.execute to count instance lookups against boxel_index, runs a 50-source × 5-target searchCards with loadLinks: true, asserts:
    • i.url = \$N per-link lookups: 0 (the old N×M path is dead)
    • i.url IN (…) batched lookups: ≥ 1, ≤ 2 (one per recursion depth)
    • result.data.length === 50 and result.included.length === 5
  • Existing loadLinks: true coverage in indexing-test.ts (lines 2829, 2973) still passes under semantic equivalence
  • CI green

Code pointers

  • packages/runtime-common/realm-index-query-engine.tsloadLinks rewritten as a BFS driver; searchCards / cardDocument / loadLinksForResource updated to pass rootResources: [...]
  • packages/runtime-common/index-query-engine.ts — new getInstances and getFiles plus extracted #rowToInstanceOrError / #rowToIndexedFile row-mapping helpers
  • packages/realm-server/tests/load-links-batching-test.ts — regression test

🤖 Generated with Claude Code

`RealmIndexQueryEngine.loadLinks` resolved every linked card with its
own `getInstance(linkURL)` round-trip. For an N-result search with M
average links per card, that produced N×M sequential
`SELECT * FROM boxel_index WHERE i.url = $1` round-trips, which was
the dominant cost behind `_federated-search` 90s timeouts on staging.

Restructured loadLinks as a level-order BFS:
- each layer collects every in-realm link URL into a Set
- one batched `WHERE i.url IN (...)` query (and one for file-meta)
  per recursion depth resolves them all in one round-trip
- per-entry hydration walks the resulting Map, with the same
  relationship.data rewrite + cycle/depth semantics as the original
- cross-realm links keep the existing per-link fetch path
  (parallelizing those is a separate ticket)

Added `IndexQueryEngine.getInstances` and `getFiles` as the batched
counterparts of the existing single-URL methods, and refactored the
single-URL methods to share row-mapping helpers.

Regression test in `realm-server/tests/load-links-batching-test.ts`
asserts that a 50-source/5-target search issues zero per-link
lookups and ≤ 2 batched-link queries (1 per recursion depth).

Closes CS-11038.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f20dc0201a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/runtime-common/realm-index-query-engine.ts Outdated
The original recursive `loadLinks` checked
`stack.length <= maxLinkDepth` BEFORE pushing the current resource
onto the stack for the recursive call, so at the boundary
(stack.length == maxLinkDepth) it still recursed once more —
depth-(maxLinkDepth+1) resources ended up in `included[]` with
their `relationship.data` rewritten.

The BFS rewrite was checking `descendStack.length` (which already
includes the current parent), cutting traversal off one level
early. Use `entry.item.stack.length` (ancestors only) for the
gate; keep `descendStack` for the nextLayer push so children
carry the right ancestor chain.

Caught by Codex review on PR #4668.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets the _federated-search/loadLinks hot path in runtime-common by replacing per-link instance lookups with batched, breadth-first resolution. It fits into the realm/indexing layer where search results are hydrated into JSON:API documents with included[] relationships.

Changes:

  • Refactors RealmIndexQueryEngine.loadLinks from recursive per-resource traversal to level-order BFS across all root resources.
  • Adds batched IndexQueryEngine.getInstances() and getFiles() helpers to fetch many in-realm links in a single query per depth.
  • Adds a realm-server regression test intended to assert batched lookup behavior for a large searchCards(..., { loadLinks: true }) case.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
packages/runtime-common/realm-index-query-engine.ts Reworks link hydration to BFS/batched traversal and updates callers to pass root resource arrays.
packages/runtime-common/index-query-engine.ts Introduces batched instance/file lookup helpers and shared row-mapping helpers.
packages/realm-server/tests/load-links-batching-test.ts Adds a regression test that counts DB query shapes during loadLinks resolution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/runtime-common/realm-index-query-engine.ts
Comment thread packages/runtime-common/index-query-engine.ts Outdated
Comment thread packages/runtime-common/index-query-engine.ts Outdated
Comment thread packages/realm-server/tests/load-links-batching-test.ts Outdated
Three fixes from Copilot review on PR #4668:

* `getInstances` and `getFiles` now chunk URL lookups (450 sqlite /
  2500 pg) so a search with many unique links cannot blow past the
  sqlite parameter limit. Each chunk emits 2*N+1 placeholders (url
  IN list + file_alias IN list + i.type), matching the existing
  index-writer.ts chunking discipline.

* The regression test's SQL matcher was filtering on the literal
  string "'instance'" appearing in the rendered SQL, but
  `param('instance')` is parameterized as `$N`. Switched to
  inspecting `opts.bind` for the `'instance'` value — the matcher
  now correctly attributes link-resolution queries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
habdelra and others added 2 commits May 5, 2026 19:42
- Use a local `dbExecute` alias instead of the multi-line IIFE-style
  cast; prettier wanted the closing paren on the same line as the
  assignment.
- Split compound `assert.ok(A && B, ...)` into separate asserts;
  qunit/no-assert-logical-expression flagged the && operator inside
  assertion arguments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
qunit/no-assert-logical-expression also rejects `??` inside an
assertion argument, not just `&&`. Compute the count first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra requested a review from a team May 6, 2026 00:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Host Test Results

    1 files  ±    0      1 suites  ±0   2h 1m 45s ⏱️ + 1h 8m 35s
2 565 tests +1 073  2 550 ✅ +1 068  15 💤 +8  0 ❌  - 1 
2 584 runs  +1 079  2 569 ✅ +1 076  15 💤 +8  0 ❌  - 3 

Results for commit faba169. ± Comparison against earlier commit 914a5f0.

Realm Server Test Results

    1 files  ±  0      1 suites  ±0   16m 4s ⏱️ + 8m 38s
1 256 tests +208  1 256 ✅ +208  0 💤 ±0  0 ❌ ±0 
1 334 runs  +218  1 334 ✅ +218  0 💤 ±0  0 ❌ ±0 

Results for commit faba169. ± Comparison against earlier commit 914a5f0.

habdelra and others added 3 commits May 6, 2026 00:10
The original recursive `loadLinks` cloned a child resource only AFTER
the recursive call had mutated its `relationship.data` fields — the
caller's `for (let includedResource of subInc) { ... cloneDeep ... }`
loop ran post-recursion. The BFS rewrite was cloning at discovery
time (in the parent's layer), so the snapshot in `included[]` saw
the pre-rewrite pristine_doc data. For a `Hassan ↔ Mango` cycle the
included Mango ended up with `friend.data` pointing back to whatever
the index had stored, not the rewritten `{ type, id }` from this
loadLinks invocation.

Defer the clone+absolutize+push to a Step 5 that runs at the END of
each layer's processing — once Step 4 has finished mutating
`item.resource.relationships.<rel>.data` for every entry in this
layer. The clone now captures those mutations.

Also: when a sibling references an already-visited target, the
original code's `included.find(target)` succeeded because the first
recursion had already added it; with deferred clones that's no
longer mid-layer-true. Mark `foundLinks = true` whenever
`linkResource` is `alreadyVisited` (it's either a root in `omit` or
queued for inclusion at its own layer's end — either way it's part
of our doc).

Caught by the `Integration | realm indexing: can index a field with
a cycle in the linksTo[Many] field` host tests on PR #4668.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Non-root layer items always originate from the batched index lookup
or a cross-realm fetch, both of which produce CardResource<Saved>
shapes — but the LayerItem.resource type is the wider
LooseCardResource union (to also accept root resources). Cast at the
push site to satisfy the included[] type signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

habdelra commented May 6, 2026

Heads-up from CI investigation on the stack (#4668#4670): the loadLinks BFS rewrite in #4668 regresses two host integration tests that pass on main:

Test File Notes
Integration | realm indexing: can index a field with a cycle in the linksTo field packages/host/tests/integration/realm-indexing-test.gts:3388 Hassan ↔ Mango cycle, both in same realm
Integration | realm indexing: can index a field with a cycle in the linksToMany field :3859 Same shape, linksToMany
Acceptance | code submode tests > single realm: clicking a linksToMany field in playground panel opens the linked card JSON code-submode-tests UI-side deepEqual of a loadLinks result
All three fail with Actual: [object Object] Expected: [object Object] (no message), which means a deepEqual is comparing two near-identical structures whose diff isn't being printed by qunit. Failure shape is consistent — the BFS produces a subtly-different included[] or relationship.data from the recursive original in cycle-back-to-root cases.

Confirmed pre-existing: same exact tests fail on cs-11038's 6ddece7 (Host Tests 16/20 + 20/20). They pass on main (47e6884). #4670 inherits the failure since it's stacked on #4668; my change only touches the cross-realm branch and can't reach these in-realm cycle paths.

This blocks #4668 → main, which in turn blocks #4670. The Codex maxLinkDepth boundary fix (e218d21) didn't catch this case. Likely candidates for the BFS bug:

visited is added at root-resource enqueue and again at layer expansion; the else if (omitSet.has(linkResource.id) || included.find(...)) branch only sets foundLinks — it doesn't push to included[] when the cycle-target is the root (which is in omitSet, not included). The recursive original may have produced a different rewrite for that case.
The cloneDeep of linkResource happens before any relationship.data rewrite; nested relationships discovered later don't propagate into the cloned included[] entry.
Posting here per Hassan's request so the context isn't lost when #4667 lands first. CC @habdelra.

@habdelra
Member
Author
habdelra
commented
4 minutes ago
Got it — those two realm-indexing-test.gts tests + the code-submode deepEqual are all in #4668 BFS rewrite path, not in anything #4667 touches.

Confirmed scope-isolation for #4667:

Realm.search plumbing only added an opts.include arm; the existing loadLinks: true call path is byte-identical when include is absent.
The linkFields argument forwarded for partial-include reuses the existing seam (realm-index-query-engine.ts:914) — no traversal-order changes.
None of the failing tests exercise _federated-search; they're pure indexing tests calling searchCards directly with the recursive loadLinks.
So the cycle-back-to-root regression you're tracking lives entirely in #4668 BFS, and #4667 landing first doesn't affect that investigation. I'll watch the monitor for any in-PR feedback on this code path here.

@habdelra
Copy link
Copy Markdown
Contributor Author

habdelra commented May 6, 2026

Thanks for the clean diagnosis — both candidate causes turned out to be real:

Candidate #2 (clone happens before nested rewrites) — fixed in 427892c. The BFS now defers the cloneDeep + absolutize + push to included[] to a Step 5 that runs at the end of each layer, after Step 4 has finished mutating item.resource.relationships.<rel>.data for every entry in this layer. The clone snapshot now reflects those rewrites — matching the original recursive implementation's order (the parent's for (let includedResource of subInc) { ... cloneDeep ... } ran post-recursion, so the child's relationships had already been mutated by the recursive call).

Candidate #1 (cycle-target-is-root in the alreadyVisited branch) — same commit. Simplified the gate to foundLinks = true whenever linkResource.id != null and is alreadyVisited (i.e., either a root in omit, or queued for inclusion at its own layer's end — either way it's part of our doc, so relationship.data should be rewritten). The previous gating on omitSet.has(...) || included.find(...) was correct for the root case but missed mid-layer-sibling-pointing-to-pending-include because included entries don't materialize until end-of-layer with deferred clones.

Lint type-check fix piggybacked in 914a5f0 (the LayerItem.resource union covers both root LooseCardResource and non-root CardResource; cast at the included.push site).

Will let CI cycle and check the cycle tests + the playground deepEqual on the new SHA. If anything subtler is still off, the per-SHA monitor will re-fire and I'll iterate.

PR #4668 has been seeing intermittent host-test failures with
generic "A network error occurred" or "Failed to fetch dynamically
imported module" messages — both are runner-level / teardown-time
flake signatures, and the same tests pass on PR #4670 (which
includes all the BFS code from #4668), so they look like
infrastructure flakes rather than code regressions. Adding
diagnostics anyway so the next CI run captures concrete data:

- An invocation id at loadLinks entry, repeated on every layer log
  line — lets us correlate logs from one loadLinks call across
  layers when investigating CI failures.
- Layer summaries (size, entry count, in-realm/cross-realm breakdown).
- Batch index lookup result counts + ms timing.
- try/catch around `Promise.all(populateQueryFields)` and the batched
  index lookup that logs the failing context before re-throwing.
- try/catch around the cross-realm fetch logging the URL and HTTP
  status (or thrown error message) before re-throwing.
- Completion log with layers visited / included size.

All under `this.#log.debug` (or `.warn` on the catch arms), so it
follows the existing logger semantics and stays out of production
noise unless the channel is enabled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra merged commit 69055a5 into main May 6, 2026
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants