Realm-server: batch link resolution in loadLinks (CS-11038) by habdelra · Pull Request #4668 · cardstack/boxel

habdelra · 2026-05-05T23:20:54Z

Summary

RealmIndexQueryEngine.loadLinks was the dominant cost behind _federated-search 90s timeouts on staging — it resolved every linked card with its own getInstance(linkURL) round-trip. For an N-result search with M average links per card, that produced N×M sequential SELECT * FROM boxel_index WHERE i.url = $1 round-trips, all serialized on Node's event loop and contending for the shared PG pool with every other concurrent request.

This PR replaces the recursive per-link traversal with a level-order BFS: each recursion depth issues at most one batched WHERE i.url IN (...) query (plus one for file-meta) regardless of how many siblings reference links at that depth.

Live data anchor (staging, 2026-04-28 → 2026-05-05)

110 of 126 (87%) _federated-search 90s timeouts in 7d traced to N×M sequential link fan-out in loadLinks
During one observed flood window (UTC 13:24:25), 30+ concurrent _federated-search requests piled up in one second; per-realm cards-grid latency varied by ~300× under the same load — the variance correlates with result-set size, consistent with N×M sequential awaits dominating response time
Total wasted compute on _federated-search 90s timeouts: ~2.75 worker-hours / 7d

Relationship to PR #4667 (CS-11037)

Orthogonal — both ship independently:

PR Realm-server: make _federated-search JSON:API-compliant via include #4667 lets callers opt out of loadLinks entirely (include: []). It's the cards-grid fast path: when the consumer doesn't traverse relationships, skip link loading altogether.
This PR makes the case where you DO need links N+M → 1+1. Future caller-driven sparse-fields filtering still benefits from the batched pass.

No file overlap (#4667 doesn't touch realm-index-query-engine.ts). Branched off main.

Implementation

Level-order BFS in `loadLinks`

The original recursive loadLinks(resource) was called once per root resource, and each invocation awaited getInstance(linkURL) per relationship. Restructured the function so each layer batches across siblings:

Pre-pass — walk every relationship across all resources in the current layer; classify each link as in-realm card / in-realm file / cross-realm; accumulate URL Sets (de-duplicated). Skip omit / visited URLs.
Batch fetch — await Promise.all([getInstances(cardURLs), getFiles(fileURLs)]). One round-trip per kind, regardless of how many siblings asked.
Hydration — walk the same entries again; pull linkResource from the resulting Maps for in-realm; cross-realm still goes through the existing per-link fetch path (out of scope to parallelize per the ticket).
Build next layer from each newly-resolved linkResource; repeat.

searchCards now passes the entire doc.data array as the root layer (previously it looped per-resource), which is what enables the first layer to batch across all roots' relationships in one query.

Preserved semantics

Cycle prevention — visited: Set<string> shared across all layers; each linkResource is added to visited on first encounter; later encounters are short-circuited.
Depth gate — descendStack.length <= maxLinkDepth matches the original stack.length <= maxLinkDepth check; nothing is added to included[] or scheduled for descent past the limit.
linkFields filter — applied only at the root layer (matching the original "linkFields only constrains the immediate relationships of the root card"). Carried via a per-layer-item applyLinkFields bool.
Stale type fallback — fieldExpectsFileMeta is still called when the relationship's data.type is missing or stale, so file-meta links incorrectly typed as 'card' still resolve correctly.
relationship.data rewrite — same three-way condition (just-found / in-omit / in-included → rewrite with resolved type; otherwise fallback type when linkResource is unavailable).
URL absolutization — cloneDeep + visitInstanceURLs + visitModuleDeps applied to each newly-pushed included[] entry, identical to the original.

New batch primitives in `IndexQueryEngine`

getInstances(urls, opts) — returns Map<lookupURL, InstanceOrError>. Handles both i.url IN (...) and i.file_alias IN (...) matches in a single query so a row found via either column is keyed by whichever lookup URL the caller passed.
getFiles(urls, opts) — analogous batch for file-meta lookups.
The single-URL getInstance / getFile were refactored to share extracted row-mapping helpers (#rowToInstanceOrError, #rowToIndexedFile) with their batched siblings.

Acceptance criteria

loadLinks for a cards-grid-shape query issues 1 batched DB query per recursion depth, not N×M separate ones
Recursion still terminates (visited set + depth gate unchanged in semantics)
Cross-realm link resolution unchanged in correctness; can be parallelized as a follow-up
Regression test: 50-card / 5-link search → ≤ 2 batched-link queries (and 0 per-link queries)

Out of scope

Cross-realm link parallelization (sibling ticket — Promise.all on outbound HTTP fetches)
Caching link resolutions across requests (would need invalidation discipline)
Changing Realm.search's default loadLinks: true behavior (covered by Realm-server: make _federated-search JSON:API-compliant via include #4667)
Batching the populateQueryFields fan-out (separate searchCards queries per query-field) — out of scope here

Test plan

New realm-server/tests/load-links-batching-test.ts — wraps dbAdapter.execute to count instance lookups against boxel_index, runs a 50-source × 5-target searchCards with loadLinks: true, asserts:
- i.url = \$N per-link lookups: 0 (the old N×M path is dead)
- i.url IN (…) batched lookups: ≥ 1, ≤ 2 (one per recursion depth)
- result.data.length === 50 and result.included.length === 5
Existing loadLinks: true coverage in indexing-test.ts (lines 2829, 2973) still passes under semantic equivalence
CI green

Code pointers

packages/runtime-common/realm-index-query-engine.ts — loadLinks rewritten as a BFS driver; searchCards / cardDocument / loadLinksForResource updated to pass rootResources: [...]
packages/runtime-common/index-query-engine.ts — new getInstances and getFiles plus extracted #rowToInstanceOrError / #rowToIndexedFile row-mapping helpers
packages/realm-server/tests/load-links-batching-test.ts — regression test

🤖 Generated with Claude Code

`RealmIndexQueryEngine.loadLinks` resolved every linked card with its own `getInstance(linkURL)` round-trip. For an N-result search with M average links per card, that produced N×M sequential `SELECT * FROM boxel_index WHERE i.url = $1` round-trips, which was the dominant cost behind `_federated-search` 90s timeouts on staging. Restructured loadLinks as a level-order BFS: - each layer collects every in-realm link URL into a Set - one batched `WHERE i.url IN (...)` query (and one for file-meta) per recursion depth resolves them all in one round-trip - per-entry hydration walks the resulting Map, with the same relationship.data rewrite + cycle/depth semantics as the original - cross-realm links keep the existing per-link fetch path (parallelizing those is a separate ticket) Added `IndexQueryEngine.getInstances` and `getFiles` as the batched counterparts of the existing single-URL methods, and refactored the single-URL methods to share row-mapping helpers. Regression test in `realm-server/tests/load-links-batching-test.ts` asserts that a 50-source/5-target search issues zero per-link lookups and ≤ 2 batched-link queries (1 per recursion depth). Closes CS-11038. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f20dc0201a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The original recursive `loadLinks` checked `stack.length <= maxLinkDepth` BEFORE pushing the current resource onto the stack for the recursive call, so at the boundary (stack.length == maxLinkDepth) it still recursed once more — depth-(maxLinkDepth+1) resources ended up in `included[]` with their `relationship.data` rewritten. The BFS rewrite was checking `descendStack.length` (which already includes the current parent), cutting traversal off one level early. Use `entry.item.stack.length` (ancestors only) for the gate; keep `descendStack` for the nextLayer push so children carry the right ancestor chain. Caught by Codex review on PR #4668. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR targets the _federated-search/loadLinks hot path in runtime-common by replacing per-link instance lookups with batched, breadth-first resolution. It fits into the realm/indexing layer where search results are hydrated into JSON:API documents with included[] relationships.

Changes:

Refactors RealmIndexQueryEngine.loadLinks from recursive per-resource traversal to level-order BFS across all root resources.
Adds batched IndexQueryEngine.getInstances() and getFiles() helpers to fetch many in-realm links in a single query per depth.
Adds a realm-server regression test intended to assert batched lookup behavior for a large searchCards(..., { loadLinks: true }) case.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`packages/runtime-common/realm-index-query-engine.ts`	Reworks link hydration to BFS/batched traversal and updates callers to pass root resource arrays.
`packages/runtime-common/index-query-engine.ts`	Introduces batched instance/file lookup helpers and shared row-mapping helpers.
`packages/realm-server/tests/load-links-batching-test.ts`	Adds a regression test that counts DB query shapes during `loadLinks` resolution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Three fixes from Copilot review on PR #4668: * `getInstances` and `getFiles` now chunk URL lookups (450 sqlite / 2500 pg) so a search with many unique links cannot blow past the sqlite parameter limit. Each chunk emits 2*N+1 placeholders (url IN list + file_alias IN list + i.type), matching the existing index-writer.ts chunking discipline. * The regression test's SQL matcher was filtering on the literal string "'instance'" appearing in the rendered SQL, but `param('instance')` is parameterized as `$N`. Switched to inspecting `opts.bind` for the `'instance'` value — the matcher now correctly attributes link-resolution queries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Use a local `dbExecute` alias instead of the multi-line IIFE-style cast; prettier wanted the closing paren on the same line as the assignment. - Split compound `assert.ok(A && B, ...)` into separate asserts; qunit/no-assert-logical-expression flagged the && operator inside assertion arguments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qunit/no-assert-logical-expression also rejects `??` inside an assertion argument, not just `&&`. Compute the count first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inks

github-actions · 2026-05-06T02:06:23Z

Host Test Results

1 files ± 0 1 suites ±0 2h 1m 45s ⏱️ + 1h 8m 35s
2 565 tests +1 073 2 550 ✅ +1 068 15 💤 +8 0 ❌ - 1
2 584 runs +1 079 2 569 ✅ +1 076 15 💤 +8 0 ❌ - 3

Results for commit faba169. ± Comparison against earlier commit 914a5f0.

Realm Server Test Results

1 files ± 0 1 suites ±0 16m 4s ⏱️ + 8m 38s
1 256 tests +208 1 256 ✅ +208 0 💤 ±0 0 ❌ ±0
1 334 runs +218 1 334 ✅ +218 0 💤 ±0 0 ❌ ±0

Results for commit faba169. ± Comparison against earlier commit 914a5f0.

…inks

The original recursive `loadLinks` cloned a child resource only AFTER the recursive call had mutated its `relationship.data` fields — the caller's `for (let includedResource of subInc) { ... cloneDeep ... }` loop ran post-recursion. The BFS rewrite was cloning at discovery time (in the parent's layer), so the snapshot in `included[]` saw the pre-rewrite pristine_doc data. For a `Hassan ↔ Mango` cycle the included Mango ended up with `friend.data` pointing back to whatever the index had stored, not the rewritten `{ type, id }` from this loadLinks invocation. Defer the clone+absolutize+push to a Step 5 that runs at the END of each layer's processing — once Step 4 has finished mutating `item.resource.relationships.<rel>.data` for every entry in this layer. The clone now captures those mutations. Also: when a sibling references an already-visited target, the original code's `included.find(target)` succeeded because the first recursion had already added it; with deferred clones that's no longer mid-layer-true. Mark `foundLinks = true` whenever `linkResource` is `alreadyVisited` (it's either a root in `omit` or queued for inclusion at its own layer's end — either way it's part of our doc). Caught by the `Integration | realm indexing: can index a field with a cycle in the linksTo[Many] field` host tests on PR #4668. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Non-root layer items always originate from the batched index lookup or a cross-realm fetch, both of which produce CardResource<Saved> shapes — but the LayerItem.resource type is the wider LooseCardResource union (to also accept root resources). Cast at the push site to satisfy the included[] type signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra · 2026-05-06T04:49:40Z

Heads-up from CI investigation on the stack (#4668 → #4670): the loadLinks BFS rewrite in #4668 regresses two host integration tests that pass on main:

Test File Notes
Integration | realm indexing: can index a field with a cycle in the linksTo field packages/host/tests/integration/realm-indexing-test.gts:3388 Hassan ↔ Mango cycle, both in same realm
Integration | realm indexing: can index a field with a cycle in the linksToMany field :3859 Same shape, linksToMany
Acceptance | code submode tests > single realm: clicking a linksToMany field in playground panel opens the linked card JSON code-submode-tests UI-side deepEqual of a loadLinks result
All three fail with Actual: [object Object] Expected: [object Object] (no message), which means a deepEqual is comparing two near-identical structures whose diff isn't being printed by qunit. Failure shape is consistent — the BFS produces a subtly-different included[] or relationship.data from the recursive original in cycle-back-to-root cases.

Confirmed pre-existing: same exact tests fail on cs-11038's 6ddece7 (Host Tests 16/20 + 20/20). They pass on main (47e6884). #4670 inherits the failure since it's stacked on #4668; my change only touches the cross-realm branch and can't reach these in-realm cycle paths.

This blocks #4668 → main, which in turn blocks #4670. The Codex maxLinkDepth boundary fix (e218d21) didn't catch this case. Likely candidates for the BFS bug:

visited is added at root-resource enqueue and again at layer expansion; the else if (omitSet.has(linkResource.id) || included.find(...)) branch only sets foundLinks — it doesn't push to included[] when the cycle-target is the root (which is in omitSet, not included). The recursive original may have produced a different rewrite for that case.
The cloneDeep of linkResource happens before any relationship.data rewrite; nested relationships discovered later don't propagate into the cloned included[] entry.
Posting here per Hassan's request so the context isn't lost when #4667 lands first. CC @habdelra.

@habdelra
Member
Author
habdelra
commented
4 minutes ago
Got it — those two realm-indexing-test.gts tests + the code-submode deepEqual are all in #4668 BFS rewrite path, not in anything #4667 touches.

Confirmed scope-isolation for #4667:

Realm.search plumbing only added an opts.include arm; the existing loadLinks: true call path is byte-identical when include is absent.
The linkFields argument forwarded for partial-include reuses the existing seam (realm-index-query-engine.ts:914) — no traversal-order changes.
None of the failing tests exercise _federated-search; they're pure indexing tests calling searchCards directly with the recursive loadLinks.
So the cycle-back-to-root regression you're tracking lives entirely in #4668 BFS, and #4667 landing first doesn't affect that investigation. I'll watch the monitor for any in-PR feedback on this code path here.

habdelra · 2026-05-06T04:50:43Z

Thanks for the clean diagnosis — both candidate causes turned out to be real:

Candidate #2 (clone happens before nested rewrites) — fixed in 427892c. The BFS now defers the cloneDeep + absolutize + push to included[] to a Step 5 that runs at the end of each layer, after Step 4 has finished mutating item.resource.relationships.<rel>.data for every entry in this layer. The clone snapshot now reflects those rewrites — matching the original recursive implementation's order (the parent's for (let includedResource of subInc) { ... cloneDeep ... } ran post-recursion, so the child's relationships had already been mutated by the recursive call).

Candidate #1 (cycle-target-is-root in the alreadyVisited branch) — same commit. Simplified the gate to foundLinks = true whenever linkResource.id != null and is alreadyVisited (i.e., either a root in omit, or queued for inclusion at its own layer's end — either way it's part of our doc, so relationship.data should be rewritten). The previous gating on omitSet.has(...) || included.find(...) was correct for the root case but missed mid-layer-sibling-pointing-to-pending-include because included entries don't materialize until end-of-layer with deferred clones.

Lint type-check fix piggybacked in 914a5f0 (the LayerItem.resource union covers both root LooseCardResource and non-root CardResource; cast at the included.push site).

Will let CI cycle and check the cycle tests + the playground deepEqual on the new SHA. If anything subtler is still off, the per-SHA monitor will re-fire and I'll iterate.

PR #4668 has been seeing intermittent host-test failures with generic "A network error occurred" or "Failed to fetch dynamically imported module" messages — both are runner-level / teardown-time flake signatures, and the same tests pass on PR #4670 (which includes all the BFS code from #4668), so they look like infrastructure flakes rather than code regressions. Adding diagnostics anyway so the next CI run captures concrete data: - An invocation id at loadLinks entry, repeated on every layer log line — lets us correlate logs from one loadLinks call across layers when investigating CI failures. - Layer summaries (size, entry count, in-realm/cross-realm breakdown). - Batch index lookup result counts + ms timing. - try/catch around `Promise.all(populateQueryFields)` and the batched index lookup that logs the failing context before re-throwing. - try/catch around the cross-realm fetch logging the URL and HTTP status (or thrown error message) before re-throwing. - Completion log with layers visited / included size. All under `this.#log.debug` (or `.warn` on the catch arms), so it follows the existing logger semantics and stays out of production noise unless the channel is enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra requested a review from Copilot May 5, 2026 23:22

Copilot started reviewing on behalf of habdelra May 5, 2026 23:24 View session

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

Comment thread packages/runtime-common/realm-index-query-engine.ts Outdated

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread packages/runtime-common/realm-index-query-engine.ts

Comment thread packages/runtime-common/index-query-engine.ts Outdated

Comment thread packages/runtime-common/index-query-engine.ts Outdated

Comment thread packages/realm-server/tests/load-links-batching-test.ts Outdated

habdelra mentioned this pull request May 5, 2026

Realm-server: parallelize cross-realm fetches in loadLinks (CS-11039) #4670

Merged

7 tasks

habdelra and others added 2 commits May 5, 2026 19:42

realm-server: hoist ?? out of assertion arg

abfd445

qunit/no-assert-logical-expression also rejects `??` inside an assertion argument, not just `&&`. Compute the count first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra requested a review from a team May 6, 2026 00:05

Merge remote-tracking branch 'origin/main' into cs-11038-batch-load-l…

b3318ef

…inks

habdelra and others added 3 commits May 6, 2026 00:10

Merge remote-tracking branch 'origin/main' into cs-11038-batch-load-l…

8c423c1

…inks

habdelra mentioned this pull request May 6, 2026

Realm-server: make _federated-search JSON:API-compliant via include #4667

Closed

3 tasks

lukemelia approved these changes May 6, 2026

View reviewed changes

habdelra merged commit 69055a5 into main May 6, 2026
67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realm-server: batch link resolution in loadLinks (CS-11038)#4668

Realm-server: batch link resolution in loadLinks (CS-11038)#4668
habdelra merged 10 commits intomainfrom
cs-11038-batch-load-links

habdelra commented May 5, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

habdelra commented May 6, 2026

Uh oh!

habdelra commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

habdelra commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Live data anchor (staging, 2026-04-28 → 2026-05-05)

Relationship to PR #4667 (CS-11037)

Implementation

Level-order BFS in loadLinks

Preserved semantics

New batch primitives in IndexQueryEngine

Acceptance criteria

Out of scope

Test plan

Code pointers

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

habdelra commented May 6, 2026

Uh oh!

habdelra commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

habdelra commented May 5, 2026 •

edited

Loading

Level-order BFS in `loadLinks`

New batch primitives in `IndexQueryEngine`

github-actions Bot commented May 6, 2026 •

edited

Loading