CS-11077: in-memory parsed module-definition cache by habdelra · Pull Request #4690 · cardstack/boxel

habdelra · 2026-05-06T23:35:40Z

Summary

Today, every warm-cache loadLinks GET against a card whose link tree adopts from a large software-factory module re-fetches and re-parses the same modules.definitions jsonb row 5–11 times per request. This PR adds a per-process, parsed-form working set above readFromDatabaseCache so each realm-server fetches+parses each row at most once between invalidations. The shared DB row stays the authoritative source of truth — nothing is being duplicated, and no per-process state is being introduced that peers need to know about.

Where state lives — and what this PR does and does not change

It's worth being careful about the term "in-memory cache" given the direction of the DB-Authoritative Realm Registry project. So, concretely:

Authoritative state, shared across all realm-server processes (DB):

modules.definitions jsonb — the canonical persisted output of a module prerender. Written once, read by every realm-server process. This PR does not change how it's written, what's stored in it, or who reads it.
boxel_index, realm_registry, queue tables — unchanged.

Per-process state, already in CachingDefinitionLookup today:

#moduleGenerations / #realmGenerations / #globalGeneration — three counters bumped synchronously by invalidate / clearRealmCache / clearAllModules before the DB DELETE. Used today only to discard mid-flight prerender results that started against a now-stale view.
#inFlight — coalesces concurrent same-key callers within a process so a single SQL round-trip is shared by all waiters; evicts on completion.

Per-process state added by this PR:

#parsedCache — a Map of parsed ModuleCacheEntry keyed by (canonicalModuleURL, cacheScope, cacheUserId, resolvedRealmURL). Each entry holds the result of one readFromDatabaseCache call along with the generation-counter snapshot that was current at insert time. Bounded by total parsed-bytes (PARSED_CACHE_BUDGET_CHARS = 256 MB JSON-text proxy for ~0.5–0.75 GB heap). LRU via Map insertion order.

The cache is purely a parsed-form working set of the DB row. It is reconcilable from the DB at any time: drop the entire Map and the next lookup re-reads + re-parses the DB row, exactly like today. No process needs to know what's in any other process's #parsedCache. There is no new shared state.

Findings — what we measured

Instrumented warm-cache GET against e2e-1/Validations/eval_sticky-note-1. Every relevant software-factory module was already populated in the modules table; no prerender fired during the measured GET. The realm-server process held a fresh CachingDefinitionLookup (in-memory generation counters all zero, #inFlight empty between layers).

total=28.1s  idxWait=0ms  cardDoc=28.1s
  layer 0 (Validations card)         layerMs=2.9s
  layer 1 (Issues/sticky-note)       layerMs=10.1s
  layer 2 (Project + 3 KAs)          layerMs=15.1s

darkfactory.gts                      ~11 reads, ~22s of read+parse total
eval-result.gts                       1 read,    1.3s
loadModuleCacheEntry events          MISS=14   COALESCE=7   (≈ 33% coalesce rate)
readFromDatabaseCache (per call)     sql ~1.5s   parse ~0.4s   blob ~60 MB

#inFlight does its job for concurrent same-key callers (layer 2's 4-card Promise.all collapses to 1 read), but evicts on completion. Sequential calls within a single GET — across layer boundaries, after each layer's Promise.all settles, and inside executeQueryForField follow-ups — each re-fetch and re-parse the same 60 MB DB row from scratch. Two-thirds of loadModuleCacheEntry invocations did not coalesce.

The DB row sizes are large by design (the dominant module re-exports the entire factory schema, and one-time-cost trimming is tracked separately):

eval-result.gts        ~41 MB    instantiate-result.gts ~41 MB
lint-result.gts        ~41 MB    parse-result.gts       ~41 MB
test-results.gts       ~41 MB    darkfactory.gts        ~62 MB

A direct DB-only bench (no realm-server, single PG connection) put the floor at 8.7s of read+parse work per such GET; observed end-to-end was 20–28s in production-shape conditions. Five user/cs-11003-e2e-{1..5} realms exhibited the same behavior:

e2e-1  eval/instantiate         20–22s   (subtype's module already cached)
e2e-1  lint/parse/test (first)  40–43s   (subtype's module is the only cold one)
e2e-1  lint/parse/test (next)   19–23s   (warm)
e2e-2..3                        20–30s
e2e-4..5                        11–22s   (PG + ts-node fully warm)

Eliminating the redundancy collapses 28s → ~5s. The residual is executeQueryForField issues (4.7s in this GET) + cloneDeep + per-resource search — separate concerns, tracked elsewhere.

Fix

Front-end readFromDatabaseCache with #parsedCache. Both lookupDefinition and lookupCachedDefinition paths benefit because both flow through readFromDatabaseCache.

On lookup: snapshot the three generation counters as the first synchronous step. Look up the key in #parsedCache. If a hit and the cached snapshot equals the current snapshot, LRU-touch it (delete + re-insert at end of Map iteration order) and return the parsed entry. If snapshots disagree (a local invalidation or peer NOTIFY has bumped a counter since insert), drop the entry, decrement byte-count, fall through to the SQL+parse path.
On insert: re-snapshot before writing. If a counter bumped between the lookup snapshot and the post-SQL snapshot — i.e. an invalidation arrived during the SQL round-trip — skip the cache write rather than insert an entry that's stale at birth. Subtract any existing entry's sizeChars before overwriting (see thread for the concurrent-replace race).
Errors are not cached: the existing isExpiredErrorEntry TTL path already handles them in DB; in-memory caching errors would either return a stale-but-still-cached error or require a TTL-aware in-memory eviction path. Not worth either.
Bounded by total parsed-bytes. Map insertion order = iteration order in JS, so re-inserting on hit gives LRU. Eviction loops while bytes exceed budget.

Why this stacks on `cs-10952-cross-process-invalidation-broadcast` (#4644)

CS-10952 is the piece that makes a per-process parsed-form cache safe to add at all in a multi-process realm-server topology. Two reasons, both load-bearing:

1. The bump helpers it makes public are exactly what my snapshot mechanism keys on. CS-10952 promotes bumpModuleGeneration / bumpRealmGeneration / bumpGlobalGeneration from private to public on CachingDefinitionLookup so a sibling ModuleCacheInvalidationListener can call them. My fix uses those same three counters as its invalidation contract. Landing this PR on top of CS-10952 means I get the public API for free; landing it independently would mean either duplicating the promotion or merging out of order.

2. The cross-process invalidation channel it adds is what closes the multi-process correctness gap. At N=1 (today's prod), local invalidate / clearRealmCache / clearAllModules already bump counters synchronously before the DB DELETE — so my snapshot check is sufficient: the next lookup sees the bump and self-evicts. At N>1, that's no longer true. Process A's local bump doesn't reach process B's counters. Without CS-10952, B's #parsedCache would happily serve a parsed entry that A has already deleted from the DB.

CS-10952 closes that gap end-to-end:

Each invalidation path (invalidate / clearRealmCache / clearAllModules) emits NOTIFY module_cache_invalidated with a structured payload (URL fan-out / realm scope / global) inside the same transaction as the DB DELETE, so peers only see the signal on commit.
Each realm-server runs a ModuleCacheInvalidationListener (subscribed via the multiplexed PgAdapter.subscribe from PgAdapter.subscribe: multiplex LISTEN over a shared client #4660) that parses incoming payloads and replays them by calling the now-public bump helpers on the local CachingDefinitionLookup.

So with both PRs in place, the chain on a peer invalidation is: A invalidates → A bumps locally → A's DELETE commits → NOTIFY fires → B's listener wakes → B's listener calls bumpModuleGeneration (etc.) on B's CachingDefinitionLookup → B's #parsedCache snapshot for the affected entry is now stale → B's next readFromDatabaseCache lookup observes the snapshot mismatch on its in-memory entry → B drops the entry and falls through to the (now-empty) DB → returns undefined (or the freshly-prerendered row if a peer raced to re-populate). Same persist-after-invalidate protection that already exists for in-flight prerenders today, just extended to longer-lived parsed entries.

End result: this cache ships safely under multi-process from day one, layered cleanly above the shared DB row without introducing any new cross-process state. Landing CS-11077 directly on main (without CS-10952 first) would be the failure mode worth being concerned about — that would be a per-process cache without a cross-process invalidation contract. With the stack, the contract is intact.

Out of scope, tracked separately

The definitions jsonb is read whole even though populateQueryFields only needs the fields slice for one specific card type. A jsonb path read or a sibling slim column would shrink the read from ~60 MB to a few KB and would help every realm-server, not just ones with a warm #parsedCache.
darkfactory.gts's definitions['…/Issue' | '…/Project' | '…/DarkFactory'] are 20+ MB each because of transitive schema embedding in definition.fields. Trimming would help every lookup of that module.
executeQueryForField for Project.issues linksToMany was 4.7s on its own in the measured GET — separate index-search question, not module-cache-related.

Test plan

pnpm test-module for definition-lookup-test.ts — all 28 tests green, including 3 new ones:
- parsed-entry cache short-circuits the DB read on repeat lookups
- bumpModuleGeneration invalidates the parsed-entry cache for that module (the cross-process invalidation arrival from CS-10952's listener is structurally identical to a direct call here)
- bumpGlobalGeneration invalidates the parsed-entry cache for every entry
pnpm lint clean for both runtime-common and realm-server
End-to-end check: re-run the warm-cache Validations GET measurement against a local stack with this branch — expect 28s → ~5s

Front-end readFromDatabaseCache with a parsed-entry cache so warm-cache loadLinks GETs stop re-fetching and re-parsing the same modules.definitions jsonb on every populateQueryFields call. Each cached entry snapshots the three generation counters at insertion; lookups validate against current counters and self-evict on mismatch — so a local invalidate / clearRealmCache / clearAllModules, or a remote bump arriving via the cross-process NOTIFY listener, drops the entry on next access. Bounded by total parsed-bytes (PARSED_CACHE_BUDGET_CHARS, 256 MB JSON-text proxy for ~0.5–0.75 GB of resident heap). Map insertion order = iteration order, so re-inserting on hit gives LRU. Error entries are not cached: they have a TTL and the SQL path already handles them correctly. Re-snapshots before insert so a generation bump during the SQL round-trip skips the cache write. Tests: - parsed-entry cache short-circuits the DB read on repeat lookups - bumpModuleGeneration invalidates the parsed-entry cache for that module - bumpGlobalGeneration invalidates the parsed-entry cache for every entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 557b7e4e0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-06T23:38:02Z

+        this.#parsedCache.set(cacheKey, {
+          entry,
+          snapshot: snapshotAtPersist,
+          sizeChars,
+        });
+        this.#parsedCacheBytes += sizeChars;


Subtract replaced parsed-cache entries before adding new ones

When concurrent cache-only lookups for the same module both miss #parsedCache, they can each finish the SQL read and execute this set for the same cacheKey; Map#set replaces the first entry, but #parsedCacheBytes is incremented again without subtracting the replaced entry. I checked lookupCachedDefinition, and it calls readFromDatabaseCache directly rather than going through #inFlight, so this is reachable for parallel cache-only definition reads. Once the inflated counter crosses the budget, eviction subtracts only entries still present in the map and can leave #parsedCacheBytes positive even with an empty map, causing unnecessary eviction/cache thrash thereafter.

Useful? React with 👍 / 👎.

Real bug — confirmed reachable via lookupCachedDefinition, which calls readFromDatabaseCache directly and bypasses #inFlight, so two concurrent same-key cache-only readers can both finish the SQL round-trip and both reach the set. Map#set overwrites the value but the bytes counter was previously double-counted; over time it drifts above actual map weight and forces spurious evictions.

Fixed in af9405c: subtract the replaced entry's sizeChars before the overwrite. Comment in the code points back to this thread's diagnosis.

When two concurrent cache-only callers both miss `#parsedCache` for the same key and both finish the SQL round-trip, the second `Map#set` overwrites the first but `#parsedCacheBytes` was previously incremented twice. Over time the counter drifted above actual map weight and forced spurious LRU evictions. Reachable via `lookupCachedDefinition`, which calls `readFromDatabaseCache` directly — bypassing `#inFlight`, so two concurrent same-key cache-only reads don't coalesce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an in-memory, LRU-bounded cache of parsed ModuleCacheEntry objects in CachingDefinitionLookup to avoid repeated DB fetch + JSON parsing of large modules.definitions rows during warm-cache requests, with generation-snapshot validation to self-evict on invalidations (local or cross-process via generation bumps).

Changes:

Add #parsedCache in CachingDefinitionLookup that short-circuits readFromDatabaseCache when generation snapshots still match, plus LRU eviction by approximate entry size.
Track per-entry generation snapshots (module/realm/global) and drop cache entries on mismatch before falling back to SQL.
Add tests covering cache short-circuiting and invalidation via bumpModuleGeneration / bumpGlobalGeneration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
packages/runtime-common/definition-lookup.ts	Introduces in-memory parsed-entry cache in front of DB reads with snapshot validation and LRU eviction.
packages/realm-server/tests/definition-lookup-test.ts	Adds regression tests for parsed-cache hits and generation-bump invalidation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

habdelra · 2026-05-06T23:52:56Z

+// amortized over every subsequent hit. JSON.stringify can throw on
+// unexpected shapes (cycles, BigInts); fall back to 0 so a single
+// pathological insert can't poison the cache.
+function approximateEntrySizeChars(
+  entry: ModuleCacheEntry,
+  rawDefinitions: unknown,
+): number {
+  if (typeof rawDefinitions === 'string') {
+    return rawDefinitions.length;
+  }
+  try {
+    return JSON.stringify(entry.definitions).length;
+  } catch (_err) {
+    return 0;
+  }


Fair — the safety hatch was leaky. Fixed in 3eb6b3a: approximateEntrySizeChars still returns 0 as the "size unknown" sentinel, but the caller now treats 0 as "skip caching" so a pathological insert can't bypass the budget cap. Realistic prerender output won't trip this (module definitions are plain JSON-shaped), but the safety hatch needed to stay closed.

habdelra · 2026-05-06T23:53:03Z

+  // Parsed-entry cache. Front-ends `readFromDatabaseCache` so successful
+  // cache reads avoid re-running the SQL fetch + JSON.parse of the
+  // `modules.definitions` jsonb column on every lookup. Each entry
+  // captures the three-counter generation snapshot at insertion; lookups
+  // validate against the current snapshot and drop stale entries — so a
+  // local invalidate/clearRealmCache/clearAllModules (or a remote bump
+  // arriving via the cross-process NOTIFY listener) self-evicts on next
+  // access. Map insertion order is iteration order in JS, so re-inserting
+  // on hit gives LRU eviction. Bounded by `PARSED_CACHE_BUDGET_CHARS`
+  // (JSON-text-size proxy for heap).


Agreed — snapshot validation alone leaves stale entries resident until they're touched (or LRU-pressured out under unrelated work). Fixed in 3eb6b3a by adding eager-evict inside the bump helpers themselves:

bumpGlobalGeneration clears the entire #parsedCache and resets #parsedCacheChars.

bumpRealmGeneration evicts every entry whose key starts with ${resolvedRealmURL}|.

bumpModuleGeneration evicts every entry whose key starts with ${resolvedRealmURL}|${moduleURL}| (narrows to the affected module across all (scope, user) pairs).

Doing it inside the bump helpers covers both directions: local invalidate / clearRealmCache / clearAllModules already funnel through these, and so does CS-10952's ModuleCacheInvalidationListener when it replays a peer NOTIFY. So a peer's wholesale invalidation now frees this process's parsed-cache memory promptly without waiting for access pressure.

habdelra · 2026-05-06T23:53:06Z

+      sizeChars: number;
+    }
+  >();
+  #parsedCacheBytes = 0;


Renamed in 3eb6b3a: #parsedCacheBytes → #parsedCacheChars. Matches the units of sizeChars and PARSED_CACHE_BUDGET_CHARS and avoids the implication that we're tracking a JS-heap byte count (we're not — it's the JSON-text character count, used as a rough proxy for heap weight).

…t on bump; rename chars counter Three Copilot review threads on the parsed-entry cache: 1. `approximateEntrySizeChars` returns 0 when JSON.stringify throws on a pathological shape (cycles, BigInts). The previous code still cached the entry, so a zero-weight insert wouldn't count against the eviction budget — silently disabling the cap for any pathological row that landed first. Skip the insert when sizeChars === 0 instead. 2. `bumpGlobalGeneration` / `bumpRealmGeneration` / `bumpModuleGeneration` leave snapshot-stale parsed entries resident until they're touched (or LRU-pressured out). Eager-evict matching entries inside the bump methods so both local invalidations and CS-10952's cross-process NOTIFY-replay free memory immediately. 3. `#parsedCacheBytes` actually accumulates `sizeChars` (character counts) and is compared against `PARSED_CACHE_BUDGET_CHARS`. Renamed to `#parsedCacheChars` so the unit is obvious. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

habdelra · 2026-05-07T00:02:12Z

Closing in favor of CS-11079 + a request-scoped memoization pass.

The work in this PR is correct, but post-instrumentation we found the underlying modules.definitions blob is ~99.96% redundant — Issue's definition.fields has 74,977 entries that resolve to only 28 distinct values, with the most common (a StringField contains field-def) literally byte-identical at 19,290 different keys. Fixing that at the source (CS-11079) shrinks darkfactory.gts from ~62 MB → ~12 MB and the per-blob read+parse from ~2s → ~0.4s.

Once that lands, the per-GET cost of the original problem (28s) is dominated not by the per-blob parse but by re-reading the same module 11× within one GET. That's better solved with a request-scoped Map<canonicalModuleURL, Promise<ModuleCacheEntry>> passed through loadLinks / populateQueryFields — no long-lived in-process state, no cross-process invalidation contract, and steady-state perf equivalent to this PR.

Net result: the long-lived parsed-entry cache adds complexity (cross-process bump-via-NOTIFY contract, byte-bounded LRU) for a benefit that disappears once CS-11079 makes parsing cheap. Strictly aligned with the DB-Authoritative Realm Registry direction to keep per-process state minimal.

Tracking the new approach at CS-11079.

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

habdelra requested a review from Copilot May 6, 2026 23:40

Copilot started reviewing on behalf of habdelra May 6, 2026 23:43 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

habdelra closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CS-11077: in-memory parsed module-definition cache#4690

CS-11077: in-memory parsed module-definition cache#4690
habdelra wants to merge 3 commits intocs-10952-cross-process-invalidation-broadcastfrom
cs-11077-parsed-module-cache

habdelra commented May 6, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 6, 2026

Uh oh!

habdelra May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

habdelra May 6, 2026

Uh oh!

habdelra May 6, 2026

Uh oh!

habdelra May 6, 2026

Uh oh!

habdelra commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

habdelra commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Where state lives — and what this PR does and does not change

Findings — what we measured

Fix

Why this stacks on cs-10952-cross-process-invalidation-broadcast (#4644)

Out of scope, tracked separately

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

habdelra May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

habdelra May 6, 2026

Choose a reason for hiding this comment

Uh oh!

habdelra May 6, 2026

Choose a reason for hiding this comment

Uh oh!

habdelra May 6, 2026

Choose a reason for hiding this comment

Uh oh!

habdelra commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

habdelra commented May 6, 2026 •

edited

Loading

Why this stacks on `cs-10952-cross-process-invalidation-broadcast` (#4644)