CS-11077: in-memory parsed module-definition cache#4690
CS-11077: in-memory parsed module-definition cache#4690habdelra wants to merge 3 commits intocs-10952-cross-process-invalidation-broadcastfrom
Conversation
Front-end readFromDatabaseCache with a parsed-entry cache so warm-cache loadLinks GETs stop re-fetching and re-parsing the same modules.definitions jsonb on every populateQueryFields call. Each cached entry snapshots the three generation counters at insertion; lookups validate against current counters and self-evict on mismatch — so a local invalidate / clearRealmCache / clearAllModules, or a remote bump arriving via the cross-process NOTIFY listener, drops the entry on next access. Bounded by total parsed-bytes (PARSED_CACHE_BUDGET_CHARS, 256 MB JSON-text proxy for ~0.5–0.75 GB of resident heap). Map insertion order = iteration order, so re-inserting on hit gives LRU. Error entries are not cached: they have a TTL and the SQL path already handles them correctly. Re-snapshots before insert so a generation bump during the SQL round-trip skips the cache write. Tests: - parsed-entry cache short-circuits the DB read on repeat lookups - bumpModuleGeneration invalidates the parsed-entry cache for that module - bumpGlobalGeneration invalidates the parsed-entry cache for every entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 557b7e4e0d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| this.#parsedCache.set(cacheKey, { | ||
| entry, | ||
| snapshot: snapshotAtPersist, | ||
| sizeChars, | ||
| }); | ||
| this.#parsedCacheBytes += sizeChars; |
There was a problem hiding this comment.
Subtract replaced parsed-cache entries before adding new ones
When concurrent cache-only lookups for the same module both miss #parsedCache, they can each finish the SQL read and execute this set for the same cacheKey; Map#set replaces the first entry, but #parsedCacheBytes is incremented again without subtracting the replaced entry. I checked lookupCachedDefinition, and it calls readFromDatabaseCache directly rather than going through #inFlight, so this is reachable for parallel cache-only definition reads. Once the inflated counter crosses the budget, eviction subtracts only entries still present in the map and can leave #parsedCacheBytes positive even with an empty map, causing unnecessary eviction/cache thrash thereafter.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Real bug — confirmed reachable via lookupCachedDefinition, which calls readFromDatabaseCache directly and bypasses #inFlight, so two concurrent same-key cache-only readers can both finish the SQL round-trip and both reach the set. Map#set overwrites the value but the bytes counter was previously double-counted; over time it drifts above actual map weight and forces spurious evictions.
Fixed in af9405c: subtract the replaced entry's sizeChars before the overwrite. Comment in the code points back to this thread's diagnosis.
When two concurrent cache-only callers both miss `#parsedCache` for the same key and both finish the SQL round-trip, the second `Map#set` overwrites the first but `#parsedCacheBytes` was previously incremented twice. Over time the counter drifted above actual map weight and forced spurious LRU evictions. Reachable via `lookupCachedDefinition`, which calls `readFromDatabaseCache` directly — bypassing `#inFlight`, so two concurrent same-key cache-only reads don't coalesce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds an in-memory, LRU-bounded cache of parsed ModuleCacheEntry objects in CachingDefinitionLookup to avoid repeated DB fetch + JSON parsing of large modules.definitions rows during warm-cache requests, with generation-snapshot validation to self-evict on invalidations (local or cross-process via generation bumps).
Changes:
- Add
#parsedCacheinCachingDefinitionLookupthat short-circuitsreadFromDatabaseCachewhen generation snapshots still match, plus LRU eviction by approximate entry size. - Track per-entry generation snapshots (module/realm/global) and drop cache entries on mismatch before falling back to SQL.
- Add tests covering cache short-circuiting and invalidation via
bumpModuleGeneration/bumpGlobalGeneration.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/runtime-common/definition-lookup.ts | Introduces in-memory parsed-entry cache in front of DB reads with snapshot validation and LRU eviction. |
| packages/realm-server/tests/definition-lookup-test.ts | Adds regression tests for parsed-cache hits and generation-bump invalidation behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // amortized over every subsequent hit. JSON.stringify can throw on | ||
| // unexpected shapes (cycles, BigInts); fall back to 0 so a single | ||
| // pathological insert can't poison the cache. | ||
| function approximateEntrySizeChars( | ||
| entry: ModuleCacheEntry, | ||
| rawDefinitions: unknown, | ||
| ): number { | ||
| if (typeof rawDefinitions === 'string') { | ||
| return rawDefinitions.length; | ||
| } | ||
| try { | ||
| return JSON.stringify(entry.definitions).length; | ||
| } catch (_err) { | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
Fair — the safety hatch was leaky. Fixed in 3eb6b3a: approximateEntrySizeChars still returns 0 as the "size unknown" sentinel, but the caller now treats 0 as "skip caching" so a pathological insert can't bypass the budget cap. Realistic prerender output won't trip this (module definitions are plain JSON-shaped), but the safety hatch needed to stay closed.
| // Parsed-entry cache. Front-ends `readFromDatabaseCache` so successful | ||
| // cache reads avoid re-running the SQL fetch + JSON.parse of the | ||
| // `modules.definitions` jsonb column on every lookup. Each entry | ||
| // captures the three-counter generation snapshot at insertion; lookups | ||
| // validate against the current snapshot and drop stale entries — so a | ||
| // local invalidate/clearRealmCache/clearAllModules (or a remote bump | ||
| // arriving via the cross-process NOTIFY listener) self-evicts on next | ||
| // access. Map insertion order is iteration order in JS, so re-inserting | ||
| // on hit gives LRU eviction. Bounded by `PARSED_CACHE_BUDGET_CHARS` | ||
| // (JSON-text-size proxy for heap). |
There was a problem hiding this comment.
Agreed — snapshot validation alone leaves stale entries resident until they're touched (or LRU-pressured out under unrelated work). Fixed in 3eb6b3a by adding eager-evict inside the bump helpers themselves:
bumpGlobalGenerationclears the entire#parsedCacheand resets#parsedCacheChars.bumpRealmGenerationevicts every entry whose key starts with${resolvedRealmURL}|.bumpModuleGenerationevicts every entry whose key starts with${resolvedRealmURL}|${moduleURL}|(narrows to the affected module across all (scope, user) pairs).
Doing it inside the bump helpers covers both directions: local invalidate / clearRealmCache / clearAllModules already funnel through these, and so does CS-10952's ModuleCacheInvalidationListener when it replays a peer NOTIFY. So a peer's wholesale invalidation now frees this process's parsed-cache memory promptly without waiting for access pressure.
| sizeChars: number; | ||
| } | ||
| >(); | ||
| #parsedCacheBytes = 0; |
There was a problem hiding this comment.
Renamed in 3eb6b3a: #parsedCacheBytes → #parsedCacheChars. Matches the units of sizeChars and PARSED_CACHE_BUDGET_CHARS and avoids the implication that we're tracking a JS-heap byte count (we're not — it's the JSON-text character count, used as a rough proxy for heap weight).
…t on bump; rename chars counter Three Copilot review threads on the parsed-entry cache: 1. `approximateEntrySizeChars` returns 0 when JSON.stringify throws on a pathological shape (cycles, BigInts). The previous code still cached the entry, so a zero-weight insert wouldn't count against the eviction budget — silently disabling the cap for any pathological row that landed first. Skip the insert when sizeChars === 0 instead. 2. `bumpGlobalGeneration` / `bumpRealmGeneration` / `bumpModuleGeneration` leave snapshot-stale parsed entries resident until they're touched (or LRU-pressured out). Eager-evict matching entries inside the bump methods so both local invalidations and CS-10952's cross-process NOTIFY-replay free memory immediately. 3. `#parsedCacheBytes` actually accumulates `sizeChars` (character counts) and is compared against `PARSED_CACHE_BUDGET_CHARS`. Renamed to `#parsedCacheChars` so the unit is obvious. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Closing in favor of CS-11079 + a request-scoped memoization pass. The work in this PR is correct, but post-instrumentation we found the underlying Once that lands, the per-GET cost of the original problem (28s) is dominated not by the per-blob parse but by re-reading the same module 11× within one GET. That's better solved with a request-scoped Net result: the long-lived parsed-entry cache adds complexity (cross-process bump-via-NOTIFY contract, byte-bounded LRU) for a benefit that disappears once CS-11079 makes parsing cheap. Strictly aligned with the DB-Authoritative Realm Registry direction to keep per-process state minimal. Tracking the new approach at CS-11079. |
Summary
Today, every warm-cache
loadLinksGET against a card whose link tree adopts from a largesoftware-factorymodule re-fetches and re-parses the samemodules.definitionsjsonb row 5–11 times per request. This PR adds a per-process, parsed-form working set abovereadFromDatabaseCacheso each realm-server fetches+parses each row at most once between invalidations. The shared DB row stays the authoritative source of truth — nothing is being duplicated, and no per-process state is being introduced that peers need to know about.Where state lives — and what this PR does and does not change
It's worth being careful about the term "in-memory cache" given the direction of the DB-Authoritative Realm Registry project. So, concretely:
Authoritative state, shared across all realm-server processes (DB):
modules.definitionsjsonb — the canonical persisted output of a module prerender. Written once, read by every realm-server process. This PR does not change how it's written, what's stored in it, or who reads it.boxel_index,realm_registry, queue tables — unchanged.Per-process state, already in
CachingDefinitionLookuptoday:#moduleGenerations/#realmGenerations/#globalGeneration— three counters bumped synchronously byinvalidate/clearRealmCache/clearAllModulesbefore the DB DELETE. Used today only to discard mid-flight prerender results that started against a now-stale view.#inFlight— coalesces concurrent same-key callers within a process so a single SQL round-trip is shared by all waiters; evicts on completion.Per-process state added by this PR:
#parsedCache— a Map of parsedModuleCacheEntrykeyed by(canonicalModuleURL, cacheScope, cacheUserId, resolvedRealmURL). Each entry holds the result of onereadFromDatabaseCachecall along with the generation-counter snapshot that was current at insert time. Bounded by total parsed-bytes (PARSED_CACHE_BUDGET_CHARS= 256 MB JSON-text proxy for ~0.5–0.75 GB heap). LRU via Map insertion order.The cache is purely a parsed-form working set of the DB row. It is reconcilable from the DB at any time: drop the entire Map and the next lookup re-reads + re-parses the DB row, exactly like today. No process needs to know what's in any other process's
#parsedCache. There is no new shared state.Findings — what we measured
Instrumented warm-cache GET against
e2e-1/Validations/eval_sticky-note-1. Every relevantsoftware-factorymodule was already populated in themodulestable; no prerender fired during the measured GET. The realm-server process held a freshCachingDefinitionLookup(in-memory generation counters all zero,#inFlightempty between layers).#inFlightdoes its job for concurrent same-key callers (layer 2's 4-cardPromise.allcollapses to 1 read), but evicts on completion. Sequential calls within a single GET — across layer boundaries, after each layer'sPromise.allsettles, and insideexecuteQueryForFieldfollow-ups — each re-fetch and re-parse the same 60 MB DB row from scratch. Two-thirds ofloadModuleCacheEntryinvocations did not coalesce.The DB row sizes are large by design (the dominant module re-exports the entire factory schema, and one-time-cost trimming is tracked separately):
A direct DB-only bench (no realm-server, single PG connection) put the floor at 8.7s of read+parse work per such GET; observed end-to-end was 20–28s in production-shape conditions. Five
user/cs-11003-e2e-{1..5}realms exhibited the same behavior:Eliminating the redundancy collapses 28s → ~5s. The residual is
executeQueryForField issues(4.7s in this GET) +cloneDeep+ per-resource search — separate concerns, tracked elsewhere.Fix
Front-end
readFromDatabaseCachewith#parsedCache. BothlookupDefinitionandlookupCachedDefinitionpaths benefit because both flow throughreadFromDatabaseCache.#parsedCache. If a hit and the cached snapshot equals the current snapshot, LRU-touch it (delete + re-insert at end of Map iteration order) and return the parsed entry. If snapshots disagree (a local invalidation or peer NOTIFY has bumped a counter since insert), drop the entry, decrement byte-count, fall through to the SQL+parse path.sizeCharsbefore overwriting (see thread for the concurrent-replace race).isExpiredErrorEntryTTL path already handles them in DB; in-memory caching errors would either return a stale-but-still-cached error or require a TTL-aware in-memory eviction path. Not worth either.Why this stacks on
cs-10952-cross-process-invalidation-broadcast(#4644)CS-10952 is the piece that makes a per-process parsed-form cache safe to add at all in a multi-process realm-server topology. Two reasons, both load-bearing:
1. The bump helpers it makes public are exactly what my snapshot mechanism keys on. CS-10952 promotes
bumpModuleGeneration/bumpRealmGeneration/bumpGlobalGenerationfromprivateto public onCachingDefinitionLookupso a siblingModuleCacheInvalidationListenercan call them. My fix uses those same three counters as its invalidation contract. Landing this PR on top of CS-10952 means I get the public API for free; landing it independently would mean either duplicating the promotion or merging out of order.2. The cross-process invalidation channel it adds is what closes the multi-process correctness gap. At N=1 (today's prod), local
invalidate/clearRealmCache/clearAllModulesalready bump counters synchronously before the DB DELETE — so my snapshot check is sufficient: the next lookup sees the bump and self-evicts. At N>1, that's no longer true. Process A's local bump doesn't reach process B's counters. Without CS-10952, B's#parsedCachewould happily serve a parsed entry that A has already deleted from the DB.CS-10952 closes that gap end-to-end:
invalidate/clearRealmCache/clearAllModules) emitsNOTIFY module_cache_invalidatedwith a structured payload (URL fan-out / realm scope / global) inside the same transaction as the DB DELETE, so peers only see the signal on commit.ModuleCacheInvalidationListener(subscribed via the multiplexedPgAdapter.subscribefrom PgAdapter.subscribe: multiplex LISTEN over a shared client #4660) that parses incoming payloads and replays them by calling the now-public bump helpers on the localCachingDefinitionLookup.So with both PRs in place, the chain on a peer invalidation is: A invalidates → A bumps locally → A's DELETE commits → NOTIFY fires → B's listener wakes → B's listener calls
bumpModuleGeneration(etc.) on B'sCachingDefinitionLookup→ B's#parsedCachesnapshot for the affected entry is now stale → B's nextreadFromDatabaseCachelookup observes the snapshot mismatch on its in-memory entry → B drops the entry and falls through to the (now-empty) DB → returnsundefined(or the freshly-prerendered row if a peer raced to re-populate). Same persist-after-invalidate protection that already exists for in-flight prerenders today, just extended to longer-lived parsed entries.End result: this cache ships safely under multi-process from day one, layered cleanly above the shared DB row without introducing any new cross-process state. Landing CS-11077 directly on
main(without CS-10952 first) would be the failure mode worth being concerned about — that would be a per-process cache without a cross-process invalidation contract. With the stack, the contract is intact.Out of scope, tracked separately
definitionsjsonb is read whole even thoughpopulateQueryFieldsonly needs thefieldsslice for one specific card type. A jsonb path read or a sibling slim column would shrink the read from ~60 MB to a few KB and would help every realm-server, not just ones with a warm#parsedCache.darkfactory.gts'sdefinitions['…/Issue' | '…/Project' | '…/DarkFactory']are 20+ MB each because of transitive schema embedding indefinition.fields. Trimming would help every lookup of that module.executeQueryForFieldforProject.issueslinksToMany was 4.7s on its own in the measured GET — separate index-search question, not module-cache-related.Test plan
pnpm test-modulefordefinition-lookup-test.ts— all 28 tests green, including 3 new ones:bumpModuleGenerationinvalidates the parsed-entry cache for that module (the cross-process invalidation arrival from CS-10952's listener is structurally identical to a direct call here)bumpGlobalGenerationinvalidates the parsed-entry cache for every entrypnpm lintclean for bothruntime-commonandrealm-server