db/kv/memstoredb, node: experimental pure-Go in-memory kv.RwDB#21499
db/kv/memstoredb, node: experimental pure-Go in-memory kv.RwDB#21499yperbasis wants to merge 17 commits into
Conversation
Add a feature-flagged in-memory implementation of kv.RwDB selected via USE_IN_MEMORY_KV=1 / --experimental.inmem-kv. With the flag off (default) MDBX remains the backend unchanged. The store is a per-table tidwall/btree, snapshot-isolated via btree.Copy() on BeginRo/BeginRw; Commit atomically swaps master. Single-writer / many-readers, no OS-thread affinity (no cgo / no LockOSThread). DB instances are deduplicated per path so reopening the same path returns the same data, mirroring MDBX semantics. Wired into node.OpenDatabase and memdb.New/NewChainDB so every MDBX user (chaindata, txpool, downloader, ...) picks up the in-memory backend when the flag is on. Volatile by design (no WAL, no on-disk persistence) — intended for measurement / benchmarking newPayload latency without the Go↔C cgo boundary, not for durable production use. make test-short passes in both modes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profiling a fresh in-memory engineapi test showed ~50% of CPU in
btree.Copy / IsoCopy / newIsoID, called per-table at every BeginRo and
BeginRw. Both transaction kinds were eagerly cloning every master table
even when most are never touched.
Switch to clone-on-first-write:
- BeginRoTx clones the master tables MAP (~ns/op for a few dozen
entries), not the btrees. Reads go through the shared master btrees;
they stay valid because RwTx commits replace master.tables with a
new map, leaving the pre-commit btrees this RoTx still references
untouched (writers mutate their own COW clones).
- BeginRwTx does the same. getOrCreateTable now distinguishes the
write path: on the first mutation of a master-shared table it
btree.Copy()s the table into a tx-private clone (recorded in
privateTables) and replaces the entry in tx.tables. Subsequent
writes hit the clone directly.
Effect:
- BenchmarkBeginRwCommit_Mem_1M: 181ns/op -> 138ns/op (24% faster)
- newIsoID disappears from the engineapi CPU profile
- Real engineapi tests stay 21-55% faster than MDBX-mode
setPos drops one common.Copy pair: stored entries are immutable in
tidwall's COW btree (mutations create new nodes), so the cursor can
hold the entry by reference; the caller still gets its own copies.
make test-short passes in both MDBX (baseline) and in-mem mode;
make lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Profiling a tip-tracking erigon revealed maps.Clone of the master tables
map as ~10% of CPU on the engineapi hot path: the commitment Warmuper
fans out NumCPU*8 (= 256 on this box) goroutines per block, and each one
calls db.BeginTemporalRo() → memstoredb.BeginRo() → maps.Clone — so the
small per-tx cost gets multiplied by ~256.
RoTx never mutates the tables map, so it doesn't need a copy. Switch to
shared-by-reference:
- beginRoTx() captures master's tables/sequences map references.
- getOrCreateTable() splits read/write paths: RoTx + missing table now
falls back to a per-tx roLocal map (lazily allocated), so RoTx
cursors over non-existent tables get an empty stub instead of
polluting the shared master map.
- emptyTables / emptySequences are nil-fallback singletons.
RwTx is unchanged: it still maps.Clone the master at begin (cheap),
keeps a privateTables set, and COW-clones each table on first write.
Wallclock A/B (count=10 sequential test runs, in-mem mode):
TestEngineApiEmptyBlockProduction 233ms/run → 181ms (-22%)
TestEngineApiSequentialNonceAdvancement 200ms/run → 162ms (-19%)
TestEngineApiHighGasContractsFillBlock 227ms/run → 180ms (-21%)
make test-short passes; make lint clean. RoTx semantics unchanged for
external callers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the experimental in-mem KV ON by default so CI runs through it. Reverting requires setting USE_IN_MEMORY_KV=false in env. This is a temporary stress-test commit — the goal is to surface any correctness issues in the in-mem backend by exposing it to the full CI matrix (unit tests, race tests, hive, hive-eest, kurtosis, eest spec). Will revert after the in-mem KV is validated as 100% correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tx.Delete and tx.ClearTable were using lookupTable to obtain the table pointer and then calling tab.tree.Delete / tab.tree.Clear directly. For a RwTx that hadn't previously COW-cloned that table (no prior Put/ getOrCreateTable on it), lookupTable returns master's shared pointer — so these mutations corrupt master directly, and concurrent RoTx readers holding the same map see the mid-tx deletions immediately. That breaks snapshot isolation and almost certainly caused the wrong-trie-root errors I saw on the live rig at blocks 25196561 / 25196872: prune / ClearTable runs concurrently with the Warmuper RoTxs that pre-fetch trie state, the warmer sees the partial mutations, and the commitment computation ends up using an inconsistent view of state. Introduce cowIfShared(name): returns the tx-private clone of an existing table (COW from master if needed), or nil if the table doesn't exist. Delete and ClearTable now route through it. Cursor mutations (DeleteCurrent / DeleteExact / PutCurrent / PutNoDup- Data) were already safe because newCursor calls getOrCreateTable, which COW-clones at cursor open time and stores the clone in cursor.table. Put / Append / AppendDup are also safe (they go through tx.Put → getOrCreateTable). make test-short passes; engineapi tests pass; lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TestRwTxDeleteIsolatedFromRoTx and TestRwTxClearTableIsolatedFromRoTx catch the snapshot-isolation bug fixed in 1a35354: a RwTx that only calls Delete (or ClearTable) on an existing table — without a prior Put / getOrCreateTable on that table — must NOT corrupt master. A concurrent RoTx that opened before the RwTx began must continue to see the pre-Delete state until the RwTx commits, and a Rollback must undo the change. Both tests would have failed against the pre-fix code (lookupTable + direct tab.tree.Delete bypassed COW). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The volatile in-mem backend was killing erigon's CLI tooling: hive's
client wrapper runs `erigon init genesis.json` → `erigon import` →
`erigon` (daemon) as separate processes and expects chaindata to
survive between them. With a purely volatile backend the second
process opens an empty memstoredb, eth.New finds no genesis to
populate cfg.Genesis, and eth.Init then nil-derefs at
ethCfg.Genesis.Config (import_cmd.go:114).
Add a minimal binary dump file alongside the registered path
(`<path>.mem`):
- OpenForPath() loads it if present (corrupt file → fresh DB, no
panic — caller can't tell, matching graceful-restore semantics).
- Close() best-effort writes it (failure is non-fatal: the backend
is documented as volatile, durability is a convenience for CLI
tooling, not a guarantee).
Format: 4-byte magic + version, then sequences and tables in turn
(name + dupSort flag + entry count + (k_len,k,v_len,v) per entry).
Atomic write via tmp+rename; tmp is cleaned up on failure via
common/dir.RemoveFile (ruleguard-friendly).
Hot-path runtime stays zero-cgo — disk is only touched on the cold
Open / Close edges. Long-lived daemons see no change.
TestSaveLoadRoundTrip covers the dump/restore path. Verified
`erigon init genesis.json` + `erigon import` now succeeds with
USE_IN_MEMORY_KV=true (previously nil-derefed).
make test-short clean (220 ok, 0 fail); make lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NextNoDup and LastDup were positioning via Seek(NextSubtree(curKey)) +
Prev, which is only correct when keys are fixed-length. With variable-
length keys (e.g. CommitmentVals nibble paths, history-table keys with
varying widths) an entry can sit lexicographically between curKey and
NextSubtree(curKey) — "aa" < "aab" < "ab" — so the Seek-to-prefix-end
trick skips it.
In TestAggregatorV3_PruneSmallBatches this surfaced as LastDup returning
nil because Prev landed outside curKey; the prune path then did
binary.BigEndian.Uint64(nil) and crashed. TestCheckStateVerify saw the
same symptom as a state-vs-commitment integrity mismatch (the commitment
trie computed the wrong set of keys when NextNoDup walked over the wrong
neighbour). Both pass after the fix.
Replace both with linear-forward scans:
- NextNoDup steps iter.Next() from the current position until the key
changes; that's the first dup of the next distinct key by MDBX
semantics (MDBX_NEXT_NODUP).
- LastDup seeks to curKey's first dup and walks iter.Next() while the
key still matches, tracking the last entry seen.
PrevNoDup is untouched: Seek(curKey, nil) + Prev lands on the entry
immediately preceding (curKey, smallest-v), which is always the last
dup of the largest key < curKey — correct regardless of key length.
make test-short under in-mem mode previously passed; the bug surfaced
only in tests like prune-small-batches that drive variable-length
domain-history keys. db/kv/memstoredb unit tests still pass; lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 5a692ab switched NextNoDup to a linear forward walk for DupSort tables, the loop unconditionally called iter.Next() at the top of every iteration — including the very first iteration, before inspecting what iter.Seek() had already positioned on. When Seek({k:curKey, v:c.current.v}) lands on an entry whose key is already past curKey (because no dup matches the requested v — e.g. the v from c.current is stale relative to the live tree), the iter.Next() that followed silently skipped THAT entry, advancing one too far. Symptom: TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight read blockReader.Header → nil because some upstream NextNoDup-driven iteration over AccountHistoryKeys / ReceiptCacheHistoryKeys lost an entry. Restructure the loop to inspect the Seek-positioned entry FIRST and only call iter.Next() if the current entry is still under curKey. Also: take a non-DupSort fast path so single-entry-per-key tables (Headers, BlockBody, AccountVals, etc.) skip the variable-length-key defensive walk and do a single Seek + Next. LastDup gets the same correction — already walks forward from Seek(curKey) tracking the last matching entry — but its loop already inspects iter.Item() before advancing, so no behaviour change was needed there; the test against it just re-validated. make test-short clean (220 ok, 0 fail) under USE_IN_MEMORY_KV=true; make lint clean. Both regression cases pass: - TestAggregatorV3_PruneSmallBatches (DupSort variable-length keys) - TestValidateChainAndUpdateForkChoiceWithSideForksThatGoBackAndForwardInHeight Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ViewID() was hard-coded to 0, so distinct transactions collided when kvcache.Coherent uses tx.ViewID() to key its root map. The result was that a cache populated under one tx's view leaked into the next reader, e.g. FuzzOnNewBlocks panicking in accounts.DeserialiseV3 on cached sender bytes that the new reader had no business seeing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two related fixes for cross-process / cross-test isolation: 1. Move the persistence file inside the data directory (`<path>/inmem.dat` instead of `<path>.mem` sibling). Callers that rm -rf the data directory between tests now also wipe the dump file, matching MDBX where the data lives inside the dir. 2. Drop the registry entry on Close. The runner-style pattern of "open path → use → close → rm -rf the dir → MkdirTemp returns same path → reopen" was resurrecting the prior DB from the registry; with the entry gone, the second open rehydrates from whatever (if anything) sits at the new path's dump file. Matches MDBX, which holds no in-memory state for a closed handle. Surfaced by eest enginextests-stable-sequential failing with "database contains genesis" when MkdirTemp recycled a tmp path across testers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Manually dispatched Hive workflows against sha
Goal is to verify the pure-Go memstoredb passes all hive simulators end-to-end before merging. |
stack.Close() only closes registered databases when the node has been Start()ed (state == runningState in node.doClose). import_cmd never calls stack.Start(), so the deferred stack.Close() drops out via the initializingState branch and never reaches the chainDB. For MDBX this is harmless — commits sync to disk during execution and Close is a cleanup. For the pure-Go memstoredb backend, persistence is hooked off Close (see persist.go), so skipping it meant the imported chain was discarded when the process exited, and the next stage of the hive flow (the daemon) booted to head=0. Hive's rpc-compat suite then fails its first 141/211 tests because every block-relative query lands on the empty genesis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
All CI green with USE_IN_MEMORY_KV defaulted to true on sha
Bugs fixed during this CI iteration:
|
…ment parity test The wrong-trie-root behaviour we hit on the live mainnet rig with --experimental.inmem-kv (three "Wrong trie root" hits in the first ~2 minutes past the last frozen step) is not reproduced by any of the existing memstoredb / domain / aggregator tests: all of them keep state inside a single tier, never the cross-tier mix of frozen .kv snapshot files + live chaindata writes that mainnet sync is actually doing past the last frozen step. This test is a starting point that *does* exercise the cross-tier shape: it builds N frozen steps via agg.BuildFiles, then writes additional steps in per-batch RwTx commits with intervening RoTx reads (matching the per-batch shape of the stage loop), computes commitment, and compares the root against an MDBX baseline driven through the exact same write sequence. It currently PASSES at the small scale it tests — i.e. the bug on the rig isn't in the cross-tier paths the synthetic workload exercises. Future bisection should extend this test along the dimensions that distinguish mainnet sync from the current shape (account deletes, storage SET-to-zero, real frozen file contents, pruning, larger key distributions, etc.) until it reproduces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds two mainnet-style write patterns the original scaffold was missing: - periodic SELFDESTRUCT-style account DomainDel (cascades into storage and code tombstones), every ~37th (account, txNum) combo - storage SET-to-zero via DomainDel at 1 in 5 storage writes Test still PASSES at this scale — memstoredb still tracks MDBX exactly. The rig's wrong-trie-root must require a dimension this synthetic workload doesn't yet hit (real frozen-file contents, scale, prune, or something specific to the rig's existing snapshot files). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Live-rig A/B verdict: the state-root bug is memstoredb-specific. Same datadir / snapshot directory / binary at
The cross-tier synthetic test scaffold (
Not blocking the PR's CI green, but blocking any "safe to default" promotion of |
|
Corrected verdict: the 3 wrong-trie-root hits were carried-over .kv-snapshot corruption from an earlier (older, buggier) memstoredb run, NOT the current memstoredb code. The earlier A/B comparison was confounded by a difference I missed: the in-mem run that failed was reading May-28-era step 9072-9080 .kv files written by a prior (older) in-mem erigon. The MDBX A/B run rebuilt those .kv files cleanly from execution. So "MDBX clean, in-mem broken" really meant "MDBX has clean .kv files, in-mem had corrupt ones." Re-test: I wiped chaindata + step 9080+ files, kept the MDBX-rebuilt clean 9072-9080 files, and restarted in
So the live memstoredb code on sha Practical implication for cleanup: any datadir that has .kv files produced by a memstoredb run older than today is suspect and should have those step ranges wiped + rebuilt via MDBX before being trusted again. We surgically removed step 9072-9080 + 9080-9082 + 9082-9083 + 9083-9084 + 9080-9084 + 9084-9085 from this rig. Rig stopped cleanly. Tasks 25 and 26 complete. |
profiling at tip (MDBX, 60s window) pinned bytes.Clone as the #1 single allocation source at 64.3 GB / 60s (~16.85% of total alloc traffic). 37.86 GB / 60s (~631 MB/sec) of that came from the defensive common.Copy at TrieContext.Branch. The clone was redundant: every downstream consumer either reads the slice inline (cell.fillFromFields copies into pre-allocated arrays; merger.Merge consumes and produces a fresh buffer; trie_reader parses bytes into cells; unfoldBranchNode similar) or clones it itself at the queue boundary (getDeferredUpdate clones both prefix and prev when storing in the deferred-update pool). Branch's clone was a third copy of bytes that nothing needed to retain. Document the new contract (borrowed slice valid for the current ComputeCommitment scope) and update the test that exercised the old "returns owned bytes" guarantee to verify the new aliasing guarantee instead. After-measurement on the rig is blocked by an unrelated stage-loop persistence inconsistency (chaindata head pointer ahead of state writes on restart) that's reproducing on every restart cycle today; the change is mathematically minimal (single Clone removal + test contract update) and unit tests + make lint pass, so shipping on the math. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Adds a feature-flagged pure-Go in-memory implementation of
kv.RwDB. With the flag off (default) MDBX remains the backend unchanged. With the flag on (USE_IN_MEMORY_KV=1or--experimental.inmem-kv), every MDBX user (chaindata, txpool, downloader, …) picks up the in-memory backend instead.Why
Native
perfon erigon (metarepo-internal-issues#17) shows ~28–30% of CPU on the hot newPayload→FCU path is the Go↔C cgo boundary (exitsyscall/reentersyscall/cgocall/cgoCheckPointer/_Cfuncstubs) and another ~11% is MDBX's own B-tree work. The fast-cgo trampoline experiment confirmed the boundary tax is not reclaimable via shortcuts. The architectural way to erase it is to take MDBX out of the picture entirely behind a feature flag.Approach
New package
db/kv/memstoredb:*DBimplementskv.RwDB. Per-tabletidwall/btree.BTreeG[entry].BeginRoclones only the tables MAP, not the btrees themselves (writers always work on their own COW clones, so master's pre-commit btrees stay untouched).BeginRwtakes the write lock, captures the map, and uses clone-on-first-write: a table isbtree.Copy()-ed into a tx-private clone the first time it is mutated. Subsequent writes mutate the clone directly.BeginRwTrymirrors MDBX'sMDBX_TXN_TRY(returnsEBUSYif a writer is active).Commitatomically swaps the master map; concurrent RoTx that captured the old map see pre-commit data via the still-valid old btree pointers.OpenForPath(path, …)calls return the same DB — matches MDBX's "opening the same file returns the same data" semantic.Closeis a handle release, not a wipe.Wired in two places — every other caller works through these:
node.OpenDatabase(production) — early-returnmemstoredb.OpenForPath(...)when the flag is on.db/kv/memdbNew/NewChainDB(tests) — same conditional swap, so test runs read the env var.Feature flag follows the existing
UseStateCachepattern:dbg.UseInMemoryKV+SetUseInMemoryKV(bool), plus a--experimental.inmem-kvCLI flag registered innode/cli/default_flags.go.Verification
go build ./...clean.make lintclean (multiple runs).make test-shortpasses withERIGON_USE_IN_MEMORY_KV=false(MDBX baseline, no regression).make test-shortpasses withERIGON_USE_IN_MEMORY_KV=true(in-memory mode).Measurements
End-to-end wallclock on real engineapi tests, 3-run sequential
go test:Synthetic micro-bench (
/tmp/elbench-bench):The last row is the only one where MDBX wins: hot writes to an already-populated table cost more in tidwall/btree (per-Set node allocations + GC) than in MDBX (in-place mmap pages). At the system level the cgo savings + cheap tx-overhead dominate, hence Mem still wins all the wall-clock tests. The first-call cost of cloning a populated btree into a tx-private copy on write is the structural limit, and addressing it would require a non-allocating btree — out of scope here.
Out of scope
.kvsnapshot files survive on disk.🤖 Generated with Claude Code