Context
LMDBWriter and LMDBDataset currently store every event as a single LMDB key/value pair, with all extractors and all precomputed DataRepresentations bundled into one serialized blob:
- Key:
str(event_no).encode("utf-8")
- Value: serialized dict
{
"<pulsemap_extractor_name>": {col: [values...], ...},
"<truth_extractor_name>": {col: [value], ...},
...other extractors...
"data_representations": {
"GraphDefinition": <torch_geometric.Data>,
...one entry per stored rep...
},
}
Worth opening a discussion on whether this is the right long-term layout.
Where the monolithic layout hurts
- Reading when you only want one rep.
LMDBDataset._update_cache deserializes the entire value even when pre_computed_representation="GraphDefinition". The raw tables (~3.5 KB) are decoded for nothing.
- Adding or removing a representation after the fact requires deserializing every event value, mutating its
data_representations dict, and reserializing the whole thing back — even though only the new rep is actually new on disk. For ~10M events at ~500 µs round-trip that's hours single-threaded.
- Pickle of small torch tensors is the dominant cost. Each tensor carries its own torch metadata overhead, so a
Data with ~10 small tensors is much slower to (de)serialize than the equivalent number of plain bytes.
What a sub-database layout would look like
LMDB natively supports multiple named sub-databases per env via env.open_db(name=...) (with max_dbs > 1). The natural split is:
default sub-DB: {event_no → {extractor_name: dict-of-lists, ...}} (raw tables, no data_representations key)
- One sub-DB per representation field name (
KNNGraph, GraphDefinition, ...): {event_no → Data}
- Metadata keys (
__meta_serialization__, __meta_data_representations__) stay where they are.
Wins
- Adding a representation becomes a pure insert into the new rep's sub-DB. No reads of existing data, no rewriting of raw tables.
- Reads only deserialize what was asked for.
LMDBDataset(pre_computed_representation="GraphDefinition") would do a single B+tree lookup in the GraphDefinition sub-DB and skip the raw tables entirely.
- Removing a representation is cheap — drop the sub-DB or delete its keys.
- Smaller values in the hot path are friendlier to LMDB's page cache.
Costs
- Two B+tree lookups per event if a caller wants raw tables AND a rep. Still O(log N) and amortized cheap, but real.
- Breaking change for existing on-disk LMDBs. A migration helper would be needed (or a one-time "flatten" pass).
- More moving parts in
LMDBWriter, LMDBDataset, merge_files, and the metadata schema (need to know which sub-DB names to open).
Other tradeoffs worth noting
- Lookup time is not a concern: LMDB is a B+tree, lookup is O(log N) — essentially constant for any realistic dataset size. The cost discussion above is about value size and (de)serialization, not key lookup.
- Parallelizing writers is bounded by LMDB's single-writer-per-env lock. A compute pool + serial writer helps the current monolithic layout; a sub-DB layout would help more, because workers would only ship the new
Data rather than a full re-serialized event value.
Decision needed
Whether the sub-DB refactor is worth the breaking change. Numbers are needed before deciding — see the linked profiling task.
Tasks
Context
LMDBWriterandLMDBDatasetcurrently store every event as a single LMDB key/value pair, with all extractors and all precomputedDataRepresentations bundled into one serialized blob:str(event_no).encode("utf-8")Worth opening a discussion on whether this is the right long-term layout.
Where the monolithic layout hurts
LMDBDataset._update_cachedeserializes the entire value even whenpre_computed_representation="GraphDefinition". The raw tables (~3.5 KB) are decoded for nothing.data_representationsdict, and reserializing the whole thing back — even though only the new rep is actually new on disk. For ~10M events at ~500 µs round-trip that's hours single-threaded.Datawith ~10 small tensors is much slower to (de)serialize than the equivalent number of plain bytes.What a sub-database layout would look like
LMDB natively supports multiple named sub-databases per env via
env.open_db(name=...)(withmax_dbs > 1). The natural split is:defaultsub-DB:{event_no → {extractor_name: dict-of-lists, ...}}(raw tables, nodata_representationskey)KNNGraph,GraphDefinition, ...):{event_no → Data}__meta_serialization__,__meta_data_representations__) stay where they are.Wins
LMDBDataset(pre_computed_representation="GraphDefinition")would do a single B+tree lookup in theGraphDefinitionsub-DB and skip the raw tables entirely.Costs
LMDBWriter,LMDBDataset,merge_files, and the metadata schema (need to know which sub-DB names to open).Other tradeoffs worth noting
Datarather than a full re-serialized event value.Decision needed
Whether the sub-DB refactor is worth the breaking change. Numbers are needed before deciding — see the linked profiling task.
Tasks
num_workerssettingsIf profiling shows the costs are negligible at realistic dataset sizes, the sub-DB refactor isn't worth doing.