Skip to content

LMDB storage layout: monolithic event values vs. per-rep sub-databases #893

@sevmag

Description

@sevmag

Context

LMDBWriter and LMDBDataset currently store every event as a single LMDB key/value pair, with all extractors and all precomputed DataRepresentations bundled into one serialized blob:

  • Key: str(event_no).encode("utf-8")
  • Value: serialized dict
    {
        "<pulsemap_extractor_name>": {col: [values...], ...},
        "<truth_extractor_name>":    {col: [value], ...},
        ...other extractors...
        "data_representations": {
            "GraphDefinition": <torch_geometric.Data>,
            ...one entry per stored rep...
        },
    }
    

Worth opening a discussion on whether this is the right long-term layout.

Where the monolithic layout hurts

  1. Reading when you only want one rep. LMDBDataset._update_cache deserializes the entire value even when pre_computed_representation="GraphDefinition". The raw tables (~3.5 KB) are decoded for nothing.
  2. Adding or removing a representation after the fact requires deserializing every event value, mutating its data_representations dict, and reserializing the whole thing back — even though only the new rep is actually new on disk. For ~10M events at ~500 µs round-trip that's hours single-threaded.
  3. Pickle of small torch tensors is the dominant cost. Each tensor carries its own torch metadata overhead, so a Data with ~10 small tensors is much slower to (de)serialize than the equivalent number of plain bytes.

What a sub-database layout would look like

LMDB natively supports multiple named sub-databases per env via env.open_db(name=...) (with max_dbs > 1). The natural split is:

  • default sub-DB: {event_no → {extractor_name: dict-of-lists, ...}} (raw tables, no data_representations key)
  • One sub-DB per representation field name (KNNGraph, GraphDefinition, ...): {event_no → Data}
  • Metadata keys (__meta_serialization__, __meta_data_representations__) stay where they are.

Wins

  • Adding a representation becomes a pure insert into the new rep's sub-DB. No reads of existing data, no rewriting of raw tables.
  • Reads only deserialize what was asked for. LMDBDataset(pre_computed_representation="GraphDefinition") would do a single B+tree lookup in the GraphDefinition sub-DB and skip the raw tables entirely.
  • Removing a representation is cheap — drop the sub-DB or delete its keys.
  • Smaller values in the hot path are friendlier to LMDB's page cache.

Costs

  • Two B+tree lookups per event if a caller wants raw tables AND a rep. Still O(log N) and amortized cheap, but real.
  • Breaking change for existing on-disk LMDBs. A migration helper would be needed (or a one-time "flatten" pass).
  • More moving parts in LMDBWriter, LMDBDataset, merge_files, and the metadata schema (need to know which sub-DB names to open).

Other tradeoffs worth noting

  • Lookup time is not a concern: LMDB is a B+tree, lookup is O(log N) — essentially constant for any realistic dataset size. The cost discussion above is about value size and (de)serialization, not key lookup.
  • Parallelizing writers is bounded by LMDB's single-writer-per-env lock. A compute pool + serial writer helps the current monolithic layout; a sub-DB layout would help more, because workers would only ship the new Data rather than a full re-serialized event value.

Decision needed

Whether the sub-DB refactor is worth the breaking change. Numbers are needed before deciding — see the linked profiling task.

Tasks

  • Profile the current layout to determine whether the costs above are actually felt in practice, before committing to any refactor:
    • Per-event deserialization time (full value vs. just the rep)
    • DataLoader throughput at typical batch sizes / num_workers settings
    • End-to-end wall time of an "add one rep to N events" operation
    • On-disk size before/after adding a rep
      If profiling shows the costs are negligible at realistic dataset sizes, the sub-DB refactor isn't worth doing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions