LMDB storage layout: monolithic event values vs. per-rep sub-databases

## Context

`LMDBWriter` and `LMDBDataset` currently store every event as a single LMDB key/value pair, with all extractors and all precomputed `DataRepresentation`s bundled into one serialized blob:

- **Key**: `str(event_no).encode("utf-8")`
- **Value**: serialized dict
  ```
  {
      "<pulsemap_extractor_name>": {col: [values...], ...},
      "<truth_extractor_name>":    {col: [value], ...},
      ...other extractors...
      "data_representations": {
          "GraphDefinition": <torch_geometric.Data>,
          ...one entry per stored rep...
      },
  }
  ```

Worth opening a discussion on whether this is the right long-term layout.

## Where the monolithic layout hurts

1. **Reading when you only want one rep.** `LMDBDataset._update_cache` deserializes the entire value even when `pre_computed_representation="GraphDefinition"`. The raw tables (~3.5 KB) are decoded for nothing.
2. **Adding or removing a representation after the fact** requires deserializing every event value, mutating its `data_representations` dict, and reserializing the whole thing back — even though only the new rep is actually new on disk. For ~10M events at ~500 µs round-trip that's hours single-threaded.
3. **Pickle of small torch tensors is the dominant cost.** Each tensor carries its own torch metadata overhead, so a `Data` with ~10 small tensors is much slower to (de)serialize than the equivalent number of plain bytes.

## What a sub-database layout would look like

LMDB natively supports multiple named sub-databases per env via `env.open_db(name=...)` (with `max_dbs > 1`). The natural split is:

- `default` sub-DB: `{event_no → {extractor_name: dict-of-lists, ...}}` (raw tables, no `data_representations` key)
- One sub-DB per representation field name (`KNNGraph`, `GraphDefinition`, ...): `{event_no → Data}`
- Metadata keys (`__meta_serialization__`, `__meta_data_representations__`) stay where they are.

### Wins

- **Adding a representation becomes a pure insert** into the new rep's sub-DB. No reads of existing data, no rewriting of raw tables.
- **Reads only deserialize what was asked for.** `LMDBDataset(pre_computed_representation="GraphDefinition")` would do a single B+tree lookup in the `GraphDefinition` sub-DB and skip the raw tables entirely.
- **Removing a representation is cheap** — drop the sub-DB or delete its keys.
- Smaller values in the hot path are friendlier to LMDB's page cache.

### Costs

- **Two B+tree lookups per event** if a caller wants raw tables AND a rep. Still O(log N) and amortized cheap, but real.
- **Breaking change for existing on-disk LMDBs.** A migration helper would be needed (or a one-time "flatten" pass).
- **More moving parts** in `LMDBWriter`, `LMDBDataset`, `merge_files`, and the metadata schema (need to know which sub-DB names to open).

## Other tradeoffs worth noting

- **Lookup time is not a concern**: LMDB is a B+tree, lookup is O(log N) — essentially constant for any realistic dataset size. The cost discussion above is about value size and (de)serialization, not key lookup.
- **Parallelizing writers** is bounded by LMDB's single-writer-per-env lock. A compute pool + serial writer helps the current monolithic layout; a sub-DB layout would help more, because workers would only ship the new `Data` rather than a full re-serialized event value.

## Decision needed

Whether the sub-DB refactor is worth the breaking change. Numbers are needed before deciding — see the linked profiling task.

## Tasks

- [ ] **Profile the current layout** to determine whether the costs above are actually felt in practice, before committing to any refactor:
  - Per-event deserialization time (full value vs. just the rep)
  - DataLoader throughput at typical batch sizes / `num_workers` settings
  - End-to-end wall time of an "add one rep to N events" operation
  - On-disk size before/after adding a rep
  If profiling shows the costs are negligible at realistic dataset sizes, the sub-DB refactor isn't worth doing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDB storage layout: monolithic event values vs. per-rep sub-databases #893

Context

Where the monolithic layout hurts

What a sub-database layout would look like

Wins

Costs

Other tradeoffs worth noting

Decision needed

Tasks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LMDB storage layout: monolithic event values vs. per-rep sub-databases #893

Description

Context

Where the monolithic layout hurts

What a sub-database layout would look like

Wins

Costs

Other tradeoffs worth noting

Decision needed

Tasks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions