Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
583 changes: 583 additions & 0 deletions docs/superpowers/plans/2026-05-30-dense2sparse-with-length.md

Large diffs are not rendered by default.

215 changes: 215 additions & 0 deletions docs/superpowers/specs/2026-05-29-dataloader-bench-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
# DataLoader Bench — Design

**Status:** draft, awaiting user review
**Date:** 2026-05-29

## Motivation

The prefetching dataloader design (`2026-05-28-prefetching-dataloader-design.md`)
introduces two new `to_dataloader` modes — `"buffered"` and `"double_buffered"` —
alongside the existing `mode=None` baseline. We need a benchmark that:

1. **Compares the three modes head-to-head** under realistic-but-bounded
conditions, so we can claim the new modes are wins and identify *where*
each wins.
2. **Maps how throughput scales** with the four user-facing knobs (threads,
region length, batch size, buffer bytes) per output mode, providing the
data needed to inform a future auto-tuning heuristic.

The bench targets variant-data-only workloads (no BigWig/Table tracks). Total
wall time budget: **20 minutes** on a developer workstation.

## Goals & non-goals

In scope:

- Output modes: `haplotypes`, `annotated`, `variants` (drops `reference`)
- Modes: `None`, `"buffered"`, `"double_buffered"`
- Axes: threads, region length, batch size, buffer bytes
- 1KG chr21+chr22 sparse-var (.svar) dataset
- CSV results + single matplotlib summary plot

Out of scope:

- Tracks (BigWig, Table) — variant-only
- Spliced datasets
- VCF and PGEN backends (svar only, the fastest path)
- Multi-GPU / DDP
- Zero-copy (`copy=False`) variants — keep default `copy=True`
- Real GPU step / `.to(device)` — measure loader throughput only
- `pin_memory=True` — would mask `copy` semantics

## Directory layout

```
experiments/dataloader/
├── _common.py # shared configs, dataset prep, replication helper
├── bench.py # main entrypoint: configs → measurements → CSV
├── plot_results.py # CSV → results_plot.png
├── results.csv # output (committed only on intentional refresh)
└── results_plot.png # output
```

Mirrors the layout of `experiments/bench_table_overlap/`.

## Data

**Source:** `tests/data/1kg/filtered.svar` (5 samples, chr21+chr22, ~287k
variants, ploidy 2). The committed `phased_1kg.svar.gvl` fixture is *not*
reused directly because its regions are baked at write time and the bench
varies region length.

**Region BED generation:** for each unique `region_length` value the bench
generates a BED by taking the 100 regions in `tests/data/1kg/regions.bed`,
recentering them on their midpoint, and resizing to the target length. The
resulting BED is written to a temp directory, then `gvl.write` produces a
fresh `.gvl` dataset at `tmp/dataset_rL{length}.gvl`.

**Dataset writes are amortized once at bench startup**, keyed by region
length. With 5 unique lengths (`{1_000, 2_500, 5_000, 10_000, 25_000}`),
this is 5 `gvl.write` calls; expected total ≈ tens of seconds.

**Per-cell epoch sizing:** the 1KG fixture has 5 samples × 100 regions =
500 (region, sample) instances per epoch. To amortize buffer fills the
bench iterates **≥3 epochs or ≥1.5 s wall time, whichever first**, after
one discarded warmup epoch.

## Parameter axes

| Axis | Factorial values | Fan values (added around midpoint) |
|---|---|---|
| `mode` | `None`, `"buffered"`, `"double_buffered"` | — |
| `with_seqs` | `"haplotypes"`, `"annotated"`, `"variants"` | — |
| `threads` | `{1, 8}` | `{2, 4, 16}` |
| `region_length` | `{1_000, 10_000}` | `{2_500, 5_000, 25_000}` |
| `batch_size` | `{16, 128}` | `{32, 64, 256}` |
| `buffer_bytes` | `{256 MiB, 2 GiB}` | `{512 MiB, 1 GiB, 4 GiB}` |

**Midpoints used during 1-axis fans:** `threads=4`, `region_length=5_000`,
`batch_size=64`, `buffer_bytes=1 GiB`. When fanning along axis A, the other
three axes are held at their midpoint.

`mode=None` does not accept `buffer_bytes`; baseline cells use `{1, 8}`
threads × `{16, 128}` batch × `{1_000, 10_000}` region only.

## Cell count

- New modes: 2 modes × 3 outputs × (2⁴ factorial + 4 axes × 3 fan values)
= 2 × 3 × 28 = **168 cells**
- Baseline: 3 outputs × 2 threads × 2 batch × 2 region = **24 cells**
- **Total: 192 cells.**

At ~4 s/cell (warmup + ≥1.5 s measurement + loader teardown) this is
~13 min of measurement on top of ~1–3 min of dataset writes, fitting the
20 min budget.

## Measurement protocol

For each cell:

1. Construct `Dataset.open(rL_dataset_path).with_seqs(output_mode)`,
apply `.with_settings(deterministic=True)` when needed by the spec
(`haplotypes` / `annotated`).
2. Build the loader: `dataset.to_dataloader(batch_size=..., mode=...,
buffer_bytes=..., shuffle=False, num_workers=0)`.
3. **Warmup:** iterate one full epoch, discard timing.
4. **Measure:** until `elapsed ≥ 1.5 s` *and* `epochs ≥ 3`:
- `t0 = time.perf_counter()`
- iterate one epoch, summing `instances` and `bytes`
- `bytes` = `sum(_output_bytes_per_instance[r, s])` over the epoch's
(r, s) pairs (cheap; the table is computed once per cell)
- record per-epoch wall time
5. Record a single row aggregating across measured epochs.
6. Tear down loader; `gc.collect()` to clip RSS between cells.

**Hard cap:** 10 s per cell to prevent a pathological config from blowing
the budget; if hit, mark `timed_out=True` and continue.

### CSV columns

```
mode, with_seqs, threads, region_length, batch_size, buffer_bytes,
n_epochs, instances, bytes, wall_s, instances_per_s, MiB_per_s,
peak_rss_MiB, timed_out, git_sha, host, started_at
```

`peak_rss_MiB`: `resource.getrusage(RUSAGE_SELF).ru_maxrss` delta over the
measurement window (monotonic on Linux; we snapshot before and after).
Used mostly to spot `buffer_bytes` overshoot.

## Thread isolation via subprocess re-exec

`RAYON_NUM_THREADS` and the BLAS thread pools are pinned at process start;
they cannot be safely mutated after the rayon pool initializes. The bench
therefore dispatches one child process per `threads` value:

```python
# bench.py
if "--child" in sys.argv:
run_cells_for_threads(int(os.environ["BENCH_THREADS"]))
else:
write_datasets_per_region_length()
init_csv()
for n_threads in [1, 2, 4, 8, 16]:
env = {
**os.environ,
"BENCH_THREADS": str(n_threads),
"RAYON_NUM_THREADS": str(n_threads),
"POLARS_MAX_THREADS": str(n_threads),
"OMP_NUM_THREADS": str(n_threads),
"MKL_NUM_THREADS": str(n_threads),
"OPENBLAS_NUM_THREADS": str(n_threads),
}
subprocess.run(
[sys.executable, __file__, "--child"], env=env, check=True
)
```

The parent writes the CSV header once and child runs append. Children only
execute cells whose `threads` matches their pinned value.

## Plotting

`plot_results.py` reads `results.csv` and emits a single `results_plot.png`
with a 3×4 grid of small multiples:

- **Rows:** output mode (`haplotypes`, `annotated`, `variants`).
- **Cols:** axis (`threads`, `region_length`, `batch_size`, `buffer_bytes`).
- **X:** the column's axis values; **Y:** `instances_per_s` (linear).
- **Series:** 3 lines per panel, one per `mode` (None / buffered /
double_buffered).
- Each panel pulls fan-axis rows where the other three axes sit at their
midpoint. The `mode=None` series is absent from the `buffer_bytes`
column.

A second optional plot (commented out by default) does `MiB_per_s` instead
of `instances_per_s` to highlight that larger output modes saturate
bandwidth at smaller batch counts.

## File-level change summary

New:

- `experiments/dataloader/_common.py`
- `experiments/dataloader/bench.py`
- `experiments/dataloader/plot_results.py`

Generated (gitignored where appropriate):

- `experiments/dataloader/results.csv`
- `experiments/dataloader/results_plot.png`
- temp `.gvl` datasets under a bench-managed `tmp/` directory (cleaned on
exit; `tmp/` is added to `.gitignore` if not already covered)

No changes to `python/genvarloader/`. The bench is a pure consumer of the
public API delivered by the prefetching dataloader plan.

## Open questions deferred to implementation

- Exact memory cost of the dataset-write step at `region_length=25_000`
(the largest variant). If it pushes setup past ~3 min, reduce the
region_length fan to 4 values.
- Whether `_output_bytes_per_instance` is cheap enough to call once per
loader, or whether to cache it in `_common.py` keyed by `(dataset_path,
with_seqs)`. Default: compute once per cell; revisit if profiling shows
it dominates.
153 changes: 153 additions & 0 deletions docs/superpowers/specs/2026-05-30-dense2sparse-with-length-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Use `_dense2sparse_with_length` for VCF/PGEN writes

**Date:** 2026-05-30
**Branch:** `worktree-feat-dense2sparse-with-length`
**Status:** Approved design

## Problem

genoray 2.7.0 added `genoray._svar._dense2sparse_with_length`, which converts a
dense, over-extended region window into **per-haplotype-minimal** sparse output —
identical to what `SparseVar.read_ranges_with_length` produces. GVL's VCF/PGEN
write path does not yet use it, and as a result has two issues:

1. **Over-extension.** The current VCF/PGEN path calls plain `dense2sparse` on the
over-extended window, so *every* haplotype keeps all ALT variants in the
extension tail — including haplotypes with few/no deletions, which then extend
past the region length. The SVAR path already trims per-haplotype to the
minimal set via `read_ranges_with_length`. VCF/PGEN and SVAR therefore produce
different sparse output for the same input.

2. **`extend_to_length=False` is a silent no-op for VCF/PGEN.** `_write_from_vcf`
and `_write_from_pgen` accept `extend_to_length` but never pass it to their
chunk generators (`_vcf_region_chunks` / `_pgen_region_chunks`), which always
call `_chunk_ranges_with_length` (which always extends). So `False` still
extends today.

## Goal

Make VCF/PGEN writes produce per-haplotype-minimal sparse output identical to
SVAR when `extend_to_length=True`, and make `extend_to_length=False` actually
take effect (no extension). Fold in a `max_ends` consistency fix so all three
input types (VCF, PGEN, SVAR) agree on the per-region `chromEnd`.

## The new genoray function

```python
# genoray._svar
def _dense2sparse_with_length(
genos: NDArray[np.integer], # (samples, ploidy, variants) — full window
var_idxs: NDArray[V_IDX_TYPE], # (variants,) — global, window-aligned
q_start: int, # 0-based, original (unextended) query start
q_end: int, # 0-based exclusive, original query end
v_starts: NDArray[np.int32], # POS - 1, window-aligned with var_idxs
ilens: NDArray[np.int32], # ILEN (ALT - REF length), window-aligned
dosages: NDArray | None = None,
) -> Ragged[V_IDX_TYPE] | tuple[Ragged, Ragged]:
"""Convert a dense with_length window into per-haplotype-minimal sparse output,
identical to SparseVar.read_ranges_with_length for the same query."""
```

**Key constraint:** the function does a per-haplotype length walk over the variant
axis, so it requires the **entire region window** (all memory-chunks concatenated)
in one dense `(samples, ploidy, variants)` array. It cannot be applied
chunk-by-chunk. GVL does not use the `dosages` parameter (genotypes only).

The VCF/PGEN variant index (genoray `_index`) exposes the `POS` (i32) and `ILEN`
(list[i32], first element for bi-allelic variants) columns needed to build
`v_starts` and `ilens`.

## Architecture (Approach 2: shared window helper)

`_vcf_region_chunks` and `_pgen_region_chunks` are ~95% duplicates. Refactor both
to: (a) accept `extend_to_length`, (b) assemble each region's full dense window,
and (c) delegate conversion to one shared helper. `_write_phased_chunked` keeps
its current contract — it consumes `(list[Ragged], region_end, desc)` per region.

### New shared helper

```python
def _window_to_sparse(
genos: NDArray[np.integer], # (samples, ploidy, variants) — full region window
var_idxs: NDArray[V_IDX_TYPE], # (variants,) — global, window-aligned
q_start: int,
q_end: int,
v_starts: NDArray[np.int32], # POS - 1, window-aligned
ilens: NDArray[np.int32], # ILEN first-elem, window-aligned
extend_to_length: bool,
) -> Ragged:
if extend_to_length:
return _dense2sparse_with_length(
genos, var_idxs, q_start, q_end, v_starts, ilens
)
return dense2sparse(genos, var_idxs)
```

`v_starts` / `ilens` are built by indexing the loaded variant index
(`POS` / `ILEN`) with the window's global `var_idxs`, cast to contiguous int32.

### Data flow per generator

**`extend_to_length=True`:**
1. Read the region's chunks via `_chunk_ranges_with_length` (existing API).
2. Concatenate dense genos along the variant axis into one full window.
3. Build the full-window global `var_idxs`:
- VCF: unextended idxs (from `_var_idxs`) + `arange` extension tail derived
from the last chunk's `n_ext`.
- PGEN: concatenate the per-chunk `chunk_idxs` genoray already returns.
4. Call `_window_to_sparse(...)` once → yield `([single_ragged], region_end, desc)`.

**`extend_to_length=False`:**
1. Read via the non-length API: `VCF.chunk` (per region) / `PGEN.chunk_ranges`.
These return exactly the in-range variants, aligned with `_var_idxs`.
2. `dense2sparse` per chunk (current structure minus the extension logic).
3. Yield `([chunk_ragged, ...], region_end, desc)`.

Keeping the yielded type as `list[Ragged]` for both paths leaves
`_write_phased_chunked`'s aggregation (concatenate + flatten + lengths) unchanged.

### max_ends fix (folded in)

Each generator computes, once from its index:

```python
v_ends = POS - ILEN_first.clip(upper_bound=0) # = POS + deletion length
```

For each region, `region_end = v_ends[max retained global v_idx]`, falling back to
the input `chromEnd` when the region has no retained variants. This matches
`_write_from_svar` exactly and is applied to **both** the True and False paths,
replacing the current `region_end = chunk_end` logic. `_write_phased_chunked`
continues to write `region_end` into the output bed's `chromEnd` column.

## Edge cases

- **Empty region** (no variants for any sample): existing "no variant" warning in
`_write_phased_chunked` still fires; `region_end` falls back to input `chromEnd`.
- **Missing / unnormalized contig:** genoray yields empty chunks (with a warning),
handled by the empty-region fallback.
- **Memory:** full-window assembly holds one region's dense window in memory at
once; `max_mem` still bounds the per-chunk *read*. Acceptable — genoray's own
`read_ranges_with_length` materializes similarly. A pathologically large single
region is the only risk; noted, not guarded.

## Testing

- **Dependency bump** (applied): `pixi.toml` `genoray = "==2.7.0"`,
`pyproject.toml` `genoray>=2.7.0,<3`.
- **Regenerate test ground-truth** (`pixi run -e dev gen`): VCF/PGEN stored
genotypes change (now per-haplotype-minimal), so committed ground-truth must be
regenerated.
- **Parity test (acceptance criterion):** VCF and PGEN datasets must produce
**identical** sparse output to the SVAR dataset for the same regions/samples.
This is the strongest correctness check and the explicit acceptance criterion.
- Existing reconstruction/dataset tests pass against regenerated ground-truth.

## Out of scope

- SVAR writing (`_write_from_svar`) — already correct, untouched.
- Public API signatures — `gvl.write`'s `extend_to_length` keeps its meaning; it
simply now actually takes effect for VCF/PGEN when `False`.
- Dosage support in the new conversion path (genotypes only).
- Full unification of the SVAR write path with VCF/PGEN (different storage
mechanism; out of scope).
Loading
Loading