libsql-server: cap snapshot files at SQLD_MAX_SNAPSHOT_SIZE / _COUNT by RaVbaker · Pull Request #26 · Shopify/libsql

RaVbaker · 2026-05-06T11:58:39Z

Summary

Adds two new env vars to bound the size and number of .snap files produced by the libsql snapshot merger:

SQLD_MAX_SNAPSHOT_SIZE (MB, mirrors SQLD_MAX_LOG_SIZE) — caps any single merger output file.
SQLD_MAX_SNAPSHOT_COUNT (count, overrides legacy MAX_SNAPSHOT_NUMBER = 32) — count-based merge trigger.

When both are unset, behavior is identical to upstream: legacy 2 * db_page_count amplification trigger, >32 count trigger, single combined merge output.

Why

Catalog-init for large shops downloads the whole .snap in a single Snapshot RPC because the compactor merges accumulated .snap files into one giant blob whenever their cumulative size exceeds 2 * db_page_count (SNAPHOT_SPACE_AMPLIFICATION_FACTOR). Combined with a 200 MB max_log_size, the result is a single 200–500 MB streaming RPC that routinely runs over the 100 s SFE cap on degraded mobile links.

The libsql client's replicator state machine (NeedFrames ↔ NeedSnapshot) already loops over multiple Snapshot RPCs — it requests one .snap per call. So if the server simply keeps .snap files small, fresh clients naturally download them in chunks and an interruption only costs the in-flight chunk instead of the whole catalog. No client/protocol changes needed.

Behavior when `SQLD_MAX_SNAPSHOT_SIZE` is set

should_compact triggers a merge when the cumulative size of accumulated snapshots reaches the cap, instead of using the legacy 2 * db_page_count amplification rule. Crucially, it also requires that the merge would actually combine 2+ files — otherwise it returns false to avoid spinning a hot no-op merge loop on every snapshot registration once the working set crosses the cap.
merge_snapshots groups input snapshots greedily into contiguous batches whose summed frame count fits under the cap, producing one output file per batch instead of a single combined blob.
A pre-existing snapshot whose own size already exceeds the cap is left in place as a singleton batch, so the merger never produces a file larger than the configured cap.

Why also `SQLD_MAX_SNAPSHOT_COUNT`

The hardcoded count trigger (> 32) fires after just 33 small files, ignoring size, and would force a merge that (with the size cap) just rebuilds many small files. Operators pairing a low size cap with a low SQLD_MAX_LOG_SIZE should raise this count in lockstep:

SQLD_MAX_SNAPSHOT_COUNT >= expected_total_snapshot_bytes / SQLD_MAX_LOG_SIZE
SQLD_MAX_SNAPSHOT_COUNT >= expected_total_snapshot_bytes / SQLD_MAX_SNAPSHOT_SIZE

The second bound matters because the count trigger is evaluated post-merge too: a successful chunked merge produces about total / cap files, and the count must accommodate that.

Operational deployment

Suggested env on a fresh deploy:

SQLD_MAX_LOG_SIZE: "20"          # 20 MB per .snap from compactor
SQLD_MAX_SNAPSHOT_SIZE: "1280"   # 1.28 GB cap on any single merged file
SQLD_MAX_SNAPSHOT_COUNT: "256"   # bound directory size; tune per worst-case shop

A 500 MB catalog then streams as ~25 small Snapshot RPCs instead of one giant one. Resume cost on interruption: ~20 MB instead of 500 MB.

Implementation notes

Localised to libsql-server/src/replication/snapshot.rs — no DbConfig / PrimaryConfig / CLI plumbing — to keep the fork patch small. Configuration policy is encapsulated in a new SnapshotPolicy struct that:

holds max_frames: Option<u64> and max_count: usize;
reads env vars exactly once via OnceLock in SnapshotPolicy::from_env, called from SnapshotMerger::new;
carries should_compact as a method, so unit tests can construct policies with arbitrary values without touching process env.

Full plumbing through DbConfig / CLI flags is deferred to a follow-up; the parameter-injection middle ground here restores testability without enlarging the fork patch.

Tests

The pure helpers and SnapshotPolicy::should_compact are now covered by 20 unit tests:

group_snapshots_for_merge: legacy mode, empty, greedy packing, boundary acc+count == max, boundary count == max (not flagged oversized), strictly oversized passthrough singleton, runs of consecutive oversized files, "each fits but pairs don't", property check that no batch exceeds cap.
merge_makes_progress: all-singletons → false, any multi → true, empty → false.
SnapshotPolicy::should_compact: legacy amplification trigger, legacy count trigger, cap size trigger with progress, below-threshold returns false, the hot-loop guard (cap exceeded but every batch is a singleton → returns false), count override behavior.
parse_max_snapshot_frames: unset / empty / non-numeric / zero / valid 20 MB / overflow on mb * 1_000_000 (uses checked_mul, falls back to legacy on overflow).

cargo check -p libsql-server --tests clean. (Local lib-test linking is broken by an unrelated stale libsql-ffi PCRE2 archive — CI will exercise the test binary.)

Review feedback addressed

From the automated review:

(crit) hot merge loop when all batches are singletons → should_compact now computes the proposed grouping and returns false unless at least one batch would combine 2+ files (merge_makes_progress).
(crit) dangling rustdoc paragraph → max_snapshot_frames doc moved onto its own function.
(crit) should_compact and the new branching had zero tests → 6 new tests on SnapshotPolicy::should_compact, plus 3 on merge_makes_progress, plus 6 new boundary/property tests on group_snapshots_for_merge.
(warn) DI violation: env-read inside domain logic → SnapshotPolicy struct, env read at construction, methods take &self. Pure parsing extracted to parse_max_snapshot_frames(Option<&str>) which is unit-tested.
(warn) for insert(i, _) is O(n·k) and idiom-anti → replaced with snapshots.splice(0..0, ret). debug_assert! on monotonic start_frame_no documents and enforces the ordering contract.
(warn) >= vs > asymmetry in group_snapshots_for_merge → both branches use >. Comment explains the choice.
(warn) overflow on mb * 1_000_000 → checked_mul, returns None on overflow. saturating_add on the batch accumulator.
(warn) ordering contract on merge_snapshots undocumented → explicit doc block on the function plus debug_assert! at the prepend site.
(warn) test ordering / use super::* mid-module → moved to top of mod test, qualifications dropped, s helper renamed to entry.

Explicitly not done in this PR (deferred / out of scope, called out in the review):

Full plumbing through DbConfig / CLI — explicitly deferred per task. Mitigated by the parameter-injection in SnapshotPolicy.
Splitting into a merge_policy.rs submodule — separate refactor PR.
README/operator-facing docs — separate doc PR; the rule-of-thumb formulae are at least documented in read_max_snapshot_count_from_env's rustdoc.
Failure-mode regression in merge_snapshots (per-batch error leaves orphans on disk for >K batches) — pre-existing class of bug at smaller scope, recovered on next process restart via init_snapshot_info_list. Documented in the function's rustdoc with a "tracked separately" note.

Refs

ae-task 259
GSD 48518
Plan: ~/claude/2026-05-05-libsql-snapshot-chunking-deferred-plan.md

Catalog-init for large shops downloads the whole .snap in a single Snapshot RPC because the compactor merges accumulated .snap files into one giant blob whenever their cumulative size exceeds 2x the live database (SNAPHOT_SPACE_AMPLIFICATION_FACTOR). Combined with a 200 MB max_log_size, the result is a single 200-500 MB streaming RPC that routinely runs over the 100 s SFE cap on degraded mobile links. Add a new SQLD_MAX_SNAPSHOT_SIZE env var (in MB, mirroring SQLD_MAX_LOG_SIZE). When set: - should_compact triggers a merge when the cumulative size of accumulated snapshots reaches the cap, instead of using the 2x amplification rule. - merge_snapshots groups input snapshots greedily into contiguous batches whose summed frame count fits under the cap, producing one output file per batch instead of a single combined blob. - A pre-existing snapshot whose own size already exceeds the cap is left in place as a singleton batch, so the merger never produces a file larger than the configured cap. When unset, behavior is unchanged: legacy 2x-db-page-count amplification, single combined merge output. The env var is read once via OnceLock to keep the change local to snapshot.rs (no DbConfig/PrimaryConfig plumbing) and small for a fork patch. Pairs with lowering SQLD_MAX_LOG_SIZE in the production env so each .snap written by the compactor is small (e.g., 20 MB), making the existing client-side replicator loop stream a 500 MB catalog as ~25 small Snapshot RPCs instead of one giant one. Refs: ae-task 259, GSD 48518

RaVbaker force-pushed the ae-task-259-implement-the-chunk-of-snapshots-with-a branch from d6fbbd9 to fd6d82c Compare May 6, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libsql-server: cap snapshot files at SQLD_MAX_SNAPSHOT_SIZE / _COUNT#26

libsql-server: cap snapshot files at SQLD_MAX_SNAPSHOT_SIZE / _COUNT#26
RaVbaker wants to merge 1 commit intoour-forkfrom
ae-task-259-implement-the-chunk-of-snapshots-with-a

RaVbaker commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RaVbaker commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Behavior when SQLD_MAX_SNAPSHOT_SIZE is set

Why also SQLD_MAX_SNAPSHOT_COUNT

Operational deployment

Implementation notes

Tests

Review feedback addressed

Refs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RaVbaker commented May 6, 2026 •

edited

Loading

Behavior when `SQLD_MAX_SNAPSHOT_SIZE` is set

Why also `SQLD_MAX_SNAPSHOT_COUNT`