libsql-server: cap snapshot files at SQLD_MAX_SNAPSHOT_SIZE / _COUNT#26
Draft
libsql-server: cap snapshot files at SQLD_MAX_SNAPSHOT_SIZE / _COUNT#26
Conversation
Catalog-init for large shops downloads the whole .snap in a single Snapshot
RPC because the compactor merges accumulated .snap files into one giant
blob whenever their cumulative size exceeds 2x the live database
(SNAPHOT_SPACE_AMPLIFICATION_FACTOR). Combined with a 200 MB max_log_size,
the result is a single 200-500 MB streaming RPC that routinely runs over
the 100 s SFE cap on degraded mobile links.
Add a new SQLD_MAX_SNAPSHOT_SIZE env var (in MB, mirroring
SQLD_MAX_LOG_SIZE). When set:
- should_compact triggers a merge when the cumulative size of accumulated
snapshots reaches the cap, instead of using the 2x amplification rule.
- merge_snapshots groups input snapshots greedily into contiguous batches
whose summed frame count fits under the cap, producing one output file
per batch instead of a single combined blob.
- A pre-existing snapshot whose own size already exceeds the cap is left
in place as a singleton batch, so the merger never produces a file
larger than the configured cap.
When unset, behavior is unchanged: legacy 2x-db-page-count amplification,
single combined merge output. The env var is read once via OnceLock to keep
the change local to snapshot.rs (no DbConfig/PrimaryConfig plumbing) and
small for a fork patch.
Pairs with lowering SQLD_MAX_LOG_SIZE in the production env so each .snap
written by the compactor is small (e.g., 20 MB), making the existing
client-side replicator loop stream a 500 MB catalog as ~25 small Snapshot
RPCs instead of one giant one.
Refs: ae-task 259, GSD 48518
d6fbbd9 to
fd6d82c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two new env vars to bound the size and number of
.snapfiles produced by the libsql snapshot merger:SQLD_MAX_SNAPSHOT_SIZE(MB, mirrorsSQLD_MAX_LOG_SIZE) — caps any single merger output file.SQLD_MAX_SNAPSHOT_COUNT(count, overrides legacyMAX_SNAPSHOT_NUMBER = 32) — count-based merge trigger.When both are unset, behavior is identical to upstream: legacy
2 * db_page_countamplification trigger,>32count trigger, single combined merge output.Why
Catalog-init for large shops downloads the whole
.snapin a single Snapshot RPC because the compactor merges accumulated.snapfiles into one giant blob whenever their cumulative size exceeds2 * db_page_count(SNAPHOT_SPACE_AMPLIFICATION_FACTOR). Combined with a 200 MBmax_log_size, the result is a single 200–500 MB streaming RPC that routinely runs over the 100 s SFE cap on degraded mobile links.The libsql client's replicator state machine (
NeedFrames↔NeedSnapshot) already loops over multiple Snapshot RPCs — it requests one.snapper call. So if the server simply keeps.snapfiles small, fresh clients naturally download them in chunks and an interruption only costs the in-flight chunk instead of the whole catalog. No client/protocol changes needed.Behavior when
SQLD_MAX_SNAPSHOT_SIZEis setshould_compacttriggers a merge when the cumulative size of accumulated snapshots reaches the cap, instead of using the legacy2 * db_page_countamplification rule. Crucially, it also requires that the merge would actually combine 2+ files — otherwise it returns false to avoid spinning a hot no-op merge loop on every snapshot registration once the working set crosses the cap.merge_snapshotsgroups input snapshots greedily into contiguous batches whose summed frame count fits under the cap, producing one output file per batch instead of a single combined blob.Why also
SQLD_MAX_SNAPSHOT_COUNTThe hardcoded count trigger (
> 32) fires after just 33 small files, ignoring size, and would force a merge that (with the size cap) just rebuilds many small files. Operators pairing a low size cap with a lowSQLD_MAX_LOG_SIZEshould raise this count in lockstep:The second bound matters because the count trigger is evaluated post-merge too: a successful chunked merge produces about
total / capfiles, and the count must accommodate that.Operational deployment
Suggested env on a fresh deploy:
A 500 MB catalog then streams as ~25 small Snapshot RPCs instead of one giant one. Resume cost on interruption: ~20 MB instead of 500 MB.
Implementation notes
Localised to
libsql-server/src/replication/snapshot.rs— noDbConfig/PrimaryConfig/ CLI plumbing — to keep the fork patch small. Configuration policy is encapsulated in a newSnapshotPolicystruct that:max_frames: Option<u64>andmax_count: usize;OnceLockinSnapshotPolicy::from_env, called fromSnapshotMerger::new;should_compactas a method, so unit tests can construct policies with arbitrary values without touching process env.Full plumbing through
DbConfig/ CLI flags is deferred to a follow-up; the parameter-injection middle ground here restores testability without enlarging the fork patch.Tests
The pure helpers and
SnapshotPolicy::should_compactare now covered by 20 unit tests:group_snapshots_for_merge: legacy mode, empty, greedy packing, boundaryacc+count == max, boundarycount == max(not flagged oversized), strictly oversized passthrough singleton, runs of consecutive oversized files, "each fits but pairs don't", property check that no batch exceeds cap.merge_makes_progress: all-singletons → false, any multi → true, empty → false.SnapshotPolicy::should_compact: legacy amplification trigger, legacy count trigger, cap size trigger with progress, below-threshold returns false, the hot-loop guard (cap exceeded but every batch is a singleton → returns false), count override behavior.parse_max_snapshot_frames: unset / empty / non-numeric / zero / valid 20 MB / overflow onmb * 1_000_000(useschecked_mul, falls back to legacy on overflow).cargo check -p libsql-server --testsclean. (Local lib-test linking is broken by an unrelated stalelibsql-ffiPCRE2 archive — CI will exercise the test binary.)Review feedback addressed
From the automated review:
should_compactnow computes the proposed grouping and returns false unless at least one batch would combine 2+ files (merge_makes_progress).max_snapshot_framesdoc moved onto its own function.should_compactand the new branching had zero tests → 6 new tests onSnapshotPolicy::should_compact, plus 3 onmerge_makes_progress, plus 6 new boundary/property tests ongroup_snapshots_for_merge.SnapshotPolicystruct, env read at construction, methods take&self. Pure parsing extracted toparse_max_snapshot_frames(Option<&str>)which is unit-tested.for insert(i, _)is O(n·k) and idiom-anti → replaced withsnapshots.splice(0..0, ret).debug_assert!on monotonicstart_frame_nodocuments and enforces the ordering contract.>=vs>asymmetry ingroup_snapshots_for_merge→ both branches use>. Comment explains the choice.mb * 1_000_000→checked_mul, returnsNoneon overflow.saturating_addon the batch accumulator.merge_snapshotsundocumented → explicit doc block on the function plusdebug_assert!at the prepend site.use super::*mid-module → moved to top ofmod test, qualifications dropped,shelper renamed toentry.Explicitly not done in this PR (deferred / out of scope, called out in the review):
DbConfig/ CLI — explicitly deferred per task. Mitigated by the parameter-injection inSnapshotPolicy.merge_policy.rssubmodule — separate refactor PR.read_max_snapshot_count_from_env's rustdoc.merge_snapshots(per-batch error leaves orphans on disk for>Kbatches) — pre-existing class of bug at smaller scope, recovered on next process restart viainit_snapshot_info_list. Documented in the function's rustdoc with a "tracked separately" note.Refs
~/claude/2026-05-05-libsql-snapshot-chunking-deferred-plan.md