perf: parallelize blob fetch in dump command#501
perf: parallelize blob fetch in dump command#501ChecksumDev wants to merge 2 commits intorustic-rs:mainfrom
Conversation
Extending rustic-rs#12. PR rustic-rs#106 added get_blob_cached to amortize repeat fetches, but commands::dump::dump still fetched each blob serially. On high-latency object stores the throughput was capped by per-blob round-trip latency. This routes the inner loop through pariter::parallel_map_scoped_custom (the same primitive the archiver and packer already use) so fetches overlap while the writer stays single-threaded and ordered. Output is byte-identical to the sequential loop. A new DumpOptions { num_threads: u32 } and Repository::dump_with_opts are added. Repository::dump(node, w) keeps its existing signature and now defaults to available_parallelism. Single-blob files short-circuit to the sequential path so very small files don't pay the worker setup cost. The rustic CLI's dump --archive tar / targz, mount, and webdav paths go through vfs::OpenFile::read_at rather than commands::dump::dump, so they aren't affected here. A follow-up can apply the same idea to the VFS reader.
|
Hi @ChecksumDev Thanks a lot for the PR! I had a short look:
|
|
@aawsome I removed the customization. I was sure parallelization was already exposed in the CLI, but agreed this fits better under a broader user-customizable constants story. |
Extends #12. PR #106 added
get_blob_cachedto amortize repeat fetches, butcommands::dump::dumpstill walks its blob list serially:https://github.com/rustic-rs/rustic_core/blob/main/crates/core/src/commands/dump.rs#L39-L48
so on high-latency backends the throughput is bounded by per-blob round-trip time regardless of how many concurrent connections the backend itself permits.
This routes the loop through
pariter::IteratorExt::parallel_map_scoped_custom, the same family of ordered parallel-map primitives the archiver and packer already rely on, so fetches overlap N-wide while the writer stays single-threaded. Output ordering is preserved bit-for-bit; an integration test backs that up by chunking a 64 KiB payload into 16 fixed-size blobs and asserting that the parallel and sequential dumps both equal the source bytes exactly.The new
DumpOptionscarries a singlenum_threads: u32field (Setters,non_exhaustive, gated behind the existingclapfeature), where0selectsavailable_parallelismand1forces the sequential implementation.Repository::dump_with_opts(node, w, &DumpOptions)joins the existingRepository::dump(node, w), which keeps its signature and now defaults toavailable_parallelism. Both methods pick up anS: Syncbound; the in-treeIndexedFullStatusalready satisfies it, so all in-tree call sites compile unchanged. Files that decompose into a single blob take a sequential short-circuit and pay none of the worker setup cost.Out of scope:
dump --archive tar,--archive targz,mount, andwebdavgo throughvfs::OpenFile::read_atrather thancommands::dump::dump. The serial blob-fetch shape is the same, but the consumer there is pull-based (the tar writer callsRead::readrepeatedly with small buffers), so the right fix is read-ahead prefetching rather than the push-style overlap this PR uses. That wants its own change.Results
Same caveat as #487: not a proper benchmark suite. An
InMemoryBackendwrapper sleeps a configurable duration on everyread_full/read_partialto simulate object-store round-trip latency. 4 MiB file, 64 KiB fixed-size chunks (64 blobs), freshRepositoryper measurement so the in-process blob cache is cold for every run.For any non-trivial backend latency the speedup converges to the worker count. A thread sweep at 20 ms latency confirms linear scaling: 1, 2, 4, 8, 16, 32 threads gives 1.0x, 2.0x, 4.0x, 7.9x, 15.8x, 30.8x.
The 0 ms row is the cost. Against a free backend (warm page cache, in-memory backend, repeat dumps that already hit
get_blob_cached) the parallel path runs at roughly 60-70% of sequential because thread setup and channel passing dominate when the per-blob fetch is itself near-instant. The single-blob short-circuit covers the smallest files, so the regression is bounded to multi-blob files on near-zero-latency backends, which is the case where dump throughput wasn't the bottleneck to begin with.