perf: parallelize blob fetch in dump command by ChecksumDev · Pull Request #501 · rustic-rs/rustic_core

ChecksumDev · 2026-05-03T00:57:29Z

Extends #12. PR #106 added get_blob_cached to amortize repeat fetches, but commands::dump::dump still walks its blob list serially:

https://github.com/rustic-rs/rustic_core/blob/main/crates/core/src/commands/dump.rs#L39-L48

so on high-latency backends the throughput is bounded by per-blob round-trip time regardless of how many concurrent connections the backend itself permits.

This routes the loop through pariter::IteratorExt::parallel_map_scoped_custom, the same family of ordered parallel-map primitives the archiver and packer already rely on, so fetches overlap N-wide while the writer stays single-threaded. Output ordering is preserved bit-for-bit; an integration test backs that up by chunking a 64 KiB payload into 16 fixed-size blobs and asserting that the parallel and sequential dumps both equal the source bytes exactly.

The new DumpOptions carries a single num_threads: u32 field (Setters, non_exhaustive, gated behind the existing clap feature), where 0 selects available_parallelism and 1 forces the sequential implementation. Repository::dump_with_opts(node, w, &DumpOptions) joins the existing Repository::dump(node, w), which keeps its signature and now defaults to available_parallelism. Both methods pick up an S: Sync bound; the in-tree IndexedFullStatus already satisfies it, so all in-tree call sites compile unchanged. Files that decompose into a single blob take a sequential short-circuit and pay none of the worker setup cost.

Out of scope: dump --archive tar, --archive targz, mount, and webdav go through vfs::OpenFile::read_at rather than commands::dump::dump. The serial blob-fetch shape is the same, but the consumer there is pull-based (the tar writer calls Read::read repeatedly with small buffers), so the right fix is read-ahead prefetching rather than the push-style overlap this PR uses. That wants its own change.

Results

Same caveat as #487: not a proper benchmark suite. An InMemoryBackend wrapper sleeps a configurable duration on every read_full/read_partial to simulate object-store round-trip latency. 4 MiB file, 64 KiB fixed-size chunks (64 blobs), fresh Repository per measurement so the in-process blob cache is cold for every run.

    latency      sequential        parallel  speedup
        0ms     2124 MB/s        1497 MB/s    0.70x
        1ms       54 MB/s         699 MB/s   12.89x
        5ms       12 MB/s         180 MB/s   15.15x
       20ms        3 MB/s          48 MB/s   15.65x
       50ms        1 MB/s          19 MB/s   15.72x
      100ms      0.6 MB/s           9 MB/s   15.81x

For any non-trivial backend latency the speedup converges to the worker count. A thread sweep at 20 ms latency confirms linear scaling: 1, 2, 4, 8, 16, 32 threads gives 1.0x, 2.0x, 4.0x, 7.9x, 15.8x, 30.8x.

The 0 ms row is the cost. Against a free backend (warm page cache, in-memory backend, repeat dumps that already hit get_blob_cached) the parallel path runs at roughly 60-70% of sequential because thread setup and channel passing dominate when the per-blob fetch is itself near-instant. The single-blob short-circuit covers the smallest files, so the regression is bounded to multi-blob files on near-zero-latency backends, which is the case where dump throughput wasn't the bottleneck to begin with.

Extending rustic-rs#12. PR rustic-rs#106 added get_blob_cached to amortize repeat fetches, but commands::dump::dump still fetched each blob serially. On high-latency object stores the throughput was capped by per-blob round-trip latency. This routes the inner loop through pariter::parallel_map_scoped_custom (the same primitive the archiver and packer already use) so fetches overlap while the writer stays single-threaded and ordered. Output is byte-identical to the sequential loop. A new DumpOptions { num_threads: u32 } and Repository::dump_with_opts are added. Repository::dump(node, w) keeps its existing signature and now defaults to available_parallelism. Single-blob files short-circuit to the sequential path so very small files don't pay the worker setup cost. The rustic CLI's dump --archive tar / targz, mount, and webdav paths go through vfs::OpenFile::read_at rather than commands::dump::dump, so they aren't affected here. A follow-up can apply the same idea to the VFS reader.

aawsome · 2026-05-06T18:44:05Z

Hi @ChecksumDev Thanks a lot for the PR!

I had a short look:

Most of the complexity is due to the additional parameter which is exposed to externally set parallalization. In other commands this is hard-coded and we need a general userstory about how to make those constants user-customizable. I'd like to start here similarily. Can you remove the part where this is made customizable?
I think that this is only a first step and fetching multiple blobs at-once in combination with a better caching (as we know the access pattern exactly in advance) will bring even more benefit. But this we can do in another PR.

ChecksumDev · 2026-05-06T21:38:51Z

@aawsome I removed the customization. I was sure parallelization was already exposed in the CLI, but agreed this fits better under a broader user-customizable constants story.

perf: hard-code parallelism for dump blob fetch

4409311

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize blob fetch in dump command#501

perf: parallelize blob fetch in dump command#501
ChecksumDev wants to merge 2 commits intorustic-rs:mainfrom
ChecksumDev:feat/parallel-dump

ChecksumDev commented May 3, 2026

Uh oh!

aawsome commented May 6, 2026

Uh oh!

ChecksumDev commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChecksumDev commented May 3, 2026

Results

Uh oh!

aawsome commented May 6, 2026

Uh oh!

ChecksumDev commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants