Skip to content

perf: parallelize blob fetch in dump command#501

Open
ChecksumDev wants to merge 2 commits intorustic-rs:mainfrom
ChecksumDev:feat/parallel-dump
Open

perf: parallelize blob fetch in dump command#501
ChecksumDev wants to merge 2 commits intorustic-rs:mainfrom
ChecksumDev:feat/parallel-dump

Conversation

@ChecksumDev
Copy link
Copy Markdown

Extends #12. PR #106 added get_blob_cached to amortize repeat fetches, but commands::dump::dump still walks its blob list serially:

https://github.com/rustic-rs/rustic_core/blob/main/crates/core/src/commands/dump.rs#L39-L48

so on high-latency backends the throughput is bounded by per-blob round-trip time regardless of how many concurrent connections the backend itself permits.

This routes the loop through pariter::IteratorExt::parallel_map_scoped_custom, the same family of ordered parallel-map primitives the archiver and packer already rely on, so fetches overlap N-wide while the writer stays single-threaded. Output ordering is preserved bit-for-bit; an integration test backs that up by chunking a 64 KiB payload into 16 fixed-size blobs and asserting that the parallel and sequential dumps both equal the source bytes exactly.

The new DumpOptions carries a single num_threads: u32 field (Setters, non_exhaustive, gated behind the existing clap feature), where 0 selects available_parallelism and 1 forces the sequential implementation. Repository::dump_with_opts(node, w, &DumpOptions) joins the existing Repository::dump(node, w), which keeps its signature and now defaults to available_parallelism. Both methods pick up an S: Sync bound; the in-tree IndexedFullStatus already satisfies it, so all in-tree call sites compile unchanged. Files that decompose into a single blob take a sequential short-circuit and pay none of the worker setup cost.

Out of scope: dump --archive tar, --archive targz, mount, and webdav go through vfs::OpenFile::read_at rather than commands::dump::dump. The serial blob-fetch shape is the same, but the consumer there is pull-based (the tar writer calls Read::read repeatedly with small buffers), so the right fix is read-ahead prefetching rather than the push-style overlap this PR uses. That wants its own change.

Results

Same caveat as #487: not a proper benchmark suite. An InMemoryBackend wrapper sleeps a configurable duration on every read_full/read_partial to simulate object-store round-trip latency. 4 MiB file, 64 KiB fixed-size chunks (64 blobs), fresh Repository per measurement so the in-process blob cache is cold for every run.

    latency      sequential        parallel  speedup
        0ms     2124 MB/s        1497 MB/s    0.70x
        1ms       54 MB/s         699 MB/s   12.89x
        5ms       12 MB/s         180 MB/s   15.15x
       20ms        3 MB/s          48 MB/s   15.65x
       50ms        1 MB/s          19 MB/s   15.72x
      100ms      0.6 MB/s           9 MB/s   15.81x

For any non-trivial backend latency the speedup converges to the worker count. A thread sweep at 20 ms latency confirms linear scaling: 1, 2, 4, 8, 16, 32 threads gives 1.0x, 2.0x, 4.0x, 7.9x, 15.8x, 30.8x.

The 0 ms row is the cost. Against a free backend (warm page cache, in-memory backend, repeat dumps that already hit get_blob_cached) the parallel path runs at roughly 60-70% of sequential because thread setup and channel passing dominate when the per-blob fetch is itself near-instant. The single-blob short-circuit covers the smallest files, so the regression is bounded to multi-blob files on near-zero-latency backends, which is the case where dump throughput wasn't the bottleneck to begin with.

Extending rustic-rs#12. PR rustic-rs#106 added get_blob_cached to amortize repeat
fetches, but commands::dump::dump still fetched each blob serially.
On high-latency object stores the throughput was capped by per-blob
round-trip latency.

This routes the inner loop through pariter::parallel_map_scoped_custom
(the same primitive the archiver and packer already use) so fetches
overlap while the writer stays single-threaded and ordered. Output is
byte-identical to the sequential loop.

A new DumpOptions { num_threads: u32 } and Repository::dump_with_opts
are added. Repository::dump(node, w) keeps its existing signature and
now defaults to available_parallelism. Single-blob files short-circuit
to the sequential path so very small files don't pay the worker setup
cost.

The rustic CLI's dump --archive tar / targz, mount, and webdav paths
go through vfs::OpenFile::read_at rather than commands::dump::dump,
so they aren't affected here. A follow-up can apply the same idea to
the VFS reader.
@aawsome
Copy link
Copy Markdown
Member

aawsome commented May 6, 2026

Hi @ChecksumDev Thanks a lot for the PR!

I had a short look:

  1. Most of the complexity is due to the additional parameter which is exposed to externally set parallalization. In other commands this is hard-coded and we need a general userstory about how to make those constants user-customizable. I'd like to start here similarily. Can you remove the part where this is made customizable?

  2. I think that this is only a first step and fetching multiple blobs at-once in combination with a better caching (as we know the access pattern exactly in advance) will bring even more benefit. But this we can do in another PR.

@ChecksumDev
Copy link
Copy Markdown
Author

@aawsome I removed the customization. I was sure parallelization was already exposed in the CLI, but agreed this fits better under a broader user-customizable constants story.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants