Problem
The default zstd configuration uses multi-threaded encoding per stream (WithEncoderConcurrency(0) = GOMAXPROCS). When upload_concurrency=N tables are being compressed simultaneously, this creates N × GOMAXPROCS goroutines competing for CPU — severe over-subscription that degrades throughput.
For decompression, the default single-threaded decoder leaves CPU idle during restore while I/O is the bottleneck.
Benchmarks (16-core host, 8 tables in parallel)
| Configuration |
Throughput |
CPU usage |
| Encoder: GOMAXPROCS(16), 8 parallel |
2201 MB/s |
128 threads, heavy contention |
| Encoder: concurrency=1, 8 parallel |
2856 MB/s |
8 threads, no contention |
| Improvement |
+30% |
|
Proposed Changes
1. Single-threaded zstd encoder (per stream)
// pkg/storage/utils.go — getArchiveWriter()
case "zstd":
return &archiver.CompressedArchive{Compression: archiver.Zstd{EncoderOptions: []zstd.EOption{
zstd.WithEncoderLevel(zstd.EncoderLevelFromZstd(level)),
zstd.WithEncoderConcurrency(1), // single-threaded per stream
zstd.WithWindowSize(4 * 1024 * 1024), // 4MB window
}}, Archival: archiver.Tar{}}, nil
Rationale: clickhouse-backup already parallelizes at the table level (upload_concurrency). Each table gets its own compression stream. When 8 tables compress simultaneously, 8 single-threaded encoders achieve higher aggregate throughput than 8 × 16-threaded encoders because they avoid lock contention in the zstd encoder's internal state.
2. Multi-threaded zstd decoder
// pkg/storage/utils.go — getArchiveReader()
case "zstd":
zstdDecoderConcurrency := min(runtime.GOMAXPROCS(0), 32)
return &archiver.CompressedArchive{Compression: archiver.Zstd{DecoderOptions: []zstd.DOption{
zstd.WithDecoderConcurrency(zstdDecoderConcurrency),
zstd.WithDecoderLowmem(false),
}}, Archival: archiver.Tar{}}, nil
Rationale: During restore, each table's compressed archive is downloaded sequentially and decompressed. Multi-threaded decoding parallelizes block decompression within a single stream, utilizing CPU cores that would otherwise be idle waiting for I/O.
3. 4MB window size
The default 128KB window limits zstd's ability to find long-range matches. ClickHouse data files often have repeating patterns at 1-4MB intervals (column chunks). A 4MB window improves compression ratio by 2-5% at negligible speed cost (the window only affects memory usage, not CPU time).
Summary
| Change |
Effect |
WithEncoderConcurrency(1) |
+30% throughput on parallel backup |
WithDecoderConcurrency(GOMAXPROCS) |
Faster restore decompression |
WithWindowSize(4MB) |
Better compression ratio |
WithDecoderLowmem(false) |
Speed over memory during restore |
Problem
The default zstd configuration uses multi-threaded encoding per stream (
WithEncoderConcurrency(0)= GOMAXPROCS). Whenupload_concurrency=Ntables are being compressed simultaneously, this creates N × GOMAXPROCS goroutines competing for CPU — severe over-subscription that degrades throughput.For decompression, the default single-threaded decoder leaves CPU idle during restore while I/O is the bottleneck.
Benchmarks (16-core host, 8 tables in parallel)
Proposed Changes
1. Single-threaded zstd encoder (per stream)
Rationale: clickhouse-backup already parallelizes at the table level (
upload_concurrency). Each table gets its own compression stream. When 8 tables compress simultaneously, 8 single-threaded encoders achieve higher aggregate throughput than 8 × 16-threaded encoders because they avoid lock contention in the zstd encoder's internal state.2. Multi-threaded zstd decoder
Rationale: During restore, each table's compressed archive is downloaded sequentially and decompressed. Multi-threaded decoding parallelizes block decompression within a single stream, utilizing CPU cores that would otherwise be idle waiting for I/O.
3. 4MB window size
The default 128KB window limits zstd's ability to find long-range matches. ClickHouse data files often have repeating patterns at 1-4MB intervals (column chunks). A 4MB window improves compression ratio by 2-5% at negligible speed cost (the window only affects memory usage, not CPU time).
Summary
WithEncoderConcurrency(1)WithDecoderConcurrency(GOMAXPROCS)WithWindowSize(4MB)WithDecoderLowmem(false)