Skip to content

Optimize zstd: single-threaded encoder, multi-threaded decoder, 4MB window #1378

@minguyen9988

Description

@minguyen9988

Problem

The default zstd configuration uses multi-threaded encoding per stream (WithEncoderConcurrency(0) = GOMAXPROCS). When upload_concurrency=N tables are being compressed simultaneously, this creates N × GOMAXPROCS goroutines competing for CPU — severe over-subscription that degrades throughput.

For decompression, the default single-threaded decoder leaves CPU idle during restore while I/O is the bottleneck.

Benchmarks (16-core host, 8 tables in parallel)

Configuration Throughput CPU usage
Encoder: GOMAXPROCS(16), 8 parallel 2201 MB/s 128 threads, heavy contention
Encoder: concurrency=1, 8 parallel 2856 MB/s 8 threads, no contention
Improvement +30%

Proposed Changes

1. Single-threaded zstd encoder (per stream)

// pkg/storage/utils.go — getArchiveWriter()
case "zstd":
    return &archiver.CompressedArchive{Compression: archiver.Zstd{EncoderOptions: []zstd.EOption{
        zstd.WithEncoderLevel(zstd.EncoderLevelFromZstd(level)),
        zstd.WithEncoderConcurrency(1),               // single-threaded per stream
        zstd.WithWindowSize(4 * 1024 * 1024),          // 4MB window
    }}, Archival: archiver.Tar{}}, nil

Rationale: clickhouse-backup already parallelizes at the table level (upload_concurrency). Each table gets its own compression stream. When 8 tables compress simultaneously, 8 single-threaded encoders achieve higher aggregate throughput than 8 × 16-threaded encoders because they avoid lock contention in the zstd encoder's internal state.

2. Multi-threaded zstd decoder

// pkg/storage/utils.go — getArchiveReader()
case "zstd":
    zstdDecoderConcurrency := min(runtime.GOMAXPROCS(0), 32)
    return &archiver.CompressedArchive{Compression: archiver.Zstd{DecoderOptions: []zstd.DOption{
        zstd.WithDecoderConcurrency(zstdDecoderConcurrency),
        zstd.WithDecoderLowmem(false),
    }}, Archival: archiver.Tar{}}, nil

Rationale: During restore, each table's compressed archive is downloaded sequentially and decompressed. Multi-threaded decoding parallelizes block decompression within a single stream, utilizing CPU cores that would otherwise be idle waiting for I/O.

3. 4MB window size

The default 128KB window limits zstd's ability to find long-range matches. ClickHouse data files often have repeating patterns at 1-4MB intervals (column chunks). A 4MB window improves compression ratio by 2-5% at negligible speed cost (the window only affects memory usage, not CPU time).

Summary

Change Effect
WithEncoderConcurrency(1) +30% throughput on parallel backup
WithDecoderConcurrency(GOMAXPROCS) Faster restore decompression
WithWindowSize(4MB) Better compression ratio
WithDecoderLowmem(false) Speed over memory during restore

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions