Skip to content

Proposal: eliminate S3 Walk (ListObjectsV2) during restore via upload-time file manifest #1375

@minguyen9988

Description

@minguyen9988

Problem

When restoring (downloading) a backup from S3-compatible object storage, clickhouse-backup calls Walk() (which maps to ListObjectsV2 with pagination) for every single part in the backup. For a production backup with tens of thousands of parts, this creates an enormous number of S3 API calls that dominate restore time.

How Walk is used during download

The DownloadPath() and DownloadPathParallel() functions in pkg/storage/general.go both call bd.Walk() before downloading any files:

// general.go line 506
walkErr := bd.Walk(ctx, remotePath, true, func(ctx context.Context, f RemoteFile) error {
    // ... download each file
})

Under the hood, Walk() calls ListObjectsV2 with pagination (1000 keys per page) via remotePager():

// s3.go line 851
params := &s3.ListObjectsV2Input{
    Bucket:  aws.String(s.Config.Bucket),
    MaxKeys: aws.Int32(1000),
    Prefix:  aws.String(prefix),
}
pager := s3.NewListObjectsV2Paginator(s.client, params, ...)

Impact at scale

For a typical production ClickHouse backup:

  • 50,000,000 parts across all tables
  • Each part has 5-20 files (data columns, mark files, primary index, etc.)
  • Each Walk() call requires at least 1 ListObjectsV2 request (more if the part has >1000 files, though rare)
  • That's 50,000,000+ ListObjectsV2 API calls just to discover which files to download

At typical S3 latency of 50-100ms per ListObjects request:

  • ~40-80 minutes spent purely on Walk/ListObjects, before any actual data transfer begins
  • This is sequential per-part since each part's Walk is independent

The Walk overhead is especially painful for:

  1. Resume operations — even when all files are already downloaded, Walk is still called to check what exists remotely
  2. Partial restores (--table flag) — Walk is called for filtered parts that will actually be downloaded, but the per-part overhead is the same
  3. Incremental backups — recursive downloads multiply the Walk overhead

The data is already known at upload time

The key insight is that the uploader already knows every file it uploaded. During UploadPath()/UploadPathParallel(), each file's path, size, and modification time are available immediately after upload. This information could be recorded during upload and stored alongside the backup, completely eliminating the need for Walk during download.

Proposed Solution: Upload-time file manifest

1. Add manifest.json to each backup

During upload, record every file in a manifest:

{
  "version": 1,
  "backup_name": "daily_backup_20260515",
  "created_at": "2026-05-15T23:01:40Z",
  "total_size": 1234567890,
  "total_files": 150000,
  "files": [
    {"path": "shadow/default/my_table/default/part1/data.bin", "size": 104857600, "last_modified": "2026-05-15T23:01:40Z"},
    {"path": "shadow/default/my_table/default/part1/data.cmrk3", "size": 8192, "last_modified": "2026-05-15T23:01:40Z"},
    ...
  ]
}

The manifest is built incrementally during upload — each successful PutFile() call appends an entry. No post-upload Walk needed.

2. New download path that uses the manifest

Add DownloadPathWithManifest() and DownloadPathParallelWithManifest() that iterate over manifest entries instead of calling Walk:

func (bd *BackupDestination) DownloadPathWithManifest(ctx context.Context, remotePath string, localPath string, 
    manifestFiles []ManifestEntry, prefixInManifest string, ...) (int64, error) {
    for _, entry := range manifestFiles {
        // Download file directly — no Walk needed
        r, err := bd.GetFileReader(ctx, path.Join(remotePath, f.Name()))
        // ...
    }
}

3. Graceful fallback for older backups

If manifest.json doesn't exist (backups created before this feature), fall back to the existing Walk-based download. This ensures full backward compatibility.

Performance improvement

Metric Walk-based (current) Manifest-based (proposed)
ListObjectsV2 calls (50k parts) ~50,000 1 (manifest download)
API overhead latency 40-80 minutes <1 second
Resume validation Requires Walk per part Local file stat only
Upload overhead None ~1 small JSON upload

Additional benefits

  1. Resume validation without remote calls: The manifest contains file sizes, enabling local-only validation of downloaded parts (check that all expected files exist with correct sizes) without any remote API calls.

  2. Accurate progress tracking: Total file count and size are known upfront from the manifest, enabling accurate progress bars and ETAs.

  3. Reduced S3 costs: ListObjectsV2 is billed per request. Eliminating 50k+ calls per restore saves money at scale.

  4. Works for all storage backends: The manifest is storage-agnostic — it benefits S3, GCS, Azure Blob, and any backend where listing is expensive.

Implementation notes

  • The manifest should be uploaded as the last step after all data files, ensuring it only lists actually-uploaded files.
  • For incremental manifest building during upload, use a mutex-protected slice that each upload goroutine appends to.
  • Pre-allocate the manifest slice capacity based on the estimated file count (sum of parts × average files per part) to reduce GC pressure.
  • Consider pipelining the manifest download — start downloading it concurrently with metadata.json downloads since they're independent.

We've implemented and tested this approach in a fork and confirmed it eliminates Walk overhead entirely for backups with manifests, while gracefully falling back to Walk for older backups without manifests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions