Problem
When restoring (downloading) a backup from S3-compatible object storage, clickhouse-backup calls Walk() (which maps to ListObjectsV2 with pagination) for every single part in the backup. For a production backup with tens of thousands of parts, this creates an enormous number of S3 API calls that dominate restore time.
How Walk is used during download
The DownloadPath() and DownloadPathParallel() functions in pkg/storage/general.go both call bd.Walk() before downloading any files:
// general.go line 506
walkErr := bd.Walk(ctx, remotePath, true, func(ctx context.Context, f RemoteFile) error {
// ... download each file
})
Under the hood, Walk() calls ListObjectsV2 with pagination (1000 keys per page) via remotePager():
// s3.go line 851
params := &s3.ListObjectsV2Input{
Bucket: aws.String(s.Config.Bucket),
MaxKeys: aws.Int32(1000),
Prefix: aws.String(prefix),
}
pager := s3.NewListObjectsV2Paginator(s.client, params, ...)
Impact at scale
For a typical production ClickHouse backup:
- 50,000,000 parts across all tables
- Each part has 5-20 files (data columns, mark files, primary index, etc.)
- Each
Walk() call requires at least 1 ListObjectsV2 request (more if the part has >1000 files, though rare)
- That's 50,000,000+ ListObjectsV2 API calls just to discover which files to download
At typical S3 latency of 50-100ms per ListObjects request:
- ~40-80 minutes spent purely on Walk/ListObjects, before any actual data transfer begins
- This is sequential per-part since each part's Walk is independent
The Walk overhead is especially painful for:
- Resume operations — even when all files are already downloaded, Walk is still called to check what exists remotely
- Partial restores (
--table flag) — Walk is called for filtered parts that will actually be downloaded, but the per-part overhead is the same
- Incremental backups — recursive downloads multiply the Walk overhead
The data is already known at upload time
The key insight is that the uploader already knows every file it uploaded. During UploadPath()/UploadPathParallel(), each file's path, size, and modification time are available immediately after upload. This information could be recorded during upload and stored alongside the backup, completely eliminating the need for Walk during download.
Proposed Solution: Upload-time file manifest
1. Add manifest.json to each backup
During upload, record every file in a manifest:
{
"version": 1,
"backup_name": "daily_backup_20260515",
"created_at": "2026-05-15T23:01:40Z",
"total_size": 1234567890,
"total_files": 150000,
"files": [
{"path": "shadow/default/my_table/default/part1/data.bin", "size": 104857600, "last_modified": "2026-05-15T23:01:40Z"},
{"path": "shadow/default/my_table/default/part1/data.cmrk3", "size": 8192, "last_modified": "2026-05-15T23:01:40Z"},
...
]
}
The manifest is built incrementally during upload — each successful PutFile() call appends an entry. No post-upload Walk needed.
2. New download path that uses the manifest
Add DownloadPathWithManifest() and DownloadPathParallelWithManifest() that iterate over manifest entries instead of calling Walk:
func (bd *BackupDestination) DownloadPathWithManifest(ctx context.Context, remotePath string, localPath string,
manifestFiles []ManifestEntry, prefixInManifest string, ...) (int64, error) {
for _, entry := range manifestFiles {
// Download file directly — no Walk needed
r, err := bd.GetFileReader(ctx, path.Join(remotePath, f.Name()))
// ...
}
}
3. Graceful fallback for older backups
If manifest.json doesn't exist (backups created before this feature), fall back to the existing Walk-based download. This ensures full backward compatibility.
Performance improvement
| Metric |
Walk-based (current) |
Manifest-based (proposed) |
| ListObjectsV2 calls (50k parts) |
~50,000 |
1 (manifest download) |
| API overhead latency |
40-80 minutes |
<1 second |
| Resume validation |
Requires Walk per part |
Local file stat only |
| Upload overhead |
None |
~1 small JSON upload |
Additional benefits
-
Resume validation without remote calls: The manifest contains file sizes, enabling local-only validation of downloaded parts (check that all expected files exist with correct sizes) without any remote API calls.
-
Accurate progress tracking: Total file count and size are known upfront from the manifest, enabling accurate progress bars and ETAs.
-
Reduced S3 costs: ListObjectsV2 is billed per request. Eliminating 50k+ calls per restore saves money at scale.
-
Works for all storage backends: The manifest is storage-agnostic — it benefits S3, GCS, Azure Blob, and any backend where listing is expensive.
Implementation notes
- The manifest should be uploaded as the last step after all data files, ensuring it only lists actually-uploaded files.
- For incremental manifest building during upload, use a mutex-protected slice that each upload goroutine appends to.
- Pre-allocate the manifest slice capacity based on the estimated file count (sum of parts × average files per part) to reduce GC pressure.
- Consider pipelining the manifest download — start downloading it concurrently with metadata.json downloads since they're independent.
We've implemented and tested this approach in a fork and confirmed it eliminates Walk overhead entirely for backups with manifests, while gracefully falling back to Walk for older backups without manifests.
Problem
When restoring (downloading) a backup from S3-compatible object storage,
clickhouse-backupcallsWalk()(which maps toListObjectsV2with pagination) for every single part in the backup. For a production backup with tens of thousands of parts, this creates an enormous number of S3 API calls that dominate restore time.How Walk is used during download
The
DownloadPath()andDownloadPathParallel()functions inpkg/storage/general.goboth callbd.Walk()before downloading any files:Under the hood,
Walk()callsListObjectsV2with pagination (1000 keys per page) viaremotePager():Impact at scale
For a typical production ClickHouse backup:
Walk()call requires at least 1ListObjectsV2request (more if the part has >1000 files, though rare)At typical S3 latency of 50-100ms per ListObjects request:
The Walk overhead is especially painful for:
--tableflag) — Walk is called for filtered parts that will actually be downloaded, but the per-part overhead is the sameThe data is already known at upload time
The key insight is that the uploader already knows every file it uploaded. During
UploadPath()/UploadPathParallel(), each file's path, size, and modification time are available immediately after upload. This information could be recorded during upload and stored alongside the backup, completely eliminating the need for Walk during download.Proposed Solution: Upload-time file manifest
1. Add
manifest.jsonto each backupDuring upload, record every file in a manifest:
{ "version": 1, "backup_name": "daily_backup_20260515", "created_at": "2026-05-15T23:01:40Z", "total_size": 1234567890, "total_files": 150000, "files": [ {"path": "shadow/default/my_table/default/part1/data.bin", "size": 104857600, "last_modified": "2026-05-15T23:01:40Z"}, {"path": "shadow/default/my_table/default/part1/data.cmrk3", "size": 8192, "last_modified": "2026-05-15T23:01:40Z"}, ... ] }The manifest is built incrementally during upload — each successful
PutFile()call appends an entry. No post-upload Walk needed.2. New download path that uses the manifest
Add
DownloadPathWithManifest()andDownloadPathParallelWithManifest()that iterate over manifest entries instead of calling Walk:3. Graceful fallback for older backups
If
manifest.jsondoesn't exist (backups created before this feature), fall back to the existing Walk-based download. This ensures full backward compatibility.Performance improvement
Additional benefits
Resume validation without remote calls: The manifest contains file sizes, enabling local-only validation of downloaded parts (check that all expected files exist with correct sizes) without any remote API calls.
Accurate progress tracking: Total file count and size are known upfront from the manifest, enabling accurate progress bars and ETAs.
Reduced S3 costs: ListObjectsV2 is billed per request. Eliminating 50k+ calls per restore saves money at scale.
Works for all storage backends: The manifest is storage-agnostic — it benefits S3, GCS, Azure Blob, and any backend where listing is expensive.
Implementation notes
We've implemented and tested this approach in a fork and confirmed it eliminates Walk overhead entirely for backups with manifests, while gracefully falling back to Walk for older backups without manifests.