Skip to content

Reconcile metadata.json with actually-uploaded files after upload #1380

@minguyen9988

Description

@minguyen9988

Problem

After uploading a backup, metadata.json lists all tables from the original local backup — including tables whose data may have failed to upload. If a single table's upload fails (network error, permission issue, disk full on remote), the backup's metadata.json still references it. On download, clickhouse-backup tries to download that table's data and fails.

Current behavior

  1. Backup create writes local metadata.json listing all tables
  2. Upload uploads metadata.json first, then data files
  3. If table X's data upload fails but the overall upload is retried/resumed, metadata.json still lists table X
  4. Download reads metadata.json, finds table X, tries to download its data — fails

Proposed Fix

After completing all data uploads, reconcile metadata.json with what was actually uploaded. Remove entries for tables whose data files don't exist on remote.

func (b *Backuper) uploadManifest(ctx context.Context, backupName string) error {
    var manifest *storage.BackupManifest
    if b.fileManifest != nil && b.fileManifest.TotalFiles > 0 && !b.resume {
        // Use the incrementally-built manifest (zero Walk / zero ListObjects)
        manifest = b.fileManifest
    } else {
        // Fallback: Walk the backup directory for resumed uploads
        manifest = storage.NewBackupManifest(backupName)
        err := b.dst.Walk(ctx, backupName+"/", true, func(ctx context.Context, f storage.RemoteFile) error {
            name := f.Name()
            if name == storage.ManifestFileName || f.Size() == 0 {
                return nil
            }
            manifest.AddFile(name, f.Size(), f.LastModified())
            return nil
        })
        if err != nil {
            return errors.WithMessage(err, "manifest Walk")
        }
    }
    return b.dst.UploadManifest(ctx, backupName, manifest)
}

The incremental approach is even better — build the manifest during upload by recording each file as it's successfully uploaded:

func (b *Backuper) recordUploadedFiles(basePath string, files []string, localBasePath string) {
    if b.fileManifest == nil {
        return
    }
    b.fileManifestMu.Lock()
    now := time.Now().UTC()
    for _, f := range files {
        size := int64(0)
        if localBasePath != "" {
            if fInfo, err := os.Stat(path.Join(localBasePath, f)); err == nil {
                size = fInfo.Size()
            }
        }
        b.fileManifest.AddFile(path.Join(basePath, f), size, now)
    }
    b.fileManifestMu.Unlock()
}

This ensures the manifest (and by extension, the metadata) only references files that actually exist on remote. Thread-safe via mutex since multiple upload goroutines call this concurrently.

Benefits

  1. Downloads never fail due to phantom table references in metadata
  2. For fresh (non-resume) uploads, the manifest is built with zero ListObjects calls
  3. For resumed uploads, the Walk fallback accurately captures what exists on remote
  4. The manifest doubles as the file listing used to skip Walk during restore (see Proposal: eliminate S3 Walk (ListObjectsV2) during restore via upload-time file manifest #1375)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions