Skip to content

Fix --table with --resume on incremental backups (b.dst corruption + state wipe) #1384

@minguyen9988

Description

@minguyen9988

Problem

Using --table with --resume on incremental backups crashes or produces incorrect results due to two interacting bugs.

Bug 1: Recursive Download() corrupts parent's BackupDestination

When downloading an incremental backup, Download() recursively calls itself to download the RequiredBackup (base backup):

// pkg/backup/download.go
if !schemaOnly && !b.cfg.General.DownloadByPart && remoteBackup.RequiredBackup != "" {
    err := b.Download(remoteBackup.RequiredBackup, tablePattern, ...)
    // After this returns, b.dst has been overwritten and closed!
}

The recursive call runs on the same Backuper instance. Inside, it calls initDisksPathsAndBackupDestination() which overwrites b.dst with a new BackupDestination. On return, the child's defer b.dst.Close(ctx) closes this connection. The parent then tries to use b.dst for its own data downloads — but it's been closed.

Without --table, this is masked because the resume state says "all parts already processed" and b.dst is never actually called for data transfer. With --table, the state wipe (Bug 2) forces actual downloads, exposing the closed connection.

Bug 2: tablePattern in resume state params triggers unnecessary state wipe

The resume state includes tablePattern in its stored params:

b.resumableState = resumable.NewState(b.GetStateDir(), backupName, "download", map[string]interface{}{
    "tablePattern": tablePattern,  // <-- causes state wipe
    "partitions":   partitions,
    "schemaOnly":   schemaOnly,
})

NewState() calls cleanupStateIfParamsChange(), which compares stored params with current params. When the user changes from no --table (full download) to --table pnlcrypto_prod.lotted_position (partial resume), the params don't match, so ALL resume progress is wiped.

This forces actual data downloads (instead of validated skips), which exposes Bug 1.

Proposed Fix

Fix 1: Save/restore b.dst around recursive call

if !schemaOnly && !b.cfg.General.DownloadByPart && remoteBackup.RequiredBackup != "" {
    savedDst := b.dst
    err := b.Download(remoteBackup.RequiredBackup, tablePattern, ...)
    b.dst = savedDst  // restore parent's connection
    if err != nil && !errors.Is(err, ErrBackupIsAlreadyExists) {
        return errors.WithMessage(err, "download RequiredBackup")
    }
}

Fix 2: Remove tablePattern from resume state params

b.resumableState = resumable.NewState(b.GetStateDir(), backupName, "download", map[string]interface{}{
    // tablePattern intentionally excluded — resume state tracks individual
    // part paths which are independent of the table filter. Including
    // tablePattern would wipe ALL resume progress whenever the user
    // changes --table.
    "partitions": partitions,
    "schemaOnly": schemaOnly,
})

Resume state is part-level (keys are individual part remote paths). The table filter only controls which parts are queued for download — it doesn't affect the validity of previously-downloaded parts. A part downloaded as part of a full backup is equally valid when resuming with --table.

Expected behavior after fix

# Full download (interrupted)
./clickhouse-backup download my_backup --resume
# Downloads 1000/1200 tables before crash

# Resume just one table
./clickhouse-backup download my_backup --resume --table pnlcrypto_prod.lotted_position
# Only processes pnlcrypto_prod.lotted_position
# Does NOT wipe state for the other 1000 tables
# Uses the same b.dst connection throughout

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions