Skip to content

backup: add architecture reference docs#170871

Draft
dt wants to merge 2 commits into
cockroachdb:masterfrom
dt:backup-arch-docs
Draft

backup: add architecture reference docs#170871
dt wants to merge 2 commits into
cockroachdb:masterfrom
dt:backup-arch-docs

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented May 24, 2026

Adds three reference documents to pkg/backup describing how BACKUP and RESTORE work end-to-end:

  • README.md indexes the package, provides a glossary, and points at debugging entry points and the test layout.
  • backup.md covers what backup produces (units, captured state, exclude_data_from_backup, revision-history layers, encryption, PTS chaining, multi-region) and the mechanics (planning, job phases, DistSQL flow, ExportRequest knobs, on-disk layout, compaction).
  • restore.md covers what restore is (units, the three modes with their compatibility constraints, partial-graph reconciliation, what is and isn't carried over, tenant restore specifics) and the mechanics in chronological order (planning, schema creation, split & scatter, above-raft ingest with the KeyRewriter elision asymmetry, below-raft link with the SyntheticPrefix/SyntheticSuffix mechanism, background download, revision-history fallback, post-restore).

Goal: future readers (humans and AI agents) can build a working mental model of the subsystem without re-deriving the architecture from scratch every time a bug demands deep context.

Epic: none
Release note: None

Add three reference documents to pkg/backup that describe how
BACKUP and RESTORE work end-to-end:

- README.md indexes the package, provides a glossary, and points
  at debugging entry points and the test layout.
- backup.md describes what backup produces (units, captured state,
  exclude-data-from-backup, revision-history layers, encryption,
  PTS chaining, multi-region) and the mechanics (planning, job
  phases, DistSQL flow, ExportRequest knobs, on-disk layout,
  compaction).
- restore.md describes what restore is (units, the three modes
  with their compatibility constraints, partial-graph
  reconciliation, what is and isn't carried over, tenant restore
  specifics) and the mechanics in chronological order (planning,
  schema creation, split & scatter, above-raft ingest with the
  KeyRewriter elision asymmetry, below-raft link with the
  SyntheticPrefix/SyntheticSuffix mechanism, background download,
  revision-history fallback, post-restore).

The goal is that future readers (humans and AI agents) can build
a working mental model of how the subsystem works without having
to re-derive the architecture from scratch every time a bug
demands deep context.

Epic: none
Release note: None
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 24, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't done a first pass of restore side yet but flushing some comments on backup side

Comment thread pkg/backup/README.md Outdated
(S3, GCS, Azure, nodelocal, etc.).
- **Above-raft** — writing data through the normal SQL → KV → raft
path, so the SST bytes traverse the raft log (e.g. `AddSSTable`).
- **Below-raft** — registering a file in the leaseholder's local
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not "leaseholder" : all replicas register the file in their lsm, nothing special about the leaseholder here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Fixed: 'each replica's' now.

Comment thread pkg/backup/README.md Outdated
- **Below-raft** — registering a file in the leaseholder's local
storage engine through a replicated apply-time hook
(e.g. `LinkExternalSSTable`). Still raft-replicated, but only the
file metadata traverses the raft log; the SST bytes stay external.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the backup sst's URL is replicated instead of its (restore-processed) content.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This level of detail is starting to step on scope of/duplicate the restore doc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Adopted: 'the backup SST's URL is replicated, not the SST's (restore-processed) contents.'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Trimmed — the Below-raft entry is now one sentence about what's replicated; the read-from-external-storage detail lives only in restore.md.

Comment thread pkg/backup/README.md Outdated
(e.g. `LinkExternalSSTable`). Still raft-replicated, but only the
file metadata traverses the raft log; the SST bytes stay external.
- **Link** — record an external SST in Pebble's LSM without copying
its contents. Depending on restore mode, reads against the linked
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just say "see "below raft" above".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Done — Link now just defers to Below-raft.

Comment thread pkg/backup/README.md Outdated
`kv.bulk_io_write.*` govern most knobs that affect throughput and
retry behavior.

## Tests
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cut this; it'll rot and it is easy enough to derive the current correct answer via looking at files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Cut.

Comment thread pkg/backup/README.md Outdated

## Package map

| File group | Subsystem |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd shy away from enumerating every file -- this is likely to rot. perhaps mention x_planning.go tends to do planning, i.e. take the user-provided instruction (SQL stmt) and translate that to a machine-readable persisted, resumable job, then x_job.go tends to define the entry point/cooridnation of said job, with x_processor.go usually being the def(s) of the distributed distSQL procs used during job execution and x_proc_planning being where the logic specific on how to place processors/distribute work is defined. Then we can name some specific files that deserve attention, like system_schema.go and show.go.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Replaced the table with a naming-convention explanation (x_planning / x_job / x_processor / x_processor_planning) plus a short list of files worth knowing about (system_schema.go, show.go, key_rewriter.go, generative_split_and_scatter_processor.go) and a one-liner pointing at the subpackages.

Comment thread pkg/backup/backup.md
honors it in two layers, belt-and-suspenders:

- **Planning** (`backup_job.go`): excluded tables' spans are
pre-marked completed so they never appear in the export plan.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recall tenant backups being opaque? that means we don't know the excluded tables for tenants. that's why this is a span config, so the export requests to those tables elide themselves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Added a sentence: the flag flows through SpanConfig (rather than just a planning-time filter) because tenant backups are opaque and don't enumerate per-table flags; each KV server checks its local SpanConfig and elides exports for excluded ranges itself.

Comment thread pkg/backup/backup.md
fast/online restore falls back to above-raft ingest for these layers.

Revision-history chains also support `RESTORE … AS OF SYSTEM TIME` to
reconstruct state at any timestamp within the chain's coverage.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: revision history backups have historically been more prone to bugs as they are less tested and also since the widely used "span" for describing key ranges is typically in terms of only the user key, not its time-stamp suffixed engine key encoding so e.g. mid-row resume spans were at one point mis-represented in metadata. Their SSTs are also incompatible with fast/online restore. Future work in this area may choose to deprecate these and replace them with point-in-time normal inc backups accompanied by some sort of "revision log" that could be replayed by restore to serve the same point-in-time-recovery use-case, with lower RPO as well (since revision history backups are still periodic backups that can't reliably be run more often than every X minutes at most).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Added a paragraph covering all of this: less testing, the Span type historically being just user-key without an MVCC suffix (leading to mid-row resume-span bugs), fast/online incompatibility, and possible future deprecation in favor of point-in-time backups plus a revision log that restore replays — same use case, lower RPO.

Comment thread pkg/backup/backup.md Outdated
the distributed export, writes the final manifest, and releases
PTS.

The deferred design exists so a SQL session isn't tied up for the
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the deferred design exists so a job that runs for some non-trivial time can reliably complete even if a client disconnects, a node restarts, etc without losing minutes or hours of work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Rewrote: 'The deferred design exists so that a long-running backup can reliably complete across client disconnects and node restarts: state lives on the job, not in a SQL session, and a restart can resume from the last checkpoint without losing minutes or hours of work.'

Comment thread pkg/backup/backup.md
for how restore consumes the elided prefix to produce a synthetic
prefix on the linked file.

### Compaction
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kev-cao I donno if you want to add more here -- either following up this pr after it merges of as comments/suggestions here i can have claude pull in before this one merges

Comment thread pkg/backup/backup.md
backup). Encryption is *not* requested via the KV API; the returned
SSTs are plaintext from KV's perspective, and the sink encrypts them.

### Sink and on-disk layout
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing some info here -- and perhaps above as well -- about backup collections. We've talked about how a backup is laid out but not that it is in a collection of backups and how full vs inc are kept, how compactions (below) discover inputs, etc. Maybe above in the intro we should mention something about a backup being kept in a collection of backups maintained by this cluster (usually via a schedule), with the collection organizing full and incremental backups, then maybe here we should include how schedule layout is used to do things like find latest chain or something. This is somewhat important to know when reading though backup since e.g. finding the prior backup is how planning picks the inc start time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(claude) Added a 'Backup collections' section: collections hold one or more chains; the LATEST pointer drives discovery; planning picks an incremental's StartTime from the prior layer's EndTime; compaction reads its inputs the same way; schedules produce the structure in practice. Also added a one-sentence mention in the intro so 'collection' is introduced before it's referenced.

- README: replace file-by-file package map with a higher-level
  naming-convention explanation plus a short list of notable files.
- README: fix below-raft glossary (file lands on every replica, not
  just the leaseholder; the SST's URL is replicated, not its
  contents) and simplify the "link" entry to cross-reference
  below-raft.
- README: drop the test layout list; it'd rot faster than it'd help.
- backup.md / restore.md: drop "Part I/II" framing in favor of direct
  headings ("Content of a backup" / "Execution").
- backup.md: rename "Units of backup" to "Backup scope" and define
  "restorable unit" first so the term lands.
- backup.md: quote "cluster" and spell out that cluster backups are
  user-created state plus user-expressed configuration — not a
  physical cluster snapshot.
- backup.md: rewrite the tenant bullet to lead with opaqueness (every
  KV in, no semantic processing) and note that tenant backups
  contain more than cluster backups.
- backup.md: reframe exclude_data_from_backup's motivation around
  letting high-churn data GC promptly, not around saving backup
  work; explain why the flag flows through SpanConfig (so opaque
  tenant backups can elide on the server side without enumerating
  per-table flags).
- backup.md: drop the "belt-and-suspenders" phrase.
- backup.md: add caveats about revision-history backups being more
  bug-prone, mid-row resume issues, fast/online incompatibility,
  and possible future deprecation in favor of point-in-time +
  revision log.
- backup.md: rewrite the deferred-design framing — the point is
  that long-running jobs can ride out client disconnects and node
  restarts without losing work, not just that a SQL session isn't
  tied up.
- backup.md: add a "Backup collections" section explaining that
  backups live in collections (one or more chains), how the LATEST
  pointer drives chain discovery, how planning picks an
  incremental's start time from the prior layer, and how schedules
  produce the structure.
- restore.md: parallel rename / wording fixes.

Epic: none
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants