backup: add architecture reference docs by dt · Pull Request #170871 · cockroachdb/cockroach

dt · 2026-05-24T19:33:14Z

Adds three reference documents to pkg/backup describing how BACKUP and RESTORE work end-to-end:

README.md indexes the package, provides a glossary, and points at debugging entry points and the test layout.
backup.md covers what backup produces (units, captured state, exclude_data_from_backup, revision-history layers, encryption, PTS chaining, multi-region) and the mechanics (planning, job phases, DistSQL flow, ExportRequest knobs, on-disk layout, compaction).
restore.md covers what restore is (units, the three modes with their compatibility constraints, partial-graph reconciliation, what is and isn't carried over, tenant restore specifics) and the mechanics in chronological order (planning, schema creation, split & scatter, above-raft ingest with the KeyRewriter elision asymmetry, below-raft link with the SyntheticPrefix/SyntheticSuffix mechanism, background download, revision-history fallback, post-restore).

Goal: future readers (humans and AI agents) can build a working mental model of the subsystem without re-deriving the architecture from scratch every time a bug demands deep context.

Epic: none
Release note: None

Add three reference documents to pkg/backup that describe how BACKUP and RESTORE work end-to-end: - README.md indexes the package, provides a glossary, and points at debugging entry points and the test layout. - backup.md describes what backup produces (units, captured state, exclude-data-from-backup, revision-history layers, encryption, PTS chaining, multi-region) and the mechanics (planning, job phases, DistSQL flow, ExportRequest knobs, on-disk layout, compaction). - restore.md describes what restore is (units, the three modes with their compatibility constraints, partial-graph reconciliation, what is and isn't carried over, tenant restore specifics) and the mechanics in chronological order (planning, schema creation, split & scatter, above-raft ingest with the KeyRewriter elision asymmetry, below-raft link with the SyntheticPrefix/SyntheticSuffix mechanism, background download, revision-history fallback, post-restore). The goal is that future readers (humans and AI agents) can build a working mental model of how the subsystem works without having to re-derive the architecture from scratch every time a bug demands deep context. Epic: none Release note: None

trunk-io · 2026-05-24T19:33:17Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

cockroach-teamcity · 2026-05-24T19:33:28Z

This change is

dt

haven't done a first pass of restore side yet but flushing some comments on backup side

dt · 2026-05-24T21:18:30Z

+  (S3, GCS, Azure, nodelocal, etc.).
+- **Above-raft** — writing data through the normal SQL → KV → raft
+  path, so the SST bytes traverse the raft log (e.g. `AddSSTable`).
+- **Below-raft** — registering a file in the leaseholder's local


not "leaseholder" : all replicas register the file in their lsm, nothing special about the leaseholder here.

(claude) Fixed: 'each replica's' now.

dt · 2026-05-24T21:19:30Z

+- **Below-raft** — registering a file in the leaseholder's local
+  storage engine through a replicated apply-time hook
+  (e.g. `LinkExternalSSTable`). Still raft-replicated, but only the
+  file metadata traverses the raft log; the SST bytes stay external.


I'd say the backup sst's URL is replicated instead of its (restore-processed) content.

This level of detail is starting to step on scope of/duplicate the restore doc.

(claude) Adopted: 'the backup SST's URL is replicated, not the SST's (restore-processed) contents.'

(claude) Trimmed — the Below-raft entry is now one sentence about what's replicated; the read-from-external-storage detail lives only in restore.md.

dt · 2026-05-24T21:20:30Z

+  (e.g. `LinkExternalSSTable`). Still raft-replicated, but only the
+  file metadata traverses the raft log; the SST bytes stay external.
+- **Link** — record an external SST in Pebble's LSM without copying
+  its contents. Depending on restore mode, reads against the linked


I'd just say "see "below raft" above".

(claude) Done — Link now just defers to Below-raft.

dt · 2026-05-24T21:21:41Z

+  `kv.bulk_io_write.*` govern most knobs that affect throughput and
+  retry behavior.
+
+## Tests


cut this; it'll rot and it is easy enough to derive the current correct answer via looking at files.

(claude) Cut.

dt · 2026-05-24T21:25:25Z

+
+## Package map
+
+| File group | Subsystem |


I'd shy away from enumerating every file -- this is likely to rot. perhaps mention x_planning.go tends to do planning, i.e. take the user-provided instruction (SQL stmt) and translate that to a machine-readable persisted, resumable job, then x_job.go tends to define the entry point/cooridnation of said job, with x_processor.go usually being the def(s) of the distributed distSQL procs used during job execution and x_proc_planning being where the logic specific on how to place processors/distribute work is defined. Then we can name some specific files that deserve attention, like system_schema.go and show.go.

(claude) Replaced the table with a naming-convention explanation (x_planning / x_job / x_processor / x_processor_planning) plus a short list of files worth knowing about (system_schema.go, show.go, key_rewriter.go, generative_split_and_scatter_processor.go) and a one-liner pointing at the subpackages.

dt · 2026-05-24T21:38:20Z

+honors it in two layers, belt-and-suspenders:
+
+- **Planning** (`backup_job.go`): excluded tables' spans are
+  pre-marked completed so they never appear in the export plan.


recall tenant backups being opaque? that means we don't know the excluded tables for tenants. that's why this is a span config, so the export requests to those tables elide themselves.

(claude) Added a sentence: the flag flows through SpanConfig (rather than just a planning-time filter) because tenant backups are opaque and don't enumerate per-table flags; each KV server checks its local SpanConfig and elides exports for excluded ranges itself.

dt · 2026-05-24T21:45:44Z

+fast/online restore falls back to above-raft ingest for these layers.
+
+Revision-history chains also support `RESTORE … AS OF SYSTEM TIME` to
+reconstruct state at any timestamp within the chain's coverage.


NB: revision history backups have historically been more prone to bugs as they are less tested and also since the widely used "span" for describing key ranges is typically in terms of only the user key, not its time-stamp suffixed engine key encoding so e.g. mid-row resume spans were at one point mis-represented in metadata. Their SSTs are also incompatible with fast/online restore. Future work in this area may choose to deprecate these and replace them with point-in-time normal inc backups accompanied by some sort of "revision log" that could be replayed by restore to serve the same point-in-time-recovery use-case, with lower RPO as well (since revision history backups are still periodic backups that can't reliably be run more often than every X minutes at most).

(claude) Added a paragraph covering all of this: less testing, the Span type historically being just user-key without an MVCC suffix (leading to mid-row resume-span bugs), fast/online incompatibility, and possible future deprecation in favor of point-in-time backups plus a revision log that restore replays — same use case, lower RPO.

dt · 2026-05-24T21:47:49Z

+   the distributed export, writes the final manifest, and releases
+   PTS.
+
+The deferred design exists so a SQL session isn't tied up for the


no, the deferred design exists so a job that runs for some non-trivial time can reliably complete even if a client disconnects, a node restarts, etc without losing minutes or hours of work.

(claude) Rewrote: 'The deferred design exists so that a long-running backup can reliably complete across client disconnects and node restarts: state lives on the job, not in a SQL session, and a restart can resume from the last checkpoint without losing minutes or hours of work.'

dt · 2026-05-24T22:25:15Z

+for how restore consumes the elided prefix to produce a synthetic
+prefix on the linked file.
+
+### Compaction


@kev-cao I donno if you want to add more here -- either following up this pr after it merges of as comments/suggestions here i can have claude pull in before this one merges

dt · 2026-05-24T22:29:03Z

+backup). Encryption is *not* requested via the KV API; the returned
+SSTs are plaintext from KV's perspective, and the sink encrypts them.
+
+### Sink and on-disk layout


I think we're missing some info here -- and perhaps above as well -- about backup collections. We've talked about how a backup is laid out but not that it is in a collection of backups and how full vs inc are kept, how compactions (below) discover inputs, etc. Maybe above in the intro we should mention something about a backup being kept in a collection of backups maintained by this cluster (usually via a schedule), with the collection organizing full and incremental backups, then maybe here we should include how schedule layout is used to do things like find latest chain or something. This is somewhat important to know when reading though backup since e.g. finding the prior backup is how planning picks the inc start time.

(claude) Added a 'Backup collections' section: collections hold one or more chains; the LATEST pointer drives discovery; planning picks an incremental's StartTime from the prior layer's EndTime; compaction reads its inputs the same way; schedules produce the structure in practice. Also added a one-sentence mention in the intro so 'collection' is introduced before it's referenced.

- README: replace file-by-file package map with a higher-level naming-convention explanation plus a short list of notable files. - README: fix below-raft glossary (file lands on every replica, not just the leaseholder; the SST's URL is replicated, not its contents) and simplify the "link" entry to cross-reference below-raft. - README: drop the test layout list; it'd rot faster than it'd help. - backup.md / restore.md: drop "Part I/II" framing in favor of direct headings ("Content of a backup" / "Execution"). - backup.md: rename "Units of backup" to "Backup scope" and define "restorable unit" first so the term lands. - backup.md: quote "cluster" and spell out that cluster backups are user-created state plus user-expressed configuration — not a physical cluster snapshot. - backup.md: rewrite the tenant bullet to lead with opaqueness (every KV in, no semantic processing) and note that tenant backups contain more than cluster backups. - backup.md: reframe exclude_data_from_backup's motivation around letting high-churn data GC promptly, not around saving backup work; explain why the flag flows through SpanConfig (so opaque tenant backups can elide on the server side without enumerating per-table flags). - backup.md: drop the "belt-and-suspenders" phrase. - backup.md: add caveats about revision-history backups being more bug-prone, mid-row resume issues, fast/online incompatibility, and possible future deprecation in favor of point-in-time + revision log. - backup.md: rewrite the deferred-design framing — the point is that long-running jobs can ride out client disconnects and node restarts without losing work, not just that a SQL session isn't tied up. - backup.md: add a "Backup collections" section explaining that backups live in collections (one or more chains), how the LATEST pointer drives chain discovery, how planning picks an incremental's start time from the prior layer, and how schedules produce the structure. - restore.md: parallel rename / wording fixes. Epic: none Release note: None

dt commented May 24, 2026

View reviewed changes

Conversation

dt commented May 24, 2026

Uh oh!

trunk-io Bot commented May 24, 2026

Uh oh!

cockroach-teamcity commented May 24, 2026

Uh oh!

dt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants