backup: add architecture reference docs#170871
Conversation
Add three reference documents to pkg/backup that describe how BACKUP and RESTORE work end-to-end: - README.md indexes the package, provides a glossary, and points at debugging entry points and the test layout. - backup.md describes what backup produces (units, captured state, exclude-data-from-backup, revision-history layers, encryption, PTS chaining, multi-region) and the mechanics (planning, job phases, DistSQL flow, ExportRequest knobs, on-disk layout, compaction). - restore.md describes what restore is (units, the three modes with their compatibility constraints, partial-graph reconciliation, what is and isn't carried over, tenant restore specifics) and the mechanics in chronological order (planning, schema creation, split & scatter, above-raft ingest with the KeyRewriter elision asymmetry, below-raft link with the SyntheticPrefix/SyntheticSuffix mechanism, background download, revision-history fallback, post-restore). The goal is that future readers (humans and AI agents) can build a working mental model of how the subsystem works without having to re-derive the architecture from scratch every time a bug demands deep context. Epic: none Release note: None
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
dt
left a comment
There was a problem hiding this comment.
haven't done a first pass of restore side yet but flushing some comments on backup side
| (S3, GCS, Azure, nodelocal, etc.). | ||
| - **Above-raft** — writing data through the normal SQL → KV → raft | ||
| path, so the SST bytes traverse the raft log (e.g. `AddSSTable`). | ||
| - **Below-raft** — registering a file in the leaseholder's local |
There was a problem hiding this comment.
not "leaseholder" : all replicas register the file in their lsm, nothing special about the leaseholder here.
There was a problem hiding this comment.
(claude) Fixed: 'each replica's' now.
| - **Below-raft** — registering a file in the leaseholder's local | ||
| storage engine through a replicated apply-time hook | ||
| (e.g. `LinkExternalSSTable`). Still raft-replicated, but only the | ||
| file metadata traverses the raft log; the SST bytes stay external. |
There was a problem hiding this comment.
I'd say the backup sst's URL is replicated instead of its (restore-processed) content.
There was a problem hiding this comment.
This level of detail is starting to step on scope of/duplicate the restore doc.
There was a problem hiding this comment.
(claude) Adopted: 'the backup SST's URL is replicated, not the SST's (restore-processed) contents.'
There was a problem hiding this comment.
(claude) Trimmed — the Below-raft entry is now one sentence about what's replicated; the read-from-external-storage detail lives only in restore.md.
| (e.g. `LinkExternalSSTable`). Still raft-replicated, but only the | ||
| file metadata traverses the raft log; the SST bytes stay external. | ||
| - **Link** — record an external SST in Pebble's LSM without copying | ||
| its contents. Depending on restore mode, reads against the linked |
There was a problem hiding this comment.
I'd just say "see "below raft" above".
There was a problem hiding this comment.
(claude) Done — Link now just defers to Below-raft.
| `kv.bulk_io_write.*` govern most knobs that affect throughput and | ||
| retry behavior. | ||
|
|
||
| ## Tests |
There was a problem hiding this comment.
cut this; it'll rot and it is easy enough to derive the current correct answer via looking at files.
|
|
||
| ## Package map | ||
|
|
||
| | File group | Subsystem | |
There was a problem hiding this comment.
I'd shy away from enumerating every file -- this is likely to rot. perhaps mention x_planning.go tends to do planning, i.e. take the user-provided instruction (SQL stmt) and translate that to a machine-readable persisted, resumable job, then x_job.go tends to define the entry point/cooridnation of said job, with x_processor.go usually being the def(s) of the distributed distSQL procs used during job execution and x_proc_planning being where the logic specific on how to place processors/distribute work is defined. Then we can name some specific files that deserve attention, like system_schema.go and show.go.
There was a problem hiding this comment.
(claude) Replaced the table with a naming-convention explanation (x_planning / x_job / x_processor / x_processor_planning) plus a short list of files worth knowing about (system_schema.go, show.go, key_rewriter.go, generative_split_and_scatter_processor.go) and a one-liner pointing at the subpackages.
| honors it in two layers, belt-and-suspenders: | ||
|
|
||
| - **Planning** (`backup_job.go`): excluded tables' spans are | ||
| pre-marked completed so they never appear in the export plan. |
There was a problem hiding this comment.
recall tenant backups being opaque? that means we don't know the excluded tables for tenants. that's why this is a span config, so the export requests to those tables elide themselves.
There was a problem hiding this comment.
(claude) Added a sentence: the flag flows through SpanConfig (rather than just a planning-time filter) because tenant backups are opaque and don't enumerate per-table flags; each KV server checks its local SpanConfig and elides exports for excluded ranges itself.
| fast/online restore falls back to above-raft ingest for these layers. | ||
|
|
||
| Revision-history chains also support `RESTORE … AS OF SYSTEM TIME` to | ||
| reconstruct state at any timestamp within the chain's coverage. |
There was a problem hiding this comment.
NB: revision history backups have historically been more prone to bugs as they are less tested and also since the widely used "span" for describing key ranges is typically in terms of only the user key, not its time-stamp suffixed engine key encoding so e.g. mid-row resume spans were at one point mis-represented in metadata. Their SSTs are also incompatible with fast/online restore. Future work in this area may choose to deprecate these and replace them with point-in-time normal inc backups accompanied by some sort of "revision log" that could be replayed by restore to serve the same point-in-time-recovery use-case, with lower RPO as well (since revision history backups are still periodic backups that can't reliably be run more often than every X minutes at most).
There was a problem hiding this comment.
(claude) Added a paragraph covering all of this: less testing, the Span type historically being just user-key without an MVCC suffix (leading to mid-row resume-span bugs), fast/online incompatibility, and possible future deprecation in favor of point-in-time backups plus a revision log that restore replays — same use case, lower RPO.
| the distributed export, writes the final manifest, and releases | ||
| PTS. | ||
|
|
||
| The deferred design exists so a SQL session isn't tied up for the |
There was a problem hiding this comment.
no, the deferred design exists so a job that runs for some non-trivial time can reliably complete even if a client disconnects, a node restarts, etc without losing minutes or hours of work.
There was a problem hiding this comment.
(claude) Rewrote: 'The deferred design exists so that a long-running backup can reliably complete across client disconnects and node restarts: state lives on the job, not in a SQL session, and a restart can resume from the last checkpoint without losing minutes or hours of work.'
| for how restore consumes the elided prefix to produce a synthetic | ||
| prefix on the linked file. | ||
|
|
||
| ### Compaction |
There was a problem hiding this comment.
@kev-cao I donno if you want to add more here -- either following up this pr after it merges of as comments/suggestions here i can have claude pull in before this one merges
| backup). Encryption is *not* requested via the KV API; the returned | ||
| SSTs are plaintext from KV's perspective, and the sink encrypts them. | ||
|
|
||
| ### Sink and on-disk layout |
There was a problem hiding this comment.
I think we're missing some info here -- and perhaps above as well -- about backup collections. We've talked about how a backup is laid out but not that it is in a collection of backups and how full vs inc are kept, how compactions (below) discover inputs, etc. Maybe above in the intro we should mention something about a backup being kept in a collection of backups maintained by this cluster (usually via a schedule), with the collection organizing full and incremental backups, then maybe here we should include how schedule layout is used to do things like find latest chain or something. This is somewhat important to know when reading though backup since e.g. finding the prior backup is how planning picks the inc start time.
There was a problem hiding this comment.
(claude) Added a 'Backup collections' section: collections hold one or more chains; the LATEST pointer drives discovery; planning picks an incremental's StartTime from the prior layer's EndTime; compaction reads its inputs the same way; schedules produce the structure in practice. Also added a one-sentence mention in the intro so 'collection' is introduced before it's referenced.
- README: replace file-by-file package map with a higher-level
naming-convention explanation plus a short list of notable files.
- README: fix below-raft glossary (file lands on every replica, not
just the leaseholder; the SST's URL is replicated, not its
contents) and simplify the "link" entry to cross-reference
below-raft.
- README: drop the test layout list; it'd rot faster than it'd help.
- backup.md / restore.md: drop "Part I/II" framing in favor of direct
headings ("Content of a backup" / "Execution").
- backup.md: rename "Units of backup" to "Backup scope" and define
"restorable unit" first so the term lands.
- backup.md: quote "cluster" and spell out that cluster backups are
user-created state plus user-expressed configuration — not a
physical cluster snapshot.
- backup.md: rewrite the tenant bullet to lead with opaqueness (every
KV in, no semantic processing) and note that tenant backups
contain more than cluster backups.
- backup.md: reframe exclude_data_from_backup's motivation around
letting high-churn data GC promptly, not around saving backup
work; explain why the flag flows through SpanConfig (so opaque
tenant backups can elide on the server side without enumerating
per-table flags).
- backup.md: drop the "belt-and-suspenders" phrase.
- backup.md: add caveats about revision-history backups being more
bug-prone, mid-row resume issues, fast/online incompatibility,
and possible future deprecation in favor of point-in-time +
revision log.
- backup.md: rewrite the deferred-design framing — the point is
that long-running jobs can ride out client disconnects and node
restarts without losing work, not just that a SQL session isn't
tied up.
- backup.md: add a "Backup collections" section explaining that
backups live in collections (one or more chains), how the LATEST
pointer drives chain discovery, how planning picks an
incremental's start time from the prior layer, and how schedules
produce the structure.
- restore.md: parallel rename / wording fixes.
Epic: none
Release note: None
Adds three reference documents to
pkg/backupdescribing how BACKUP and RESTORE work end-to-end:README.mdindexes the package, provides a glossary, and points at debugging entry points and the test layout.backup.mdcovers what backup produces (units, captured state,exclude_data_from_backup, revision-history layers, encryption, PTS chaining, multi-region) and the mechanics (planning, job phases, DistSQL flow,ExportRequestknobs, on-disk layout, compaction).restore.mdcovers what restore is (units, the three modes with their compatibility constraints, partial-graph reconciliation, what is and isn't carried over, tenant restore specifics) and the mechanics in chronological order (planning, schema creation, split & scatter, above-raft ingest with theKeyRewriterelision asymmetry, below-raft link with theSyntheticPrefix/SyntheticSuffixmechanism, background download, revision-history fallback, post-restore).Goal: future readers (humans and AI agents) can build a working mental model of the subsystem without re-deriving the architecture from scratch every time a bug demands deep context.
Epic: none
Release note: None