Skip to content

Commit b377c7e

Browse files
committed
docs: add replication production rollout plan
1 parent 931a631 commit b377c7e

1 file changed

Lines changed: 350 additions & 0 deletions

File tree

Lines changed: 350 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,350 @@
1+
# Replication Production Rollout Plan
2+
3+
Status: draft rollout plan for `feat/replication-e2e-bootstrap`
4+
5+
This document is the production rollout plan for the replication branch in `tinycloud-node`. It is an operations plan for merge, canary, registration, rollout, and rollback. It is not the protocol spec.
6+
7+
## 1. Current State and Rollout Scope
8+
9+
### Branch state
10+
11+
The branch already contains the first real replication surface in `tinycloud-node`:
12+
13+
- `/info` and `/replication/info` expose node capability advertisement.
14+
- `/replication/session/open` issues short-lived replication session tokens after sync auth.
15+
- `/replication/auth/export` and `/replication/auth/reconcile` support auth sync.
16+
- `/replication/export`, `/replication/reconcile`, and `/replication/reconcile/split` support KV replay-style repair.
17+
- `/replication/recon/export`, `/replication/recon/compare`, `/replication/recon/split`, and `/replication/recon/split/compare` provide the first Recon anti-entropy surface.
18+
- `/replication/kv/state`, `/replication/kv/state/compare`, `/replication/peer-missing/plan`, `/replication/peer-missing/apply`, and `/replication/peer-missing/quarantine` provide evidence-based KV repair and quarantine handling.
19+
- `/replication/sql/export` and `/replication/sql/reconcile` provide the current SQL replication path.
20+
- Host and replica roles are configurable with `TINYCLOUD_REPLICATION_ROLE`.
21+
- Peer serving is configurable with `TINYCLOUD_REPLICATION_PEER_SERVING`.
22+
- In TEE builds, `/attestation` and `/info.inTEE` expose attestation-related runtime signals.
23+
24+
### What the first production cut is actually rolling out
25+
26+
The first production cut should roll out:
27+
28+
- Authenticated replication sessions.
29+
- Full auth sync between known peers.
30+
- Conservative KV replication between known hosts and selected replicas.
31+
- Static bootstrap and per-host fan-out registration.
32+
- `/info` capability advertisement for routing and diagnostics.
33+
- Optional TEE attestation validation for Phala-hosted canary instances.
34+
35+
The first production cut should not rely on:
36+
37+
- `/info` as a trust root.
38+
- DHT or ambient peer discovery.
39+
- Blind prune on absence.
40+
- Automatic authority election.
41+
- Broad replica peer-serving by default.
42+
43+
### Recommended first-cut scope
44+
45+
Roll out in this order:
46+
47+
- Auth sync and KV replication first.
48+
- Host-to-host first.
49+
- Host-to-replica second.
50+
- Replica peer-serving only after the canary passes.
51+
- SQL replication only behind an explicit canary gate, even though branch support exists.
52+
53+
## 2. Merge Order and Readiness Gates
54+
55+
### Merge order
56+
57+
1. Merge `tinycloud-node` replication branch to `main` with conservative defaults.
58+
2. Merge the companion SDK and rollout automation changes after the node merge is accepted.
59+
3. Publish a dedicated Phala-ready image for the replication rollout.
60+
4. Apply canary deployment manifests and registration records.
61+
62+
### Runtime defaults at merge
63+
64+
Use safe defaults at merge time:
65+
66+
- `TINYCLOUD_REPLICATION_ROLE=host` unless the instance is explicitly a replica.
67+
- `TINYCLOUD_REPLICATION_PEER_SERVING=false` unless the instance is explicitly approved to serve peers.
68+
- `TINYCLOUD_REPLICATION_SESSION_TTL_SECS=600` unless operational tuning is justified.
69+
70+
### Readiness gates before merge
71+
72+
- Branch builds cleanly in CI for the production target image.
73+
- The replication routes are covered by real end-to-end tests, not mocks.
74+
- `/info`, `/replication/info`, `/replication/session/open`, auth sync, KV replay reconcile, and Recon compare/split flows are green in pre-merge test runs.
75+
- The Phala image is built with the intended confidential-compute feature set.
76+
- Monitoring, alerting, and log shipping are in place before the first canary instance is exposed.
77+
78+
### Readiness gates before canary
79+
80+
- A fixed bootstrap inventory exists for the canary spaces and peers.
81+
- Host delegations and sync delegations are created for every canary participant.
82+
- Attestation verification procedure is written down and tested.
83+
- Each instance has isolated storage and a unique registration record.
84+
- Rollback and drain commands are tested before customer traffic uses replication.
85+
86+
## 3. Phala Canary Topology
87+
88+
Use a small multi-instance canary, not a single-node smoke deploy.
89+
90+
### Recommended topology
91+
92+
- `tc-canary-a`: authority host for the canary spaces.
93+
- `tc-canary-b`: secondary host for the same spaces.
94+
- `tc-canary-r1`: replica with peer serving disabled.
95+
- `tc-canary-r2`: replica reserved for peer-serving canary after the first gate passes.
96+
97+
### Placement guidance
98+
99+
- Put each instance on its own Phala CVM.
100+
- Keep each instance on its own storage namespace.
101+
- Keep the authority host on the most stable CVM and do not rotate authority during the first cut.
102+
- Use the archived Phala deployment guidance as the baseline for image build, env encryption, and attestation handling.
103+
104+
### Canary phases
105+
106+
Phase 1:
107+
108+
- `tc-canary-a` and `tc-canary-b` only.
109+
- Validate host-to-host auth sync and KV replication.
110+
111+
Phase 2:
112+
113+
- Add `tc-canary-r1`.
114+
- Validate host-to-replica auth sync, replay reconcile, Recon compare, and quarantine behavior.
115+
116+
Phase 3:
117+
118+
- Add `tc-canary-r2` with peer serving enabled.
119+
- Validate replica export only after the earlier phases are stable.
120+
121+
## 4. Registration Model
122+
123+
Registration must be treated as three separate layers.
124+
125+
### Infra registration
126+
127+
Infra registration says an instance exists as infrastructure.
128+
129+
It should include:
130+
131+
- Phala project or CVM identifier.
132+
- deployment channel such as `canary` or `prod`.
133+
- base URL and DNS record.
134+
- image digest.
135+
- attestation verification status.
136+
- owner and on-call metadata.
137+
138+
Infra registration does not prove a space relationship. It only proves that an operator recognizes a concrete deployed instance.
139+
140+
### Instance registration
141+
142+
Instance registration says what a particular running instance claims it can do.
143+
144+
It should include:
145+
146+
- instance ID.
147+
- base URL.
148+
- replication role configured on the instance.
149+
- whether peer serving is enabled.
150+
- whether TEE mode is expected and verified.
151+
- health and last-seen timestamps.
152+
- storage backend identifiers.
153+
154+
`/info` belongs here. It is capability advertisement only. It is useful for:
155+
156+
- confirming the node supports replication.
157+
- confirming the enabled role and peer-serving mode.
158+
- confirming the service surface and software version.
159+
160+
`/info` is not a trust root and is not enough to authorize replication for a space.
161+
162+
### Space registration
163+
164+
Space registration is the trust-bearing layer.
165+
166+
It should include:
167+
168+
- the authority host for the space.
169+
- all approved hosts for the space.
170+
- all approved replicas for the space.
171+
- the exact `tinycloud.space/host` and `tinycloud.space/sync` delegations used for those roles.
172+
- peer-serving allowance for replicas, if any.
173+
- the bootstrap host list for that space.
174+
175+
Space registration is where the rollout should rely on proof:
176+
177+
- `tinycloud.space/host` proves host authority for the space.
178+
- `tinycloud.space/sync` proves replication scope for replicas.
179+
- `/peer/generate/<space>` binds the serving node to its per-space server DID.
180+
- `/attestation` proves the runtime identity of the instance if TEE validation is required.
181+
182+
### Registration rule for first production cut
183+
184+
- Infra registration is maintained by operations.
185+
- Instance registration is maintained by deployment automation.
186+
- Space registration is maintained by explicit host and sync delegations.
187+
- Registration with hosts is per-host fan-out, not implicit cluster membership.
188+
189+
## 5. Bootstrap and Discovery Model
190+
191+
The first production cut should use static bootstrap and conservative discovery.
192+
193+
### First-contact bootstrap
194+
195+
For each canary space, maintain an explicit bootstrap record with:
196+
197+
- the authority host URL.
198+
- the secondary host URL, if any.
199+
- the approved replicas, if any.
200+
- the expected server DID per host if already staged.
201+
202+
### Discovery order
203+
204+
1. Start from the explicit bootstrap host list for the space.
205+
2. Call `/info` only to confirm capability advertisement and basic compatibility.
206+
3. If the rollout requires TEE assurance, validate `/attestation`.
207+
4. Resolve or stage the per-space server DID with `/peer/generate/<space>`.
208+
5. Open `/replication/session/open` using the caller’s `tinycloud.space/sync` delegation.
209+
6. Run auth sync before relying on data export from a first-contact peer.
210+
7. Fan out registration to additional hosts explicitly, per host.
211+
212+
### Discovery rules
213+
214+
- Do not auto-discover new peers from `/info`.
215+
- Do not use replica peer-serving as a bootstrap source in the first cut.
216+
- Do not trust a host or replica for a space unless the matching host or sync delegation is already known or just synchronized through auth sync.
217+
218+
## 6. Storage Isolation Requirements
219+
220+
Every canary instance must have isolated mutable storage.
221+
222+
### Hard requirements
223+
224+
- No two instances may share the same SQLite database path.
225+
- No two instances may share the same local block store directory.
226+
- If Postgres is used, each instance must use a distinct database or schema.
227+
- If object storage is used, each instance must use a distinct prefix or bucket namespace.
228+
- Temporary directories, logs, and cache directories must be instance-scoped.
229+
230+
### Why this matters
231+
232+
The replication rollout is testing protocol behavior between nodes. Shared storage would hide real divergence, break forensic analysis, and turn replication bugs into storage corruption bugs.
233+
234+
### Recommended storage layout for canary
235+
236+
- one Postgres database or schema per instance.
237+
- one block-storage prefix per instance.
238+
- one Phala env file per instance.
239+
- one monitoring identity per instance.
240+
241+
## 7. Canary Test Plan and Success Criteria
242+
243+
### Canary test plan
244+
245+
Run these tests in order:
246+
247+
- auth session open against every canary instance.
248+
- auth sync from authority host to secondary host.
249+
- auth sync from authority host to replica.
250+
- KV host-to-host write, reconcile, and read validation.
251+
- KV host-to-replica write, reconcile, and read validation.
252+
- KV Recon compare and split on a diverged prefix.
253+
- KV peer-missing quarantine path on a replica.
254+
- restart one non-authority instance and confirm catch-up after restart.
255+
- if SQL is in scope for the canary, run only on a dedicated canary space after KV is stable.
256+
257+
### Success criteria
258+
259+
- Zero unauthorized replication exports accepted.
260+
- Zero cross-instance storage collisions.
261+
- Auth sync converges on every canary instance.
262+
- KV writes converge across the canary hosts and replicas within the expected polling window.
263+
- Recon compare returns clean match after repair for the test prefixes.
264+
- Quarantined keys remain hidden from canonical reads and visible only through the intended provisional path.
265+
- No repeated crash loop, session leak, or auth-session invalidation bug appears during a 24-hour soak.
266+
- Attestation verification succeeds for every instance that is expected to run in TEE mode.
267+
268+
### Exit criteria for widening rollout
269+
270+
- At least 24 hours of stable canary behavior.
271+
- At least one controlled restart of a host and a replica with successful recovery.
272+
- No unresolved auth, session, or storage-separation incidents.
273+
274+
## 8. Rollback and Drain Procedure
275+
276+
Rollback must preserve trust correctness first and traffic continuity second.
277+
278+
### Immediate drain
279+
280+
1. Remove the instance from the bootstrap inventory.
281+
2. Stop issuing new space registrations to that instance.
282+
3. Set `TINYCLOUD_REPLICATION_PEER_SERVING=false` on the draining instance.
283+
4. Revoke the instance’s `tinycloud.space/sync` delegations if it should no longer replicate.
284+
5. Wait for the replication session TTL window to expire or restart the instance to clear existing sessions.
285+
286+
### Host rollback
287+
288+
If a non-authority host is unhealthy:
289+
290+
- drain it from bootstrap and space registration.
291+
- revoke its host delegation if it should no longer serve the space.
292+
- leave the authority host unchanged.
293+
294+
If the authority host is unhealthy:
295+
296+
- do not promote a new authority automatically in the first cut.
297+
- freeze new replication expansion.
298+
- either roll back the authority host in place or perform a controlled manual authority reassignment with new host delegations.
299+
300+
### Replica rollback
301+
302+
- revoke the replica’s `tinycloud.space/sync` delegation.
303+
- remove it from space registration and bootstrap lists.
304+
- keep its storage for forensics until the incident is closed.
305+
306+
### Data handling rule
307+
308+
- Do not delete canary instance storage during initial rollback.
309+
- Snapshot or preserve the instance state first so divergence can be inspected.
310+
311+
## 9. Open Questions and Deferred Items
312+
313+
### Open questions
314+
315+
- Whether SQL replication should be part of the first canary or delayed until the KV and auth planes soak cleanly.
316+
- Whether replica peer-serving should require extra rollout policy beyond the existing sync delegation facts.
317+
- How strict the attestation gate should be for non-TEE fallback environments.
318+
- Whether a separate canonical-host registration record is needed operationally even though `tinycloud.space/host` is the proof.
319+
320+
### Deferred items
321+
322+
- Dynamic peer discovery.
323+
- Any `/info`-driven automatic enrollment.
324+
- Automatic authority election or failover.
325+
- Merkle-proof-based auth sync.
326+
- Blind prune-on-absence semantics.
327+
- Broad production use of replica peer-serving before the host-host and host-replica lanes are stable.
328+
329+
## 10. Operational Checklist
330+
331+
Before merge:
332+
333+
- replication routes reviewed.
334+
- production image built.
335+
- dashboards and alerts ready.
336+
- bootstrap inventory format finalized.
337+
338+
Before canary:
339+
340+
- host and sync delegations created.
341+
- Phala instances deployed and attested.
342+
- storage namespaces isolated.
343+
- rollback rehearsal completed.
344+
345+
Before widening:
346+
347+
- canary soak passed.
348+
- restart recovery passed.
349+
- incident log reviewed.
350+
- authority host remained stable throughout the canary window.

0 commit comments

Comments
 (0)