|
| 1 | +# Replication Production Rollout Plan |
| 2 | + |
| 3 | +Status: draft rollout plan for `feat/replication-e2e-bootstrap` |
| 4 | + |
| 5 | +This document is the production rollout plan for the replication branch in `tinycloud-node`. It is an operations plan for merge, canary, registration, rollout, and rollback. It is not the protocol spec. |
| 6 | + |
| 7 | +## 1. Current State and Rollout Scope |
| 8 | + |
| 9 | +### Branch state |
| 10 | + |
| 11 | +The branch already contains the first real replication surface in `tinycloud-node`: |
| 12 | + |
| 13 | +- `/info` and `/replication/info` expose node capability advertisement. |
| 14 | +- `/replication/session/open` issues short-lived replication session tokens after sync auth. |
| 15 | +- `/replication/auth/export` and `/replication/auth/reconcile` support auth sync. |
| 16 | +- `/replication/export`, `/replication/reconcile`, and `/replication/reconcile/split` support KV replay-style repair. |
| 17 | +- `/replication/recon/export`, `/replication/recon/compare`, `/replication/recon/split`, and `/replication/recon/split/compare` provide the first Recon anti-entropy surface. |
| 18 | +- `/replication/kv/state`, `/replication/kv/state/compare`, `/replication/peer-missing/plan`, `/replication/peer-missing/apply`, and `/replication/peer-missing/quarantine` provide evidence-based KV repair and quarantine handling. |
| 19 | +- `/replication/sql/export` and `/replication/sql/reconcile` provide the current SQL replication path. |
| 20 | +- Host and replica roles are configurable with `TINYCLOUD_REPLICATION_ROLE`. |
| 21 | +- Peer serving is configurable with `TINYCLOUD_REPLICATION_PEER_SERVING`. |
| 22 | +- In TEE builds, `/attestation` and `/info.inTEE` expose attestation-related runtime signals. |
| 23 | + |
| 24 | +### What the first production cut is actually rolling out |
| 25 | + |
| 26 | +The first production cut should roll out: |
| 27 | + |
| 28 | +- Authenticated replication sessions. |
| 29 | +- Full auth sync between known peers. |
| 30 | +- Conservative KV replication between known hosts and selected replicas. |
| 31 | +- Static bootstrap and per-host fan-out registration. |
| 32 | +- `/info` capability advertisement for routing and diagnostics. |
| 33 | +- Optional TEE attestation validation for Phala-hosted canary instances. |
| 34 | + |
| 35 | +The first production cut should not rely on: |
| 36 | + |
| 37 | +- `/info` as a trust root. |
| 38 | +- DHT or ambient peer discovery. |
| 39 | +- Blind prune on absence. |
| 40 | +- Automatic authority election. |
| 41 | +- Broad replica peer-serving by default. |
| 42 | + |
| 43 | +### Recommended first-cut scope |
| 44 | + |
| 45 | +Roll out in this order: |
| 46 | + |
| 47 | +- Auth sync and KV replication first. |
| 48 | +- Host-to-host first. |
| 49 | +- Host-to-replica second. |
| 50 | +- Replica peer-serving only after the canary passes. |
| 51 | +- SQL replication only behind an explicit canary gate, even though branch support exists. |
| 52 | + |
| 53 | +## 2. Merge Order and Readiness Gates |
| 54 | + |
| 55 | +### Merge order |
| 56 | + |
| 57 | +1. Merge `tinycloud-node` replication branch to `main` with conservative defaults. |
| 58 | +2. Merge the companion SDK and rollout automation changes after the node merge is accepted. |
| 59 | +3. Publish a dedicated Phala-ready image for the replication rollout. |
| 60 | +4. Apply canary deployment manifests and registration records. |
| 61 | + |
| 62 | +### Runtime defaults at merge |
| 63 | + |
| 64 | +Use safe defaults at merge time: |
| 65 | + |
| 66 | +- `TINYCLOUD_REPLICATION_ROLE=host` unless the instance is explicitly a replica. |
| 67 | +- `TINYCLOUD_REPLICATION_PEER_SERVING=false` unless the instance is explicitly approved to serve peers. |
| 68 | +- `TINYCLOUD_REPLICATION_SESSION_TTL_SECS=600` unless operational tuning is justified. |
| 69 | + |
| 70 | +### Readiness gates before merge |
| 71 | + |
| 72 | +- Branch builds cleanly in CI for the production target image. |
| 73 | +- The replication routes are covered by real end-to-end tests, not mocks. |
| 74 | +- `/info`, `/replication/info`, `/replication/session/open`, auth sync, KV replay reconcile, and Recon compare/split flows are green in pre-merge test runs. |
| 75 | +- The Phala image is built with the intended confidential-compute feature set. |
| 76 | +- Monitoring, alerting, and log shipping are in place before the first canary instance is exposed. |
| 77 | + |
| 78 | +### Readiness gates before canary |
| 79 | + |
| 80 | +- A fixed bootstrap inventory exists for the canary spaces and peers. |
| 81 | +- Host delegations and sync delegations are created for every canary participant. |
| 82 | +- Attestation verification procedure is written down and tested. |
| 83 | +- Each instance has isolated storage and a unique registration record. |
| 84 | +- Rollback and drain commands are tested before customer traffic uses replication. |
| 85 | + |
| 86 | +## 3. Phala Canary Topology |
| 87 | + |
| 88 | +Use a small multi-instance canary, not a single-node smoke deploy. |
| 89 | + |
| 90 | +### Recommended topology |
| 91 | + |
| 92 | +- `tc-canary-a`: authority host for the canary spaces. |
| 93 | +- `tc-canary-b`: secondary host for the same spaces. |
| 94 | +- `tc-canary-r1`: replica with peer serving disabled. |
| 95 | +- `tc-canary-r2`: replica reserved for peer-serving canary after the first gate passes. |
| 96 | + |
| 97 | +### Placement guidance |
| 98 | + |
| 99 | +- Put each instance on its own Phala CVM. |
| 100 | +- Keep each instance on its own storage namespace. |
| 101 | +- Keep the authority host on the most stable CVM and do not rotate authority during the first cut. |
| 102 | +- Use the archived Phala deployment guidance as the baseline for image build, env encryption, and attestation handling. |
| 103 | + |
| 104 | +### Canary phases |
| 105 | + |
| 106 | +Phase 1: |
| 107 | + |
| 108 | +- `tc-canary-a` and `tc-canary-b` only. |
| 109 | +- Validate host-to-host auth sync and KV replication. |
| 110 | + |
| 111 | +Phase 2: |
| 112 | + |
| 113 | +- Add `tc-canary-r1`. |
| 114 | +- Validate host-to-replica auth sync, replay reconcile, Recon compare, and quarantine behavior. |
| 115 | + |
| 116 | +Phase 3: |
| 117 | + |
| 118 | +- Add `tc-canary-r2` with peer serving enabled. |
| 119 | +- Validate replica export only after the earlier phases are stable. |
| 120 | + |
| 121 | +## 4. Registration Model |
| 122 | + |
| 123 | +Registration must be treated as three separate layers. |
| 124 | + |
| 125 | +### Infra registration |
| 126 | + |
| 127 | +Infra registration says an instance exists as infrastructure. |
| 128 | + |
| 129 | +It should include: |
| 130 | + |
| 131 | +- Phala project or CVM identifier. |
| 132 | +- deployment channel such as `canary` or `prod`. |
| 133 | +- base URL and DNS record. |
| 134 | +- image digest. |
| 135 | +- attestation verification status. |
| 136 | +- owner and on-call metadata. |
| 137 | + |
| 138 | +Infra registration does not prove a space relationship. It only proves that an operator recognizes a concrete deployed instance. |
| 139 | + |
| 140 | +### Instance registration |
| 141 | + |
| 142 | +Instance registration says what a particular running instance claims it can do. |
| 143 | + |
| 144 | +It should include: |
| 145 | + |
| 146 | +- instance ID. |
| 147 | +- base URL. |
| 148 | +- replication role configured on the instance. |
| 149 | +- whether peer serving is enabled. |
| 150 | +- whether TEE mode is expected and verified. |
| 151 | +- health and last-seen timestamps. |
| 152 | +- storage backend identifiers. |
| 153 | + |
| 154 | +`/info` belongs here. It is capability advertisement only. It is useful for: |
| 155 | + |
| 156 | +- confirming the node supports replication. |
| 157 | +- confirming the enabled role and peer-serving mode. |
| 158 | +- confirming the service surface and software version. |
| 159 | + |
| 160 | +`/info` is not a trust root and is not enough to authorize replication for a space. |
| 161 | + |
| 162 | +### Space registration |
| 163 | + |
| 164 | +Space registration is the trust-bearing layer. |
| 165 | + |
| 166 | +It should include: |
| 167 | + |
| 168 | +- the authority host for the space. |
| 169 | +- all approved hosts for the space. |
| 170 | +- all approved replicas for the space. |
| 171 | +- the exact `tinycloud.space/host` and `tinycloud.space/sync` delegations used for those roles. |
| 172 | +- peer-serving allowance for replicas, if any. |
| 173 | +- the bootstrap host list for that space. |
| 174 | + |
| 175 | +Space registration is where the rollout should rely on proof: |
| 176 | + |
| 177 | +- `tinycloud.space/host` proves host authority for the space. |
| 178 | +- `tinycloud.space/sync` proves replication scope for replicas. |
| 179 | +- `/peer/generate/<space>` binds the serving node to its per-space server DID. |
| 180 | +- `/attestation` proves the runtime identity of the instance if TEE validation is required. |
| 181 | + |
| 182 | +### Registration rule for first production cut |
| 183 | + |
| 184 | +- Infra registration is maintained by operations. |
| 185 | +- Instance registration is maintained by deployment automation. |
| 186 | +- Space registration is maintained by explicit host and sync delegations. |
| 187 | +- Registration with hosts is per-host fan-out, not implicit cluster membership. |
| 188 | + |
| 189 | +## 5. Bootstrap and Discovery Model |
| 190 | + |
| 191 | +The first production cut should use static bootstrap and conservative discovery. |
| 192 | + |
| 193 | +### First-contact bootstrap |
| 194 | + |
| 195 | +For each canary space, maintain an explicit bootstrap record with: |
| 196 | + |
| 197 | +- the authority host URL. |
| 198 | +- the secondary host URL, if any. |
| 199 | +- the approved replicas, if any. |
| 200 | +- the expected server DID per host if already staged. |
| 201 | + |
| 202 | +### Discovery order |
| 203 | + |
| 204 | +1. Start from the explicit bootstrap host list for the space. |
| 205 | +2. Call `/info` only to confirm capability advertisement and basic compatibility. |
| 206 | +3. If the rollout requires TEE assurance, validate `/attestation`. |
| 207 | +4. Resolve or stage the per-space server DID with `/peer/generate/<space>`. |
| 208 | +5. Open `/replication/session/open` using the caller’s `tinycloud.space/sync` delegation. |
| 209 | +6. Run auth sync before relying on data export from a first-contact peer. |
| 210 | +7. Fan out registration to additional hosts explicitly, per host. |
| 211 | + |
| 212 | +### Discovery rules |
| 213 | + |
| 214 | +- Do not auto-discover new peers from `/info`. |
| 215 | +- Do not use replica peer-serving as a bootstrap source in the first cut. |
| 216 | +- Do not trust a host or replica for a space unless the matching host or sync delegation is already known or just synchronized through auth sync. |
| 217 | + |
| 218 | +## 6. Storage Isolation Requirements |
| 219 | + |
| 220 | +Every canary instance must have isolated mutable storage. |
| 221 | + |
| 222 | +### Hard requirements |
| 223 | + |
| 224 | +- No two instances may share the same SQLite database path. |
| 225 | +- No two instances may share the same local block store directory. |
| 226 | +- If Postgres is used, each instance must use a distinct database or schema. |
| 227 | +- If object storage is used, each instance must use a distinct prefix or bucket namespace. |
| 228 | +- Temporary directories, logs, and cache directories must be instance-scoped. |
| 229 | + |
| 230 | +### Why this matters |
| 231 | + |
| 232 | +The replication rollout is testing protocol behavior between nodes. Shared storage would hide real divergence, break forensic analysis, and turn replication bugs into storage corruption bugs. |
| 233 | + |
| 234 | +### Recommended storage layout for canary |
| 235 | + |
| 236 | +- one Postgres database or schema per instance. |
| 237 | +- one block-storage prefix per instance. |
| 238 | +- one Phala env file per instance. |
| 239 | +- one monitoring identity per instance. |
| 240 | + |
| 241 | +## 7. Canary Test Plan and Success Criteria |
| 242 | + |
| 243 | +### Canary test plan |
| 244 | + |
| 245 | +Run these tests in order: |
| 246 | + |
| 247 | +- auth session open against every canary instance. |
| 248 | +- auth sync from authority host to secondary host. |
| 249 | +- auth sync from authority host to replica. |
| 250 | +- KV host-to-host write, reconcile, and read validation. |
| 251 | +- KV host-to-replica write, reconcile, and read validation. |
| 252 | +- KV Recon compare and split on a diverged prefix. |
| 253 | +- KV peer-missing quarantine path on a replica. |
| 254 | +- restart one non-authority instance and confirm catch-up after restart. |
| 255 | +- if SQL is in scope for the canary, run only on a dedicated canary space after KV is stable. |
| 256 | + |
| 257 | +### Success criteria |
| 258 | + |
| 259 | +- Zero unauthorized replication exports accepted. |
| 260 | +- Zero cross-instance storage collisions. |
| 261 | +- Auth sync converges on every canary instance. |
| 262 | +- KV writes converge across the canary hosts and replicas within the expected polling window. |
| 263 | +- Recon compare returns clean match after repair for the test prefixes. |
| 264 | +- Quarantined keys remain hidden from canonical reads and visible only through the intended provisional path. |
| 265 | +- No repeated crash loop, session leak, or auth-session invalidation bug appears during a 24-hour soak. |
| 266 | +- Attestation verification succeeds for every instance that is expected to run in TEE mode. |
| 267 | + |
| 268 | +### Exit criteria for widening rollout |
| 269 | + |
| 270 | +- At least 24 hours of stable canary behavior. |
| 271 | +- At least one controlled restart of a host and a replica with successful recovery. |
| 272 | +- No unresolved auth, session, or storage-separation incidents. |
| 273 | + |
| 274 | +## 8. Rollback and Drain Procedure |
| 275 | + |
| 276 | +Rollback must preserve trust correctness first and traffic continuity second. |
| 277 | + |
| 278 | +### Immediate drain |
| 279 | + |
| 280 | +1. Remove the instance from the bootstrap inventory. |
| 281 | +2. Stop issuing new space registrations to that instance. |
| 282 | +3. Set `TINYCLOUD_REPLICATION_PEER_SERVING=false` on the draining instance. |
| 283 | +4. Revoke the instance’s `tinycloud.space/sync` delegations if it should no longer replicate. |
| 284 | +5. Wait for the replication session TTL window to expire or restart the instance to clear existing sessions. |
| 285 | + |
| 286 | +### Host rollback |
| 287 | + |
| 288 | +If a non-authority host is unhealthy: |
| 289 | + |
| 290 | +- drain it from bootstrap and space registration. |
| 291 | +- revoke its host delegation if it should no longer serve the space. |
| 292 | +- leave the authority host unchanged. |
| 293 | + |
| 294 | +If the authority host is unhealthy: |
| 295 | + |
| 296 | +- do not promote a new authority automatically in the first cut. |
| 297 | +- freeze new replication expansion. |
| 298 | +- either roll back the authority host in place or perform a controlled manual authority reassignment with new host delegations. |
| 299 | + |
| 300 | +### Replica rollback |
| 301 | + |
| 302 | +- revoke the replica’s `tinycloud.space/sync` delegation. |
| 303 | +- remove it from space registration and bootstrap lists. |
| 304 | +- keep its storage for forensics until the incident is closed. |
| 305 | + |
| 306 | +### Data handling rule |
| 307 | + |
| 308 | +- Do not delete canary instance storage during initial rollback. |
| 309 | +- Snapshot or preserve the instance state first so divergence can be inspected. |
| 310 | + |
| 311 | +## 9. Open Questions and Deferred Items |
| 312 | + |
| 313 | +### Open questions |
| 314 | + |
| 315 | +- Whether SQL replication should be part of the first canary or delayed until the KV and auth planes soak cleanly. |
| 316 | +- Whether replica peer-serving should require extra rollout policy beyond the existing sync delegation facts. |
| 317 | +- How strict the attestation gate should be for non-TEE fallback environments. |
| 318 | +- Whether a separate canonical-host registration record is needed operationally even though `tinycloud.space/host` is the proof. |
| 319 | + |
| 320 | +### Deferred items |
| 321 | + |
| 322 | +- Dynamic peer discovery. |
| 323 | +- Any `/info`-driven automatic enrollment. |
| 324 | +- Automatic authority election or failover. |
| 325 | +- Merkle-proof-based auth sync. |
| 326 | +- Blind prune-on-absence semantics. |
| 327 | +- Broad production use of replica peer-serving before the host-host and host-replica lanes are stable. |
| 328 | + |
| 329 | +## 10. Operational Checklist |
| 330 | + |
| 331 | +Before merge: |
| 332 | + |
| 333 | +- replication routes reviewed. |
| 334 | +- production image built. |
| 335 | +- dashboards and alerts ready. |
| 336 | +- bootstrap inventory format finalized. |
| 337 | + |
| 338 | +Before canary: |
| 339 | + |
| 340 | +- host and sync delegations created. |
| 341 | +- Phala instances deployed and attested. |
| 342 | +- storage namespaces isolated. |
| 343 | +- rollback rehearsal completed. |
| 344 | + |
| 345 | +Before widening: |
| 346 | + |
| 347 | +- canary soak passed. |
| 348 | +- restart recovery passed. |
| 349 | +- incident log reviewed. |
| 350 | +- authority host remained stable throughout the canary window. |
0 commit comments