[Feature] Real-world DR scenario validating #61464 — FE catalog recovery is the missing piece for disaggregated B&R

 ---
  Summary

  We ran a full disaster recovery test on a production Apache Doris disaggregated cluster — backing up FDB and bringing up a restore cluster against it. Everything at the storage
  layer recovered correctly. The only blocker was FE catalog (BDBJE), which is exactly what issue #61464 and PR #61465 are implementing. Sharing our findings as a real-world
  validation of that work and to ask about timeline.

  ---
  What we did

  1. Took a continuous FDB backup (fdbbackup with 2-hour snapshot intervals) from a running production disaggregated cluster (apache/doris:fe-4.0.3, ms-4.0.3, be-4.0.3)
  2. Restored that FDB backup into a separate namespace with a fresh DorisDisaggregatedCluster pointing at the restored FDB
  3. Worked through node registration — dropped production FE entries, registered the restore FE via MetaService API (drop_cluster → add_cluster → add_node), patched correct
  cluster_id and cloud_unique_id
  4. The restore FE started cleanly as FE_MASTER with the correct cluster_id
  5. Confirmed tablet metadata intact via fdbcli key scan — thousands of \x01\x10job/{instance_id}/tablet/... keys present

  ---
  What works

  - FDB restore: ✅ tablet metadata, instance registry all recovered correctly
  - MetaService: ✅ starts cleanly against restored FDB, serves correct get_instance
  - BE: ✅ connects to restore cluster, reports healthy; tablet metadata present in restored FDB
  - FE startup: ✅ no crashes, correct cluster_id, correct cloud_unique_id

  ---
  What doesn't work

  SHOW DATABASES returns only system databases. The production catalog is gone.

  Root cause confirmed: We scanned the full FDB keyspace — only \x01\x10instance and \x01\x10job (tablet/recycle) keys exist. No FE journal, no catalog entries. The FE catalog
  lives entirely in BDBJE on the local PVC. A fresh FE deployment has no way to recover it from the restored FDB.

  ---
  The PITR consistency problem

  FDB backup is continuous and supports point-in-time recovery to any arbitrary version. But BDBJE has no equivalent — it is a filesystem snapshot only. This means:

  - You cannot align a PVC snapshot of BDBJE with an arbitrary FDB restore point
  - Any FDB PITR restore between two FE snapshots produces an inconsistent state — FDB has tablets for tables the FE doesn't know about, or vice versa
  - FDB's PITR capability is effectively wasted until FE catalog is part of the same snapshot

  ---
  Connection to #61464 / PR #61465

  This is exactly the problem the DorisCloudSnapshotHandler + DorisSnapshotManager work is designed to solve — specifically the "point-in-time FE metadata restoration" and "large
  FE image handling" requirements called out in #61464.

  Our DR test is a concrete real-world scenario that validates why that feature matters. Once the FE image is part of the cluster snapshot, all three layers (FE catalog + FDB
  metadata + object storage) can be captured consistently, making true PITR DR possible.

  ---
  Ask

  1. Is there a target release for #61464 / PR #61465?
  2. Is there any supported workaround for FE catalog recovery in disaggregated mode today while this feature is in development?
  3. Any guidance on what a production DR runbook should look like until this lands?

  ---
  Environment
  - Doris: 4.0.3 (fe, ms, be)
  - FDB: 7.1.38
  - Operator: DorisDisaggregatedCluster CRD
  - Kubernetes: Azure AKS


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Real-world DR scenario validating #61464 — FE catalog recovery is the missing piece for disaggregated B&R #63060

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Real-world DR scenario validating #61464 — FE catalog recovery is the missing piece for disaggregated B&R #63060

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions