You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We ran a full disaster recovery test on a production Apache Doris disaggregated cluster — backing up FDB and bringing up a restore cluster against it. Everything at the storage
layer recovered correctly. The only blocker was FE catalog (BDBJE), which is exactly what issue #61464 and PR #61465 are implementing. Sharing our findings as a real-world
validation of that work and to ask about timeline.
What we did
Took a continuous FDB backup (fdbbackup with 2-hour snapshot intervals) from a running production disaggregated cluster (apache/doris:fe-4.0.3, ms-4.0.3, be-4.0.3)
Restored that FDB backup into a separate namespace with a fresh DorisDisaggregatedCluster pointing at the restored FDB
Worked through node registration — dropped production FE entries, registered the restore FE via MetaService API (drop_cluster → add_cluster → add_node), patched correct
cluster_id and cloud_unique_id
The restore FE started cleanly as FE_MASTER with the correct cluster_id
Confirmed tablet metadata intact via fdbcli key scan — thousands of \x01\x10job/{instance_id}/tablet/... keys present
What works
FDB restore: ✅ tablet metadata, instance registry all recovered correctly
MetaService: ✅ starts cleanly against restored FDB, serves correct get_instance
BE: ✅ connects to restore cluster, reports healthy; tablet metadata present in restored FDB
FE startup: ✅ no crashes, correct cluster_id, correct cloud_unique_id
What doesn't work
SHOW DATABASES returns only system databases. The production catalog is gone.
Root cause confirmed: We scanned the full FDB keyspace — only \x01\x10instance and \x01\x10job (tablet/recycle) keys exist. No FE journal, no catalog entries. The FE catalog
lives entirely in BDBJE on the local PVC. A fresh FE deployment has no way to recover it from the restored FDB.
The PITR consistency problem
FDB backup is continuous and supports point-in-time recovery to any arbitrary version. But BDBJE has no equivalent — it is a filesystem snapshot only. This means:
You cannot align a PVC snapshot of BDBJE with an arbitrary FDB restore point
Any FDB PITR restore between two FE snapshots produces an inconsistent state — FDB has tablets for tables the FE doesn't know about, or vice versa
FDB's PITR capability is effectively wasted until FE catalog is part of the same snapshot
This is exactly the problem the DorisCloudSnapshotHandler + DorisSnapshotManager work is designed to solve — specifically the "point-in-time FE metadata restoration" and "large
FE image handling" requirements called out in #61464.
Our DR test is a concrete real-world scenario that validates why that feature matters. Once the FE image is part of the cluster snapshot, all three layers (FE catalog + FDB
metadata + object storage) can be captured consistently, making true PITR DR possible.
Summary
We ran a full disaster recovery test on a production Apache Doris disaggregated cluster — backing up FDB and bringing up a restore cluster against it. Everything at the storage
layer recovered correctly. The only blocker was FE catalog (BDBJE), which is exactly what issue #61464 and PR #61465 are implementing. Sharing our findings as a real-world
validation of that work and to ask about timeline.
What we did
cluster_id and cloud_unique_id
What works
What doesn't work
SHOW DATABASES returns only system databases. The production catalog is gone.
Root cause confirmed: We scanned the full FDB keyspace — only \x01\x10instance and \x01\x10job (tablet/recycle) keys exist. No FE journal, no catalog entries. The FE catalog
lives entirely in BDBJE on the local PVC. A fresh FE deployment has no way to recover it from the restored FDB.
The PITR consistency problem
FDB backup is continuous and supports point-in-time recovery to any arbitrary version. But BDBJE has no equivalent — it is a filesystem snapshot only. This means:
Connection to #61464 / PR #61465
This is exactly the problem the DorisCloudSnapshotHandler + DorisSnapshotManager work is designed to solve — specifically the "point-in-time FE metadata restoration" and "large
FE image handling" requirements called out in #61464.
Our DR test is a concrete real-world scenario that validates why that feature matters. Once the FE image is part of the cluster snapshot, all three layers (FE catalog + FDB
metadata + object storage) can be captured consistently, making true PITR DR possible.
Ask
Environment