Skip to content

[Feature] Real-world DR scenario validating #61464 — FE catalog recovery is the missing piece for disaggregated B&R #63060

@sudarshan-shapur-aera

Description

@sudarshan-shapur-aera

Summary

We ran a full disaster recovery test on a production Apache Doris disaggregated cluster — backing up FDB and bringing up a restore cluster against it. Everything at the storage
layer recovered correctly. The only blocker was FE catalog (BDBJE), which is exactly what issue #61464 and PR #61465 are implementing. Sharing our findings as a real-world
validation of that work and to ask about timeline.


What we did

  1. Took a continuous FDB backup (fdbbackup with 2-hour snapshot intervals) from a running production disaggregated cluster (apache/doris:fe-4.0.3, ms-4.0.3, be-4.0.3)
  2. Restored that FDB backup into a separate namespace with a fresh DorisDisaggregatedCluster pointing at the restored FDB
  3. Worked through node registration — dropped production FE entries, registered the restore FE via MetaService API (drop_cluster → add_cluster → add_node), patched correct
    cluster_id and cloud_unique_id
  4. The restore FE started cleanly as FE_MASTER with the correct cluster_id
  5. Confirmed tablet metadata intact via fdbcli key scan — thousands of \x01\x10job/{instance_id}/tablet/... keys present

What works

  • FDB restore: ✅ tablet metadata, instance registry all recovered correctly
  • MetaService: ✅ starts cleanly against restored FDB, serves correct get_instance
  • BE: ✅ connects to restore cluster, reports healthy; tablet metadata present in restored FDB
  • FE startup: ✅ no crashes, correct cluster_id, correct cloud_unique_id

What doesn't work

SHOW DATABASES returns only system databases. The production catalog is gone.

Root cause confirmed: We scanned the full FDB keyspace — only \x01\x10instance and \x01\x10job (tablet/recycle) keys exist. No FE journal, no catalog entries. The FE catalog
lives entirely in BDBJE on the local PVC. A fresh FE deployment has no way to recover it from the restored FDB.


The PITR consistency problem

FDB backup is continuous and supports point-in-time recovery to any arbitrary version. But BDBJE has no equivalent — it is a filesystem snapshot only. This means:

  • You cannot align a PVC snapshot of BDBJE with an arbitrary FDB restore point
  • Any FDB PITR restore between two FE snapshots produces an inconsistent state — FDB has tablets for tables the FE doesn't know about, or vice versa
  • FDB's PITR capability is effectively wasted until FE catalog is part of the same snapshot

Connection to #61464 / PR #61465

This is exactly the problem the DorisCloudSnapshotHandler + DorisSnapshotManager work is designed to solve — specifically the "point-in-time FE metadata restoration" and "large
FE image handling" requirements called out in #61464.

Our DR test is a concrete real-world scenario that validates why that feature matters. Once the FE image is part of the cluster snapshot, all three layers (FE catalog + FDB
metadata + object storage) can be captured consistently, making true PITR DR possible.


Ask

  1. Is there a target release for [Feature] Doris Cluster Snapshot Backup #61464 / PR [Feature] Doris Cluster Snapshot Backup #61465?
  2. Is there any supported workaround for FE catalog recovery in disaggregated mode today while this feature is in development?
  3. Any guidance on what a production DR runbook should look like until this lands?

Environment

  • Doris: 4.0.3 (fe, ms, be)
  • FDB: 7.1.38
  • Operator: DorisDisaggregatedCluster CRD
  • Kubernetes: Azure AKS

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions