Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions docs/manuals/spaces/concepts/resilience-by-deployment-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
title: Resilience Responsibilities by Deployment Mode
sidebar_position: 20
description: Understand what Upbound manages and what you are responsible for across Cloud, Dedicated, Managed, and Self-Hosted Spaces.
---

Upbound offers four deployment modes for Spaces, each with a different distribution of operational responsibilities between Upbound and the customer. Understanding these responsibilities is the first step toward designing a resilience architecture that matches your requirements.

This page compares the four deployment modes across the dimensions that matter most for resilience planning: infrastructure management, high availability configuration, disaster recovery capabilities, data residency, and support boundaries.

## Deployment modes at a glance

| Deployment mode | Hosted by | Managed by |
|---|---|---|
| **Cloud Spaces** | Upbound | Upbound |
| **Dedicated Spaces** | Upbound | Upbound |
| **Managed Spaces** | Customer | Upbound |
| **Self-Hosted Spaces** | Customer | Customer |

For a full description of each mode, see [Deployment Modes][deployment-modes].

## Infrastructure management

| Responsibility | Cloud Spaces | Dedicated Spaces | Managed Spaces | Self-Hosted Spaces |
|---|---|---|---|---|
| Kubernetes cluster provisioning | Upbound | Upbound | Upbound | Customer |
| Node pool sizing and scaling | Upbound | Upbound | Upbound | Customer |
| Kubernetes upgrades | Upbound | Upbound | Upbound | Customer |
| etcd management | Upbound | Upbound | Upbound | Customer |
| Spaces software installation | Upbound | Upbound | Upbound | Customer |
| Spaces software upgrades | Upbound | Upbound | Upbound | Customer |
| TLS certificate rotation | Upbound | Upbound | Upbound | Customer |
| Ingress and load balancer configuration | Upbound | Upbound | Upbound | Customer |

## High availability

| Capability | Cloud Spaces | Dedicated Spaces | Managed Spaces | Self-Hosted Spaces |
|---|---|---|---|---|
| Multi-zone control plane scheduling | Upbound-managed | Upbound-managed | Upbound-managed | Customer-configured |
| Spaces router (Envoy) HA | Upbound-managed | Upbound-managed | Upbound-managed | Customer-configured via [configure-ha][configure-ha] |
| Spaces controller HA | Upbound-managed | Upbound-managed | Upbound-managed | Customer-configured via [configure-ha][configure-ha] |
| etcd quorum (3-node) | Upbound-managed | Upbound-managed | Upbound-managed | Customer-configured via [scaling-resources][scaling-resources] |
| Horizontal Pod Autoscaler for router | Upbound-managed | Upbound-managed | Upbound-managed | Customer-configured via [configure-ha][configure-ha] |
| PostgreSQL for Query API | Upbound-managed | Upbound-managed | Upbound-managed | Customer-configured via [configure-ha][configure-ha] |
| Node anti-affinity for critical pods | Upbound-managed | Upbound-managed | Upbound-managed | Customer-configured via [configure-ha][configure-ha] |

## Disaster recovery capabilities

| Capability | Cloud Spaces | Dedicated Spaces | Managed Spaces | Self-Hosted Spaces |
|---|---|---|---|---|
| **Space Backups** (`SpaceBackupConfig`, `SpaceBackupSchedule`, `SpaceBackup`) | Not accessible to users | Not accessible to users | Available — Space admin manages | Available — Space admin manages |
| **Shared Backups** (`SharedBackupConfig`, `SharedBackupSchedule`, `SharedBackup`) | Available | Available | Available | Available |
| Self-service restore from Space Backup | Not available | Not available | Available | Available |
| Self-service restore from Shared Backup | Available | Available | Available | Available |
| Restore to a different cluster or region | Not applicable | Not applicable | Customer-managed (new cluster required) | Customer-managed |
| Warm standby control planes (ObserveOnly pattern) | Customer-configured | Customer-configured | Customer-configured | Customer-configured |

:::info
Space Backups cover the entire Space including all groups and control planes. Shared Backups cover individual control planes within a group. For most multi-tenant or production workloads, configure both.
:::

## Plan requirements

Some disaster recovery capabilities require specific Upbound plan tiers.

| Capability | Required plan |
|---|---|
| Shared Backups | Enterprise |
| Space Backups (Managed and Self-Hosted) | Enterprise |
| Dedicated Spaces | Enterprise |

All pages in this resilience guide that apply to plan-restricted features indicate the requirement at the top of the page.

## Data residency

| Dimension | Cloud Spaces | Dedicated Spaces | Managed Spaces | Self-Hosted Spaces |
|---|---|---|---|---|
| Control plane data location | Upbound-chosen region | Upbound-chosen region | Customer's cloud account, customer-chosen region | Customer's cluster, customer-chosen location |
| Backup storage location | Customer-configured object storage | Customer-configured object storage | Customer-configured object storage | Customer-configured object storage |
| etcd data location | Upbound-managed | Upbound-managed | Customer's cloud account | Customer's cluster |
| Network traffic path | Through Upbound infrastructure | Through Upbound infrastructure | Customer's network, to Upbound Console | Customer's network, to Upbound Console |

:::tip
For workloads with strict data residency requirements (GDPR, FedRAMP, financial services regulations), Managed Spaces or Self-Hosted Spaces give you direct control over where compute and storage resources reside.
:::

## Resilience responsibilities summary

The table below shows the overall split of resilience responsibility per deployment mode:

| Area | Cloud Spaces | Dedicated Spaces | Managed Spaces | Self-Hosted Spaces |
|---|---|---|---|---|
| Infrastructure HA | Upbound | Upbound | Upbound | **Customer** |
| Space-level DR | Upbound (not exposed) | Upbound (not exposed) | **Shared** | **Customer** |
| Control plane DR | **Customer** | **Customer** | **Customer** | **Customer** |
| Observability setup | **Customer** | **Customer** | **Customer** | **Customer** |
| Alert response | **Customer** | **Customer** | **Customer** | **Customer** |

The key takeaway is that **every deployment mode** requires customers to take ownership of control plane-level disaster recovery and observability. Only the underlying Space infrastructure responsibility varies by mode.

## Next steps

- Understand recovery objectives for your workloads: [Designing for RTO and RPO in Upbound][rto-rpo]
- Configure Shared Backups for control plane-level DR: [Backup and Restore][backup-and-restore]
- Configure Space Backups for full Space DR (Self-Hosted and Managed): [Disaster Recovery][dr]
- Configure HA for a Self-Hosted Space: [Production Scaling and High Availability][configure-ha]

[deployment-modes]: /manuals/spaces/concepts/deployment-modes
[configure-ha]: /manuals/spaces/howtos/self-hosted/configure-ha
[scaling-resources]: /manuals/spaces/howtos/self-hosted/scaling-resources
[backup-and-restore]: /manuals/spaces/howtos/backup-and-restore
[dr]: /manuals/spaces/howtos/self-hosted/dr
[rto-rpo]: /manuals/spaces/concepts/rto-rpo
143 changes: 143 additions & 0 deletions docs/manuals/spaces/concepts/rto-rpo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
title: Designing for RTO and RPO in Upbound
sidebar_position: 30
description: Understand recovery time and recovery point objectives in the context of Upbound control planes and choose the right resilience architecture for your requirements.
---

Before choosing a disaster recovery architecture for Upbound, you need to understand what you're recovering from, how quickly you need to recover, and how much data loss is acceptable. This page explains the concepts of RTO and RPO in the context of Upbound control planes and maps them to the architectural patterns available.

## RTO and RPO defined

**Recovery Time Objective (RTO)** is the maximum acceptable time from a failure event to the moment your control plane is operational again. An RTO of 30 minutes means your organization can tolerate up to 30 minutes of downtime before it becomes a business problem.

**Recovery Point Objective (RPO)** is the maximum acceptable amount of data loss, measured in time. An RPO of 4 hours means you accept that up to 4 hours of changes to your control plane state could be lost in a recovery scenario.

In the context of Upbound:

- **RTO** governs how quickly a control plane returns to a state where it can reconcile resources — both reading current cloud state and making changes.
- **RPO** governs how much Crossplane resource state (XRs, claims, managed resources, ProviderConfigs, compositions, packages) could be lost or rolled back to an older version.

## What gets lost in a disaster?

When a control plane fails and you need to recover, the state at risk is:

- **Crossplane resource state in etcd** — the desired-state objects that Crossplane is reconciling. If the backup is from 6 hours ago, any resources created or modified in the last 6 hours may not be in the restore.
- **Provider configuration** — ProviderConfigs, credentials, package revisions installed after the last backup.
- **Composition and function revisions** — if you update a composition between backups, the older version is what restores.

What is generally **not** lost regardless of your Upbound DR posture:

- **Actual cloud resources** (S3 buckets, VMs, databases) — these exist independently of the control plane. After restore, Crossplane will reconcile the restored desired state against the actual cloud state. Resources that exist in the cloud but not in the backup can be imported or cleaned up and replaced.
- **Git-managed configuration** — if you use GitOps to drive your control plane, re-applying your Git repository to a restored control plane brings it back to the desired state quickly, reducing the effective impact of RPO.

## Recovery architectures and their RTO/RPO

Upbound supports three primary DR and HA patterns. Each has a different RTO and RPO profile.

### Backup and restore (Shared Backups and Space Backups)

In this pattern, you configure scheduled backups of control plane state using `SharedBackupSchedule` (control plane level) or `SpaceBackupSchedule` (full Space). When a failure occurs, you restore from the most recent backup to either the existing Space or a new one.

| Metric | Typical range | Notes |
|---|---|---|
| **RPO** | Equal to backup interval | Hourly backups → up to 1 hour of data loss. Daily backups → up to 24 hours. |
| **RTO** | 30–90 minutes | Includes time to provision a new cluster (if needed), install Spaces, run the restore command, and validate control planes reach Ready. |

**Best for:** Workloads where occasional longer downtime is acceptable and the cost of a warm standby is not justified.

**Available in:** All deployment modes. Space Backups require Self-Hosted or Managed Spaces. Shared Backups are available in all modes (Enterprise plan required).

### Warm standby with ObserveOnly management policies

In this pattern, you maintain a secondary control plane that mirrors the primary's desired state but operates in `ObserveOnly` mode — it reads cloud resource state but makes no changes. When the primary fails, you promote the standby to active by removing the `ObserveOnly` policy, and it begins reconciling immediately.

| Metric | Typical range | Notes |
|---|---|---|
| **RPO** | Near-zero | The standby mirrors state continuously. Any change applied to the primary is reflected in the standby's observed state. Data loss is bounded by reconciliation latency (seconds to minutes), not backup schedule. |
| **RTO** | 5–15 minutes | Includes time to detect the failure, promote the standby (change `managementPolicies`), and verify providers are reconciling. Automated promotion can bring this closer to 2–5 minutes. |

**Best for:** Workloads with low RTO requirements where backup/restore recovery is too slow.

**Available in:** All deployment modes. Requires Composition Functions. For details, see [Warm Standby Control Planes with ObserveOnly Management Policies][observeonly-standby].

### Cloud Spaces managed HA (Upbound-managed)

For Cloud and Dedicated Spaces, Upbound manages infrastructure-level HA. The Space infrastructure runs with multi-zone redundancy, and control planes are scheduled for resilience. This is transparent to users.

| Metric | Typical range | Notes |
|---|---|---|
| **RPO** | Near-zero | Upbound manages state redundancy. Individual control plane failures do not result in data loss. |
| **RTO** | Near-zero to minutes | Infrastructure-level failures are handled by Upbound. Application-level failures (a broken composition, a misconfigured provider) are still the customer's responsibility to detect and remediate. |

**Best for:** Teams that prefer to delegate infrastructure resilience to Upbound and focus on control plane configuration.

**Available in:** Cloud Spaces and Dedicated Spaces only.

## Backup schedule and RPO relationship

For backup/restore DR, your RPO is bounded by how often you take backups. The table below maps common backup schedules to their maximum RPO:

| Backup schedule | Maximum RPO | Example use case |
|---|---|---|
| Every 15 minutes | 15 minutes | High-frequency configuration changes, compliance-sensitive workloads |
| Hourly | 1 hour | Typical production workloads with active development |
| Every 4 hours | 4 hours | Stable production workloads with infrequent configuration changes |
| Daily (`@daily`) | 24 hours | Non-critical or staging environments |
| Weekly | 7 days | Archive or DR-testing purposes only |

:::tip
Even if your team's configuration changes are infrequent, short backup intervals reduce the blast radius when a change introduces corruption. An hourly backup schedule is a reasonable default for production control planes.
:::

## Decision matrix

Use this matrix to select the DR architecture that matches your requirements:

| Your RTO target | Your RPO target | Recommended architecture |
|---|---|---|
| < 15 minutes | < 15 minutes | Warm standby (ObserveOnly) with automated promotion |
| < 15 minutes | 15–60 minutes | Warm standby (ObserveOnly) with manual promotion |
| 15–60 minutes | 1–4 hours | Shared Backups with hourly schedule + GitOps for fast re-application |
| 1–4 hours | 4–24 hours | Shared Backups with multi-hour schedule |
| > 4 hours | > 24 hours | Daily backups minimum; consider whether this tier of service is acceptable |
| Upbound-managed | Upbound-managed | Cloud Spaces or Dedicated Spaces |

For Self-Hosted Spaces, the warm standby and backup/restore approaches can be combined: use scheduled backups as a safety net, and use a warm standby for fast failover.

## Effect of deployment mode on achievable RTO/RPO

Your deployment mode affects which DR patterns are available:

| Pattern | Cloud Spaces | Dedicated Spaces | Managed Spaces | Self-Hosted Spaces |
|---|---|---|---|---|
| Upbound-managed infrastructure HA | Yes | Yes | Partial | No |
| Shared Backups (control plane DR) | Yes | Yes | Yes | Yes |
| Space Backups (full Space DR) | No | No | Yes | Yes |
| Warm standby (ObserveOnly) | Yes | Yes | Yes | Yes |

Self-Hosted Spaces offer the most control — you can configure the full stack of HA and DR patterns — but require the most operational investment. Cloud Spaces offer the least configuration burden but the least flexibility.

## Planning checklist

Before finalizing your resilience architecture:

- [ ] Document the RTO and RPO target for each environment (production, staging, dev)
- [ ] Map each target to a backup schedule or standby pattern using the decision matrix above
- [ ] Verify the pattern is available in your deployment mode
- [ ] Verify your Upbound plan tier includes the features you need (Enterprise required for Shared and Space Backups)
- [ ] Test your recovery procedure in staging before relying on it in production (see [Validating Your Resilience Configuration][dr-testing])
- [ ] Document the runbook steps so any operator can execute recovery without prior knowledge (see [Disaster Recovery Runbook][dr-runbook])

## Next steps

- Compare resilience responsibilities by deployment mode: [Resilience Responsibilities by Deployment Mode][resilience-by-mode]
- Configure Shared Backups: [Backup and Restore][backup-and-restore]
- Configure Space Backups (Self-Hosted and Managed): [Disaster Recovery][dr]
- Implement a warm standby control plane: [Warm Standby Control Planes with ObserveOnly Management Policies][observeonly-standby]

[resilience-by-mode]: /manuals/spaces/concepts/resilience-by-deployment-mode
[backup-and-restore]: /manuals/spaces/howtos/backup-and-restore
[dr]: /manuals/spaces/howtos/self-hosted/dr
[observeonly-standby]: /manuals/spaces/howtos/observeonly-standby
[dr-testing]: /manuals/spaces/howtos/dr-testing
[dr-runbook]: /manuals/spaces/howtos/dr-runbook
Loading
Loading