Etcd Backup on Pollux by JamesDoingStuff · Pull Request #1197 · DiamondLightSource/workflows

JamesDoingStuff · 2026-03-13T11:30:29Z

The dev_resources/build CI is currently failing due, I think, to check_k8s_resources.py not handling CronJobs well - specifically, having a resources field but no replicas. I'll look into fixing this

Adds:

CronJob that executes daily to take a snapshot of one of the etcd PVs and upload it to Echo. Backups are timestamped and stored under the path dls-workflows-prod/<staging/prod>/etcd-snapshot-<timestamp>.db. The contents are encrypted. The job deletes backups older than 2 days.
CronJob to download the snapshot and perform an etcdctl snapshot restore on the provided etcd volume. This job won't automatically.
Script (scripts/restore-etcd.sh) that scales down the etcd and the vcluster, performs the above job for each etcd volume, then returns the cluster to initial levels.

charts/workflows-cluster/Chart.yaml

TBThomas56 · 2026-03-13T11:58:37Z

charts/workflows-cluster/values.yaml

why do we have a staging being backed up and not prod?

Probably best to roll it out on staging first and just make sure all is well - I just needed to add something to the Values.yaml for prod

charts/workflows-cluster/templates/etcd-backup-cronjob.yaml

TBThomas56 · 2026-03-13T12:07:22Z

charts/workflows-cluster/charts/secrets/templates/etcd-rclone-config.yaml

+      namespace: workflows
+    type: Opaque
+{{ else }}
+{{- end }}


should there be an empty line after all of these?

TBThomas56 · 2026-03-13T12:08:13Z

charts/workflows-cluster/templates/etcd-backup-cronjob.yaml

+      template:
+        spec:
+          initContainers:
+          - name: backup


extra space after backup

TBThomas56 · 2026-03-13T12:08:37Z

charts/workflows-cluster/templates/etcd-rclone-scripts-configmap.yaml

+    # Check for today's backup
+    if [ ! -s "$SNAP" ]; then
+      echo "Backup does not exist"
+      exit 1 


extra space after exit 1

TBThomas56 · 2026-03-13T12:08:53Z

charts/workflows-cluster/templates/etcd-restore-cronjob.yaml

+metadata:
+  name: "restore-etcd-{{ $i }}"
+spec:
+  schedule: "@yearly" 


extra space after yearly

TBThomas56 · 2026-03-13T12:10:12Z

charts/workflows-cluster/scripts/restore-etcd.sh

+    echo "Waiting for Jobs to complete..."
+
+    for ((i=0;i<ETCD_REPLICAS;i++)); do
+        kubectl -n workflows wait --for=condition=complete job/restore-etcd-$i --timeout=300s


is the timeout enough for prod?

Not sure... This seems sufficient for Pollux, so maybe leave it as is for now, then we can boost it when we switch on backups for Argus?

davehadley · 2026-03-13T14:08:23Z

charts/workflows-cluster/scripts/restore-etcd.sh

@@ -0,0 +1,46 @@
+#!/bin/bash
+


is it better to set set -euo pipefail to have the script stop on errors?

davehadley · 2026-03-13T14:09:13Z

charts/workflows-cluster/scripts/restore-etcd.sh

@@ -0,0 +1,46 @@
+#!/bin/bash


is it more robust to do:

#!/usr/bin/env bash

davehadley · 2026-03-13T14:13:13Z

charts/workflows-cluster/templates/etcd-backup-cronjob.yaml

+  {{- if .Values.backup.enabled }}
+  schedule: "@daily"
+  {{ else }}
+  schedule: "@yearly"


if .Values.backup.enabled is false the backups still happen but once per year?

Is there a reason you don't just wrap the whole CronJob with {{- if .Values.backup.enabled }} so that CronJob simply doesn't get applied if backup is disabled?

I borrowed this pattern from the LIMS postgres backup, but you're right, I could just wrap the whole thing. I think the suspend: true means it wouldn't run, but I'll change it just in case

see my comment below. Having read more about CronJobs, what you are doing may be OK.

davehadley · 2026-03-13T14:16:07Z

charts/workflows-cluster/templates/etcd-rclone-scripts-configmap.yaml

+
+    # Delete old backed up objects, with age >= 2 days.
+    echo "deleting old backups from echo s3"
+    rclone delete --min-age=2d echo:dls-workflows-prod/${PREFIX}


this is fine for this PR but we should decide our strategy for how many and how long we want to keep backups for.

davehadley · 2026-03-13T14:22:13Z

charts/workflows-cluster/templates/etcd-restore-cronjob.yaml

+  name: "restore-etcd-{{ $i }}"
+spec:
+  schedule: "@yearly"
+  suspend: true # Never runs automatically


Is there a reason you went with a CronJob for this? Naively I would have thought that a Job triggered by restore-etcd.sh would be the way to go.

I'm slightly nervous about this... we don't want the production database accidentally restoring to a backup at some random point in the year!

If you stick with this solution, please be absolutely sure that this is how this works.

Initially when creating the restore job, I wanted to avoid using any local files and have everything required present on the cluster - hence the CronJob. Once that proved difficult, it was simpler to leave it as an unscheduled cron than to switch. I guess since I'm using a locally stored script now it's not a big a deal to require the job file too, so I'll probably switch this over too

My earlier comment was perhaps based on my ignorance. I haven't done much with CronJob's yet. Apparently it is intended to be able to create Jobs from suspended CronJobs (eg https://kubernetes.io/docs/reference/kubectl/generated/kubectl_create/kubectl_create_job/). Your solution may be the "idiomatic" kubernetes way.

JamesDoingStuff requested review from TBThomas56, davehadley and iamvigneshwars March 13, 2026 11:30

JamesDoingStuff self-assigned this Mar 13, 2026

TBThomas56 reviewed Mar 13, 2026

View reviewed changes

charts/workflows-cluster/Chart.yaml Show resolved Hide resolved

TBThomas56 reviewed Mar 13, 2026

View reviewed changes

charts/workflows-cluster/templates/etcd-backup-cronjob.yaml Show resolved Hide resolved

TBThomas56 reviewed Mar 13, 2026

View reviewed changes

feat(charts): implement etcd backup and restore on pollux

ceddb28

JamesDoingStuff force-pushed the jg/etcd-backup branch from 8abd3b1 to ceddb28 Compare March 13, 2026 13:18

davehadley approved these changes Mar 13, 2026

View reviewed changes

Conversation

JamesDoingStuff commented Mar 13, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davehadley Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davehadley Mar 19, 2026 •

edited

Loading