Skip to content

Etcd Backup on Pollux#1197

Open
JamesDoingStuff wants to merge 1 commit intomainfrom
jg/etcd-backup
Open

Etcd Backup on Pollux#1197
JamesDoingStuff wants to merge 1 commit intomainfrom
jg/etcd-backup

Conversation

@JamesDoingStuff
Copy link
Contributor

The dev_resources/build CI is currently failing due, I think, to check_k8s_resources.py not handling CronJobs well - specifically, having a resources field but no replicas. I'll look into fixing this

Adds:

  • CronJob that executes daily to take a snapshot of one of the etcd PVs and upload it to Echo. Backups are timestamped and stored under the path dls-workflows-prod/<staging/prod>/etcd-snapshot-<timestamp>.db. The contents are encrypted. The job deletes backups older than 2 days.
  • CronJob to download the snapshot and perform an etcdctl snapshot restore on the provided etcd volume. This job won't automatically.
  • Script (scripts/restore-etcd.sh) that scales down the etcd and the vcluster, performs the above job for each etcd volume, then returns the cluster to initial levels.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have a staging being backed up and not prod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to roll it out on staging first and just make sure all is well - I just needed to add something to the Values.yaml for prod

namespace: workflows
type: Opaque
{{ else }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be an empty line after all of these?

template:
spec:
initContainers:
- name: backup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after backup

# Check for today's backup
if [ ! -s "$SNAP" ]; then
echo "Backup does not exist"
exit 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after exit 1

metadata:
name: "restore-etcd-{{ $i }}"
spec:
schedule: "@yearly"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after yearly

echo "Waiting for Jobs to complete..."

for ((i=0;i<ETCD_REPLICAS;i++)); do
kubectl -n workflows wait --for=condition=complete job/restore-etcd-$i --timeout=300s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the timeout enough for prod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure... This seems sufficient for Pollux, so maybe leave it as is for now, then we can boost it when we switch on backups for Argus?

@@ -0,0 +1,46 @@
#!/bin/bash

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to set set -euo pipefail to have the script stop on errors?

@@ -0,0 +1,46 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it more robust to do:

#!/usr/bin/env bash

{{- if .Values.backup.enabled }}
schedule: "@daily"
{{ else }}
schedule: "@yearly"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if .Values.backup.enabled is false the backups still happen but once per year?

Is there a reason you don't just wrap the whole CronJob with {{- if .Values.backup.enabled }} so that CronJob simply doesn't get applied if backup is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I borrowed this pattern from the LIMS postgres backup, but you're right, I could just wrap the whole thing. I think the suspend: true means it wouldn't run, but I'll change it just in case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below. Having read more about CronJobs, what you are doing may be OK.


# Delete old backed up objects, with age >= 2 days.
echo "deleting old backups from echo s3"
rclone delete --min-age=2d echo:dls-workflows-prod/${PREFIX}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine for this PR but we should decide our strategy for how many and how long we want to keep backups for.

name: "restore-etcd-{{ $i }}"
spec:
schedule: "@yearly"
suspend: true # Never runs automatically
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you went with a CronJob for this? Naively I would have thought that a Job triggered by restore-etcd.sh would be the way to go.

I'm slightly nervous about this... we don't want the production database accidentally restoring to a backup at some random point in the year!

If you stick with this solution, please be absolutely sure that this is how this works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially when creating the restore job, I wanted to avoid using any local files and have everything required present on the cluster - hence the CronJob. Once that proved difficult, it was simpler to leave it as an unscheduled cron than to switch. I guess since I'm using a locally stored script now it's not a big a deal to require the job file too, so I'll probably switch this over too

Copy link
Collaborator

@davehadley davehadley Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My earlier comment was perhaps based on my ignorance. I haven't done much with CronJob's yet. Apparently it is intended to be able to create Jobs from suspended CronJobs (eg https://kubernetes.io/docs/reference/kubectl/generated/kubectl_create/kubectl_create_job/). Your solution may be the "idiomatic" kubernetes way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants