Conversation
There was a problem hiding this comment.
why do we have a staging being backed up and not prod?
There was a problem hiding this comment.
Probably best to roll it out on staging first and just make sure all is well - I just needed to add something to the Values.yaml for prod
| namespace: workflows | ||
| type: Opaque | ||
| {{ else }} | ||
| {{- end }} |
There was a problem hiding this comment.
should there be an empty line after all of these?
| template: | ||
| spec: | ||
| initContainers: | ||
| - name: backup |
There was a problem hiding this comment.
extra space after backup
| # Check for today's backup | ||
| if [ ! -s "$SNAP" ]; then | ||
| echo "Backup does not exist" | ||
| exit 1 |
There was a problem hiding this comment.
extra space after exit 1
| metadata: | ||
| name: "restore-etcd-{{ $i }}" | ||
| spec: | ||
| schedule: "@yearly" |
There was a problem hiding this comment.
extra space after yearly
| echo "Waiting for Jobs to complete..." | ||
|
|
||
| for ((i=0;i<ETCD_REPLICAS;i++)); do | ||
| kubectl -n workflows wait --for=condition=complete job/restore-etcd-$i --timeout=300s |
There was a problem hiding this comment.
is the timeout enough for prod?
There was a problem hiding this comment.
Not sure... This seems sufficient for Pollux, so maybe leave it as is for now, then we can boost it when we switch on backups for Argus?
8abd3b1 to
ceddb28
Compare
| @@ -0,0 +1,46 @@ | |||
| #!/bin/bash | |||
|
|
|||
There was a problem hiding this comment.
is it better to set set -euo pipefail to have the script stop on errors?
| @@ -0,0 +1,46 @@ | |||
| #!/bin/bash | |||
There was a problem hiding this comment.
is it more robust to do:
#!/usr/bin/env bash
| {{- if .Values.backup.enabled }} | ||
| schedule: "@daily" | ||
| {{ else }} | ||
| schedule: "@yearly" |
There was a problem hiding this comment.
if .Values.backup.enabled is false the backups still happen but once per year?
Is there a reason you don't just wrap the whole CronJob with {{- if .Values.backup.enabled }} so that CronJob simply doesn't get applied if backup is disabled?
There was a problem hiding this comment.
I borrowed this pattern from the LIMS postgres backup, but you're right, I could just wrap the whole thing. I think the suspend: true means it wouldn't run, but I'll change it just in case
There was a problem hiding this comment.
see my comment below. Having read more about CronJobs, what you are doing may be OK.
|
|
||
| # Delete old backed up objects, with age >= 2 days. | ||
| echo "deleting old backups from echo s3" | ||
| rclone delete --min-age=2d echo:dls-workflows-prod/${PREFIX} |
There was a problem hiding this comment.
this is fine for this PR but we should decide our strategy for how many and how long we want to keep backups for.
| name: "restore-etcd-{{ $i }}" | ||
| spec: | ||
| schedule: "@yearly" | ||
| suspend: true # Never runs automatically |
There was a problem hiding this comment.
Is there a reason you went with a CronJob for this? Naively I would have thought that a Job triggered by restore-etcd.sh would be the way to go.
I'm slightly nervous about this... we don't want the production database accidentally restoring to a backup at some random point in the year!
If you stick with this solution, please be absolutely sure that this is how this works.
There was a problem hiding this comment.
Initially when creating the restore job, I wanted to avoid using any local files and have everything required present on the cluster - hence the CronJob. Once that proved difficult, it was simpler to leave it as an unscheduled cron than to switch. I guess since I'm using a locally stored script now it's not a big a deal to require the job file too, so I'll probably switch this over too
There was a problem hiding this comment.
My earlier comment was perhaps based on my ignorance. I haven't done much with CronJob's yet. Apparently it is intended to be able to create Jobs from suspended CronJobs (eg https://kubernetes.io/docs/reference/kubectl/generated/kubectl_create/kubectl_create_job/). Your solution may be the "idiomatic" kubernetes way.
The
dev_resources/buildCI is currently failing due, I think, tocheck_k8s_resources.pynot handling CronJobs well - specifically, having aresourcesfield but noreplicas. I'll look into fixing thisAdds:
dls-workflows-prod/<staging/prod>/etcd-snapshot-<timestamp>.db. The contents are encrypted. The job deletes backups older than 2 days.etcdctl snapshot restoreon the provided etcd volume. This job won't automatically.scripts/restore-etcd.sh) that scales down the etcd and the vcluster, performs the above job for each etcd volume, then returns the cluster to initial levels.