Skip to content

Commit 8da04a3

Browse files
authored
STAC-24078: Handle parallel execution (#13)
1 parent f111d31 commit 8da04a3

24 files changed

Lines changed: 2179 additions & 58 deletions

File tree

ARCHITECTURE.md

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ stackstate-backup-cli/
3939
│ ├── orchestration/ # Layer 2: Workflows
4040
│ │ ├── portforward/ # Port-forwarding orchestration
4141
│ │ ├── scale/ # Deployment/StatefulSet scaling workflows
42-
│ │ └── restore/ # Restore job orchestration
42+
│ │ ├── restore/ # Restore job orchestration
43+
│ │ └── restorelock/ # Restore lock mechanism (prevents parallel restores)
4344
│ │
4445
│ ├── app/ # Layer 3: Dependency Container
4546
│ │ └── app.go # Application context and dependency injection
@@ -126,6 +127,7 @@ appCtx.Formatter
126127
- `portforward/`: Manages Kubernetes port-forwarding lifecycle
127128
- `scale/`: Deployment and StatefulSet scaling workflows with detailed logging
128129
- `restore/`: Restore job orchestration (confirmation, job lifecycle, finalization, resource management)
130+
- `restorelock/`: Prevents parallel restore operations using Kubernetes annotations
129131

130132
**Dependency Rules**:
131133
- ✅ Can import: `internal/foundation/*`, `internal/clients/*`
@@ -317,16 +319,45 @@ Log: log,
317319

318320
### 7. Structured Logging
319321

320-
All operations use structured logging with consistent levels:
322+
All operations use structured logging with consistent levels and emoji prefixes for visual clarity:
321323

322324
```go
323-
log.Infof("Starting operation...")
324-
log.Debugf("Detail: %v", detail)
325-
log.Warningf("Non-fatal issue: %v", warning)
326-
log.Errorf("Operation failed: %v", err)
327-
log.Successf("Operation completed successfully")
325+
log.Infof("Starting operation...") // No prefix
326+
log.Debugf("Detail: %v", detail) // 🛠️ DEBUG:
327+
log.Warningf("Non-fatal issue: %v", warning) // ⚠️ Warning:
328+
log.Errorf("Operation failed: %v", err) // ❌ Error:
329+
log.Successf("Operation completed") //
328330
```
329331

332+
### 8. Restore Lock Pattern
333+
334+
The `restorelock` package prevents parallel restore operations that could corrupt data:
335+
336+
```go
337+
// Scale down with automatic lock acquisition
338+
scaledApps, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
339+
K8sClient: k8sClient,
340+
Namespace: namespace,
341+
LabelSelector: selector,
342+
Datastore: config.DatastoreStackgraph,
343+
AllSelectors: config.GetAllScaleDownSelectors(),
344+
Log: log,
345+
})
346+
347+
// Scale up and release lock
348+
defer scale.ScaleUpAndReleaseLock(k8sClient, namespace, selector, log)
349+
```
350+
351+
**How it works**:
352+
1. Before scaling down, checks for existing restore locks on Deployments/StatefulSets
353+
2. Detects conflicts for the same datastore or mutually exclusive datastores (e.g., Stackgraph and Settings)
354+
3. Sets annotations (`stackstate.com/restore-in-progress`, `stackstate.com/restore-started-at`) on resources
355+
4. Releases locks when scaling up or on failure
356+
357+
**Mutual Exclusion Groups**:
358+
- Stackgraph and Settings restores are mutually exclusive (both modify HBase data)
359+
- Other datastores (Elasticsearch, ClickHouse, VictoriaMetrics) are independent
360+
330361
## Testing Strategy
331362

332363
### Unit Tests

README.md

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -391,11 +391,12 @@ See [internal/foundation/config/testdata/validConfigMapConfig.yaml](internal/fou
391391
│ ├── orchestration/ # Layer 2: Workflows
392392
│ │ ├── portforward/ # Port-forwarding lifecycle
393393
│ │ ├── scale/ # Deployment/StatefulSet scaling
394-
│ │ └── restore/ # Restore job orchestration
395-
│ │ ├── confirmation.go # User confirmation prompts
396-
│ │ ├── finalize.go # Job status check and cleanup
397-
│ │ ├── job.go # Job lifecycle management
398-
│ │ └── resources.go # Restore resource management
394+
│ │ ├── restore/ # Restore job orchestration
395+
│ │ │ ├── confirmation.go # User confirmation prompts
396+
│ │ │ ├── finalize.go # Job status check and cleanup
397+
│ │ │ ├── job.go # Job lifecycle management
398+
│ │ │ └── resources.go # Restore resource management
399+
│ │ └── restorelock/ # Parallel restore prevention
399400
│ ├── app/ # Layer 3: Dependency container
400401
│ │ └── app.go # Application context and DI
401402
│ └── scripts/ # Embedded bash scripts
@@ -409,6 +410,24 @@ See [internal/foundation/config/testdata/validConfigMapConfig.yaml](internal/fou
409410
- **Dependency Injection**: Centralized dependency creation via `internal/app/` eliminates boilerplate from commands
410411
- **Testability**: All layers use interfaces for external dependencies, enabling comprehensive unit testing
411412
- **Clean Commands**: Commands are thin (50-100 lines) and focused on business logic
413+
- **Restore Lock Protection**: Prevents parallel restore operations that could corrupt data
414+
415+
### Restore Lock Protection
416+
417+
The CLI prevents parallel restore operations that could corrupt data by using Kubernetes annotations on Deployments and StatefulSets. When a restore starts:
418+
419+
1. The CLI checks for existing restore locks before proceeding
420+
2. If another restore is in progress for the same datastore, the operation is blocked
421+
3. Mutually exclusive datastores are also protected (e.g., Stackgraph and Settings cannot restore simultaneously because they share HBase data)
422+
423+
If a restore operation is interrupted or fails, the lock annotations may remain. To manually remove a stuck lock:
424+
425+
```bash
426+
kubectl annotate deployment,statefulset -l <label-selector> \
427+
stackstate.com/restore-in-progress- \
428+
stackstate.com/restore-started-at- \
429+
-n <namespace>
430+
```
412431

413432
See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed information about the layered architecture and design patterns.
414433

cmd/clickhouse/check_and_finalize.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -127,21 +127,21 @@ func waitAndFinalize(appCtx *app.Context, chClient clickhouse.Interface, operati
127127
return finalizeRestore(appCtx)
128128
}
129129

130-
// finalizeRestore finalizes the restore by executing SQL and scaling up
130+
// finalizeRestore finalizes the restore by executing SQL, scaling up, and releasing lock
131131
func finalizeRestore(appCtx *app.Context) error {
132132
if err := executePostRestoreSQL(appCtx); err != nil {
133133
appCtx.Logger.Warningf("Post-restore SQL failed: %v", err)
134134
}
135135

136136
appCtx.Logger.Println()
137137
scaleSelector := appCtx.Config.Clickhouse.Restore.ScaleDownLabelSelector
138-
if err := scale.ScaleUpFromAnnotations(
138+
if err := scale.ScaleUpAndReleaseLock(
139139
appCtx.K8sClient,
140140
appCtx.Namespace,
141141
scaleSelector,
142142
appCtx.Logger,
143143
); err != nil {
144-
return fmt.Errorf("failed to scale up: %w", err)
144+
return fmt.Errorf("failed to scale up and release lock: %w", err)
145145
}
146146

147147
appCtx.Logger.Println()

cmd/clickhouse/restore.go

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,10 +69,17 @@ func runRestore(appCtx *app.Context) error {
6969
}
7070
}
7171

72-
// Scale down deployments/statefulsets before restore
72+
// Scale down deployments/statefulsets before restore (with lock protection)
7373
appCtx.Logger.Println()
7474
scaleDownLabelSelector := appCtx.Config.Clickhouse.Restore.ScaleDownLabelSelector
75-
_, err := scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger)
75+
_, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
76+
K8sClient: appCtx.K8sClient,
77+
Namespace: appCtx.Namespace,
78+
LabelSelector: scaleDownLabelSelector,
79+
Datastore: config.DatastoreClickhouse,
80+
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
81+
Log: appCtx.Logger,
82+
})
7683
if err != nil {
7784
return err
7885
}

cmd/cmdutils/common.go

Lines changed: 6 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,7 @@
11
package cmdutils
22

33
import (
4-
"errors"
54
"fmt"
6-
"io"
75
"os"
86

97
"github.com/stackvista/stackstate-backup-cli/internal/app"
@@ -18,19 +16,15 @@ const (
1816
func Run(globalFlags *config.CLIGlobalFlags, runFunc func(ctx *app.Context) error, minioRequired bool) {
1917
appCtx, err := app.NewContext(globalFlags)
2018
if err != nil {
21-
exitWithError(err, os.Stderr)
19+
_, _ = fmt.Fprintf(os.Stderr, "❌ Error: %v\n", err)
20+
os.Exit(1)
2221
}
2322
if minioRequired && !appCtx.Config.Minio.Enabled {
24-
exitWithError(errors.New("commands that interact with Minio require SUSE Observability to be deployed with .Values.global.backup.enabled=true"), os.Stderr)
23+
appCtx.Logger.Errorf("commands that interact with Minio require SUSE Observability to be deployed with .Values.global.backup.enabled=true")
24+
os.Exit(1)
2525
}
2626
if err := runFunc(appCtx); err != nil {
27-
exitWithError(err, os.Stderr)
27+
appCtx.Logger.Errorf(err.Error())
28+
os.Exit(1)
2829
}
2930
}
30-
31-
// ExitWithError prints an error message to the writer and exits with status code 1.
32-
// This is a helper function to avoid repeating error handling code in commands.
33-
func exitWithError(err error, w io.Writer) {
34-
_, _ = fmt.Fprintf(w, "error: %v\n", err)
35-
os.Exit(1)
36-
}

cmd/elasticsearch/check_and_finalize.go

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -113,20 +113,22 @@ func waitAndFinalize(appCtx *app.Context, repository, snapshotName string) error
113113
return finalizeRestore(appCtx)
114114
}
115115

116-
// finalizeRestore performs post-restore finalization (scale up deployments)
116+
// finalizeRestore performs post-restore finalization (scale up deployments and release lock)
117117
func finalizeRestore(appCtx *app.Context) error {
118118
appCtx.Logger.Println()
119+
labelSelector := appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector
119120
scaleUpFn := func() error {
120-
return scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector, appCtx.Logger)
121+
return scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, labelSelector, appCtx.Logger)
121122
}
122123

123124
return restore.FinalizeRestore(scaleUpFn, appCtx.Logger)
124125
}
125126

126-
// attemptScaleUp tries to scale up deployments (used when restore is not found/already complete)
127+
// attemptScaleUp tries to scale up deployments and release lock (used when restore is not found/already complete)
127128
func attemptScaleUp(appCtx *app.Context) error {
129+
labelSelector := appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector
128130
scaleUpFn := func() error {
129-
return scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector, appCtx.Logger)
131+
return scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, labelSelector, appCtx.Logger)
130132
}
131133

132134
if err := scaleUpFn(); err != nil {

cmd/elasticsearch/restore.go

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,9 +91,17 @@ func runRestore(appCtx *app.Context) error {
9191
}
9292
}
9393

94-
// Scale down deployments before restore
94+
// Scale down deployments before restore (with lock protection)
9595
appCtx.Logger.Println()
96-
_, err = scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector, appCtx.Logger)
96+
scaleDownLabelSelector := appCtx.Config.Elasticsearch.Restore.ScaleDownLabelSelector
97+
_, err = scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
98+
K8sClient: appCtx.K8sClient,
99+
Namespace: appCtx.Namespace,
100+
LabelSelector: scaleDownLabelSelector,
101+
Datastore: config.DatastoreElasticsearch,
102+
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
103+
Log: appCtx.Logger,
104+
})
97105
if err != nil {
98106
return err
99107
}

cmd/settings/check_and_finalize.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ func runCheckAndFinalize(appCtx *app.Context) error {
4848
Namespace: appCtx.Namespace,
4949
JobName: checkJobName,
5050
ServiceName: "settings",
51-
ScaleUpFn: scale.ScaleUpFromAnnotations,
51+
ScaleUpFn: scale.ScaleUpAndReleaseLock,
5252
ScaleDownFn: scale.ScaleDown,
5353
ScaleSelector: appCtx.Config.Settings.Restore.ScaleDownLabelSelector,
5454
CleanupPVC: false,

cmd/settings/restore.go

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -78,19 +78,26 @@ func runRestore(appCtx *app.Context) error {
7878
}
7979
}
8080

81-
// Scale down deployments before restore
81+
// Scale down deployments before restore (with lock protection)
8282
appCtx.Logger.Println()
8383
scaleDownLabelSelector := appCtx.Config.Settings.Restore.ScaleDownLabelSelector
84-
scaledDeployments, err := scale.ScaleDown(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger)
84+
scaledDeployments, err := scale.ScaleDownWithLock(scale.ScaleDownWithLockParams{
85+
K8sClient: appCtx.K8sClient,
86+
Namespace: appCtx.Namespace,
87+
LabelSelector: scaleDownLabelSelector,
88+
Datastore: config.DatastoreSettings,
89+
AllSelectors: appCtx.Config.GetAllScaleDownSelectors(),
90+
Log: appCtx.Logger,
91+
})
8592
if err != nil {
8693
return err
8794
}
8895

89-
// Ensure deployments are scaled back up on exit (even if restore fails)
96+
// Ensure deployments are scaled back up and lock released on exit (even if restore fails)
9097
defer func() {
9198
if len(scaledDeployments) > 0 && !background {
9299
appCtx.Logger.Println()
93-
if err := scale.ScaleUpFromAnnotations(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
100+
if err := scale.ScaleUpAndReleaseLock(appCtx.K8sClient, appCtx.Namespace, scaleDownLabelSelector, appCtx.Logger); err != nil {
94101
appCtx.Logger.Warningf("Failed to scale up deployments: %v", err)
95102
}
96103
}

cmd/stackgraph/check_and_finalize.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ func runCheckAndFinalize(appCtx *app.Context) error {
4848
Namespace: appCtx.Namespace,
4949
JobName: checkJobName,
5050
ServiceName: "stackgraph",
51-
ScaleUpFn: scale.ScaleUpFromAnnotations,
51+
ScaleUpFn: scale.ScaleUpAndReleaseLock,
5252
ScaleDownFn: scale.ScaleDown,
5353
ScaleSelector: appCtx.Config.Stackgraph.Restore.ScaleDownLabelSelector,
5454
CleanupPVC: true,

0 commit comments

Comments
 (0)