Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
71 commits
Select commit Hold shift + click to select a range
3632535
feat: add node recovery state machine with health checkers, executor,…
hlts2 Apr 13, 2026
c64cd02
refactor: extract operation options to options.go and add input valid…
hlts2 Apr 13, 2026
abc4545
fix: update godoc comment
hlts2 Apr 13, 2026
6e66585
feat: add check function to detect disk pressure issue
hlts2 Apr 13, 2026
9773247
fix: update code comment
hlts2 Apr 13, 2026
117931d
feat: add per-checker thresholds and DiskPressure checker
hlts2 Apr 13, 2026
b1bd9a0
fix: replace environment variable names with parameter-level error me…
hlts2 Apr 13, 2026
0a456c8
refactor: remove unused nodeDesiredGPUCount from watcher and add kube…
hlts2 Apr 13, 2026
43a495a
feat: support multiple node pool IDs and empty (all nodes) selector
hlts2 Apr 13, 2026
a0017b1
feat: detect expected GPU count from node label instead of static env…
hlts2 Apr 13, 2026
7e7e316
refactor: move nodePoolIDs from NewWatcher parameter to functional op…
hlts2 Apr 13, 2026
9c86454
refactor: remove unused clusterID from watcher
hlts2 Apr 13, 2026
baee4e0
fix: use nvidia.com/gpu.count label for hasGPU instead of allocatable
hlts2 Apr 13, 2026
3204673
fix: rename env vars to CIVO_NODE_AGENT_ prefix for agent-specific se…
hlts2 Apr 13, 2026
1cc2c87
feat: differentiate reboot wait times for standard and GPU nodes
hlts2 Apr 13, 2026
d364bb7
refactor: move node pool ID parsing into WithNodePoolIDs option
hlts2 Apr 13, 2026
6167e5a
docs: add TODO for standard node Drain → Replace recovery flow
hlts2 Apr 13, 2026
2bd4898
refactor: move monitor-only flag parsing into WithMonitorOnly option
hlts2 Apr 13, 2026
b061f02
fix: add WithMonitorOnly to defaultOptions
hlts2 Apr 13, 2026
8cd0e35
refactor: remove unused FailedCheckers getter from NodeState
hlts2 Apr 14, 2026
9f8a754
feat: add CiliumAgent health checker
hlts2 Apr 14, 2026
6dd8218
fix: use NetworkUnavailable condition with CiliumIsUp reason for Cili…
hlts2 Apr 14, 2026
d6e27ab
feat: return reason from HealthChecker.Check for metrics observability
hlts2 Apr 14, 2026
953abd3
fix: capitalize GPU checker reason messages for consistency
hlts2 Apr 14, 2026
5041abd
refactor: move hasGPU to health.HasGPU
hlts2 Apr 14, 2026
96e6866
test: add missing tests for Threshold, HasGPU, validation, and buildN…
hlts2 Apr 14, 2026
c583f7c
refactor: extract threshold values into local constants
hlts2 Apr 14, 2026
de77915
feat: add NopExecutor as default to prevent nil pointer dereference
hlts2 Apr 14, 2026
edefca8
refactor: rename nodeSelector to nodeLabelSelector and inline FormatL…
hlts2 Apr 14, 2026
53a4799
refactor: unexport test-only options WithNowFunc and WithNodeLister
hlts2 Apr 14, 2026
cad145f
fix: use UTC for all internal timestamps
hlts2 Apr 14, 2026
98f80c5
refactor: apply Civo Go testing conventions to all test files
hlts2 Apr 14, 2026
df5b428
refactor: remove redundant state.Phase() call in recovery check
hlts2 Apr 14, 2026
4943091
docs: add state transition comments to reconcile loop
hlts2 Apr 14, 2026
4025e68
fix: add graceful shutdown for metrics server and fix godoc typo
hlts2 Apr 14, 2026
d05744b
fix: use clientCfgPath instead of cfg in kubeconfig error message
hlts2 Apr 14, 2026
45a4bad
fix: use fixed reason strings in GPU checker to avoid high-cardinalit…
hlts2 Apr 14, 2026
cad1c3e
fix: clean up Prometheus gauge metrics when nodes are removed
hlts2 Apr 14, 2026
6bb50ca
fix: skip label selector when no node pool IDs are configured
hlts2 Apr 15, 2026
9a900cd
fix: log node label selector on informer setup
hlts2 Apr 15, 2026
fb830cc
fix: add skip logs for unhealthy threshold wait and reboot wait
hlts2 Apr 15, 2026
73d943b
feat: update Helm chart env vars and bump version to 0.2.0
hlts2 Apr 20, 2026
bc11770
docs: update README for new config and civo-api-access secret
hlts2 Apr 20, 2026
ebec5b4
fix: always render CIVO_NODE_POOL_ID env var even when empty
hlts2 Apr 20, 2026
7ee2fa5
fix: rename CIVO_NODE_POOL_ID to CIVO_NODE_POOL_IDS
hlts2 Apr 20, 2026
89d88ed
fix: correct civo-api-access secret key for CIVO_API_KEY
hlts2 Apr 20, 2026
bbfd189
docs: correct civo-api-access secret key name
hlts2 Apr 20, 2026
4e295c4
fix: add civo_ prefix to all Prometheus metric names
hlts2 Apr 20, 2026
86a245c
docs: link to Civo GPU operator docs instead of inline NVIDIA device …
hlts2 Apr 20, 2026
4a66d5d
feat: add civo_node_agent_recovery_failures_total metric
hlts2 Apr 20, 2026
b4203b9
feat: add civo_node_agent_info metric with version and cluster_id labels
hlts2 Apr 20, 2026
1a90b31
feat: delete info metric on graceful shutdown
hlts2 Apr 20, 2026
8ff99e2
feat: add civo_node_agent_reconcile_errors_total metric
hlts2 Apr 20, 2026
38a0657
refactor: move watcher start log out of ticker loop
hlts2 Apr 20, 2026
4c0d7f3
docs: update GPU operator link to Civo docs site
hlts2 Apr 21, 2026
2dc6d69
fix: suppress "Waiting for reboot effect" log in monitor-only mode
hlts2 Apr 21, 2026
9dfb942
docs: sync AGENTS.md with current implementation
hlts2 Apr 21, 2026
e64cb2a
fix: log invalid MonitorOnly value instead of silently ignoring
hlts2 Apr 21, 2026
3752158
docs: simplify AGENTS.md and clarify deployment target
hlts2 Apr 21, 2026
1207ff5
docs: trim AGENTS.md to essentials
hlts2 Apr 21, 2026
640596c
feat: add PhaseFailed and reboot retry limit
hlts2 Apr 21, 2026
ffe8d7a
docs: update AGENTS.md for PhaseFailed and trim interface details
hlts2 Apr 21, 2026
3ae7bf9
test: cover PhaseFailed, retry limit, and monitor-only count semantics
hlts2 Apr 21, 2026
a4e9658
fix: address review feedback (RBAC, time.Duration, error handling)
hlts2 Apr 21, 2026
51c5d38
fix: NodeState concurrency, Healthy metric, log fixes
hlts2 Apr 21, 2026
8debc95
refactor: monitor mode fully simulates the recovery lifecycle
hlts2 Apr 21, 2026
47239f7
fix: align HealthCheckTotal result label with godoc and drop no-op In…
hlts2 Apr 21, 2026
6620955
refactor: log checker reasons per-check and drop unused PhaseReboot
hlts2 Apr 21, 2026
0a5c02c
refactor: drop per-check "Health check failed" log
hlts2 Apr 21, 2026
94fc280
docs: rename build output to node-agent
hlts2 Apr 21, 2026
3826bd2
feat: cap reboot API failures to bound Civo API load
hlts2 Apr 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 20 additions & 35 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,57 +4,42 @@ This file provides guidance to AI coding agents when working with code in this r

## Project Overview

Kubernetes node agent for Civo cloud that monitors cluster nodes and triggers automatic hard reboots via the Civo API when nodes become NotReady or lose expected GPU capacity. Deployed as a single-replica Deployment in kube-system via Helm.
`civo-node-agent` monitors Kubernetes nodes in a Civo cluster and triggers automatic recovery actions (currently hard reboot via the Civo API) when nodes fail health checks.

### Deployment

`civo-node-agent` is designed to run as a daemon process on the control plane VM, which is the preferred deployment. A Helm chart (`charts/`) is also provided so it can run as a single-replica Deployment in `kube-system` if needed.

By default the agent runs in **monitor-only mode** (logs recovery actions without executing them). Set `CIVO_NODE_AGENT_MONITOR_ONLY=false` to enable actual reboots.

## Build & Test Commands

```bash
# Build
go build -o node-agent ./
# Build (CGO disabled — no C dependencies; required for static binary on the CP VM)
CGO_ENABLED=0 go build -o node-agent ./

# Run all tests
go test ./...

# Run a single test
go test ./pkg/watcher/ -run TestName

# Build Docker image (dry-run)
goreleaser release --snapshot --skip=publish --clean
# Before completing any task, always run:
go fmt ./...
go vet ./...
go test ./...
```

No linter is configured in CI.

## Architecture

**Entrypoint** (`main.go`): Reads env vars, sets up JSON structured logging (slog), creates a Watcher, and runs it with graceful SIGTERM/SIGINT shutdown.

**Core package** (`pkg/watcher/`):
- `watcher.go` — Main loop polls every 10 seconds. For each node matching the node pool label (`kubernetes.civo.com/civo-node-pool={nodePoolID}`), checks if the node is NotReady or has fewer GPUs than desired. If a reboot is warranted (and cooldown window hasn't elapsed), calls `HardRebootInstance` via the Civo API.
- `options.go` — Functional options pattern (`WithKubernetesClient`, `WithCivoClient`, etc.) for dependency injection and configuration.
- `fake.go` — `FakeClient` implementing `civogo.Clienter` for testing.
- `watcher_test.go` — Tests use fake Kubernetes client (`k8s.io/client-go/kubernetes/fake`) and `FakeClient` for Civo API.

**Reboot safeguards**: Tracks last reboot time per node in a `sync.Map`. Skips reboot if the node's Ready/NotReady condition transitioned recently or a reboot command was sent within the configurable time window (default 40 minutes).

## Required Environment Variables

`CIVO_API_KEY`, `CIVO_REGION`, `CIVO_CLUSTER_ID`, `CIVO_NODE_POOL_ID` — see `.env.example`.

Optional: `CIVO_API_URL`, `CIVO_NODE_DESIRED_GPU_COUNT`, `CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES`.

## Deployment

Helm chart in `charts/`. Secrets are expected in `civo-node-agent` and `civo-api-access` Kubernetes secrets.

```bash
helm upgrade -n kube-system --install node-agent ./charts
```
**Entrypoint** (`main.go`): Reads env vars + `--kubeconfig` flag, sets up JSON structured logging (slog), registers Prometheus metrics, starts the metrics HTTP server, constructs an Executor + Checkers + Watcher, and runs the watcher with graceful SIGTERM/SIGINT shutdown.

## Key Dependencies
### Packages

- `github.com/civo/civogo` — Civo cloud API client
- `k8s.io/client-go` — Kubernetes client (in-cluster config by default)
- **`pkg/watcher/`** — Orchestrator. Sets up a Node Informer (filtered by optional node pool label selector) and runs a 10s ticker reconcile loop driving the state machine (`Unknown → Healthy → Unhealthy → WaitingReboot → Failed`).
- **`pkg/health/`** — Health checkers.
- **`pkg/operation/`** — Recovery executors (Civo API reboot; nop executor used as safe default).
- **`pkg/metrics/`** — Prometheus metrics (all `civo_` prefixed). Defined once in `metrics.go`.

## Release

Tags matching `v*.*.*` trigger `.github/workflows/release-image.yaml`, which builds multi-arch Docker images via goreleaser and publishes to Docker Hub.
Tags matching `v*.*.*` trigger `.github/workflows/release-image.yaml`, which builds multi-arch Docker images via goreleaser and publishes to Docker Hub. The same binary is also uploaded to Civo object storage for CP VM installations (handled outside this repository).
78 changes: 37 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,60 @@
# Node Agent

`node-agent` monitors the health of Kubernetes nodes and can automatically restart VM instances when necessary. It triggers a restart under the following conditions:
`node-agent` monitors the health of Kubernetes nodes and can automatically reboot VM instances when necessary. A reboot is triggered when a node fails one or more health checks (e.g. `NodeReady`, GPU count, Cilium, DiskPressure) for a configured threshold.

- A node enters the **NotReady** state.
- The number of available GPUs per node falls below a configured threshold.
By default it runs in **monitor-only** mode, logging recovery actions without executing them. Set `monitorOnly=false` to enable actual reboots.

## Prerequisites: `civo-api-access` Secret

## Set Your `civo-node-agent` Secret
The `civo-api-access` secret is automatically provisioned by Civo in the `kube-system` namespace of every Civo Kubernetes cluster. It contains the API credentials and cluster identity used by `node-agent`:

```
export CIVO_DESIRED_GPU_COUNT="8"
export CIVO_NODE_POOL_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxx"
export CIVO_API_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES="xxxx"
kubectl -n kube-system delete secret civo-node-agent --ignore-not-found
kubectl -n kube-system create secret generic civo-node-agent
kubectl -n kube-system patch secret civo-node-agent -n kube-system --type='merge' \
-p='{"stringData": {"civo-api-key": "'"$CIVO_API_KEY"'", "node-pool-id": "'"$CIVO_NODE_POOL_ID"'", "desired-gpu-count": "'"$CIVO_DESIRED_GPU_COUNT"'", "time-window": "'"$CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES"'" }}'
```
| Key | Description |
|-----|-------------|
| `api-key` | Civo API key used for reboot operations. |
| `api-url` | Civo API URL. |
| `cluster-id` | The ID of this Civo Kubernetes cluster. |
| `region` | The Civo region this cluster runs in. |

## Nvidia Device Plugin Install
No manual setup is required — `node-agent` reads these values directly from the existing secret.

```bash
kubectl create ns gpu-operator
kubectl label namespace gpu-operator pod-security.kubernetes.io/enforce=privileged
kubectl label namespace gpu-operator pod-security.kubernetes.io/warn=privileged
kubectl label namespace gpu-operator pod-security.kubernetes.io/audit=privileged
```
## NVIDIA GPU Operator (GPU clusters only)

```bash
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
&& helm repo update
```
The GPU health check relies on the `nvidia.com/gpu.count` label added by the NVIDIA GPU Feature Discovery component. Follow the Civo documentation to install the NVIDIA GPU Operator on your cluster:

```bash
helm install --namespace gpu-operator nvidia-device-plugin nvdp/nvidia-device-plugin --create-namespace \
--version=0.17.0 \
--set gfd.enabled=true \
--set devicePlugin.enabled=true \
--set dcgm.enabled=true \
--set nfd.enableNodeFeatureApi=true \
--wait
```
[Installing the NVIDIA GPU Operator](https://www.civo.com/docs/kubernetes/advanced/gpu-config#installing-the-nvidia-gpu-operator)

## Install `node-agent` chart

You will need to clone this repository in order to have access to the charts directory that is used for installation. In your terminal, please change directory to your cloned `node-agent` repo directory, and then run:
You will need to clone this repository in order to have access to the charts directory. In your terminal, change directory to your cloned `node-agent` repo directory, then run:

```bash
helm upgrade -n kube-system --install node-agent ./charts
```

## Configuration Details
To enable active recovery (actually reboot nodes):

```bash
helm upgrade -n kube-system --install node-agent ./charts --set monitorOnly=false
```

The following configurations are stored in the `node-agent` secret in the `kube-system` namespace.
## Configuration

`node-pool-id`: The ID of your Kubernetes node pool which you want monitored. To collect this value, go to the [civo kubernetes dashboard](https://dashboard.civo.com/kubernetes), select your cluster, and click copy next to your pool id.
### Helm values (`values.yaml`)

`desired-gpu-count`: This value is intended to match the number of GPUs per node. If you had a 2-node cluster with 8 GPU total, you would set this value to 4 to represent the number of GPUs per node.
| Value | Default | Description |
|-------|---------|-------------|
| `nodePoolIDs` | `""` | Comma-separated node pool IDs to watch. Empty means all nodes. |
| `rebootWaitMinutes` | `10` | Minutes to wait after rebooting a standard node before retrying. |
| `gpuRebootWaitMinutes` | `40` | Minutes to wait after rebooting a GPU node before retrying. |
| `maxRebootRetries` | `5` | Maximum reboot attempts before the node transitions to `Failed` (no further reboots). |
| `monitorOnly` | `true` | If `true`, log recovery actions without executing them. Set `false` to enable reboots. |
| `metricsPort` | `9625` | Port for the Prometheus metrics endpoint. |

`civo-api-key`: The civo api key to use when automatically rebooting nodes. To collect this value, go to toue [civo settings security tab](https://dashboard.civo.com/security).
### Health checkers

`time-window`: The time-window is the time we need to give a node after a reboot happens
| Checker | Condition | Threshold |
|---------|-----------|-----------|
| `NodeReady` | `NodeReady == True` | 5 min |
| `DiskPressure` | `DiskPressure != True` | 30 min |
| `CiliumAgent` | `NetworkUnavailable == False` with reason `CiliumIsUp` (skipped for non-Cilium CNI) | 10 min |
| `GPU` | `allocatable["nvidia.com/gpu"]` equals `nvidia.com/gpu.count` label (skipped for non-GPU nodes) | 10 min |
37 changes: 20 additions & 17 deletions charts/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ spec:
- name: CIVO_API_KEY
valueFrom:
secretKeyRef:
name: civo-node-agent
key: civo-api-key
name: civo-api-access
key: api-key
- name: CIVO_API_URL
valueFrom:
secretKeyRef:
Expand All @@ -56,27 +56,30 @@ spec:
secretKeyRef:
name: civo-api-access
key: region
- name: CIVO_NODE_POOL_ID
valueFrom:
secretKeyRef:
name: civo-node-agent
key: node-pool-id
- name: CIVO_NODE_DESIRED_GPU_COUNT
valueFrom:
secretKeyRef:
name: civo-node-agent
key: desired-gpu-count
- name: CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES
valueFrom:
secretKeyRef:
name: civo-node-agent
key: time-window
- name: CIVO_NODE_POOL_IDS
value: {{ .Values.nodePoolIDs | quote }}
- name: CIVO_NODE_REBOOT_WAIT_MINUTES
value: {{ .Values.rebootWaitMinutes | quote }}
- name: CIVO_GPU_NODE_REBOOT_WAIT_MINUTES
value: {{ .Values.gpuRebootWaitMinutes | quote }}
- name: CIVO_NODE_MAX_REBOOT_RETRIES
value: {{ .Values.maxRebootRetries | quote }}
- name: CIVO_NODE_AGENT_MONITOR_ONLY
value: {{ .Values.monitorOnly | quote }}
- name: CIVO_NODE_AGENT_METRICS_PORT
value: {{ .Values.metricsPort | quote }}
{{- with .Values.securityContext }}
securityContext:
{{- toYaml . | nindent 12 }}
{{- end }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
args:
- "--kubeconfig="
ports:
- name: metrics
containerPort: {{ .Values.metricsPort | default 9625 }}
protocol: TCP
{{- with .Values.resources }}
resources:
{{- toYaml . | nindent 12 }}
Expand Down
3 changes: 2 additions & 1 deletion charts/templates/rbac.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,6 @@ subjects:
name: {{ .Chart.Name }}
namespace: kube-system
roleRef:
kind: ClusterRole
kind: ClusterRole
name: {{ .Chart.Name }}
apiGroup: rbac.authorization.k8s.io
18 changes: 18 additions & 0 deletions charts/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,24 @@ image:
pullPolicy: IfNotPresent
tag: "6b8426a"

# Comma-separated node pool IDs to watch (empty = all nodes).
nodePoolIDs: ""

# Reboot wait time for standard nodes (minutes).
rebootWaitMinutes: 10

# Reboot wait time for GPU nodes (minutes).
gpuRebootWaitMinutes: 40

# Maximum number of reboot attempts before a node transitions to PhaseFailed.
maxRebootRetries: 5

# Monitor-only mode: log recovery actions without executing them.
monitorOnly: true

# Port for Prometheus metrics endpoint.
metricsPort: 9625

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
Expand Down
23 changes: 15 additions & 8 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,15 @@ go 1.24.0

require (
github.com/civo/civogo v0.3.94
github.com/prometheus/client_golang v1.23.2
k8s.io/api v0.32.2
k8s.io/apimachinery v0.32.2
k8s.io/client-go v0.32.2
)

require (
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
github.com/emicklei/go-restful/v3 v3.11.0 // indirect
github.com/fxamacker/cbor/v2 v2.7.0 // indirect
Expand All @@ -20,7 +23,7 @@ require (
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/protobuf v1.5.4 // indirect
github.com/google/gnostic-models v0.6.8 // indirect
github.com/google/go-cmp v0.6.0 // indirect
github.com/google/go-cmp v0.7.0 // indirect
github.com/google/go-querystring v1.1.0 // indirect
github.com/google/gofuzz v1.2.0 // indirect
github.com/google/uuid v1.6.0 // indirect
Expand All @@ -31,16 +34,20 @@ require (
github.com/modern-go/reflect2 v1.0.2 // indirect
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
github.com/pkg/errors v0.9.1 // indirect
github.com/prometheus/client_model v0.6.2 // indirect
github.com/prometheus/common v0.66.1 // indirect
github.com/prometheus/procfs v0.16.1 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/x448/float16 v0.8.4 // indirect
golang.org/x/mod v0.20.0 // indirect
golang.org/x/net v0.38.0 // indirect
golang.org/x/oauth2 v0.27.0 // indirect
golang.org/x/sys v0.31.0 // indirect
golang.org/x/term v0.30.0 // indirect
golang.org/x/text v0.23.0 // indirect
go.yaml.in/yaml/v2 v2.4.2 // indirect
golang.org/x/mod v0.26.0 // indirect
golang.org/x/net v0.43.0 // indirect
golang.org/x/oauth2 v0.30.0 // indirect
golang.org/x/sys v0.35.0 // indirect
golang.org/x/term v0.34.0 // indirect
golang.org/x/text v0.28.0 // indirect
golang.org/x/time v0.7.0 // indirect
google.golang.org/protobuf v1.35.1 // indirect
google.golang.org/protobuf v1.36.8 // indirect
gopkg.in/evanphx/json-patch.v4 v4.12.0 // indirect
gopkg.in/inf.v0 v0.9.1 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
Expand Down
Loading
Loading