civo · hlts2 · Apr 23, 2026 · Apr 13, 2026 · Apr 13, 2026 · Apr 13, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -4,57 +4,42 @@ This file provides guidance to AI coding agents when working with code in this r
 
 ## Project Overview
 
-Kubernetes node agent for Civo cloud that monitors cluster nodes and triggers automatic hard reboots via the Civo API when nodes become NotReady or lose expected GPU capacity. Deployed as a single-replica Deployment in kube-system via Helm.
+`civo-node-agent` monitors Kubernetes nodes in a Civo cluster and triggers automatic recovery actions (currently hard reboot via the Civo API) when nodes fail health checks.
+
+### Deployment
+
+`civo-node-agent` is designed to run as a daemon process on the control plane VM, which is the preferred deployment. A Helm chart (`charts/`) is also provided so it can run as a single-replica Deployment in `kube-system` if needed.
+
+By default the agent runs in **monitor-only mode** (logs recovery actions without executing them). Set `CIVO_NODE_AGENT_MONITOR_ONLY=false` to enable actual reboots.
 
 ## Build & Test Commands
 
 ```bash
-# Build
-go build -o node-agent ./
+# Build (CGO disabled — no C dependencies; required for static binary on the CP VM)
+CGO_ENABLED=0 go build -o node-agent ./
 
 # Run all tests
 go test ./...
 
-# Run a single test
-go test ./pkg/watcher/ -run TestName
-
-# Build Docker image (dry-run)
-goreleaser release --snapshot --skip=publish --clean
+# Before completing any task, always run:
+go fmt ./...
+go vet ./...
+go test ./...
 ```
 
 No linter is configured in CI.
 
 ## Architecture
 
-**Entrypoint** (`main.go`): Reads env vars, sets up JSON structured logging (slog), creates a Watcher, and runs it with graceful SIGTERM/SIGINT shutdown.
-
-**Core package** (`pkg/watcher/`):
-- `watcher.go` — Main loop polls every 10 seconds. For each node matching the node pool label (`kubernetes.civo.com/civo-node-pool={nodePoolID}`), checks if the node is NotReady or has fewer GPUs than desired. If a reboot is warranted (and cooldown window hasn't elapsed), calls `HardRebootInstance` via the Civo API.
-- `options.go` — Functional options pattern (`WithKubernetesClient`, `WithCivoClient`, etc.) for dependency injection and configuration.
-- `fake.go` — `FakeClient` implementing `civogo.Clienter` for testing.
-- `watcher_test.go` — Tests use fake Kubernetes client (`k8s.io/client-go/kubernetes/fake`) and `FakeClient` for Civo API.
-
-**Reboot safeguards**: Tracks last reboot time per node in a `sync.Map`. Skips reboot if the node's Ready/NotReady condition transitioned recently or a reboot command was sent within the configurable time window (default 40 minutes).
-
-## Required Environment Variables
-
-`CIVO_API_KEY`, `CIVO_REGION`, `CIVO_CLUSTER_ID`, `CIVO_NODE_POOL_ID` — see `.env.example`.
-
-Optional: `CIVO_API_URL`, `CIVO_NODE_DESIRED_GPU_COUNT`, `CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES`.
-
-## Deployment
-
-Helm chart in `charts/`. Secrets are expected in `civo-node-agent` and `civo-api-access` Kubernetes secrets.
-
-```bash
-helm upgrade -n kube-system --install node-agent ./charts
-```
+**Entrypoint** (`main.go`): Reads env vars + `--kubeconfig` flag, sets up JSON structured logging (slog), registers Prometheus metrics, starts the metrics HTTP server, constructs an Executor + Checkers + Watcher, and runs the watcher with graceful SIGTERM/SIGINT shutdown.
 
-## Key Dependencies
+### Packages
 
-- `github.com/civo/civogo` — Civo cloud API client
-- `k8s.io/client-go` — Kubernetes client (in-cluster config by default)
+- **`pkg/watcher/`** — Orchestrator. Sets up a Node Informer (filtered by optional node pool label selector) and runs a 10s ticker reconcile loop driving the state machine (`Unknown → Healthy → Unhealthy → WaitingReboot → Failed`).
+- **`pkg/health/`** — Health checkers.
+- **`pkg/operation/`** — Recovery executors (Civo API reboot; nop executor used as safe default).
+- **`pkg/metrics/`** — Prometheus metrics (all `civo_` prefixed). Defined once in `metrics.go`.
 
 ## Release
 
-Tags matching `v*.*.*` trigger `.github/workflows/release-image.yaml`, which builds multi-arch Docker images via goreleaser and publishes to Docker Hub.
+Tags matching `v*.*.*` trigger `.github/workflows/release-image.yaml`, which builds multi-arch Docker images via goreleaser and publishes to Docker Hub. The same binary is also uploaded to Civo object storage for CP VM installations (handled outside this repository).
diff --git a/README.md b/README.md
@@ -1,64 +1,60 @@
 # Node Agent
 
-`node-agent` monitors the health of Kubernetes nodes and can automatically restart VM instances when necessary. It triggers a restart under the following conditions:  
+`node-agent` monitors the health of Kubernetes nodes and can automatically reboot VM instances when necessary. A reboot is triggered when a node fails one or more health checks (e.g. `NodeReady`, GPU count, Cilium, DiskPressure) for a configured threshold.
 
-- A node enters the **NotReady** state.  
-- The number of available GPUs per node falls below a configured threshold.  
+By default it runs in **monitor-only** mode, logging recovery actions without executing them. Set `monitorOnly=false` to enable actual reboots.
 
+## Prerequisites: `civo-api-access` Secret
 
-## Set Your `civo-node-agent` Secret
+The `civo-api-access` secret is automatically provisioned by Civo in the `kube-system` namespace of every Civo Kubernetes cluster. It contains the API credentials and cluster identity used by `node-agent`:
 
-```
-export CIVO_DESIRED_GPU_COUNT="8"
-export CIVO_NODE_POOL_ID="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxxx"
-export CIVO_API_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
-export CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES="xxxx"
-kubectl -n kube-system delete secret civo-node-agent --ignore-not-found
-kubectl -n kube-system create secret generic civo-node-agent
-kubectl -n kube-system patch secret civo-node-agent -n kube-system --type='merge' \
-    -p='{"stringData": {"civo-api-key": "'"$CIVO_API_KEY"'", "node-pool-id": "'"$CIVO_NODE_POOL_ID"'", "desired-gpu-count": "'"$CIVO_DESIRED_GPU_COUNT"'", "time-window": "'"$CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES"'" }}'
-```
+| Key | Description |
+|-----|-------------|
+| `api-key`  | Civo API key used for reboot operations. |
+| `api-url` | Civo API URL. |
+| `cluster-id` | The ID of this Civo Kubernetes cluster. |
+| `region` | The Civo region this cluster runs in. |
 
-## Nvidia Device Plugin Install 
+No manual setup is required — `node-agent` reads these values directly from the existing secret.
 
-```bash
-kubectl create ns gpu-operator
-kubectl label namespace gpu-operator pod-security.kubernetes.io/enforce=privileged                                              
-kubectl label namespace gpu-operator pod-security.kubernetes.io/warn=privileged
-kubectl label namespace gpu-operator pod-security.kubernetes.io/audit=privileged
-```
+## NVIDIA GPU Operator (GPU clusters only)
 
-```bash
-helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
-&& helm repo update
-```
+The GPU health check relies on the `nvidia.com/gpu.count` label added by the NVIDIA GPU Feature Discovery component. Follow the Civo documentation to install the NVIDIA GPU Operator on your cluster:
 
-```bash
-helm install --namespace gpu-operator nvidia-device-plugin nvdp/nvidia-device-plugin --create-namespace \
-        --version=0.17.0 \
-        --set gfd.enabled=true \
-        --set devicePlugin.enabled=true \
-        --set dcgm.enabled=true \
-        --set nfd.enableNodeFeatureApi=true \
-        --wait
-```
+[Installing the NVIDIA GPU Operator](https://www.civo.com/docs/kubernetes/advanced/gpu-config#installing-the-nvidia-gpu-operator)
 
 ## Install `node-agent` chart
 
-You will need to clone this repository in order to have access to the charts directory that is used for installation. In your terminal, please change directory to your cloned `node-agent` repo directory, and then run:
+You will need to clone this repository in order to have access to the charts directory. In your terminal, change directory to your cloned `node-agent` repo directory, then run:
 
 ```bash
 helm upgrade -n kube-system --install node-agent ./charts
 ```
 
-## Configuration Details
+To enable active recovery (actually reboot nodes):
+
+```bash
+helm upgrade -n kube-system --install node-agent ./charts --set monitorOnly=false
+```
 
-The following configurations are stored in the `node-agent` secret in the `kube-system` namespace.
+## Configuration
 
-`node-pool-id`: The ID of your Kubernetes node pool which you want monitored. To collect this value, go to the [civo kubernetes dashboard](https://dashboard.civo.com/kubernetes), select your cluster, and click copy next to your pool id.
+### Helm values (`values.yaml`)
 
-`desired-gpu-count`: This value is intended to match the number of GPUs per node. If you had a 2-node cluster with 8 GPU total, you would set this value to 4 to represent the number of GPUs per node.
+| Value | Default | Description |
+|-------|---------|-------------|
+| `nodePoolIDs` | `""` | Comma-separated node pool IDs to watch. Empty means all nodes. |
+| `rebootWaitMinutes` | `10` | Minutes to wait after rebooting a standard node before retrying. |
+| `gpuRebootWaitMinutes` | `40` | Minutes to wait after rebooting a GPU node before retrying. |
+| `maxRebootRetries` | `5` | Maximum reboot attempts before the node transitions to `Failed` (no further reboots). |
+| `monitorOnly` | `true` | If `true`, log recovery actions without executing them. Set `false` to enable reboots. |
+| `metricsPort` | `9625` | Port for the Prometheus metrics endpoint. |
 
-`civo-api-key`: The civo api key to use when automatically rebooting nodes. To collect this value, go to toue [civo settings security tab](https://dashboard.civo.com/security).
+### Health checkers
 
-`time-window`: The time-window is the time we need to give a node after a reboot happens
+| Checker | Condition | Threshold |
+|---------|-----------|-----------|
+| `NodeReady` | `NodeReady == True` | 5 min |
+| `DiskPressure` | `DiskPressure != True` | 30 min |
+| `CiliumAgent` | `NetworkUnavailable == False` with reason `CiliumIsUp` (skipped for non-Cilium CNI) | 10 min |
+| `GPU` | `allocatable["nvidia.com/gpu"]` equals `nvidia.com/gpu.count` label (skipped for non-GPU nodes) | 10 min |
diff --git a/charts/templates/deployment.yaml b/charts/templates/deployment.yaml
@@ -39,8 +39,8 @@ spec:
             - name: CIVO_API_KEY
               valueFrom:
                 secretKeyRef:
-                  name: civo-node-agent
-                  key: civo-api-key
+                  name: civo-api-access
+                  key: api-key
             - name: CIVO_API_URL
               valueFrom:
                 secretKeyRef:
@@ -56,27 +56,30 @@ spec:
                 secretKeyRef:
                   name: civo-api-access
                   key: region
-            - name: CIVO_NODE_POOL_ID
-              valueFrom:
-                secretKeyRef:
-                  name: civo-node-agent
-                  key: node-pool-id
-            - name: CIVO_NODE_DESIRED_GPU_COUNT
-              valueFrom:
-                secretKeyRef:
-                  name: civo-node-agent
-                  key: desired-gpu-count
-            - name: CIVO_NODE_REBOOT_TIME_WINDOW_MINUTES
-              valueFrom:
-                secretKeyRef:
-                  name: civo-node-agent 
-                  key: time-window
+            - name: CIVO_NODE_POOL_IDS
+              value: {{ .Values.nodePoolIDs | quote }}
+            - name: CIVO_NODE_REBOOT_WAIT_MINUTES
+              value: {{ .Values.rebootWaitMinutes | quote }}
+            - name: CIVO_GPU_NODE_REBOOT_WAIT_MINUTES
+              value: {{ .Values.gpuRebootWaitMinutes | quote }}
+            - name: CIVO_NODE_MAX_REBOOT_RETRIES
+              value: {{ .Values.maxRebootRetries | quote }}
+            - name: CIVO_NODE_AGENT_MONITOR_ONLY
+              value: {{ .Values.monitorOnly | quote }}
+            - name: CIVO_NODE_AGENT_METRICS_PORT
+              value: {{ .Values.metricsPort | quote }}
           {{- with .Values.securityContext }}
           securityContext:
             {{- toYaml . | nindent 12 }}
           {{- end }}
           image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
           imagePullPolicy: {{ .Values.image.pullPolicy }}
+          args:
+            - "--kubeconfig="
+          ports:
+            - name: metrics
+              containerPort: {{ .Values.metricsPort | default 9625 }}
+              protocol: TCP
           {{- with .Values.resources }}
           resources:
             {{- toYaml . | nindent 12 }}

diff --git a/charts/templates/rbac.yaml b/charts/templates/rbac.yaml
@@ -16,5 +16,6 @@ subjects:
   name: {{ .Chart.Name }}
   namespace: kube-system
 roleRef:
-  kind: ClusterRole 
+  kind: ClusterRole
   name: {{ .Chart.Name }}
+  apiGroup: rbac.authorization.k8s.io
diff --git a/charts/values.yaml b/charts/values.yaml
@@ -6,6 +6,24 @@ image:
   pullPolicy: IfNotPresent
   tag: "6b8426a"
 
+# Comma-separated node pool IDs to watch (empty = all nodes).
+nodePoolIDs: ""
+
+# Reboot wait time for standard nodes (minutes).
+rebootWaitMinutes: 10
+
+# Reboot wait time for GPU nodes (minutes).
+gpuRebootWaitMinutes: 40
+
+# Maximum number of reboot attempts before a node transitions to PhaseFailed.
+maxRebootRetries: 5
+
+# Monitor-only mode: log recovery actions without executing them.
+monitorOnly: true
+
+# Port for Prometheus metrics endpoint.
+metricsPort: 9625
+
 imagePullSecrets: []
 nameOverride: ""
 fullnameOverride: ""

diff --git a/go.mod b/go.mod
@@ -4,12 +4,15 @@ go 1.24.0
 
 require (
 	github.com/civo/civogo v0.3.94
+	github.com/prometheus/client_golang v1.23.2
 	k8s.io/api v0.32.2
 	k8s.io/apimachinery v0.32.2
 	k8s.io/client-go v0.32.2
 )
 
 require (
+	github.com/beorn7/perks v1.0.1 // indirect
+	github.com/cespare/xxhash/v2 v2.3.0 // indirect
 	github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
 	github.com/emicklei/go-restful/v3 v3.11.0 // indirect
 	github.com/fxamacker/cbor/v2 v2.7.0 // indirect
@@ -20,7 +23,7 @@ require (
 	github.com/gogo/protobuf v1.3.2 // indirect
 	github.com/golang/protobuf v1.5.4 // indirect
 	github.com/google/gnostic-models v0.6.8 // indirect
-	github.com/google/go-cmp v0.6.0 // indirect
+	github.com/google/go-cmp v0.7.0 // indirect
 	github.com/google/go-querystring v1.1.0 // indirect
 	github.com/google/gofuzz v1.2.0 // indirect
 	github.com/google/uuid v1.6.0 // indirect
@@ -31,16 +34,20 @@ require (
 	github.com/modern-go/reflect2 v1.0.2 // indirect
 	github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
 	github.com/pkg/errors v0.9.1 // indirect
+	github.com/prometheus/client_model v0.6.2 // indirect
+	github.com/prometheus/common v0.66.1 // indirect
+	github.com/prometheus/procfs v0.16.1 // indirect
 	github.com/spf13/pflag v1.0.5 // indirect
 	github.com/x448/float16 v0.8.4 // indirect
-	golang.org/x/mod v0.20.0 // indirect
-	golang.org/x/net v0.38.0 // indirect
-	golang.org/x/oauth2 v0.27.0 // indirect
-	golang.org/x/sys v0.31.0 // indirect
-	golang.org/x/term v0.30.0 // indirect
-	golang.org/x/text v0.23.0 // indirect
+	go.yaml.in/yaml/v2 v2.4.2 // indirect
+	golang.org/x/mod v0.26.0 // indirect
+	golang.org/x/net v0.43.0 // indirect
+	golang.org/x/oauth2 v0.30.0 // indirect
+	golang.org/x/sys v0.35.0 // indirect
+	golang.org/x/term v0.34.0 // indirect
+	golang.org/x/text v0.28.0 // indirect
 	golang.org/x/time v0.7.0 // indirect
-	google.golang.org/protobuf v1.35.1 // indirect
+	google.golang.org/protobuf v1.36.8 // indirect
 	gopkg.in/evanphx/json-patch.v4 v4.12.0 // indirect
 	gopkg.in/inf.v0 v0.9.1 // indirect
 	gopkg.in/yaml.v3 v3.0.1 // indirect