Skip to content

sethmalone/weka-operator-node-uid-patch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

weka-operator patch: K8s node UID as machine_identifier

Background

By default, Weka backend containers (drive, compute) use the SMBIOS/DMI UUID reported by the hypervisor as machine_identifier. On OCI (and other cloud hypervisors), this UUID is not guaranteed to be stable across stop/start cycles — it can change whenever the VM is migrated to different physical hardware. When machine_identifier changes, Weka treats the node as a new host, which can trigger unnecessary protection rebuilds.

The weka-operator supports a weka.io/machine-identifier-ref node annotation that allows you to substitute a stable Kubernetes node UID in place of the SMBIOS UUID. Prior to this patch, the annotation was only honored for client containers; drive and compute containers were silently excluded due to a structural issue in the allocation path.

This patch corrects that. With it applied, all container types — drive, compute, client, S3, envoy — resolve machine_identifier from the node annotation before the existing allocations are consulted, so the K8s UID is written into resources.json from the very first container startup.


Code change

File: internal/controllers/wekacontainer/funcs_resources_allocation.go Function: getExpectedAllocations

-   var allocations *weka.ContainerAllocations
-   if r.container.Status.Allocations != nil {
-       allocations = r.container.Status.Allocations
-   } else {
-       // client flow
-       allocations = &weka.ContainerAllocations{}
-
-       machineIdentifierPath := r.container.Spec.GetOverrides().MachineIdentifierNodeRef
-       if machineIdentifierPath == "" {
-           if r.node != nil {
-               if val, ok := r.node.Annotations["weka.io/machine-identifier-ref"]; ok && val != "" {
-                   machineIdentifierPath = r.node.Annotations["weka.io/machine-identifier-ref"]
-               }
-           }
-       }
-
-       if machineIdentifierPath != "" {
-           uid, err := util.GetKubeObjectFieldValue[string](r.node, machineIdentifierPath)
-           ...
-       }
-   }
+   // Resolve machine identifier path for all container types
+   // (spec takes precedence over node annotation).
+   machineIdentifierPath := r.container.Spec.GetOverrides().MachineIdentifierNodeRef
+   if machineIdentifierPath == "" && r.node != nil {
+       if val, ok := r.node.Annotations["weka.io/machine-identifier-ref"]; ok && val != "" {
+           machineIdentifierPath = val
+       }
+   }
+
+   var allocations *weka.ContainerAllocations
+   if r.container.Status.Allocations != nil {
+       allocations = r.container.Status.Allocations
+   } else {
+       allocations = &weka.ContainerAllocations{}
+
+       if machineIdentifierPath != "" {
+           uid, err := util.GetKubeObjectFieldValue[string](r.node, machineIdentifierPath)
+           ...
+       }
+   }
+
+   // For all container types: if annotation is set but machineIdentifier
+   // was not resolved above (e.g. existing allocations didn't carry it),
+   // fall back to the K8s node UID directly.
+   if machineIdentifierPath != "" && allocations.MachineIdentifier == "" && r.node != nil {
+       allocations.MachineIdentifier = string(r.node.UID)
+   }

The key structural change is:

  1. Annotation resolution moves before the if Status.Allocations != nil branch so it runs unconditionally for every container type.
  2. A fallback clause fills in MachineIdentifier on the existing allocations object when the annotation is set but the identifier hasn't been populated yet (covers the drive/compute first-run case).

Test results

Tested on a 6-node OCI DenseIO cluster (sethm0504), Weka 4.4.10.183, operator 1.12.0.

Procedure:

  1. Annotated all K8s nodes with .metadata.uid path:
    kubectl annotate node <node-name> \
      weka.io/machine-identifier-ref='.metadata.uid'
  2. Deployed a fresh WekaCluster (6 drive + 6 compute + 6 S3 + 6 client containers).
  3. After all containers reached Running, queried machine_identifier for every container.

Result:

{'OK': 24, 'MISSING': 0, 'MISMATCH': 0}

All 24 containers (drive, compute, S3, client) reported machine_identifier equal to the corresponding K8s node UID. hw_machine_identifier retained the original SMBIOS UUID as expected.


Applying the patched image

The file weka-operator-node-uid-fallback.tar is an OCI image archive of operator image weka-operator:node-uid-fallback. Import it into containerd on every node in the cluster, then update the operator deployment to use it.

Step 1 — Import the image on each node

Copy the tar to each node and import it:

NODE_IP=<node-ip>
scp weka-operator-node-uid-fallback.tar ubuntu@${NODE_IP}:/tmp/

ssh ubuntu@${NODE_IP} \
  'sudo ctr -n k8s.io images import /tmp/weka-operator-node-uid-fallback.tar'

Verify the import:

ssh ubuntu@${NODE_IP} \
  'sudo ctr -n k8s.io images ls | grep node-uid-fallback'
# Expected: docker.io/library/weka-operator:node-uid-fallback  ...  application/vnd.docker.distribution.manifest.v2+json

Repeat for all nodes.

Step 2 — Annotate all K8s nodes

for node in $(kubectl get nodes -o name); do
  kubectl annotate $node weka.io/machine-identifier-ref='.metadata.uid' --overwrite
done

Step 3 — Patch the operator deployment

Update the operator controller-manager to use the local image and pin imagePullPolicy: Never so Kubernetes doesn't attempt to pull from a registry:

kubectl set image deployment/weka-operator-controller-manager \
  -n weka-operator-system \
  manager=docker.io/library/weka-operator:node-uid-fallback

kubectl patch deployment weka-operator-controller-manager \
  -n weka-operator-system \
  --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/1/imagePullPolicy","value":"Never"}]'

Wait for the rollout:

kubectl rollout status deployment/weka-operator-controller-manager \
  -n weka-operator-system

Step 4 — Deploy (or redeploy) WekaCluster

If this is a fresh cluster deployment, deploy normally — all containers will use the K8s UID from startup.

If the cluster is already running and you want to apply the fix in-place (non-destructive, one container at a time):

CONTAINER=<wekacontainer-name>
NS=weka-operator-system   # adjust if different

# Force WriteResources to re-run for this container:
IDX=$(kubectl get wekacontainer $CONTAINER -n $NS \
  -o jsonpath='{range .status.conditions[*]}{.type}{"\n"}{end}' \
  | grep -n ContainerResourcesWritten | cut -d: -f1)
IDX=$(( IDX - 1 ))
kubectl patch wekacontainer $CONTAINER -n $NS \
  --subresource=status --type=json \
  -p "[{\"op\":\"remove\",\"path\":\"/status/conditions/$IDX\"}]"

# Clear existing machineIdentifier so the operator re-resolves it
kubectl patch wekacontainer $CONTAINER -n $NS \
  --subresource=status --type=merge \
  -p '{"status":{"allocations":{"machineIdentifier":""}}}'

The operator will re-run WriteResources, write the K8s UID into resources.json, and the container will reconnect with the updated machine_identifier on next restart.

Note for drive containers: Do not run weka cluster container deactivate/remove before the pod restart. Drive containers reconnect to the cluster via their signed NVMe drives; removing the container registration breaks that reconnection path. A plain pod restart (after resources.json is updated) is sufficient.

Step 5 — Verify

kubectl exec -n weka-operator-system <any-weka-backend-pod> -- \
  weka cluster container -J 2>/dev/null | python3 -c "
import sys, json
data = json.load(sys.stdin)
ok = sum(1 for c in data if c.get('machine_identifier') == c.get('hw_machine_identifier') or c.get('machine_identifier','').count('-') == 4)
print(f'Total containers: {len(data)}')
for c in data:
    print(c.get('mode','?'), c.get('machine_identifier',''), c.get('hw_machine_identifier',''))
"

All backend containers should show machine_identifier as a UUID matching the corresponding K8s node UID. hw_machine_identifier will still show the SMBIOS value — that is expected and correct.


Image details

Field Value
Base image quay.io/weka.io/weka-operator:v1.12.0
Patched tag docker.io/library/weka-operator:node-uid-fallback
Architecture linux/amd64
Binary patched /weka-operator (replaced as a new OCI layer)
Manifest format Docker manifest v2 (application/vnd.docker.distribution.manifest.v2+json)

About

Patched weka-operator image: honors weka.io/machine-identifier-ref annotation for all container types (drive/compute/client/S3)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors