By default, Weka backend containers (drive, compute) use the SMBIOS/DMI UUID reported by the hypervisor as machine_identifier. On OCI (and other cloud hypervisors), this UUID is not guaranteed to be stable across stop/start cycles — it can change whenever the VM is migrated to different physical hardware. When machine_identifier changes, Weka treats the node as a new host, which can trigger unnecessary protection rebuilds.
The weka-operator supports a weka.io/machine-identifier-ref node annotation that allows you to substitute a stable Kubernetes node UID in place of the SMBIOS UUID. Prior to this patch, the annotation was only honored for client containers; drive and compute containers were silently excluded due to a structural issue in the allocation path.
This patch corrects that. With it applied, all container types — drive, compute, client, S3, envoy — resolve machine_identifier from the node annotation before the existing allocations are consulted, so the K8s UID is written into resources.json from the very first container startup.
File: internal/controllers/wekacontainer/funcs_resources_allocation.go
Function: getExpectedAllocations
- var allocations *weka.ContainerAllocations
- if r.container.Status.Allocations != nil {
- allocations = r.container.Status.Allocations
- } else {
- // client flow
- allocations = &weka.ContainerAllocations{}
-
- machineIdentifierPath := r.container.Spec.GetOverrides().MachineIdentifierNodeRef
- if machineIdentifierPath == "" {
- if r.node != nil {
- if val, ok := r.node.Annotations["weka.io/machine-identifier-ref"]; ok && val != "" {
- machineIdentifierPath = r.node.Annotations["weka.io/machine-identifier-ref"]
- }
- }
- }
-
- if machineIdentifierPath != "" {
- uid, err := util.GetKubeObjectFieldValue[string](r.node, machineIdentifierPath)
- ...
- }
- }
+ // Resolve machine identifier path for all container types
+ // (spec takes precedence over node annotation).
+ machineIdentifierPath := r.container.Spec.GetOverrides().MachineIdentifierNodeRef
+ if machineIdentifierPath == "" && r.node != nil {
+ if val, ok := r.node.Annotations["weka.io/machine-identifier-ref"]; ok && val != "" {
+ machineIdentifierPath = val
+ }
+ }
+
+ var allocations *weka.ContainerAllocations
+ if r.container.Status.Allocations != nil {
+ allocations = r.container.Status.Allocations
+ } else {
+ allocations = &weka.ContainerAllocations{}
+
+ if machineIdentifierPath != "" {
+ uid, err := util.GetKubeObjectFieldValue[string](r.node, machineIdentifierPath)
+ ...
+ }
+ }
+
+ // For all container types: if annotation is set but machineIdentifier
+ // was not resolved above (e.g. existing allocations didn't carry it),
+ // fall back to the K8s node UID directly.
+ if machineIdentifierPath != "" && allocations.MachineIdentifier == "" && r.node != nil {
+ allocations.MachineIdentifier = string(r.node.UID)
+ }The key structural change is:
- Annotation resolution moves before the
if Status.Allocations != nilbranch so it runs unconditionally for every container type. - A fallback clause fills in
MachineIdentifieron the existingallocationsobject when the annotation is set but the identifier hasn't been populated yet (covers the drive/compute first-run case).
Tested on a 6-node OCI DenseIO cluster (sethm0504), Weka 4.4.10.183, operator 1.12.0.
Procedure:
- Annotated all K8s nodes with
.metadata.uidpath:kubectl annotate node <node-name> \ weka.io/machine-identifier-ref='.metadata.uid'
- Deployed a fresh WekaCluster (6 drive + 6 compute + 6 S3 + 6 client containers).
- After all containers reached
Running, queriedmachine_identifierfor every container.
Result:
{'OK': 24, 'MISSING': 0, 'MISMATCH': 0}
All 24 containers (drive, compute, S3, client) reported machine_identifier equal to the corresponding K8s node UID. hw_machine_identifier retained the original SMBIOS UUID as expected.
The file weka-operator-node-uid-fallback.tar is an OCI image archive of operator image weka-operator:node-uid-fallback. Import it into containerd on every node in the cluster, then update the operator deployment to use it.
Copy the tar to each node and import it:
NODE_IP=<node-ip>
scp weka-operator-node-uid-fallback.tar ubuntu@${NODE_IP}:/tmp/
ssh ubuntu@${NODE_IP} \
'sudo ctr -n k8s.io images import /tmp/weka-operator-node-uid-fallback.tar'Verify the import:
ssh ubuntu@${NODE_IP} \
'sudo ctr -n k8s.io images ls | grep node-uid-fallback'
# Expected: docker.io/library/weka-operator:node-uid-fallback ... application/vnd.docker.distribution.manifest.v2+jsonRepeat for all nodes.
for node in $(kubectl get nodes -o name); do
kubectl annotate $node weka.io/machine-identifier-ref='.metadata.uid' --overwrite
doneUpdate the operator controller-manager to use the local image and pin imagePullPolicy: Never so Kubernetes doesn't attempt to pull from a registry:
kubectl set image deployment/weka-operator-controller-manager \
-n weka-operator-system \
manager=docker.io/library/weka-operator:node-uid-fallback
kubectl patch deployment weka-operator-controller-manager \
-n weka-operator-system \
--type=json \
-p='[{"op":"replace","path":"/spec/template/spec/containers/1/imagePullPolicy","value":"Never"}]'Wait for the rollout:
kubectl rollout status deployment/weka-operator-controller-manager \
-n weka-operator-systemIf this is a fresh cluster deployment, deploy normally — all containers will use the K8s UID from startup.
If the cluster is already running and you want to apply the fix in-place (non-destructive, one container at a time):
CONTAINER=<wekacontainer-name>
NS=weka-operator-system # adjust if different
# Force WriteResources to re-run for this container:
IDX=$(kubectl get wekacontainer $CONTAINER -n $NS \
-o jsonpath='{range .status.conditions[*]}{.type}{"\n"}{end}' \
| grep -n ContainerResourcesWritten | cut -d: -f1)
IDX=$(( IDX - 1 ))
kubectl patch wekacontainer $CONTAINER -n $NS \
--subresource=status --type=json \
-p "[{\"op\":\"remove\",\"path\":\"/status/conditions/$IDX\"}]"
# Clear existing machineIdentifier so the operator re-resolves it
kubectl patch wekacontainer $CONTAINER -n $NS \
--subresource=status --type=merge \
-p '{"status":{"allocations":{"machineIdentifier":""}}}'The operator will re-run WriteResources, write the K8s UID into resources.json, and the container will reconnect with the updated machine_identifier on next restart.
Note for drive containers: Do not run
weka cluster container deactivate/removebefore the pod restart. Drive containers reconnect to the cluster via their signed NVMe drives; removing the container registration breaks that reconnection path. A plain pod restart (after resources.json is updated) is sufficient.
kubectl exec -n weka-operator-system <any-weka-backend-pod> -- \
weka cluster container -J 2>/dev/null | python3 -c "
import sys, json
data = json.load(sys.stdin)
ok = sum(1 for c in data if c.get('machine_identifier') == c.get('hw_machine_identifier') or c.get('machine_identifier','').count('-') == 4)
print(f'Total containers: {len(data)}')
for c in data:
print(c.get('mode','?'), c.get('machine_identifier',''), c.get('hw_machine_identifier',''))
"All backend containers should show machine_identifier as a UUID matching the corresponding K8s node UID. hw_machine_identifier will still show the SMBIOS value — that is expected and correct.
| Field | Value |
|---|---|
| Base image | quay.io/weka.io/weka-operator:v1.12.0 |
| Patched tag | docker.io/library/weka-operator:node-uid-fallback |
| Architecture | linux/amd64 |
| Binary patched | /weka-operator (replaced as a new OCI layer) |
| Manifest format | Docker manifest v2 (application/vnd.docker.distribution.manifest.v2+json) |