Skip to content

CKS: NPE when trying to remove a external node from a cks cluster #11581

@kiranchavala

Description

@kiranchavala

problem

CKS: NPE when trying to remove a external node from a cks cluster

versions

ACS 4.20.x

The steps to reproduce the bug

  1. Register a external cks template

https://download.cloudstack.org/testing/custom_templates/ubuntu/22.04/22.04/cks-ubuntu-2204-kvm.qcow2.bz2

  1. Launch a cks cluster

  2. Launch a Ubuntu vm with the template mentioned above

  3. Add the management server public key, once the Ubuntu VM boots up

  4. Add the Ubuntu vm as external node to the cks cluster

Image
  1. CKS cluster will be in importing state

  2. The external node will go be in not-ready state , due to disk issue

Login to the external node and check the cloud-init-output.log

unpacking registry.k8s.io/etcd:3.5.21-0 (sha256:d58c035df557080a27387d687092e3fc2b64c6d0e3162dc51453a115f847d121)...time="2025-09-04T09:32:05Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:edcdf51bd97dae2c7c6a75ab21cf445d5997888402357f5cc36e7582543431ac: write /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/45/fs/usr/local/bin/etcd-3.4.18: no space left on device: unknown" key="extract-939483347-ZepJ sha256:c6230a0bcc0db1264e316a45b18c0a8dfab3c4818a4245035770b4e58967e035"
time="2025-09-04T09:32:05Z" level=warning msg="extraction snapshot removal failed" error="write /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db: no space left on device: unknown" key="extract-939483347-ZepJ sha256:c6230a0bcc0db1264e316a45b18c0a8dfab3c4818a4245035770b4e58967e035"
ctr: failed to extract layer sha256:edcdf51bd97dae2c7c6a75ab21cf445d5997888402357f5cc36e7582543431ac: write /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/45/fs/usr/local/bin/etcd-3.4.18: no space left on device: unknown
ctr: failed to ingest "blobs/sha256/0038afa1c30b6e7c6ed64ebbb3593756f0a5328da72cf3304be62b27cb40139a": failed to open writer: mkdir /var/lib/containerd/io.containerd.content.v1.content/ingest/f2e067fbba1183a1b7465f8eb36511a50660708c165d166d396d63d79880c233: no space left on device: unknown
ctr: failed to ingest "blobs/sha256/0038afa1c30b6e7c6ed64ebbb3593756f0a5328da72cf3304be62b27cb40139a": failed to open writer: mkdir /var/lib/containerd/io.containerd.content.v1.content/ingest/f2e067fbba1183a1b7465f8eb36511a50660708c165d166d396d63d79880c233: no space left on device: unknown
Loading docker image /mnt/k8sdisk//docker/etcd:3.5.21-0.tar failed!
ctr: failed to ingest "blobs/sha256/0bab15eea81d0fe6ab56ebf5fba14e02c4c1775a7f7436fbddd3505add4e18fa": failed to open writer: mkdir /var/lib/containerd/io.containerd.content.v1.content/ingest/e32451d5bcc4edc020e31aecb569918a2e63aca4585fb9735e3e19ad02b818a0: no space left on device: unknown
ctr: failed to ingest "blobs/sha256/0bab15eea81d0fe6ab56ebf5fba14e02c4c1775a7f7436fbddd3505add4e18fa": failed to open writer: mkdir /var/lib/containerd/io.containerd.content.v1.content/ingest/e32451d5bcc4edc020e31aecb569918a2e63aca4585fb9735e3e19ad02b818a0: no space left on device: unknown
ctr: failed to ingest "blobs/sha256/0bab15eea81d0fe6ab56ebf5fba14e02c4c1775a7f7436fbddd3505add4e18fa": failed to open writer: mkdir /var/lib/containerd/io.containerd.content.v1.content/ingest/e32451d5bcc4edc020e31aecb569918a2e63aca4585fb9735e3e19ad02b818a0: no space left on device: unknown

  1. Stop the CKS cluster, in oder to remove the external node

  2. CKS cluster goes into stop state , but the addition of the external node job still carries on

[root@ref-trl-9383-k-Mol8-kiran-chavala-mgmt1 ~]# tail -f     /var/log/cloudstack/management/management-server.log |grep job-295
2025-09-04 10:07:11,732 DEBUG [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) Checking ready nodes for the Kubernetes cluster KubernetesCluster {"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} with total 3 provisioned nodes
2025-09-04 10:07:12,529 DEBUG [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) Kubernetes cluster KubernetesCluster {"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} has total 3 provisioned nodes while 2 ready now
2025-09-04 10:07:16,880 WARN  [o.a.c.f.j.i.AsyncJobMonitor] (Timer-0:[ctx-2caef2bd]) (logid:299149e7) Task (job-295) has been pending for 2208 seconds
2025-09-04 10:07:27,530 DEBUG [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) Checking ready nodes for the Kubernetes cluster KubernetesCluster {"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} with total 3 provisioned nodes
2025-09-04 10:07:28,209 DEBUG [c.c.k.c.u.KubernetesClusterUtil] (API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) Kubernetes cluster KubernetesCluster {"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} has total 3 provisioned nodes while 2 ready now

  1. Destroy the external node and start the cks cluster

Exception observed

Image
  1. The cks cluster remains in alert state

  2. Destroy the cks cluster

Exception

Image

2025-09-04 11:01:47,260 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-64:[ctx-bc12e86e, job-333]) (logid:2e067630) Unexpected exception while executing org.apache.cloudstack.api.command.user.kubernetes.cluster.DeleteKubernetesClusterCmd java.lang.NullPointerException: Cannot invoke "com.cloud.vm.VMInstanceVO.getBackupOfferingId()" because "vm" is null
	at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.checkIfVmsAssociatedWithBackupOffering(KubernetesClusterManagerImpl.java:2010)
	at com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterDestroyWorker.destroy(KubernetesClusterDestroyWorker.java:267)
	at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.destroyKubernetesCluster(KubernetesClusterManagerImpl.java:2384)
	at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.destroyKubernetesCluster(KubernetesClusterManagerImpl.java:2392)
	at com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.deleteKubernetesCluster(KubernetesClusterManagerImpl.java:1969)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
	at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:109)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
	at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:52)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
	at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
	at jdk.proxy3/jdk.proxy3.$Proxy534.deleteKubernetesCluster(Unknown Source)
	at org.apache.cloudstack.api.command.user.kubernetes.cluster.DeleteKubernetesClusterCmd.execute(DeleteKubernetesClusterCmd.java:95)
	at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:173)
	at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:110)

...

What to do about it?

Workaround

Deploy the external ubuntu node with root disk size greater than 20 gb

Need to fix the NPE as the cks cluster remains in alert state

Metadata

Metadata

Type

Projects

Status

ready for Review

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions