Summary
In staging, the connector and iroh-dns controllers in network-services-operator fail to engage for every project control plane whose apiserver does not advertise coordination.k8s.io/v1 in API discovery. Both controllers register a Watches(&coordinationv1.Lease{}, …), and mcController.Engage rejects the whole controller/cluster pair when any watch can't be wired. The result is that for affected clusters:
Connector.status.conditions[Ready] is frozen at the value set the first time the controller successfully engaged (often the creation time, when the Lease hadn't been renewed yet, so Ready=False(ConnectorNotReady)).
IrohDNSPublished similarly never updates.
metadata.generation advances but status.conditions[*].observedGeneration stays behind.
- HTTP/Gateway/etc. controllers in the same operator binary continue to reconcile that same cluster normally (because they don't watch Lease).
User-visible effect: Datum Connect desktop agents heartbeat correctly every ~15s (Lease spec.renewTime is fresh) but the Datum Cloud UI reports the Connector as offline indefinitely.
There is also a secondary OOMKilled loop on controller-manager (every ~30 min per replica, see "Operator stability" below) which makes the problem worse — every restart re-attempts the failing Engage and re-grows whatever caches are leaking.
Reproduction
- In staging, install Datum Connect desktop and create a Connector in a project whose PCP apiserver only advertises
networking.datumapis.com in /apis (i.e. doesn't advertise coordination.k8s.io).
- Verify the agent is patching
spec.renewTime on the connector's Lease.
- Observe
Connector.status.conditions[Ready] = False(ConnectorNotReady) indefinitely. observedGeneration lags metadata.generation.
Evidence — affected staging cluster matt-jenkinson-yz0y92
Connector datum-connect-ff98k (uid e5b7d11f-945d-4432-8b21-6f66d93f3e5a, project namespace default).
Connector status (via kubectl get connector ... -o jsonpath=...)
generation=2 resourceVersion=1138668215
Accepted=True(Accepted) obs=1 @ 1970-01-01T00:00:00Z
Ready=False(ConnectorNotReady) obs=1 @ 2026-03-07T15:20:14Z ← creation time
IrohDNSPublished=False(DeferredToOwner) obs=1 @ 2026-05-01T20:11:43Z ← last operator status write
Note observedGeneration=1 everywhere despite generation=2. The controller has never observed the current generation.
metadata.managedFields confirms no manager: manager (operator) status write has happened since 2026-05-01T20:11:43Z. Datum Desktop (the agent) continues to patch status.connectionDetails and renew the Lease.
Lease (coordination.k8s.io/v1 default/datum-connect-ff98k)
Fetched via kubectl get --raw … (kubectl's normal discovery returns error: the server doesn't have a resource type "leases" — see "PCP discovery omission" below — but the resource is reachable directly):
metadata:
ownerReferences:
- apiVersion: networking.datumapis.com/v1alpha1
kind: Connector
name: datum-connect-ff98k
uid: e5b7d11f-945d-4432-8b21-6f66d93f3e5a
controller: true
blockOwnerDeletion: true
spec:
leaseDurationSeconds: 30
renewTime: 2026-05-14T13:19:58.587566Z # less than 15s old at fetch
Lease is healthy: correct ownerRef, fresh renewTime, valid duration.
Operator-side logs
The connector/iroh-dns controllers have never reconciled matt-jenkinson-yz0y92 on either of the most recent boots of the leader pod (the boots where they did reconcile httpproxy/gateway for that exact cluster):
$ kubectl -n datum-system logs <leader-pod> --tail=200000 \
| grep '"controller":"(connector|iroh-dns)"' | grep matt-jenkinson
$ kubectl -n datum-system logs <leader-pod> --previous --tail=200000 \
| grep '"controller":"(connector|iroh-dns)"' | grep matt-jenkinson
(empty)
For the same boots, the httpproxy and gateway controllers reconcile the same cluster normally — they don't Watches(&coordinationv1.Lease{}, …), so their Engage isn't rejected.
Root-cause log line (the bug)
For matt-jenkinson-yz0y92 specifically, the error repeats every ~5 seconds in a retry loop on the leader pod:
2026-05-14T13:18:40Z ERROR get informer failed
{"cluster": "/matt-jenkinson-yz0y92", "source": "kind",
"error": "no matches for kind \"Lease\" in version \"coordination.k8s.io/v1\""}
2026-05-14T13:18:40Z ERROR cluster-sharding-coordinator failed to engage
{"cluster": "/matt-jenkinson-yz0y92",
"error": "failed to watch for cluster \"/matt-jenkinson-yz0y92\":
no matches for kind \"Lease\" in version \"coordination.k8s.io/v1\""}
2026-05-14T13:18:45Z ERROR get informer failed { … same … }
2026-05-14T13:18:45Z ERROR cluster-sharding-coordinator failed to engage { … same … }
2026-05-14T13:18:50Z ERROR get informer failed { … same … }
2026-05-14T13:18:50Z ERROR cluster-sharding-coordinator failed to engage { … same … }
…
221 unique project clusters are in this state on staging (counted via grep "failed to engage" | grep -oE '"cluster": "/[^"]+"' | sort -u | wc -l across both boots of all three replicas). The retry-every-5s pattern means each affected cluster generates ~720 error-pair log lines per hour, which also likely contributes to the OOMKill loop below.
A small sample of the 221 affected clusters (alphabetical prefix only):
/aaaaaa-d4qxk8
/asdf-6283wa
/asdasd
/e2e-shared-project-1776-{0bjzou,74u9w6,gp25oj,kaghqp,knraaw}
/e2e-shared-project-1777-x69inr
/e2e-test-dns-project-17-jawcmp
/hiyahya-4vrcph
/jacob-test-project-ybdzjo
/jbjjjhji-jm8yi3
/jose-{project-pt1wpv,sirugu}
/matt-jenkinson-yz0y92
/molla-{9rnjfm,otoke-4baody}
/new-project-6x6sz1
/osca-slo-test-r5h1r7
/personal-project-{2119b055,6527428a,759543f8,aeef86da,be933431, …many more}
/tdaly-v20250703-yxt7b6
/test-{delete-n4ccjo,elzw4o,fathom-project-twjb2l,project-{1-fscxij,6z9bj6,quota-8mmaoj,w4t25q},queue-{2-7v9vy1,hupwwe,project-56chve},quota-yu4nc4}
/test{1-clhv7m,2-gbod84,123-xvposa}
/testing-rkuax5
PCP discovery omission
The Datum project-control-plane apiserver only advertises networking.datumapis.com in discovery:
$ kubectl api-resources --api-group=coordination.k8s.io
(empty)
$ kubectl api-resources --api-group=networking.datumapis.com
connectoradvertisements networking.datumapis.com/v1alpha1 ConnectorAdvertisement
connectorclasses networking.datumapis.com/v1alpha1 ConnectorClass
connectors networking.datumapis.com/v1alpha1 Connector
…
But the underlying Lease resource is reachable via direct path:
$ kubectl get --raw "/apis/coordination.k8s.io/v1/namespaces/default/leases/datum-connect-ff98k"
{ "kind": "Lease", "apiVersion": "coordination.k8s.io/v1", … }
So this is a discovery omission, not a real "Lease isn't there" condition. The Datum Connect desktop uses kube-rs (which doesn't do discovery, just constructs the URL directly) and is consequently able to renew leases without issue.
Operator stability
Concurrent issue compounding the above: all three replicas of network-services-operator-controller-manager are OOMKilled every ~30 minutes (Exit 137, memory: 4Gi limit). Image ghcr.io/datum-cloud/network-services-operator:v0.0.0-main-20260512-182158. Restart counts on 2026-05-14T13:30Z:
network-services-operator-controller-manager-67ff7d4f66-cj8kd 36 restarts in 17h
network-services-operator-controller-manager-67ff7d4f66-j29gb 32 restarts in 17h
network-services-operator-controller-manager-67ff7d4f66-vj4bf 33 restarts in 17h
Probably a separate bug (memory growth scaling with number of project clusters / retried engagements). Even if Bug 1 above were fixed, the OOM loop is going to cause stalls.
Suggested directions
For the engage failure (the primary blocker):
- Make the
Lease watch optional: catch the discovery error and either continue without it, or schedule a periodic re-attempt without failing the whole Engage for the controller. Today, every controller that watches Lease is "all or nothing" per cluster.
- Alternatively, on the PCP apiserver side, advertise
coordination.k8s.io/v1 in /apis discovery — since the resource is already reachable, just hidden from clients that do API discovery (kubectl, controller-runtime).
- Either fix would also restore correct reconciles for the iroh-dns controller in these clusters.
For the OOMKill loop — needs a separate investigation; probably worth a pprof heap dump on a healthy-but-near-OOM pod.
Workaround
None on the agent side. Restarting the operator briefly resurrects reconciles for clusters that do engage, but the affected clusters never recover within a pod's lifetime.
Summary
In staging, the
connectorandiroh-dnscontrollers innetwork-services-operatorfail to engage for every project control plane whose apiserver does not advertisecoordination.k8s.io/v1in API discovery. Both controllers register aWatches(&coordinationv1.Lease{}, …), andmcController.Engagerejects the whole controller/cluster pair when any watch can't be wired. The result is that for affected clusters:Connector.status.conditions[Ready]is frozen at the value set the first time the controller successfully engaged (often the creation time, when the Lease hadn't been renewed yet, soReady=False(ConnectorNotReady)).IrohDNSPublishedsimilarly never updates.metadata.generationadvances butstatus.conditions[*].observedGenerationstays behind.User-visible effect: Datum Connect desktop agents heartbeat correctly every ~15s (Lease
spec.renewTimeis fresh) but the Datum Cloud UI reports the Connector as offline indefinitely.There is also a secondary
OOMKilledloop oncontroller-manager(every ~30 min per replica, see "Operator stability" below) which makes the problem worse — every restart re-attempts the failing Engage and re-grows whatever caches are leaking.Reproduction
networking.datumapis.comin/apis(i.e. doesn't advertisecoordination.k8s.io).spec.renewTimeon the connector's Lease.Connector.status.conditions[Ready] = False(ConnectorNotReady)indefinitely.observedGenerationlagsmetadata.generation.Evidence — affected staging cluster
matt-jenkinson-yz0y92Connector
datum-connect-ff98k(uide5b7d11f-945d-4432-8b21-6f66d93f3e5a, project namespacedefault).Connector status (via
kubectl get connector ... -o jsonpath=...)Note
observedGeneration=1everywhere despitegeneration=2. The controller has never observed the current generation.metadata.managedFieldsconfirms nomanager: manager(operator) status write has happened since 2026-05-01T20:11:43Z. Datum Desktop (the agent) continues to patchstatus.connectionDetailsand renew the Lease.Lease (
coordination.k8s.io/v1default/datum-connect-ff98k)Fetched via
kubectl get --raw …(kubectl's normal discovery returnserror: the server doesn't have a resource type "leases"— see "PCP discovery omission" below — but the resource is reachable directly):Lease is healthy: correct ownerRef, fresh renewTime, valid duration.
Operator-side logs
The connector/iroh-dns controllers have never reconciled
matt-jenkinson-yz0y92on either of the most recent boots of the leader pod (the boots where they did reconcile httpproxy/gateway for that exact cluster):For the same boots, the
httpproxyandgatewaycontrollers reconcile the same cluster normally — they don'tWatches(&coordinationv1.Lease{}, …), so their Engage isn't rejected.Root-cause log line (the bug)
For
matt-jenkinson-yz0y92specifically, the error repeats every ~5 seconds in a retry loop on the leader pod:221 unique project clusters are in this state on staging (counted via
grep "failed to engage" | grep -oE '"cluster": "/[^"]+"' | sort -u | wc -lacross both boots of all three replicas). The retry-every-5s pattern means each affected cluster generates ~720 error-pair log lines per hour, which also likely contributes to the OOMKill loop below.A small sample of the 221 affected clusters (alphabetical prefix only):
PCP discovery omission
The Datum project-control-plane apiserver only advertises
networking.datumapis.comin discovery:But the underlying Lease resource is reachable via direct path:
So this is a discovery omission, not a real "Lease isn't there" condition. The Datum Connect desktop uses kube-rs (which doesn't do discovery, just constructs the URL directly) and is consequently able to renew leases without issue.
Operator stability
Concurrent issue compounding the above: all three replicas of
network-services-operator-controller-managerareOOMKilledevery ~30 minutes (Exit 137,memory: 4Gilimit). Imageghcr.io/datum-cloud/network-services-operator:v0.0.0-main-20260512-182158. Restart counts on2026-05-14T13:30Z:Probably a separate bug (memory growth scaling with number of project clusters / retried engagements). Even if Bug 1 above were fixed, the OOM loop is going to cause stalls.
Suggested directions
For the engage failure (the primary blocker):
Leasewatch optional: catch the discovery error and either continue without it, or schedule a periodic re-attempt without failing the wholeEngagefor the controller. Today, every controller that watches Lease is "all or nothing" per cluster.coordination.k8s.io/v1in/apisdiscovery — since the resource is already reachable, just hidden from clients that do API discovery (kubectl, controller-runtime).For the OOMKill loop — needs a separate investigation; probably worth a pprof heap dump on a healthy-but-near-OOM pod.
Workaround
None on the agent side. Restarting the operator briefly resurrects reconciles for clusters that do engage, but the affected clusters never recover within a pod's lifetime.