feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement#80
Open
aucahuasi wants to merge 18 commits into
Open
feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement#80aucahuasi wants to merge 18 commits into
aucahuasi wants to merge 18 commits into
Conversation
…emove Dask Operator Replace dask-cuda-worker Deployment (with K8s-level replicas scaling) with a DaemonSet matching the same pattern used by forge-etl-python and streamgl-gpu. This aligns the Helm chart with the docker-compose GPU architecture where all GPU services run one pod per node and scale workers internally via env vars (DASK_NUM_WORKERS, DCW_CUDA_VISIBLE_DEVICES) rather than K8s replicas. The previous Deployment model (dask.workers replicas) contradicted the app-level multi-GPU configuration: multiple replicas on the same node would see the same CUDA_VISIBLE_DEVICES and compete for GPU memory. The DaemonSet model ensures one pod per GPU node with app-controlled worker processes and round-robin GPU assignment, consistent with forge-etl-python and streamgl-gpu. Remove the Dask Kubernetes Operator integration (dask.operator toggle, DaskCluster CRD template, operator install docs, Argo CD app, ACR import, chart-bundler gathering, dev-compose setup). The operator's pod-level scaling model conflicts with Graphistry's app-level GPU management where services control their own GPU assignment and worker counts via environment variables. Remove forgeWorkers Helm value — FORGE_NUM_WORKERS is now controlled exclusively via env vars (default: 4), matching how DASK_NUM_WORKERS and STREAMGL_NUM_WORKERS already work. This gives operators a single, consistent interface for GPU worker configuration across all services. Templates changed: - dask-cuda-worker-daemonset.yaml: rewritten as DaemonSet (was Deployment) - dask-cluster.yml: deleted (DaskCluster operator mode) - dask-scheduler-deployment.yaml: removed dask.operator guard - forge-etl-python-daemonset.yaml: removed forgeWorkers Helm value reference Values cleaned: - Removed dask.workers, dask.operator, forgeWorkers from values.yaml - Removed forgeWorkers from k3s example values Docs/infra cleaned (23 files): - Removed Dask Operator install/troubleshoot sections from all READMEs (k3s, gke, tanzu, cluster, troubleshooting.md) - Removed dask-kubernetes-operator-docs.rst and index.rst reference - Removed dask.workers and forgeWorkers from graphistry-helm-docs.rst - Rewrote troubleshooting.md Dask architecture section to document DaemonSet model and app-level GPU/worker configuration - Removed dask-operator-cd.yaml Argo CD app - Removed dask operator from cd/repo/Chart.yaml, bundler.sh, helm-dev-setup-deploy.sh, ACR import script, docs Makefile Tested: deployed on jorge7 GKE cluster, dask-cuda-worker DaemonSet pod starts correctly, detects 2 GPUs (CUDA_VISIBLE_DEVICES=0,1), forge-etl-python connects to dask-scheduler and initializes 4 workers with round-robin GPU assignment (0,1,0,1).
…streamgl-gpu router split Make the chart correctly deploy Graphistry across multiple GPU nodes from a single Helm release. Three interlocking chart changes land the core of the multi-node story; everything else in this commit is complementary refactor cleanup. 1. streamgl-gpu split (router Deployment + spawner DaemonSet) --------------------------------------------------------------- New `streamgl-gpu-deployment.yaml` deploys a cluster-wide router (replicas=1) that holds the session-to-worker registry in memory and proxies `/streamgl/*` WebSocket traffic to whichever worker owns a session. New `streamgl-gpu-networkpolicy.yaml` restricts the router's `/internal/*` endpoints (register, deregister, worker-event) to in-cluster callers as defense-in-depth on top of the nginx-level denial of `/streamgl/internal/`. Workers stay per-node as a DaemonSet, but the in-pod spawner now registers each worker with the remote router over HTTP rather than relying on PM2's localhost event bus. That removes the PM2 IPC limitation that blocked multi-node in every prior design. Workers on bob are now discoverable and schedulable from jorge7 (and vice versa) via `/internal/register`, and sessions fan out across all GPU nodes. New Helm value `StreamglGpuWorkerResources` separates the DaemonSet (GPU) resource block from `StreamglGpuResources` (router, no GPU), since the two workloads have very different resource profiles. Worker count and GPU visibility stay on env vars (`STREAMGL_NUM_WORKERS`, `CUDA_VISIBLE_DEVICES`) consistent with the `DASK_NUM_WORKERS` and `FORGE_NUM_WORKERS` pattern. Corresponds to the app-layer PR graphistry/graphistry#3087. 2. nginx: Deployment -> DaemonSet --------------------------------- nginx was a single-replica Deployment, which on a multi-node cluster funnelled all external traffic through one node and every upload through one `forge-etl-python` pod via the load-balanced Service. That pinning broke the upload path on NFS because nginx on node A wrote the request body to the RWX PVC and `forge-etl-python` on node B read it as 0 bytes until node B's NFS client attribute cache refreshed, well past `FORGE_MAX_FILE_WAIT_MS`. Uploads 500'd with "Waited longer than 10000 ms for from_path ... to be populated". Running nginx as a DaemonSet gives each GPU node its own ingress pod. caddy still reverse-proxies to the `nginx` Service with the default Cluster policy, so kube-proxy distributes external traffic across every node's nginx replica. Each node's nginx then hands off to its local `forge-etl-python` via the Local routing below, so writer and reader share a single kubelet's NFS client. Load distribution is automatic, matches the DaemonSet count, and requires no `replicas:` tuning or HPA. The `rollingUpdate`/`maxSurge`/`Recreate` branch is gone (DaemonSets only support `RollingUpdate`/`OnDelete`); rolling updates now proceed one node at a time with `maxUnavailable: 1`. 3. forge-etl-python Service: internalTrafficPolicy: Local --------------------------------------------------------- Pairs with the nginx DaemonSet above. Every Service call to `forge-etl-python` routes to the DaemonSet pod on the caller's node, so nginx's write to `uploads-files` PVC is read back through the same kubelet's NFS client. The NFS cross-node coherence race goes away. Trade-off: nginx on node A cannot fall back to `forge-etl-python` on node B if node A's pod is unhealthy; DaemonSet per-node liveness probes cover that failure mode. Single-node deployments are unaffected (there is only one endpoint). This is the same pattern already applied to `otel-collector` for the same data-locality reason (upstream OpenTelemetry Operator guidance). --- Validated end-to-end on a 2-node k3s cluster (jorge7 + bob) with NFS RWX: 1.77 GB of concurrent uploads across both nginx replicas with zero errors; 6 active WebSocket sessions distributed 3/3 across GPU nodes; cross-tab node-move sync working across the streamgl-gpu router; no retry amplification under sustained load. Complementary changes in this commit (details in CHANGELOG.md): - Storage-agnostic chart: PVC templates unified on `global.storage.accessMode` + `global.storageClassNameOverride`. Removed `ENABLE_CLUSTER_MODE`, `IS_FOLLOWER`, `multiNode`, `clusterVolume`, `provisioner`, `REDIS_URL_NEXUS_FEP`, `longhornDashboard`, and the hardcoded `datamount-longhorn` / `postgres-longhorn` / `retain-sc-cluster` StorageClass names. Longhorn becomes just another backend operators can point a SC at. - Dedicated `retain-sc-postgres` StorageClass for the postgres-cluster chart, per Crunchy PGO's documented pattern. - `LEADER_OTEL_EXPORTER_OTLP_ENDPOINT` removed; `otel-collector` Service flipped to `ClusterIP` + `internalTrafficPolicy: Local`; Redis Service flipped to `ClusterIP` (was LoadBalancer). - `charts/values-overrides/examples/cluster/` replaced: legacy leader/follower multi-namespace example deleted, new single- namespace multi-node guide added with NFS as the documented default plus a `retain-sc-nfs.yaml` manifest for operators who prefer to apply the Retain StorageClass separately. - k3s README: new "Postgres StorageClass" and "Graphistry StorageClass" subsections, two-SC install flow, same-namespace invariant note, Cleanup filter extended to cover both charts. - Fixes: `dcgm-exporter` DaemonSet and `netshoot`/`whoami` http-tools gained the missing `nodeSelector` blocks (were the only templates that did not honour `global.nodeSelector`).
… docs Follow-up to 511876d, picking up cleanups and operator-facing docs surfaced during multi-node validation on the k3s cluster. - templates/dask: remove vestigial dask-cuda-worker Service (dead since 66f1c63 "DKO HOTFIX" 2023-02-14); Dask workers register by pod IP + ephemeral port with the scheduler's in-memory registry, never through a Service. Inline comment documents why Dask bypasses Services by design, and why the dashboard stays bound to localhost. - templates/forge-etl: init container wait switched from "service dask-cuda-worker" to "pod -lio.kompose.service=..." since the Service is gone; pod-label readiness is strictly stronger than Service-existence as an init gate (a Service can render with zero ready backends). - templates/streamgl NetworkPolicy: inline DEV-MODE GAP and CNI REQUIREMENT caveats. Policy is gated on global.devMode == false and is inert at runtime on non-enforcing CNIs (vanilla flannel, stock AWS VPC CNI without the add-on); nginx L7 deny is the only remaining defense on those clusters. - examples/cluster/README: operator-facing NetworkPolicy CNI section listing enforcers (kube-router in k3s >=1.25, Calico, Cilium, Antrea, Weave, GKE with --enable-network-policy) vs non-enforcers, with a pointer to test_1_networkpolicy.md for a 5-minute verification. - examples/troubleshooting: document HTTP/1.1 6-connections-per-origin browser limit that hangs the 7th viz tab on plain-HTTP localhost; fix is HTTP/2 multiplex via Caddy "tls internal" (dev), mkcert for a named host, or automatic ACME (prod). Diagnosed by Manfred on hub. - docs/configure-storageclass: new "StorageClass defaults per backend" table (9 backends x default volumeBindingMode x default reclaimPolicy x override-needed column) so operators know up front which knobs must be set at SC creation rather than discovering it from a pending PVC.
…vices into single bundled pod per node
Replaces the per-service Deployments/DaemonSets (nginx, forge-etl-python,
dask-cuda-worker, streamgl-gpu, streamgl-viz, streamgl-sessions) with a
single `engine` DaemonSet pod per GPU node that colocates all of them as
sibling containers. Referred to internally as "fatpod" during development;
final name is `engine` for both single-node and multi-node deployments at
analytics+ tier.
Why this design over the previous per-service / streamgl-gpu router split
- Latency: every intra-stack HTTP hop (viz <-> gpu, fep <-> dask,
nginx <-> any) is now localhost. The streaming hot path no longer
crosses the CNI overlay on viz frames or ETL submissions.
- Data locality: dask-cuda-worker, fep, and streamgl-gpu run on the
same physical host as their consumers; partitioned cuDF/dask_cudf
data lives next to compute. Pairs with the local-worker affinity
hint added to fep ETL submissions in graphistry_master.
- Resilience: equivalent to previous designs. Browser sessions are
pinned across engine pods via Caddy's HMAC-signed `graphistry_sticky`
cookie; node loss has the same blast radius as before.
- Simplicity: streamgl-gpu's PM2 localhost IPC continues to work
unchanged because gpu-router and PM2-forked gpu-worker children
are intra-pod. No router/spawner HTTP split, no app-layer changes.
New templates
-------------
templates/engine/engine-daemonset.yaml pod definition, tier-aware:
3 containers at `analytics`
(nginx + fep + dask), 6 at
`viz`/`full` (adds
streamgl-{viz,sessions,gpu})
templates/engine/engine-nginx-cfg.yml supplementary :8080 Host-
header dispatcher inside the
pod, loaded as conf.d
alongside the production
default.conf
templates/engine/engine-service.yaml engine-headless for Caddy
upstreams (gated tier.analytics)
plus 4 shim Services
(streamgl-viz, streamgl-sessions,
streamgl-gpu, forge-etl-python,
gated tier.viz) so the
production nginx FQDN-suffixed
hostnames still resolve
Removed templates (subsumed by engine)
--------------------------------------
templates/nginx/nginx-deployment.yaml
templates/nginx/nginx-log-exporter-configmap.yaml (sidecar dropped)
templates/forge-etl/forge-etl-python-daemonset.yaml
templates/dask/dask-cuda-worker-daemonset.yaml
templates/streamgl/streamgl-gpu-daemonset.yaml
templates/streamgl/streamgl-gpu-deployment.yaml (router-split design)
templates/streamgl/streamgl-gpu-networkpolicy.yaml (router-split design)
templates/streamgl/streamgl-sessions-deployment.yaml
templates/streamgl/streamgl-viz-deployment.yaml
Caddy as the L7 layer
---------------------
Caddy now load-balances browser sessions across engine-headless with
cookie stickiness (`graphistry_sticky`, HMAC-signed on
`engine.cookieSecret`) using Caddy's `dynamic a` resolver against the
headless Service for live pod-IP refresh.
New operator knobs in values.yaml:
caddy.enabled toggle Caddy + caddy-ingress entirely. `false`
lets operators front engine-headless with their
own ingress controller (Pattern B). The render
gate on caddy-cfg.yml, caddy-deployment.yaml,
and caddy-ingress.yml is now
`tier.analytics AND caddy.enabled`.
caddy.tls.mode `external` | `self` | `off`. external trusts
XFP=https from private_ranges so the sticky
cookie keeps Secure+SameSite=None. self
terminates from existingSecret or ACME. off is
plain HTTP only.
caddy.lb.fallback first-time-assignment policy when no cookie is
set (default round_robin).
caddy.service type / loadBalancerIP / nodePort /
nodePortHttps / annotations /
externalTrafficPolicy. Cloud, Tanzu, MetalLB,
NodePort all selectable from values.
caddy.upstreamImage escape hatch to use upstream `caddy:2.10-alpine`
directly while the bundled wrapper image lags
upstream PR caddyserver/caddy#6115. Liveness
switches to httpGet when set (no curl in the
official image).
Caddy Pod template gains a `checksum/config` annotation that hashes the
rendered Caddyfile ConfigMap into the Pod spec, so any change to TLS
mode / lb.fallback / cookieSecret triggers an automatic rollout on
`helm upgrade`. Without this, ConfigMap-only changes left Caddy with
stale parsed config in memory until something forced a pod bounce.
PVC tier-gating dropped (correctness fix)
-----------------------------------------
`gak-private`, `gak-public`, and `uploads-files` PersistentVolumeClaims
are no longer gated on tier.full / tier.analytics. With Retain reclaim
policy, gating made tier downgrades leave PVs Released with stale
claimRef; later upgrades created new PVCs with fresh UIDs that no
longer matched, and the new PVCs sat Pending indefinitely. Consuming
Deployments stay tier-gated; only the storage object is unconditional.
Other cleanup
-------------
- Dask Kubernetes Operator removed: dask-cluster.yml CRD,
`dask.operator` toggle, operator sections from all platform READMEs
and Sphinx docs, ArgoCD app, CD subchart, chart bundler entries,
ACR import script, dev-compose setup script.
- Cluster mode (ENABLE_CLUSTER_MODE / IS_FOLLOWER) wiring stripped
from all Deployment/DaemonSet templates. Legacy
cluster/{cluster-storage,follower,global-common,leader}.yaml
replaced by a single-namespace multi-node guide and
cluster/retain-sc-nfs.yaml.
- Redis Service: LoadBalancer -> ClusterIP (latent security default).
- otel-collector Service: LoadBalancer -> ClusterIP +
internalTrafficPolicy: Local (matches DaemonSet collector model;
opentelemetry-operator#1401).
- dcgm-exporter, netshoot, whoami: added missing `nodeSelector`
blocks honouring global.nodeSelector.
values.yaml additions
---------------------
- `engine` block: cookieSecret, uploadsScratchSizeLimit.
- `caddy` sub-blocks: enabled, tls{mode, existingSecret, acmeEmail,
domains}, lb{fallback}, service{type, loadBalancerIP, nodePort,
nodePortHttps, annotations, externalTrafficPolicy}, upstreamImage.
- `global.storage.accessMode` (default ReadWriteOnce).
- TOPOLOGY note covering Pattern A (Caddy as L7) vs Pattern B
(operator's ingress controller as L7) with explicit operator
responsibilities under each pattern.
k3s example values
------------------
- Exercises the new knobs end-to-end: caddy.enabled: true,
caddy.tls.mode: "off", caddy.lb.fallback: round_robin,
caddy.service.type: ClusterIP, caddy.upstreamImage:
caddy:2.10-alpine, engine.cookieSecret,
engine.uploadsScratchSizeLimit. Tier set to `viz`.
- PV-cleanup runbook in the README reorganized into three explicit
options (Graphistry-only / postgres-only / both) to prevent
operators accidentally wiping the wrong chart's data.
Verification
------------
Two-node test bed: node1 (k3s server, NFS server colocated) and
node2 (k3s agent, NFS client). RWX storage backed by NFS, tier
`viz`, two engine pods (one per node) load-balanced by Caddy
(ClusterIP) fronted by k3s Traefik on node1.
ETL 5-call sequence -- analytics tier
/readarrow, /upload, /preshape, /properties, /download all returned
200 on both nodes. Local-worker affinity hint exercised: each fep
submission's persist/compute kwargs named the dask-cuda-worker on
its own node (verified by HOSTNAME-prefix match against scheduler
worker info). Behavior decayed to no-op when a fep landed on a node
without a registered local worker.
Session test -- viz tier, multi-node
Browsers opened sessions against the public Caddy endpoint on
node1. Each browser pinned to one engine pod via the
`graphistry_sticky` cookie across page reloads and WebSocket
reconnects; sessions on node1 were undisturbed by traffic
landing on node2, and vice versa. Verified sticky distribution by
tab count vs cookie value. Long-idle sessions (60s+ inactivity)
survived without disconnect.
Tier transitions
Upgraded `viz -> analytics -> viz` against a release with
non-empty PVCs. gak/uploads PVCs survived the round trip and
re-bound to their existing PVs (PV/PVC UID stable, claimRef
intact). Pre-fix this would have left the new PVCs Pending.
Caddyfile config rollout
Edited only `caddy.lb.fallback` in values and re-ran
`helm upgrade`. Caddy Pod rolled automatically because the
rendered Caddyfile checksum changed; ConfigMap-only edits no
longer require manual pod bounce.
…erations) + dask local-affinity wiring + cluster README rewrite
Three coherent changes shipping together to support heterogeneous K8s
clusters with mixed GPU/CPU node pools and dedicated-tenant taints.
1. Multi-node placement infrastructure
--------------------------------------
* charts/graphistry-helm/values.yaml:
- New `engine.nodeSelector` (default `{}`). When set, the engine DaemonSet
uses this selector instead of `global.nodeSelector` (falls back to
global when empty). Lets operators target the engine pod at GPU-labelled
nodes while keeping the chart's CPU-side workloads (caddy, nexus, redis,
dask-scheduler, notebook, pivot, gak-{private,public}) on a separate
pool.
- New `global.tolerations` (default `[]`). Applied to every chart-
rendered workload. Lets operators opt the chart's pods into nodes
carrying operator-defined taints: NVIDIA GPU Operator's
`nvidia.com/gpu=true:NoSchedule`, GKE/EKS managed GPU node-pool taints,
dedicated-tenant taints (`dedicated=foo:NoSchedule`), etc. Empty `[]`
keeps current behaviour byte-identical.
- Inline docstrings explain the recommended `graphistry.io/role=gpu/cpu`
labelling convention, the GPU-Operator and managed-pool taint patterns,
and why tolerations are permissive (not directive) so adding a GPU-pool
toleration to global.tolerations is harmless for non-GPU workloads.
* templates/{caddy,dask,engine,graph-app-kit,http-tools,nexus,notebook,
pivot,redis}/*-deployment.yaml + engine-daemonset.yaml:
Eleven templates now emit a `tolerations:` block when
`.Values.global.tolerations` is non-empty (standard `{{- with ... }}`
guard pattern). engine-daemonset.yaml additionally rewires nodeSelector
to `(engine.nodeSelector | default global.nodeSelector)` so engine can
override placement independently.
2. Cluster deployment guide rewrite
-----------------------------------
* charts/values-overrides/examples/cluster/README.md (+943 / -130):
Near-rewrite of the multi-node deployment guide. New sections:
- "Cluster architecture: engine DaemonSet and Caddy" -- bird's-eye view
of the bundled engine pod, Caddy sticky LB across pods, why both
pieces are necessary together.
- `engine.cookieSecret` production hardening + rotation behaviour
(existing browser sessions are dropped on secret change).
- "Node placement: GPU vs CPU pools, dedicated tenants" -- the two
scheduling axes (nodeSelector for where-allowed, tolerations for
what-taints-accepted), the chart's scheduling values, recommended
labelling convention.
- Worked walkthrough: 5-node A100 cluster with a dedicated LLM tenant
on node 5 (label + taint + toleration), step-by-step from cluster-
admin labelling through values file authoring through the LLM team
deploying their own workload separately.
- Cost-optimised variant (separate GPU/CPU pools).
- Three approaches for pinning CPU singletons to one specific node
(label-one-node / pin-by-hostname / dedicated-pin-label), with
tradeoffs and gotchas, plus a list of approaches not recommended.
- Defensive matrix: what each safety net catches (selector-only vs
selector+taint vs taint-without-toleration).
- Common per-distro pre-baked taints reference.
- Verification commands matrix to confirm placement after install.
The existing NFS / `global.storage.accessMode: ReadWriteMany` Step 4
content is preserved.
3. Dask local-affinity wiring
-----------------------------
* templates/engine/engine-daemonset.yaml (forge-etl-python container env
block): set `GRAPHISTRY_DASK_LOCAL_AFFINITY=1`. Wires the chart-side
half of the dask local-worker affinity feature shipping in
graphistry_master PR #3097. fep's `server/util/dask_affinity.py:
persist_kwargs()` reads this env var to gate per-submission
`scheduler_info()` round-trips and pin `client.persist`/`compute` calls
to the co-located dask-cuda-worker on the same node (~2x ETL speedup
empirically validated at 4 MiB / 92 MiB / 512 MiB upload sizes). Hint
is soft -- `allow_other_workers=True` is always set fep-side, so the
scheduler still falls back to a remote worker when the local one is
busy or down. Hardcoded "1" matches the precedent of other engine-
DaemonSet-required constants like `OTEL_SERVICE_NAME` and `PORT`. If
the env var is missing, fep's helper short-circuits and the deployment
runs byte-identically to pre-change.
…loss on mixed pools
Bug
---
The graph-app-kit-{public,private} Deployments and the Jupyter notebook
Deployment hardcoded `nodeSelector: global.nodeSelector` with no override
hook. On a mixed GPU/CPU node-pool deployment -- the recommended pattern
where global.nodeSelector pins the platform-tier majority (caddy, nexus,
redis, dask-scheduler) to a CPU pool -- this routed the GPU-bound gak
Streamlit views and the notebook RAPIDS kernel to CPU nodes. Pods came up
green but lost CUDA capability:
- graph-app-kit (both public + private) runs the same RAPIDS-on-CUDA
runtime as forge-etl-python (cudf, pygraphistry); the Streamlit
dashboards drive graph layout against the GPU stack. compose pins
both `runtime: nvidia` for this reason.
- notebook ships a `Python 3.8 (RAPIDS)` ipykernel (cudf/cuml/cugraph)
that fails to load CUDA when the kernel container has no GPU
access. compose pins it `runtime: nvidia`.
The hardcoded selector blocked any per-workload override, so operators
running mixed pools could not pin gak/notebook to GPU nodes without
forking the chart.
Fix
---
Three templates now use the override-with-fallback Helm idiom:
templates/graph-app-kit/graph-app-kit-public-deployment.yaml
templates/graph-app-kit/graph-app-kit-private-deployment.yaml
templates/notebook/notebook-deployment.yaml
nodeSelector: {{- (.Values.X.nodeSelector | default .Values.global.nodeSelector)
| toYaml | nindent 8 }}
New Helm values (default `{}`):
- graphAppKit.nodeSelector -- applies to both gak Deployments
- notebook.nodeSelector -- applies to the notebook Deployment
Empty default falls back to global.nodeSelector, so single-pool
deployments are byte-identical to pre-fix. Mixed-pool operators set
these explicitly to a GPU-labelled pool; the values.yaml docstrings
spell out the pattern with a `graphistry.io/role: gpu` example.
Pivot is intentionally not changed: it is a Node.js HTTP-only service
that embeds the streamgl-viz iframe in browser-side pages. No GPU
bindings server-side, so it stays correctly tied to global.nodeSelector.
Docs
----
- cluster/README.md: corrected the GPU-vs-CPU classification (notebook
and gak were previously listed as CPU-only, which was wrong -- both
are GPU-bound). Split the topology diagram into GPU-bound vs
CPU-only boxes, added a per-workload breakdown table, and added a
defensive matrix row describing the silent-runtime-regression
failure mode this fix addresses.
- troubleshooting.md: same misclassification fix in the multi-node
placement guidance.
- k3s/k3s_example_values.yaml: illustrative usage of the two new
nodeSelector overrides for the mixed-pool example.
- CHANGELOG.md: bug-fix entries for both new values.
Telemetry now ships as a properly-structured Helm subchart consumed via
dependencies + condition: global.ENABLE_OPEN_TELEMETRY, with a top-level
`telemetry:` block in the parent's values.yaml for subchart-specific knobs
and `global.*` reserved for cross-cutting values (OTLP endpoint, instance,
image-pull, scheduling, storage class). The shared `_helpers.tpl` is
extracted into a new `graphistry-common` library subchart so the telemetry
subchart can reuse `graphistry.tier.*` gating via a dependency entry.
Multi-node Prometheus scraping is fixed. The dcgm-exporter and
node-exporter jobs switched from Service-VIP `static_configs` to
`kubernetes_sd_configs` with `role: pod` and relabel rules that set the
`node` label from `pod_node_name` and `__address__` from `pod_ip`. With
the old static-target config, kube-proxy load-balanced to one DaemonSet
pod per scrape and silently dropped the other nodes' GPU/host metrics.
Validated on a 2-node 3-GPU k3s cluster: pre-fix 1 GPU visible, post-fix
all 3 with stable per-`node` labels. New `templates/prometheus-rbac.yaml`
(ServiceAccount + namespace-scoped Role + RoleBinding for read-only
`pods`/`services`/`endpoints`) covers apiserver discovery; the prometheus
pod sets `automountServiceAccountToken: true` explicitly so locked-down
clusters (some Tanzu/OpenShift profiles, hardened GKE namespaces) that
flip the default to false don't silently break discovery.
Telemetry-stack persistence is normalized. Prometheus is now a
single-replica Deployment with `Recreate` strategy + RWO PVC (was
per-node DaemonSet, no global view, no retention) with
`telemetry.prometheus.retention=15d` and a new
`telemetry.prometheus.enableAdminAPI` knob (default false; mirrors
kube-prometheus-stack's default; flip in dev to drop stale series after
scrape-config refactors). Grafana and Jaeger gain
`persistence.{enabled,size,storageClassName}` blocks matching prometheus.
The otel-collector cloud and self-hosted ConfigMaps are merged into one
templated ConfigMap with cloud-mode credentials sourced via
`secretKeyRef`. New `useExternal` + `externalEndpoint` knobs on
`telemetry.dcgmExporter` and `telemetry.nodeExporter` let operators
defer to NVIDIA GPU Operator's exporter (mandatory on GKE
Container-Optimized OS, where the bundled image fails) or
kube-prometheus-stack's node-exporter, avoiding double DaemonSets.
`checksum/config` annotations on otel-collector / prometheus / grafana /
jaeger pod templates make `helm upgrade` actually roll workloads when
their ConfigMaps change.
Caddy stability: two crash modes fixed. The single-line
`handle @grafana { reverse_proxy ... }` blocks tripped Caddy v2's parser
("Unexpected next token after '{' on same line") -- expanded to
multi-line form. The `/caddy/health/` endpoint was being routed by the
catch-all `handle` to `engine-headless` (no endpoints) after `respond`
wrote 200, because `respond` is non-terminal in Caddy v2; wrapped in a
terminal `handle /caddy/health/ { respond 200 ... }` to break the 30s
SIGTERM-then-restart liveness loop. New
`templates/caddy/_helpers.tpl` defines (`graphistry.caddy.healthHandle`,
`graphistry.caddy.telemetryHandles`) remove 3-4x duplication of identical
handle blocks across the `tls.mode = external | self | off` branches in
`caddy-cfg.yml`.
Per-component telemetry Ingresses (grafana/jaeger/prometheus) are removed
in favour of Caddy path routes on the parent chart's main Ingress;
NOTES.txt updates "Ingress paths" wording to "Caddy paths". The gke
README's "Fix DCGM GPU Metrics on GKE" section is rewritten to use
`telemetry.dcgmExporter.useExternal: true` instead of an out-of-band
`kubectl patch`. The k3s example values exercise the new caddy/engine
knobs and enable telemetry self-hosted by default.
Adds a platform-tier-only nginx Deployment + Service (`nexus-proxy`) that
fronts nexus and provides the v1-to-v2 endpoint rewrites and deprecated-
endpoint 410 shims that live in the `graphistry/nginx` image. Renders only
when `global.tier == "platform"` (exact match, not `>=`); at analytics+
the engine DaemonSet's nginx container plays the same role with intra-pod
localhost dispatch to streamgl/forge backends, so the standalone
Deployment would be redundant at higher tiers.
The platform tier now ships a deployable slice of postgres + nexus +
nexus-proxy with no transitive dependencies on analytics-tier services.
The Deployment's only init container waits for the nexus Service alone
(not redis / dask-scheduler / streamgl) so the slice is genuinely
self-contained. Reuses existing values and PVCs: `NginxResources`,
`global.{nodeSelector,tolerations,imagePullSecrets,restartPolicy}`,
postgres secret refs, and the `local-media-mount` + `data-mount` PVCs.
The image's content-directory expectations for `/streamgl`, `/pivot`,
and `/upload` paths are satisfied by emptyDir mounts -- those routes
return 404 at platform tier, which is the correct behaviour when the
backends don't exist.
End-to-end validation at `tier=platform`: the deprecation shims (`/etl`,
`/api/check`, `/api/encrypt`, `/api/decrypt`, `/api/v1/etl/vgraph/*`)
return 410 Gone with the documented upgrade message; live `/api/v1/*`
routes (`/datasets/`, `/files/`, `/organization/`, `/team/`,
`/named-endpoint/`) return 200 with real data after a Django session
login at `/accounts/login/`; v2 ETL routes (`/api/v2/etl/vgraph/`,
`/api/v2/etl/datasets/<id>/{gfql,kepler}/...`) route through nginx as
expected (upstream errors at platform tier because GPU backends are
absent by design). Switching back to `tier=analytics` correctly removes
the standalone nexus-proxy in favour of the engine DaemonSet's nginx
container.
NOTES.txt is now tier-aware: at platform it lists `nexus-proxy` in
DEPLOYED SERVICES, prints both in-cluster URLs and a
`port-forward --address 0.0.0.0` recipe for browser access, and skips
the ACCESS and TELEMETRY blocks (Caddy and the telemetry stack don't
render at this tier). At analytics+ both blocks render as before. The
tier descriptions are also updated to reflect the engine-DaemonSet
collapse from 0.4.4 -- the single DaemonSet with tier-conditional
sibling containers replaces the per-service Deployments listed in the
old text.
Documentation updated to match: the k3s README's Deployment Tiers
section now documents `nexus-proxy` in the platform-tier row and
rewrites analytics+ rows around the engine DaemonSet shape; the
"Services per tier" table replaces standalone
`nginx`/`forge-etl-python`/`dask-cuda-worker`/`streamgl-*` rows with
`engine` DaemonSet container rows showing the tier-conditional sibling
set. troubleshooting.md Section 9 "Accessing Graphistry" gains a new
"Tier matters for the access path" subsection covering the platform-
tier path (no Caddy/Ingress; `nexus-proxy` is the entry point;
port-forward + curl smoke tests for v1/v2 routing); the existing
Caddy/Ingress flow is scoped as "Verification (analytics+ tier)".
The k3s example values file's tier comment records the platform-tier
validation run and the port-forward recipe. The active value is
`tier: "analytics"` for normal usage.
Caddy now renders at every tier (was tier >= analytics) so platform-tier
deployments share the same TLS / Ingress / caddy.service.{type,...}
machinery as analytics+. Tier promotions no longer churn external-facing
config: the Service name, Ingress, port-forward target, and cert-manager
wiring are byte-identical from platform through full. The catch-all
reverse_proxy upstream switches by tier via a new
graphistry.caddy.upstreamHandle helper -- nexus-proxy at platform tier
(single replica, no sticky-cookie ceremony; mirrors main-branch's
caddy -> nginx:80 production shape, just retargeted at nexus-proxy which
carries the v1-to-v2 endpoint shims), dynamic engine-headless with HMAC
sticky-cookie LB at analytics+. The 3 inline catch-all blocks across the
external/self/off tlsMode branches collapse to one include each.
Adds ingress.enabled (default true) for operators whose external L7 is
at the Service layer rather than the K8s Ingress layer (Brad/Dell on
Tanzu NSX-T pointing an external LB at caddy.service.type=NodePort,
BBAI's Pulumi-managed LB, service mesh, or dev port-forward). Default
true preserves today's behavior so existing values overlays render an
unchanged Ingress. The Bitnami-style ingress.{className, hosts, tls,
annotations} keys are intentionally not added; equivalents already exist
under global.ingressClassName, global.domain, caddy.tls.{existingSecret,
acmeEmail, domains}, caddy.service.annotations, and
ingress.management.annotations. values.yaml now carries an explicit
Bitnami-shape -> chart-shape mapping table so operators coming from
other charts can find the right keys.
Drive-by fix to a pre-existing bug: ingress.management.annotations was
documented and exemplified with service.beta.kubernetes.io/* cloud-LB
annotations, but those are Service-resource annotations -- they're inert
on an Ingress. Replaced the comment + commented-out examples with
annotations that actually take effect on an Ingress (nginx-ingress
internal class, ALB scheme, GKE gce-internal, cert-manager cluster
issuer, body-size). Cloud-LB annotations belong on
caddy.service.annotations, where they're already correctly documented.
Verified on jorge7 k3s at platform tier: all 5 v1 deprecation shim paths
return byte-identical 410s through Caddy vs. direct nexus-proxy, /
returns 200, /accounts/login/ returns 200, Caddy /caddy/health/ returns
{"success": true}, Ingress is picked up by Traefik. helm template lints
clean across 4 tiers x 3 tls.mode combinations.
`make html` now copies the 10 canonical chart READMEs into docs/source/ with relative-link rewriting, runs frigate via a Python wrapper that bypasses its broken --no-deps CLI (frigate archived 2024-12), and renders the result. 7 stale hand-written .rst pages deleted; 3 frigate-generated .rst files gitignored as derived artifacts. index.rst restructured into 5 captioned sections; 10mins-to-k8s.rst rewritten as platform-agnostic 8-step skeleton with placeholders. Other docs cleanup: TROUBLESHOOTING.md moved to repo root, CLUSTER.md into the chart dir; new postgres-cluster, cd, aks READMEs; top-level README repository-structure HTML -> nested markdown list. Complete improvement in organization!
…nt guide
Telemetry storage fixes
- Prometheus, Jaeger, and Grafana pod templates now set securityContext.fsGroup
per workload (65534 / 10001 / 472, matching each upstream image's hardcoded
UID); kubelet chowns the PVC mount on first attach. Pre-fix the three pods
crashlooped with permission-denied on Longhorn (and any other block CSI
driver) because the data directory inherited root:root 0755 ownership and
the non-root container processes could not write to it. Configurable per
workload via telemetry.<workload>.securityContext.fsGroup; set to {} to
disable on NFS deployments that prefer chmod 0777 + no_root_squash.
workload via telemetry.<workload>.securityContext.fsGroup; set to {} to
disable on NFS deployments that prefer chmod 0777 + no_root_squash.
Longhorn deployment documentation (charts/graphistry-helm/CLUSTER.md)
- New end-to-end Longhorn 1.11.x section: architecture (control plane vs data
plane, CSI vs iSCSI vs iscsid), when to choose Longhorn over NFS, per-node
prerequisites (open-iscsi / iscsid running, ext4/XFS data path, shared mount
propagation), Helm install with defaultReplicaCount=2, defaultDataPath, and
persistence.defaultClass=false, Node CR multi-disk patches, retain-sc-longhorn
StorageClass creation (RWO replicated block) plus optional RWX share-manager
class, smoke test, and cleanup runbook.
- Top-level README volumeBindingMode table grew a Replicated-block row covering
Longhorn RWO, Rook-Ceph RBD, OpenEBS Mayastor / cStor, and Portworx; the
existing Longhorn entry now reads "Longhorn RWX (share-manager NFSv4.1
re-export)" so the two modes are no longer conflated.
Docs style sweep across all six READMEs (top-level, postgres-cluster,
graphistry-helm, CLUSTER, telemetry, TROUBLESHOOTING): semantic chart-name +
section-name link text replaces file-path-as-link-text in prose; em-dashes,
double-dashes, and prose arrows replaced with semicolons, colons, parentheses,
and English connectives in narrative prose; tables, ASCII diagrams, and code
blocks keep their structural symbols. Five pre-existing broken inbound links
(../troubleshooting.md, ../cluster/README.md, ../k3s/, ../gke/, ../tanzu/)
repaired in-flight.
…ipeline
Values key restructuring
- rollingUpdate boolean + sibling maxSurge value collapsed into the
rollingUpdate.{enabled,maxSurge} block; applied across nexus, dask-
scheduler, caddy, redis, notebook, pivot, gak-public, gak-private,
and the engine DaemonSet rollout strategy.
- graphAppKitPublic / graphAppKitPrivate top-level booleans replaced by
graphAppKit.public.enabled / graphAppKit.private.enabled. PVC and
Deployment templates for gak read the new keys.
- nginx.service.httpPort replaces hardcoded port 80 across engine-
service, nexus-proxy-service, caddy-ingress, and nginx-ingress-dev;
nginxPorts.portOne removed.
- ingress.proxyBodySize replaces the camelCase root ProxyBodySize key.
- pivot.devRepository replaces the pivotDev.repository nested key.
- Nexus chart-wide env keys collapsed into nexus.env map:
graphistryCPUMode, nodeEnv, appEnvironment, djangoSettingsModule,
djangoDebug, sessionCookieAge, jwtExpirationDelta, enableDjangoSilk.
Operators that overrode these in overlays need to move them under
nexus.env.
- dask-scheduler REMOTE_DASK env removed (unused since the scheduler-as-
coordinator pattern in v0.4.x).
Operator-explicit TLS
- global.tls, global.tlsStaging, tlsEmail Helm values removed. Their
seven prior jobs (cert-manager Ingress annotations, spec.tls block,
ClusterIssuer resources, Nexus COOKIE_SECURE, Caddyfile HSTS, NOTES
URL scheme, Grafana GF_SERVER_ROOT_URL auto-derivation) are now
operator-explicit or auto-derived from caddy.tls.mode + ingress.tls.
- ingress.tls Helm value (default []): passthrough into the Caddy
Ingress spec.tls. Each entry {hosts: [...], secretName: ...} is
rendered verbatim. Operators wiring TLS at the upstream cluster
ingress (nginx-ingress, Traefik, Tanzu LB, cert-manager) set this
directly.
- graphistry.externalScheme library helper centralises the "is this
cluster reachable over HTTPS?" decision. caddy-cfg.yml,
nexus-deployment.yaml, and NOTES.txt now read the helper; single
source of truth replaces three separate inline tls/tlsStaging checks.
- Grafana GF_SERVER_ROOT_URL auto-derivation from global.domain + tls
removed. Default reverts to "/grafana" (relative). Operators set
telemetry.grafana.GF_SERVER_ROOT_URL explicitly when absolute URLs are
needed (OAuth callbacks, share-dashboard links, alert webhook image
URLs). Auto-derivation also coupled the telemetry subchart to
parent-chart values that subcharts shouldn't see.
Crypto Secret management
- crypto.jwtSecret (default graphistry-jwt) and crypto.nexusSecret
(default graphistry-nexus) Helm values: names of two operator-managed
K8s Secrets the chart references via envFrom: secretRef:. graphistry-
jwt holds DJANGO_SECRET_KEY (loaded into nexus + the three streamgl
containers for shared JWT signing); graphistry-nexus holds the three
GRAPHISTRY_NEXUS_* keys (nexus only; bulk envFrom lets operators layer
additional nexus-only sensitive vars into the same Secret).
- graphistry.assertCrypto library helper: lookup-based pre-flight that
runs on every install + upgrade render. Fails at template time when
either Secret is missing in .Release.Namespace, pointing operators at
the bootstrap script.
- charts/graphistry-helm/crypto-bootstrap/create-secrets.sh: operator
bootstrap tool. Takes <graphistry-namespace> as required arg;
generates CSPRNG values via a one-off busybox Pod (image already in
the airgap bundle; needs only kubectl on the host) and creates both
Secrets via stdin-piped manifests so values never appear in kubectl
argv. --help covers custom Secret names, external secret-management
tooling (External Secrets Operator, helmfile + sops, vault, Sealed
Secrets), manual kubectl create -f - equivalents, and /dev/urandom
fallbacks.
- README "Create Crypto Secrets (Required)" section + NOTES.txt step 4
document the contract and rotation rules. Rotation table scopes the
GRAPHISTRY_NEXUS_ENCRYPTION_KEYS data loss correctly: connector
credentials in the nexus Postgres Connector.keyjson JSONB column
become unrecoverable on single-value rotation; graph data, datasets,
files, user accounts are unaffected.
Per-component env / extraEnv / extraEnvFrom extensibility
- Added across every chart workload: nexus, the engine-pod siblings
(nginx, streamgl-viz, streamgl-sessions, streamgl-gpu, forge-etl-
python), graph-app-kit-public, graph-app-kit-private, notebook,
pivot, dask-scheduler, nexus-proxy, redis. Chart defaults stay in
<container>.env (deep-mergeable map); operators layer additional
entries via <container>.extraEnv (supports valueFrom: secretKeyRef)
and <container>.extraEnvFrom without forking templates. PORT is
filtered on engine-pod siblings since per-container PORT differs
from the chart-wide default.
Removed templates
- templates/tls/letsencrypt-{production,staging}.yaml: chart-owned
ClusterIssuer resources. Cert issuance is no longer the chart's
responsibility; operators install cert-manager (cluster-scoped)
and define ClusterIssuers separately.
- templates/http-tools/{netshoot,whoami}.yaml: dev / network-debug pods.
Run on demand instead via kubectl run -it --rm --image=nicolaka/
netshoot ... -- bash; see TROUBLESHOOTING.md.
- templates/network/grph-networkpolicy.yaml: an incomplete attempt at
pod-to-pod policy (missing egress allowlist for DNS / Postgres /
external HTTPS, wrong PGO label selector). The permissive variant
could also undo a cluster-wide default-deny baseline. NetworkPolicy
is a cluster-security concern; operators write their own tailored to
their CNI and label scheme.
- templates/service-monitors/nginx-service-monitor.yaml: stale
ServiceMonitor for a Service that no longer exists at its pre-engine
location; the engine DaemonSet's nginx sibling is scraped via the
telemetry subchart's existing config.
- templates/ingress/forward-headers.yaml: a one-off Middleware resource
for an ingress controller variant the chart no longer assumes;
superseded by ingress.management.annotations passthrough.
Docs build pipeline
- docs/Makefile gains a helm-docs-gen target rendering per-chart
parameter tables via the helm-docs binary, replacing the deprecated
frigate path (frigate has been archived since 2024-12).
- docs/_tools/helm-docs-table.gotmpl: chart-specific helm-docs template.
- docs/_tools/frigate_nodeps.py removed.
- docs/requirements.txt + docs/source/conf.py minor adjustments tracking
the helm-docs path; docs/source/.gitignore swaps frigate-generated
.rst entries for helm-docs-generated .md entries.
- docs/source/_static/custom.css: Sphinx site styling.
- .readthedocs.yaml: build.jobs.pre_build runs `make -C docs helm-docs-
install + readmes + helm-docs-gen` so the published RTD site matches
what `make html` produces locally. Without the pre_build hook, the
toctree-referenced chart-README + Values pages would silently be
missing on read-the-docs.
values.yaml documentation pass
- helm-docs `# --` annotation markers moved to the comment block
immediately preceding the value (helm-docs requires contiguous comment
blocks).
- Em-dashes and double-dashes in prose replaced with ;, :, () per
project style and helm-docs annotation-parsing constraints.
Examples + adjacent cleanup
- Stale top-level `tls: false` removed from gke / tanzu / microk8s
example values files; the templates only ever read `.Values.global.
tls`, so these overrides were silently ignored from day one.
- charts/values-overrides/examples/k3s/k3s_example_values.yaml
simplified to track current chart defaults.
- charts/postgres-cluster/values.yaml adjustments tracking the parent
chart's storage and TLS value renames.
- charts/graphistry-helm/CLUSTER.md wording polish.
CHANGELOG.md
- 0.4.4 `### Added` gains entries for crypto.{jwt,nexus}Secret,
graphistry.assertCrypto, create-secrets.sh, chart-injected envFrom
wiring, the README Crypto Secrets section, and per-component env
extensibility. `### Upgrade notes` gains a Crypto Secrets migration
entry pointing operators at the bootstrap script.
Hardening - Add graphistry.assertCrypto library helper: values-shape pre-flight for crypto.jwtSecret and crypto.nexusSecret, no lookup-based check so it works on Argo CD / helm template / helm-diff offline. - Quote ingress.proxyBodySize in the nginx-ingress annotation so operators setting a bare integer per the values comment do not crash K8s admission with the metadata.annotations type-mismatch error. - Route graphistry.tier.platform through tierLevel for the same bogus- tier validation as tier.analytics, tier.viz, tier.full; pre-fix the helper bypassed validation and bogus tiers were caught incidentally by sibling templates. - Caddy ACME email directive in tls.mode=self now renders only when no static cert is present; previously emitted as dead config when both existingSecret and acmeEmail were set. - nginx-ingress-specific annotations on the chart Ingress now gate on global.ingressClassName == "nginx"; operators on Traefik / ALB / GKE / NSX get a clean Ingress without dead annotations. - streamgl-gpu liveness probe computes the list-server port from PORT + STREAMGL_NUM_WORKERS + 1 at exec time instead of hardcoding 8095. Default behavior (N=4) is byte-equivalent; non-default N values now work without CrashLoopBackOff. - dask-cuda-worker container no longer inherits .Values.forgeetlpython env. New daskcudaworker.env / extraEnv / extraEnvFrom keys mirror the forge-etl-python 3-layer pattern; cross-cutting vars (GRAPHISTRY_CPU_MODE, REDIS_DB) mirrored as defaults in both blocks with cross-reference comments. - crypto-bootstrap/create-secrets.sh partial-parse failure no longer dumps raw kubectl stdout (which holds any CSPRNG values the Pod produced) to stderr; opt-in BOOTSTRAP_DEBUG=1 for the raw dump. - caddy.tls.mode enum consolidates legacy ingress.enabled + three-value mode enum + externalTerminatesUpstream into one knob; previously illegal combinations (mode=self redirect loop, cloud-LB silently downgrading to http://) are unrepresentable. - Caddy /grafana, /prometheus, /jaeger reverse_proxy handles now gate on tier.analytics+ alongside the existing ENABLE_OPEN_TELEMETRY and OTEL_CLOUD_MODE checks; platform-tier deployments no longer emit Caddy routes to backends that did not render. Dead-code purge - charts/k8s-dashboard/ chart + cd/argo-apps/templates/k8s-dash-cd.yaml. Verbatim copy-paste of upstream kubernetes/dashboard v2.6.1 (June 2022), default disabled, untouched since 2022-09, 4 years behind upstream. Operators wanting it install upstream's own chart. - charts/values-overrides/internal/skinny-values.yaml, charts/values-overrides/internal/eks-dev-values.yaml, charts/values-overrides/examples/microk8s_example_values.yaml, charts/values-overrides/examples/azure_example_values.yaml. All four set keys the 0.4.x chart API no longer reads. skinny-values.yaml in particular set graphistryCPUMode: "1" which the chart silently ignored, so an operator using it for a CPU-only EKS dev cluster got GPU containers crashing instead of CPU mode. Maintained per-distro examples (k3s/, gke/, aks/, tanzu/) cover the same surface for those distros; microk8s and azure operators copy from k3s/ or aks/. - DEVELOP.md, dev-compose/, acr-bootstrap/, and .github/workflows/dev-cluster-deployment.yaml. Webcoderz-era 2022 Azure DevOps Pipelines + manual workflow_dispatch GH Action that hard-coded the legacy MULTINODE / TLS / APP_TAG env-var contract against the unreleased multiNode + datamount-longhorn chart path. The Azure DevOps Pipeline was never wired up in Graphistry's organization; the GH Action was manual-only; dev-compose/ was its sole consumer. - templates/ingress/nginx-ingress-dev.yml: a duplicate devMode-only Ingress routing directly to the nginx Service and bypassing Caddy. - templates/pvc/uploads-files-persistentvolumeclaim.yaml + the volumeName.uploadsFiles Helm value + NOTES.txt / README / CLUSTER.md / TROUBLESHOOTING.md references. The PVC has been dead code since the engine DaemonSet refactor switched nginx body-temp scratch to a per-pod emptyDir. - forgetlpython.env / extraEnv iteration on the dask-cuda-worker container (the F1 leak fix above). Operator docs - New retain-sc-telemetry StorageClass requirement for the telemetry subchart. Parallel to retain-sc-postgres for Postgres. Decoupled from the chart-wide retain-sc so multi-node operators whose retain- sc is NFS/RWX do not silently land Prometheus TSDB / Jaeger Badger / Grafana SQLite on a networked filesystem (POSIX fsync + flock semantics violated). Telemetry subchart README gains a TL;DR mirroring the postgres-cluster pattern (heredoc manifest + skip- path bullets) and a Storage section with per-backend provisioner table. Top-level README "Three StorageClasses to create" entry. Chart README pointer next to the existing Postgres Prereq line. - New "Diagnose engine internals" subsection in TROUBLESHOOTING.md documenting the kubectl port-forward path for operators hitting the internalTrafficPolicy: Local silent connection drop on off-pod callers (sidecars on CPU pool nodes, manual kubectl exec curl). - "Access Graphistry" table in the chart README and the ACCESS block in NOTES.txt are now tier-aware. Nexus landing + dashboard works at every tier; ETL REST API requires analytics+; /graph/* browser viz requires viz+. Previously labeled every URL "Graphistry:" with no qualifier. - New STORAGE TOPOLOGY block in NOTES.txt printed at install time when global.storage.accessMode is ReadWriteOnce (the default). Reframes accessMode as the single declaration for single-node vs multi-node intent. Industry survey confirmed no mainstream Helm chart auto-detects topology at render time (Helm lookup is unreliable on Argo CD / helm template / --dry-run=client); the contract is operator-declared. CLUSTER.md sidebar names the "Multi-Attach error for volume" symptom for operators searching by failure mode. - README Migration section gains a "Plan a maintenance window" entry for the 0.4.3 to 0.4.4 upgrade. Lists the affected Services, the selector flip mechanism, expected outage window (60-300s on single- node k3s with images cached; 5-15 min per node on multi-node with fresh image pulls), and verification commands. - Top-level README "PVCs per tier" table corrected to match chart reality: gak-public and gak-private PVCs render at every tier (the consuming Deployments stay tier-gated to full). Tier-gated PVCs on Retain SC would orphan PVs with stale claimRefs on downgrade-then- upgrade cycles. - README "Optional" pointer for the telemetry subchart TL;DR next to the existing Postgres Prereq pointer; mirrors abstraction layers with the rest of the doc tree. - nginx.env block in values.yaml documents two image-supported optional env vars (LONGHORN_NAMESPACE, EXTRA_NGINX_LOGS) that the chart does not set as defaults but operators can opt into via env or extraEnv. - OTEL_RESOURCE_ATTRIBUTES added to the otelEnv map as a commented- out default with a doc comment naming the OTel semantic-conventions spec. Verification - helm lint passes on every change. - ASCII only; no em-dashes in prose; no commit hashes or PR refs in chart-shipped docs (CHANGELOG body entries restate rationale inline for airgapped operators).
Continuation of a0d5cba covering the remaining blocker / critical / docs items in the 0.4.4 review surface. CHANGELOG.md carries one entry per item with full failure-mode and fix detail; this summary groups by severity. BLOCKERS - Rolling update strategy default: documented per-workload race window rather than reverting or restructuring. The flip from Recreate to RollingUpdate does not deadlock helm upgrade on correctly-configured clusters; the K8s-level deadlock only fires in the multi-node + RWO misconfig the chart already warns about via NOTES.txt. The actual residual risk is a per-app race during the ~30-90s pod-overlap window, affecting redis snapshots, nexus uploads, pivot/notebook saves, and caddy ACME (only at tls.mode=self). New TROUBLESHOOTING section 12 documents the per-workload verdict and the rollingUpdate.enabled=false opt-out. - Engine env priority: documented rather than reordered. The chart-literals-after-extraEnv emission order is deliberate defense (chart-owned wiring stays authoritative). New TROUBLESHOOTING section 13 documents the full emission order, SSA-vs-default-helm collision behavior, and the dedicated .Values.global.* channels operators should use for legitimate overrides of chart-owned keys. values.yaml gains a one-line pointer at the first engine-container extraEnv block. engine-daemonset.yaml "K8s API rejects duplicate env keys" comments rewritten to accurately describe SSA vs non-SSA. - Multi-node otel-collector scrape: replaced the static_configs block in prometheus-configmap.yaml with kubernetes_sd_configs role: pod selecting label io.kompose.service: otel-collector, relabeling __address__ to pod IP. Single-replica Prometheus now scrapes every per-node collector instead of routing through the Service VIP, which was blocked by internalTrafficPolicy: Local. Split into two scrape jobs: otel-collector on :8889 for engine OTLP-derived metrics, and otel-collector-self on :8888 for the collector's own telemetry (queue size, drop counters, export failures). Service unchanged; only the Prometheus scrape path bypasses it. - Notebook gak mount gating: the notebook Deployment unconditionally mounted gak-public and gak-private PVCs even when the operator disabled either gak variant via graphAppKit.public.enabled or graphAppKit.private.enabled. The PVC templates were correctly gated but the consumer Deployment was not, leaving the notebook pod stuck FailedMount on a PVC the chart never rendered. Matching .enabled gates added around the two volumeMount entries and the two volume entries; default rendering byte-identical when both variants are enabled. CRITICAL - Crypto bootstrap shape validation: four regex checks added after CSPRNG extraction, before any kubectl create. Catches generators whose output is non-empty but wrong-format (line-wrapped base64, broken od column width, rate-limited urandom). Mismatches accumulate into a bash array so all four surface in one error naming the generator image. - Crypto bootstrap precondition Secret-presence check: ahead of the CSPRNG generator, kubectl get secret is run for both graphistry-jwt and graphistry-nexus. If either exists, the script lists which, prints a two-bullet resolution guide (full re-init: delete BOTH and rerun; rotation: follow the MultiFernet README recipe), and exits 1 before any CSPRNG values are generated. The prior delete-one-and-rerun pattern silently created partial key rotation, invalidating either browser sessions or connector credentials with no error signal. - engine.cookieSecret Caddyfile injection: the value rendered into lb_policy cookie graphistry_sticky <SECRET> { without quoting. Whitespace split the directive into multiple positional args (parse error); literal newlines broke the directive entirely (EOF). Wrapped the helper's emission with | quote (whitespace tolerated; embedded " escaped; # / { / } carried through literally) and added a template-time fail if the value contains a literal newline. Gated on $isAnalyticsPlus so platform-tier is not affected. Default rendering byte-shape unchanged. - Caddy wrapper cookie-LB Secure/SameSite fix delivered via global.tag bump v2.50.4 to v2.50.6 across base values, chart README, and the k3s / gke / tanzu example overlays. v2.50.6 rebases the bundled wrapper on Caddy 2.11.2 (caddyserver/caddy PR #6115 merged in v2.8.0). Pre-v2.50.6 wrappers shipped Caddy v2.7.6 and silently dropped graphistry_sticky in cross-site embed contexts, breaking session pinning. caddy.upstreamImage remains as an escape hatch for operators needing a specific Caddy version. Stale "currently pinned to v2.7.6" comments refreshed (history retained for operators debugging older deployments). - Caddy phantom :443 port: gated both the Pod's containerPort: 443 and the Service's :443 port block (including the nested nodePortHttps clause) on tls.mode=self. The Caddyfile only emits a :443 site block in tls.mode=self; in ingress / external / off Caddy listens on :80 only. Pre-fix, cloud LB TCP health checks on :443 saw connection-refused; with externalTrafficPolicy: Local, kube-proxy blackholed traffic. Same edit unified the caddy-deployment.yaml pre-existing default "external" inline fallbacks to default "ingress", matching values.yaml and the three other Caddy templates. - Telemetry N-fold timeseries cardinality: added internalTrafficPolicy: Local to dcgm-exporter and node-exporter Services. Pre-fix, each per-node otel-collector's internal prometheus receiver scraped via Service hostname and kube-proxy load-balanced cluster-wide; resulting metrics carried the collector's own host.name resource attribute, fanning a single exporter pod's metrics to N distinct timeseries upstream (Grafana Cloud / Mimir bill by active timeseries; Datadog by submissions). Local policy makes each collector scrape its node-local exporter only. Self-hosted mode unaffected; Prometheus already scrapes pod IP direct via kubernetes_sd_configs. - Floating :latest pins: groundnuty/k8s-wait-for pinned to v2.0 across engine DaemonSet (wait-for-redis, wait-for-dask-scheduler), nexus-proxy Deployment (wait-for-nexus), and otel-collector DaemonSet (init-prometheus, init-jaeger): 10 references across 3 templates, both default and custom-registry render branches. crypto-bootstrap busybox:latest pinned to busybox:1.38.0-musl. charts/graphistry-helm/README.md airgap section rewritten in the same pass: dropped a stale pre-engine-consolidation pod list, bumped the v2.50.1 example tag to v2.50.6, and mirrored the existing "Create Docker Hub Secret" section's "contact Graphistry Support" tone (operators do not enumerate images themselves). - NOTES.txt TROUBLESHOOTING URL fix: updated the deleted /charts/values-overrides/examples/troubleshooting.md to the repo-root /TROUBLESHOOTING.md the docs-pipeline-rewire moved earlier in this version. - NOTES.txt legacy-key migration warning: six-key hasKey-gated warning added (graphAppKitPublic, graphAppKitPrivate, pivotDev.repository, volumeName.uploadsFiles, multiNode, clusterVolume). Slim shape: lists detected keys by name, points at the chart README Migration section for recipes. Inline comment documents why rollingUpdate-bool is intentionally NOT detected (Deployment templates access .Values.rollingUpdate.enabled directly and Helm cannot index a bool; helm install errors at template-render time with "can't evaluate field enabled in type interface {}" before NOTES is reached, so the README Migration paragraph is the recovery path). Chart README Migration section extended with six new "**You had X.**" paragraphs matching the existing TLS migration tone. DOCS / MINOR Many smaller cleanups; one CHANGELOG entry per item. Highlights: - CHANGELOG StreamglGpuWorkerResources typo corrected to StreamglGpuResources (matches the actual chart values key). - nexusDev.repository added to the legacy-key migration warning alongside its pivotDev sibling. - OTEL_EXPORTER_JAEGER_ENDPOINT gated on self-hosted mode (cloud pods no longer carry the dangling jaeger:4317 reference). - otel-collector liveness probe comment corrected to reflect the actual emitted path ("/", not "/healthz"). - New NOTES.txt advisory: COOKIE SECRET WARNING when engine.cookieSecret equals the placeholder default and tier is analytics+, pointing operators at openssl rand -hex 32. - TROUBLESHOOTING.md "Expected output (k3s with Traefik)" updated to reflect the 0.4.4 single-Caddy-Ingress topology (telemetry UIs now path-routed through Caddy reverse_proxy, no standalone grafana / prometheus / jaeger Ingresses). - nginx env block in engine-daemonset.yaml gains the PORT filter for manifest symmetry with its five sibling containers. - engine-wait-for-postgres port driven by new global.POSTGRES_PORT value (default empty, falls back to 5432). Operators on RDS Proxy / PgBouncer on 6432 no longer hit pg_isready timeouts. - New TROUBLESHOOTING section documenting the dask-cuda-worker liveness -> in-pod nginx dispatcher -> forge-etl-python:8100/ workerhealth chain and cascading-restart behavior under nginx instability. - engine DaemonSet terminationGracePeriodSeconds bumped 30 to 60 for dask-cuda-worker drain headroom. - gak-{public,private} PVC comments clarify that the new .enabled gate is a per-variant opt-out (entire variant suppressed), distinct from the intentionally-absent tier gate. - TROUBLESHOOTING section 13 extended with a subsection naming notebook, pivot, graph-app-kit-public, and graph-app-kit-private as sharing the engine pod's env emission-order pattern. - NOTES.txt Crypto Secrets reminder lifted out of the devMode gate; PREREQUISITES list inside the gate renumbered 1-3. - NOTES.txt Available tiers line no longer implies the chart bundles postgres (postgres-cluster is a separate Helm release prereq). - New NOTES.txt advisory: EXTERNAL TLS TOPOLOGY when caddy.tls.mode=external, warning about HSTS pinning if the upstream LB serves plain HTTP. - Grafana defaults: GF_AUTH_ANONYMOUS_ENABLED true to false, GF_AUTH_ANONYMOUS_ORG_ROLE Admin to Viewer. Pre-fix, anyone reaching the chart's domain via Caddy /grafana got unauthenticated Grafana Admin (modify dashboards, run arbitrary queries against Prometheus, add datasources). admin/admin seed credentials unchanged for dev convenience. - --stdin=true removed from the crypto-bootstrap kubectl run invocation (unused attach stream; some kubectl versions drop initial stdout bytes via the attach side effect). OUT OF SCOPE - Defensive trap around the bootstrap-script kubectl run pod (orphan-pod cleanup on Ctrl-C): a defense that does not constrain any attacker who has the access required to exploit the window (RBAC rarely partitions pods/log from secrets; any actor at the operator's terminal trivially bypasses the trap by editing the script). Memory rule defense-must-constrain-attacker captures the reasoning. - assertCrypto include in engine-daemonset.yaml and nexus-proxy-deployment.yaml: deferred. The assertion is caught incidentally today because nexus-deployment.yaml always renders and includes it; defense-in-depth across templates would close fragility under future template-render-order changes. - readinessProbe + startupProbe coverage across the six engine sibling containers plus notebook / pivot / gak / nexus: real architectural work; deferred to a separate workstream. Verified: helm lint clean; helm template renders cleanly across default values plus the scenario-specific overlays exercised per item (rollingUpdate bool, caddy.tls.mode variants, telemetry cloud + self-hosted, graphAppKit variant disables, legacy-key overlays, cookieSecret placeholder)
Four narrow additions catching foot-guns that were not in the prior
review pass but are cheap to close. None are install-time fatal; all
four are documentation or comment-only and do not change any chart
template behavior.
- graphistry.tier.* helpers (charts/graphistry-common/templates/
_helpers.tpl) emit the literal string "true" when the deployment
tier is at or above the named level, and the EMPTY STRING when not.
There is no "false" branch. All 30 existing call sites correctly
use eq ... "true" or ne ... "true", but a future contributor
reaching for eq ... "false" silently always evaluates false, and
one reaching for {{ if (include ...) }} truthiness is subject to
Go template's non-empty-string-truthy rule (the literal string
"false" would be truthy if the helper were ever symmetrised).
Added a file-level docstring documenting the contract with explicit
positive and negative example forms; per-helper one-liners now
read "empty string otherwise" instead of stopping at the truthy
case. No behavior change.
- charts/graphistry-helm/CLUSTER.md Storage section gains a new
"Adding a node to an existing single-node deployment" subsection.
The chart's STORAGE TOPOLOGY warning in NOTES.txt fires at every
helm install and helm upgrade with global.storage.accessMode set
to ReadWriteOnce, but does NOT re-fire when cluster node count
changes out-of-band (operator joins a 2nd GPU node months after
install). The new subsection walks the prevention path (provision
RWX backend; swap StorageClass; migrate data explicitly; flip
accessMode to ReadWriteMany; helm upgrade; then join the new node)
and the recovery path (cordon the new node first if already joined
and the engine pod is stuck Multi-Attach). Lands after the existing
"Symptom you are debugging" subsection so operators reading either
the prevention or recovery path encounter the right adjacent
content. Doc-only addition.
- otelEnv block in values.yaml gains a comment block documenting
Helm's map-merge vs atomic-replace semantics. otelEnv is a
top-level map with seven chart defaults (OTLP endpoint, four
per-signal timeouts, metric push interval, metric export timeout).
Overlay files (-f overlay.yaml) and per-key --set otelEnv.KEY=VAL
deep-merge per key; whole-map --set otelEnv="{KEY: VAL}" atomically
REPLACES every default. The atomic-replace form silently drops the
OTLP endpoint and all timeouts, producing missing-telemetry symptoms
with no render-time signal. The new comment block tells operators
which override forms merge and which replace, and recommends
overlay files or per-key --set for additions. Same comment shape
as the existing nexus.env merge-semantics guidance.
- templates/NOTES.txt analytics-tier ACCESS block gains a single
line under the existing "Browser visualization (/graph/*) NOT
available at analytics tier" warning, naming /streamgl/* asset-path
502 behavior at analytics tier as expected (not a chart bug).
Engine-pod nginx unconditionally has a proxy_pass to
http://streamgl-viz:8080 from the bundled default.conf (chart
cannot path-gate it; that lives in the upstream nginx image),
but the streamgl-viz shim Service is gated on tier.viz and does
not render at analytics. /streamgl/* requests (from old session
bookmarks, third-party iframe embeds, dashboard widgets pointing
at viz URLs) return 502 from the engine nginx. The new NOTES line
names the audience and the expected behavior so analytics-tier
operators recognize the failure mode without filing a chart bug.
Storage in-flight upload window during helm upgrade rolls (also
floated as a doc candidate) is deliberately left undocumented at
the chart level: the maintenance-window pattern in the existing
Migration guidance already covers it, and an explicit per-Deployment
drain procedure would model operator state machinery the chart
deliberately leaves unmodeled (consistent with the deferred
readiness/startup-probe and preStop-hook work).
Verified: helm lint clean; helm template renders cleanly; no
em-dashes in any touched file; no internal review-process language
in any chart-shipped file.
…atomicity
Continuation of the 0.4.4 hardening surface. CHANGELOG.md carries one
entry per item with full failure-mode and verification detail; this
summary groups by concern.
ENGINE POD PROBES
- nginx sibling container gains a readinessProbe mirroring the existing
livenessProbe (curl -f http://localhost/healthz) with
initialDelaySeconds: 8, periodSeconds: 10, failureThreshold: 3,
timeoutSeconds: 5. Caddy's `dynamic a engine-headless ...` resolver
publishes pod IPs to its upstream pool on a 10s refresh as soon as
the pod enters Endpoints, which without a readinessProbe happens as
soon as the pod is Ready (containers Running), before nginx is bound
on :80. During DaemonSet rollover (updateStrategy: RollingUpdate
with maxUnavailable: 1, one engine pod per node rolling sequentially)
Caddy would route browser traffic to a not-yet-serving pod, producing
transient 502s on every rollover. Pre-0.4.4 the same gap existed on
the standalone nginx Deployment but was masked by the Recreate-default
strategy (no overlap window) and by Caddy resolving through kube-proxy
with TCP retry semantics; 0.4.4's DaemonSet RollingUpdate + headless
Service + dynamic resolver makes the gap operator-visible. The wrapper
image's /healthz returns HTTP 201, and the container becomes ready in
roughly one second from start (verified at runtime against
graphistry/streamgl-nginx:v2.50.6-universal); the 8s initialDelay is
conservative buffer for K8s scheduling overhead. Pod IPs no longer
enter Endpoints until nginx is bound; rolling upgrades stop dropping
browser sessions on the new-pod side.
- dask-cuda-worker stage-2 liveness target changed from
http://forge-etl-python:8080/workerhealth (which routed through the
in-pod nginx dispatcher via hostAliases: forge-etl-python -> 127.0.0.1
-> nginx :8080 -> fep :8100) to http://localhost:8100/workerhealth
(direct to fep in the shared engine-pod network namespace at fep's
bind port). The earlier indirect form coupled dask's liveness to nginx
availability: any nginx container restart (image pull, OOM,
ConfigMap-triggered rollout) made the dispatcher unavailable, dask's
liveness failed failureThreshold: 3 consecutive times, kubelet killed
and restarted the dask container, and in-flight Arrow batches were
lost (no preStop hook; terminationGracePeriodSeconds: 60 only helps
if the worker can respond to SIGTERM, which it cannot if liveness
has already failed). The direct-localhost form decouples the cascade.
Probe semantic preserved: fep's /workerhealth handler validates
dask-cluster reachability internally before responding, so the probe
still asserts "the dask cluster is operational through fep"; it just
no longer requires nginx routing to make the assertion. Verified
end-to-end against the running graphistry/etl-server-python:v2.50.6-13
container: the image binds 0.0.0.0:$PORT, the chart's engine pod sets
PORT=8100 for the forge-etl-python sibling, and
curl http://localhost:$PORT/workerhealth from the same network
namespace returns {"success": true}.
- TROUBLESHOOTING.md section 14 ("Engine Container Probe Dependencies")
rewritten: the dask-cuda-worker subsection now documents the
direct-localhost variant as current and the routed-through-nginx
variant as historical/superseded; new subsection on the nginx
readinessProbe documents its role in headless-Service rollover
semantics and the rationale for the 8s initialDelay.
MIGRATION WARNING CORRECTNESS
- NOTES.txt legacy-TLS warning gate was checking the wrong nesting
level. 0.4.3 declared tls and tlsStaging at the TOP level
(values.yaml:268,271 on origin/main) and the shipped 0.4.3 overlays
(k3s, gke, tanzu, microk8s) all set them top-level. The warning was
checking hasKey .Values.global "tls" / "tlsStaging", which never
fires for the actual operator-overlay shape. Result: an operator
upgrading from 0.4.3 with tls: true got no warning that TLS handling
had moved to operator-explicit caddy.tls.mode + ingress.tls (could
end up serving plain HTTP believing the chart still terminates TLS).
Fixed by switching the three hasKey paths to top-level and dropping
the `global.` prefix from the body labels. Verified the warning fires
on --set tls=true and does NOT false-positive on the implausible
--set global.tls=true shape.
- NOTES.txt legacy-v0.4.3 migration block dropped its multiNode /
clusterVolume detection entirely. The original C14 addition checked
these at the wrong nesting level (top-level hasKey .Values "multiNode"
while 0.4.3 templates uniformly read .Values.global.multiNode /
.Values.global.clusterVolume). On audit of the actual operator
population, no customer ever shipped an overlay using the chart's
pre-0.4.4 leader/follower cluster-mode prerequisites that these two
keys configured. Correcting the hasKey path would defend an empty
customer population. Removed the two hasKey entries from the outer
or-chain and the two if blocks in the body; chart README Migration
section parallel paragraphs (`**You had multiNode: true.**` and
`**You had clusterVolume: ....**`) also dropped on the same logic.
The five remaining keys in the warning block (graphAppKitPublic,
graphAppKitPrivate, nexusDev.repository, pivotDev.repository,
volumeName.uploadsFiles) are genuinely top-level in 0.4.3 and shipped
in operator overlays, so those detections remain and continue to fire.
- TROUBLESHOOTING.md "Multi-Attach error on engine PVCs (multi-node)"
step 2 rephrased from "Deploy in cluster mode with NFS, EFS, Longhorn
RWX, or CephFS" to "Switch the chart's global.storage.accessMode to
ReadWriteMany and back retain-sc with an RWX storage backend (NFS,
EFS, Longhorn RWX, CephFS, Azure Files)". Removes ambiguous "cluster
mode" phrasing that could be misread as the never-shipped legacy
leader/follower cluster feature.
BOOTSTRAP SCRIPT ATOMICITY
- crypto-bootstrap/create-secrets.sh Secret creation step refactored
from two sequential kubectl create -f - heredocs into one multi-doc
stdin submission separated by `---`. Two motivations stack:
(1) values continue not to appear in kubectl argv (`ps` /
/proc/<pid>/cmdline cannot see them, same as before), and (2) the
failure window where graphistry-jwt could end up orphaned without
graphistry-nexus shrinks from "any moment between two separate
kubectl processes" to "any moment within a single kubectl process
reading one stdin stream". The shape is honestly NOT transactionally
atomic at the apiserver (kubectl still processes docs sequentially;
an admission-webhook failure on the second doc still leaves the
first committed), but operator-side interruption between docs
(Ctrl-C, terminal disconnect, OOM of the shell) cannot leave the
script in the half-created state since both docs ship in one syscall
stream. The existing precondition Secret-presence check ahead of
the CSPRNG generator continues to catch any orphan-jwt state on
rerun. Help text reference updated from "the two kubectl create -f -
invocations" to "the multi-doc kubectl create -f - invocation".
Verified via kubectl create --dry-run=client -f - with multi-doc
stdin: both Secrets processed and reported in one invocation.
OUT OF SCOPE (no change)
- streamgl-gpu liveness probe port: verified against the streamgl-gpu
container. Both the current chart probe target
(localhost:$((PORT+STREAMGL_NUM_WORKERS+1))/check-workers, =:8095 at
K8s PORT=8090, N=4) and the pre-0.4.4 standalone probe target
(localhost:$PORT/check-workers, =:8090 in K8s) return identical
{"status":200,"workers":[...]} payloads at runtime. Source confirms
/check-workers is canonically registered on the PORT+N+1 endpoint and
exposed on PORT through a proxy special case; both shapes work, the
current chart shape is more direct. No fix needed.
- $RANDOM-based POD_NAME in the bootstrap script: theoretical name
collision under concurrent same-namespace runs only; concurrent
bootstrap of the same namespace is not a real workflow and the pod
is --rm short-lived.
Verified: helm lint clean; helm template renders cleanly; bash -n
clean on the bootstrap script; no em-dashes added to any touched file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables Graphistry to run correctly across multiple GPU nodes from a single Helm release. All GPU-bound services (nginx, forge-etl-python, dask-cuda-worker, streamgl-gpu, streamgl-viz, streamgl-sessions) are consolidated into a single
engineDaemonSet pod per GPU node, fronted by Caddy as a sticky-LB L7. The prior multi-node model (leader + follower helm releases per node) is retired in favour of a single-namespace deployment.The architectural change
engine DaemonSet (one bundled pod per GPU node)
New
templates/engine/engine-daemonset.yamlcolocates nginx, fep, dask, and the streamgl-* services as sibling containers in one pod per GPU node. Tier-aware: 3 containers (nginx + fep + dask) atanalytics, 6 containers (addsstreamgl-{viz,sessions,gpu}) atviz/full. Intra-podhostAliasespin the app-layer hostnames (streamgl-viz,streamgl-gpu,forge-etl-python,dask-cuda-worker) to127.0.0.1, so every intra-stack HTTP hop is localhost. streamgl-gpu's PM2 localhost IPC continues to work unchanged because all gpu-router and PM2-forked gpu-worker children are intra-pod.Companion templates:
templates/engine/engine-nginx-cfg.yml-- supplementary nginxconf.dConfigMap that adds an intra-pod:8080Host-header dispatcher, loaded alongside the productiondefault.conf.template.templates/engine/engine-service.yaml--engine-headlessService for the Caddy upstream pool (gated ontier.analytics) plus four shim Services (streamgl-viz,streamgl-sessions,streamgl-gpu,forge-etl-python, gated ontier.viz) so the production nginx FQDN-suffixed hostnames keep resolving through CoreDNS for any path that does not go through hostAliases.This bundle supersedes the earlier streamgl-gpu router/spawner split (added in
511876d, retired in699bc4f). Bundling keeps PM2 IPC working unchanged, makes every intra-stack hop localhost (lower latency, better data locality), and has the same blast radius as a router-split design because browser sessions are cookie-pinned across engine pods anyway. Per-node redundancy comes from running multiple engine pods (one per GPU node) under the headless Service.The engine DaemonSet's
forge-etl-pythoncontainer now setsGRAPHISTRY_DASK_LOCAL_AFFINITY=1so each fep submission'spersist/computecarries a soft worker-affinity hint pinning it to the dask-cuda-worker on the same pod (matched by HOSTNAME prefix). This is a cross-repo dependency on graphistry/graphistry#3097, which adds thedask_affinityhelper that reads this env var and produces the kwargs. With multiple GPU nodes registered to the scheduler, the hint eliminates the cross-node shuffle path that would otherwise ship ~256 MiB partitions over the cluster overlay (~2x ETL speedup measured at 4/92/512 MiB). The hint is soft (allow_other_workers=True) and decays to a no-op on miss, so single-node deployments are byte-identical to pre-change.Caddy as sticky-LB L7
Caddy now load-balances browser sessions across
engine-headlessusing Caddy'sdynamic aresolver (live pod-IP refresh,refresh 10s) andlb_policy cookie graphistry_stickyfor session stickiness. The cookie is HMAC-signed onengine.cookieSecret; WebSocket upgrades to/graph/socket.ioand/streamgl/*ride the same pin.New operator-tunable knobs in
values.yaml:caddy.enabled-- toggle Caddy + caddy-ingress entirely.falselets operators frontengine-headlesswith their own ingress controller (Pattern B); the operator owns TLS termination and cookie affinity in that case. The render gate oncaddy-cfg.yml,caddy-deployment.yaml, andcaddy-ingress.ymlis nowtier.analytics AND caddy.enabled.caddy.tls.mode--external|self|off.externaltrustsX-Forwarded-Proto: httpsfromprivate_rangesso the sticky cookie keepsSecure+SameSite=None.selfterminates fromexistingSecretor ACME.offis plain HTTP only.caddy.lb.fallback-- first-time-assignment policy when no cookie is set yet (defaultround_robin).caddy.service--type/loadBalancerIP/nodePort/nodePortHttps/annotations/externalTrafficPolicy. Cloud, Tanzu, MetalLB, NodePort all selectable from values; per-platform annotation hints inline.caddy.upstreamImage-- escape hatch to usecaddy:2.10-alpinedirectly while the bundled wrapper image lags upstream PR reverseproxy: cookie should be Secure and SameSite=None when TLS caddyserver/caddy#6115 (the cookie-LBSecure+SameSite=Nonefix). Liveness probe switches tohttpGetwhen set (nocurlin the official image).Caddy Pod template gains a
checksum/configannotation that hashes the rendered Caddyfile into the Pod spec, so any change to TLS mode / lb.fallback / cookieSecret triggers an automatic rollout onhelm upgrade. Without this, ConfigMap-only changes left Caddy with stale parsed config in memory until something forced a pod bounce.Heterogeneous cluster placement: GPU/CPU pools + dedicated-tenant taints
Two new operator-tunable values let the chart land correctly on clusters that mix GPU and CPU node pools, or that use NodePool/managed taints to keep workloads off shared infra:
engine.nodeSelector(default{}) -- when non-empty, the engine DaemonSet uses this selector instead ofglobal.nodeSelector(falls back to global when empty). Lets operators target the engine pod at GPU-labelled nodes (graphistry.io/role=gpu) while keeping the chart's CPU-side workloads (caddy, nexus, redis, dask-scheduler, notebook, pivot, gak-{private,public}) on a separate pool viaglobal.nodeSelector(graphistry.io/role=cpu).global.tolerations(default[]) -- applied to every chart-rendered workload (11 templates: caddy, dask-scheduler, gak-private/public, http-tools netshoot+whoami, nexus, notebook, pivot, redis, engine DaemonSet). Lets operators opt the chart's pods into nodes carrying operator-defined taints: NVIDIA GPU Operator'snvidia.com/gpu=true:NoSchedule, GKE/EKS managed GPU node-pool taints (nvidia.com/gpu=present:NoSchedule), dedicated-tenant taints (dedicated=graphistry:NoSchedule), etc. Empty[]keeps current behaviour byte-identical. Tolerations are permissive (not directive), so adding a GPU-pool toleration toglobal.tolerationsis harmless for non-GPU workloads -- they still land on whichever nodeglobal.nodeSelectoradmits them to.Bug fix:
dcgm-exporterandhttp-tools(netshoot, whoami) were the only chart workloads that did not honourglobal.nodeSelector; now they do. Pre-fix these would leak onto non-GPU nodes in mixed-pool clusters even when the operator constrained the rest of the chart to GPU nodes.PVC tier-gating dropped (correctness fix)
gak-private,gak-public, anduploads-filesPersistentVolumeClaims are no longer gated ontier.full/tier.analytics. With Retain reclaim policy, gating made tier downgrades leave PVsReleasedwith staleclaimRef; later upgrades created new PVCs with fresh UIDs that no longer matched, and the new PVCs satPendingindefinitely. Consuming Deployments stay tier-gated; only the storage object is unconditional.Validation
End-to-end on a 2-node k3s cluster (node1 = k3s server with NFS server colocated, node2 = k3s agent / NFS client) with NFS RWX storage, tier
viz, two engine pods (one per node), Caddy as ClusterIP fronted by k3s Traefik on node1./readarrow,/upload,/preshape,/properties,/downloadall returned 200 on both nodes. Local-worker affinity hint (added in graphistry_master) was exercised: each fep submission'spersist/computekwargs named the dask-cuda-worker on its own node, verified by HOSTNAME-prefix match against scheduler worker info. Behavior decayed to a no-op when a fep call landed on a node without a registered local worker, so single-node deployments are byte-identical to pre-change.graphistry_stickycookie across page reloads and WebSocket reconnects; sessions on node1 were undisturbed by traffic landing on node2, and vice versa. Verified sticky distribution by tab count vs cookie value. Long-idle sessions (60s+ inactivity) survived without disconnect.viz -> analytics -> vizagainst a release with non-empty PVCs. gak/uploads PVCs survived the round trip and re-bound to their existing PVs (PV/PVC UID stable,claimRefintact). Pre-fix this would have left the new PVCsPending.caddy.lb.fallbackin values and re-ranhelm upgrade. Caddy Pod rolled automatically because the rendered Caddyfile checksum changed; ConfigMap-only edits no longer require manual pod bounce.kubernetes_sd_configs role: pod+ relabel rules), all 3 GPUs visible with stable per-nodelabels. node-exporter validated equivalently. The newprometheus-rbac.yamlServiceAccount + Role + RoleBinding satisfies prometheus's apiserver pod-discovery;automountServiceAccountToken: trueis explicit on the prometheus pod for clusters that flip the namespace default.nexus-proxyend-to-end (tier=platform, postgres + nexus + nexus-proxy slice): All five v1-to-v2 deprecation shims (/etl,/api/check,/api/encrypt,/api/decrypt,/api/v1/etl/vgraph/*) returned 410 Gone with the documented upgrade message. Live/api/v1/*routes (/datasets/,/files/,/organization/,/team/,/named-endpoint/,/my/user/entitlements/) returned 200 with real data after a Django session login at/accounts/login/. v2 ETL routes routed correctly through nginx (returned upstream errors at platform tier as expected — the GPU backendsstreamgl-gpuandforge-etl-pythonintentionally don't render at this tier). Tier transitionplatform -> analyticsremovednexus-proxyand brought up the engine DaemonSet's nginx container;analytics -> platformreversed it cleanly.handle @grafana { reverse_proxy ... }block syntax tripped Caddy v2's parser ("Unexpected next token after '{' on same line") — expanded to multi-line form. The/caddy/health/endpoint was being intercepted by the catch-allhandleand proxied toengine-headless(no endpoints) afterrespondwrote 200, becauserespondis non-terminal in Caddy v2; wrapping in a terminalhandle /caddy/health/ { respond 200 ... }broke the resulting 30s SIGTERM-then-restart liveness loop. Newtemplates/caddy/_helpers.tplwithgraphistry.caddy.healthHandleandgraphistry.caddy.telemetryHandlesdefines removes 3-4× duplication of identical handle blocks across thetls.mode = external | self | offbranches.Telemetry as a properly-structured subchart
telemetryis now a Helm subchart consumed viadependencies:withcondition: global.ENABLE_OPEN_TELEMETRY. Helm fully prunes the subchart's templates when the flag is off (was previously per-template{{- if }}gates inside the parent).Telemetry-specific values move to a top-level
telemetry:block in the parent'svalues.yaml(subchart-canonical layout); only cross-cutting values (OTLP endpoint, instance name, image-pull config, default scheduling, storage class) stay underglobal.*.The parent's shared
_helpers.tpl(graphistry.tier.*helpers) is extracted into a newgraphistry-commonlibrary subchart so the telemetry subchart can reuse the same tier gating via a dependency entry.New operator-tunable knobs:
telemetry.dcgmExporter.useExternal+telemetry.dcgmExporter.externalEndpoint(defaultfalse/"") — when true the chart skips its own dcgm-exporter DaemonSet+Service and points prometheus + otel-collector at an externally-managed endpoint.Use cases: GKE with NVIDIA GPU Operator (the bundled exporter image fails on Container-Optimized OS), or any cluster already running the GPU Operator's DCGM module (avoids two DaemonSets scraping the same GPUs). Format
host:port, no scheme.telemetry.nodeExporter.useExternal+telemetry.nodeExporter.externalEndpoint— same pattern for clusters already running kube-prometheus-stack's node-exporter; avoids two DaemonSets per node.telemetry.prometheus.enableAdminAPI(defaultfalse) — when true passes--web.enable-admin-apito prometheus, enablingtsdb/delete_series,tsdb/clean_tombstones,snapshot, andshutdown. Mirrors kube-prometheus-stack'sprometheusSpec.enableAdminAPIdefault. Off in production; flip in dev/test to drop stale series after scrape-config refactors.telemetry.prometheus.retention(default15d) — local TSDB retention.telemetry.{prometheus,jaeger,grafana}.persistence.{enabled,size,storageClassName}— per-component PVC config for the otel-collector backend stack. Storage class falls back toglobal.storageClassNameOverridethenretain-sc. Without persistence,Grafana dashboards/sessions and Jaeger traces are lost on every pod restart.
Topology + correctness changes:
Recreatestrategy + RWO PVC (was per-node DaemonSet, no global view, no retention). Multi-replica HA is out of scope; users who need it should add Thanos orremote_writeto a managed backend in their overrides.static_configstokubernetes_sd_configswithrole: pod+ relabel rules that set thenodelabel frompod_node_nameand__address__frompod_ip. Pre-fix prometheus scraped the Service VIP and kube-proxy load-balanced to one DaemonSet pod per scrape, silently dropping the other nodes' GPU/host metrics from dashboards. Post-fix every DaemonSet pod is a distinct scrape target with a stable per-nodelabel.templates/prometheus-rbac.yaml: ServiceAccount + namespace-scoped Role + RoleBinding granting read-onlypods/services/endpointsget/list/watch. Required by prometheus'skubernetes_sd_configsapiserver discovery. The prometheus pod setsautomountServiceAccountToken: trueexplicitly — the K8s default is true, but locked-down clusters (some Tanzu/OpenShift profiles, hardened GKE namespaces) flip it to false at namespace level, which would silently break pod discovery.otel-collector-cloud-configmap.yamlandotel-collector-configmap.yamlrendered separately based onOTEL_CLOUD_MODE; now one templated ConfigMap with cloud-mode/self-hosted branches inline. Cloud-mode credentials read from a pre-created Secret viasecretKeyRef(never inlined into values.yaml).checksum/configannotation on otel-collector / prometheus / grafana / jaeger pod templates — ConfigMap content changes now trigger a rollout onhelm upgrade(was previously a no-op until manual pod delete).grafana-ingress.yaml,jaeger-ingress.yaml,prometheus-ingress.yaml) removed; replaced by Caddy path routes (/grafana,/jaeger,/prometheus) on the parent chart's main Ingress.Platform-tier
nexus-proxy(nginx fronting nexus, v1-to-v2 endpoint rewrites)New
templates/nexus-proxy/nexus-proxy-deployment.yaml+nexus-proxy-service.yaml: platform-tier-only nginx Deployment + Service that fronts nexus and provides the v1-to-v2 endpoint rewrites + deprecated-endpoint 410 shims baked into thegraphistry/nginximage. Renders only whenglobal.tier == "platform"(exact match, not>=); at analytics+ the engine DaemonSet's nginx container plays the same role with intra-pod localhost dispatch to the streamgl/forge backends, so the standalone Deployment would be redundant.Why it exists: the v1-to-v2 rewrites Graphistry clients depend on (deprecation 410s on
/etl,/api/check,/api/encrypt,/api/decrypt,/api/v1/etl/vgraph/*; live forwarding for/api/v1/{datasets,files,organization,team,named-endpoint,...}; parallel/api/v2/etl/...routes) live in thegraphistry/nginximage'sdefault.conf.templaterendered byrender_templates.sh. Pre-fix,tier=platformrendered postgres + nexus only, so a nexus-only deployment had no externally-reachable HTTP surface honouring the v1 paths.The platform tier now ships a deployable slice of postgres + nexus + nexus-proxy with no transitive dependencies on analytics-tier services. The Deployment's only init container waits for the nexus Service alone (not redis / dask-scheduler / streamgl), so the slice is genuinely self-contained — useful for downstream products that need only the auth foundation (e.g. as the Nexus-only auth backend behind another product's chart).
In-cluster reachability (the actual ask for service-to-service integrations like Louie):
http://nexus-proxy.<ns>.svc.cluster.local:80— both v1 and v2 pathshttp://nexus.<ns>.svc.cluster.local:8000— raw nexus, v2 onlyExternal reachability is deliberately not chart-managed at platform tier (no Ingress, no Caddy). Operators bring their own L7 (Pulumi, Ansible, kustomize, manual Ingress), or use port-forward for dev/test:
Reuses existing values and PVCs:
NginxResources,global.{nodeSelector,tolerations,imagePullSecrets,restartPolicy}, postgres secret refs, and thelocal-media-mount+data-mountPVCs. The image's content-directory expectations for/streamgl,/pivot, and/uploadpaths are satisfied byemptyDirmounts — those routes return 404 at platform tier, which is the correct behaviour when the backends don't exist.Complementary changes (details in
CHANGELOG.md)global.storage.accessMode+global.storageClassNameOverride. RemovedENABLE_CLUSTER_MODE,IS_FOLLOWER,multiNode,clusterVolume,provisioner,REDIS_URL_NEXUS_FEP,longhornDashboard, and the hardcodeddatamount-longhorn/postgres-longhorn/retain-sc-clusterStorageClass names. Longhorn becomes just another backend operators can point a SC at.retain-sc-postgresStorageClass for thepostgres-clusterchart, per Crunchy PGO's documented pattern.dask-cluster.ymlCRD,dask.operatortoggle, operator sections from all platform READMEs and Sphinx docs, ArgoCD app, CD subchart, chart bundler entries, ACR import script, dev-compose setup script.otel-collectorService flipped toClusterIP+internalTrafficPolicy: Local(DaemonSet collector model, per opentelemetry-operator#1401); Redis Service flipped toClusterIP(wasLoadBalancer, a latent security default).charts/values-overrides/examples/cluster/rewritten end-to-end (~+800 lines): legacy leader/follower multi-namespace example deleted. New single-namespace multi-node guide adds:cookieSecretrotation procedure (HMAC key forgraphistry_sticky)nodeSelectorfor where-allowed vstolerationsfor what-taints-accepted) with a worked 5-node A100 walkthroughengine.nodeSelector+global.tolerations)cluster/retain-sc-nfs.yamlfor operators who prefer to apply the Retain StorageClass separatelydcgm-exporterDaemonSet andnetshoot/whoamihttp-tools gained the missingnodeSelectorblocks (they were the only workload templates that did not honourglobal.nodeSelector).Chart version:
graphistry-helm 0.4.3 -> 0.4.4.Test plan
helm installsucceeds end-to-endgraphistry_stickycookie; reconnects stay pinnedviz -> analytics -> viz) preserve PVC bindingchecksum/configonhelm upgradedcgm-exporterrespectsglobal.nodeSelector(prior bug: leaked onto non-GPU nodes)helm uninstall+ reinstall rebinds PVCs viavolumeNameworkflow, preserves dataengine.nodeSelector=graphistry.io/role=gpu+ CPU singletons pinned viaglobal.nodeSelector=graphistry.io/role=cpu; verify engine pods land only on GPU nodes and caddy/nexus/redis/dask-scheduler land only on CPU nodesglobal.tolerationsvalidation against NVIDIA GPU Operator taint (nvidia.com/gpu=true:NoSchedule) on a managed cluster (GKE/EKS GPU node-pool)dedicated=graphistry:NoSchedule) -- chart pods schedule onto tainted nodes only when the toleration is setnodelabels (postkubernetes_sd_configsswitch)tier=platform): postgres + nexus + nexus-proxy renders standalone; v1-to-v2 deprecation 410s + authenticated/api/v1/*round-trips return real data; tier transitions to/from analytics are cleantls.mode = external | self | offpaths;/caddy/health/no longer trips the 30s SIGTERM liveness loopRelated