infernode-os · pdfinn · May 15, 2026 · May 15, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Container builds, deployment recipes, and routing configuration for the LLM serving stack behind InferNode (`llmsrv` / `lucibridge` / `serve-llm.sh`). It does **not** contain the OS runtime (that lives in `infernode-os/infernode`) or the training pipeline (that lives in `pdfinn/infernode-os-llm`). Lifecycle here tracks JetPack releases / SGLang versions / hardware generations, not model training cycles.
 
-The repo is intentionally **public** (changed 2026-05-14) so CI can use the free `ubuntu-24.04-arm` GitHub-hosted runner minutes. Do not commit secrets or credential-shaped files.
+The repo is intentionally **public** (changed 2026-05-14) so CI can use the free `ubuntu-24.04-arm` GitHub-hosted runner minutes. Do not commit secrets, credential-shaped files, internal hostnames, or names of confidential workloads that may colocate with this stack on a dev box. Operational policy that's specific to a particular deployment lives in private docs, not here.
 
 Work is tracked under the **INFR Jira project's "Productize SGLang serving" epic** (parent: INFR-68). When a file references `INFR-NN`, that's a Jira ticket — see ticket commentary for context Claude can't recover from the tree.
 
@@ -29,16 +29,16 @@ CI is `.github/workflows/build-sglang.yml`, runs on `ubuntu-24.04-arm` (Graviton
 
 `workflow_dispatch` exposes `orin_base_image` as an override input — use when exploring an alternate dustynv tag (e.g. for INFR-92's upgrade work).
 
-### Manual Orin build (on Hephaestus dev daemon, or any aarch64 host)
+### Manual Orin build (any aarch64 host)
 
 ```sh
 cd sglang/orin
-docker --host unix:///run/docker-dev.sock build \
+docker build \
   --build-arg BASE_IMAGE=dustynv/sglang:r36.4.0 \
   -t serving-sglang:orin-local .
 ```
 
-(Drop `--host` on a field-deployment Orin AGX. The dev-daemon socket is Hephaestus-specific — see `runbooks/hephaestus-deploy.md` §0.1.)
+If your build host runs a dedicated experimental Docker daemon on a non-default socket, prepend `--host unix:///run/<daemon>.sock` — that's a per-environment concern, not part of the build itself.
 
 ### Manual Thor build
 
@@ -51,45 +51,20 @@ docker build -t serving-sglang:thor-local sglang/thor/
 The on-hardware validator launches `sglang.launch_server` with TinyLlama, asserts `/health`, asserts the `KV Cache is allocated` startup line, and exercises `/v1/chat/completions`. **Run it after every published-image pull** — CI's build-time guards can only check metadata (no GPU on the GitHub runner):
 
 ```sh
-docker --host unix:///run/docker-dev.sock run --rm \
+HF_CACHE="${HF_CACHE:-/var/lib/huggingface}"
+docker run --rm \
   --runtime nvidia --gpus all \
-  -v /mnt/orin-ssd/huggingface:/root/.cache/huggingface \
+  -v "$HF_CACHE":/root/.cache/huggingface \
   -e HF_HOME=/root/.cache/huggingface \
   serving-sglang:orin-local \
   /opt/sglang/validate-on-hardware.sh
 ```
 
-Expected last line: `All on-hardware checks passed.`. Serving smoke takes ~60s cold, ~15s warm. Deploy + healthcheck sequence is in `runbooks/hephaestus-deploy.md`.
-
-## Hephaestus dual-purpose model (do NOT violate — INFR-90)
-
-Hephaestus serves two purposes that pull in opposite directions:
-
-1. **Field-parity reference** — TAK / NERVA / Ollama must run under a Docker daemon whose storage lives on the device's only disk (root partition), exactly as they would on a no-SSD field unit.
-2. **Development environment** — experimental work (SGLang, future Jetson experiments) needs the 916 GiB `/mnt/orin-ssd` and would crush the root partition if forced through the production daemon.
-
-A single Docker daemon has one storage root, so the only honest answer is **two daemons**:
-
-| Daemon | Socket | Storage | Used by |
-|---|---|---|---|
-| `docker.service` (production) | `/run/docker.sock` | root partition (with the Docker 29 quirk that its image content lives at `/var/lib/containerd` on root, regardless of where `/var/lib/docker` symlinks to) | TAK · NERVA · Ollama · field-parity workloads |
-| `docker-dev.service` (experimental) | `/run/docker-dev.sock` | `/mnt/orin-ssd/docker-dev` (own legacy snapshotter, `containerd-snapshotter: false`) | SGLang · any other Jetson experiment |
-
-**Rules for Claude when working on this host:**
-
-- For SGLang or any experimental work, use the dev daemon: `docker --host unix:///run/docker-dev.sock` (or set up `alias dev-docker=...`).
-- **Never** pull SGLang or other experimental images on the production daemon — its 12-14 GiB image plus working set would crush root.
-- **Never** migrate TAK / NERVA / Ollama to the dev daemon — they belong on production for field parity.
-- The runbook §0.1 has the full one-time setup if the dev daemon ever needs to be re-created.
-- Pre-flight any plan that involves `docker pull`/`docker run` by checking *which* daemon you're hitting. The default `docker` CLI uses the production daemon — easy to forget on Hephaestus.
-
-### Watch out: `/var/lib/docker` symlink is misleading
-
-`/var/lib/docker` is symlinked to `/mnt/orin-ssd/docker/docker` on this host, but `/var/lib/containerd` (the actual image content store under Docker 29's containerd integration) lives on root. So the symlink only redirects the daemon's *metadata*, not its *image content*. A pull on the production daemon will land on root regardless of what the symlink suggests. This was discovered the hard way (root went from 12 GiB → 877 MB free during a base-image pull); the dev daemon avoids the trap by disabling the containerd snapshotter entirely.
+Expected last line: `All on-hardware checks passed.`. Serving smoke takes ~60s cold, ~15s warm. Deploy + healthcheck sequence is in `runbooks/deploy.md`.
 
 ## Operational coexistence with Ollama
 
-SGLang **coexists** with Ollama on Hephaestus, it does not replace it. `lucibridge` routes per-request based on tool-category:
+SGLang **coexists** with Ollama on the deploy host, it does not replace it. `lucibridge` routes per-request based on tool-category:
 
 - `limbo_authoring` → Ollama (Devstral, single-user fluency)
 - `dispatch` / `tool_call` / `memory` / `task` → SGLang (gpt-oss, concurrent fan-out, xgrammar)
@@ -109,14 +84,14 @@ A pre-INFR-79 single-`LLM_BACKEND_URL` config must keep working unchanged. The b
 | `sglang/orin/{Dockerfile.upstream,install.sh,build.sh,config.py}` | Leftovers from the deprecated framework-vendoring path; not driven by the build. Slated for removal after one release cycle |
 | `sglang/LICENSE-UPSTREAM.md` | Records the dusty-nv commit SHA at vendoring; consult before re-syncing |
 | `docs/SGLANG-ADOPTION-NOTES.md` | Spike findings, measured bake-off (SGLang ~78 tok/s @ N=8 vs Ollama's ~23 tok/s plateau), the canonical working recipe |
-| `runbooks/hephaestus-deploy.md` | End-to-end Hephaestus deploy; acceptance gate is "new contributor brings SGLang up in <15 min" |
+| `runbooks/deploy.md` | End-to-end deploy guide for a clean Orin AGX; acceptance gate is "new contributor brings SGLang up in <15 min" |
 | `runbooks/lucibridge-routing.md` | Routing config schema + per-tool / per-category mapping |
 
 ## Conventions worth knowing
 
-- **Launch arguments live in the runbook, not the image.** No `CMD` / `ENTRYPOINT` is baked into the production Dockerfiles so the same image can serve different model paths without rebuild. When changing launch flags, update `runbooks/hephaestus-deploy.md` §3 and §5 (systemd unit) together.
+- **Launch arguments live in the runbook, not the image.** No `CMD` / `ENTRYPOINT` is baked into the production Dockerfiles so the same image can serve different model paths without rebuild. When changing launch flags, update `runbooks/deploy.md` §3 and §5 (systemd unit) together.
 - **Tokenizer path is mandatory for Llama-3 family, not for the TinyLlama smoke.** SGLang 0.4's GGUF tokenizer doesn't register Llama-3 special tokens; `--tokenizer-path /opt/tokenizers/llama-3.1` (baked by `bake-tokenizers.sh`) is the fix. TinyLlama uses Llama-2's tokenizer and doesn't need the override. See INFR-78.
 - **`TORCHDYNAMO_DISABLE=1 TORCH_COMPILE_DISABLE=1` and `--disable-cuda-graph` are required on Jetson** — `torch.compile` is broken at-import in this Triton 3.2 + Torch 2.5 combo (it pulls in `torch._inductor.runtime.hints` which calls `fields(AttrsDescriptor)` on a non-dataclass), and CUDA-graph capture is unreliable on Tegra. The two env vars are baked into the image; don't remove without re-running the validator.
 - **The dataclasses backport must stay removed.** dustynv ships a Python-2-era `dataclasses.py` in `/usr/local/lib/python3.10/dist-packages/` that shadows the stdlib. The Dockerfile moves it to `_disabled_backports/` and asserts the stdlib is now what resolves. If a future base-image bump re-introduces it, the same pattern works.
-- **CI guards are metadata-only; `validate-on-hardware.sh` is the real gate.** GitHub-hosted runners have no GPU, so the in-Dockerfile guards can only check torch+CUDA metadata, triton import, and the sglang version pin. Real correctness lives in the on-hardware validator's serving smoke. Always run the validator after pulling a new tag on Hephaestus.
+- **CI guards are metadata-only; `validate-on-hardware.sh` is the real gate.** GitHub-hosted runners have no GPU, so the in-Dockerfile guards can only check torch+CUDA metadata, triton import, and the sglang version pin. Real correctness lives in the on-hardware validator's serving smoke. Always run the validator after pulling a new tag onto the deploy hardware.
 - **NGC pulls are anonymous by default.** The `NGC_API_KEY` step in CI is forward-compatible scaffolding for any future gated tag — don't make it a hard requirement.
diff --git a/README.md b/README.md
@@ -48,7 +48,7 @@ serving/
 │   ├── orin/                  pinned config for Orin (sm_87)
 │   └── thor/                  pinned config for Thor (sm_103) — when we have a Thor box
 ├── runbooks/
-│   ├── hephaestus-deploy.md
+│   ├── deploy.md
 │   └── ...
 └── .github/workflows/
     └── build-sglang.yml   GitHub-hosted ubuntu-24.04-arm runner
@@ -66,11 +66,11 @@ SBSA). Native aarch64 — no QEMU — and `nvcc` cross-compiles for
 sm_87 (Orin) and sm_103 (Thor) via `TORCH_CUDA_ARCH_LIST` at build
 time. The output is a Jetson-Tegra-targeted image; GitHub's hosted
 runners have no Jetson hardware, so end-to-end smoke testing (CUDA
-paths actually executing) happens manually on Hephaestus after each
-successful CI build. **Hephaestus must never be configured as a
-GitHub self-hosted runner**, and the remote CI must not hold any
-secrets that link back to the device — the public CI surface stays
-strictly isolated from the dev box.
+paths actually executing) happens manually on Jetson hardware after
+each successful CI build. **The on-hardware validation host must
+never be configured as a GitHub self-hosted runner**, and the remote
+CI must not hold any secrets that link back to it — the public CI
+surface stays strictly isolated from on-hardware validation.
 
 The `sglang/` subtree is intended to vendor the canonical
 [`dusty-nv/jetson-containers`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/sglang)
@@ -99,32 +99,16 @@ the model registry).
 See `docs/SGLANG-ADOPTION-NOTES.md` for:
 - The spike attempts that didn't work and why (PyTorch wheel
   `USE_DISTRIBUTED` vs `sm_87` constraint)
-- The working recipe (extract dustynv container via crane onto
-  orin-ssd, host Python 3.10 with patched `LD_LIBRARY_PATH`)
+- The working recipe (extract dustynv container via crane onto a
+  fast local disk, host Python 3.10 with patched `LD_LIBRARY_PATH`)
 - Measured bake-off results — SGLang vs Ollama on Llama-3.1-8B
   (TL;DR: SGLang scales to ~78 tok/s at N=8 concurrent vs Ollama's
   ~23 tok/s plateau; tied at single-user)
-- Operations: where things live on Hephaestus, start/stop, verify
+- Operations: on-host layout, start/stop, verify
 - Known gaps: SGLang 0.4.1's GGUF tokenizer doesn't recognize Llama
   3 special tokens (needs HF tokenizer dir); no `gpt_oss.py` in
   0.4.1's model registry (needs SGLang 0.5.x bump)
 
-## Hephaestus disk policy (important for the build path)
-
-Hephaestus is the Jetson Orin AGX dev box. Its root partition is
-**deliberately constrained** to emulate a production single-disk
-node (OS + TAK + NERVA via Docker + Ollama binary). The 916 GB
-`/mnt/orin-ssd` is the dev indulgence; serving-spike artifacts live
-there.
-
-**Do not migrate Docker / containerd state from root to orin-ssd.**
-That would move TAK/NERVA images onto the dev-only disk and break
-the production emulation. The Jetson-container build needs to either
-fit in root partition's residual space, or use a daemonless build
-path (we did extraction-only via
-[`crane`](https://github.com/google/go-containerregistry) on the
-spike; expect to reuse).
-
 ## Tracking
 
 Work in this repo is tracked under the **INFR Jira project's