Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 12 additions & 37 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

Container builds, deployment recipes, and routing configuration for the LLM serving stack behind InferNode (`llmsrv` / `lucibridge` / `serve-llm.sh`). It does **not** contain the OS runtime (that lives in `infernode-os/infernode`) or the training pipeline (that lives in `pdfinn/infernode-os-llm`). Lifecycle here tracks JetPack releases / SGLang versions / hardware generations, not model training cycles.

The repo is intentionally **public** (changed 2026-05-14) so CI can use the free `ubuntu-24.04-arm` GitHub-hosted runner minutes. Do not commit secrets or credential-shaped files.
The repo is intentionally **public** (changed 2026-05-14) so CI can use the free `ubuntu-24.04-arm` GitHub-hosted runner minutes. Do not commit secrets, credential-shaped files, internal hostnames, or names of confidential workloads that may colocate with this stack on a dev box. Operational policy that's specific to a particular deployment lives in private docs, not here.

Work is tracked under the **INFR Jira project's "Productize SGLang serving" epic** (parent: INFR-68). When a file references `INFR-NN`, that's a Jira ticket — see ticket commentary for context Claude can't recover from the tree.

Expand All @@ -29,16 +29,16 @@ CI is `.github/workflows/build-sglang.yml`, runs on `ubuntu-24.04-arm` (Graviton

`workflow_dispatch` exposes `orin_base_image` as an override input — use when exploring an alternate dustynv tag (e.g. for INFR-92's upgrade work).

### Manual Orin build (on Hephaestus dev daemon, or any aarch64 host)
### Manual Orin build (any aarch64 host)

```sh
cd sglang/orin
docker --host unix:///run/docker-dev.sock build \
docker build \
--build-arg BASE_IMAGE=dustynv/sglang:r36.4.0 \
-t serving-sglang:orin-local .
```

(Drop `--host` on a field-deployment Orin AGX. The dev-daemon socket is Hephaestus-specific — see `runbooks/hephaestus-deploy.md` §0.1.)
If your build host runs a dedicated experimental Docker daemon on a non-default socket, prepend `--host unix:///run/<daemon>.sock` — that's a per-environment concern, not part of the build itself.

### Manual Thor build

Expand All @@ -51,45 +51,20 @@ docker build -t serving-sglang:thor-local sglang/thor/
The on-hardware validator launches `sglang.launch_server` with TinyLlama, asserts `/health`, asserts the `KV Cache is allocated` startup line, and exercises `/v1/chat/completions`. **Run it after every published-image pull** — CI's build-time guards can only check metadata (no GPU on the GitHub runner):

```sh
docker --host unix:///run/docker-dev.sock run --rm \
HF_CACHE="${HF_CACHE:-/var/lib/huggingface}"
docker run --rm \
--runtime nvidia --gpus all \
-v /mnt/orin-ssd/huggingface:/root/.cache/huggingface \
-v "$HF_CACHE":/root/.cache/huggingface \
-e HF_HOME=/root/.cache/huggingface \
serving-sglang:orin-local \
/opt/sglang/validate-on-hardware.sh
```

Expected last line: `All on-hardware checks passed.`. Serving smoke takes ~60s cold, ~15s warm. Deploy + healthcheck sequence is in `runbooks/hephaestus-deploy.md`.

## Hephaestus dual-purpose model (do NOT violate — INFR-90)

Hephaestus serves two purposes that pull in opposite directions:

1. **Field-parity reference** — TAK / NERVA / Ollama must run under a Docker daemon whose storage lives on the device's only disk (root partition), exactly as they would on a no-SSD field unit.
2. **Development environment** — experimental work (SGLang, future Jetson experiments) needs the 916 GiB `/mnt/orin-ssd` and would crush the root partition if forced through the production daemon.

A single Docker daemon has one storage root, so the only honest answer is **two daemons**:

| Daemon | Socket | Storage | Used by |
|---|---|---|---|
| `docker.service` (production) | `/run/docker.sock` | root partition (with the Docker 29 quirk that its image content lives at `/var/lib/containerd` on root, regardless of where `/var/lib/docker` symlinks to) | TAK · NERVA · Ollama · field-parity workloads |
| `docker-dev.service` (experimental) | `/run/docker-dev.sock` | `/mnt/orin-ssd/docker-dev` (own legacy snapshotter, `containerd-snapshotter: false`) | SGLang · any other Jetson experiment |

**Rules for Claude when working on this host:**

- For SGLang or any experimental work, use the dev daemon: `docker --host unix:///run/docker-dev.sock` (or set up `alias dev-docker=...`).
- **Never** pull SGLang or other experimental images on the production daemon — its 12-14 GiB image plus working set would crush root.
- **Never** migrate TAK / NERVA / Ollama to the dev daemon — they belong on production for field parity.
- The runbook §0.1 has the full one-time setup if the dev daemon ever needs to be re-created.
- Pre-flight any plan that involves `docker pull`/`docker run` by checking *which* daemon you're hitting. The default `docker` CLI uses the production daemon — easy to forget on Hephaestus.

### Watch out: `/var/lib/docker` symlink is misleading

`/var/lib/docker` is symlinked to `/mnt/orin-ssd/docker/docker` on this host, but `/var/lib/containerd` (the actual image content store under Docker 29's containerd integration) lives on root. So the symlink only redirects the daemon's *metadata*, not its *image content*. A pull on the production daemon will land on root regardless of what the symlink suggests. This was discovered the hard way (root went from 12 GiB → 877 MB free during a base-image pull); the dev daemon avoids the trap by disabling the containerd snapshotter entirely.
Expected last line: `All on-hardware checks passed.`. Serving smoke takes ~60s cold, ~15s warm. Deploy + healthcheck sequence is in `runbooks/deploy.md`.

## Operational coexistence with Ollama

SGLang **coexists** with Ollama on Hephaestus, it does not replace it. `lucibridge` routes per-request based on tool-category:
SGLang **coexists** with Ollama on the deploy host, it does not replace it. `lucibridge` routes per-request based on tool-category:

- `limbo_authoring` → Ollama (Devstral, single-user fluency)
- `dispatch` / `tool_call` / `memory` / `task` → SGLang (gpt-oss, concurrent fan-out, xgrammar)
Expand All @@ -109,14 +84,14 @@ A pre-INFR-79 single-`LLM_BACKEND_URL` config must keep working unchanged. The b
| `sglang/orin/{Dockerfile.upstream,install.sh,build.sh,config.py}` | Leftovers from the deprecated framework-vendoring path; not driven by the build. Slated for removal after one release cycle |
| `sglang/LICENSE-UPSTREAM.md` | Records the dusty-nv commit SHA at vendoring; consult before re-syncing |
| `docs/SGLANG-ADOPTION-NOTES.md` | Spike findings, measured bake-off (SGLang ~78 tok/s @ N=8 vs Ollama's ~23 tok/s plateau), the canonical working recipe |
| `runbooks/hephaestus-deploy.md` | End-to-end Hephaestus deploy; acceptance gate is "new contributor brings SGLang up in <15 min" |
| `runbooks/deploy.md` | End-to-end deploy guide for a clean Orin AGX; acceptance gate is "new contributor brings SGLang up in <15 min" |
| `runbooks/lucibridge-routing.md` | Routing config schema + per-tool / per-category mapping |

## Conventions worth knowing

- **Launch arguments live in the runbook, not the image.** No `CMD` / `ENTRYPOINT` is baked into the production Dockerfiles so the same image can serve different model paths without rebuild. When changing launch flags, update `runbooks/hephaestus-deploy.md` §3 and §5 (systemd unit) together.
- **Launch arguments live in the runbook, not the image.** No `CMD` / `ENTRYPOINT` is baked into the production Dockerfiles so the same image can serve different model paths without rebuild. When changing launch flags, update `runbooks/deploy.md` §3 and §5 (systemd unit) together.
- **Tokenizer path is mandatory for Llama-3 family, not for the TinyLlama smoke.** SGLang 0.4's GGUF tokenizer doesn't register Llama-3 special tokens; `--tokenizer-path /opt/tokenizers/llama-3.1` (baked by `bake-tokenizers.sh`) is the fix. TinyLlama uses Llama-2's tokenizer and doesn't need the override. See INFR-78.
- **`TORCHDYNAMO_DISABLE=1 TORCH_COMPILE_DISABLE=1` and `--disable-cuda-graph` are required on Jetson** — `torch.compile` is broken at-import in this Triton 3.2 + Torch 2.5 combo (it pulls in `torch._inductor.runtime.hints` which calls `fields(AttrsDescriptor)` on a non-dataclass), and CUDA-graph capture is unreliable on Tegra. The two env vars are baked into the image; don't remove without re-running the validator.
- **The dataclasses backport must stay removed.** dustynv ships a Python-2-era `dataclasses.py` in `/usr/local/lib/python3.10/dist-packages/` that shadows the stdlib. The Dockerfile moves it to `_disabled_backports/` and asserts the stdlib is now what resolves. If a future base-image bump re-introduces it, the same pattern works.
- **CI guards are metadata-only; `validate-on-hardware.sh` is the real gate.** GitHub-hosted runners have no GPU, so the in-Dockerfile guards can only check torch+CUDA metadata, triton import, and the sglang version pin. Real correctness lives in the on-hardware validator's serving smoke. Always run the validator after pulling a new tag on Hephaestus.
- **CI guards are metadata-only; `validate-on-hardware.sh` is the real gate.** GitHub-hosted runners have no GPU, so the in-Dockerfile guards can only check torch+CUDA metadata, triton import, and the sglang version pin. Real correctness lives in the on-hardware validator's serving smoke. Always run the validator after pulling a new tag onto the deploy hardware.
- **NGC pulls are anonymous by default.** The `NGC_API_KEY` step in CI is forward-compatible scaffolding for any future gated tag — don't make it a hard requirement.
34 changes: 9 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ serving/
│ ├── orin/ pinned config for Orin (sm_87)
│ └── thor/ pinned config for Thor (sm_103) — when we have a Thor box
├── runbooks/
│ ├── hephaestus-deploy.md
│ ├── deploy.md
│ └── ...
└── .github/workflows/
└── build-sglang.yml GitHub-hosted ubuntu-24.04-arm runner
Expand All @@ -66,11 +66,11 @@ SBSA). Native aarch64 — no QEMU — and `nvcc` cross-compiles for
sm_87 (Orin) and sm_103 (Thor) via `TORCH_CUDA_ARCH_LIST` at build
time. The output is a Jetson-Tegra-targeted image; GitHub's hosted
runners have no Jetson hardware, so end-to-end smoke testing (CUDA
paths actually executing) happens manually on Hephaestus after each
successful CI build. **Hephaestus must never be configured as a
GitHub self-hosted runner**, and the remote CI must not hold any
secrets that link back to the device — the public CI surface stays
strictly isolated from the dev box.
paths actually executing) happens manually on Jetson hardware after
each successful CI build. **The on-hardware validation host must
never be configured as a GitHub self-hosted runner**, and the remote
CI must not hold any secrets that link back to it — the public CI
surface stays strictly isolated from on-hardware validation.

The `sglang/` subtree is intended to vendor the canonical
[`dusty-nv/jetson-containers`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/sglang)
Expand Down Expand Up @@ -99,32 +99,16 @@ the model registry).
See `docs/SGLANG-ADOPTION-NOTES.md` for:
- The spike attempts that didn't work and why (PyTorch wheel
`USE_DISTRIBUTED` vs `sm_87` constraint)
- The working recipe (extract dustynv container via crane onto
orin-ssd, host Python 3.10 with patched `LD_LIBRARY_PATH`)
- The working recipe (extract dustynv container via crane onto a
fast local disk, host Python 3.10 with patched `LD_LIBRARY_PATH`)
- Measured bake-off results — SGLang vs Ollama on Llama-3.1-8B
(TL;DR: SGLang scales to ~78 tok/s at N=8 concurrent vs Ollama's
~23 tok/s plateau; tied at single-user)
- Operations: where things live on Hephaestus, start/stop, verify
- Operations: on-host layout, start/stop, verify
- Known gaps: SGLang 0.4.1's GGUF tokenizer doesn't recognize Llama
3 special tokens (needs HF tokenizer dir); no `gpt_oss.py` in
0.4.1's model registry (needs SGLang 0.5.x bump)

## Hephaestus disk policy (important for the build path)

Hephaestus is the Jetson Orin AGX dev box. Its root partition is
**deliberately constrained** to emulate a production single-disk
node (OS + TAK + NERVA via Docker + Ollama binary). The 916 GB
`/mnt/orin-ssd` is the dev indulgence; serving-spike artifacts live
there.

**Do not migrate Docker / containerd state from root to orin-ssd.**
That would move TAK/NERVA images onto the dev-only disk and break
the production emulation. The Jetson-container build needs to either
fit in root partition's residual space, or use a daemonless build
path (we did extraction-only via
[`crane`](https://github.com/google/go-containerregistry) on the
spike; expect to reuse).

## Tracking

Work in this repo is tracked under the **INFR Jira project's
Expand Down
Loading