Add SGLang container recipes, CI, and deployment runbooks by pdfinn · Pull Request #1 · infernode-os/serving

pdfinn · 2026-05-14T09:31:07Z

Summary

This PR adds complete infrastructure for deploying SGLang on NVIDIA Jetson hardware (Orin and Thor), including container recipes, GitHub Actions CI, and operational runbooks. This unblocks INFR-77 (gpt-oss model support on Jetson) and INFR-79 (multi-backend routing via lucibridge).

Key Changes

Container Recipes

sglang/orin/: Full vendored recipe for Jetson Orin AGX (sm_87, JetPack 6.x, CUDA 12.6)
- Pinned to SGLang 0.5.3 (first 0.5.x with gpt-oss support, compatible with CUDA 12.6)
- Includes Llama-3 tokenizer bake-in to fix special-token handling (INFR-78)
- Vendored from dusty-nv/jetson-containers with divergent version pin for JP6 compatibility
sglang/thor/: Thin overlay on NVIDIA's official NGC SGLang image (sm_103, JetPack 7, CUDA 13)
- Forward-looking recipe for future Thor hardware
- Leverages NGC's official base rather than community fork

CI/CD

.github/workflows/build-sglang.yml: Matrix build for both Orin and Thor variants
- Runs on ubuntu-24.04-arm (native aarch64, no QEMU)
- Pushes to GHCR with short-SHA tagging for production reproducibility
- Supports NGC auth for gated content (forward-compatible)

Operational Documentation

runbooks/hephaestus-deploy.md: End-to-end deployment guide for Hephaestus (Orin AGX)
- Disk policy enforcement (root partition for OS/TAK/NERVA, /mnt/orin-ssd for SGLang)
- One-shot launch command with all required flags (Triton backend, shared memory tuning, CUDA graph disable)
- systemd unit template for production deployment
- Memory budgeting across 64 GiB unified memory (OS + Ollama + SGLang coexistence)
- Comprehensive troubleshooting table
- Integration with serve-llm.sh and dual-backend mode
runbooks/lucibridge-routing.md: Multi-backend routing schema for per-tool dispatch
- Routing table mapping tool categories to backends (Ollama for Limbo authoring, SGLang for dispatch/tool-call/memory/task)
- JSON config schema for /etc/lucibridge/routing.json
- Backwards-compatible fallback to single LLM_BACKEND_URL env var
- Observability guidance (structured logging per routing decision)

Supporting Files

Upstream attribution and license documentation (sglang/LICENSE-UPSTREAM.md)
README files explaining vendoring decisions and build procedures
Smoke test scripts for both variants
Build scripts with fallback paths (pip install → source build)
Tokenizer bake script for offline Llama-3 support

Notable Implementation Details

Version pinning rationale: SGLang 0.5.3 chosen for Orin because upstream's current default (0.5.11) is CUDA-13-only; 0.5.3 is the first 0.5.x with gpt-oss support and JP6/CUDA-12.6 compatibility
Disk policy enforcement: SGLang container uses bind-mounts to /mnt/orin-ssd rather than migrating Docker daemon storage, preserving production emulation on root partition
Shared memory tuning: --shm-size 8g required for SGLang's worker pool (default 64 MB causes stalls under concurrency)
CUDA graph disable: Conservative stability choice for Jetson; can be re-enabled later if performance requires it
Tokenizer bake-in: Llama-3

https://claude.ai/code/session_01Dx8Vba9MmR3aMaRMXFhYyD

Lands the in-repo work for the "Productize SGLang serving" epic (INFR-73), covering child tickets INFR-74 through INFR-81. Cross-repo work (lucibridge code in infernode-os/infernode, eval harness in IOL) stays out of this commit; their entry-points and contracts are documented in runbooks/. Per-ticket summary: INFR-74 (Investigate NGC for Orin sm_87): no code. Findings posted to the Jira ticket — NGC's SGLang line is CUDA-13 / JP7-only (datacenter + Thor). Fork-and-vendor remains the right path for Orin; NGC is the recommended base for Thor. INFR-76 (Vendor dusty-nv recipe): copy of dusty-nv/jetson-containers/packages/llm/sglang verbatim into sglang/orin/ (Dockerfile.upstream, build.sh, install.sh, test.py) with attribution in sglang/LICENSE-UPSTREAM.md. Standalone build path lives in sglang/orin/Dockerfile (diverged: drops chained transformers install, adds tokenizer bake step). INFR-77 (Pin SGLang >=0.5.x for gpt-oss): sglang/orin/config.py pinned to 0.5.3 (first 0.5.x line with srt/models/gpt_oss.py, predates upstream's CUDA-13 transition at 0.5.11). Fallback ladder documented in the config.py docstring; on-target smoke build on Hephaestus is the verification gate. INFR-75 (GitHub-hosted ubuntu-24.04-arm CI): .github/workflows/build-sglang.yml. Native aarch64 build on Graviton SBSA, push to ghcr.io/infernode-os/serving-sglang with variant-tagged images. Pins all third-party actions by commit SHA. Note: the self-hosted-Hephaestus plan in the original ticket description has been superseded; the Jira description has been updated via API. INFR-78 (Llama-3 tokenizer + chat-template fix): sglang/orin/bake-tokenizers.sh pulls non-gated mirrors of the Llama-3.1 and Llama-3 tokenizer dirs into /opt/tokenizers/ at image build time (~60 MB total). Documented launch flag --tokenizer-path /opt/tokenizers/llama-3.1 in the runbook. INFR-79 (lucibridge per-tool routing): code change lives in infernode-os/infernode (out of scope here). What's in this repo: runbooks/lucibridge-routing.md — the routing config schema, the per-category default table, env-var bridging, observability spec, and test plan. The infernode-side PR will consume this as the contract. INFR-80 (Hephaestus deploy runbook): runbooks/hephaestus-deploy.md. Pull + pre-flight + launch + healthcheck + systemd unit + memory budget + troubleshooting + serve-llm.sh integration + clean shutdown. Respects the Hephaestus disk policy (Docker on root, working data on /mnt/orin-ssd via bind mounts). INFR-81 (Thor sm_103 matrix build): sglang/thor/ (Dockerfile + README) wraps NGC nvcr.io/nvidia/sglang:25.10-py3. The build workflow matrix-builds Thor alongside Orin; Thor variant is skipped on PRs (needs NGC_API_KEY secret which forks don't have). Not in this commit (genuinely out of scope or blocked): * IOL-26 (virgil-agent eval against SGLang) — lives in IOL repo; runs after a working SGLang endpoint exists on Hephaestus. * The on-target smoke build of the pinned 0.5.3 image on Hephaestus (acceptance gate for INFR-77, requires Jetson hardware). * The actual lucibridge code change in infernode-os/infernode (consumes the runbook schema; tracked under INFR-79).

NGC SGLang containers are anonymously pullable from nvcr.io for the default tags we care about. Make the NGC login step conditional on the secret being set (forward-compat with any future gated variant) and remove the PR-skip that was only there because of the bogus auth assumption. Thor variant now builds on every event, same as Orin.

First CI run failed with: dustynv/pytorch:2.6-r36.4.0-cu126-22.04: not found dustynv moved the JP6 publishing line to cu128 / Ubuntu 24.04 a while back; the cu126-22.04 / Python 3.10 variant the spike used is no longer maintained. Switch the workflow default and the orin README's manual-build example to 2.6-r36.4.0-cu128-24.04. In-container Python 3.12 is fine — the spike's host-Python-alignment constraint only mattered for its hand-extracted-onto-host setup, not for Docker. CUDA 12.8 runtime is forward-compatible with JP6.x's CUDA 12.6 driver per NVIDIA's same-major compat policy.

claude added 3 commits May 14, 2026 09:21

pdfinn merged commit d87e136 into main May 14, 2026
1 of 2 checks passed

pdfinn deleted the claude/investigate-ngc-orin-fgiaV branch May 14, 2026 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SGLang container recipes, CI, and deployment runbooks#1

Add SGLang container recipes, CI, and deployment runbooks#1
pdfinn merged 3 commits into
mainfrom
claude/investigate-ngc-orin-fgiaV

pdfinn commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pdfinn commented May 14, 2026

Summary

Key Changes

Container Recipes

CI/CD

Operational Documentation

Supporting Files

Notable Implementation Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants