Add SGLang container recipes, CI, and deployment runbooks#1
Merged
Conversation
Lands the in-repo work for the "Productize SGLang serving" epic (INFR-73), covering child tickets INFR-74 through INFR-81. Cross-repo work (lucibridge code in infernode-os/infernode, eval harness in IOL) stays out of this commit; their entry-points and contracts are documented in runbooks/. Per-ticket summary: INFR-74 (Investigate NGC for Orin sm_87): no code. Findings posted to the Jira ticket — NGC's SGLang line is CUDA-13 / JP7-only (datacenter + Thor). Fork-and-vendor remains the right path for Orin; NGC is the recommended base for Thor. INFR-76 (Vendor dusty-nv recipe): copy of dusty-nv/jetson-containers/packages/llm/sglang verbatim into sglang/orin/ (Dockerfile.upstream, build.sh, install.sh, test.py) with attribution in sglang/LICENSE-UPSTREAM.md. Standalone build path lives in sglang/orin/Dockerfile (diverged: drops chained transformers install, adds tokenizer bake step). INFR-77 (Pin SGLang >=0.5.x for gpt-oss): sglang/orin/config.py pinned to 0.5.3 (first 0.5.x line with srt/models/gpt_oss.py, predates upstream's CUDA-13 transition at 0.5.11). Fallback ladder documented in the config.py docstring; on-target smoke build on Hephaestus is the verification gate. INFR-75 (GitHub-hosted ubuntu-24.04-arm CI): .github/workflows/build-sglang.yml. Native aarch64 build on Graviton SBSA, push to ghcr.io/infernode-os/serving-sglang with variant-tagged images. Pins all third-party actions by commit SHA. Note: the self-hosted-Hephaestus plan in the original ticket description has been superseded; the Jira description has been updated via API. INFR-78 (Llama-3 tokenizer + chat-template fix): sglang/orin/bake-tokenizers.sh pulls non-gated mirrors of the Llama-3.1 and Llama-3 tokenizer dirs into /opt/tokenizers/ at image build time (~60 MB total). Documented launch flag --tokenizer-path /opt/tokenizers/llama-3.1 in the runbook. INFR-79 (lucibridge per-tool routing): code change lives in infernode-os/infernode (out of scope here). What's in this repo: runbooks/lucibridge-routing.md — the routing config schema, the per-category default table, env-var bridging, observability spec, and test plan. The infernode-side PR will consume this as the contract. INFR-80 (Hephaestus deploy runbook): runbooks/hephaestus-deploy.md. Pull + pre-flight + launch + healthcheck + systemd unit + memory budget + troubleshooting + serve-llm.sh integration + clean shutdown. Respects the Hephaestus disk policy (Docker on root, working data on /mnt/orin-ssd via bind mounts). INFR-81 (Thor sm_103 matrix build): sglang/thor/ (Dockerfile + README) wraps NGC nvcr.io/nvidia/sglang:25.10-py3. The build workflow matrix-builds Thor alongside Orin; Thor variant is skipped on PRs (needs NGC_API_KEY secret which forks don't have). Not in this commit (genuinely out of scope or blocked): * IOL-26 (virgil-agent eval against SGLang) — lives in IOL repo; runs after a working SGLang endpoint exists on Hephaestus. * The on-target smoke build of the pinned 0.5.3 image on Hephaestus (acceptance gate for INFR-77, requires Jetson hardware). * The actual lucibridge code change in infernode-os/infernode (consumes the runbook schema; tracked under INFR-79).
NGC SGLang containers are anonymously pullable from nvcr.io for the default tags we care about. Make the NGC login step conditional on the secret being set (forward-compat with any future gated variant) and remove the PR-skip that was only there because of the bogus auth assumption. Thor variant now builds on every event, same as Orin.
First CI run failed with: dustynv/pytorch:2.6-r36.4.0-cu126-22.04: not found dustynv moved the JP6 publishing line to cu128 / Ubuntu 24.04 a while back; the cu126-22.04 / Python 3.10 variant the spike used is no longer maintained. Switch the workflow default and the orin README's manual-build example to 2.6-r36.4.0-cu128-24.04. In-container Python 3.12 is fine — the spike's host-Python-alignment constraint only mattered for its hand-extracted-onto-host setup, not for Docker. CUDA 12.8 runtime is forward-compatible with JP6.x's CUDA 12.6 driver per NVIDIA's same-major compat policy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds complete infrastructure for deploying SGLang on NVIDIA Jetson hardware (Orin and Thor), including container recipes, GitHub Actions CI, and operational runbooks. This unblocks INFR-77 (gpt-oss model support on Jetson) and INFR-79 (multi-backend routing via lucibridge).
Key Changes
Container Recipes
sglang/orin/: Full vendored recipe for Jetson Orin AGX (sm_87, JetPack 6.x, CUDA 12.6)sglang/thor/: Thin overlay on NVIDIA's official NGC SGLang image (sm_103, JetPack 7, CUDA 13)CI/CD
.github/workflows/build-sglang.yml: Matrix build for both Orin and Thor variantsubuntu-24.04-arm(native aarch64, no QEMU)Operational Documentation
runbooks/hephaestus-deploy.md: End-to-end deployment guide for Hephaestus (Orin AGX)/mnt/orin-ssdfor SGLang)serve-llm.shand dual-backend moderunbooks/lucibridge-routing.md: Multi-backend routing schema for per-tool dispatch/etc/lucibridge/routing.jsonLLM_BACKEND_URLenv varSupporting Files
sglang/LICENSE-UPSTREAM.md)Notable Implementation Details
/mnt/orin-ssdrather than migrating Docker daemon storage, preserving production emulation on root partition--shm-size 8grequired for SGLang's worker pool (default 64 MB causes stalls under concurrency)https://claude.ai/code/session_01Dx8Vba9MmR3aMaRMXFhYyD