Skip to content

Add SGLang container recipes, CI, and deployment runbooks#1

Merged
pdfinn merged 3 commits into
mainfrom
claude/investigate-ngc-orin-fgiaV
May 14, 2026
Merged

Add SGLang container recipes, CI, and deployment runbooks#1
pdfinn merged 3 commits into
mainfrom
claude/investigate-ngc-orin-fgiaV

Conversation

@pdfinn
Copy link
Copy Markdown
Member

@pdfinn pdfinn commented May 14, 2026

Summary

This PR adds complete infrastructure for deploying SGLang on NVIDIA Jetson hardware (Orin and Thor), including container recipes, GitHub Actions CI, and operational runbooks. This unblocks INFR-77 (gpt-oss model support on Jetson) and INFR-79 (multi-backend routing via lucibridge).

Key Changes

Container Recipes

  • sglang/orin/: Full vendored recipe for Jetson Orin AGX (sm_87, JetPack 6.x, CUDA 12.6)

    • Pinned to SGLang 0.5.3 (first 0.5.x with gpt-oss support, compatible with CUDA 12.6)
    • Includes Llama-3 tokenizer bake-in to fix special-token handling (INFR-78)
    • Vendored from dusty-nv/jetson-containers with divergent version pin for JP6 compatibility
  • sglang/thor/: Thin overlay on NVIDIA's official NGC SGLang image (sm_103, JetPack 7, CUDA 13)

    • Forward-looking recipe for future Thor hardware
    • Leverages NGC's official base rather than community fork

CI/CD

  • .github/workflows/build-sglang.yml: Matrix build for both Orin and Thor variants
    • Runs on ubuntu-24.04-arm (native aarch64, no QEMU)
    • Pushes to GHCR with short-SHA tagging for production reproducibility
    • Supports NGC auth for gated content (forward-compatible)

Operational Documentation

  • runbooks/hephaestus-deploy.md: End-to-end deployment guide for Hephaestus (Orin AGX)

    • Disk policy enforcement (root partition for OS/TAK/NERVA, /mnt/orin-ssd for SGLang)
    • One-shot launch command with all required flags (Triton backend, shared memory tuning, CUDA graph disable)
    • systemd unit template for production deployment
    • Memory budgeting across 64 GiB unified memory (OS + Ollama + SGLang coexistence)
    • Comprehensive troubleshooting table
    • Integration with serve-llm.sh and dual-backend mode
  • runbooks/lucibridge-routing.md: Multi-backend routing schema for per-tool dispatch

    • Routing table mapping tool categories to backends (Ollama for Limbo authoring, SGLang for dispatch/tool-call/memory/task)
    • JSON config schema for /etc/lucibridge/routing.json
    • Backwards-compatible fallback to single LLM_BACKEND_URL env var
    • Observability guidance (structured logging per routing decision)

Supporting Files

  • Upstream attribution and license documentation (sglang/LICENSE-UPSTREAM.md)
  • README files explaining vendoring decisions and build procedures
  • Smoke test scripts for both variants
  • Build scripts with fallback paths (pip install → source build)
  • Tokenizer bake script for offline Llama-3 support

Notable Implementation Details

  • Version pinning rationale: SGLang 0.5.3 chosen for Orin because upstream's current default (0.5.11) is CUDA-13-only; 0.5.3 is the first 0.5.x with gpt-oss support and JP6/CUDA-12.6 compatibility
  • Disk policy enforcement: SGLang container uses bind-mounts to /mnt/orin-ssd rather than migrating Docker daemon storage, preserving production emulation on root partition
  • Shared memory tuning: --shm-size 8g required for SGLang's worker pool (default 64 MB causes stalls under concurrency)
  • CUDA graph disable: Conservative stability choice for Jetson; can be re-enabled later if performance requires it
  • Tokenizer bake-in: Llama-3

https://claude.ai/code/session_01Dx8Vba9MmR3aMaRMXFhYyD

claude added 3 commits May 14, 2026 09:21
Lands the in-repo work for the "Productize SGLang serving" epic
(INFR-73), covering child tickets INFR-74 through INFR-81. Cross-repo
work (lucibridge code in infernode-os/infernode, eval harness in IOL)
stays out of this commit; their entry-points and contracts are
documented in runbooks/.

Per-ticket summary:

INFR-74 (Investigate NGC for Orin sm_87): no code. Findings posted to
the Jira ticket — NGC's SGLang line is CUDA-13 / JP7-only (datacenter
+ Thor). Fork-and-vendor remains the right path for Orin; NGC is the
recommended base for Thor.

INFR-76 (Vendor dusty-nv recipe): copy of
dusty-nv/jetson-containers/packages/llm/sglang verbatim into
sglang/orin/ (Dockerfile.upstream, build.sh, install.sh, test.py)
with attribution in sglang/LICENSE-UPSTREAM.md. Standalone build path
lives in sglang/orin/Dockerfile (diverged: drops chained
transformers install, adds tokenizer bake step).

INFR-77 (Pin SGLang >=0.5.x for gpt-oss): sglang/orin/config.py pinned
to 0.5.3 (first 0.5.x line with srt/models/gpt_oss.py, predates
upstream's CUDA-13 transition at 0.5.11). Fallback ladder documented
in the config.py docstring; on-target smoke build on Hephaestus is the
verification gate.

INFR-75 (GitHub-hosted ubuntu-24.04-arm CI):
.github/workflows/build-sglang.yml. Native aarch64 build on Graviton
SBSA, push to ghcr.io/infernode-os/serving-sglang with variant-tagged
images. Pins all third-party actions by commit SHA. Note: the
self-hosted-Hephaestus plan in the original ticket description has
been superseded; the Jira description has been updated via API.

INFR-78 (Llama-3 tokenizer + chat-template fix):
sglang/orin/bake-tokenizers.sh pulls non-gated mirrors of the
Llama-3.1 and Llama-3 tokenizer dirs into /opt/tokenizers/ at image
build time (~60 MB total). Documented launch flag
--tokenizer-path /opt/tokenizers/llama-3.1 in the runbook.

INFR-79 (lucibridge per-tool routing): code change lives in
infernode-os/infernode (out of scope here). What's in this repo:
runbooks/lucibridge-routing.md — the routing config schema, the
per-category default table, env-var bridging, observability spec, and
test plan. The infernode-side PR will consume this as the contract.

INFR-80 (Hephaestus deploy runbook): runbooks/hephaestus-deploy.md.
Pull + pre-flight + launch + healthcheck + systemd unit + memory
budget + troubleshooting + serve-llm.sh integration + clean shutdown.
Respects the Hephaestus disk policy (Docker on root, working data on
/mnt/orin-ssd via bind mounts).

INFR-81 (Thor sm_103 matrix build): sglang/thor/ (Dockerfile +
README) wraps NGC nvcr.io/nvidia/sglang:25.10-py3. The build workflow
matrix-builds Thor alongside Orin; Thor variant is skipped on PRs
(needs NGC_API_KEY secret which forks don't have).

Not in this commit (genuinely out of scope or blocked):

* IOL-26 (virgil-agent eval against SGLang) — lives in IOL repo; runs
  after a working SGLang endpoint exists on Hephaestus.
* The on-target smoke build of the pinned 0.5.3 image on Hephaestus
  (acceptance gate for INFR-77, requires Jetson hardware).
* The actual lucibridge code change in infernode-os/infernode
  (consumes the runbook schema; tracked under INFR-79).
NGC SGLang containers are anonymously pullable from nvcr.io for the
default tags we care about. Make the NGC login step conditional on the
secret being set (forward-compat with any future gated variant) and
remove the PR-skip that was only there because of the bogus auth
assumption. Thor variant now builds on every event, same as Orin.
First CI run failed with:
  dustynv/pytorch:2.6-r36.4.0-cu126-22.04: not found

dustynv moved the JP6 publishing line to cu128 / Ubuntu 24.04 a while
back; the cu126-22.04 / Python 3.10 variant the spike used is no
longer maintained. Switch the workflow default and the orin README's
manual-build example to 2.6-r36.4.0-cu128-24.04.

In-container Python 3.12 is fine — the spike's host-Python-alignment
constraint only mattered for its hand-extracted-onto-host setup, not
for Docker. CUDA 12.8 runtime is forward-compatible with JP6.x's
CUDA 12.6 driver per NVIDIA's same-major compat policy.
@pdfinn pdfinn merged commit d87e136 into main May 14, 2026
1 of 2 checks passed
@pdfinn pdfinn deleted the claude/investigate-ngc-orin-fgiaV branch May 14, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants