feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU#8
Draft
Vib-UX wants to merge 2 commits into
Draft
feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU#8Vib-UX wants to merge 2 commits into
Vib-UX wants to merge 2 commits into
Conversation
…ool, model registration Begins LTX-2.3 (video-only) support for the Tether LTX-2 bounty (M1). - docs/ltx2_feasibility.md: research findings (architecture, scope, risks). - script/convert_ltx2_to_gguf.py (+requirements-ltx2.txt): safetensors -> GGUF converter that keeps only the video stream (drops audio DiT, AV cross-attn, audio VAE, vocoder). Filtering/naming is pure-stdlib so --dry-run needs no heavy deps; F16 plus Q4_0/Q5_1/Q8_0 supported. Validated against the real ltx-2.3-22b-dev header (1758 video tensors, 0 audio leaks). - model.h / stable-diffusion.cpp / model.cpp: register VERSION_LTX2, sd_version_is_ltx2(), include in sd_version_is_dit(), "LTX-2" version string, and weight detection via video_embeddings_connector / patchify_proj.
Add the config-driven LTX-2 video-DiT block tree (src/ltx2.hpp) and an Ltx2Model diffusion-model adapter, then wire VERSION_LTX2 into init(): construct the runner, allocate params on CPU, and bind every tensor. Geometry is inferred from checkpoint shapes so reduced-size synthetic checkpoints load through the same path as the real weights. - src/ltx2.hpp: DiT (patchify/proj_out/adaln/connector + 48 blocks), gated attention, FFN, modulation tables; Ltx2Runner with shape inference. - diffusion_model.hpp: Ltx2Model adapter (M1 is load-only). - stable-diffusion.cpp: LTX-2 branch, null-conditioner guards (Gemma is M2), FakeVAE placeholder, FLOW_PRED denoiser, graceful generate_video stop. - script/make_synthetic_ltx2_gguf.py: tiny synthetic DiT GGUF generator. - script/ci_ltx2_load_smoke.sh: load-on-CPU smoke test (no large download). - script/convert_ltx2_to_gguf.py: add --self-test filter validation. - .github/workflows/ltx2.yml: Linux x86-64 build + load smoke. - docs/ltx2.md + README links. Verified locally: synthetic GGUF detected as LTX-2, geometry inferred (num_layers/dim/heads/connector), all tensors bound on CPU, clean exit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First milestone (M1) of LTX-2.3 video-generation support for the Tether LTX-2 bounty:
model conversion + scaffolding + "model loads on CPU", scoped to the video stream only
(audio DiT / Audio-VAE / vocoder are explicitly out of scope and dropped during conversion).
This PR is intentionally load-only. End-to-end T2V/I2V inference, the Gemma-3 text encoder,
and the CausalVideoAutoencoder land in later milestones (M2+).
What's included
GGUF conversion tooling
script/convert_ltx2_to_gguf.py: safetensors → GGUF converter that keeps only the videostream (drops audio DiT, AV cross-attention, Audio-VAE, vocoder). Supports
f16,q8_0,q5_1,q4_0. Filtering/naming is pure-stdlib so--dry-runand--self-testneed noheavy deps. Validated against the real
ltx-2.3-22b-devheader (1758 video tensors, 0 audio leaks).script/requirements-ltx2.txt: deps for the full conversion path (numpy,safetensors,gguf).Model registration & detection
src/model.h:VERSION_LTX2,sd_version_is_ltx2(), included insd_version_is_dit().src/stable-diffusion.cpp:"LTX-2"version string.src/model.cpp: architecture auto-detection viavideo_embeddings_connector/patchify_proj.DiT scaffolding (load-only)
src/ltx2.hpp: config-driven video DiT block tree —patchify_proj,proj_out,adaln_single/prompt_adaln_single, the 8-layervideo_embeddings_connector(learnableregisters), and the 48 transformer blocks (gated self/cross attention with RMS qk-norm, gelu
FFN, modulation tables).
Ltx2Runnerinfers geometry from checkpoint shapes, so reduced-sizesynthetic checkpoints load through the exact same path as the real weights.
src/diffusion_model.hpp:Ltx2Modeladapter.src/stable-diffusion.cpp: LTX-2 branch ininit(), null-conditioner guards (Gemma is M2),FakeVAEplaceholder,FLOW_PREDdenoiser, and a gracefulgenerate_videostop instead of a crash.CI & verification (no large download)
script/make_synthetic_ltx2_gguf.py: tiny synthetic DiT GGUF generator (real tensor names, reduced dims).script/ci_ltx2_load_smoke.sh: load-on-CPU smoke test..github/workflows/ltx2.yml: Linux x86-64 — converter filter self-test → buildsd-cli→ load smoke.Docs
docs/ltx2.md: M1 build/conversion guide + verification steps; linked fromREADME.md.docs/ltx2_feasibility.md: architecture research, scope and risks.Scope
In scope (project): video DiT, Video-VAE encoder+decoder, Gemma-3 text encoder, scheduler + CFG,
T2V and I2V, GGUF conversion, CLI, C API.
Out of scope: audio stream (audio DiT, Audio-VAE, vocoder), training/fine-tuning, spatial upscaler, V2V.
Test plan
python script/convert_ltx2_to_gguf.py --self-testpasses (audio dropped, video kept).python script/convert_ltx2_to_gguf.py --src ... --dry-runvalidated against the realltx-2.3-22b-devheader (0 audio leaks).cmake --build build --target sd-clicompiles clean (no new-Wall/-Wextrawarnings).bash script/ci_ltx2_load_smoke.shpasses: synthetic GGUF detected asVersion: LTX-2,geometry inferred (
num_layers/dim/heads/connector), all tensors bound on CPU, clean exit..github/workflows/ltx2.yml).Milestone roadmap
Notes for reviewers
the full ~46 GB safetensors on a larger machine — the tooling here is complete and dry-run-validated.