feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU by Vib-UX · Pull Request #8 · tetherto/qvac-ext-stable-diffusion.cpp

Vib-UX · 2026-06-01T05:17:40Z

Summary

First milestone (M1) of LTX-2.3 video-generation support for the Tether LTX-2 bounty:
model conversion + scaffolding + "model loads on CPU", scoped to the video stream only
(audio DiT / Audio-VAE / vocoder are explicitly out of scope and dropped during conversion).

This PR is intentionally load-only. End-to-end T2V/I2V inference, the Gemma-3 text encoder,
and the CausalVideoAutoencoder land in later milestones (M2+).

What's included

GGUF conversion tooling

script/convert_ltx2_to_gguf.py: safetensors → GGUF converter that keeps only the video
stream (drops audio DiT, AV cross-attention, Audio-VAE, vocoder). Supports f16, q8_0,
q5_1, q4_0. Filtering/naming is pure-stdlib so --dry-run and --self-test need no
heavy deps. Validated against the real ltx-2.3-22b-dev header (1758 video tensors, 0 audio leaks).
script/requirements-ltx2.txt: deps for the full conversion path (numpy, safetensors, gguf).

Model registration & detection

src/model.h: VERSION_LTX2, sd_version_is_ltx2(), included in sd_version_is_dit().
src/stable-diffusion.cpp: "LTX-2" version string.
src/model.cpp: architecture auto-detection via video_embeddings_connector / patchify_proj.

DiT scaffolding (load-only)

src/ltx2.hpp: config-driven video DiT block tree — patchify_proj, proj_out,
adaln_single / prompt_adaln_single, the 8-layer video_embeddings_connector (learnable
registers), and the 48 transformer blocks (gated self/cross attention with RMS qk-norm, gelu
FFN, modulation tables). Ltx2Runner infers geometry from checkpoint shapes, so reduced-size
synthetic checkpoints load through the exact same path as the real weights.
src/diffusion_model.hpp: Ltx2Model adapter.
src/stable-diffusion.cpp: LTX-2 branch in init(), null-conditioner guards (Gemma is M2),
FakeVAE placeholder, FLOW_PRED denoiser, and a graceful generate_video stop instead of a crash.

CI & verification (no large download)

script/make_synthetic_ltx2_gguf.py: tiny synthetic DiT GGUF generator (real tensor names, reduced dims).
script/ci_ltx2_load_smoke.sh: load-on-CPU smoke test.
.github/workflows/ltx2.yml: Linux x86-64 — converter filter self-test → build sd-cli → load smoke.

Docs

docs/ltx2.md: M1 build/conversion guide + verification steps; linked from README.md.
docs/ltx2_feasibility.md: architecture research, scope and risks.

Scope

In scope (project): video DiT, Video-VAE encoder+decoder, Gemma-3 text encoder, scheduler + CFG,
T2V and I2V, GGUF conversion, CLI, C API.
Out of scope: audio stream (audio DiT, Audio-VAE, vocoder), training/fine-tuning, spatial upscaler, V2V.

Test plan

python script/convert_ltx2_to_gguf.py --self-test passes (audio dropped, video kept).
python script/convert_ltx2_to_gguf.py --src ... --dry-run validated against the real
ltx-2.3-22b-dev header (0 audio leaks).
cmake --build build --target sd-cli compiles clean (no new -Wall/-Wextra warnings).
bash script/ci_ltx2_load_smoke.sh passes: synthetic GGUF detected as Version: LTX-2,
geometry inferred (num_layers/dim/heads/connector), all tensors bound on CPU, clean exit.
CI green on Linux x86-64 (.github/workflows/ltx2.yml).

Milestone roadmap

M1 (this PR): conversion + scaffolding + loads on CPU. ✅
M2: Gemma-3 text encoder, CausalVideoAutoencoder, DiT forward → end-to-end T2V/I2V on CPU; Q4/Q8 checkpoints.
M3: Vulkan + Metal backends, benchmarks.
M4: C API, Bare addon, full test suite, docs polish.

Notes for reviewers

M1 acceptance is "model loads on CPU"; generation is deliberately a graceful no-op until M2.
The F16 GGUF checkpoint artifact (to publish on HuggingFace) requires a one-time conversion run on
the full ~46 GB safetensors on a larger machine — the tooling here is complete and dry-run-validated.
Kept as a Draft until CI is green and the F16 checkpoint is published.

…ool, model registration Begins LTX-2.3 (video-only) support for the Tether LTX-2 bounty (M1). - docs/ltx2_feasibility.md: research findings (architecture, scope, risks). - script/convert_ltx2_to_gguf.py (+requirements-ltx2.txt): safetensors -> GGUF converter that keeps only the video stream (drops audio DiT, AV cross-attn, audio VAE, vocoder). Filtering/naming is pure-stdlib so --dry-run needs no heavy deps; F16 plus Q4_0/Q5_1/Q8_0 supported. Validated against the real ltx-2.3-22b-dev header (1758 video tensors, 0 audio leaks). - model.h / stable-diffusion.cpp / model.cpp: register VERSION_LTX2, sd_version_is_ltx2(), include in sd_version_is_dit(), "LTX-2" version string, and weight detection via video_embeddings_connector / patchify_proj.

Add the config-driven LTX-2 video-DiT block tree (src/ltx2.hpp) and an Ltx2Model diffusion-model adapter, then wire VERSION_LTX2 into init(): construct the runner, allocate params on CPU, and bind every tensor. Geometry is inferred from checkpoint shapes so reduced-size synthetic checkpoints load through the same path as the real weights. - src/ltx2.hpp: DiT (patchify/proj_out/adaln/connector + 48 blocks), gated attention, FFN, modulation tables; Ltx2Runner with shape inference. - diffusion_model.hpp: Ltx2Model adapter (M1 is load-only). - stable-diffusion.cpp: LTX-2 branch, null-conditioner guards (Gemma is M2), FakeVAE placeholder, FLOW_PRED denoiser, graceful generate_video stop. - script/make_synthetic_ltx2_gguf.py: tiny synthetic DiT GGUF generator. - script/ci_ltx2_load_smoke.sh: load-on-CPU smoke test (no large download). - script/convert_ltx2_to_gguf.py: add --self-test filter validation. - .github/workflows/ltx2.yml: Linux x86-64 build + load smoke. - docs/ltx2.md + README links. Verified locally: synthetic GGUF detected as LTX-2, geometry inferred (num_layers/dim/heads/connector), all tensors bound on CPU, clean exit.

Vib-UX added 2 commits June 1, 2026 10:10

Vib-UX marked this pull request as draft June 1, 2026 05:18

Vib-UX changed the title ~~Feat/ltx2 video generation~~ feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU#8

feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU#8
Vib-UX wants to merge 2 commits into
tetherto:masterfrom
Vib-UX:feat/ltx2-video-generation

Vib-UX commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Vib-UX commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Scope

Test plan

Milestone roadmap

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Vib-UX commented Jun 1, 2026 •

edited

Loading