Skip to content

feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU#8

Draft
Vib-UX wants to merge 2 commits into
tetherto:masterfrom
Vib-UX:feat/ltx2-video-generation
Draft

feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU#8
Vib-UX wants to merge 2 commits into
tetherto:masterfrom
Vib-UX:feat/ltx2-video-generation

Conversation

@Vib-UX
Copy link
Copy Markdown

@Vib-UX Vib-UX commented Jun 1, 2026

Summary

First milestone (M1) of LTX-2.3 video-generation support for the Tether LTX-2 bounty:
model conversion + scaffolding + "model loads on CPU", scoped to the video stream only
(audio DiT / Audio-VAE / vocoder are explicitly out of scope and dropped during conversion).

This PR is intentionally load-only. End-to-end T2V/I2V inference, the Gemma-3 text encoder,
and the CausalVideoAutoencoder land in later milestones (M2+).

What's included

GGUF conversion tooling

  • script/convert_ltx2_to_gguf.py: safetensors → GGUF converter that keeps only the video
    stream (drops audio DiT, AV cross-attention, Audio-VAE, vocoder). Supports f16, q8_0,
    q5_1, q4_0. Filtering/naming is pure-stdlib so --dry-run and --self-test need no
    heavy deps. Validated against the real ltx-2.3-22b-dev header (1758 video tensors, 0 audio leaks).
  • script/requirements-ltx2.txt: deps for the full conversion path (numpy, safetensors, gguf).

Model registration & detection

  • src/model.h: VERSION_LTX2, sd_version_is_ltx2(), included in sd_version_is_dit().
  • src/stable-diffusion.cpp: "LTX-2" version string.
  • src/model.cpp: architecture auto-detection via video_embeddings_connector / patchify_proj.

DiT scaffolding (load-only)

  • src/ltx2.hpp: config-driven video DiT block tree — patchify_proj, proj_out,
    adaln_single / prompt_adaln_single, the 8-layer video_embeddings_connector (learnable
    registers), and the 48 transformer blocks (gated self/cross attention with RMS qk-norm, gelu
    FFN, modulation tables). Ltx2Runner infers geometry from checkpoint shapes, so reduced-size
    synthetic checkpoints load through the exact same path as the real weights.
  • src/diffusion_model.hpp: Ltx2Model adapter.
  • src/stable-diffusion.cpp: LTX-2 branch in init(), null-conditioner guards (Gemma is M2),
    FakeVAE placeholder, FLOW_PRED denoiser, and a graceful generate_video stop instead of a crash.

CI & verification (no large download)

  • script/make_synthetic_ltx2_gguf.py: tiny synthetic DiT GGUF generator (real tensor names, reduced dims).
  • script/ci_ltx2_load_smoke.sh: load-on-CPU smoke test.
  • .github/workflows/ltx2.yml: Linux x86-64 — converter filter self-test → build sd-cli → load smoke.

Docs

  • docs/ltx2.md: M1 build/conversion guide + verification steps; linked from README.md.
  • docs/ltx2_feasibility.md: architecture research, scope and risks.

Scope

In scope (project): video DiT, Video-VAE encoder+decoder, Gemma-3 text encoder, scheduler + CFG,
T2V and I2V, GGUF conversion, CLI, C API.
Out of scope: audio stream (audio DiT, Audio-VAE, vocoder), training/fine-tuning, spatial upscaler, V2V.

Test plan

  • python script/convert_ltx2_to_gguf.py --self-test passes (audio dropped, video kept).
  • python script/convert_ltx2_to_gguf.py --src ... --dry-run validated against the real
    ltx-2.3-22b-dev header (0 audio leaks).
  • cmake --build build --target sd-cli compiles clean (no new -Wall/-Wextra warnings).
  • bash script/ci_ltx2_load_smoke.sh passes: synthetic GGUF detected as Version: LTX-2,
    geometry inferred (num_layers/dim/heads/connector), all tensors bound on CPU, clean exit.
  • CI green on Linux x86-64 (.github/workflows/ltx2.yml).

Milestone roadmap

  • M1 (this PR): conversion + scaffolding + loads on CPU. ✅
  • M2: Gemma-3 text encoder, CausalVideoAutoencoder, DiT forward → end-to-end T2V/I2V on CPU; Q4/Q8 checkpoints.
  • M3: Vulkan + Metal backends, benchmarks.
  • M4: C API, Bare addon, full test suite, docs polish.

Notes for reviewers

  • M1 acceptance is "model loads on CPU"; generation is deliberately a graceful no-op until M2.
  • The F16 GGUF checkpoint artifact (to publish on HuggingFace) requires a one-time conversion run on
    the full ~46 GB safetensors on a larger machine — the tooling here is complete and dry-run-validated.
  • Kept as a Draft until CI is green and the F16 checkpoint is published.

Vib-UX added 2 commits June 1, 2026 10:10
…ool, model registration

Begins LTX-2.3 (video-only) support for the Tether LTX-2 bounty (M1).

- docs/ltx2_feasibility.md: research findings (architecture, scope, risks).
- script/convert_ltx2_to_gguf.py (+requirements-ltx2.txt): safetensors -> GGUF
  converter that keeps only the video stream (drops audio DiT, AV cross-attn,
  audio VAE, vocoder). Filtering/naming is pure-stdlib so --dry-run needs no
  heavy deps; F16 plus Q4_0/Q5_1/Q8_0 supported. Validated against the real
  ltx-2.3-22b-dev header (1758 video tensors, 0 audio leaks).
- model.h / stable-diffusion.cpp / model.cpp: register VERSION_LTX2,
  sd_version_is_ltx2(), include in sd_version_is_dit(), "LTX-2" version string,
  and weight detection via video_embeddings_connector / patchify_proj.
Add the config-driven LTX-2 video-DiT block tree (src/ltx2.hpp) and an
Ltx2Model diffusion-model adapter, then wire VERSION_LTX2 into init():
construct the runner, allocate params on CPU, and bind every tensor.
Geometry is inferred from checkpoint shapes so reduced-size synthetic
checkpoints load through the same path as the real weights.

- src/ltx2.hpp: DiT (patchify/proj_out/adaln/connector + 48 blocks),
  gated attention, FFN, modulation tables; Ltx2Runner with shape inference.
- diffusion_model.hpp: Ltx2Model adapter (M1 is load-only).
- stable-diffusion.cpp: LTX-2 branch, null-conditioner guards (Gemma is M2),
  FakeVAE placeholder, FLOW_PRED denoiser, graceful generate_video stop.
- script/make_synthetic_ltx2_gguf.py: tiny synthetic DiT GGUF generator.
- script/ci_ltx2_load_smoke.sh: load-on-CPU smoke test (no large download).
- script/convert_ltx2_to_gguf.py: add --self-test filter validation.
- .github/workflows/ltx2.yml: Linux x86-64 build + load smoke.
- docs/ltx2.md + README links.

Verified locally: synthetic GGUF detected as LTX-2, geometry inferred
(num_layers/dim/heads/connector), all tensors bound on CPU, clean exit.
@Vib-UX Vib-UX marked this pull request as draft June 1, 2026 05:18
@Vib-UX Vib-UX changed the title Feat/ltx2 video generation feat(ltx2): M1 — LTX-2.3 video model conversion, scaffolding & load-on-CPU Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant