tetherto · Vib-UX · Jun 1, 2026 · Jun 1, 2026
diff --git a/.github/workflows/ltx2.yml b/.github/workflows/ltx2.yml
@@ -0,0 +1,68 @@
+name: LTX-2 CI
+
+# M1 deliverable: buildable project + "model loads on CPU" verified on
+# Linux x86-64 via a tiny synthetic GGUF (no large weight download).
+
+on:
+  workflow_dispatch:
+  push:
+    branches:
+      - master
+      - ci
+    paths:
+      - ".github/workflows/ltx2.yml"
+      - "src/ltx2.hpp"
+      - "src/diffusion_model.hpp"
+      - "src/model.*"
+      - "src/stable-diffusion.cpp"
+      - "script/convert_ltx2_to_gguf.py"
+      - "script/make_synthetic_ltx2_gguf.py"
+      - "script/ci_ltx2_load_smoke.sh"
+  pull_request:
+    types: [opened, synchronize, reopened]
+    paths:
+      - ".github/workflows/ltx2.yml"
+      - "src/ltx2.hpp"
+      - "src/diffusion_model.hpp"
+      - "src/model.*"
+      - "src/stable-diffusion.cpp"
+      - "script/convert_ltx2_to_gguf.py"
+      - "script/make_synthetic_ltx2_gguf.py"
+      - "script/ci_ltx2_load_smoke.sh"
+
+concurrency:
+  group: ltx2-${{ github.head_ref && github.ref || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  linux-x64-load-smoke:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with:
+          submodules: recursive
+
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y build-essential cmake
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+
+      - name: Install conversion deps
+        run: pip install numpy gguf
+
+      - name: Validate conversion tooling (filter self-test)
+        run: python script/convert_ltx2_to_gguf.py --self-test
+
+      - name: Build sd-cli
+        run: |
+          cmake -B build -DCMAKE_BUILD_TYPE=Release
+          cmake --build build -j"$(nproc)" --target sd-cli
+
+      - name: LTX-2 load-on-CPU smoke test
+        run: bash script/ci_ltx2_load_smoke.sh
diff --git a/.gitignore b/.gitignore
@@ -13,3 +13,5 @@ output*.png
 models*
 *.log
 preview.png
+__pycache__/
+*.pyc
diff --git a/README.md b/README.md
@@ -59,6 +59,7 @@ API and command-line option may change frequently.***
     - [Qwen Image Edit series](./docs/qwen_image_edit.md)
   - Video Models
     - [Wan2.1/Wan2.2](./docs/wan.md)
+    - [LTX-2 (T2V/I2V)](./docs/ltx2.md) (work in progress)
   - [PhotoMaker](https://github.com/TencentARC/PhotoMaker) support.
   - Control Net support with SD 1.5
   - LoRA support, same as [stable-diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Features#lora)
@@ -138,6 +139,7 @@ If you want to improve performance or reduce VRAM/RAM usage, please refer to [pe
 - [🔥Qwen Image](./docs/qwen_image.md)
 - [🔥Qwen Image Edit series](./docs/qwen_image_edit.md)
 - [🔥Wan2.1/Wan2.2](./docs/wan.md)
+- [LTX-2 (T2V/I2V)](./docs/ltx2.md) (work in progress)
 - [🔥Z-Image](./docs/z_image.md)
 - [Ovis-Image](./docs/ovis_image.md)
 - [Anima](./docs/anima.md)

diff --git a/docs/ltx2.md b/docs/ltx2.md
@@ -0,0 +1,88 @@
+# LTX-2 (LTX-2.3) video generation
+
+Support for [Lightricks LTX-2](https://huggingface.co/Lightricks) text-to-video
+(T2V) and image-to-video (I2V) generation. Only the **video** stream is in
+scope; the audio stream (audio DiT, audio VAE, vocoder) is intentionally not
+converted or loaded.
+
+> Status: **work in progress.** Milestone M1 (model conversion + scaffolding +
+> "model loads on CPU") is implemented. End-to-end T2V/I2V inference, the
+> Gemma-3 text encoder and the CausalVideoAutoencoder land in later milestones.
+
+## What works today (M1)
+
+- Safetensors -> GGUF conversion tooling for the video-only DiT (+ VAE), at
+  `f16`, `q8_0`, `q5_1`, `q4_0`.
+- LTX-2 architecture auto-detection in the model loader.
+- The 14B video DiT (`AVTransformer3DModel`, video half) loads and binds all of
+  its parameters on CPU, with geometry inferred from the checkpoint shapes.
+- A CI smoke test that verifies the load path on Linux x86-64 without any large
+  download.
+
+## Model conversion
+
+The conversion script reads an LTX-2.3 `.safetensors` checkpoint, drops every
+audio/vocoder tensor, and writes a video-only GGUF.
+
+```bash
+pip install -r script/requirements-ltx2.txt
+
+# Inspect the keep/drop plan without writing anything (stdlib only):
+python script/convert_ltx2_to_gguf.py --src ltx-2.3-22b-dev.safetensors --dry-run
+
+# Convert to F16 (M1 deliverable):
+python script/convert_ltx2_to_gguf.py \
+    --src ltx-2.3-22b-dev.safetensors \
+    --dst ltx-2.3-22b-video-f16.gguf --type f16
+
+# Quantized variants:
+python script/convert_ltx2_to_gguf.py --src ... --dst ...-q8_0.gguf --type q8_0
+python script/convert_ltx2_to_gguf.py --src ... --dst ...-q4_0.gguf --type q4_0
+```
+
+Quantization notes:
+
+- `f16` keeps every tensor in half precision (highest quality, largest file).
+- `q8_0` / `q5_1` / `q4_0` quantize only the large 2D DiT linear weights; norms,
+  biases, modulation tables and the VAE stay in higher precision.
+
+## Building
+
+Follow the standard [build guide](./build.md). The CLI binary is `sd-cli`:
+
+```bash
+cmake -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j --target sd-cli
+```
+
+## Verifying "loads on CPU" without the full weights
+
+M1 ships a tiny synthetic checkpoint generator so the load path can be exercised
+in seconds. It emits the exact DiT tensor names at drastically reduced
+dimensions; the C++ side infers the geometry from the shapes, so this is the
+same code path used for the real weights.
+
+```bash
+# requires numpy + gguf (see script/requirements-ltx2.txt)
+bash script/ci_ltx2_load_smoke.sh            # uses build/bin/sd-cli
+```
+
+Expected: the log reports `Version: LTX-2`, the inferred DiT geometry, and
+`loading tensors completed`, then exits early (generation is not available yet
+in M1).
+
+## CI
+
+`.github/workflows/ltx2.yml` runs on Linux x86-64:
+
+1. validates the conversion filter (`--self-test`),
+2. builds `sd-cli`,
+3. runs the synthetic load-on-CPU smoke test.
+
+## Scope
+
+In scope: video DiT, Video-VAE encoder+decoder, Gemma-3 text encoder, a noise
+scheduler + CFG, T2V and I2V, GGUF conversion, CLI, C API.
+
+Out of scope: the audio stream (audio DiT, Audio-VAE, vocoder), training /
+fine-tuning, the spatial upscaler, and video-to-video.
diff --git a/docs/ltx2_feasibility.md b/docs/ltx2_feasibility.md
@@ -0,0 +1,199 @@
+# LTX-2 Support — Feasibility & Research Findings
+
+Status: research pass (Step 1). No code plan yet. This document captures what LTX-2
+actually is, what this repo already provides, and the open
+decisions/risks that must be resolved before a concrete implementation plan.
+
+---
+
+## 1. What LTX-2 actually is
+
+LTX-2 is Lightricks' open-weights **audio-video** DiT foundation model (released
+2026-01-06, arXiv:2601.03233). It is **not** the same as the older `LTX-Video` model
+(arXiv:2501.00103) for which this repo has a 73-line stub at
+[src/ltxv.hpp](src/ltxv.hpp) (only a `CausalConv3d` block, wired into nothing).
+
+Two HuggingFace repos exist:
+
+- `Lightricks/LTX-2` — the **19B** model = **14B video stream + 5B audio stream**.
+  - `ltx-2-19b-dev.safetensors` (~43 GB bf16), `...-fp8` (~27 GB), `...-fp4` (~20 GB),
+    `ltx-2-19b-distilled.safetensors` (8 steps, CFG=1), distilled LoRA, spatial/temporal upscalers.
+- `Lightricks/LTX-2.3` — a newer **22B** variant: `ltx-2.3-22b-dev.safetensors`,
+  `ltx-2.3-22b-distilled-1.1.safetensors`, `ltx-2.3-22b-distilled-lora-384-1.1.safetensors`.
+
+Reference code:
+
+- Official monorepo: `github.com/Lightricks/LTX-2` (`packages/ltx-core` has the model defs;
+  `packages/ltx-pipelines` has T2V/I2V pipelines).
+- Cleaner third-party PyTorch port: `deepbeepmeep/Wan2GP` `models/ltx2/ltx_core/...`
+  (useful, more digestible reference for VAE/transformer porting).
+
+### Scope per the grant
+
+In scope: **video stream only** — T2V + I2V. Out of scope (explicitly): the 5B audio
+stream, Audio-VAE, vocoder, spatial/temporal upscalers, video-to-video, training, GUI.
+
+This matters: LTX-2 has a **video-only inference path** (`VideoGemmaTextEncoderModel` +
+the transformer run without the audio stream / without A↔V cross-attention), so dropping
+audio is supported by design — we do not need to implement the audio half.
+
+---
+
+## 2. In-scope components and architecture
+
+```mermaid
+flowchart TD
+    P["Text prompt (+ optional image for I2V)"] --> TE
+    subgraph TE [Text Encoder - NET NEW]
+        G["Gemma 3-12B backbone (decoder-only LLM, frozen)"] --> FE["Multi-Layer Feature Extractor"]
+        FE --> TC["Text Connector (bidirectional, register/thinking tokens)"]
+    end
+    TC -->|"video context [B, seq, 4096]"| DiT
+    IMG["Input image / frames"] --> VAEenc["Video VAE encoder (I2V cond)"]
+    VAEenc --> DiT
+    NOISE["Init noise latent [B,128,F',H/32,W/32]"] --> DiT
+    subgraph DiT [Video DiT - 14B, ~48 blocks]
+        SA["Self-Attn + 3D RoPE"] --> TX["Text Cross-Attn"] --> FF["FFN, RMSNorm, AdaLN"]
+    end
+    DiT -->|denoise loop, flow-matching + CFG| LAT["Final video latents"]
+    LAT --> VAEdec["Video VAE decoder"]
+    VAEdec --> OUT["Frames -> MP4/AVI"]
+```
+
+### 2a. Video VAE (spatiotemporal causal)
+
+- Compression **32×32×8** (spatial 32, temporal 8), **128 latent channels**, no patchifier
+  in the transformer (1×1×1).
+- Encoder: `[B,3,F,H,W] -> [B,128, 1+(F-1)/8, H/32, W/32]`; requires `(F-1) % 8 == 0`.
+- Decoder: `[B,128,F,H,W] -> [B,3, 1+(F-1)*8, H*32, W*32]`.
+- Uses causal 3D convs, `PixelNorm`/`GroupNorm`, SiLU, and per-channel latent
+  mean/std statistics baked into the encoder (`per_channel_statistics.normalize`).
+- Reuse signal: the repo already implements causal 3D VAE machinery for Wan
+  (`WAN::CausalConv3d`, `WanVAE`, `WanVAERunner` in [src/wan.hpp](src/wan.hpp)) and the
+  stub `LTXV::CausalConv3d` in [src/ltxv.hpp](src/ltxv.hpp). The op set (Conv3d, causal
+  padding) is already present.
+
+### 2b. Video DiT (14B)
+
+- ~48 transformer blocks (shared depth with audio; video stream width is larger).
+- Each block (video-only path): RMSNorm + **AdaLN** (timestep-conditioned) -> **Self-Attn
+  with 3D RoPE** -> RMSNorm -> **Text Cross-Attn** -> RMSNorm + AdaLN -> **FFN**.
+  (The A↔V cross-attention sublayer is the audio coupling and is skipped for video-only.)
+- Reuse signal: this is structurally close to `WAN::Wan` /`WanAttentionBlock`
+  (AdaLN modulation, self-attn + text cross-attn, 3D RoPE via `Rope::gen_wan_pe` in
+  [src/rope.hpp](src/rope.hpp)). The DiffusionModel adapter pattern is `WanModel` in
+  [src/diffusion_model.hpp](src/diffusion_model.hpp).
+
+### 2c. Text encoder — Gemma 3-12B + Feature Extractor + Text Connector (BIGGEST NET-NEW PIECE)
+
+- Backbone: **Gemma 3-12B** decoder-only LLM (frozen), multilingual.
+  - The repo's LLM support (`src/llm.hpp`, `enum LLMArch { QWEN2_5_VL, QWEN3,
+MISTRAL_SMALL_3_2 }`) does **not** include Gemma 3. This is net-new: Gemma 3 attention
+    (sliding-window + global layers), RMSNorm, GeGLU MLP, and a SentencePiece/Gemma tokenizer.
+- Multi-Layer Feature Extractor: aggregates **all** decoder layers `[B,T,D,L]`, mean-centered
+  scaling, flatten to `[B,T,D×L]`, learnable projection `W` (trained with LTX-2).
+- Text Connector: bidirectional transformer with learnable **register / "thinking" tokens**
+  replacing padded positions; the video connector outputs **`[B, seq, 4096]`**.
+- The Feature Extractor + Connector weights live in the LTX-2 checkpoint (not in Gemma).
+
+### 2d. Scheduler / guidance
+
+- LTX-2 uses flow-matching with `LTX2Scheduler` / `LinearQuadratic` timestep schedule and
+  Euler updates; CFG (and optionally STG/APG, out of scope).
+- Reuse signal: repo has `DiscreteFlowDenoiser` (`prediction_t::FLOW_PRED`) and ~14 sigma
+  schedulers in [src/denoiser.hpp](src/denoiser.hpp), but **no LinearQuadratic schedule**.
+  A new scheduler + the LTX timestep shift is net-new but small.
+
+---
+
+## 3. Mapping to existing repo infrastructure
+
+Reusable as-is or with light adaptation (Wan video pipeline is the template):
+
+- Video generation entrypoint `generate_video()`, `vid_gen` CLI mode, and the public C API
+  `sd_vid_gen_params_t` / `generate_video()` in
+  [include/stable-diffusion.h](include/stable-diffusion.h).
+- Latent geometry helpers (`generate_init_latent`, `process_latent_in/out`,
+  `get_vae_scale_factor`, `get_latent_channel`, frame alignment) in
+  [src/stable-diffusion.cpp](src/stable-diffusion.cpp) — need LTX values (scale 32, 128 ch,
+  `(F-1)%8` alignment).
+- `GGMLBlock`/`GGMLRunner` framework, 3D RoPE, Conv3d, RMSNorm, AdaLN, attention ops in
+  [src/ggml_extend.hpp](src/ggml_extend.hpp).
+- GGUF conversion path: C++ `-M convert` -> `convert()` -> `save_to_gguf_file()`
+  ([src/model.cpp](src/model.cpp)); tensor-name remap in
+  [src/name_conversion.cpp](src/name_conversion.cpp).
+- Model registration pattern: `enum SDVersion` + `sd_version_is_*` + detection in
+  `ModelLoader::get_sd_version()` ([src/model.h](src/model.h),
+  [src/model.cpp](src/model.cpp)).
+
+Net-new (no existing equivalent):
+
+1. **Gemma 3-12B encoder** + tokenizer (largest single piece).
+2. **LTX Feature Extractor + Text Connector** (LTX-specific trained modules).
+3. **LTX Video VAE** (new class; can borrow Wan/`LTXV` conv blocks).
+4. **LTX DiT** (new class; structurally similar to `WAN::Wan`).
+5. **LinearQuadratic / LTX2 scheduler**.
+6. **Video-stream extraction + name conversion** from the combined 19B/22B checkpoint.
+7. **CI workflows** (`.github/workflows/` does not exist today) and a **Python/C++ GGUF
+   conversion** path for LTX (no model-conversion Python currently in the repo).
+8. **MP4 output** — CLI currently writes **MJPEG-in-AVI** (`avi_writer.h`), not MP4; the
+   grant asks for MP4 or raw frames (raw frames is the low-risk option).
+
+---
+
+## 4. Key risks / feasibility concerns
+
+1. **Gemma 3-12B is a full 12B LLM used as the text encoder.** This is effectively a
+   second model to implement from scratch (new arch in `llm.hpp` + tokenizer). It is the
+   single largest risk to the timeline and is not optional — LTX-2 conditioning depends on
+   the multi-layer Gemma features.
+2. **Memory budget vs grant target.** Target is Q4 ≤ 12 GB RAM (CPU) / ≤ 10 GB VRAM (GPU),
+   but 14B DiT (Q4 ≈ 7–8 GB) + Gemma 3-12B (Q4 ≈ 7 GB) + VAE cannot co-reside in 10 GB.
+   Feasible only via **sequential component staging** (encode text -> free encoder ->
+   load DiT -> free -> load VAE). The repo's `offload_params_to_cpu` + per-component
+   loading supports this, but the target is tight and must be validated early.
+3. **Checkpoint ambiguity (14B vs 19B vs 22B).** Need the exact target. The bf16 `dev`
+   checkpoint (~43 GB) is the conversion source; `fp8`/`fp4` are NVIDIA-specific and not a
+   usable GGUF source. Disk/bandwidth is significant.
+4. **Quality metric methodology.** PSNR ≥ 25 dB / SSIM ≥ 0.85 vs PyTorch must compare the
+   **same single-stage pipeline** at F16. The recommended LTX pipeline is **two-stage with a
+   spatial upscaler** (out of scope), so the reference must be the single-stage
+   base/distilled path, fixed seed, fixed scheduler/steps.
+5. **dev vs distilled.** `distilled` (8 steps, CFG=1) best hits "2s clip in <5 min" and
+   halves compute (no CFG). `dev` needs CFG (2× forward passes) + more steps. Recommend
+   targeting distilled first for the success metrics, dev for the quality baseline.
+6. **LTX-2.3 distilled needs a distilled-LoRA** for the standard pipeline -> may require
+   LoRA fusion at convert time. Repo has LoRA support, but fusing into a video DiT is new.
+7. **PR 2 (Bare addon) target repo is "TBD"** in the grant — blocked until the repo link
+   exists; pattern is `bare-llama-cpp`.
+
+---
+
+## 5. Open decisions (need your input before the code plan)
+
+1. **Target checkpoint:** `Lightricks/LTX-2` 19B (video stream = the "14B" in the grant)
+   or `Lightricks/LTX-2.3` 22B? Recommendation: start with `LTX-2` 19B (matches the 14B
+   figure, slightly smaller, has the cleaner Wan2GP reference port).
+2. **dev vs distilled first:** Recommendation: bring up **distilled** first (fewer steps,
+   CFG=1) to reach end-to-end video fastest, then add `dev` + CFG for the quality baseline.
+3. **Output format:** raw frames + optional MJPEG-AVI (reuse existing writer) for M1–M2,
+   and add real **MP4 (H.264)** later? Or is MJPEG-AVI acceptable for "MP4 or raw frames"?
+4. **Gemma encoder strategy:** implement Gemma 3 natively in `llm.hpp`, or is reusing an
+   external GGUF Gemma 3 encoder acceptable (still needs the LTX feature-extractor +
+   connector on top)? This is the biggest scoping lever.
+5. **Conversion tooling:** Python script (HF safetensors -> GGUF, like llama.cpp's
+   `convert_hf_to_gguf.py`) vs extending the in-repo C++ `-M convert` path. Recommendation:
+   a Python converter for the DiT/VAE/connector + reuse llama.cpp tooling for Gemma.
+
+---
+
+## 6. Conclusion
+
+The bounty is feasible but large. The video DiT and Video VAE map well onto the existing
+**Wan** video template, and the core ggml ops (Conv3d, 3D RoPE, AdaLN, flow-matching) already
+exist. The dominant risk and effort is the **Gemma 3-12B text encoder + LTX feature
+extractor/connector**, followed by the **memory budget** for running a 14B DiT + 12B encoder
+on consumer hardware. Recommended first concrete step (M1): pin the checkpoint, build the
+GGUF conversion + name map, add the `SDVersion`/detection scaffolding, and get the **Video
+VAE decode** running on CPU (smallest self-contained, visually verifiable component) before
+tackling the DiT and Gemma encoder.
-Original file line number
+Diff line change
@@ Expand Up / @@ -13,3 +13,5 @@ output*.png @@
     models*
     *.log
     preview.png
+    __pycache__/
+    *.pyc