Skip to content

LTX-2 transformer profiling support#1

Open
izorinLightricks wants to merge 2 commits into
mainfrom
ltx-integration
Open

LTX-2 transformer profiling support#1
izorinLightricks wants to merge 2 commits into
mainfrom
ltx-integration

Conversation

@izorinLightricks
Copy link
Copy Markdown
Owner

@izorinLightricks izorinLightricks commented May 6, 2026

Summary

Two-commit PR that adds support for profiling and optimising the Lightricks LTX-2 20B diffusion transformer with AutoKernel, plus three pipeline-integration fixes that benefit any user on Hopper-class GPUs.

  • ltx integration + transformer wrapper (commit 1) — ltx-core git+https dep, LTXModelWrapper with no-arg __init__ that loads real 20B weights and feeds production-shape AV inputs at 1080p × 121f × 24fps.
  • profiler + extractor enhancement (commit 2) — fixes three silent format gaps in the profile.pyextract.pybench.py handoff and adds opt-in source attribution + ATen/kernel dedup to profile.py.

Why these fixes matter beyond LTX

The two pipeline-integration bugs (extract shape parser, bench TEST_SIZES) plus the classifier extension (ATen op names + nvjet_*) affect every user running AutoKernel against a real PyTorch model on Hopper. Symptoms in upstream:

  • extract.py always falls back to hard-coded defaults (M=N=K=2048 for matmul) because its parser expects M=4096, N=4096 strings while torch.profiler emits [[bias], [M, K], [K, N], …].
  • bench.py ignores TEST_SIZES from the loaded kernel.py, so kernels extracted from real model profiles get benched against upstream's default sizes and any wins don't transfer back to the model.
  • On H100, ~24% of GPU time silently bucketed as "other" because cuBLASLt's JIT kernels are named nvjet_* (not cublas*/gemm*/cutlass*), and another ~24% was the Command Buffer Full profiler pseudo-event.

After this PR an end-to-end run produces honest numbers tied to the model's actual shapes.

Test plan

  • set -a && . ./.env && set +a && uv sync succeeds
  • uv run profile.py --model models/ltx_model.py --class-name LTXModelWrapper --input-shape 1,1 --dtype bfloat16 runs end-to-end and writes workspace/profile_report.json
  • Default profile output: top-1 row is flash_attention (~19.6% × 2 due to ATen/kernel double-record); `Command Buffer Full` does NOT appear in the rankings; matmul rows show `aten::addmm` / `nvjet_*` correctly classified
  • `uv run profile.py … --with-source`: total GPU time drops by ~24% (dedup applied); rows show Module + Source columns; ATen partners are removed
  • `uv run extract.py --top 5` produces no "Could not parse shape" warnings; generated `workspace/kernel_*.py` files have real LTX shapes baked into `MODEL_SHAPES` (e.g. `M=32640, K=4096, N=4096`)
  • `uv run bench.py` against a generated `kernel_matmul_*.py` runs at the model's `model_primary` shape (32640×4096×4096), not the upstream default `large` (2048³)
  • Default-mode profile and bench output is structurally unchanged for kernels that do not declare `TEST_SIZES`

Add support for profiling and optimising the Lightricks LTX-2 20B
diffusion transformer (https://github.com/LightricksResearch/ltx-2-internal,
pinned to main).

  - pyproject.toml: ltx-core as a base dependency via git+https
    (subdirectory packages/ltx-core, pinned commit). The repo is
    LTX-only so it's not behind an extras flag.
  - .env: GIT_LFS_SKIP_SMUDGE=1. The ltx-core repo references LFS
    objects in tests/assets/ that aren't accessible to anonymous
    cloners; skipping the smudge filter is harmless because the
    package source doesn't import any of those test fixtures.
  - README.md: Quick Start now sources .env before `uv sync`.
  - models/ltx_model.py: LTXModelWrapper with a no-arg `__init__`
    that loads real 20B weights via SingleGPUModelBuilder + the
    LTXV_MODEL_COMFY_RENAMING_MAP from a Comfy-format checkpoint
    on disk. forward(x) ignores its input and feeds production-
    shape AV Modality tensors built from ltx_core's patchifiers
    at the 1080p / 121f / 24fps configuration that matches
    PIPELINE_SIZE_1080P_121F in ltx-bench. Scope is the transformer
    only -- VAE, text encoder, audio VAE, vocoder, and the denoising
    loop are all replaced with random tensors at the dimensions
    LTXModel expects.

After this commit:
  set -a && . ./.env && set +a && uv sync
  uv run profile.py --model models/ltx_model.py \
    --class-name LTXModelWrapper --input-shape 1,1 --dtype bfloat16
Three independent fixes that bridge format gaps in the profile -> extract
-> bench pipeline, plus a richer profile.py output for the human reader.
None of these are LTX-specific; they apply to any PyTorch model on any
modern NVIDIA GPU. The classifier additions in particular fix silent
miscategorisation that affected all H100 / cuBLASLt JIT users.

profile.py
  - Classifier now recognises ATen op names (aten::addmm, aten::linear,
    aten::mm, aten::bmm) and Hopper cuBLASLt JIT kernels (nvjet_*) as
    matmul. Previously these landed in "other" because the upstream
    substring whitelist only matched gemm/cublas/cudnn-style kernel
    symbols -- so on H100 ~24% of real GPU time silently bucketed as
    "other" instead of matmul.
  - Filter CUPTI / driver pseudo-events ("Command Buffer Full",
    "Activity Buffer Request", "Lazy Function Loading", etc.) that
    torch.profiler reports with non-zero device time but which aren't
    real GPU kernels. ~24% of the headline GPU time on the LTX trace
    was Command Buffer Full alone.
  - Reclassify aten::addcmul_ as rotary_embedding (RoPE's fused multiply
    -add in transformer models) and add an "elementwise" category for
    aten::add/sub/mul/div/pow + generic at::native::elementwise_kernel
    so the report no longer hides scale-shift / residual / gating math
    in "other".
  - Optional --with-source flag enables torch.profiler with_modules=True
    + with_stack=True (paired with experimental_config(verbose=True)
    which PyTorch 2.11 silently requires for stack frames to populate).
    Adds Module + Source columns to the kernel-ranking table (e.g.
    "Attention_0 / attention.py(29): __call__"). Run cost is comparable
    to default mode on big models.
  - Two-pass aggregation that runs only when --with-source is on:
      1. dedup ATen-vs-kernel double-recordings at the same
         (module_path, source_loc) with timing within 10% (the same
         physical kernel reported by torch.profiler at both the ATen
         layer and the kernel-symbol layer)
      2. aggregate per-call-site rows back to one row per
         (op_type, name, source_loc) so 48 transformer blocks' Q-
         projections at attention.py(181) collapse into a single row
    Keeps the kernel-symbol record (canonical GPU work) and transfers
    the ATen partner's input_shapes + classification + attribution.
  - Drop the redundant Cumul column; surface a Shape column showing
    a compact view of evt.input_shapes (e.g. "32640x4096-4096x4096")
    for at-a-glance sanity-checking of what each row is.

extract.py
  - Parse PyTorch's bracket-format shape strings ("[[bias], [M, K],
    [K, N], ...]") with op-type-aware extraction for matmul (addmm/
    linear/bmm/mm + nvjet/cudnn fallback), flash_attention, layernorm,
    rmsnorm, softmax, rotary_embedding, fused_mlp, cross_entropy,
    reduce. Previously parse_shape_info() only matched key=value
    strings ("M=4096, N=4096, K=4096") which torch.profiler never
    actually emits, so the fallback path used hard-coded defaults
    (M=N=K=2048 for matmul) for every real model. The optimisation
    loop was therefore tuning kernels for a fictional shape
    irrespective of the input model. Backwards-compatible: the
    legacy key=value parser still runs as a fallback.

bench.py
  - Honor TEST_SIZES from the loaded kernel.py when present (extract.py
    emits these for every kernel it generates from a real model
    profile). Previously bench.py used its own KERNEL_CONFIGS defaults
    regardless of what kernel.py declared, so a kernel extracted at
    LTX's flash-attention shape was being benched against upstream's
    default (B=2, H=32, T=1024, D=64) and any wins wouldn't transfer
    to the model. Recognise "model_primary" as the primary-size label
    alongside "large" and route --quick to whichever is present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant