LTX-2 transformer profiling support by izorinLightricks · Pull Request #1 · izorinLightricks/autokernel

izorinLightricks · 2026-05-06T08:29:05Z

Summary

Two-commit PR that adds support for profiling and optimising the Lightricks LTX-2 20B diffusion transformer with AutoKernel, plus three pipeline-integration fixes that benefit any user on Hopper-class GPUs.

ltx integration + transformer wrapper (commit 1) — ltx-core git+https dep, LTXModelWrapper with no-arg __init__ that loads real 20B weights and feeds production-shape AV inputs at 1080p × 121f × 24fps.
profiler + extractor enhancement (commit 2) — fixes three silent format gaps in the profile.py → extract.py → bench.py handoff and adds opt-in source attribution + ATen/kernel dedup to profile.py.

Why these fixes matter beyond LTX

The two pipeline-integration bugs (extract shape parser, bench TEST_SIZES) plus the classifier extension (ATen op names + nvjet_*) affect every user running AutoKernel against a real PyTorch model on Hopper. Symptoms in upstream:

extract.py always falls back to hard-coded defaults (M=N=K=2048 for matmul) because its parser expects M=4096, N=4096 strings while torch.profiler emits [[bias], [M, K], [K, N], …].
bench.py ignores TEST_SIZES from the loaded kernel.py, so kernels extracted from real model profiles get benched against upstream's default sizes and any wins don't transfer back to the model.
On H100, ~24% of GPU time silently bucketed as "other" because cuBLASLt's JIT kernels are named nvjet_* (not cublas*/gemm*/cutlass*), and another ~24% was the Command Buffer Full profiler pseudo-event.

After this PR an end-to-end run produces honest numbers tied to the model's actual shapes.

Test plan

set -a && . ./.env && set +a && uv sync succeeds
uv run profile.py --model models/ltx_model.py --class-name LTXModelWrapper --input-shape 1,1 --dtype bfloat16 runs end-to-end and writes workspace/profile_report.json
Default profile output: top-1 row is flash_attention (~19.6% × 2 due to ATen/kernel double-record); `Command Buffer Full` does NOT appear in the rankings; matmul rows show `aten::addmm` / `nvjet_*` correctly classified
`uv run profile.py … --with-source`: total GPU time drops by ~24% (dedup applied); rows show Module + Source columns; ATen partners are removed
`uv run extract.py --top 5` produces no "Could not parse shape" warnings; generated `workspace/kernel_*.py` files have real LTX shapes baked into `MODEL_SHAPES` (e.g. `M=32640, K=4096, N=4096`)
`uv run bench.py` against a generated `kernel_matmul_*.py` runs at the model's `model_primary` shape (32640×4096×4096), not the upstream default `large` (2048³)
Default-mode profile and bench output is structurally unchanged for kernels that do not declare `TEST_SIZES`

Add support for profiling and optimising the Lightricks LTX-2 20B diffusion transformer (https://github.com/LightricksResearch/ltx-2-internal, pinned to main). - pyproject.toml: ltx-core as a base dependency via git+https (subdirectory packages/ltx-core, pinned commit). The repo is LTX-only so it's not behind an extras flag. - .env: GIT_LFS_SKIP_SMUDGE=1. The ltx-core repo references LFS objects in tests/assets/ that aren't accessible to anonymous cloners; skipping the smudge filter is harmless because the package source doesn't import any of those test fixtures. - README.md: Quick Start now sources .env before `uv sync`. - models/ltx_model.py: LTXModelWrapper with a no-arg `__init__` that loads real 20B weights via SingleGPUModelBuilder + the LTXV_MODEL_COMFY_RENAMING_MAP from a Comfy-format checkpoint on disk. forward(x) ignores its input and feeds production- shape AV Modality tensors built from ltx_core's patchifiers at the 1080p / 121f / 24fps configuration that matches PIPELINE_SIZE_1080P_121F in ltx-bench. Scope is the transformer only -- VAE, text encoder, audio VAE, vocoder, and the denoising loop are all replaced with random tensors at the dimensions LTXModel expects. After this commit: set -a && . ./.env && set +a && uv sync uv run profile.py --model models/ltx_model.py \ --class-name LTXModelWrapper --input-shape 1,1 --dtype bfloat16

Three independent fixes that bridge format gaps in the profile -> extract -> bench pipeline, plus a richer profile.py output for the human reader. None of these are LTX-specific; they apply to any PyTorch model on any modern NVIDIA GPU. The classifier additions in particular fix silent miscategorisation that affected all H100 / cuBLASLt JIT users. profile.py - Classifier now recognises ATen op names (aten::addmm, aten::linear, aten::mm, aten::bmm) and Hopper cuBLASLt JIT kernels (nvjet_*) as matmul. Previously these landed in "other" because the upstream substring whitelist only matched gemm/cublas/cudnn-style kernel symbols -- so on H100 ~24% of real GPU time silently bucketed as "other" instead of matmul. - Filter CUPTI / driver pseudo-events ("Command Buffer Full", "Activity Buffer Request", "Lazy Function Loading", etc.) that torch.profiler reports with non-zero device time but which aren't real GPU kernels. ~24% of the headline GPU time on the LTX trace was Command Buffer Full alone. - Reclassify aten::addcmul_ as rotary_embedding (RoPE's fused multiply -add in transformer models) and add an "elementwise" category for aten::add/sub/mul/div/pow + generic at::native::elementwise_kernel so the report no longer hides scale-shift / residual / gating math in "other". - Optional --with-source flag enables torch.profiler with_modules=True + with_stack=True (paired with experimental_config(verbose=True) which PyTorch 2.11 silently requires for stack frames to populate). Adds Module + Source columns to the kernel-ranking table (e.g. "Attention_0 / attention.py(29): __call__"). Run cost is comparable to default mode on big models. - Two-pass aggregation that runs only when --with-source is on: 1. dedup ATen-vs-kernel double-recordings at the same (module_path, source_loc) with timing within 10% (the same physical kernel reported by torch.profiler at both the ATen layer and the kernel-symbol layer) 2. aggregate per-call-site rows back to one row per (op_type, name, source_loc) so 48 transformer blocks' Q- projections at attention.py(181) collapse into a single row Keeps the kernel-symbol record (canonical GPU work) and transfers the ATen partner's input_shapes + classification + attribution. - Drop the redundant Cumul column; surface a Shape column showing a compact view of evt.input_shapes (e.g. "32640x4096-4096x4096") for at-a-glance sanity-checking of what each row is. extract.py - Parse PyTorch's bracket-format shape strings ("[[bias], [M, K], [K, N], ...]") with op-type-aware extraction for matmul (addmm/ linear/bmm/mm + nvjet/cudnn fallback), flash_attention, layernorm, rmsnorm, softmax, rotary_embedding, fused_mlp, cross_entropy, reduce. Previously parse_shape_info() only matched key=value strings ("M=4096, N=4096, K=4096") which torch.profiler never actually emits, so the fallback path used hard-coded defaults (M=N=K=2048 for matmul) for every real model. The optimisation loop was therefore tuning kernels for a fictional shape irrespective of the input model. Backwards-compatible: the legacy key=value parser still runs as a fallback. bench.py - Honor TEST_SIZES from the loaded kernel.py when present (extract.py emits these for every kernel it generates from a real model profile). Previously bench.py used its own KERNEL_CONFIGS defaults regardless of what kernel.py declared, so a kernel extracted at LTX's flash-attention shape was being benched against upstream's default (B=2, H=32, T=1024, D=64) and any wins wouldn't transfer to the model. Recognise "model_primary" as the primary-size label alongside "large" and route --quick to whichever is present.

izorinLightricks added 2 commits May 6, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LTX-2 transformer profiling support#1

LTX-2 transformer profiling support#1
izorinLightricks wants to merge 2 commits into
mainfrom
ltx-integration

izorinLightricks commented May 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

izorinLightricks commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why these fixes matter beyond LTX

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

izorinLightricks commented May 6, 2026 •

edited

Loading