LTX-2 transformer profiling support#1
Open
izorinLightricks wants to merge 2 commits into
Open
Conversation
Add support for profiling and optimising the Lightricks LTX-2 20B diffusion transformer (https://github.com/LightricksResearch/ltx-2-internal, pinned to main). - pyproject.toml: ltx-core as a base dependency via git+https (subdirectory packages/ltx-core, pinned commit). The repo is LTX-only so it's not behind an extras flag. - .env: GIT_LFS_SKIP_SMUDGE=1. The ltx-core repo references LFS objects in tests/assets/ that aren't accessible to anonymous cloners; skipping the smudge filter is harmless because the package source doesn't import any of those test fixtures. - README.md: Quick Start now sources .env before `uv sync`. - models/ltx_model.py: LTXModelWrapper with a no-arg `__init__` that loads real 20B weights via SingleGPUModelBuilder + the LTXV_MODEL_COMFY_RENAMING_MAP from a Comfy-format checkpoint on disk. forward(x) ignores its input and feeds production- shape AV Modality tensors built from ltx_core's patchifiers at the 1080p / 121f / 24fps configuration that matches PIPELINE_SIZE_1080P_121F in ltx-bench. Scope is the transformer only -- VAE, text encoder, audio VAE, vocoder, and the denoising loop are all replaced with random tensors at the dimensions LTXModel expects. After this commit: set -a && . ./.env && set +a && uv sync uv run profile.py --model models/ltx_model.py \ --class-name LTXModelWrapper --input-shape 1,1 --dtype bfloat16
Three independent fixes that bridge format gaps in the profile -> extract
-> bench pipeline, plus a richer profile.py output for the human reader.
None of these are LTX-specific; they apply to any PyTorch model on any
modern NVIDIA GPU. The classifier additions in particular fix silent
miscategorisation that affected all H100 / cuBLASLt JIT users.
profile.py
- Classifier now recognises ATen op names (aten::addmm, aten::linear,
aten::mm, aten::bmm) and Hopper cuBLASLt JIT kernels (nvjet_*) as
matmul. Previously these landed in "other" because the upstream
substring whitelist only matched gemm/cublas/cudnn-style kernel
symbols -- so on H100 ~24% of real GPU time silently bucketed as
"other" instead of matmul.
- Filter CUPTI / driver pseudo-events ("Command Buffer Full",
"Activity Buffer Request", "Lazy Function Loading", etc.) that
torch.profiler reports with non-zero device time but which aren't
real GPU kernels. ~24% of the headline GPU time on the LTX trace
was Command Buffer Full alone.
- Reclassify aten::addcmul_ as rotary_embedding (RoPE's fused multiply
-add in transformer models) and add an "elementwise" category for
aten::add/sub/mul/div/pow + generic at::native::elementwise_kernel
so the report no longer hides scale-shift / residual / gating math
in "other".
- Optional --with-source flag enables torch.profiler with_modules=True
+ with_stack=True (paired with experimental_config(verbose=True)
which PyTorch 2.11 silently requires for stack frames to populate).
Adds Module + Source columns to the kernel-ranking table (e.g.
"Attention_0 / attention.py(29): __call__"). Run cost is comparable
to default mode on big models.
- Two-pass aggregation that runs only when --with-source is on:
1. dedup ATen-vs-kernel double-recordings at the same
(module_path, source_loc) with timing within 10% (the same
physical kernel reported by torch.profiler at both the ATen
layer and the kernel-symbol layer)
2. aggregate per-call-site rows back to one row per
(op_type, name, source_loc) so 48 transformer blocks' Q-
projections at attention.py(181) collapse into a single row
Keeps the kernel-symbol record (canonical GPU work) and transfers
the ATen partner's input_shapes + classification + attribution.
- Drop the redundant Cumul column; surface a Shape column showing
a compact view of evt.input_shapes (e.g. "32640x4096-4096x4096")
for at-a-glance sanity-checking of what each row is.
extract.py
- Parse PyTorch's bracket-format shape strings ("[[bias], [M, K],
[K, N], ...]") with op-type-aware extraction for matmul (addmm/
linear/bmm/mm + nvjet/cudnn fallback), flash_attention, layernorm,
rmsnorm, softmax, rotary_embedding, fused_mlp, cross_entropy,
reduce. Previously parse_shape_info() only matched key=value
strings ("M=4096, N=4096, K=4096") which torch.profiler never
actually emits, so the fallback path used hard-coded defaults
(M=N=K=2048 for matmul) for every real model. The optimisation
loop was therefore tuning kernels for a fictional shape
irrespective of the input model. Backwards-compatible: the
legacy key=value parser still runs as a fallback.
bench.py
- Honor TEST_SIZES from the loaded kernel.py when present (extract.py
emits these for every kernel it generates from a real model
profile). Previously bench.py used its own KERNEL_CONFIGS defaults
regardless of what kernel.py declared, so a kernel extracted at
LTX's flash-attention shape was being benched against upstream's
default (B=2, H=32, T=1024, D=64) and any wins wouldn't transfer
to the model. Recognise "model_primary" as the primary-size label
alongside "large" and route --quick to whichever is present.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-commit PR that adds support for profiling and optimising the Lightricks LTX-2 20B diffusion transformer with AutoKernel, plus three pipeline-integration fixes that benefit any user on Hopper-class GPUs.
ltx-coregit+https dep,LTXModelWrapperwith no-arg__init__that loads real 20B weights and feeds production-shape AV inputs at 1080p × 121f × 24fps.profile.py→extract.py→bench.pyhandoff and adds opt-in source attribution + ATen/kernel dedup toprofile.py.Why these fixes matter beyond LTX
The two pipeline-integration bugs (extract shape parser, bench
TEST_SIZES) plus the classifier extension (ATen op names +nvjet_*) affect every user running AutoKernel against a real PyTorch model on Hopper. Symptoms in upstream:extract.pyalways falls back to hard-coded defaults (M=N=K=2048for matmul) because its parser expectsM=4096, N=4096strings whiletorch.profileremits[[bias], [M, K], [K, N], …].bench.pyignoresTEST_SIZESfrom the loadedkernel.py, so kernels extracted from real model profiles get benched against upstream's default sizes and any wins don't transfer back to the model."other"because cuBLASLt's JIT kernels are namednvjet_*(notcublas*/gemm*/cutlass*), and another ~24% was theCommand Buffer Fullprofiler pseudo-event.After this PR an end-to-end run produces honest numbers tied to the model's actual shapes.
Test plan
set -a && . ./.env && set +a && uv syncsucceedsuv run profile.py --model models/ltx_model.py --class-name LTXModelWrapper --input-shape 1,1 --dtype bfloat16runs end-to-end and writesworkspace/profile_report.json