Add cutile jsd#1228
Conversation
|
@Mecoli1219 @vaibhavjindal Could you please look at it and give some comments? |
Tcc0403
left a comment
There was a problem hiding this comment.
Thanks for your contribuiton, it is exciting to see new dsls integrated to Liger Kernel. However, since this is the first PR with non-existing dsl, I haven't had an idea to have different DSLs co-existing in our repo. The current backend is designed for hardware device, not for DSLs. Happy to discuss if you have any ideas!
| def select_backend_for_device(device: str) -> Optional[VendorInfo]: | ||
| """ | ||
| Select the backend implementation for a given device. | ||
|
|
||
| LIGER_KERNEL_BACKEND is an explicit override for optional backends that are | ||
| not the default vendor implementation for a device. | ||
| """ | ||
| backend = os.environ.get("LIGER_KERNEL_BACKEND") | ||
| if backend is None: | ||
| return get_vendor_for_device(device) | ||
|
|
||
| backend = backend.strip().lower() | ||
| if backend == "": | ||
| return get_vendor_for_device(device) | ||
|
|
||
| if backend == "cutile": | ||
| if device != "cuda": | ||
| raise RuntimeError("LIGER_KERNEL_BACKEND=cutile requires a CUDA/GPU device.") | ||
| return VendorInfo(vendor="cutile", device="cuda") | ||
|
|
||
| raise RuntimeError(f"Unsupported LIGER_KERNEL_BACKEND: {backend}. Only 'cutile' is currently supported.") |
There was a problem hiding this comment.
The vendor and backend semantics here refer to the hardware device, not dsl backend. We might need a new structure design for adding DSLs support
There was a problem hiding this comment.
How about we move away from the VendorInfo object? Vendor kind of denotes the chip-maker, but we might have different backends even for the same chip-maker. By backend we should ideally mean "a new kernel implementation".
How about we add something like:
@dataclass
class BackendInfo:
name: str # "ascend", "cutile", later "ck", "cudnn"
device: str # "npu", "cuda"
is_default_for_device: bool # ascend=True (auto on npu); cutile=False (opt-in)
Wdyt @Tcc0403 ?
| @@ -117,13 +117,12 @@ def _replace_with_vendor_ops(): | |||
|
|
|||
| Note: Vendor can both override existing ops AND add new vendor-specific ops. | |||
| """ | |||
| from liger_kernel.ops.backends import get_vendor_for_device | |||
| from liger_kernel.ops.backends import select_backend_for_device | |||
| from liger_kernel.utils import infer_device | |||
|
|
|||
| device = infer_device() | |||
|
|
|||
| # Look up vendor info for this device | |||
| vendor_info = get_vendor_for_device(device) | |||
| vendor_info = select_backend_for_device(device) | |||
| if vendor_info is None: | |||
| return | |||
|
|
|||
| @@ -144,7 +143,12 @@ def _replace_with_vendor_ops(): | |||
| globals()[name] = getattr(vendor_ops, name) | |||
|
|
|||
| except ImportError: | |||
| # Vendor module not available, use default implementations | |||
| import os | |||
|
|
|||
| backend = os.environ.get("LIGER_KERNEL_BACKEND", "").strip().lower() | |||
| if backend == "cutile": | |||
| raise | |||
| # Vendor module not available, use default implementations. | |||
There was a problem hiding this comment.
Let's come up with another dispatch mechanism different from the existing device backend
There was a problem hiding this comment.
Yes, we need to come up with a better dispatch mechanism for the new backend since we are breaking the one vendor one gpu model.
There was a problem hiding this comment.
Thanks for the feedback! Are you planning to push a fix for this? If so feel free to push a new commit directly — otherwise I can do it.
| - **Time and memory efficient:** In the same spirit as Flash-Attn, but for layers like **RMSNorm**, **RoPE**, **SwiGLU**, and **CrossEntropy**! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with **kernel fusion**, **in-place replacement**, and **chunking** techniques. | ||
| - **Exact:** Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy. | ||
| - **Lightweight:** Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches! | ||
| - **Lightweight:** Liger Kernel has minimal default dependencies, requiring only Torch and Triton. Optional backends such as cuTile can be installed explicitly when needed. |
There was a problem hiding this comment.
Let's keep it as how it was and make additional dsl backend support as a single bullet point
I agree, we need to think more deeply about how to integrate the cutile backend, especially since it works on already supported hardware (nvidia). |
|
Hi @xjmxyt , thanks for opening this PR. Excited to get cuTile kernels working in the liger-kernel. The kernel looks good, however, we need to decide on how to integrate the backend in a better way. Do you think it will be okay if we make some modifications on top of this PR as maintainers to make it easy to land the first PR? |
Sure, feel free to push changes on top. |
…vice
The existing VendorInfo / VENDOR_REGISTRY model assumed one vendor per
device (e.g., Ascend on NPU). It doesn't model alternative DSLs on the
same device (e.g., cuTile on CUDA, where Triton is already the default),
which was being shoehorned in via a hardcoded `if backend == "cutile":`
branch and a _cutile/__init__.py that deliberately skipped registration.
Replace VendorInfo(vendor, device) with BackendInfo(name, devices,
default_devices):
- devices: tuple of supported devices (multi-device backends OK)
- default_devices: subset auto-applied on import; empty = opt-in
only via LIGER_KERNEL_BACKEND
Both Ascend and cuTile now register through the same register_backend()
mechanism, no special cases. Ascend uses devices=("npu",),
default_devices=("npu",); cuTile uses devices=("cuda",) with empty
default_devices, making it opt-in. The dispatcher
(_replace_with_backend_ops) reads the env var once and threads explicit
through, dropping the duplicate parse in the import fallback.
No behavior change for end users: LIGER_KERNEL_BACKEND=cutile still
selects cuTile on CUDA, and Ascend on NPU still auto-applies.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @xjmxyt, thanks again for yesterday's discussion and for the contribution. I've pushed a follow-up commit Can you test things on your end and confirm numerical parity between the cuTile path and the existing Triton path on the full JSD test suite? LIGER_KERNEL_BACKEND=cutile pytest test/transformers/test_jsd.py -vI am not able to test things on blackwells right now because of some ongoing issue. |
|
@Tcc0403 I have done some refactoring here to solve the vendor/backend confusion. Could you please check it on your end to see if it looks good? Can you also please check if these changes will break anything on the ascend backend (hopefully it should be fine)? Happy to discuss if you have any ideas about we should do the integration for different backends! cc @Mecoli1219 |
|
| jsd,liger-cutile,forward,speed,ms,BT,B * T,1024,0.7814080119132996,0.7793023943901062,0.7831360101699829,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0 | ||
| jsd,liger-cutile,forward,speed,ms,BT,B * T,2048,1.4285119771957397,1.4254208087921143,1.4325439929962158,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0 | ||
| jsd,liger-cutile,forward,speed,ms,BT,B * T,4096,2.7792000770568848,2.773011255264282,2.783692789077759,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0 | ||
| jsd,liger-cutile,forward,speed,ms,BT,B * T,8192,5.50931191444397,5.502655982971191,5.513644886016845,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0 | ||
| jsd,liger-cutile,forward,speed,ms,BT,B * T,16384,10.931103706359863,10.921529960632324,10.938668823242187,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0 |
There was a problem hiding this comment.
I can modify the default csv file path (e.g. all_benchmark_data_cutile.csv) when specifying the cutile backend.
There was a problem hiding this comment.
I suggest moving _cutile to the same hierarchy level as backends. The modified structure would look like:
src/
└── liger_kernel/
└── ops/
├── backends/
│ └── _ascend/
│ ├── __init__.py
│ └── tilelang/ (replace tilelang with potential future dsl support)
├── cutile/
│ └── ops/
│ ├── __init__.py
│ ├── jsd.py
│ └── utils.py
├── tilelang/ (replace tilelang with potential future dsl support)
└── __init__.py
I think this change makes it easier for hardware backends to support multiple DSLs coexisting. cc @vaibhavjindal
We are also planning to bring in Cutlass/CUDA and CuteDSL in future for specific use cases. Cutlass in particular will be somewhat significant departure from the existing framework and requires compiling. I felt having the different "languages" as backends would be more natural as single kernel may contain multiple different implementations - backends. The language choice is also heavily tied to hardware and Cutlass/CuteDSL/CuTile are working only for Nvidia GPUs. Hence, I prefer to have them under backend and have something along the lines of the BackendInfo that @vaibhavjindal drafted above. |

Summary
Add optional cuTile backend support for Liger Kernel. This change introduces
LIGER_KERNEL_BACKEND=cutilebackend selection, CUDA-only validation, cuTile JSD operator replacement, optionalcutilepackage extras, benchmark labeling asliger-cutile, and documentation for enabling cuTile.Related to #1205
Details
cuTile is only enabled explicitly through
LIGER_KERNEL_BACKEND=cutile. Non-CUDA devices raise an error, and missing cuTile dependencies raise an import error with installation instructions.Testing Done
LIGER_KERNEL_BACKEND=cutile pytest -s test/transformers/test_jsd.py -vtest/transformers/test_cutile_backend.pyligervsliger-cutileon B300, will add B200 in the later commitmake testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence