Add cutile jsd by xjmxyt · Pull Request #1228 · linkedin/Liger-Kernel

xjmxyt · 2026-05-18T11:10:45Z

Summary

Add optional cuTile backend support for Liger Kernel. This change introduces LIGER_KERNEL_BACKEND=cutile backend selection, CUDA-only validation, cuTile JSD operator replacement, optional cutile package extras, benchmark labeling as liger-cutile, and documentation for enabling cuTile.
Related to #1205

Details

cuTile is only enabled explicitly through LIGER_KERNEL_BACKEND=cutile. Non-CUDA devices raise an error, and missing cuTile dependencies raise an import error with installation instructions.

Testing Done

Ran JSD correctness tests with cuTile enabled:
LIGER_KERNEL_BACKEND=cutile pytest -s test/transformers/test_jsd.py -v
Added backend selection test:
test/transformers/test_cutile_backend.py
Ran targeted ruff checks on modified files.
Ran JSD benchmarks comparing liger vs liger-cutile on B300, will add B200 in the later commit
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

xjmxyt · 2026-05-18T11:32:13Z

@Mecoli1219 @vaibhavjindal Could you please look at it and give some comments?

Tcc0403

Thanks for your contribuiton, it is exciting to see new dsls integrated to Liger Kernel. However, since this is the first PR with non-existing dsl, I haven't had an idea to have different DSLs co-existing in our repo. The current backend is designed for hardware device, not for DSLs. Happy to discuss if you have any ideas!

Tcc0403 · 2026-05-18T14:14:42Z

+def select_backend_for_device(device: str) -> Optional[VendorInfo]:
+    """
+    Select the backend implementation for a given device.
+
+    LIGER_KERNEL_BACKEND is an explicit override for optional backends that are
+    not the default vendor implementation for a device.
+    """
+    backend = os.environ.get("LIGER_KERNEL_BACKEND")
+    if backend is None:
+        return get_vendor_for_device(device)
+
+    backend = backend.strip().lower()
+    if backend == "":
+        return get_vendor_for_device(device)
+
+    if backend == "cutile":
+        if device != "cuda":
+            raise RuntimeError("LIGER_KERNEL_BACKEND=cutile requires a CUDA/GPU device.")
+        return VendorInfo(vendor="cutile", device="cuda")
+
+    raise RuntimeError(f"Unsupported LIGER_KERNEL_BACKEND: {backend}. Only 'cutile' is currently supported.")


The vendor and backend semantics here refer to the hardware device, not dsl backend. We might need a new structure design for adding DSLs support

How about we move away from the VendorInfo object? Vendor kind of denotes the chip-maker, but we might have different backends even for the same chip-maker. By backend we should ideally mean "a new kernel implementation".

How about we add something like:

@dataclass class BackendInfo: name: str # "ascend", "cutile", later "ck", "cudnn" device: str # "npu", "cuda" is_default_for_device: bool # ascend=True (auto on npu); cutile=False (opt-in)

Wdyt @Tcc0403 ?

Tcc0403 · 2026-05-18T14:17:32Z

@@ -117,13 +117,12 @@ def _replace_with_vendor_ops():

    Note: Vendor can both override existing ops AND add new vendor-specific ops.
    """
-    from liger_kernel.ops.backends import get_vendor_for_device
+    from liger_kernel.ops.backends import select_backend_for_device
    from liger_kernel.utils import infer_device

    device = infer_device()

-    # Look up vendor info for this device
-    vendor_info = get_vendor_for_device(device)
+    vendor_info = select_backend_for_device(device)
    if vendor_info is None:
        return

@@ -144,7 +143,12 @@ def _replace_with_vendor_ops():
            globals()[name] = getattr(vendor_ops, name)

    except ImportError:
-        # Vendor module not available, use default implementations
+        import os
+
+        backend = os.environ.get("LIGER_KERNEL_BACKEND", "").strip().lower()
+        if backend == "cutile":
+            raise
+        # Vendor module not available, use default implementations.


Let's come up with another dispatch mechanism different from the existing device backend

Yes, we need to come up with a better dispatch mechanism for the new backend since we are breaking the one vendor one gpu model.

Thanks for the feedback! Are you planning to push a fix for this? If so feel free to push a new commit directly — otherwise I can do it.

Tcc0403 · 2026-05-18T14:20:09Z

 - **Time and memory efficient:** In the same spirit as Flash-Attn, but for layers like **RMSNorm**, **RoPE**, **SwiGLU**, and **CrossEntropy**! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with **kernel fusion**, **in-place replacement**, and **chunking** techniques.
 - **Exact:** Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
- **Lightweight:** Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches!
+- **Lightweight:** Liger Kernel has minimal default dependencies, requiring only Torch and Triton. Optional backends such as cuTile can be installed explicitly when needed.


Let's keep it as how it was and make additional dsl backend support as a single bullet point

vaibhavjindal · 2026-05-18T23:53:44Z

Thanks for your contribuiton, it is exciting to see new dsls integrated to Liger Kernel. However, since this is the first PR with non-existing dsl, I haven't had an idea to have different DSLs co-existing in our repo. The current backend is designed for hardware device, not for DSLs. Happy to discuss if you have any ideas!

I agree, we need to think more deeply about how to integrate the cutile backend, especially since it works on already supported hardware (nvidia).

vaibhavjindal · 2026-05-19T00:00:20Z

Hi @xjmxyt , thanks for opening this PR. Excited to get cuTile kernels working in the liger-kernel.

The kernel looks good, however, we need to decide on how to integrate the backend in a better way. Do you think it will be okay if we make some modifications on top of this PR as maintainers to make it easy to land the first PR?

xjmxyt · 2026-05-19T02:34:47Z

Hi @xjmxyt , thanks for opening this PR. Excited to get cuTile kernels working in the liger-kernel.

The kernel looks good, however, we need to decide on how to integrate the backend in a better way. Do you think it will be okay if we make some modifications on top of this PR as maintainers to make it easy to land the first PR?

Sure, feel free to push changes on top.

…vice The existing VendorInfo / VENDOR_REGISTRY model assumed one vendor per device (e.g., Ascend on NPU). It doesn't model alternative DSLs on the same device (e.g., cuTile on CUDA, where Triton is already the default), which was being shoehorned in via a hardcoded `if backend == "cutile":` branch and a _cutile/__init__.py that deliberately skipped registration. Replace VendorInfo(vendor, device) with BackendInfo(name, devices, default_devices): - devices: tuple of supported devices (multi-device backends OK) - default_devices: subset auto-applied on import; empty = opt-in only via LIGER_KERNEL_BACKEND Both Ascend and cuTile now register through the same register_backend() mechanism, no special cases. Ascend uses devices=("npu",), default_devices=("npu",); cuTile uses devices=("cuda",) with empty default_devices, making it opt-in. The dispatcher (_replace_with_backend_ops) reads the env var once and threads explicit through, dropping the duplicate parse in the import fallback. No behavior change for end users: LIGER_KERNEL_BACKEND=cutile still selects cuTile on CUDA, and Ascend on NPU still auto-applies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vaibhavjindal · 2026-05-19T21:40:54Z

Hi @xjmxyt, thanks again for yesterday's discussion and for the contribution.

I've pushed a follow-up commit
(8ac80fd) that refactors the
backend registry to address @Tcc0403's concern about device-vs-DSL.

Can you test things on your end and confirm numerical parity between the cuTile path and the existing Triton path on the full JSD test suite?

  LIGER_KERNEL_BACKEND=cutile pytest test/transformers/test_jsd.py -v

I am not able to test things on blackwells right now because of some ongoing issue.

vaibhavjindal · 2026-05-19T21:43:41Z

@Tcc0403 I have done some refactoring here to solve the vendor/backend confusion. Could you please check it on your end to see if it looks good? Can you also please check if these changes will break anything on the ascend backend (hopefully it should be fine)?

Happy to discuss if you have any ideas about we should do the integration for different backends!

cc @Mecoli1219

xjmxyt · 2026-05-20T04:50:43Z

Hi @xjmxyt, thanks again for yesterday's discussion and for the contribution.

I've pushed a follow-up commit (8ac80fd) that refactors the backend registry to address @Tcc0403's concern about device-vs-DSL.

Can you test things on your end and confirm numerical parity between the cuTile path and the existing Triton path on the full JSD test suite?
  LIGER_KERNEL_BACKEND=cutile pytest test/transformers/test_jsd.py -v
I am not able to test things on blackwells right now because of some ongoing issue.

I already check it, it works well.

zheliuyu · 2026-05-21T08:09:14Z

+jsd,liger-cutile,forward,speed,ms,BT,B * T,1024,0.7814080119132996,0.7793023943901062,0.7831360101699829,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
+jsd,liger-cutile,forward,speed,ms,BT,B * T,2048,1.4285119771957397,1.4254208087921143,1.4325439929962158,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
+jsd,liger-cutile,forward,speed,ms,BT,B * T,4096,2.7792000770568848,2.773011255264282,2.783692789077759,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
+jsd,liger-cutile,forward,speed,ms,BT,B * T,8192,5.50931191444397,5.502655982971191,5.513644886016845,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
+jsd,liger-cutile,forward,speed,ms,BT,B * T,16384,10.931103706359863,10.921529960632324,10.938668823242187,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0


@Tcc0403 @xjmxyt Should we consider creating a separate benchmark_csv file for the new backend to keep the existing benchmark baseline intact?

I can modify the default csv file path (e.g. all_benchmark_data_cutile.csv) when specifying the cutile backend.

Tcc0403

I suggest moving _cutile to the same hierarchy level as backends. The modified structure would look like:

src/
└── liger_kernel/
    └── ops/
        ├── backends/
        │   └── _ascend/
        │       ├── __init__.py
        │       └── tilelang/  (replace tilelang with potential future dsl support)
        ├── cutile/
        │   └── ops/
        │       ├── __init__.py
        │       ├── jsd.py
        │       └── utils.py
        ├── tilelang/  (replace tilelang with potential future dsl support)
        └── __init__.py

I think this change makes it easier for hardware backends to support multiple DSLs coexisting. cc @vaibhavjindal

kolehma8 · 2026-05-21T21:46:22Z

I suggest moving _cutile to the same hierarchy level as backends. The modified structure would look like:

src/
└── liger_kernel/
    └── ops/
        ├── backends/
        │   └── _ascend/
        │       ├── __init__.py
        │       └── tilelang/  (replace tilelang with potential future dsl support)
        ├── cutile/
        │   └── ops/
        │       ├── __init__.py
        │       ├── jsd.py
        │       └── utils.py
        ├── tilelang/  (replace tilelang with potential future dsl support)
        └── __init__.py

I think this change makes it easier for hardware backends to support multiple DSLs coexisting. cc @vaibhavjindal

We are also planning to bring in Cutlass/CUDA and CuteDSL in future for specific use cases. Cutlass in particular will be somewhat significant departure from the existing framework and requires compiling. I felt having the different "languages" as backends would be more natural as single kernel may contain multiple different implementations - backends. The language choice is also heavily tied to hardware and Cutlass/CuteDSL/CuTile are working only for Nvidia GPUs.

Hence, I prefer to have them under backend and have something along the lines of the BackendInfo that @vaibhavjindal drafted above.

xjmxyt added 2 commits May 18, 2026 00:57

add cutile jsd example

8da96be

refine

f2593b2

xjmxyt changed the title ~~[Draft] Add cutile jsd~~ Add cutile jsd May 18, 2026

Tcc0403 reviewed May 18, 2026

View reviewed changes

Merge branch 'main' into feature/add_cutile

255137a

xjmxyt and others added 2 commits May 18, 2026 20:13

refine readme

7733ee4

Merge branch 'main' into feature/add_cutile

00e00d7

zheliuyu reviewed May 21, 2026

View reviewed changes

Tcc0403 reviewed May 21, 2026

View reviewed changes

Conversation

xjmxyt commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing Done

Uh oh!

xjmxyt commented May 18, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Tcc0403 May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhavjindal commented May 18, 2026

Uh oh!

vaibhavjindal commented May 19, 2026

Uh oh!

xjmxyt commented May 19, 2026

Uh oh!

vaibhavjindal commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaibhavjindal commented May 19, 2026

Uh oh!

xjmxyt commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tcc0403 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolehma8 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xjmxyt commented May 18, 2026 •

edited

Loading

Tcc0403 May 18, 2026 •

edited

Loading

vaibhavjindal commented May 19, 2026 •

edited

Loading

Tcc0403 left a comment •

edited

Loading