Skip to content

Add cutile jsd#1228

Open
xjmxyt wants to merge 6 commits into
linkedin:mainfrom
xjmxyt:feature/add_cutile
Open

Add cutile jsd#1228
xjmxyt wants to merge 6 commits into
linkedin:mainfrom
xjmxyt:feature/add_cutile

Conversation

@xjmxyt
Copy link
Copy Markdown

@xjmxyt xjmxyt commented May 18, 2026

Summary

Add optional cuTile backend support for Liger Kernel. This change introduces LIGER_KERNEL_BACKEND=cutile backend selection, CUDA-only validation, cuTile JSD operator replacement, optional cutile package extras, benchmark labeling as liger-cutile, and documentation for enabling cuTile.
Related to #1205

Details

cuTile is only enabled explicitly through LIGER_KERNEL_BACKEND=cutile. Non-CUDA devices raise an error, and missing cuTile dependencies raise an import error with installation instructions.

Testing Done

  • Ran JSD correctness tests with cuTile enabled:
    LIGER_KERNEL_BACKEND=cutile pytest -s test/transformers/test_jsd.py -v
  • Added backend selection test:
    test/transformers/test_cutile_backend.py
  • Ran targeted ruff checks on modified files.
  • Ran JSD benchmarks comparing liger vs liger-cutile on B300, will add B200 in the later commit
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence
image

@xjmxyt xjmxyt changed the title [Draft] Add cutile jsd Add cutile jsd May 18, 2026
@xjmxyt
Copy link
Copy Markdown
Author

xjmxyt commented May 18, 2026

@Mecoli1219 @vaibhavjindal Could you please look at it and give some comments?

Copy link
Copy Markdown
Collaborator

@Tcc0403 Tcc0403 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribuiton, it is exciting to see new dsls integrated to Liger Kernel. However, since this is the first PR with non-existing dsl, I haven't had an idea to have different DSLs co-existing in our repo. The current backend is designed for hardware device, not for DSLs. Happy to discuss if you have any ideas!

Comment on lines +66 to +86
def select_backend_for_device(device: str) -> Optional[VendorInfo]:
"""
Select the backend implementation for a given device.

LIGER_KERNEL_BACKEND is an explicit override for optional backends that are
not the default vendor implementation for a device.
"""
backend = os.environ.get("LIGER_KERNEL_BACKEND")
if backend is None:
return get_vendor_for_device(device)

backend = backend.strip().lower()
if backend == "":
return get_vendor_for_device(device)

if backend == "cutile":
if device != "cuda":
raise RuntimeError("LIGER_KERNEL_BACKEND=cutile requires a CUDA/GPU device.")
return VendorInfo(vendor="cutile", device="cuda")

raise RuntimeError(f"Unsupported LIGER_KERNEL_BACKEND: {backend}. Only 'cutile' is currently supported.")
Copy link
Copy Markdown
Collaborator

@Tcc0403 Tcc0403 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vendor and backend semantics here refer to the hardware device, not dsl backend. We might need a new structure design for adding DSLs support

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move away from the VendorInfo object? Vendor kind of denotes the chip-maker, but we might have different backends even for the same chip-maker. By backend we should ideally mean "a new kernel implementation".

How about we add something like:

@dataclass
  class BackendInfo:
      name: str                          # "ascend", "cutile", later "ck", "cudnn"
      device: str                        # "npu", "cuda"
      is_default_for_device: bool        # ascend=True (auto on npu); cutile=False (opt-in)

Wdyt @Tcc0403 ?

Comment thread src/liger_kernel/ops/__init__.py Outdated
Comment on lines +109 to +151
@@ -117,13 +117,12 @@ def _replace_with_vendor_ops():

Note: Vendor can both override existing ops AND add new vendor-specific ops.
"""
from liger_kernel.ops.backends import get_vendor_for_device
from liger_kernel.ops.backends import select_backend_for_device
from liger_kernel.utils import infer_device

device = infer_device()

# Look up vendor info for this device
vendor_info = get_vendor_for_device(device)
vendor_info = select_backend_for_device(device)
if vendor_info is None:
return

@@ -144,7 +143,12 @@ def _replace_with_vendor_ops():
globals()[name] = getattr(vendor_ops, name)

except ImportError:
# Vendor module not available, use default implementations
import os

backend = os.environ.get("LIGER_KERNEL_BACKEND", "").strip().lower()
if backend == "cutile":
raise
# Vendor module not available, use default implementations.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's come up with another dispatch mechanism different from the existing device backend

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to come up with a better dispatch mechanism for the new backend since we are breaking the one vendor one gpu model.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! Are you planning to push a fix for this? If so feel free to push a new commit directly — otherwise I can do it.

Comment thread README.md Outdated
- **Time and memory efficient:** In the same spirit as Flash-Attn, but for layers like **RMSNorm**, **RoPE**, **SwiGLU**, and **CrossEntropy**! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with **kernel fusion**, **in-place replacement**, and **chunking** techniques.
- **Exact:** Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
- **Lightweight:** Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches!
- **Lightweight:** Liger Kernel has minimal default dependencies, requiring only Torch and Triton. Optional backends such as cuTile can be installed explicitly when needed.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as how it was and make additional dsl backend support as a single bullet point

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@vaibhavjindal
Copy link
Copy Markdown
Collaborator

Thanks for your contribuiton, it is exciting to see new dsls integrated to Liger Kernel. However, since this is the first PR with non-existing dsl, I haven't had an idea to have different DSLs co-existing in our repo. The current backend is designed for hardware device, not for DSLs. Happy to discuss if you have any ideas!

I agree, we need to think more deeply about how to integrate the cutile backend, especially since it works on already supported hardware (nvidia).

@vaibhavjindal
Copy link
Copy Markdown
Collaborator

Hi @xjmxyt , thanks for opening this PR. Excited to get cuTile kernels working in the liger-kernel.

The kernel looks good, however, we need to decide on how to integrate the backend in a better way. Do you think it will be okay if we make some modifications on top of this PR as maintainers to make it easy to land the first PR?

@xjmxyt
Copy link
Copy Markdown
Author

xjmxyt commented May 19, 2026

Hi @xjmxyt , thanks for opening this PR. Excited to get cuTile kernels working in the liger-kernel.

The kernel looks good, however, we need to decide on how to integrate the backend in a better way. Do you think it will be okay if we make some modifications on top of this PR as maintainers to make it easy to land the first PR?

Sure, feel free to push changes on top.

xjmxyt and others added 2 commits May 18, 2026 20:13
…vice

The existing VendorInfo / VENDOR_REGISTRY model assumed one vendor per
device (e.g., Ascend on NPU). It doesn't model alternative DSLs on the
same device (e.g., cuTile on CUDA, where Triton is already the default),
which was being shoehorned in via a hardcoded `if backend == "cutile":`
branch and a _cutile/__init__.py that deliberately skipped registration.

Replace VendorInfo(vendor, device) with BackendInfo(name, devices,
default_devices):

  - devices: tuple of supported devices (multi-device backends OK)
  - default_devices: subset auto-applied on import; empty = opt-in
    only via LIGER_KERNEL_BACKEND

Both Ascend and cuTile now register through the same register_backend()
mechanism, no special cases. Ascend uses devices=("npu",),
default_devices=("npu",); cuTile uses devices=("cuda",) with empty
default_devices, making it opt-in. The dispatcher
(_replace_with_backend_ops) reads the env var once and threads explicit
through, dropping the duplicate parse in the import fallback.

No behavior change for end users: LIGER_KERNEL_BACKEND=cutile still
selects cuTile on CUDA, and Ascend on NPU still auto-applies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vaibhavjindal
Copy link
Copy Markdown
Collaborator

vaibhavjindal commented May 19, 2026

Hi @xjmxyt, thanks again for yesterday's discussion and for the contribution.

I've pushed a follow-up commit
(8ac80fd) that refactors the
backend registry to address @Tcc0403's concern about device-vs-DSL.

Can you test things on your end and confirm numerical parity between the cuTile path and the existing Triton path on the full JSD test suite?

  LIGER_KERNEL_BACKEND=cutile pytest test/transformers/test_jsd.py -v

I am not able to test things on blackwells right now because of some ongoing issue.

@vaibhavjindal
Copy link
Copy Markdown
Collaborator

@Tcc0403 I have done some refactoring here to solve the vendor/backend confusion. Could you please check it on your end to see if it looks good? Can you also please check if these changes will break anything on the ascend backend (hopefully it should be fine)?

Happy to discuss if you have any ideas about we should do the integration for different backends!

cc @Mecoli1219

@xjmxyt
Copy link
Copy Markdown
Author

xjmxyt commented May 20, 2026

Hi @xjmxyt, thanks again for yesterday's discussion and for the contribution.

I've pushed a follow-up commit (8ac80fd) that refactors the backend registry to address @Tcc0403's concern about device-vs-DSL.

Can you test things on your end and confirm numerical parity between the cuTile path and the existing Triton path on the full JSD test suite?

  LIGER_KERNEL_BACKEND=cutile pytest test/transformers/test_jsd.py -v

I am not able to test things on blackwells right now because of some ongoing issue.

I already check it, it works well.
image

Comment on lines +2230 to +2234
jsd,liger-cutile,forward,speed,ms,BT,B * T,1024,0.7814080119132996,0.7793023943901062,0.7831360101699829,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
jsd,liger-cutile,forward,speed,ms,BT,B * T,2048,1.4285119771957397,1.4254208087921143,1.4325439929962158,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
jsd,liger-cutile,forward,speed,ms,BT,B * T,4096,2.7792000770568848,2.773011255264282,2.783692789077759,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
jsd,liger-cutile,forward,speed,ms,BT,B * T,8192,5.50931191444397,5.502655982971191,5.513644886016845,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
jsd,liger-cutile,forward,speed,ms,BT,B * T,16384,10.931103706359863,10.921529960632324,10.938668823242187,"{""vocab_size"": 128256}",NVIDIA B300 SXM6 AC,2026-05-18 09:59:22,0.8.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tcc0403 @xjmxyt Should we consider creating a separate benchmark_csv file for the new backend to keep the existing benchmark baseline intact?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can modify the default csv file path (e.g. all_benchmark_data_cutile.csv) when specifying the cutile backend.

Copy link
Copy Markdown
Collaborator

@Tcc0403 Tcc0403 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving _cutile to the same hierarchy level as backends. The modified structure would look like:

src/
└── liger_kernel/
    └── ops/
        ├── backends/
        │   └── _ascend/
        │       ├── __init__.py
        │       └── tilelang/  (replace tilelang with potential future dsl support)
        ├── cutile/
        │   └── ops/
        │       ├── __init__.py
        │       ├── jsd.py
        │       └── utils.py
        ├── tilelang/  (replace tilelang with potential future dsl support)
        └── __init__.py

I think this change makes it easier for hardware backends to support multiple DSLs coexisting. cc @vaibhavjindal

@kolehma8
Copy link
Copy Markdown
Collaborator

I suggest moving _cutile to the same hierarchy level as backends. The modified structure would look like:

src/
└── liger_kernel/
    └── ops/
        ├── backends/
        │   └── _ascend/
        │       ├── __init__.py
        │       └── tilelang/  (replace tilelang with potential future dsl support)
        ├── cutile/
        │   └── ops/
        │       ├── __init__.py
        │       ├── jsd.py
        │       └── utils.py
        ├── tilelang/  (replace tilelang with potential future dsl support)
        └── __init__.py

I think this change makes it easier for hardware backends to support multiple DSLs coexisting. cc @vaibhavjindal

We are also planning to bring in Cutlass/CUDA and CuteDSL in future for specific use cases. Cutlass in particular will be somewhat significant departure from the existing framework and requires compiling. I felt having the different "languages" as backends would be more natural as single kernel may contain multiple different implementations - backends. The language choice is also heavily tied to hardware and Cutlass/CuteDSL/CuTile are working only for Nvidia GPUs.

Hence, I prefer to have them under backend and have something along the lines of the BackendInfo that @vaibhavjindal drafted above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants