KernelAgent-Oink: Add SM100 CuTeDSL RMSNorm custom ops plugin for vLLM #69

Laurawly · 2026-01-06T19:09:48Z

Add the kernelagent-oink vLLM plugin that registers Blackwell (SM100) RMSNorm
custom ops via torch.library.custom_op under the oink:: namespace:

oink::rmsnorm(x, weight, eps) -> Tensor
oink::fused_add_rms_norm(x!, residual!, weight, eps) -> () (in-place, vLLM semantics)

The SM100 CuTeDSL implementation is layout-aware and preserves padded-row
strides (stride(1)==1, stride(0)>=N) so torch.compile/CUDA-graph capture sees a
stable stride contract. Includes small-M latency tuning for DSv3-like N=7168
and maintains high-M bandwidth, with correctness-first fallbacks on non-SM100.

oink::fused_add_rms_norm backed by an SM100 CuTeDSL RMSNorm kernel. The ops are torch.compile-friendly (stride-preserving for padded-row inputs) and the fused op matches vLLM's in-place residual-add RMSNorm semantics.

Jack-Khuu

Initial comments

Need to go through rmsnorm.py

Jack-Khuu · 2026-01-09T21:50:14Z

oink/src/kernelagent_oink/blackwell/lite_quack.py

+
+import math
+import operator
+from typing import Callable, Optional


nit: Prefer <type> | None instead of Optional keyword

#66

Jack-Khuu · 2026-01-09T21:51:44Z

oink/src/kernelagent_oink/blackwell/lite_quack.py

+numerical behaviour and performance close to the original reference
+implementations.


original reference implementations.

Commit hash would be nice if you have it handy

Jack-Khuu · 2026-01-09T22:10:49Z

oink/src/kernelagent_oink/blackwell/oink_custom_ops.py

+    if sm >= 100:
+        # Use the tuned CuTeDSL SM100 kernel. The public API already
+        # contains all necessary gating and layout checks internally.
+        _rms = _get_rmsnorm_mod()


nit: pull _rms out of conditional

sm = _get_sm(x.device) _rms = _get_rmsnorm_mod() if sm >= 100: return <> return _rms.rmsnorm_ref(...)

Jack-Khuu · 2026-01-09T22:22:55Z

oink/src/kernelagent_oink/blackwell/oink_custom_ops.py

+    assert weight.dim() == 1, "weight must be 1D [N]"
+
+    sm = _get_sm(x.device)
+    if sm >= 100:


nit: Check inverse to reduce nesting

if sm < 100: # Non-SM100: keep semantics in-place (correctness-first).

Jack-Khuu · 2026-01-09T22:47:25Z

oink/src/kernelagent_oink/__init__.py

+    local_rank = os.environ.get("LOCAL_RANK")
+    if local_rank is not None:
+        try:
+            return int(local_rank)
+        except ValueError:
+            pass
+    return 0


Ignore suggestion if we want to guard/enable on "off/on" "yes/no"

Suggested change

local_rank = os.environ.get("LOCAL_RANK")

if local_rank is not None:

try:

return int(local_rank)

except ValueError:

pass

return 0

rank = os.environ.get("LOCAL_RANK", "0")

return int(rank)

Jack-Khuu · 2026-01-10T00:11:36Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

+import subprocess
+import sys
+import threading
+from typing import Optional, Tuple


If we want to keep it for <= Python 3.9 support that's fine. If not let's use | None and tuple for 3.10+

Jack-Khuu · 2026-01-10T00:47:53Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

+                    f"falling back to staged SMEM path (returncode={rc}).",
+                    file=sys.stderr,
+                )
+                failing_proc = proc_128 if proc_128 is not None else proc_256


Do we want to spit out both error traces since both exist+fail?

Since 128 is the fallback, fixing the 256 probe makes more sense right?

Jack-Khuu · 2026-01-10T00:59:55Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

+_CLUSTER_DIRECT_GMEM_PROBE_WARNED = False
+
+
+def _probe_cluster_direct_gmem_max_copy_bits() -> int:


This isn't called until ~line 2560, do we want to move this lower?

Specifically somewhere after 263-299 which are still configurating the env variables (and called on import)

Jack-Khuu · 2026-01-10T01:01:44Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

+    """Resolve copy width (in bits) from the (import-time) policy string."""
+    if _COPY_BITS_POLICY in {"128"}:
+        return 128
+    if _COPY_BITS_POLICY in {"256"} and can_use_256:


Why in instead of ==?

Jack-Khuu · 2026-01-10T01:08:42Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

+    # This relies on internal CuTeDSL runtime pointer fields (`_desc`, `_pointer`,
+    # etc.). If these internals change in a future CuTeDSL upgrade, callers
+    # should catch AttributeError and fall back to the regular launch path.
+    device_ptr = int(device_ptr)


why cast if the function expects an int?

v0i0 · 2026-01-13T00:18:27Z

oink/src/kernelagent_oink/blackwell/lite_quack.py

+    sm_count = (
+        sm_count * sm_count_multiple
+        if N <= 8192
+        else sm_count // 2


this seems strange, why would we ever want to run fewer than sm_count?

for clustered launches sm_count is effectively a cluster-count heuristic (matching Quack’s naming/launch shape). Launch uses grid=[sm_count, cluster_n, 1], so total CTAs is sm_count * cluster_n.

v0i0 · 2026-01-13T00:23:42Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

+_PTR_FAST_LAUNCH_TLS = threading.local()
+
+
+def _env_flag(name: str, default: bool) -> bool:


this already exists in init, remove dupes like that

v0i0 · 2026-01-13T01:00:59Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

+        elif N <= 8192:
+            # Allow an override (used by 2-rows/CTA path for N≈6k/8k)
+            try:
+                return self._tpr_override  # type: ignore[attr-defined]


seems redundant wrt the first override check?

msaroufim · 2026-01-15T05:09:36Z

oink/src/kernelagent_oink/blackwell/rmsnorm.py

@@ -0,0 +1,2927 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


Sorry for the late review, it's nice that we have an extension system now to try things in VLLM so I mostly want to spend time reviewing the kernel itself and what'd make it easier for vendors like VLLM to actually merge this in. Mostly reiterating points I made here https://x.com/marksaroufim/status/2009096176789016600?s=20

A lot of it stems from this file is too long but I think it shouldn't be too hard to clean it up

we don't need the cache to work over multiple cute DSL versions, presumably they're making breaking changes fairly frequently so let's just pick the latest version and update as needed

The code almost looks like a splatted autotune run because it's trying to handle many cases and choose between different optimization. I think we should just try and ship the one specific config that is fast on some specific shapes on a specific model that the VLLM team cares about on B200. Otherwise they'll have trouble reviewing this code even if it's faster and I'd rather we generalize the code progressively as the need arises

A lot of the pointer marshalling code can be deleted in favor of using tvm-ffi, a good chunk of the file is doing this and this will be error prone

Point 2 also will have unexpected side effects, where tons of fallback makes it unpredictable for an end user precisely which kernel configuration will run which is something all of our numerics sensitive customers will really care about. A user would often like to explicitly state whether they want an op to be in place or not. I'd argue that instead of environment variables gating specific optimizations we should have arguments to a function or separate functions. Even further PyTorch now has an intra kernel dispatcher where we can make guarantees on which specific kernel will be called for a specific shape

Finally while I think an e2e test in VLLM works great, we probably also want some smaller unit tests comparing numerics vs vanilla PyTorch code and Quack right here

- Switch correctness gate to PyTorch ref + record err stats\n- Tighten Softmax/LayerNorm tolerances (Quack-like)\n- Quack-style benchmark suite layout + SVG plots\n- Packaging/README polish for publishability

Provide a vLLM general plugin that registers oink::rmsnorm and

27cff30

oink::fused_add_rms_norm backed by an SM100 CuTeDSL RMSNorm kernel. The ops are torch.compile-friendly (stride-preserving for padded-row inputs) and the fused op matches vLLM's in-place residual-add RMSNorm semantics.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 6, 2026

Laurawly changed the title ~~kernelagent-oink: add SM100 CuTeDSL RMSNorm custom ops plugin for vLLM~~ KernelAgent-Oink: Add SM100 CuTeDSL RMSNorm custom ops plugin for vLLM Jan 6, 2026

Laurawly added 2 commits January 6, 2026 12:25

Fix oink ruff lint and add license headers

3003c13

Format oink with ruff

1468088

Laurawly requested review from Jack-Khuu, kaiming-cheng and msaroufim January 7, 2026 19:37

Jack-Khuu reviewed Jan 10, 2026

View reviewed changes

Laurawly requested review from drisspg and v0i0 January 12, 2026 23:01

v0i0 reviewed Jan 13, 2026

View reviewed changes

msaroufim reviewed Jan 15, 2026

View reviewed changes

Laurawly added 5 commits January 21, 2026 11:06

oink: SM100 suite refresh (strict parity + quack-style benches)

4c9a826

- Switch correctness gate to PyTorch ref + record err stats\n- Tighten Softmax/LayerNorm tolerances (Quack-like)\n- Quack-style benchmark suite layout + SVG plots\n- Packaging/README polish for publishability

oink: fix ruff lint

9b29732

oink: ruff format

0543b6f

oink: add license headers to benchmarks

7e818ee

update

5d195d6

		numerical behaviour and performance close to the original reference
		implementations.

		_CLUSTER_DIRECT_GMEM_PROBE_WARNED = False


		def _probe_cluster_direct_gmem_max_copy_bits() -> int:

		_PTR_FAST_LAUNCH_TLS = threading.local()


		def _env_flag(name: str, default: bool) -> bool:

		@@ -0,0 +1,2927 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

KernelAgent-Oink: Add SM100 CuTeDSL RMSNorm custom ops plugin for vLLM #69

Are you sure you want to change the base?

KernelAgent-Oink: Add SM100 CuTeDSL RMSNorm custom ops plugin for vLLM #69

Uh oh!

Conversation

Laurawly commented Jan 6, 2026

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msaroufim Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

msaroufim Jan 15, 2026 •

edited

Loading