[NPU]:Added support for the poly_norm operator by TianHao324 · Pull Request #1114 · linkedin/Liger-Kernel

TianHao324 · 2026-03-02T03:51:43Z

Summary

Dual-Kernel Strategy
- Small kernel (n_cols <= 2048): 2D tensor loading with grid-stride loops
- Large kernel (n_cols > 2048): Column blocking to prevent UB overflow
NPU Core-Aware Grid Sizing
- Dynamic grid: min(num_cores, num_row_blocks)
- Eliminates idle programs and wasted scheduling overhead
Optimized Backward Pass
- Small kernel: Per-program scratch buffers for dW/dB (avoids atomic contention)
- Large kernel: Atomic operations with column blocking
Memory-Aware Block Size
- Uses compute_default_tiling_strategy for optimal block sizing
- Ensures UB stays within hardware limits

Testing Done

Hardware Type: Atlas 800I A2
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

TianHao324 · 2026-03-02T03:52:37Z

benchmark：

**************************************
     BENCHMARKING SPEED for POLY_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.23637999594211578,
      0.2348800003528595,
      0.23905999958515167,
      0.28970998525619507,
      0.3385399878025055
    ],
    "y_values_20": [
      0.2273000031709671,
      0.22830000519752502,
      0.2342199981212616,
      0.2720000147819519,
      0.3341679871082306
    ],
    "y_values_80": [
      0.24833999574184418,
      0.24624399840831757,
      0.25320398807525635,
      0.3228999972343445,
      0.35314399003982544
    ],
    "timestamp": "2026-03-02 02:51:38",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.7463200092315674,
      0.7454400062561035,
      0.7426400184631348,
      0.7364400029182434,
      1.6384600400924683
    ],
    "y_values_20": [
      0.7362599968910217,
      0.7371799945831299,
      0.7338839769363403,
      0.7293999791145325,
      1.635275959968567
    ],
    "y_values_80": [
      0.7615360021591187,
      0.7582200169563293,
      0.754800021648407,
      0.7476999759674072,
      1.642408013343811
    ],
    "timestamp": "2026-03-02 02:51:41",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.8856799602508545,
      0.8898400068283081,
      0.8944000005722046,
      0.8992000222206116,
      7.412799835205078
    ],
    "y_values_20": [
      0.8791880011558533,
      0.8816080093383789,
      0.8853240013122559,
      0.8918319940567017,
      7.408636093139648
    ],
    "y_values_80": [
      0.900763988494873,
      0.9023399949073792,
      0.9069600105285645,
      0.9096239805221558,
      7.417980194091797
    ],
    "timestamp": "2026-03-02 02:51:44",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      2.296020030975342,
      2.2891600131988525,
      2.327739953994751,
      2.768620014190674,
      6.419480323791504
    ],
    "y_values_20": [
      2.2793400287628174,
      2.2726879119873047,
      2.31085205078125,
      2.7621400356292725,
      6.414495944976807
    ],
    "y_values_80": [
      2.31717586517334,
      2.3065640926361084,
      2.377823829650879,
      2.7743000984191895,
      6.428679943084717
    ],
    "timestamp": "2026-03-02 02:51:47",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.4584600031375885,
      0.4900600016117096,
      0.49260997772216797,
      0.5017399787902832,
      7.218299865722656
    ],
    "y_values_20": [
      0.4523720145225525,
      0.4808399975299835,
      0.48702800273895264,
      0.4955199956893921,
      7.215479850769043
    ],
    "y_values_80": [
      0.46878400444984436,
      0.5002319812774658,
      0.5033479928970337,
      0.5113000273704529,
      7.2234601974487305
    ],
    "timestamp": "2026-03-02 02:51:50",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      1.4046099185943604,
      1.3975800275802612,
      1.4344799518585205,
      2.244270086288452,
      4.8792901039123535
    ],
    "y_values_20": [
      1.360640048980713,
      1.379304051399231,
      1.4222240447998047,
      2.240027904510498,
      4.875060081481934
    ],
    "y_values_80": [
      1.427780032157898,
      1.4378840923309326,
      1.4473799467086792,
      2.249243974685669,
      4.886499881744385
    ],
    "timestamp": "2026-03-02 02:51:52",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]
**************************************
     BENCHMARKING MEMORY for POLY_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      3.04296875,
      6.04296875,
      12.04296875,
      24.04296875,
      48.029296875
    ],
    "y_values_20": [
      3.04296875,
      6.04296875,
      12.04296875,
      24.04296875,
      48.029296875
    ],
    "y_values_80": [
      3.04296875,
      6.04296875,
      12.04296875,
      24.04296875,
      48.029296875
    ],
    "timestamp": "2026-03-02 02:51:52",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      34.0322265625,
      64.03662109375,
      128.03662109375,
      256.03662109375,
      512.03662109375
    ],
    "y_values_20": [
      34.0322265625,
      64.03662109375,
      128.03662109375,
      256.03662109375,
      512.03662109375
    ],
    "y_values_80": [
      34.0322265625,
      64.03662109375,
      128.03662109375,
      256.03662109375,
      512.03662109375
    ],
    "timestamp": "2026-03-02 02:51:53",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]

TianHao324 · 2026-03-02T03:52:45Z

@Tcc0403 would you mind having a preview?

Tcc0403 · 2026-03-02T11:50:15Z

src/liger_kernel/ops/backends/_ascend/ops/poly_norm.py

+    b = tl.load(B_ptr)
+
+    # Grid-stride loop over row blocks
+    for i in tl.range(num_iterations, num_stages=NUM_STAGES):


OK, I have completed the revision

noemotiovon · 2026-03-02T12:22:09Z

src/liger_kernel/ops/backends/_ascend/ops/poly_norm.py

+    eps,
+    BLOCK_SIZE_M: tl.constexpr,
+    BLOCK_SIZE_N: tl.constexpr,
+    NUM_STAGES: tl.constexpr,


Triton-Ascend does not support num_warps and num_stages due to hardware architecture differences; autotune only supports block size and multibuffer. Remove these parameters from all Ascend NPU kernels and kernel launches to avoid no-op / misleading config.

Tcc0403 reviewed Mar 2, 2026

View reviewed changes

noemotiovon reviewed Mar 2, 2026

View reviewed changes

[NPU]:Added support for the poly_norm operator

43336b5

TianHao324 force-pushed the poly_npu branch from fd99d85 to 43336b5 Compare March 3, 2026 02:17

Merge branch 'main' into poly_npu

9b649eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU]:Added support for the poly_norm operator#1114

[NPU]:Added support for the poly_norm operator#1114
TianHao324 wants to merge 2 commits intolinkedin:mainfrom
TianHao324:poly_npu

TianHao324 commented Mar 2, 2026 •

edited

Loading

Uh oh!

TianHao324 commented Mar 2, 2026

Uh oh!

TianHao324 commented Mar 2, 2026

Uh oh!

Tcc0403 Mar 2, 2026

Uh oh!

TianHao324 Mar 3, 2026

Uh oh!

noemotiovon Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TianHao324 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

TianHao324 commented Mar 2, 2026

Uh oh!

TianHao324 commented Mar 2, 2026

Uh oh!

Tcc0403 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

TianHao324 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TianHao324 commented Mar 2, 2026 •

edited

Loading