Skip to content

[NPU]:Added support for the poly_norm operator#1114

Open
TianHao324 wants to merge 2 commits intolinkedin:mainfrom
TianHao324:poly_npu
Open

[NPU]:Added support for the poly_norm operator#1114
TianHao324 wants to merge 2 commits intolinkedin:mainfrom
TianHao324:poly_npu

Conversation

@TianHao324
Copy link
Contributor

@TianHao324 TianHao324 commented Mar 2, 2026

Summary

  1. Dual-Kernel Strategy

    • Small kernel (n_cols <= 2048): 2D tensor loading with grid-stride loops
    • Large kernel (n_cols > 2048): Column blocking to prevent UB overflow
  2. NPU Core-Aware Grid Sizing

    • Dynamic grid: min(num_cores, num_row_blocks)
    • Eliminates idle programs and wasted scheduling overhead
  3. Optimized Backward Pass

    • Small kernel: Per-program scratch buffers for dW/dB (avoids atomic contention)
    • Large kernel: Atomic operations with column blocking
  4. Memory-Aware Block Size

    • Uses compute_default_tiling_strategy for optimal block sizing
    • Ensures UB stays within hardware limits

Testing Done

image
  • Hardware Type: Atlas 800I A2
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@TianHao324
Copy link
Contributor Author

benchmark:

**************************************
     BENCHMARKING SPEED for POLY_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.23637999594211578,
      0.2348800003528595,
      0.23905999958515167,
      0.28970998525619507,
      0.3385399878025055
    ],
    "y_values_20": [
      0.2273000031709671,
      0.22830000519752502,
      0.2342199981212616,
      0.2720000147819519,
      0.3341679871082306
    ],
    "y_values_80": [
      0.24833999574184418,
      0.24624399840831757,
      0.25320398807525635,
      0.3228999972343445,
      0.35314399003982544
    ],
    "timestamp": "2026-03-02 02:51:38",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.7463200092315674,
      0.7454400062561035,
      0.7426400184631348,
      0.7364400029182434,
      1.6384600400924683
    ],
    "y_values_20": [
      0.7362599968910217,
      0.7371799945831299,
      0.7338839769363403,
      0.7293999791145325,
      1.635275959968567
    ],
    "y_values_80": [
      0.7615360021591187,
      0.7582200169563293,
      0.754800021648407,
      0.7476999759674072,
      1.642408013343811
    ],
    "timestamp": "2026-03-02 02:51:41",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.8856799602508545,
      0.8898400068283081,
      0.8944000005722046,
      0.8992000222206116,
      7.412799835205078
    ],
    "y_values_20": [
      0.8791880011558533,
      0.8816080093383789,
      0.8853240013122559,
      0.8918319940567017,
      7.408636093139648
    ],
    "y_values_80": [
      0.900763988494873,
      0.9023399949073792,
      0.9069600105285645,
      0.9096239805221558,
      7.417980194091797
    ],
    "timestamp": "2026-03-02 02:51:44",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      2.296020030975342,
      2.2891600131988525,
      2.327739953994751,
      2.768620014190674,
      6.419480323791504
    ],
    "y_values_20": [
      2.2793400287628174,
      2.2726879119873047,
      2.31085205078125,
      2.7621400356292725,
      6.414495944976807
    ],
    "y_values_80": [
      2.31717586517334,
      2.3065640926361084,
      2.377823829650879,
      2.7743000984191895,
      6.428679943084717
    ],
    "timestamp": "2026-03-02 02:51:47",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      0.4584600031375885,
      0.4900600016117096,
      0.49260997772216797,
      0.5017399787902832,
      7.218299865722656
    ],
    "y_values_20": [
      0.4523720145225525,
      0.4808399975299835,
      0.48702800273895264,
      0.4955199956893921,
      7.215479850769043
    ],
    "y_values_80": [
      0.46878400444984436,
      0.5002319812774658,
      0.5033479928970337,
      0.5113000273704529,
      7.2234601974487305
    ],
    "timestamp": "2026-03-02 02:51:50",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      1.4046099185943604,
      1.3975800275802612,
      1.4344799518585205,
      2.244270086288452,
      4.8792901039123535
    ],
    "y_values_20": [
      1.360640048980713,
      1.379304051399231,
      1.4222240447998047,
      2.240027904510498,
      4.875060081481934
    ],
    "y_values_80": [
      1.427780032157898,
      1.4378840923309326,
      1.4473799467086792,
      2.249243974685669,
      4.886499881744385
    ],
    "timestamp": "2026-03-02 02:51:52",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]
**************************************
     BENCHMARKING MEMORY for POLY_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      3.04296875,
      6.04296875,
      12.04296875,
      24.04296875,
      48.029296875
    ],
    "y_values_20": [
      3.04296875,
      6.04296875,
      12.04296875,
      24.04296875,
      48.029296875
    ],
    "y_values_80": [
      3.04296875,
      6.04296875,
      12.04296875,
      24.04296875,
      48.029296875
    ],
    "timestamp": "2026-03-02 02:51:52",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "poly_norm",
    "kernel_provider": "huggingface",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      256,
      512,
      1024,
      2048,
      4096
    ],
    "y_values_50": [
      34.0322265625,
      64.03662109375,
      128.03662109375,
      256.03662109375,
      512.03662109375
    ],
    "y_values_20": [
      34.0322265625,
      64.03662109375,
      128.03662109375,
      256.03662109375,
      512.03662109375
    ],
    "y_values_80": [
      34.0322265625,
      64.03662109375,
      128.03662109375,
      256.03662109375,
      512.03662109375
    ],
    "timestamp": "2026-03-02 02:51:53",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]

@TianHao324
Copy link
Contributor Author

@Tcc0403 would you mind having a preview?

b = tl.load(B_ptr)

# Grid-stride loop over row blocks
for i in tl.range(num_iterations, num_stages=NUM_STAGES):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have completed the revision

eps,
BLOCK_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr,
NUM_STAGES: tl.constexpr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triton-Ascend does not support num_warps and num_stages due to hardware architecture differences; autotune only supports block size and multibuffer. Remove these parameters from all Ascend NPU kernels and kernel launches to avoid no-op / misleading config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants