Skip to content

WIP: add nkigen-lite as a standalone IR-based kernel generation backend#59

Draft
ymwangg wants to merge 1 commit into
mainfrom
nkigen-lite
Draft

WIP: add nkigen-lite as a standalone IR-based kernel generation backend#59
ymwangg wants to merge 1 commit into
mainfrom
nkigen-lite

Conversation

@ymwangg
Copy link
Copy Markdown
Contributor

@ymwangg ymwangg commented Jun 2, 2026

Summary

Adds nkigen-lite, a standalone IR-based kernel generation backend that lowers numpy-style tensor programs to NKI (Neuron Kernel Interface) code for NeuronCore targets.

Architecture

The system is structured as a three-layer IR stack with a multi-phase lowering pipeline:

Core (core.py)

Shared SSA-based IR infrastructure used by both IRs:

  • Value, Op, Graph — SSA primitives with use-lists and mutation helpers
  • DType enum covering f32/f16/bf16/tf32/fp8/int types
  • Common graph utilities: DCE, verification, toposort
  • Shared numpy interpreter dispatch tables

Tensor IR (tensor_ir/)

High-level, hardware-agnostic IR operating on whole tensors:

  • SSA-based — every op produces new Value(s), enabling clean analysis and transformation
  • Numpy-like builder API — familiar interface for constructing kernel graphs
  • Numpy interpreter — executes the IR with real data for correctness checking
  • Ops: elementwise (unary/binary), reduce, matmul, transpose, reshape, slice, concat, broadcast

NKI IR (nki_ir/)

Low-level IR that makes hardware concerns explicit:

  • Memory spaces — every value carries HBM/SBUF/PSUM placement
  • Partition dimension — dim 0 of on-chip tiles is the partition dim (max 128)
  • Explicit memory management — alloc/dealloc + DMA copies for data movement
  • Pre-allocated destinations — all compute ops take a dst parameter
  • Tile indexing — DimSlice-based indexing (ts/ds) mirroring Kernel Builder
  • Loop constructs — fori_loop for explicit tile iteration
  • Hardware verifier — checks tile constraints against target specs
  • Numpy interpreter — reference execution without hardware
  • Emit to Kernel Builder — walks the graph and invokes KB API calls to produce NISA MLIR

Lowering Pipeline (tensor_ir/passes/)

The full pipeline: tensor_ir → canonicalize → decompose → layout_solver → direct_lower → nki_ir

  1. Canonicalize — recomposes primitive-op chains into high-level ops (e.g., div(1, sqrt(x))rsqrt(x), mul(x, div(1, add(1, exp(neg(x)))))silu(x))

  2. Decompose — lowers ops without direct NISA equivalents into supported primitives (e.g., div(a,b)mul(a, reciprocal(b)), reduce(mean)reduce(sum) * 1/N)

  3. Layout Solver — assigns each tensor dimension to one of three roles:

    • I (iteration) — loop indices, not in SBUF tile
    • P (partition) — SBUF dim-0, product ≤ 128, parallel compute
    • F (free) — SBUF dim-1, contiguous per partition

    Propagates constraints across the graph to find a globally consistent assignment.

  4. Direct Lower — converts tensor IR ops to tiled NKI IR:

    • Segments ops into elementwise groups (fused on-chip) vs individual non-elementwise ops (HBM boundaries)
    • Generates tiled load→compute→store sequences
    • Per-op lowering modules: memory, elementwise, reduce, matmul, transpose, broadcast
    • Inserts deallocs via liveness analysis after lowering

Hardware Target (passes/hardware.py)

Parameterized hardware profiles (TRN2 defaults) defining partition limits, SBUF/PSUM sizes, and matmul constraints.

Status

🚧 Work in progress — not ready for review.

Test plan

  • Full test suite passes (uv run pytest nkigen-lite/tests/ -n auto)
  • Integration with main nkipy package verified
  • End-to-end lowering produces correct NKI IR for representative patterns

Migrates tensor_ir, nki_ir, and the direct lowering passes from
nano-tensorizer/ir_lab into the nkipy workspace as a new package.
The pipeline (canonicalize → decompose → layout_solver → direct_lower)
produces legal NKI IR directly without intermediate passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant