WIP: add nkigen-lite as a standalone IR-based kernel generation backend by ymwangg · Pull Request #59 · aws-neuron/nkipy

ymwangg · 2026-06-02T05:27:29Z

Summary

Adds nkigen-lite, a standalone IR-based kernel generation backend that lowers numpy-style tensor programs to NKI (Neuron Kernel Interface) code for NeuronCore targets.

Architecture

The system is structured as a three-layer IR stack with a multi-phase lowering pipeline:

Core (`core.py`)

Shared SSA-based IR infrastructure used by both IRs:

Value, Op, Graph — SSA primitives with use-lists and mutation helpers
DType enum covering f32/f16/bf16/tf32/fp8/int types
Common graph utilities: DCE, verification, toposort
Shared numpy interpreter dispatch tables

Tensor IR (`tensor_ir/`)

High-level, hardware-agnostic IR operating on whole tensors:

SSA-based — every op produces new Value(s), enabling clean analysis and transformation
Numpy-like builder API — familiar interface for constructing kernel graphs
Numpy interpreter — executes the IR with real data for correctness checking
Ops: elementwise (unary/binary), reduce, matmul, transpose, reshape, slice, concat, broadcast

NKI IR (`nki_ir/`)

Low-level IR that makes hardware concerns explicit:

Memory spaces — every value carries HBM/SBUF/PSUM placement
Partition dimension — dim 0 of on-chip tiles is the partition dim (max 128)
Explicit memory management — alloc/dealloc + DMA copies for data movement
Pre-allocated destinations — all compute ops take a dst parameter
Tile indexing — DimSlice-based indexing (ts/ds) mirroring Kernel Builder
Loop constructs — fori_loop for explicit tile iteration
Hardware verifier — checks tile constraints against target specs
Numpy interpreter — reference execution without hardware
Emit to Kernel Builder — walks the graph and invokes KB API calls to produce NISA MLIR

Lowering Pipeline (`tensor_ir/passes/`)

The full pipeline: tensor_ir → canonicalize → decompose → layout_solver → direct_lower → nki_ir

Canonicalize — recomposes primitive-op chains into high-level ops (e.g., div(1, sqrt(x)) → rsqrt(x), mul(x, div(1, add(1, exp(neg(x))))) → silu(x))
Decompose — lowers ops without direct NISA equivalents into supported primitives (e.g., div(a,b) → mul(a, reciprocal(b)), reduce(mean) → reduce(sum) * 1/N)
Layout Solver — assigns each tensor dimension to one of three roles:
- I (iteration) — loop indices, not in SBUF tile
- P (partition) — SBUF dim-0, product ≤ 128, parallel compute
- F (free) — SBUF dim-1, contiguous per partition
Propagates constraints across the graph to find a globally consistent assignment.
Direct Lower — converts tensor IR ops to tiled NKI IR:
- Segments ops into elementwise groups (fused on-chip) vs individual non-elementwise ops (HBM boundaries)
- Generates tiled load→compute→store sequences
- Per-op lowering modules: memory, elementwise, reduce, matmul, transpose, broadcast
- Inserts deallocs via liveness analysis after lowering

Hardware Target (`passes/hardware.py`)

Parameterized hardware profiles (TRN2 defaults) defining partition limits, SBUF/PSUM sizes, and matmul constraints.

Status

🚧 Work in progress — not ready for review.

Test plan

Full test suite passes (uv run pytest nkigen-lite/tests/ -n auto)
Integration with main nkipy package verified
End-to-end lowering produces correct NKI IR for representative patterns

Migrates tensor_ir, nki_ir, and the direct lowering passes from nano-tensorizer/ir_lab into the nkipy workspace as a new package. The pipeline (canonicalize → decompose → layout_solver → direct_lower) produces legal NKI IR directly without intermediate passes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: add nkigen-lite as a standalone IR-based kernel generation backend#59

WIP: add nkigen-lite as a standalone IR-based kernel generation backend#59
ymwangg wants to merge 1 commit into
mainfrom
nkigen-lite

ymwangg commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ymwangg commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Core (core.py)

Tensor IR (tensor_ir/)

NKI IR (nki_ir/)

Lowering Pipeline (tensor_ir/passes/)

Hardware Target (passes/hardware.py)

Status

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ymwangg commented Jun 2, 2026 •

edited

Loading

Core (`core.py`)

Tensor IR (`tensor_ir/`)

NKI IR (`nki_ir/`)

Lowering Pipeline (`tensor_ir/passes/`)

Hardware Target (`passes/hardware.py`)