Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
1a6308e
Add FP32 operators for MicroLlama on Snitch (untiled)
lee2716 Jan 31, 2026
1bdf9c9
Add tiling support for MicroLlama on Snitch
lee2716 Jan 31, 2026
35d51ef
Fix Snitch tiled platform by unifying mapping to use TilingReadyBindings
lee2716 Feb 3, 2026
76a4678
add full microllama model in ci test
lee2716 Feb 3, 2026
c222810
add comments with information about operations
lee2716 Feb 3, 2026
497e9c1
generalize RMSNorm to support full ONNX spec
lee2716 Feb 3, 2026
bf5ddb7
Fix broadcast stride calculation for inputs with different ranks in A…
lee2716 Feb 3, 2026
7e3659d
delete unused function
lee2716 Feb 3, 2026
28280fb
delete the comment
lee2716 Feb 3, 2026
ac5d541
update year to 2026
lee2716 Feb 3, 2026
c90b35c
Fix: Revert batch_size type to uint32_t based on review
lee2716 Feb 3, 2026
5669c28
update year to 2026
lee2716 Feb 3, 2026
cf4d9bd
update year to 2026
lee2716 Feb 3, 2026
c04bd6a
remove code duplication
lee2716 Feb 3, 2026
bdc550e
remove code duplication
lee2716 Feb 3, 2026
b53ff75
update year to 2026
lee2716 Feb 3, 2026
b355624
update year to 2026
lee2716 Feb 3, 2026
5306134
recover the Gemm_fp32
lee2716 Feb 3, 2026
32d88c0
improve multicore transpose
lee2716 Feb 4, 2026
7092e35
format: run make format on Snitch platform code
lee2716 Feb 4, 2026
b2199cb
pytest: add microLlama model to Snitch test configurations
lee2716 Feb 5, 2026
7ad03a3
style: consolidate imports in Snitch platform
lee2716 Feb 5, 2026
d76f6f1
refactor: restore Snitch framework code to origin/devel
lee2716 Feb 5, 2026
1c62b68
fix: Reshape operator for Snitch platform
lee2716 Feb 5, 2026
32d4bfa
fix: Add broadcasting support and compatible type inference
lee2716 Feb 5, 2026
66c4b4f
make format update
lee2716 Feb 5, 2026
be96413
update test paths for reorganized RMSNorm and microLlama directories,…
lee2716 Feb 6, 2026
89d382a
refactor: general ONNX broadcasting for Div/Mul/Add
lee2716 Feb 6, 2026
e55c7bc
fix: enable tiled deployment for NOP operations and L2 memory management
lee2716 Feb 6, 2026
13a4e64
fix: restore NOPTileConstraint compatibility with Siracusa/Neureka ti…
lee2716 Feb 6, 2026
55b6750
fix: correct integer type inference for all-zero input arrays
lee2716 Feb 6, 2026
06010e4
fix: preserve original dtype for all-zero input type inference
lee2716 Feb 6, 2026
85a68fd
make format update
lee2716 Feb 6, 2026
182a2c3
update rmsnorm test
lee2716 Feb 9, 2026
03125c0
feat: replace trivial all-1.0 weights with true FP32 random initializ…
lee2716 Feb 17, 2026
027ccab
merge HardSwishChecker, rename parser, fix Softmax types, yapf fix
lee2716 Feb 17, 2026
064981a
fix: multi-core safe memory allocation for Snitch platform
lee2716 Feb 18, 2026
9be8768
feat: multi-core MatMul, Softmax kernels and fix Mul template
lee2716 Feb 18, 2026
fdc0c82
refactor: slim Snitch parsers, add MatMul_fp32.c, remove unsupported …
lee2716 Feb 18, 2026
7813684
remove if (snrt_is_compute_core())
lee2716 Feb 18, 2026
4865516
fix:correct RMSNorm op count from 5*inputSize to 6*inputSize
lee2716 Feb 18, 2026
fc8ea3f
refactor: use SkipTransformer with pointer assignment for Reshape, av…
lee2716 Feb 18, 2026
b6b6eb5
simplify: remove unused broadcasting logic from FloatDiv/Mul TileCons…
lee2716 Feb 18, 2026
4e8448b
cleanup: remove unused BasicTransformer and Basic*Bindings dead code
lee2716 Feb 18, 2026
7003801
fix CI test of snitch
lee2716 Feb 19, 2026
e693be7
fix: add int8→int32 MatMul binding to fix Snitch Integer MatMul CI test
lee2716 Feb 19, 2026
1633a71
fix CI test of snitch
lee2716 Feb 19, 2026
0a66cf4
refactor: reuse Generic GatherTemplate and revert NOPTileConstraint
lee2716 Feb 19, 2026
6b357cf
refactor: simplify Snitch parsers, templates, and bindings
lee2716 Feb 19, 2026
c7b9771
refactor: address reviewer comments and reduce code duplication
lee2716 Feb 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci-platform-snitch-tiled.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are reverting the changes introduced by PR #144 ! The tests now reside in DeeployTest/test_snitch_tiled_config.py

Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,4 @@ jobs:
with:
runner: ${{ needs.select-env.outputs.runner }}
docker-image: ${{ needs.select-env.outputs.image }}
pytest-marker: "kernels and singlebuffer and l2"
pytest-marker: "(kernels or models) and singlebuffer and l2"
10 changes: 9 additions & 1 deletion .github/workflows/ci-platform-snitch.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are reverting the changes from PR #144 as well! New tests should go in ‎DeeployTest/test_snitch_config.py

Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,12 @@ jobs:
with:
runner: ${{ needs.select-env.outputs.runner }}
docker-image: ${{ needs.select-env.outputs.image }}
pytest-marker: "kernels"
pytest-marker: kernels

snitch-models:
needs: select-env
uses: ./.github/workflows/_runner-snitch.yml
with:
runner: ${{ needs.select-env.outputs.runner }}
docker-image: ${{ needs.select-env.outputs.image }}
pytest-marker: models
2 changes: 2 additions & 0 deletions .yamllint
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,5 @@ ignore:
- "**/toolchain/"
# Ignore all files in .git
- "**/.git/**"
# Ignore all files in .venv
- "**/.venv/"
2 changes: 1 addition & 1 deletion Deeploy/DeeployTypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -3107,7 +3107,7 @@ def _exportGraph(self, folderPath, fileName):
# VJUNG: ONNX-Graphsurgeon needs tensors to be in their export types
constTensors = [tensor for tensor in self.graph.tensors().values() if isinstance(tensor, gs.Constant)]
for tensor in constTensors:
if tensor.dtype != tensor.export_dtype:
if hasattr(tensor, 'export_dtype') and tensor.dtype != tensor.export_dtype:
tensor.values = tensor.values.astype(tensor.export_dtype)

model = gs.export_onnx(self.graph)
Expand Down
3 changes: 3 additions & 0 deletions Deeploy/Targets/Generic/Bindings.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,9 @@
BasicConcatBindings = [
NodeBinding(ConcatChecker([PointerClass(type), PointerClass(type)], [PointerClass(type)]),
ConcatTemplate.referenceTemplate, BasicTransformer) for type in IntegerDataTypes
] + [
NodeBinding(ConcatChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
ConcatTemplate.referenceTemplate, BasicTransformer)
]

BasicQuantBindings = [
Expand Down
49 changes: 49 additions & 0 deletions Deeploy/Targets/Generic/Layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -709,3 +709,52 @@ def computeOps(self):
numPx = opRep['dim_im_out_x']

return numPx * opsPerPx


class RMSNormLayer(ONNXLayer):
"""Layer support for the ONNX RMSNormalization operator.

Supported opset: 23

It is computed as follows:
- XSquared = Mul(X, X)
- XSquaredMean = ReduceMean<axes=normalized_axes>(XSquared)
- MeanSquareEpsilon = Add(XSquaredMean, epsilon)
- RMS = Sqrt(MeanSquareEpsilon)
- Normalized = Div(X, RMS)
- Y = Mul(Normalized, Scale)

For more details, this is the official ONNX documentation:
https://onnx.ai/onnx/operators/onnx__RMSNormalization.html#rmsnormalization-23
"""

def __init__(self, maps: List[NodeMapper]):
super().__init__(maps)

def computeOps(self):
inputSize = self.mapper.parser.operatorRepresentation['inputSize']
NormalizedAxesSize = self.mapper.parser.operatorRepresentation['NormalizedAxesSize']
scale = self.mapper.parser.operatorRepresentation['scale']

# a. XSquared = Mul(X, X) => inputSize ops
# b. XSquaredMean = ReduceMean<axes=normalized_axes>(XSquared)
# => inputSize ops (additions) + (inputSize - NormalizedAxesSize) ops (divisions)
# c. MeanSquareEpsilon = Add(XSquaredMean, epsilon) => (inputSize - NormalizedAxesSize) ops
# d. RMS = Sqrt(MeanSquareEpsilon) => (inputSize - NormalizedAxesSize) ops
# e. Normalized = Div(X, RMS) => inputSize ops
# f. Y = Mul(Normalized, Scale) => 0 if all(Scale == 1.0), else inputSize ops
scale_ops = 0 if (scale == 1.0).all() else inputSize
ops = 6 * inputSize - 3 * NormalizedAxesSize + scale_ops
return ops


class HardSwishLayer(ONNXLayer):

def __init__(self, maps: List[NodeMapper]):
super().__init__(maps)

def computeOps(self):
# HardSwish(x) = x * clip(x/6 + 0.5, 0, 1)
# Operations: div + add + clip + mul
size = self.mapper.parser.operatorRepresentation['size']
return size * 4
78 changes: 69 additions & 9 deletions Deeploy/Targets/Generic/Parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,37 @@
from Deeploy.DeeployTypes import ConstantBuffer, NetworkContext, NodeParser, VariableBuffer


def compute_broadcast_strides(shape1, shape2, out_shape):
"""Compute strides for ONNX/NumPy-style broadcasting.

Pads both input shapes from the left to match the output ndim,
then computes strides where broadcast dimensions (size 1) get stride 0.

Example:
shape1=[8,8,8], shape2=[8]
-> strides1=[64,8,1], strides2=[0,0,1]
"""
ndim = len(out_shape)

pad1 = [1] * (ndim - len(shape1)) + shape1
pad2 = [1] * (ndim - len(shape2)) + shape2

def _calc_strides(padded_shape, out_shape):
strides = []
stride = 1
for i in range(ndim - 1, -1, -1):
if padded_shape[i] == 1 and out_shape[i] > 1:
strides.insert(0, 0)
else:
strides.insert(0, stride)
stride *= padded_shape[i] if padded_shape[i] > 1 else 1
return strides

strides1 = _calc_strides(pad1, out_shape)
strides2 = _calc_strides(pad2, out_shape)
return strides1, strides2


class ConcatParser(NodeParser):

def __init__(self):
Expand Down Expand Up @@ -55,6 +86,10 @@ def parseNode(self, node: gs.Node) -> (bool):
self.operatorRepresentation['n_levels'] = int(node.attrs['n_levels'])
self.operatorRepresentation['log2D'] = int(math.log2(node.attrs['D']))

stash_type = node.attrs.get('stash_type', 1)
if stash_type != 1:
raise ValueError(f"iRMSNorm: only stash_type=1 (FP32) is supported, got {stash_type}")

return ret

def parseNodeCtxt(self,
Expand All @@ -70,8 +105,19 @@ def parseNodeCtxt(self,
for idx, outputNode in enumerate(node.outputs):
self.operatorRepresentation[outputs[idx]] = ctxt.lookup(outputNode.name).name

self.operatorRepresentation['size'] = np.prod(ctxt.lookup(node.inputs[0].name).shape)
self.operatorRepresentation['lastDimLength'] = ctxt.lookup(node.inputs[0].name).shape[-1]
input_shape = list(ctxt.lookup(node.inputs[0].name).shape)

axis = node.attrs.get('axis', -1)
if axis < 0:
axis = len(input_shape) + axis

self.operatorRepresentation['inputSize'] = int(np.prod(input_shape))
self.operatorRepresentation['NormalizedAxesSize'] = int(np.prod(input_shape[axis:]))
self.operatorRepresentation['scale'] = node.inputs[1].values

# Keep old keys for C template compatibility
self.operatorRepresentation['size'] = int(np.prod(input_shape))
self.operatorRepresentation['lastDimLength'] = int(input_shape[-1])

return ctxt, True

Expand Down Expand Up @@ -471,23 +517,37 @@ def __init__(self):
super().__init__()

def parseNode(self, node: gs.Node) -> bool:

ret = all([len(node.inputs) == 2, len(node.outputs) == 1])

return ret

def parseNodeCtxt(self,
ctxt: NetworkContext,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:

data_in_1 = ctxt.lookup(node.inputs[0].name)
data_in_2 = ctxt.lookup(node.inputs[1].name)
data_out = ctxt.lookup(node.outputs[0].name)

self.operatorRepresentation['data_in_1'] = data_in_1.name
self.operatorRepresentation['data_in_2'] = data_in_2.name
self.operatorRepresentation['data_out'] = data_out.name
self.operatorRepresentation['size'] = np.prod(data_in_1.shape)
self.operatorRepresentation['size'] = np.prod(data_out.shape)

# Check if broadcasting is needed
shape1 = list(data_in_1.shape)
shape2 = list(data_in_2.shape)
out_shape = list(data_out.shape)

need_broadcast = (shape1 != out_shape) or (shape2 != out_shape)
self.operatorRepresentation['need_broadcast'] = need_broadcast

if need_broadcast:
strides1, strides2 = compute_broadcast_strides(shape1, shape2, out_shape)

self.operatorRepresentation['ndim'] = len(out_shape)
self.operatorRepresentation['strides1'] = strides1
self.operatorRepresentation['strides2'] = strides2
self.operatorRepresentation['out_shape'] = out_shape

return ctxt, True

Expand Down Expand Up @@ -2096,15 +2156,15 @@ def parseNodeCtxt(self,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:

inputs = ["input1", "input2"]
outputs = ["output"]
inputs = ["A", "B"]
outputs = ["C"]
for idx, inputNode in enumerate(node.inputs):
if idx < len(inputs):
self.operatorRepresentation[inputs[idx]] = ctxt.lookup(inputNode.name).name
for idx, outputNode in enumerate(node.outputs):
self.operatorRepresentation[outputs[idx]] = ctxt.lookup(outputNode.name).name

self.operatorRepresentation['size'] = np.prod(ctxt.lookup(self.operatorRepresentation['input1']).shape)
self.operatorRepresentation['size'] = np.prod(ctxt.lookup(self.operatorRepresentation['A']).shape)

return ctxt, True

Expand Down
2 changes: 1 addition & 1 deletion Deeploy/Targets/Generic/Templates/FloatDivTemplate.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@

referenceTemplate = NodeTemplate("""
// Division (Name: ${nodeName}, Op: ${nodeOp})
SINGLE_CORE Div_fp${input1_type.referencedType.typeWidth}_fp${input2_type.referencedType.typeWidth}_fp${output_type.referencedType.typeWidth}(${input1}, ${input2}, ${output}, ${size});
SINGLE_CORE Div_fp${A_type.referencedType.typeWidth}_fp${B_type.referencedType.typeWidth}_fp${C_type.referencedType.typeWidth}(${A}, ${B}, ${C}, ${size});
""")
29 changes: 27 additions & 2 deletions Deeploy/Targets/Generic/TypeCheckers.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

import numpy as np

from Deeploy.AbstractDataTypes import Pointer
from Deeploy.AbstractDataTypes import FloatImmediate, Pointer
from Deeploy.CommonExtensions.TypeCheckers.SignPropTypeChecker import SignPropTypeChecker
from Deeploy.DeeployTypes import ConstantBuffer, OperatorRepresentation, VariableBuffer

Expand Down Expand Up @@ -409,7 +409,10 @@ def __init__(self, input_types: Sequence[Type[Pointer]], output_types: Sequence[

def _inferNumLevels(self, inputs: List[VariableBuffer],
operatorRepresentation: OperatorRepresentation) -> List[int]:
return [2**(4 * self.input_types[0].referencedType.typeWidth)]
input_type = self.input_types[0].referencedType
if issubclass(input_type, FloatImmediate):
return [2**(input_type.typeWidth)]
return [2**(4 * input_type.typeWidth)]

def _inferSignedness(self, inputs: List[VariableBuffer],
operatorRepresentation: OperatorRepresentation) -> List[bool]:
Expand Down Expand Up @@ -610,3 +613,25 @@ def _inferNumLevels(self, inputs: List[VariableBuffer],
def _inferSignedness(self, inputs: List[VariableBuffer],
operatorRepresentation: OperatorRepresentation) -> List[bool]:
return [True]


class RMSNormChecker(SignPropTypeChecker):

def __init__(self, input_types: Sequence[Type[Pointer]], output_types: Sequence[Type[Pointer]]):
super().__init__(input_types, output_types)

def _inferNumLevels(self, inputs: List[VariableBuffer],
operatorRepresentation: OperatorRepresentation) -> List[int]:
# RMSNorm: square, mean, sqrt, reciprocal, multiply
# Output precision similar to input
return [2**(self.input_types[0].referencedType.typeWidth)]

def _inferSignedness(self, inputs: List[VariableBuffer],
operatorRepresentation: OperatorRepresentation) -> List[bool]:
# RMSNorm output can be signed (depending on input signedness)
if inputs[0]._signed:
return [True]
else:
return [False]


Loading
Loading