CCT Attention Training on Siracusa #69

runwangdl · 2025-05-11T22:16:18Z

This PR introduces gradient operator support, improved GEMM performance, and updates to the CCT training workflow. It also includes fixes to tile constraints and naming consistency in transpose pass.

Added

Implemented LayerNormGrad and GeluGrad operator parser, binding, template, and tile constraints.
Added CCT linear probing, LoRA, and full backpropagation training graph.

Changed

Optimized float GEMM kernel with loop unrolling and improved transpose handling.

Fixed

Corrected float GEMM tile constraints and templates for no bias.
Fixed transpose splitting pass logic: updated repeated naming by appending new identifiers derived from source and destination nodes.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

… have finite lifetime

…y are I/O buffers

coderabbitai · 2025-11-26T21:09:59Z

📝 Walkthrough

Walkthrough

Adds gradient/backprop support for GELU and LayerNorm (parsers, layers, bindings, tiling constraints, templates, and kernels), plus SoftmaxGrad and ReduceSum tiling; extends GEMM/SGD templates and GEMM kernel for optional bias and unrolled paths; adjusts CI L1 settings and generated-transpose naming.

Changes

Cohort / File(s)	Summary
CI Configuration Updates `\.github/workflows/ci-platform-siracusa-tiled.yml`, `\.github/workflows/ci-platform-siracusa.yml`	Simplified multi-value L1 arrays to single L1 values for several CCT tests, removed one CCT test entry, and added a new CCT2_FT2 L1 entry.
Generic Layers & Parsers `Deeploy/Targets/Generic/Layers.py`, `Deeploy/Targets/Generic/Parsers.py`	Added `GELUGradLayer`, `LayerNormGradLayer` and parsers `GELUGradParser`, `LayerNormGradParser` to produce operator representations for gradient nodes.
Topology Optimization `Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py`	Changed naming of generated transpose nodes/variables to prefix with the first transpose node's name (t1.name), altering dynamic node/var identifiers.
PULPOpen Platform & Bindings `Deeploy/Targets/PULPOpen/Bindings.py`, `Deeploy/Targets/PULPOpen/Platform.py`, `Deeploy/Targets/PULPOpen/Tiler.py`	Registered GELUGrad/LayerNormGrad mappers and tiling-ready bindings; added gradient tiling constraints (GeluGrad, LayernormGrad); replaced some Untiled bindings with explicit constraints; expanded imports/exports.
Templates (GELU / LayerNorm / GEMM / SGD) `Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py`, `Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py`, `Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`, `Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py`	Added parallel gradient reference templates for GELU and Layernorm; introduced `PULPFloatGEMMTemplate` supporting optional C (bias) and alignToContext; rewrote SGD template to use chunking and unrolled updates.
Tile Constraints (Gradients & Reductions) `Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py`, `Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py`, `Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py`, `Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py`, `Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py`	Added GeluGrad, LayernormGrad, SoftmaxGrad, ReduceSum tile-constraint classes with geometry/policy/serialization; updated GEMM tiling to optionally handle bias and batch-aware serialization.
Tiling Codegen Utility `Deeploy/TilingExtension/TilingCodegen.py`	Normalized `HyperRectangle` constructor to coerce `offset`/`dims` to tuples for internal consistency.
Target Kernels & Headers `TargetLibraries/Generic/inc/kernel/GELU.h`, `TargetLibraries/Generic/src/GELU_fp32.c`, `TargetLibraries/Generic/inc/kernel/Layernorm.h`, `TargetLibraries/Generic/src/Layernorm_fp32.c`, `TargetLibraries/PULPOpen/src/Gemm.c`	Added `GELU_fp32_fp32_sigmoid_grad_chunk` and `LayernormGrad_fp32_fp32` declarations/implementations; replaced GEMM kernel with optimized, bias-aware, 6-wide unrolled implementation.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Frontend as Deeploy Frontend
    participant Parser as Node Parser
    participant Layer as ONNX Layer
    participant Mapper as NodeMapper/Tiler
    participant Constraint as TileConstraint
    participant Template as NodeTemplate
    participant Kernel as Target Kernel

    Frontend->>Parser: emit gradient node (GeluGrad / LayerNormalizationGrad / SoftmaxGrad / ReduceSum)
    Parser->>Layer: build OperatorRepresentation (inputs, shapes, size, lastDim)
    Layer->>Mapper: request mapping (NodeMapper with tiling-ready bindings)
    Mapper->>Constraint: add geometrical & policy constraints
    Constraint->>Mapper: serialize tiling solution (VariableReplacementScheme, TilingSchedule)
    Mapper->>Template: alignToContext / instantiate gradient template
    Template->>Kernel: emit parallel chunk invocation (start_idx, end_idx)
    Kernel->>Kernel: compute chunked gradients and write outputs
    Kernel-->>Frontend: grad_out available for next passes

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Focus review on numerical correctness and boundary handling in:
- TargetLibraries/Generic/src/GELU_fp32.c — chunk boundaries and derivative formula.
- TargetLibraries/Generic/src/Layernorm_fp32.c — per-block reductions and gradients.
Validate tiling geometry/serialization and address naming in:
- Deeploy/Targets/PULPOpen/TileConstraints/* (LayernormGrad, GeluGrad, SoftmaxGrad, ReduceSum, GEMM bias paths).
Confirm template context alignment and conditional bias handling in:
- Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py, Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py.
Check registration/import consistency across Platform.py, Bindings.py, and parser/layer additions.

Possibly related PRs

Fix PULP GEMM batch serialization #109 — overlaps GEMM tiling serialization and batch-dimension handling in GEMMTileConstraint.
Refactor tiling code generation #105 — related tiling/DMA/codegen infrastructure changes that this PR builds upon.
Demo TinyViT compatibility with tiled Siracusa #124 — related Siracusa CI tiled config and PULPOpen mappings/tiling bindings changes.

Suggested labels

Bug

Suggested reviewers

Victor-Jung
Xeratec
lukamac

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'CCT Attention Training on Siracusa' is clear and directly related to the main changes: the PR adds CCT (Compact Convolutional Transformers) training workflow updates and related gradient operators for the Siracusa platform.
Description check	✅ Passed	The pull request description is directly related to the changeset, covering gradient operators, GEMM optimizations, and CCT training workflow updates that align with the file modifications.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (8)

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)

26-42: Unrolling factor of 6 is functional but unconventional.

The loop unrolling works correctly. However, an unroll factor of 4 or 8 typically aligns better with SIMD widths and cache lines on most architectures. If performance is critical, consider benchmarking with power-of-2 unroll factors.
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)
355-360: Consider adding strict=True to zip() calls for robustness.

While the lists are constructed together and should always match in length, adding strict=True (Python 3.10+) would catch any future bugs where list lengths diverge. This is optional given the current structure guarantees equal lengths.
         if has_bias:
-            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes):
+            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes, strict=True):
                 inputLoadSchedule.append({"A": a, "B": b, "C": c})
         else:
-            for a, b in zip(inputACubes, inputBCubes):
+            for a, b in zip(inputACubes, inputBCubes, strict=True):
                 inputLoadSchedule.append({"A": a, "B": b})
Deeploy/Targets/Generic/Layers.py (1)

453-457: Consider adding computeOps() method for consistency.

The LayerNormGradLayer class doesn't implement computeOps(), while other gradient layers like GELUGradLayer (Line 66) and SoftmaxGradLayer (Line 134) do. For consistency and accurate operation counting in analysis workflows, consider adding this method.

Would you like me to help estimate the operation count for LayerNorm gradient based on the kernel implementation in TargetLibraries/Generic/src/Layernorm_fp32.c?
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)
113-121: Rename unused loop variables.

The loop variable dim is not used in the loop body on lines 113-117 and 118-121. Following Python conventions, rename it to _dim or simply remove it if only the index is needed.

Apply this diff:
-        for idx, dim in enumerate(input_shape):
+        for idx, _ in enumerate(input_shape):
             tilerModel.addConstraint(
                 tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) ==
                 tilerModel.getTensorDimVar(tensorName = grad_in_buffer_name, dimIdx = idx))
 
-        for idx, dim in enumerate(input_shape):
+        for idx, _ in enumerate(input_shape):
             tilerModel.addConstraint(
                 tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) ==
                 tilerModel.getTensorDimVar(tensorName = grad_out_buffer_name, dimIdx = idx))
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
40-43: Unused bias parameter.

The bias parameter is declared but never used in the function body. LayerNorm gradient with respect to input typically doesn't require the bias term, but the signature inconsistency with the forward pass may cause confusion.

Consider either:

Documenting why bias is intentionally unused (for API consistency with forward pass), or

Removing it if not needed for gradient computation:
-void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in,
-                             float32_t *grad_out, float32_t *scale,
-                             float32_t *bias, float32_t epsilon, int32_t size,
-                             int32_t lastDimLength) {
+void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in,
+                             float32_t *grad_out, float32_t *scale,
+                             float32_t epsilon, int32_t size,
+                             int32_t lastDimLength) {
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)
16-17: Return type annotation inconsistency.

The method signature declares Tuple[NetworkContext, Dict, List[str]] but should return Tuple[NetworkContext, OperatorRepresentation, List[str]] to match the base class NodeTemplate.alignToContext signature shown in the relevant snippets.
     def alignToContext(self, ctxt: NetworkContext,
-                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]:
+                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (2)
49-59: Consider extracting axis normalization into a helper method.

The negative axis normalization logic is duplicated across addGeometricalConstraint (lines 54-59 and 80-86), constructSymbolicNodeRep (lines 154-158), and serializeTilingSolution (lines 210-216). Extracting this into a shared static helper would reduce duplication and ensure consistent behavior.
@staticmethod
def _normalize_axes(axis, input_shape_len: int) -> List[int]:
    """Normalize axis indices, handling negative values."""
    if isinstance(axis, int):
        axis = [axis]
    normalized = []
    for ax in axis:
        if ax < 0:
            ax = input_shape_len + ax
        normalized.append(ax)
    return normalized
Also applies to: 75-86, 146-158, 210-216

237-238: Move import to top of file.

The HyperRectangle import should be placed at the top of the file with other imports from the same module, unless there's a specific reason to avoid it (e.g., circular dependency).
 from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme
+from Deeploy.TilingExtension.TilingCodegen import HyperRectangle
Or combine into a single import:
-from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme
+from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, VariableReplacementScheme

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e07cd13 and 5502d04.

📒 Files selected for processing (23)

.github/workflows/ci-platform-siracusa-tiled.yml (3 hunks)
.github/workflows/ci-platform-siracusa.yml (0 hunks)
Deeploy/Targets/Generic/Layers.py (3 hunks)
Deeploy/Targets/Generic/Parsers.py (2 hunks)
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1 hunks)
Deeploy/Targets/PULPOpen/Bindings.py (2 hunks)
Deeploy/Targets/PULPOpen/Platform.py (5 hunks)
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3 hunks)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (5 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1 hunks)
Deeploy/Targets/PULPOpen/Tiler.py (3 hunks)
Deeploy/TilingExtension/TilingCodegen.py (1 hunks)
TargetLibraries/Generic/inc/kernel/GELU.h (1 hunks)
TargetLibraries/Generic/inc/kernel/Layernorm.h (1 hunks)
TargetLibraries/Generic/src/GELU_fp32.c (1 hunks)
TargetLibraries/Generic/src/Layernorm_fp32.c (1 hunks)
TargetLibraries/PULPOpen/src/Gemm.c (1 hunks)

💤 Files with no reviewable changes (1)

.github/workflows/ci-platform-siracusa.yml

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

📚 Learning: 2025-09-09T15:43:20.195Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.

Applied to files:

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

🧬 Code graph analysis (10)

TargetLibraries/Generic/inc/kernel/GELU.h (1)

TargetLibraries/Generic/src/GELU_fp32.c (1)

GELU_fp32_fp32_sigmoid_grad_chunk (34-53)

TargetLibraries/Generic/inc/kernel/Layernorm.h (1)

TargetLibraries/Generic/src/Layernorm_fp32.c (1)

LayernormGrad_fp32_fp32 (40-93)

Deeploy/Targets/Generic/Layers.py (1)

Deeploy/DeeployTypes.py (2)

ONNXLayer (1819-2147)

computeOps (1835-1840)

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)

Deeploy/DeeployTypes.py (1)

NodeTemplate (87-229)

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (4)

Deeploy/DeeployTypes.py (1)

lookup (720-752)

Deeploy/TilingExtension/TilerModel.py (1)

getTensorDimVar (131-135)

Deeploy/TilingExtension/TilingCodegen.py (1)

HyperRectangle (24-35)

Deeploy/TilingExtension/TileConstraint.py (1)

extractBaseAddr (56-74)

Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)

Deeploy/DeeployTypes.py (1)

NodeTemplate (87-229)

Deeploy/Targets/PULPOpen/Bindings.py (4)

Deeploy/DeeployTypes.py (1)

NodeBinding (1512-1657)

Deeploy/Targets/Generic/TypeCheckers.py (3)

AddChecker (88-102)

LayerNormChecker (198-209)

GELUChecker (388-402)

Deeploy/AbstractDataTypes.py (1)

PointerClass (536-559)

Deeploy/CommonExtensions/DataTypes.py (1)

float32_t (74-78)

Deeploy/Targets/PULPOpen/Tiler.py (4)

Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)

GeluGradTileConstraint (8-12)

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (2)

LayernormTileConstraint (19-80)

LayernormGradTileConstraint (83-159)

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1)

ReduceSumTileConstraint (18-250)

Deeploy/TilingExtension/TilerExtension.py (1)

TilingReadyNodeBindings (1027-1035)

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3)

Deeploy/DeeployTypes.py (3)

NetworkContext (508-1020)

NodeTemplate (87-229)

alignToContext (119-139)

Deeploy/Targets/PULPOpen/Templates/ReshapeTemplate.py (1)

alignToContext (37-53)

Deeploy/Targets/PULPOpen/Templates/DMASliceTemplate.py (1)

alignToContext (18-138)

Deeploy/Targets/PULPOpen/Platform.py (3)

Deeploy/Targets/Generic/Layers.py (2)

GELUGradLayer (61-70)

LayerNormGradLayer (453-456)

Deeploy/Targets/Generic/Parsers.py (3)

GELUGradParser (773-797)

LayerNormGradParser (1677-1704)

LayerNormParser (1647-1674)

Deeploy/DeeployTypes.py (1)

NodeMapper (1660-1816)

🪛 Ruff (0.14.5)

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py

113-113: Loop control variable dim not used within loop body

Rename unused dim to _dim

(B007)

118-118: Loop control variable dim not used within loop body

Rename unused dim to _dim

(B007)

128-128: Unused class method argument: ctxt

(ARG003)

Deeploy/Targets/Generic/Parsers.py

786-786: Unused method argument: channels_first

(ARG002)

1691-1691: Unused method argument: channels_first

(ARG002)

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

290-290: Local variable varB is assigned to but never used

Remove assignment to unused variable varB

(F841)

356-356: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

359-359: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py

71-71: Avoid specifying long messages outside the exception class

(TRY003)

92-93: Avoid specifying long messages outside the exception class

(TRY003)

127-127: Unused static method argument: parseDict

(ARG004)

127-127: Unused static method argument: ctxt

(ARG004)

133-133: Unused static method argument: tilerModel

(ARG004)

172-172: Unused class method argument: absoluteOutputCubes

(ARG003)

Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py

159-159: Unused class method argument: ctxt

(ARG003)

🔇 Additional comments (39)

.github/workflows/ci-platform-siracusa-tiled.yml (3)

137-138: Verify intentionality of L1 configuration simplification for CCT tests.

The change reduces L1 memory configurations from multiple test values to a single value, which limits test coverage to one memory allocation strategy. According to the AI summary, this was [4000, 64000] and is now [64000]. Confirm this reduction is intentional and doesn't mask hardware-specific issues that only surface with smaller memory allocations.

168-171: Verify consistency of new test addition across all job configurations.

A new test entry testTrainCCT/CCT2_FT2 is added to the L3 jobs with L1: [128000]. Per the AI summary, this test should also be added to the singlebuffer-L2 configuration (line 138 area). Confirm whether this is intentional or if the test entry is missing from singlebuffer-L2 and other jobs.

208-211: Verify consistency of L1 simplification across doublebuffer-L3.

Changes here mirror those in singlebuffer-L3 (lines 169-171), which is good for consistency. However, apply the same verification from the previous comment to ensure testTrainCCT/CCT2_FT2 is correctly scoped to only L3 jobs and not missing from other configurations.

Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1)

679-680: LGTM! Naming collision fix improves correctness.

The updated naming scheme correctly prevents collisions when multiple transpose nodes feed into the same (or identically-named) downstream nodes. By prefixing with t1.name, each split transpose gets a unique identifier derived from both the source transpose and its consumer.

Deeploy/TilingExtension/TilingCodegen.py (1)

34-35: LGTM! Defensive tuple coercion improves robustness.

The normalization ensures offset and dims are always tuples internally, which is useful when callers pass lists or other sequences. Consider updating the type hints from Tuple[int, ...] to Sequence[int] to reflect the accepted input types more accurately.

TargetLibraries/Generic/inc/kernel/GELU.h (1)

28-30: LGTM! Function declaration is consistent with implementation.

The new gradient chunk function declaration aligns with the implementation in GELU_fp32.c and follows the existing naming conventions in this header.

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)

200-210: LGTM! Optional bias handling is correctly implemented.

The conditional bias detection and buffer handling ensures both biased and unbiased GEMM configurations are properly supported. The pattern is consistently applied in both addGeometricalConstraint and serializeTilingSolution.

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)

40-51: LGTM! Kernel call with proper guard condition.

The check for elem_count > 0 correctly prevents calling the kernel with zero elements, which could occur for cores that have no work assigned when the sequence count is less than the core count.

Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)

8-12: LGTM!

The GeluGradTileConstraint class correctly extends BOPTileConstraint with appropriate gradient-specific tensor names (grad_in, data_in, grad_out), following the established pattern for binary operation tile constraints.

TargetLibraries/Generic/inc/kernel/Layernorm.h (1)

28-31: LGTM!

The function declaration for LayernormGrad_fp32_fp32 is correctly added, with a signature that mirrors the forward pass and properly includes gradient-specific parameters (grad_in, grad_out). The implementation exists in the corresponding .c file.

TargetLibraries/Generic/src/GELU_fp32.c (1)

34-53: LGTM!

The GELU gradient computation is mathematically correct. The function properly implements the derivative of the GELU approximation (x * sigmoid(1.702*x)) using the chain rule and applies the upstream gradient correctly.

Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)

12-20: LGTM!

The gradient template correctly implements parallel chunking across cores, with proper bounds checking and invocation of the GELU gradient kernel. The chunk size calculation and boundary handling prevent out-of-bounds access.

Deeploy/Targets/Generic/Layers.py (2)

61-71: LGTM!

The GELUGradLayer is properly implemented with a reasonable operation count estimate of 9 operations per element for the GELU gradient computation.

482-486: LGTM!

The computeOps() method correctly estimates 2 operations per element (one multiply and one add/subtract) for the SGD weight update step.

Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)

83-159: Verify tiling schedule correctness for gradient flow.

The LayernormGradTileConstraint implementation looks correct overall. However, ensure that the tiling schedule properly handles the gradient data flow:

Line 151: Both grad_in and data_in are loaded with the same cube dimensions

Line 153: grad_out uses the same cube dimensions as the inputs

This is consistent with the forward pass pattern and the gradient computation requirements.

Consider verifying with a test that exercises tiling for LayerNorm gradient to ensure the tile boundaries and data dependencies are correctly handled.

Deeploy/Targets/Generic/Parsers.py (2)

773-798: LGTM!

The GELUGradParser correctly parses GELU gradient nodes with proper input/output mapping. The parser validates 2 inputs (upstream gradient and forward input) and 1 output, which aligns with the gradient computation requirements.

1677-1705: LGTM!

The LayerNormGradParser correctly extends iLayerNormParser and validates 4 inputs (grad_in, data_in, weight, bias) and 1 output for the LayerNorm gradient operation. The size and lastDimLength computations are correct.

Deeploy/Targets/PULPOpen/Tiler.py (4)

18-25: LGTM!

The import updates correctly add gradient-specific bindings (GELUGrad, LayerNormGrad, SoftmaxGrad, SGD) needed for the new tiling ready bindings defined later in the file.

30-35: LGTM!

The tile constraint imports are properly added to support gradient operations (GeluGradTileConstraint, LayernormGradTileConstraint, SoftmaxGradTileConstraint, ReduceSumTileConstraint), enabling gradient-aware tiling throughout the pipeline.

124-125: LGTM!

The new gradient tiling ready bindings are correctly configured:

PULPLayernormGradTilingReadyBindings uses LayernormGradTileConstraint

PULPFPGELUGradTilingReadyBindings uses GeluGradTileConstraint

Both follow the established pattern and properly wire gradient constraints into the tiling framework.

Also applies to: 130-131

143-146: LGTM!

The updates to PULPSoftmaxGradTilingReadyBindings and PULPReduceSumTilingReadyBindings replace implicit UntiledTileConstraint with explicit gradient-specific constraints (SoftmaxGradTileConstraint and ReduceSumTileConstraint), improving tiling optimization for these operations.

TargetLibraries/PULPOpen/src/Gemm.c (5)

29-31: LGTM: Bias handling and unroll factor initialization.

The has_bias flag via NULL check and unroll factor computation are correctly implemented. The modulo-based tail calculation ensures correct handling of non-divisible dimensions.

33-83: LGTM: Non-transposed GEMM path with O-dimension unrolling.

The index calculations are correct for row-major layout. The 6-wide O-dimension unrolling with proper tail handling is well-implemented.

84-178: LGTM: Transposed-A GEMM path with dual-axis unrolling.

The transpose indexing pSrcA[k*M + i] correctly accesses column i of the original matrix. Both N and O dimension unrolling with tail handling are correctly implemented.

179-264: LGTM: Transposed-B GEMM path.

The b_row pointer setup for transposed B access is correct. The dual-axis unrolling efficiently processes 6 output columns while iterating over the reduction dimension.

265-353: LGTM: Both-transposed GEMM path.

The combination of transpose indexing for both matrices is correctly implemented, with consistent unrolling strategy across all code paths.
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
72-90: Verify LayerNorm gradient formula correctness.

The implementation computes sum_dy as the unscaled sum of gradients (line 73), but the standard LayerNorm backward pass formula typically requires sum(dy * scale) for the mean subtraction term. The current formula:
grad_in[j] - (sum_dy / lastDimLength)
Should potentially be:
grad_in[j] * scale[j] - (sum_dy_scaled / lastDimLength)
where sum_dy_scaled = sum(dy * scale). The declared but unused sum_dy_scaled variable (line 45) suggests this may have been the original intent.

Please verify against the PyTorch or reference LayerNorm backward implementation to confirm the gradient computation is correct.
Deeploy/Targets/PULPOpen/Bindings.py (3)

423-429: LGTM: LayerNorm gradient binding.

The binding correctly specifies 4 float32 inputs (grad_in, data_in, weight, bias) and 1 output, matching the LayerNormGradParser signature. The use of ForkTransformer is consistent with other PULP bindings.

435-437: LGTM: GELU gradient binding.

The gradient binding mirrors the forward PULPFloatGELUBinding structure with 2 inputs and 1 output, appropriately using referenceGradTemplate.

191-196: I'll help you verify the 3-input Add template compatibility concern. Let me start by locating the FloatAddTemplate definition and examining its implementation.
<function_calls>
<invoke_name>shell

#!/bin/bash

Locate FloatAddTemplate and its referenceTemplate

rg -n "class FloatAddTemplate|FloatAddTemplate.referenceTemplate" --type py -A 5 -B 2

</invoke_name>
</function_calls>
<invoke_name>shell

#!/bin/bash

Search for FloatAddTemplate definition across the codebase

fd . -type f -name "*.py" | xargs rg -l "FloatAddTemplate" | head -20

</invoke_name>
</function_calls>
<invoke_name>shell

#!/bin/bash

Find the file containing FloatAddTemplate definition

rg -n "class FloatAddTemplate" --type py

</invoke_name>
</function_calls>

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (2)

19-25: LGTM: Optional bias handling in alignToContext.

The method correctly handles the no-bias case by setting sensible defaults. The C_type default to float32_tPtr ensures the generated function signature remains valid even when C is NULL.

28-79: LGTM: GEMM template with conditional bias handling.

The template correctly handles both biased and bias-free GEMM cases with appropriate conditional code generation. The batch pointer arithmetic is correctly guarded by both C is not None and C_batched conditions.

Deeploy/Targets/PULPOpen/Platform.py (3)

59-59: LGTM: GELUGradMapper definition.

The mapper correctly wires GELUGradParser to PULPFPGELUGradTilingReadyBindings, following the established pattern of other mappers in this file.

89-89: LGTM: LayerNormGradMapper definition.

The mapper correctly connects LayerNormGradParser to PULPLayernormGradTilingReadyBindings.

118-120: LGTM: Gradient operation mapping entries.

The GeluGrad and LayerNormalizationGrad entries are correctly added to PULPMapping, following the same pattern as existing gradient operations like SoftmaxGrad.

Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1)

106-188: LGTM!

The SoftmaxGradTileConstraint implementation follows the same well-established pattern as iSoftmaxTileConstraint. The three-tensor constraint logic in addGeometricalConstraint correctly ensures dimensional consistency across gradient inputs and outputs, and the serialization properly maps the input/output load schedules.

Regarding the static analysis hint about ctxt being unused on line 159: this parameter is part of the TileConstraint interface signature and must be maintained for API consistency across all constraint implementations.

Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (3)

126-130: LGTM!

The empty implementation with explanatory comment is acceptable. The parent class interface requires this method signature, and deferring policy decisions to serializeTilingSolution is a valid design choice for ReduceSum operations.

132-168: LGTM!

The method correctly constructs the symbolic representation with normalized axes and keepdims handling. The unused tilerModel parameter is an interface requirement from the parent TileConstraint class.

170-175: I'll help you verify this review comment. Let me start by examining how absoluteOutputCubes is used across different tile constraint classes and then look at the specific implementation in ReduceSumTileConstraint.
<function_calls>
<invoke_name>shell
<invoke_body>#!/bin/bash

Search for absoluteOutputCubes usage in all TileConstraint files

rg -n "absoluteOutputCubes" Deeploy/Targets/PULPOpen/TileConstraints/ -A 5 -B 2
</invoke_body>
</invoke_function_calls>
</function_calls>

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

Xeratec

Thanks for your work!
Looks already very good. I still have some small comments, but nothing major.

Deeploy/Targets/Generic/Layers.py

Deeploy/Targets/PULPOpen/Bindings.py

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

Deeploy/Targets/Generic/Layers.py (2)
61-70: Verify the ops-per-element count for GELU gradient.

The computeOps method uses a hardcoded ops_per_element = 9. Consider adding a brief comment documenting which operations are counted (similar to how GELULayer.computeOps documents each operation individually). This improves maintainability and validates the count.
     def computeOps(self):
         size = self.mapper.parser.operatorRepresentation['size']
-        ops_per_element = 9
+        # GELU gradient: d_sigmoid(3), d_x*sigmoid(1), x*d_sigmoid(1), 
+        # chain rule combinations (4 ops) = 9 ops per element
+        ops_per_element = 9
         gelu_grad_ops = size * ops_per_element
         return gelu_grad_ops
453-457: Consider adding computeOps for consistency with GELUGradLayer.

LayerNormGradLayer lacks a computeOps method, so it will return 0 from the base class. This is inconsistent with GELUGradLayer (lines 61-70) which does implement computeOps. If operation counting is desired for gradient layers, consider adding the implementation here as well.
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (2)
20-23: Use templated types instead of hardcoded float32_t.

Lines 20 and 23 use hardcoded float32_t for learning_rate and temp_mul, which is inconsistent with the type-safe approach used elsewhere in templates. Based on learnings, explicit bitwidth types derived from template parameters should be used for consistency.
-float32_t learning_rate = ${lr};
+float${weight_type.referencedType.typeWidth}_t learning_rate = ${lr};

 // Temporary buffer for multiplication results
-float32_t temp_mul[6];
+float${grad_type.referencedType.typeWidth}_t temp_mul[6];
46-47: Minor: temp_grad should also use templated type.

For consistency with the suggested refactor above, the local variable in the remainder loop should also use the templated type.
 for (; i < ${nodeName}_chunk_stop; i++) {
-    float32_t temp_grad = learning_rate * ref_${grad}[i];
+    float${grad_type.referencedType.typeWidth}_t temp_grad = learning_rate * ref_${grad}[i];
     ref_${weight_updated}[i] = ref_${weight}[i] - temp_grad;
 }
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (2)
329-345: Shape construction for A/B/C cubes and rank assumptions

The new ACube/BCube construction:
if transA == 0:
    ACube = HyperRectangle((BatchOffset, BOffset, MOffset, NOffset),
                           (BatchSize, BSize, MSize, NSize))
else:
    ACube = HyperRectangle((BatchOffset, BOffset, NOffset, MOffset),
                           (BatchSize, BSize, NSize, MSize))
# similarly for BCube
assumes a 4‑D logical layout (batch, B, M, N|O) for A/B. For 2‑D or 3‑D output cubes you synthesize BatchOffset, BOffset, BatchSize, BSize defaults, so the rectangles stay 4‑D even when the output cube is lower‑rank.

If the underlying Float GEMM buffers are always 4‑D, this is fine; otherwise, this could desync the hyperrectangle dimensionality from the actual buffer shape. It would be safer to either:

enforce len(cube.offset) == 4 in the multi‑dimensional case (as in the previous comment’s diff), and/or

document that Float GEMM A/B/C/data_out are always treated as 4‑D in this tiler path.

The CCube construction for has_bias:
CCube = HyperRectangle(tuple(cube.offset), tuple(cube.dims))
is consistent with a per‑output‑tile bias tile, assuming C is laid out like data_out; this ties back to the earlier question about C’s rank and broadcasting semantics.

358-363: Add strict= to zip() to satisfy Ruff and catch mismatched lists

Static analysis warns about zip() without strict=. Given inputACubes, inputBCubes, and inputAddCubes are intended to be parallel lists, using strict=True will both satisfy Ruff and surface any accidental length mismatches:
-        if has_bias:
-            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes):
-                inputLoadSchedule.append({"A": a, "B": b, "C": c})
-        else:
-            for a, b in zip(inputACubes, inputBCubes):
-                inputLoadSchedule.append({"A": a, "B": b})
+        if has_bias:
+            for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes, strict = True):
+                inputLoadSchedule.append({"A": a, "B": b, "C": c})
+        else:
+            for a, b in zip(inputACubes, inputBCubes, strict = True):
+                inputLoadSchedule.append({"A": a, "B": b})
Please confirm the project targets a Python version that supports zip(strict=...) (PEP 618) before applying this.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5502d04 and a292765.

📒 Files selected for processing (5)

Deeploy/Targets/Generic/Layers.py (3 hunks)
Deeploy/Targets/PULPOpen/Bindings.py (2 hunks)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1 hunks)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (5 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-12-02T13:54:22.700Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 69
File: Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py:36-38
Timestamp: 2025-12-02T13:54:22.700Z
Learning: In Deeploy templates (Python files in Deeploy/Targets/PULPOpen/Templates/), always use explicit bitwidth types (e.g., `float${...type.referencedType.typeWidth}_t*`) instead of hardcoded types (e.g., `float*`) to ensure type consistency with templated kernel calls.

Applied to files:

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

📚 Learning: 2025-09-09T15:43:20.195Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.

Applied to files:

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

🧬 Code graph analysis (4)

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)

Deeploy/DeeployTypes.py (1)

NodeTemplate (87-229)

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (3)

Deeploy/DeeployTypes.py (1)

lookup (720-752)

Deeploy/TilingExtension/TilerModel.py (1)

getTensorDimVar (131-135)

Deeploy/TilingExtension/TilingCodegen.py (1)

HyperRectangle (24-35)

Deeploy/Targets/PULPOpen/Bindings.py (3)

Deeploy/Targets/Generic/TypeCheckers.py (3)

AddChecker (88-102)

LayerNormChecker (198-209)

GELUChecker (388-402)

Deeploy/AbstractDataTypes.py (1)

PointerClass (536-559)

Deeploy/CommonExtensions/DataTypes.py (1)

float32_t (74-78)

Deeploy/Targets/Generic/Layers.py (2)

Deeploy/DeeployTypes.py (2)

ONNXLayer (1819-2147)

computeOps (1835-1840)

DeeployTest/testUtils/dmaUtils.py (1)

size (72-73)

🪛 Ruff (0.14.7)

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py

359-359: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

362-362: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch
GitHub Check: snitch-kernels / test-runner-snitch

🔇 Additional comments (9)

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)

20-52: LGTM! Gradient template properly implements parallel computation with type-safe pointers.

The template correctly uses explicit bitwidth types (float${...typeWidth}_t) for pointer declarations (lines 36-38), addressing the previous review feedback. The parallel chunking logic with 32-bit integers for chunk calculations is appropriate for handling larger tensors.

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)

12-14: Overflow fix correctly applied with int32_t.

The chunk variables now use int32_t instead of int16_t, properly addressing the previous review concern about overflow for tensors larger than 32K elements.

Deeploy/Targets/PULPOpen/Bindings.py (2)

434-436: LGTM! GELU gradient binding correctly configured.

The PULPFloatGELUGradBinding uses GELUChecker with 2 float32_t inputs (grad_in, data_in) and 1 output, matching the referenceGradTemplate expectations in FloatGELUTemplate.

422-428: Verify LayerNormChecker supports 4-input gradient signature.

The gradient binding at lines 422-428 uses LayerNormChecker with 4 inputs (grad_in, data_in, weight, bias) and 1 output, while the forward binding at lines 416-420 uses only 3 inputs. Confirm that LayerNormChecker correctly validates this extended signature for gradient operations and that the additional input is properly handled.
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (5)
202-211: Bias-aware buffer registration for C looks good

Making C optional via has_bias and only looking up / registering its dimensions when present avoids key errors and unnecessary constraints in the no-bias case. This is a clean way to support both biased and unbiased GEMM variants.

234-240: Confirm bias tensor C rank and alignment with output dims

The bias constraints assume bufferC has at least two trailing dimensions that align with the output’s last two dims (M and O):

dimOffsetC = len(bufferC.shape) - 2

Both dimOffsetC and dimOffsetC + 1 are tied to the output’s first/second GEMM dims.

If C is ever stored as a 1D [O] vector (or [batch, O] with broadcasting on M), these constraints would be too strong and could make the system unsatisfiable. Please confirm that, for all Float GEMM uses, C is always shaped such that its last two dims truly match (M, O) of data_out, or adjust the constraints to allow typical bias broadcasting.

277-286: Optional C handling in address extraction is correct

Rebuilding outputCubes as HyperRectangle instances and making addrNames conditional on has_bias (['A', 'B', 'data_out'] vs inserting 'C') is consistent with extractBaseAddr’s expectations and keeps the biased/unbiased flows aligned.

293-299: Good: N size now depends on transA (matches learned pattern)

Using transA to select which axis of A is the logical N:
if transA == 0:
    NSize = ctxt.lookup(varA).shape[-1]
else:
    NSize = ctxt.lookup(varA).shape[-2]
fixes mis-sizing when A is stored transposed and aligns Float GEMM with the recommended pattern for handling transpose flags in tiling serialization. Based on learnings, this is exactly what we want.

310-327: replacements["batch"] appears to use the wrong variable

Inside the loop over outputCubes, BSize is initialized to 1 and only overwritten when len(cube.offset) == 4 (set to cube.dims[1]). However, BatchSize captures the actual batch dimension across all 2D, 3D, and 4D cases (from cube.dims[0] or cube.dims[0] in the 3D case). Using replacements["batch"].append(BSize) will record incorrect batch sizes for 2D and 3D cubes (always 1) and misalign the semantics in 4D cubes.

Change replacements["batch"].append(BSize) to replacements["batch"].append(BatchSize) to correctly track the batch extent. Verify this change against how replacements["batch"] is consumed in the Float GEMM template to ensure the semantics remain consistent with the tiling strategy.

Deeploy/Targets/PULPOpen/Bindings.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

Deeploy/Targets/PULPOpen/Bindings.py (1)
183-194: Duplicate float32 Add binding in PULPAddBindings.

Lines 191-193 introduce a second NodeBinding with the exact same AddChecker signature and template as the one at Lines 189-190. This duplication is unnecessary and can be confusing for binding resolution.

You can safely drop the extra list concatenation:
 PULPAddBindings = [
@@
 ] + [
     NodeBinding(AddChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
                 FloatAddTemplate.referenceTemplate, ForkTransformer)
-] + [
-    NodeBinding(AddChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]),
-                FloatAddTemplate.referenceTemplate, ForkTransformer)
 ]

🧹 Nitpick comments (2)

Deeploy/Targets/Generic/Layers.py (2)

61-71: GELUGradLayer op count logic looks consistent; consider documenting the 9 ops/element.

computeOps reuses the established pattern (size from operatorRepresentation) and returns an integer op count; this will integrate cleanly with ONNXLayer.bind. To aid future maintainers, consider adding a short comment breaking down where the 9 per‑element ops come from (e.g., multiplies/adds/exps) so the factor is not a magic constant.

453-457: LayerNormGradLayer currently reports 0 ops; decide if this is intentional.

Without a computeOps override, LayerNormGradLayer will inherit the base implementation and contribute 0 to nodeOps, unlike LayerNormLayer which has an explicit breakdown. If op accounting for backward passes matters to you (e.g., training FLOP reports), consider adding a computeOps similar in spirit to LayerNormLayer; otherwise, this minimalist definition is fine but should be a conscious choice.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a292765 and 7bf6395.

⛔ Files ignored due to path filters (1)

formatting_report.log is excluded by !**/*.log

📒 Files selected for processing (2)

Deeploy/Targets/Generic/Layers.py (2 hunks)
Deeploy/Targets/PULPOpen/Bindings.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

Deeploy/Targets/Generic/Layers.py (1)

Deeploy/DeeployTypes.py (2)

ONNXLayer (1819-2147)

computeOps (1835-1840)

🔇 Additional comments (1)

Deeploy/Targets/PULPOpen/Bindings.py (1)

421-427: Verify argument ordering in LayerNorm and GELU gradient templates against binding declarations.

PULPLayernormGradBinding and PULPFloatGELUGradBinding reference referenceGradTemplate from their respective float templates. The type signatures (all float32_t pointers) appear consistent, but the argument order must be confirmed: ensure the pointer sequence in the checker (grad-out, activations, parameters) matches what FloatLayernormTemplate.referenceGradTemplate and FloatGELUTemplate.referenceGradTemplate actually expect in the generated kernel calls.

runwangdl · 2025-12-08T21:41:50Z

All comments are resolved.

Xeratec

Looks good! Only one small change requested: Seems like a log file sneaked its way into the PR.

formatting_report.log

… linting test

runwangdl and others added 30 commits March 17, 2025 22:19

Add classifier training support

9ec13f9

Fix L3 DMA and Maxpool Bugs

f1a0491

WIP Static Memory Allocation of IOs

29baf2c

Temporary fix broken float softmax

25be229

Fix lifetime of aliased input buffers

da56cbe

Fix output buffer lifetime

721f747

Linting

78685e5

WIP fix output buffer lifetime

02b5435

Change RQHardswish dim due to compiler bug

a2d67a0

Fix typo

bdd92de

Fix duplicated IO in memory allocation visualization

20b1f8b

Fix the Constant Tensor offset to not take into account IO since they…

c708069

… have finite lifetime

Add new attribute to Variable and Transient buffer to annotate if the…

b6e2448

…y are I/O buffers

Adapt calculateLifetime to use buffer I/O annotation

7e96f18

Fix typo

b923520

Remove IO buffer name and refactor var name

f4cb9e0

Linting

435cc9d

Test the correctness of the memory map after memory allocation

731f39f

Allocate memory arena first

dd1370c

correct DMA lengh of copy assertion

8bfdb13

Align memory allocation test

f01eb7f

delete redundant shell scripts

031dc79

Merge branch 'devel' into PULPCCTL3_16_16_64

58e18da

Update node with multioutput to single output

ac2d879

add softmaxcrossentropygrad tiling

6a7198b

Add softmaxcrossentropylossgrad tiling

360aef7

Merge branch 'PULPCCTL3_16_16_64' into GEMM_training_tiled

bc48582

Fix CI issue

b6542ba

Fix CI bugs

fe208d0

update CI

4a21359

Merge latest devel including TinyViT tiling support

5502d04

runwangdl modified the milestones: Release 0.3.0, Release 0.2.1 Nov 26, 2025

runwangdl marked this pull request as ready for review November 26, 2025 21:06

runwangdl requested review from Victor-Jung and Xeratec as code owners November 26, 2025 21:06

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py Outdated Show resolved Hide resolved

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py Outdated Show resolved Hide resolved

Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py Show resolved Hide resolved

Xeratec requested changes Dec 2, 2025

View reviewed changes

runwangdl added 2 commits December 8, 2025 20:51

[AttentionTraining] Fixed PR Comments

a292765

[AttentionTraining] Fix Linting

7bf6395

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

Deeploy/Targets/PULPOpen/Bindings.py Show resolved Hide resolved

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

[AttentionTraining] Remove three input add binding

00d542f

runwangdl requested a review from Xeratec December 9, 2025 21:08

Xeratec modified the milestones: Release 0.3.0, Release 0.2.1 Dec 12, 2025

Xeratec moved this from In progress to Ready for Merge in Deeploy Dec 12, 2025

Xeratec moved this from Ready for Merge to In review in Deeploy Dec 12, 2025

Xeratec requested changes Dec 12, 2025

View reviewed changes

formatting_report.log Outdated Show resolved Hide resolved

[AttentionTraining] Remove formatting_report.log generated from local…

63cd06a

… linting test

runwangdl requested a review from Xeratec December 12, 2025 15:26

Xeratec moved this from In review to Ready for Merge in Deeploy Dec 15, 2025

Xeratec approved these changes Dec 15, 2025

View reviewed changes

Xeratec merged commit ecae48a into pulp-platform:devel Dec 15, 2025
136 checks passed

github-project-automation bot moved this from Ready for Merge to Done in Deeploy Dec 15, 2025

coderabbitai bot mentioned this pull request Dec 15, 2025

FP32 ReduceMean operator improvement #137

Merged

5 tasks

CCT Attention Training on Siracusa #69

CCT Attention Training on Siracusa #69

Uh oh!

Conversation

runwangdl commented May 11, 2025 • edited by Xeratec Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Locate FloatAddTemplate and its referenceTemplate

Search for FloatAddTemplate definition across the codebase

Find the file containing FloatAddTemplate definition

Search for absoluteOutputCubes usage in all TileConstraint files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xeratec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

runwangdl commented Dec 8, 2025

Uh oh!

Xeratec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

runwangdl commented May 11, 2025 •

edited by Xeratec

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading