-
Notifications
You must be signed in to change notification settings - Fork 32
CCT Attention Training on Siracusa #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… have finite lifetime
…y are I/O buffers
📝 WalkthroughWalkthroughAdds gradient/backprop support for GELU and LayerNorm (parsers, layers, bindings, tiling constraints, templates, and kernels), plus SoftmaxGrad and ReduceSum tiling; extends GEMM/SGD templates and GEMM kernel for optional bias and unrolled paths; adjusts CI L1 settings and generated-transpose naming. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Frontend as Deeploy Frontend
participant Parser as Node Parser
participant Layer as ONNX Layer
participant Mapper as NodeMapper/Tiler
participant Constraint as TileConstraint
participant Template as NodeTemplate
participant Kernel as Target Kernel
Frontend->>Parser: emit gradient node (GeluGrad / LayerNormalizationGrad / SoftmaxGrad / ReduceSum)
Parser->>Layer: build OperatorRepresentation (inputs, shapes, size, lastDim)
Layer->>Mapper: request mapping (NodeMapper with tiling-ready bindings)
Mapper->>Constraint: add geometrical & policy constraints
Constraint->>Mapper: serialize tiling solution (VariableReplacementScheme, TilingSchedule)
Mapper->>Template: alignToContext / instantiate gradient template
Template->>Kernel: emit parallel chunk invocation (start_idx, end_idx)
Kernel->>Kernel: compute chunked gradients and write outputs
Kernel-->>Frontend: grad_out available for next passes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (8)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)
26-42: Unrolling factor of 6 is functional but unconventional.The loop unrolling works correctly. However, an unroll factor of 4 or 8 typically aligns better with SIMD widths and cache lines on most architectures. If performance is critical, consider benchmarking with power-of-2 unroll factors.
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)
355-360: Consider addingstrict=Truetozip()calls for robustness.While the lists are constructed together and should always match in length, adding
strict=True(Python 3.10+) would catch any future bugs where list lengths diverge. This is optional given the current structure guarantees equal lengths.if has_bias: - for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes): + for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes, strict=True): inputLoadSchedule.append({"A": a, "B": b, "C": c}) else: - for a, b in zip(inputACubes, inputBCubes): + for a, b in zip(inputACubes, inputBCubes, strict=True): inputLoadSchedule.append({"A": a, "B": b})Deeploy/Targets/Generic/Layers.py (1)
453-457: Consider addingcomputeOps()method for consistency.The
LayerNormGradLayerclass doesn't implementcomputeOps(), while other gradient layers likeGELUGradLayer(Line 66) andSoftmaxGradLayer(Line 134) do. For consistency and accurate operation counting in analysis workflows, consider adding this method.Would you like me to help estimate the operation count for LayerNorm gradient based on the kernel implementation in
TargetLibraries/Generic/src/Layernorm_fp32.c?Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)
113-121: Rename unused loop variables.The loop variable
dimis not used in the loop body on lines 113-117 and 118-121. Following Python conventions, rename it to_dimor simply remove it if only the index is needed.Apply this diff:
- for idx, dim in enumerate(input_shape): + for idx, _ in enumerate(input_shape): tilerModel.addConstraint( tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) == tilerModel.getTensorDimVar(tensorName = grad_in_buffer_name, dimIdx = idx)) - for idx, dim in enumerate(input_shape): + for idx, _ in enumerate(input_shape): tilerModel.addConstraint( tilerModel.getTensorDimVar(tensorName = data_in_buffer_name, dimIdx = idx) == tilerModel.getTensorDimVar(tensorName = grad_out_buffer_name, dimIdx = idx))TargetLibraries/Generic/src/Layernorm_fp32.c (1)
40-43: Unusedbiasparameter.The
biasparameter is declared but never used in the function body. LayerNorm gradient with respect to input typically doesn't require the bias term, but the signature inconsistency with the forward pass may cause confusion.Consider either:
- Documenting why
biasis intentionally unused (for API consistency with forward pass), or- Removing it if not needed for gradient computation:
-void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in, - float32_t *grad_out, float32_t *scale, - float32_t *bias, float32_t epsilon, int32_t size, - int32_t lastDimLength) { +void LayernormGrad_fp32_fp32(float32_t *grad_in, float32_t *data_in, + float32_t *grad_out, float32_t *scale, + float32_t epsilon, int32_t size, + int32_t lastDimLength) {Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)
16-17: Return type annotation inconsistency.The method signature declares
Tuple[NetworkContext, Dict, List[str]]but should returnTuple[NetworkContext, OperatorRepresentation, List[str]]to match the base classNodeTemplate.alignToContextsignature shown in the relevant snippets.def alignToContext(self, ctxt: NetworkContext, - operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]: + operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (2)
49-59: Consider extracting axis normalization into a helper method.The negative axis normalization logic is duplicated across
addGeometricalConstraint(lines 54-59 and 80-86),constructSymbolicNodeRep(lines 154-158), andserializeTilingSolution(lines 210-216). Extracting this into a shared static helper would reduce duplication and ensure consistent behavior.@staticmethod def _normalize_axes(axis, input_shape_len: int) -> List[int]: """Normalize axis indices, handling negative values.""" if isinstance(axis, int): axis = [axis] normalized = [] for ax in axis: if ax < 0: ax = input_shape_len + ax normalized.append(ax) return normalizedAlso applies to: 75-86, 146-158, 210-216
237-238: Move import to top of file.The
HyperRectangleimport should be placed at the top of the file with other imports from the same module, unless there's a specific reason to avoid it (e.g., circular dependency).from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme +from Deeploy.TilingExtension.TilingCodegen import HyperRectangleOr combine into a single import:
-from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, TilingSchedule, VariableReplacementScheme +from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, VariableReplacementScheme
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
.github/workflows/ci-platform-siracusa-tiled.yml(3 hunks).github/workflows/ci-platform-siracusa.yml(0 hunks)Deeploy/Targets/Generic/Layers.py(3 hunks)Deeploy/Targets/Generic/Parsers.py(2 hunks)Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py(1 hunks)Deeploy/Targets/PULPOpen/Bindings.py(2 hunks)Deeploy/Targets/PULPOpen/Platform.py(5 hunks)Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py(3 hunks)Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py(5 hunks)Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py(1 hunks)Deeploy/Targets/PULPOpen/Tiler.py(3 hunks)Deeploy/TilingExtension/TilingCodegen.py(1 hunks)TargetLibraries/Generic/inc/kernel/GELU.h(1 hunks)TargetLibraries/Generic/inc/kernel/Layernorm.h(1 hunks)TargetLibraries/Generic/src/GELU_fp32.c(1 hunks)TargetLibraries/Generic/src/Layernorm_fp32.c(1 hunks)TargetLibraries/PULPOpen/src/Gemm.c(1 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/ci-platform-siracusa.yml
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.
📚 Learning: 2025-09-09T15:43:20.195Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.
Applied to files:
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.pyDeeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
🧬 Code graph analysis (10)
TargetLibraries/Generic/inc/kernel/GELU.h (1)
TargetLibraries/Generic/src/GELU_fp32.c (1)
GELU_fp32_fp32_sigmoid_grad_chunk(34-53)
TargetLibraries/Generic/inc/kernel/Layernorm.h (1)
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
LayernormGrad_fp32_fp32(40-93)
Deeploy/Targets/Generic/Layers.py (1)
Deeploy/DeeployTypes.py (2)
ONNXLayer(1819-2147)computeOps(1835-1840)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)
Deeploy/DeeployTypes.py (1)
NodeTemplate(87-229)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (4)
Deeploy/DeeployTypes.py (1)
lookup(720-752)Deeploy/TilingExtension/TilerModel.py (1)
getTensorDimVar(131-135)Deeploy/TilingExtension/TilingCodegen.py (1)
HyperRectangle(24-35)Deeploy/TilingExtension/TileConstraint.py (1)
extractBaseAddr(56-74)
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)
Deeploy/DeeployTypes.py (1)
NodeTemplate(87-229)
Deeploy/Targets/PULPOpen/Bindings.py (4)
Deeploy/DeeployTypes.py (1)
NodeBinding(1512-1657)Deeploy/Targets/Generic/TypeCheckers.py (3)
AddChecker(88-102)LayerNormChecker(198-209)GELUChecker(388-402)Deeploy/AbstractDataTypes.py (1)
PointerClass(536-559)Deeploy/CommonExtensions/DataTypes.py (1)
float32_t(74-78)
Deeploy/Targets/PULPOpen/Tiler.py (4)
Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)
GeluGradTileConstraint(8-12)Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (2)
LayernormTileConstraint(19-80)LayernormGradTileConstraint(83-159)Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (1)
ReduceSumTileConstraint(18-250)Deeploy/TilingExtension/TilerExtension.py (1)
TilingReadyNodeBindings(1027-1035)
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (3)
Deeploy/DeeployTypes.py (3)
NetworkContext(508-1020)NodeTemplate(87-229)alignToContext(119-139)Deeploy/Targets/PULPOpen/Templates/ReshapeTemplate.py (1)
alignToContext(37-53)Deeploy/Targets/PULPOpen/Templates/DMASliceTemplate.py (1)
alignToContext(18-138)
Deeploy/Targets/PULPOpen/Platform.py (3)
Deeploy/Targets/Generic/Layers.py (2)
GELUGradLayer(61-70)LayerNormGradLayer(453-456)Deeploy/Targets/Generic/Parsers.py (3)
GELUGradParser(773-797)LayerNormGradParser(1677-1704)LayerNormParser(1647-1674)Deeploy/DeeployTypes.py (1)
NodeMapper(1660-1816)
🪛 Ruff (0.14.5)
Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py
113-113: Loop control variable dim not used within loop body
Rename unused dim to _dim
(B007)
118-118: Loop control variable dim not used within loop body
Rename unused dim to _dim
(B007)
128-128: Unused class method argument: ctxt
(ARG003)
Deeploy/Targets/Generic/Parsers.py
786-786: Unused method argument: channels_first
(ARG002)
1691-1691: Unused method argument: channels_first
(ARG002)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
290-290: Local variable varB is assigned to but never used
Remove assignment to unused variable varB
(F841)
356-356: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
359-359: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py
71-71: Avoid specifying long messages outside the exception class
(TRY003)
92-93: Avoid specifying long messages outside the exception class
(TRY003)
127-127: Unused static method argument: parseDict
(ARG004)
127-127: Unused static method argument: ctxt
(ARG004)
133-133: Unused static method argument: tilerModel
(ARG004)
172-172: Unused class method argument: absoluteOutputCubes
(ARG003)
Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py
159-159: Unused class method argument: ctxt
(ARG003)
🔇 Additional comments (39)
.github/workflows/ci-platform-siracusa-tiled.yml (3)
137-138: Verify intentionality of L1 configuration simplification for CCT tests.The change reduces L1 memory configurations from multiple test values to a single value, which limits test coverage to one memory allocation strategy. According to the AI summary, this was
[4000, 64000]and is now[64000]. Confirm this reduction is intentional and doesn't mask hardware-specific issues that only surface with smaller memory allocations.
168-171: Verify consistency of new test addition across all job configurations.A new test entry
testTrainCCT/CCT2_FT2is added to the L3 jobs withL1: [128000]. Per the AI summary, this test should also be added to the singlebuffer-L2 configuration (line 138 area). Confirm whether this is intentional or if the test entry is missing from singlebuffer-L2 and other jobs.
208-211: Verify consistency of L1 simplification across doublebuffer-L3.Changes here mirror those in singlebuffer-L3 (lines 169-171), which is good for consistency. However, apply the same verification from the previous comment to ensure testTrainCCT/CCT2_FT2 is correctly scoped to only L3 jobs and not missing from other configurations.
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py (1)
679-680: LGTM! Naming collision fix improves correctness.The updated naming scheme correctly prevents collisions when multiple transpose nodes feed into the same (or identically-named) downstream nodes. By prefixing with
t1.name, each split transpose gets a unique identifier derived from both the source transpose and its consumer.Deeploy/TilingExtension/TilingCodegen.py (1)
34-35: LGTM! Defensive tuple coercion improves robustness.The normalization ensures
offsetanddimsare always tuples internally, which is useful when callers pass lists or other sequences. Consider updating the type hints fromTuple[int, ...]toSequence[int]to reflect the accepted input types more accurately.TargetLibraries/Generic/inc/kernel/GELU.h (1)
28-30: LGTM! Function declaration is consistent with implementation.The new gradient chunk function declaration aligns with the implementation in
GELU_fp32.cand follows the existing naming conventions in this header.Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (1)
200-210: LGTM! Optional bias handling is correctly implemented.The conditional bias detection and buffer handling ensures both biased and unbiased GEMM configurations are properly supported. The pattern is consistently applied in both
addGeometricalConstraintandserializeTilingSolution.Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)
40-51: LGTM! Kernel call with proper guard condition.The check for
elem_count > 0correctly prevents calling the kernel with zero elements, which could occur for cores that have no work assigned when the sequence count is less than the core count.Deeploy/Targets/PULPOpen/TileConstraints/GeluTileConstraint.py (1)
8-12: LGTM!The
GeluGradTileConstraintclass correctly extendsBOPTileConstraintwith appropriate gradient-specific tensor names (grad_in, data_in, grad_out), following the established pattern for binary operation tile constraints.TargetLibraries/Generic/inc/kernel/Layernorm.h (1)
28-31: LGTM!The function declaration for
LayernormGrad_fp32_fp32is correctly added, with a signature that mirrors the forward pass and properly includes gradient-specific parameters (grad_in, grad_out). The implementation exists in the corresponding .c file.TargetLibraries/Generic/src/GELU_fp32.c (1)
34-53: LGTM!The GELU gradient computation is mathematically correct. The function properly implements the derivative of the GELU approximation (x * sigmoid(1.702*x)) using the chain rule and applies the upstream gradient correctly.
Deeploy/Targets/PULPOpen/Templates/FloatGELUTemplate.py (1)
12-20: LGTM!The gradient template correctly implements parallel chunking across cores, with proper bounds checking and invocation of the GELU gradient kernel. The chunk size calculation and boundary handling prevent out-of-bounds access.
Deeploy/Targets/Generic/Layers.py (2)
61-71: LGTM!The
GELUGradLayeris properly implemented with a reasonable operation count estimate of 9 operations per element for the GELU gradient computation.
482-486: LGTM!The
computeOps()method correctly estimates 2 operations per element (one multiply and one add/subtract) for the SGD weight update step.Deeploy/Targets/PULPOpen/TileConstraints/LayernormTileConstraint.py (1)
83-159: Verify tiling schedule correctness for gradient flow.The
LayernormGradTileConstraintimplementation looks correct overall. However, ensure that the tiling schedule properly handles the gradient data flow:
- Line 151: Both
grad_inanddata_inare loaded with the same cube dimensions- Line 153:
grad_outuses the same cube dimensions as the inputsThis is consistent with the forward pass pattern and the gradient computation requirements.
Consider verifying with a test that exercises tiling for LayerNorm gradient to ensure the tile boundaries and data dependencies are correctly handled.
Deeploy/Targets/Generic/Parsers.py (2)
773-798: LGTM!The
GELUGradParsercorrectly parses GELU gradient nodes with proper input/output mapping. The parser validates 2 inputs (upstream gradient and forward input) and 1 output, which aligns with the gradient computation requirements.
1677-1705: LGTM!The
LayerNormGradParsercorrectly extendsiLayerNormParserand validates 4 inputs (grad_in, data_in, weight, bias) and 1 output for the LayerNorm gradient operation. The size and lastDimLength computations are correct.Deeploy/Targets/PULPOpen/Tiler.py (4)
18-25: LGTM!The import updates correctly add gradient-specific bindings (GELUGrad, LayerNormGrad, SoftmaxGrad, SGD) needed for the new tiling ready bindings defined later in the file.
30-35: LGTM!The tile constraint imports are properly added to support gradient operations (GeluGradTileConstraint, LayernormGradTileConstraint, SoftmaxGradTileConstraint, ReduceSumTileConstraint), enabling gradient-aware tiling throughout the pipeline.
124-125: LGTM!The new gradient tiling ready bindings are correctly configured:
PULPLayernormGradTilingReadyBindingsusesLayernormGradTileConstraintPULPFPGELUGradTilingReadyBindingsusesGeluGradTileConstraintBoth follow the established pattern and properly wire gradient constraints into the tiling framework.
Also applies to: 130-131
143-146: LGTM!The updates to
PULPSoftmaxGradTilingReadyBindingsandPULPReduceSumTilingReadyBindingsreplace implicitUntiledTileConstraintwith explicit gradient-specific constraints (SoftmaxGradTileConstraintandReduceSumTileConstraint), improving tiling optimization for these operations.TargetLibraries/PULPOpen/src/Gemm.c (5)
29-31: LGTM: Bias handling and unroll factor initialization.The
has_biasflag via NULL check and unroll factor computation are correctly implemented. The modulo-based tail calculation ensures correct handling of non-divisible dimensions.
33-83: LGTM: Non-transposed GEMM path with O-dimension unrolling.The index calculations are correct for row-major layout. The 6-wide O-dimension unrolling with proper tail handling is well-implemented.
84-178: LGTM: Transposed-A GEMM path with dual-axis unrolling.The transpose indexing
pSrcA[k*M + i]correctly accesses columniof the original matrix. Both N and O dimension unrolling with tail handling are correctly implemented.
179-264: LGTM: Transposed-B GEMM path.The
b_rowpointer setup for transposed B access is correct. The dual-axis unrolling efficiently processes 6 output columns while iterating over the reduction dimension.
265-353: LGTM: Both-transposed GEMM path.The combination of transpose indexing for both matrices is correctly implemented, with consistent unrolling strategy across all code paths.
TargetLibraries/Generic/src/Layernorm_fp32.c (1)
72-90: Verify LayerNorm gradient formula correctness.The implementation computes
sum_dyas the unscaled sum of gradients (line 73), but the standard LayerNorm backward pass formula typically requiressum(dy * scale)for the mean subtraction term. The current formula:grad_in[j] - (sum_dy / lastDimLength)Should potentially be:
grad_in[j] * scale[j] - (sum_dy_scaled / lastDimLength)where
sum_dy_scaled = sum(dy * scale). The declared but unusedsum_dy_scaledvariable (line 45) suggests this may have been the original intent.Please verify against the PyTorch or reference LayerNorm backward implementation to confirm the gradient computation is correct.
Deeploy/Targets/PULPOpen/Bindings.py (3)
423-429: LGTM: LayerNorm gradient binding.The binding correctly specifies 4 float32 inputs (grad_in, data_in, weight, bias) and 1 output, matching the
LayerNormGradParsersignature. The use ofForkTransformeris consistent with other PULP bindings.
435-437: LGTM: GELU gradient binding.The gradient binding mirrors the forward
PULPFloatGELUBindingstructure with 2 inputs and 1 output, appropriately usingreferenceGradTemplate.
191-196: I'll help you verify the 3-input Add template compatibility concern. Let me start by locating the FloatAddTemplate definition and examining its implementation.
<function_calls>
<invoke_name>shell
#!/bin/bashLocate FloatAddTemplate and its referenceTemplate
rg -n "class FloatAddTemplate|FloatAddTemplate.referenceTemplate" --type py -A 5 -B 2
</invoke_name>
</function_calls>
<invoke_name>shell
#!/bin/bashSearch for FloatAddTemplate definition across the codebase
fd . -type f -name "*.py" | xargs rg -l "FloatAddTemplate" | head -20
</invoke_name>
</function_calls>
<invoke_name>shell
#!/bin/bashFind the file containing FloatAddTemplate definition
rg -n "class FloatAddTemplate" --type py
</invoke_name>
</function_calls>Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (2)
19-25: LGTM: Optional bias handling in alignToContext.The method correctly handles the no-bias case by setting sensible defaults. The
C_typedefault tofloat32_tPtrensures the generated function signature remains valid even when C is NULL.
28-79: LGTM: GEMM template with conditional bias handling.The template correctly handles both biased and bias-free GEMM cases with appropriate conditional code generation. The batch pointer arithmetic is correctly guarded by both
C is not NoneandC_batchedconditions.Deeploy/Targets/PULPOpen/Platform.py (3)
59-59: LGTM: GELUGradMapper definition.The mapper correctly wires
GELUGradParsertoPULPFPGELUGradTilingReadyBindings, following the established pattern of other mappers in this file.
89-89: LGTM: LayerNormGradMapper definition.The mapper correctly connects
LayerNormGradParsertoPULPLayernormGradTilingReadyBindings.
118-120: LGTM: Gradient operation mapping entries.The
GeluGradandLayerNormalizationGradentries are correctly added toPULPMapping, following the same pattern as existing gradient operations likeSoftmaxGrad.Deeploy/Targets/PULPOpen/TileConstraints/iSoftmaxTileConstraint.py (1)
106-188: LGTM!The
SoftmaxGradTileConstraintimplementation follows the same well-established pattern asiSoftmaxTileConstraint. The three-tensor constraint logic inaddGeometricalConstraintcorrectly ensures dimensional consistency across gradient inputs and outputs, and the serialization properly maps the input/output load schedules.Regarding the static analysis hint about
ctxtbeing unused on line 159: this parameter is part of theTileConstraintinterface signature and must be maintained for API consistency across all constraint implementations.Deeploy/Targets/PULPOpen/TileConstraints/ReduceSumTileConstraint.py (3)
126-130: LGTM!The empty implementation with explanatory comment is acceptable. The parent class interface requires this method signature, and deferring policy decisions to
serializeTilingSolutionis a valid design choice for ReduceSum operations.
132-168: LGTM!The method correctly constructs the symbolic representation with normalized axes and keepdims handling. The unused
tilerModelparameter is an interface requirement from the parentTileConstraintclass.
170-175: I'll help you verify this review comment. Let me start by examining howabsoluteOutputCubesis used across different tile constraint classes and then look at the specific implementation in ReduceSumTileConstraint.
<function_calls>
<invoke_name>shell
<invoke_body>#!/bin/bashSearch for absoluteOutputCubes usage in all TileConstraint files
rg -n "absoluteOutputCubes" Deeploy/Targets/PULPOpen/TileConstraints/ -A 5 -B 2
</invoke_body>
</invoke_function_calls>
</function_calls>
Xeratec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
Looks already very good. I still have some small comments, but nothing major.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
Deeploy/Targets/Generic/Layers.py (2)
61-70: Verify the ops-per-element count for GELU gradient.The
computeOpsmethod uses a hardcodedops_per_element = 9. Consider adding a brief comment documenting which operations are counted (similar to howGELULayer.computeOpsdocuments each operation individually). This improves maintainability and validates the count.def computeOps(self): size = self.mapper.parser.operatorRepresentation['size'] - ops_per_element = 9 + # GELU gradient: d_sigmoid(3), d_x*sigmoid(1), x*d_sigmoid(1), + # chain rule combinations (4 ops) = 9 ops per element + ops_per_element = 9 gelu_grad_ops = size * ops_per_element return gelu_grad_ops
453-457: Consider addingcomputeOpsfor consistency withGELUGradLayer.
LayerNormGradLayerlacks acomputeOpsmethod, so it will return 0 from the base class. This is inconsistent withGELUGradLayer(lines 61-70) which does implementcomputeOps. If operation counting is desired for gradient layers, consider adding the implementation here as well.Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (2)
20-23: Use templated types instead of hardcodedfloat32_t.Lines 20 and 23 use hardcoded
float32_tforlearning_rateandtemp_mul, which is inconsistent with the type-safe approach used elsewhere in templates. Based on learnings, explicit bitwidth types derived from template parameters should be used for consistency.-float32_t learning_rate = ${lr}; +float${weight_type.referencedType.typeWidth}_t learning_rate = ${lr}; // Temporary buffer for multiplication results -float32_t temp_mul[6]; +float${grad_type.referencedType.typeWidth}_t temp_mul[6];
46-47: Minor:temp_gradshould also use templated type.For consistency with the suggested refactor above, the local variable in the remainder loop should also use the templated type.
for (; i < ${nodeName}_chunk_stop; i++) { - float32_t temp_grad = learning_rate * ref_${grad}[i]; + float${grad_type.referencedType.typeWidth}_t temp_grad = learning_rate * ref_${grad}[i]; ref_${weight_updated}[i] = ref_${weight}[i] - temp_grad; }Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (2)
329-345: Shape construction for A/B/C cubes and rank assumptionsThe new ACube/BCube construction:
if transA == 0: ACube = HyperRectangle((BatchOffset, BOffset, MOffset, NOffset), (BatchSize, BSize, MSize, NSize)) else: ACube = HyperRectangle((BatchOffset, BOffset, NOffset, MOffset), (BatchSize, BSize, NSize, MSize)) # similarly for BCubeassumes a 4‑D logical layout
(batch, B, M, N|O)for A/B. For 2‑D or 3‑D output cubes you synthesizeBatchOffset,BOffset,BatchSize,BSizedefaults, so the rectangles stay 4‑D even when the output cube is lower‑rank.If the underlying Float GEMM buffers are always 4‑D, this is fine; otherwise, this could desync the hyperrectangle dimensionality from the actual buffer shape. It would be safer to either:
- enforce
len(cube.offset) == 4in the multi‑dimensional case (as in the previous comment’s diff), and/or- document that Float GEMM A/B/C/data_out are always treated as 4‑D in this tiler path.
The CCube construction for
has_bias:CCube = HyperRectangle(tuple(cube.offset), tuple(cube.dims))is consistent with a per‑output‑tile bias tile, assuming C is laid out like
data_out; this ties back to the earlier question about C’s rank and broadcasting semantics.
358-363: Addstrict=tozip()to satisfy Ruff and catch mismatched listsStatic analysis warns about
zip()withoutstrict=. GiveninputACubes,inputBCubes, andinputAddCubesare intended to be parallel lists, usingstrict=Truewill both satisfy Ruff and surface any accidental length mismatches:- if has_bias: - for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes): - inputLoadSchedule.append({"A": a, "B": b, "C": c}) - else: - for a, b in zip(inputACubes, inputBCubes): - inputLoadSchedule.append({"A": a, "B": b}) + if has_bias: + for a, b, c in zip(inputACubes, inputBCubes, inputAddCubes, strict = True): + inputLoadSchedule.append({"A": a, "B": b, "C": c}) + else: + for a, b in zip(inputACubes, inputBCubes, strict = True): + inputLoadSchedule.append({"A": a, "B": b})Please confirm the project targets a Python version that supports
zip(strict=...)(PEP 618) before applying this.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
Deeploy/Targets/Generic/Layers.py(3 hunks)Deeploy/Targets/PULPOpen/Bindings.py(2 hunks)Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py(1 hunks)Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py(5 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-02T13:54:22.700Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 69
File: Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py:36-38
Timestamp: 2025-12-02T13:54:22.700Z
Learning: In Deeploy templates (Python files in Deeploy/Targets/PULPOpen/Templates/), always use explicit bitwidth types (e.g., `float${...type.referencedType.typeWidth}_t*`) instead of hardcoded types (e.g., `float*`) to ensure type consistency with templated kernel calls.
Applied to files:
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.pyDeeploy/Targets/PULPOpen/Templates/SGDTemplate.py
📚 Learning: 2025-09-09T15:43:20.195Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py:120-124
Timestamp: 2025-09-09T15:43:20.195Z
Learning: In GEMMTileConstraint.serializeTilingSolution, transpose flags (transA, transB) must be read from operatorRepresentation and used to adjust NSize calculation and matrix offset/shape calculations, following the pattern in FloatGEMMTileConstraint.
Applied to files:
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
🧬 Code graph analysis (4)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)
Deeploy/DeeployTypes.py (1)
NodeTemplate(87-229)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (3)
Deeploy/DeeployTypes.py (1)
lookup(720-752)Deeploy/TilingExtension/TilerModel.py (1)
getTensorDimVar(131-135)Deeploy/TilingExtension/TilingCodegen.py (1)
HyperRectangle(24-35)
Deeploy/Targets/PULPOpen/Bindings.py (3)
Deeploy/Targets/Generic/TypeCheckers.py (3)
AddChecker(88-102)LayerNormChecker(198-209)GELUChecker(388-402)Deeploy/AbstractDataTypes.py (1)
PointerClass(536-559)Deeploy/CommonExtensions/DataTypes.py (1)
float32_t(74-78)
Deeploy/Targets/Generic/Layers.py (2)
Deeploy/DeeployTypes.py (2)
ONNXLayer(1819-2147)computeOps(1835-1840)DeeployTest/testUtils/dmaUtils.py (1)
size(72-73)
🪛 Ruff (0.14.7)
Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py
359-359: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
362-362: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
- GitHub Check: snitch-kernels / test-runner-snitch
🔇 Additional comments (9)
Deeploy/Targets/PULPOpen/Templates/FloatLayernormTemplate.py (1)
20-52: LGTM! Gradient template properly implements parallel computation with type-safe pointers.The template correctly uses explicit bitwidth types (
float${...typeWidth}_t) for pointer declarations (lines 36-38), addressing the previous review feedback. The parallel chunking logic with 32-bit integers for chunk calculations is appropriate for handling larger tensors.Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)
12-14: Overflow fix correctly applied withint32_t.The chunk variables now use
int32_tinstead ofint16_t, properly addressing the previous review concern about overflow for tensors larger than 32K elements.Deeploy/Targets/PULPOpen/Bindings.py (2)
434-436: LGTM! GELU gradient binding correctly configured.The
PULPFloatGELUGradBindingusesGELUCheckerwith 2 float32_t inputs (grad_in,data_in) and 1 output, matching thereferenceGradTemplateexpectations inFloatGELUTemplate.
422-428: VerifyLayerNormCheckersupports 4-input gradient signature.The gradient binding at lines 422-428 uses
LayerNormCheckerwith 4 inputs (grad_in,data_in,weight,bias) and 1 output, while the forward binding at lines 416-420 uses only 3 inputs. Confirm thatLayerNormCheckercorrectly validates this extended signature for gradient operations and that the additional input is properly handled.Deeploy/Targets/PULPOpen/TileConstraints/GEMMTileConstraint.py (5)
202-211: Bias-aware buffer registration for C looks goodMaking
Coptional viahas_biasand only looking up / registering its dimensions when present avoids key errors and unnecessary constraints in the no-bias case. This is a clean way to support both biased and unbiased GEMM variants.
234-240: Confirm bias tensor C rank and alignment with output dimsThe bias constraints assume
bufferChas at least two trailing dimensions that align with the output’s last two dims (MandO):
dimOffsetC = len(bufferC.shape) - 2- Both
dimOffsetCanddimOffsetC + 1are tied to the output’s first/second GEMM dims.If
Cis ever stored as a 1D[O]vector (or[batch, O]with broadcasting onM), these constraints would be too strong and could make the system unsatisfiable. Please confirm that, for all Float GEMM uses,Cis always shaped such that its last two dims truly match(M, O)ofdata_out, or adjust the constraints to allow typical bias broadcasting.
277-286: Optional C handling in address extraction is correctRebuilding
outputCubesasHyperRectangleinstances and makingaddrNamesconditional onhas_bias(['A', 'B', 'data_out']vs inserting'C') is consistent withextractBaseAddr’s expectations and keeps the biased/unbiased flows aligned.
293-299: Good: N size now depends on transA (matches learned pattern)Using
transAto select which axis ofAis the logicalN:if transA == 0: NSize = ctxt.lookup(varA).shape[-1] else: NSize = ctxt.lookup(varA).shape[-2]fixes mis-sizing when
Ais stored transposed and aligns Float GEMM with the recommended pattern for handling transpose flags in tiling serialization. Based on learnings, this is exactly what we want.
310-327:replacements["batch"]appears to use the wrong variableInside the loop over
outputCubes,BSizeis initialized to1and only overwritten whenlen(cube.offset) == 4(set tocube.dims[1]). However,BatchSizecaptures the actual batch dimension across all 2D, 3D, and 4D cases (fromcube.dims[0]orcube.dims[0]in the 3D case). Usingreplacements["batch"].append(BSize)will record incorrect batch sizes for 2D and 3D cubes (always1) and misalign the semantics in 4D cubes.Change
replacements["batch"].append(BSize)toreplacements["batch"].append(BatchSize)to correctly track the batch extent. Verify this change against howreplacements["batch"]is consumed in the Float GEMM template to ensure the semantics remain consistent with the tiling strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
Deeploy/Targets/PULPOpen/Bindings.py (1)
183-194: Duplicate float32 Add binding inPULPAddBindings.Lines 191-193 introduce a second
NodeBindingwith the exact sameAddCheckersignature and template as the one at Lines 189-190. This duplication is unnecessary and can be confusing for binding resolution.You can safely drop the extra list concatenation:
PULPAddBindings = [ @@ ] + [ NodeBinding(AddChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]), FloatAddTemplate.referenceTemplate, ForkTransformer) -] + [ - NodeBinding(AddChecker([PointerClass(float32_t), PointerClass(float32_t)], [PointerClass(float32_t)]), - FloatAddTemplate.referenceTemplate, ForkTransformer) ]
🧹 Nitpick comments (2)
Deeploy/Targets/Generic/Layers.py (2)
61-71: GELUGradLayer op count logic looks consistent; consider documenting the 9 ops/element.
computeOpsreuses the established pattern (sizefromoperatorRepresentation) and returns an integer op count; this will integrate cleanly withONNXLayer.bind. To aid future maintainers, consider adding a short comment breaking down where the 9 per‑element ops come from (e.g., multiplies/adds/exps) so the factor is not a magic constant.
453-457: LayerNormGradLayer currently reports 0 ops; decide if this is intentional.Without a
computeOpsoverride,LayerNormGradLayerwill inherit the base implementation and contribute0tonodeOps, unlikeLayerNormLayerwhich has an explicit breakdown. If op accounting for backward passes matters to you (e.g., training FLOP reports), consider adding acomputeOpssimilar in spirit toLayerNormLayer; otherwise, this minimalist definition is fine but should be a conscious choice.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
formatting_report.logis excluded by!**/*.log
📒 Files selected for processing (2)
Deeploy/Targets/Generic/Layers.py(2 hunks)Deeploy/Targets/PULPOpen/Bindings.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
Deeploy/Targets/Generic/Layers.py (1)
Deeploy/DeeployTypes.py (2)
ONNXLayer(1819-2147)computeOps(1835-1840)
🔇 Additional comments (1)
Deeploy/Targets/PULPOpen/Bindings.py (1)
421-427: Verify argument ordering in LayerNorm and GELU gradient templates against binding declarations.
PULPLayernormGradBindingandPULPFloatGELUGradBindingreferencereferenceGradTemplatefrom their respective float templates. The type signatures (allfloat32_tpointers) appear consistent, but the argument order must be confirmed: ensure the pointer sequence in the checker (grad-out, activations, parameters) matches whatFloatLayernormTemplate.referenceGradTemplateandFloatGELUTemplate.referenceGradTemplateactually expect in the generated kernel calls.
|
All comments are resolved. |
Xeratec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Only one small change requested: Seems like a log file sneaked its way into the PR.
This PR introduces gradient operator support, improved GEMM performance, and updates to the CCT training workflow. It also includes fixes to tile constraints and naming consistency in transpose pass.
Added
Changed
Fixed
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.