-
Notifications
You must be signed in to change notification settings - Fork 32
Improve Profiling #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Profiling #138
Conversation
📝 WalkthroughSummary by CodeRabbitRelease Notes
✏️ Tip: You can customize this high-level summary in your review settings. WalkthroughFor profiling, this change always enables kernel-level tiling in code generation and adds per-phase measurement templates plus total/time-overhead printing; CHANGLEOG updated to reflect profiling improvements. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)
120-126: HardcodingkernelLevelTiling=True: please confirm this flag isn’t used for non-profiling semantics elsewhere.
If the intent is “always emit kernel/total profiling breakdown for every memory level”, this is fine; but the field name now reads misleadingly—consider a follow-up rename/comment to reflect “enable kernel profiling output”.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(3 hunks)Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py(1 hunks)Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py(1 hunks)Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py(3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.pyDeeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.pyDeeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
🧬 Code graph analysis (3)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
TilingMetaInfo(13-20)Deeploy/DeeployTypes.py (1)
nodeName(1542-1545)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
Deeploy/DeeployTypes.py (4)
NodeTemplate(87-229)executionBlock(1536-1539)addRight(1421-1434)nodeName(1542-1545)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
TilingMetaInfo(13-20)Deeploy/DeeployTypes.py (1)
nodeName(1542-1545)
🔇 Additional comments (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
118-150: Watch generated-code footprint and C dialect assumptions (profiling arrays + local decls).
This adds multipleuint32_t [totalNumTiles]arrays (and loop-local decls); please sanity-check stack/placement for large tile counts and confirm the target compiler/libc supports C99-style declarations + floatprintf. If footprint is tight, consider storing durations directly to eliminate the_end_measurementsarrays.Also applies to: 152-234
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)
279-285: Same note as SB: confirm unconditionalkernelLevelTiling=Truewon’t affect non-profiling behavior.
Good that SB/DB are now consistent, but please verify no downstream logic depends on the old(self.localMemory == "L1")meaning.
diaconuccalin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful feature I wouldn't even have thought about :) Minor typo, otherwise LGTM.
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py
Outdated
Show resolved
Hide resolved
Victor-Jung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT, cool stuff!
fcb433d to
d73f810
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
CHANGELOG.md (1)
7-7: Changelog entries accurately reflect the profiling improvements.The entries correctly describe:
- Non-kernel overhead calculation and total time display (Added)
- Kernel profiling for all memory levels (Changed)
These align with the code changes in
TilingPrototypes.pyand the hardcodedkernelLevelTiling = Truein the code generation files.A previous review noted punctuation/wording consistency—consider adding trailing periods and aligning metric names (Pre/Kernel/Post/Total + overhead) for consistency.
Also applies to: 85-85, 128-128
🧹 Nitpick comments (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)
97-103: Minor formatting issue: extra whitespace in printf format.Line 102 has extra trailing spaces before
, ${measurementKernel}which will not affect functionality but is inconsistent with code style.🔎 Suggested fix
- printf("%s%u] Total :%6u cycles (%2.1f%% Kernel + %2.1f%% Overhead, %u + %u)\\n", ${prefixStr}, ${profileIdxVar}, total, kernel_percentage, overhead_percentage , ${measurementKernel}, dma); + printf("%s%u] Total :%6u cycles (%2.1f%% Kernel + %2.1f%% Overhead, %u + %u)\\n", ${prefixStr}, ${profileIdxVar}, total, kernel_percentage, overhead_percentage, ${measurementKernel}, dma);
196-216: Inconsistent spacing in flavor strings affects output alignment.The flavor strings have inconsistent formatting:
"Pre-Kernel :"(space before colon)"Kernel :"(space before colon)"Post-Kernel:"(no space before colon)This causes misaligned output in profiling logs.
🔎 Suggested fix for consistent alignment
- "flavorStr": "Post-Kernel:", + "flavorStr": "Post-Kernel :",
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.mdDeeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.pyDeeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.pyDeeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py
🚧 Files skipped from review as they are similar to previous changes (1)
- Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.
Applied to files:
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
🧬 Code graph analysis (1)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
Deeploy/DeeployTypes.py (4)
NodeTemplate(87-229)executionBlock(1536-1539)addRight(1421-1434)nodeName(1542-1545)
🔇 Additional comments (3)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)
167-190: Measurement declarations correctly compute elapsed cycles.The injected measurement variables properly calculate timing differences (
end - start) for each phase, with kernel measurement correctly gated bykernelLevelTiling.
220-230: Total time and overhead calculation logic is correct.The implementation properly:
- Computes total time as sum of all phases
- Calculates overhead as the non-kernel (DMA) portion
- Gates the output on
kernelLevelTilingPrevious review feedback (typo fix and variable rename) has been addressed.
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)
279-285: Kernel-level tiling now unconditionally enabled for profiling.This change enables kernel-level measurement instrumentation for all memory levels, supporting the new overhead calculation and total time reporting in
TilingPrototypes.py. Previously, this was gated byself.localMemory == "L1". The samekernelLevelTiling = Truechange is consistently applied inSingleBufferingTilingCodeGeneration.py, ensuring uniform profiling behavior across both buffering strategies.
This PR improves the log output for tiled executions to split it into kernel execution and pre- and post-kernel time. This is useful to directly assess the control overhead of an execution.
As you can see in the new "Siracusa (Tiled, L3) FloatGEMM" example below, we can conclude that the L2-L1 overhead is minimal while the L3-L2 overhead is rather large. This makes sense as the DMA is implemented in a blocking fashion.
Added
Changed
Examples
New Implementation
Siracusa (Tiled, L2) FloatGEMM
Command
Output
Siracusa (Tiled, L3) FloatGEMM
Command
Output
Previous Implementation
Siracusa (Tiled, L2) FloatGEMM
Command
python testRunner_tiled_siracusa.py -t Tests/testFloatGEMM --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L2`Output
Siracusa (Tiled, L3) FloatGEMM
Command
$> python testRunner_tiled_siracusa.py -t Tests/testFloatGEMM --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L3Output
PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.