Skip to content

Conversation

@Xeratec
Copy link
Member

@Xeratec Xeratec commented Dec 12, 2025

This PR improves the log output for tiled executions to split it into kernel execution and pre- and post-kernel time. This is useful to directly assess the control overhead of an execution.

As you can see in the new "Siracusa (Tiled, L3) FloatGEMM" example below, we can conclude that the L2-L1 overhead is minimal while the L3-L2 overhead is rather large. This makes sense as the DMA is implemented in a blocking fashion.

Added

  • Calculate non-kernel overhead and show total time spent during profiling

Changed

  • Profile all memory levels

Examples

New Implementation

Siracusa (Tiled, L2) FloatGEMM

Command

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L2

Output

===== Profiling _L2 =====
[_L2][DB][68608 ops][Tile 0] Pre-Kernel :   829 cycles
[_L2][DB][68608 ops][Tile 0] Kernel     : 22768 cycles
[_L2][DB][68608 ops][Tile 0] Post-Kernel:   171 cycles
[_L2][DB][68608 ops][Tile 0] Total      : 23768 cycles (95.8% Kernel + 4.2% Overhad, 22768 + 1000)
[_L2][DB][68608 ops][Tile 1] Pre-Kernel :   200 cycles
[_L2][DB][68608 ops][Tile 1] Kernel     : 22294 cycles
[_L2][DB][68608 ops][Tile 1] Post-Kernel:    75 cycles
[_L2][DB][68608 ops][Tile 1] Total      : 22569 cycles (98.8% Kernel + 1.2% Overhad, 22294 + 275)
[_L2][DB][68608 ops][Tile 2] Pre-Kernel :   143 cycles
[_L2][DB][68608 ops][Tile 2] Kernel     : 22277 cycles
[_L2][DB][68608 ops][Tile 2] Post-Kernel:    56 cycles
[_L2][DB][68608 ops][Tile 2] Total      : 22476 cycles (99.1% Kernel + 0.9% Overhad, 22277 + 199)
[_L2][DB][68608 ops][Tile 3] Pre-Kernel :   149 cycles
[_L2][DB][68608 ops][Tile 3] Kernel     : 22274 cycles
[_L2][DB][68608 ops][Tile 3] Post-Kernel:    71 cycles
[_L2][DB][68608 ops][Tile 3] Total      : 22494 cycles (99.0% Kernel + 1.0% Overhad, 22274 + 220)
[_L2][DB][68608 ops][Tile 4] Pre-Kernel :   138 cycles
[_L2][DB][68608 ops][Tile 4] Kernel     : 22294 cycles
[_L2][DB][68608 ops][Tile 4] Post-Kernel:    58 cycles
[_L2][DB][68608 ops][Tile 4] Total      : 22490 cycles (99.1% Kernel + 0.9% Overhad, 22294 + 196)
[_L2][DB][68608 ops][Tile 5] Pre-Kernel :    57 cycles
[_L2][DB][68608 ops][Tile 5] Kernel     : 22250 cycles
[_L2][DB][68608 ops][Tile 5] Post-Kernel:   216 cycles
[_L2][DB][68608 ops][Tile 5] Total      : 22523 cycles (98.8% Kernel + 1.2% Overhad, 22250 + 273)
Siracusa (Tiled, L3) FloatGEMM

Command

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L3

Output

===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 0] Pre-Kernel :  1134 cycles
[_L2][SB][68608 ops][Tile 0] Kernel     : 44983 cycles
[_L2][SB][68608 ops][Tile 0] Post-Kernel:   375 cycles
[_L2][SB][68608 ops][Tile 0] Total      : 46492 cycles (96.8% Kernel + 3.2% Overhad, 44983 + 1509)
===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 1] Pre-Kernel :   428 cycles
[_L2][SB][68608 ops][Tile 1] Kernel     : 11776 cycles
[_L2][SB][68608 ops][Tile 1] Post-Kernel:   116 cycles
[_L2][SB][68608 ops][Tile 1] Total      : 12320 cycles (95.6% Kernel + 4.4% Overhad, 11776 + 544)
===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 2] Pre-Kernel :  1119 cycles
[_L2][SB][68608 ops][Tile 2] Kernel     : 44953 cycles
[_L2][SB][68608 ops][Tile 2] Post-Kernel:   341 cycles
[_L2][SB][68608 ops][Tile 2] Total      : 46413 cycles (96.9% Kernel + 3.1% Overhad, 44953 + 1460)
===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 3] Pre-Kernel :   428 cycles
[_L2][SB][68608 ops][Tile 3] Kernel     : 11705 cycles
[_L2][SB][68608 ops][Tile 3] Post-Kernel:   116 cycles
[_L2][SB][68608 ops][Tile 3] Total      : 12249 cycles (95.6% Kernel + 4.4% Overhad, 11705 + 544)
===== Profiling _L3 =====
[_L3][DB][68608 ops][Tile 0] Pre-Kernel : 14095 cycles
[_L3][DB][68608 ops][Tile 0] Kernel     : 46907 cycles
[_L3][DB][68608 ops][Tile 0] Post-Kernel:  7793 cycles
[_L3][DB][68608 ops][Tile 0] Total      : 68795 cycles (68.2% Kernel + 31.8% Overhad, 46907 + 21888)
[_L3][DB][68608 ops][Tile 1] Pre-Kernel : 19797 cycles
[_L3][DB][68608 ops][Tile 1] Kernel     : 12751 cycles
[_L3][DB][68608 ops][Tile 1] Post-Kernel:  4139 cycles
[_L3][DB][68608 ops][Tile 1] Total      : 36687 cycles (34.8% Kernel + 65.2% Overhad, 12751 + 23936)
[_L3][DB][68608 ops][Tile 2] Pre-Kernel : 13543 cycles
[_L3][DB][68608 ops][Tile 2] Kernel     : 46790 cycles
[_L3][DB][68608 ops][Tile 2] Post-Kernel:  7797 cycles
[_L3][DB][68608 ops][Tile 2] Total      : 68130 cycles (68.7% Kernel + 31.3% Overhad, 46790 + 21340)
[_L3][DB][68608 ops][Tile 3] Pre-Kernel :   309 cycles
[_L3][DB][68608 ops][Tile 3] Kernel     : 12660 cycles
[_L3][DB][68608 ops][Tile 3] Post-Kernel:  4240 cycles
[_L3][DB][68608 ops][Tile 3] Total      : 17209 cycles (73.6% Kernel + 26.4% Overhad, 12660 + 4549)

Previous Implementation

Siracusa (Tiled, L2) FloatGEMM

Command

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L2`

Output

[_L2][DB][68608 ops][Tile 0] Input DMA took 842 cycles
[_L2][DB][68608 ops][Tile 0] Kernel took 22861 cycles
[_L2][DB][68608 ops][Tile 0] Output DMA took 169 cycles
[_L2][DB][68608 ops][Tile 1] Input DMA took 218 cycles
[_L2][DB][68608 ops][Tile 1] Kernel took 22261 cycles
[_L2][DB][68608 ops][Tile 1] Output DMA took 103 cycles
[_L2][DB][68608 ops][Tile 2] Input DMA took 175 cycles
[_L2][DB][68608 ops][Tile 2] Kernel took 22250 cycles
[_L2][DB][68608 ops][Tile 2] Output DMA took 61 cycles
[_L2][DB][68608 ops][Tile 3] Input DMA took 140 cycles
[_L2][DB][68608 ops][Tile 3] Kernel took 22277 cycles
[_L2][DB][68608 ops][Tile 3] Output DMA took 61 cycles
[_L2][DB][68608 ops][Tile 4] Input DMA took 126 cycles
[_L2][DB][68608 ops][Tile 4] Kernel took 22258 cycles
[_L2][DB][68608 ops][Tile 4] Output DMA took 63 cycles
[_L2][DB][68608 ops][Tile 5] Input DMA took 57 cycles
[_L2][DB][68608 ops][Tile 5] Kernel took 22247 cycles
[_L2][DB][68608 ops][Tile 5] Output DMA took 225 cycles
Siracusa (Tiled, L3) FloatGEMM

Command

$> python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L3

Output

[_L2][SB][68608 ops][Tile 0] Input DMA took 1165 cycles
[_L2][SB][68608 ops][Tile 0] Kernel took 45001 cycles
[_L2][SB][68608 ops][Tile 0] Output DMA took 393 cycles
[_L2][SB][68608 ops][Tile 1] Input DMA took 409 cycles
[_L2][SB][68608 ops][Tile 1] Kernel took 11642 cycles
[_L2][SB][68608 ops][Tile 1] Output DMA took 184 cycles
[_L2][SB][68608 ops][Tile 2] Input DMA took 1176 cycles
[_L2][SB][68608 ops][Tile 2] Kernel took 44855 cycles
[_L2][SB][68608 ops][Tile 2] Output DMA took 393 cycles
[_L2][SB][68608 ops][Tile 3] Input DMA took 409 cycles
[_L2][SB][68608 ops][Tile 3] Kernel took 11649 cycles
[_L2][SB][68608 ops][Tile 3] Output DMA took 150 cycles
[_L3][DB][68608 ops][Tile 0] Input DMA took 14121 cycles
[_L3][DB][68608 ops][Tile 0] Output DMA took 7736 cycles
[_L3][DB][68608 ops][Tile 1] Input DMA took 19764 cycles
[_L3][DB][68608 ops][Tile 1] Output DMA took 3921 cycles
[_L3][DB][68608 ops][Tile 2] Input DMA took 13280 cycles
[_L3][DB][68608 ops][Tile 2] Output DMA took 7650 cycles
[_L3][DB][68608 ops][Tile 3] Input DMA took 274 cycles
[_L3][DB][68608 ops][Tile 3] Output DMA took 4011 cycles

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@Xeratec Xeratec added this to the Release 0.2.1 milestone Dec 12, 2025
@Xeratec Xeratec self-assigned this Dec 12, 2025
@Xeratec Xeratec requested a review from Victor-Jung as a code owner December 12, 2025 15:40
@Xeratec Xeratec added the Feature Addition of new features label Dec 12, 2025
@Xeratec Xeratec added this to Deeploy Dec 12, 2025
@Xeratec Xeratec moved this to Need Reviewer in Deeploy Dec 12, 2025
@Xeratec Xeratec moved this from Need Reviewer to Ready for Merge in Deeploy Dec 12, 2025
@Xeratec Xeratec moved this from Ready for Merge to In review in Deeploy Dec 12, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Changed
    • Profiling now calculates and reports non-kernel overhead alongside total execution time.
    • Kernel profiling information extended across all memory levels.
    • Enhanced profiling output includes Total, Kernel, and DMA overhead percentages with component measurements.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

For profiling, this change always enables kernel-level tiling in code generation and adds per-phase measurement templates plus total/time-overhead printing; CHANGLEOG updated to reflect profiling improvements.

Changes

Cohort / File(s) Summary
Changelog
CHANGELOG.md
Added Unreleased entry "Improve Profiling (#138)" and listed changes: calculate non-kernel overhead, show total time, print kernel profiling for all memory levels
Buffering codegen (kernel-level flag)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py, Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
TilingMetaInfo construction changed: kernelLevelTiling hardcoded to True (removed self.localMemory == "L1" check), affecting profiling gating in both buffering strategies
Profiling templates & injection
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py
Added _measurementDeclaration and _printCycleContribution templates; replaced inlined cycle-diff prints with measurement variables; injectPrintCycleDiff now injects per-phase measurements (ingress, optional kernel, egress) and, when kernel-level tiling is enabled, prints total/time and kernel vs DMA overhead breakdown

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • Victor-Jung
  • lukamac

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Improve Profiling' directly matches the main objective of the PR, which is to improve profiling output for tiled executions by splitting timing data.
Description check ✅ Passed The description clearly explains the PR's purpose: improving log output by splitting execution time into Pre-Kernel, Kernel, and Post-Kernel measurements, with specific examples and comparisons to the previous implementation.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)

120-126: Hardcoding kernelLevelTiling=True: please confirm this flag isn’t used for non-profiling semantics elsewhere.
If the intent is “always emit kernel/total profiling breakdown for every memory level”, this is fine; but the field name now reads misleadingly—consider a follow-up rename/comment to reflect “enable kernel profiling output”.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 363b7d5 and fcb433d.

📒 Files selected for processing (4)
  • CHANGELOG.md (3 hunks)
  • Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1 hunks)
  • Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1 hunks)
  • Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.

Applied to files:

  • Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
  • Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.

Applied to files:

  • Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
  • Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

Applied to files:

  • Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
🧬 Code graph analysis (3)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
  • TilingMetaInfo (13-20)
Deeploy/DeeployTypes.py (1)
  • nodeName (1542-1545)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
Deeploy/DeeployTypes.py (4)
  • NodeTemplate (87-229)
  • executionBlock (1536-1539)
  • addRight (1421-1434)
  • nodeName (1542-1545)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
  • TilingMetaInfo (13-20)
Deeploy/DeeployTypes.py (1)
  • nodeName (1542-1545)
🔇 Additional comments (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)

118-150: Watch generated-code footprint and C dialect assumptions (profiling arrays + local decls).
This adds multiple uint32_t [totalNumTiles] arrays (and loop-local decls); please sanity-check stack/placement for large tile counts and confirm the target compiler/libc supports C99-style declarations + float printf. If footprint is tight, consider storing durations directly to eliminate the _end_measurements arrays.

Also applies to: 152-234

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

279-285: Same note as SB: confirm unconditional kernelLevelTiling=True won’t affect non-profiling behavior.
Good that SB/DB are now consistent, but please verify no downstream logic depends on the old (self.localMemory == "L1") meaning.

Copy link
Contributor

@diaconuccalin diaconuccalin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful feature I wouldn't even have thought about :) Minor typo, otherwise LGTM.

Copy link
Member

@Victor-Jung Victor-Jung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, cool stuff!

@Xeratec Xeratec force-pushed the pr/detailed_profile branch from fcb433d to d73f810 Compare December 24, 2025 00:41
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
CHANGELOG.md (1)

7-7: Changelog entries accurately reflect the profiling improvements.

The entries correctly describe:

  • Non-kernel overhead calculation and total time display (Added)
  • Kernel profiling for all memory levels (Changed)

These align with the code changes in TilingPrototypes.py and the hardcoded kernelLevelTiling = True in the code generation files.

A previous review noted punctuation/wording consistency—consider adding trailing periods and aligning metric names (Pre/Kernel/Post/Total + overhead) for consistency.

Also applies to: 85-85, 128-128

🧹 Nitpick comments (2)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)

97-103: Minor formatting issue: extra whitespace in printf format.

Line 102 has extra trailing spaces before , ${measurementKernel} which will not affect functionality but is inconsistent with code style.

🔎 Suggested fix
-    printf("%s%u] Total      :%6u cycles (%2.1f%% Kernel + %2.1f%% Overhead, %u + %u)\\n", ${prefixStr}, ${profileIdxVar}, total, kernel_percentage, overhead_percentage    , ${measurementKernel}, dma);
+    printf("%s%u] Total      :%6u cycles (%2.1f%% Kernel + %2.1f%% Overhead, %u + %u)\\n", ${prefixStr}, ${profileIdxVar}, total, kernel_percentage, overhead_percentage, ${measurementKernel}, dma);

196-216: Inconsistent spacing in flavor strings affects output alignment.

The flavor strings have inconsistent formatting:

  • "Pre-Kernel :" (space before colon)
  • "Kernel :" (space before colon)
  • "Post-Kernel:" (no space before colon)

This causes misaligned output in profiling logs.

🔎 Suggested fix for consistent alignment
-                "flavorStr": "Post-Kernel:",
+                "flavorStr": "Post-Kernel :",
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcb433d and d73f810.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
  • Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
  • Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.

Applied to files:

  • Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.

Applied to files:

  • Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
📚 Learning: 2025-09-09T15:58:06.454Z
Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

Applied to files:

  • Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
🧬 Code graph analysis (1)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)
Deeploy/DeeployTypes.py (4)
  • NodeTemplate (87-229)
  • executionBlock (1536-1539)
  • addRight (1421-1434)
  • nodeName (1542-1545)
🔇 Additional comments (3)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)

167-190: Measurement declarations correctly compute elapsed cycles.

The injected measurement variables properly calculate timing differences (end - start) for each phase, with kernel measurement correctly gated by kernelLevelTiling.


220-230: Total time and overhead calculation logic is correct.

The implementation properly:

  • Computes total time as sum of all phases
  • Calculates overhead as the non-kernel (DMA) portion
  • Gates the output on kernelLevelTiling

Previous review feedback (typo fix and variable rename) has been addressed.

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

279-285: Kernel-level tiling now unconditionally enabled for profiling.

This change enables kernel-level measurement instrumentation for all memory levels, supporting the new overhead calculation and total time reporting in TilingPrototypes.py. Previously, this was gated by self.localMemory == "L1". The same kernelLevelTiling = True change is consistently applied in SingleBufferingTilingCodeGeneration.py, ensuring uniform profiling behavior across both buffering strategies.

@Xeratec Xeratec merged commit f792722 into pulp-platform:devel Dec 24, 2025
142 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in Deeploy Dec 24, 2025
@Xeratec Xeratec deleted the pr/detailed_profile branch December 24, 2025 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants