Improve Profiling #138

Xeratec · 2025-12-12T15:40:39Z

This PR improves the log output for tiled executions to split it into kernel execution and pre- and post-kernel time. This is useful to directly assess the control overhead of an execution.

As you can see in the new "Siracusa (Tiled, L3) FloatGEMM" example below, we can conclude that the L2-L1 overhead is minimal while the L3-L2 overhead is rather large. This makes sense as the DMA is implemented in a blocking fashion.

Added

Calculate non-kernel overhead and show total time spent during profiling

Changed

Profile all memory levels

Examples

New Implementation

Siracusa (Tiled, L2) FloatGEMM

Command

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L2

Output

===== Profiling _L2 =====
[_L2][DB][68608 ops][Tile 0] Pre-Kernel :   829 cycles
[_L2][DB][68608 ops][Tile 0] Kernel     : 22768 cycles
[_L2][DB][68608 ops][Tile 0] Post-Kernel:   171 cycles
[_L2][DB][68608 ops][Tile 0] Total      : 23768 cycles (95.8% Kernel + 4.2% Overhad, 22768 + 1000)
[_L2][DB][68608 ops][Tile 1] Pre-Kernel :   200 cycles
[_L2][DB][68608 ops][Tile 1] Kernel     : 22294 cycles
[_L2][DB][68608 ops][Tile 1] Post-Kernel:    75 cycles
[_L2][DB][68608 ops][Tile 1] Total      : 22569 cycles (98.8% Kernel + 1.2% Overhad, 22294 + 275)
[_L2][DB][68608 ops][Tile 2] Pre-Kernel :   143 cycles
[_L2][DB][68608 ops][Tile 2] Kernel     : 22277 cycles
[_L2][DB][68608 ops][Tile 2] Post-Kernel:    56 cycles
[_L2][DB][68608 ops][Tile 2] Total      : 22476 cycles (99.1% Kernel + 0.9% Overhad, 22277 + 199)
[_L2][DB][68608 ops][Tile 3] Pre-Kernel :   149 cycles
[_L2][DB][68608 ops][Tile 3] Kernel     : 22274 cycles
[_L2][DB][68608 ops][Tile 3] Post-Kernel:    71 cycles
[_L2][DB][68608 ops][Tile 3] Total      : 22494 cycles (99.0% Kernel + 1.0% Overhad, 22274 + 220)
[_L2][DB][68608 ops][Tile 4] Pre-Kernel :   138 cycles
[_L2][DB][68608 ops][Tile 4] Kernel     : 22294 cycles
[_L2][DB][68608 ops][Tile 4] Post-Kernel:    58 cycles
[_L2][DB][68608 ops][Tile 4] Total      : 22490 cycles (99.1% Kernel + 0.9% Overhad, 22294 + 196)
[_L2][DB][68608 ops][Tile 5] Pre-Kernel :    57 cycles
[_L2][DB][68608 ops][Tile 5] Kernel     : 22250 cycles
[_L2][DB][68608 ops][Tile 5] Post-Kernel:   216 cycles
[_L2][DB][68608 ops][Tile 5] Total      : 22523 cycles (98.8% Kernel + 1.2% Overhad, 22250 + 273)

Siracusa (Tiled, L3) FloatGEMM

Command

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L3

Output

===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 0] Pre-Kernel :  1134 cycles
[_L2][SB][68608 ops][Tile 0] Kernel     : 44983 cycles
[_L2][SB][68608 ops][Tile 0] Post-Kernel:   375 cycles
[_L2][SB][68608 ops][Tile 0] Total      : 46492 cycles (96.8% Kernel + 3.2% Overhad, 44983 + 1509)
===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 1] Pre-Kernel :   428 cycles
[_L2][SB][68608 ops][Tile 1] Kernel     : 11776 cycles
[_L2][SB][68608 ops][Tile 1] Post-Kernel:   116 cycles
[_L2][SB][68608 ops][Tile 1] Total      : 12320 cycles (95.6% Kernel + 4.4% Overhad, 11776 + 544)
===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 2] Pre-Kernel :  1119 cycles
[_L2][SB][68608 ops][Tile 2] Kernel     : 44953 cycles
[_L2][SB][68608 ops][Tile 2] Post-Kernel:   341 cycles
[_L2][SB][68608 ops][Tile 2] Total      : 46413 cycles (96.9% Kernel + 3.1% Overhad, 44953 + 1460)
===== Profiling _L2 =====
[_L2][SB][68608 ops][Tile 3] Pre-Kernel :   428 cycles
[_L2][SB][68608 ops][Tile 3] Kernel     : 11705 cycles
[_L2][SB][68608 ops][Tile 3] Post-Kernel:   116 cycles
[_L2][SB][68608 ops][Tile 3] Total      : 12249 cycles (95.6% Kernel + 4.4% Overhad, 11705 + 544)
===== Profiling _L3 =====
[_L3][DB][68608 ops][Tile 0] Pre-Kernel : 14095 cycles
[_L3][DB][68608 ops][Tile 0] Kernel     : 46907 cycles
[_L3][DB][68608 ops][Tile 0] Post-Kernel:  7793 cycles
[_L3][DB][68608 ops][Tile 0] Total      : 68795 cycles (68.2% Kernel + 31.8% Overhad, 46907 + 21888)
[_L3][DB][68608 ops][Tile 1] Pre-Kernel : 19797 cycles
[_L3][DB][68608 ops][Tile 1] Kernel     : 12751 cycles
[_L3][DB][68608 ops][Tile 1] Post-Kernel:  4139 cycles
[_L3][DB][68608 ops][Tile 1] Total      : 36687 cycles (34.8% Kernel + 65.2% Overhad, 12751 + 23936)
[_L3][DB][68608 ops][Tile 2] Pre-Kernel : 13543 cycles
[_L3][DB][68608 ops][Tile 2] Kernel     : 46790 cycles
[_L3][DB][68608 ops][Tile 2] Post-Kernel:  7797 cycles
[_L3][DB][68608 ops][Tile 2] Total      : 68130 cycles (68.7% Kernel + 31.3% Overhad, 46790 + 21340)
[_L3][DB][68608 ops][Tile 3] Pre-Kernel :   309 cycles
[_L3][DB][68608 ops][Tile 3] Kernel     : 12660 cycles
[_L3][DB][68608 ops][Tile 3] Post-Kernel:  4240 cycles
[_L3][DB][68608 ops][Tile 3] Total      : 17209 cycles (73.6% Kernel + 26.4% Overhad, 12660 + 4549)

Previous Implementation

Siracusa (Tiled, L2) FloatGEMM

Command

python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L2`

Output

[_L2][DB][68608 ops][Tile 0] Input DMA took 842 cycles
[_L2][DB][68608 ops][Tile 0] Kernel took 22861 cycles
[_L2][DB][68608 ops][Tile 0] Output DMA took 169 cycles
[_L2][DB][68608 ops][Tile 1] Input DMA took 218 cycles
[_L2][DB][68608 ops][Tile 1] Kernel took 22261 cycles
[_L2][DB][68608 ops][Tile 1] Output DMA took 103 cycles
[_L2][DB][68608 ops][Tile 2] Input DMA took 175 cycles
[_L2][DB][68608 ops][Tile 2] Kernel took 22250 cycles
[_L2][DB][68608 ops][Tile 2] Output DMA took 61 cycles
[_L2][DB][68608 ops][Tile 3] Input DMA took 140 cycles
[_L2][DB][68608 ops][Tile 3] Kernel took 22277 cycles
[_L2][DB][68608 ops][Tile 3] Output DMA took 61 cycles
[_L2][DB][68608 ops][Tile 4] Input DMA took 126 cycles
[_L2][DB][68608 ops][Tile 4] Kernel took 22258 cycles
[_L2][DB][68608 ops][Tile 4] Output DMA took 63 cycles
[_L2][DB][68608 ops][Tile 5] Input DMA took 57 cycles
[_L2][DB][68608 ops][Tile 5] Kernel took 22247 cycles
[_L2][DB][68608 ops][Tile 5] Output DMA took 225 cycles

Siracusa (Tiled, L3) FloatGEMM

Command

$> python testRunner_tiled_siracusa.py  -t Tests/testFloatGEMM  --l1 10000 --doublebuffer --profileTiling --defaultMemLevel=L3

Output

[_L2][SB][68608 ops][Tile 0] Input DMA took 1165 cycles
[_L2][SB][68608 ops][Tile 0] Kernel took 45001 cycles
[_L2][SB][68608 ops][Tile 0] Output DMA took 393 cycles
[_L2][SB][68608 ops][Tile 1] Input DMA took 409 cycles
[_L2][SB][68608 ops][Tile 1] Kernel took 11642 cycles
[_L2][SB][68608 ops][Tile 1] Output DMA took 184 cycles
[_L2][SB][68608 ops][Tile 2] Input DMA took 1176 cycles
[_L2][SB][68608 ops][Tile 2] Kernel took 44855 cycles
[_L2][SB][68608 ops][Tile 2] Output DMA took 393 cycles
[_L2][SB][68608 ops][Tile 3] Input DMA took 409 cycles
[_L2][SB][68608 ops][Tile 3] Kernel took 11649 cycles
[_L2][SB][68608 ops][Tile 3] Output DMA took 150 cycles
[_L3][DB][68608 ops][Tile 0] Input DMA took 14121 cycles
[_L3][DB][68608 ops][Tile 0] Output DMA took 7736 cycles
[_L3][DB][68608 ops][Tile 1] Input DMA took 19764 cycles
[_L3][DB][68608 ops][Tile 1] Output DMA took 3921 cycles
[_L3][DB][68608 ops][Tile 2] Input DMA took 13280 cycles
[_L3][DB][68608 ops][Tile 2] Output DMA took 7650 cycles
[_L3][DB][68608 ops][Tile 3] Input DMA took 274 cycles
[_L3][DB][68608 ops][Tile 3] Output DMA took 4011 cycles

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

coderabbitai · 2025-12-12T15:43:45Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

Changed
- Profiling now calculates and reports non-kernel overhead alongside total execution time.
- Kernel profiling information extended across all memory levels.
- Enhanced profiling output includes Total, Kernel, and DMA overhead percentages with component measurements.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Walkthrough

For profiling, this change always enables kernel-level tiling in code generation and adds per-phase measurement templates plus total/time-overhead printing; CHANGLEOG updated to reflect profiling improvements.

Changes

Cohort / File(s)	Summary
Changelog `CHANGELOG.md`	Added Unreleased entry "Improve Profiling (`#138`)" and listed changes: calculate non-kernel overhead, show total time, print kernel profiling for all memory levels
Buffering codegen (kernel-level flag) `Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py`, `Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py`	`TilingMetaInfo` construction changed: `kernelLevelTiling` hardcoded to `True` (removed `self.localMemory == "L1"` check), affecting profiling gating in both buffering strategies
Profiling templates & injection `Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py`	Added `_measurementDeclaration` and `_printCycleContribution` templates; replaced inlined cycle-diff prints with measurement variables; `injectPrintCycleDiff` now injects per-phase measurements (ingress, optional kernel, egress) and, when kernel-level tiling is enabled, prints total/time and kernel vs DMA overhead breakdown

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Support Fully Asynchronous DMAs #114: Adds profiling mixins/templates and profiling-aware tiling classes that this PR extends/uses.

Suggested reviewers

Victor-Jung
lukamac

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Improve Profiling' directly matches the main objective of the PR, which is to improve profiling output for tiled executions by splitting timing data.
Description check	✅ Passed	The description clearly explains the PR's purpose: improving log output by splitting execution time into Pre-Kernel, Kernel, and Post-Kernel measurements, with specific examples and comparisons to the previous implementation.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1)

120-126: Hardcoding kernelLevelTiling=True: please confirm this flag isn’t used for non-profiling semantics elsewhere.
If the intent is “always emit kernel/total profiling breakdown for every memory level”, this is fine; but the field name now reads misleadingly—consider a follow-up rename/comment to reflect “enable kernel profiling output”.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 363b7d5 and fcb433d.

📒 Files selected for processing (4)

CHANGELOG.md (3 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (1 hunks)
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py

🧬 Code graph analysis (3)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py (2)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)

TilingMetaInfo (13-20)

Deeploy/DeeployTypes.py (1)

nodeName (1542-1545)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)

Deeploy/DeeployTypes.py (4)

NodeTemplate (87-229)

executionBlock (1536-1539)

addRight (1421-1434)

nodeName (1542-1545)

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (2)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)

TilingMetaInfo (13-20)

Deeploy/DeeployTypes.py (1)

nodeName (1542-1545)

🔇 Additional comments (2)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)

118-150: Watch generated-code footprint and C dialect assumptions (profiling arrays + local decls).
This adds multiple uint32_t [totalNumTiles] arrays (and loop-local decls); please sanity-check stack/placement for large tile counts and confirm the target compiler/libc supports C99-style declarations + float printf. If footprint is tight, consider storing durations directly to eliminate the _end_measurements arrays.

Also applies to: 152-234

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

279-285: Same note as SB: confirm unconditional kernelLevelTiling=True won’t affect non-profiling behavior.
Good that SB/DB are now consistent, but please verify no downstream logic depends on the old (self.localMemory == "L1") meaning.

CHANGELOG.md

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py

diaconuccalin

Very useful feature I wouldn't even have thought about :) Minor typo, otherwise LGTM.

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py

Victor-Jung

LGMT, cool stuff!

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

CHANGELOG.md (1)

7-7: Changelog entries accurately reflect the profiling improvements.

The entries correctly describe:

Non-kernel overhead calculation and total time display (Added)

Kernel profiling for all memory levels (Changed)

These align with the code changes in TilingPrototypes.py and the hardcoded kernelLevelTiling = True in the code generation files.

A previous review noted punctuation/wording consistency—consider adding trailing periods and aligning metric names (Pre/Kernel/Post/Total + overhead) for consistency.

Also applies to: 85-85, 128-128

🧹 Nitpick comments (2)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)
97-103: Minor formatting issue: extra whitespace in printf format.

Line 102 has extra trailing spaces before , ${measurementKernel} which will not affect functionality but is inconsistent with code style.
🔎 Suggested fix
-    printf("%s%u] Total      :%6u cycles (%2.1f%% Kernel + %2.1f%% Overhead, %u + %u)\\n", ${prefixStr}, ${profileIdxVar}, total, kernel_percentage, overhead_percentage    , ${measurementKernel}, dma);
+    printf("%s%u] Total      :%6u cycles (%2.1f%% Kernel + %2.1f%% Overhead, %u + %u)\\n", ${prefixStr}, ${profileIdxVar}, total, kernel_percentage, overhead_percentage, ${measurementKernel}, dma);
196-216: Inconsistent spacing in flavor strings affects output alignment.

The flavor strings have inconsistent formatting:

"Pre-Kernel :" (space before colon)

"Kernel :" (space before colon)

"Post-Kernel:" (no space before colon)

This causes misaligned output in profiling logs.
🔎 Suggested fix for consistent alignment
-                "flavorStr": "Post-Kernel:",
+                "flavorStr": "Post-Kernel :",

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcb433d and d73f810.

📒 Files selected for processing (4)

CHANGELOG.md
Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py
Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py
Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py

🚧 Files skipped from review as they are similar to previous changes (1)

Deeploy/TilingExtension/CodeTransformationPasses/SingleBufferingTilingCodeGeneration.py

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion manually.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The _legalizeTransfers function in TilingCodeGeneration.py handles conversion from elements to bytes for DMA operations when isFinalMemoryLevel is true, eliminating the need for individual DMA implementations like MchanDma to perform this conversion.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

📚 Learning: 2025-09-09T15:58:06.454Z

Learnt from: Xeratec
Repo: pulp-platform/Deeploy PR: 105
File: Deeploy/Targets/PULPOpen/DMA/MchanDma.py:61-64
Timestamp: 2025-09-09T15:58:06.454Z
Learning: The tiling pipeline in Deeploy handles unit conversion and normalization through functions like _legalizeTransfers, ensuring that DMA implementations receive properly formatted transfer parameters without needing to perform manual element-to-byte conversions.

Applied to files:

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py

🧬 Code graph analysis (1)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (1)

Deeploy/DeeployTypes.py (4)

NodeTemplate (87-229)

executionBlock (1536-1539)

addRight (1421-1434)

nodeName (1542-1545)

🔇 Additional comments (3)

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py (2)

167-190: Measurement declarations correctly compute elapsed cycles.

The injected measurement variables properly calculate timing differences (end - start) for each phase, with kernel measurement correctly gated by kernelLevelTiling.

220-230: Total time and overhead calculation logic is correct.

The implementation properly:

Computes total time as sum of all phases

Calculates overhead as the non-kernel (DMA) portion

Gates the output on kernelLevelTiling

Previous review feedback (typo fix and variable rename) has been addressed.

Deeploy/TilingExtension/CodeTransformationPasses/DoubleBufferingTilingCodeGeneration.py (1)

279-285: Kernel-level tiling now unconditionally enabled for profiling.

This change enables kernel-level measurement instrumentation for all memory levels, supporting the new overhead calculation and total time reporting in TilingPrototypes.py. Previously, this was gated by self.localMemory == "L1". The same kernelLevelTiling = True change is consistently applied in SingleBufferingTilingCodeGeneration.py, ensuring uniform profiling behavior across both buffering strategies.

Xeratec added this to the Release 0.2.1 milestone Dec 12, 2025

Xeratec self-assigned this Dec 12, 2025

Xeratec requested a review from Victor-Jung as a code owner December 12, 2025 15:40

Xeratec added the Feature Addition of new features label Dec 12, 2025

Xeratec added this to Deeploy Dec 12, 2025

Xeratec moved this to Need Reviewer in Deeploy Dec 12, 2025

Xeratec moved this from Need Reviewer to Ready for Merge in Deeploy Dec 12, 2025

Xeratec moved this from Ready for Merge to In review in Deeploy Dec 12, 2025

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py Show resolved Hide resolved

Xeratec requested a review from diaconuccalin December 15, 2025 12:34

diaconuccalin approved these changes Dec 15, 2025

View reviewed changes

Deeploy/TilingExtension/CodeTransformationPasses/TilingPrototypes.py Outdated Show resolved Hide resolved

Victor-Jung approved these changes Dec 16, 2025

View reviewed changes

Xeratec added 4 commits December 24, 2025 01:39

Improve profiling information

5cff251

Profile all memory levels

20c47a8

Update Changelog

e0e7abf

Implement PR feedback

d73f810

Xeratec force-pushed the pr/detailed_profile branch from fcb433d to d73f810 Compare December 24, 2025 00:41

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

Xeratec merged commit f792722 into pulp-platform:devel Dec 24, 2025
142 checks passed

github-project-automation bot moved this from In review to Done in Deeploy Dec 24, 2025

Xeratec deleted the pr/detailed_profile branch December 24, 2025 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve Profiling #138

Improve Profiling #138

Xeratec commented Dec 12, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

diaconuccalin left a comment

Uh oh!

Uh oh!

Victor-Jung left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve Profiling #138

Improve Profiling #138

Conversation

Xeratec commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Examples

New Implementation

Previous Implementation

PR Merge Checklist

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

diaconuccalin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Victor-Jung left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Xeratec commented Dec 12, 2025 •

edited

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading