Add ck-build-analysis skill for compilation profiling #3561

tenpercent · 2026-01-14T03:14:31Z

Add Build Time Analysis Skill

Summary

Adds ck-build-analysis skill to automate compilation profiling using Clang's -ftime-trace feature. This skill helps identify template instantiation bottlenecks and optimize build times for Composable Kernel targets.

Motivation

CK's heavy use of template metaprogramming leads to long compilation times (20-30+ seconds per file). Understanding where compilation time is spent is critical for:

Identifying expensive template instantiations
Optimizing template hierarchies
Reducing build times through explicit instantiation or extern templates
Making informed architectural decisions

Changes

Added two new files to .claude/skills/:

ck-build-analysis - Executable bash script that:
- Configures CMake with -ftime-trace and custom granularity
- Builds the specified target in Docker
- Analyzes the generated trace JSON file
- Generates a comprehensive markdown report
ck-build-analysis.md - Documentation with:
- Usage examples and command-line options
- Granularity trade-offs and recommendations
- Natural language interface for Claude

Usage

# Quick analysis (default 500µs granularity)
.claude/skills/ck-build-analysis example_convnd_fwd_xdl_fp8

# Balanced analysis (recommended - 100µs granularity)
.claude/skills/ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=100

# High-resolution analysis (1µs granularity - captures all events)
.claude/skills/ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=1 --output=detailed_report.md

# Analyze any CK target
.claude/skills/ck-build-analysis test_amdgcn_mma --granularity=100

Example Output

The generated report includes:

Executive Summary

- Wall Clock Time: 22.2 seconds
- Trace Time: 81.6 seconds
- Template Instantiation Time: 21.7 seconds (26.6% of trace)
- Total Events Captured: 26,912
- Total Template Instantiations: 13,931
- Unique Template Families: 424

Key Sections

Compilation Phase Breakdown - Time spent in InstantiateFunction, Frontend, Backend, Optimizer
Top 30 Individual Instantiations - Most expensive single template instantiations
Template Families by Total Time - Which template families consume the most time
Template Families by Count - Which templates are instantiated most frequently
Optimization Recommendations - Actionable short/medium/long-term strategies

Analysis Results

Testing on example_convnd_fwd_xdl_fp8 revealed:

Granularity Comparison

Granularity	Trace Size	Instantiations	Unique Families
500µs (default)	3.7 MB	5,104	221
100µs (balanced)	11 MB	13,931	424
1µs (high-res)	81 MB	36,506	766

Finding: Default 500µs threshold filters out 86% of template instantiations. Using 100µs captures 2.7x more data while keeping trace files manageable.

Top Template Bottlenecks (100µs granularity)

Template Family	Count	Total Time	% of Total
TensorDescriptor	2,297	4.0s	18.5%
DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle	87	2.6s	11.9%
run_grouped_conv_fwd	3	1.7s	8.1%
transform_tensor_descriptor	281	1.7s	7.7%

Key Insights:

Template instantiation accounts for 26.6% of total compilation time
run_grouped_conv_fwd has only 3 instantiations but averages 583ms each
Top 10 template families account for 64% of instantiation time
13,931 total instantiations suggest significant over-templating

Integration

Works with ck-docker: Automatically uses existing Docker container
Granularity options: 500µs (fast), 100µs (balanced), 1µs (complete)
Custom output: Specify report filename with --output
No code changes: Pure analysis tool, doesn't modify source

Testing

Tested with multiple targets and granularity levels:

✅ example_convnd_fwd_xdl_fp8 (500µs, 100µs, 1µs)
✅ Trace file parsing and analysis
✅ Report generation with all statistics
✅ Container auto-start integration

Future Improvements

Potential enhancements:

Compare multiple builds to track improvements
Generate flamegraphs from trace data
Integration with CI to track compilation time regressions
Automated suggestions for template optimization

Example Report Preview

## Template Families by Total Time (Top 10)

| Rank | Template Family | Count | Total (ms) | Avg (ms) | % of Total |
|------|-----------------|-------|------------|----------|------------|
|    1 | TensorDescriptor                            |  2297 |    4028.86 |     1.75 |      18.5% |
|    2 | tensor_operation::device::DeviceGroupedC... |    87 |    2585.59 |    29.72 |      11.9% |
|    3 | run_grouped_conv_fwd                        |     3 |    1749.93 |   583.31 |       8.1% |
|    4 | transform_tensor_descriptor                 |   281 |    1689.94 |     6.01 |       7.7% |
...

Documentation

The .md file provides:

Clear usage examples
Granularity trade-off table
Natural language interface for Claude
Environment variable configuration
Tips for choosing the right granularity

This skill enables data-driven optimization of CK build times by making -ftime-trace analysis easy and automated.

…ild and run tests in the container

Add automated build time analysis using Clang's -ftime-trace feature to identify template instantiation bottlenecks. Features: - Configurable granularity (500µs, 100µs, 1µs) - Comprehensive markdown reports with statistics - Template family analysis and optimization recommendations - Integration with ck-docker for containerized builds Testing shows default 500µs granularity filters out 86% of template instantiations. Using 100µs captures 2.7x more data while keeping trace files manageable at ~11MB. Key findings on example_convnd_fwd_xdl_fp8: - Template instantiation: 26.6% of compilation time - TensorDescriptor: 2,297 instantiations (18.5% of time) - run_grouped_conv_fwd: Only 3 instantiations but 583ms average Co-Authored-By: Claude <noreply@anthropic.com>

- Add Jinja2 template for report generation (.claude/skills/templates/build_analysis_report.md.jinja) - Refactor analysis script to use template rendering instead of string concatenation - Add custom Jinja2 filters for formatting (format_number, truncate, pad) - Separate presentation from logic for better maintainability - Template makes report format easier to modify and extend Requirements: - python3-jinja2 must be installed in Docker container (apt-get install python3-jinja2) Benefits: - Cleaner code with separation of concerns - Easier to customize report format - Better readability and maintainability Co-Authored-By: Claude <noreply@anthropic.com>

- Extract analysis script from bash heredoc into standalone Python file - Add PEP 723 inline script metadata for dependency management - Make script compatible with pipx and uv for automatic dependency installation - Improve code organization with proper functions and docstrings - Update documentation with PEP 723 usage examples Changes: - New file: analyze_build_trace.py (PEP 723 compliant) - Modified: ck-build-analysis (now uses external Python script) - Modified: ck-build-analysis.md (added implementation details section) Benefits: - Script can be run standalone with pipx/uv - Better code organization and maintainability - Clear dependency declaration - Easier to test and develop independently Example standalone usage: pipx run .claude/skills/analyze_build_trace.py trace.json report.md target 100 22 templates/ Co-Authored-By: Claude <noreply@anthropic.com>

- Automatically detect and use uv if available in container - Fall back to python3 if uv not found (backward compatible) - Leverage PEP 723 metadata for zero-config dependency installation - Update documentation with uv installation instructions Benefits: - Zero manual dependency installation with uv - Isolated dependency environment (no system pollution) - Fast dependency caching for subsequent runs - Automatic dependency resolution from PEP 723 metadata Tested with: - uv 0.9.25: Auto-installs jinja2 from PEP 723 metadata - python3: Falls back when uv unavailable (requires python3-jinja2) Installation: docker exec <container> bash -c "curl -LsSf https://astral.sh/uv/install.sh | sh" Co-Authored-By: Claude <noreply@anthropic.com>

- Extract shared configuration logic to .claude/skills/common.sh - Container naming and detection functions - Git branch sanitization - Docker image configuration - GPU target detection - Reduces ~50 lines of duplicate code between skills - Refactor ck-docker to use common.sh utilities - Replace manual docker ps checks with helper functions - Use shared container_exists() and container_is_running() - Use shared detect_gpu_target() and get_docker_image() - Refactor ck-build-analysis to use common.sh utilities - Use shared get_project_root() and get_container_name() - Use shared ensure_container_running() - Use shared detect_gpu_target() - Change default granularity from 500µs to 100µs - Provides better balance between detail and performance - Captures ~15k instantiations vs ~5k at 500µs - Still manageable 15-20 MB trace files - Update all documentation and help text Co-Authored-By: Claude <noreply@anthropic.com>

- Automatically install uv if not found in container - Eliminates manual dependency setup - No fallback to python3 + manual jinja2 installation needed - First run installs uv (~5 seconds), subsequent runs use cached version - Update documentation to reflect automatic installation Co-Authored-By: Claude <noreply@anthropic.com>

- Install uv via Ubuntu package manager (pipx) for security - Avoids piping curl to bash which is a security concern - More reliable and verifiable installation method - Auto-installs pipx via apt if not already present - Update documentation to reflect package-based installation Co-Authored-By: Claude <noreply@anthropic.com>

Security fixes: 1. Command Injection Prevention - Use docker exec -e flag to pass variables as environment variables - Change bash -c to use single quotes to prevent shell expansion - Properly quote all variables within the single-quoted commands - Affects: CMAKE configuration, ninja build, trace file search, Python analysis 2. Path Traversal Protection for OUTPUT_FILE - Validate OUTPUT_FILE contains no path separators (/) - Validate OUTPUT_FILE contains no parent directory references (..) - Allows file extensions (.md) but blocks directory traversal - Prevents writing files outside project directory Tested: - ✅ Path traversal blocked: --output="../../../tmp/evil.md" - ✅ Double-dot blocked: --output="..evil.md" - ✅ Normal operation: --output="security_test.md" - ✅ Build process works with quoted variables Co-Authored-By: Claude <noreply@anthropic.com>

Performance and precision improvements: - Parse durations as integers (microseconds) instead of floats (milliseconds) - Accumulate all durations in microseconds for better precision - Use integer division for average calculations - Avoid floating point arithmetic throughout data processing Template updates: - Add us_to_ms and us_to_s Jinja2 filters for display formatting - Convert microseconds to milliseconds/seconds only for display - Update all template fields to use conversion filters - Maintain precision in calculations, format only for output Benefits: - Better precision (no floating point rounding errors) - Faster processing (integer arithmetic) - Matches native trace file format (microseconds) - Cleaner separation of storage vs display formatting Co-Authored-By: Claude <noreply@anthropic.com>

Instead of generic boilerplate advice, generate specific actionable recommendations based on the actual analysis data: High-Impact Targets (by total time): - Show top 5 templates with actual times and percentages - Recommend strategy based on patterns: - High count (>100) → Extern templates - High individual cost (>50ms) → Template specialization - Otherwise → Explicit instantiation Frequently Instantiated (>100 times): - Identify templates compiled repeatedly - Recommend PCH or extern templates Most Expensive Individual Instantiations: - Show top 3 specific instantiations to profile - Point to exact templates consuming most time Example before (useless): "Focus on High-Impact Templates: Address top 10 families first" Example after (actionable): "TensorDescriptor - 4.2s total (18.1%) - 2,546 instantiations, 1.65ms average - Strategy: Extern templates - High instantiation count" Co-Authored-By: Claude <noreply@anthropic.com>

- Add AMD copyright header and MIT license identifier - Format code with ruff for consistent style - Remove unused pathlib.Path import - Convert single quotes to double quotes - Fix line wrapping and indentation per ruff style All ruff checks now pass without errors. Co-Authored-By: Claude <noreply@anthropic.com>

Add AMD copyright and MIT license identifier to: - common.sh - ck-build-analysis - ck-docker Matches the copyright header format used throughout the codebase. Co-Authored-By: Claude <noreply@anthropic.com>

tenpercent and others added 5 commits January 13, 2026 18:38

add the skill for running a docker container with correct options; bu…

0148810

…ild and run tests in the container

try to handle corner cases

5ba3926

Merge branch 'develop' into tenpercent/cc-skill-build

7d18bd4

combine build and rebuild

ba65875

tenpercent requested review from a team, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, ddembeckAMD, geyyer, illsilin, poyenc, qianfengz, shumway and vidyasagar-amd as code owners January 14, 2026 03:14

tenpercent and others added 8 commits January 13, 2026 22:41

tenpercent and others added 3 commits January 14, 2026 00:18

Add copyright headers to all shell scripts

23ea6ed

Add AMD copyright and MIT license identifier to: - common.sh - ck-build-analysis - ck-docker Matches the copyright header format used throughout the codebase. Co-Authored-By: Claude <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ck-build-analysis skill for compilation profiling #3561

Add ck-build-analysis skill for compilation profiling #3561

Uh oh!

tenpercent commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add ck-build-analysis skill for compilation profiling #3561

Are you sure you want to change the base?

Add ck-build-analysis skill for compilation profiling #3561

Uh oh!

Conversation

tenpercent commented Jan 14, 2026

Add Build Time Analysis Skill

Summary

Motivation

Changes

Usage

Example Output

Executive Summary

Key Sections

Analysis Results

Granularity Comparison

Top Template Bottlenecks (100µs granularity)

Integration

Testing

Future Improvements

Example Report Preview

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants