Skip to content

Conversation

@tenpercent
Copy link
Contributor

Add Build Time Analysis Skill

Summary

Adds ck-build-analysis skill to automate compilation profiling using Clang's -ftime-trace feature. This skill helps identify template instantiation bottlenecks and optimize build times for Composable Kernel targets.

Motivation

CK's heavy use of template metaprogramming leads to long compilation times (20-30+ seconds per file). Understanding where compilation time is spent is critical for:

  • Identifying expensive template instantiations
  • Optimizing template hierarchies
  • Reducing build times through explicit instantiation or extern templates
  • Making informed architectural decisions

Changes

Added two new files to .claude/skills/:

  1. ck-build-analysis - Executable bash script that:

    • Configures CMake with -ftime-trace and custom granularity
    • Builds the specified target in Docker
    • Analyzes the generated trace JSON file
    • Generates a comprehensive markdown report
  2. ck-build-analysis.md - Documentation with:

    • Usage examples and command-line options
    • Granularity trade-offs and recommendations
    • Natural language interface for Claude

Usage

# Quick analysis (default 500µs granularity)
.claude/skills/ck-build-analysis example_convnd_fwd_xdl_fp8

# Balanced analysis (recommended - 100µs granularity)
.claude/skills/ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=100

# High-resolution analysis (1µs granularity - captures all events)
.claude/skills/ck-build-analysis example_convnd_fwd_xdl_fp8 --granularity=1 --output=detailed_report.md

# Analyze any CK target
.claude/skills/ck-build-analysis test_amdgcn_mma --granularity=100

Example Output

The generated report includes:

Executive Summary

- Wall Clock Time: 22.2 seconds
- Trace Time: 81.6 seconds
- Template Instantiation Time: 21.7 seconds (26.6% of trace)
- Total Events Captured: 26,912
- Total Template Instantiations: 13,931
- Unique Template Families: 424

Key Sections

  • Compilation Phase Breakdown - Time spent in InstantiateFunction, Frontend, Backend, Optimizer
  • Top 30 Individual Instantiations - Most expensive single template instantiations
  • Template Families by Total Time - Which template families consume the most time
  • Template Families by Count - Which templates are instantiated most frequently
  • Optimization Recommendations - Actionable short/medium/long-term strategies

Analysis Results

Testing on example_convnd_fwd_xdl_fp8 revealed:

Granularity Comparison

Granularity Trace Size Instantiations Unique Families
500µs (default) 3.7 MB 5,104 221
100µs (balanced) 11 MB 13,931 424
1µs (high-res) 81 MB 36,506 766

Finding: Default 500µs threshold filters out 86% of template instantiations. Using 100µs captures 2.7x more data while keeping trace files manageable.

Top Template Bottlenecks (100µs granularity)

Template Family Count Total Time % of Total
TensorDescriptor 2,297 4.0s 18.5%
DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle 87 2.6s 11.9%
run_grouped_conv_fwd 3 1.7s 8.1%
transform_tensor_descriptor 281 1.7s 7.7%

Key Insights:

  • Template instantiation accounts for 26.6% of total compilation time
  • run_grouped_conv_fwd has only 3 instantiations but averages 583ms each
  • Top 10 template families account for 64% of instantiation time
  • 13,931 total instantiations suggest significant over-templating

Integration

  • Works with ck-docker: Automatically uses existing Docker container
  • Granularity options: 500µs (fast), 100µs (balanced), 1µs (complete)
  • Custom output: Specify report filename with --output
  • No code changes: Pure analysis tool, doesn't modify source

Testing

Tested with multiple targets and granularity levels:

  • example_convnd_fwd_xdl_fp8 (500µs, 100µs, 1µs)
  • ✅ Trace file parsing and analysis
  • ✅ Report generation with all statistics
  • ✅ Container auto-start integration

Future Improvements

Potential enhancements:

  • Compare multiple builds to track improvements
  • Generate flamegraphs from trace data
  • Integration with CI to track compilation time regressions
  • Automated suggestions for template optimization

Example Report Preview

## Template Families by Total Time (Top 10)

| Rank | Template Family | Count | Total (ms) | Avg (ms) | % of Total |
|------|-----------------|-------|------------|----------|------------|
|    1 | TensorDescriptor                            |  2297 |    4028.86 |     1.75 |      18.5% |
|    2 | tensor_operation::device::DeviceGroupedC... |    87 |    2585.59 |    29.72 |      11.9% |
|    3 | run_grouped_conv_fwd                        |     3 |    1749.93 |   583.31 |       8.1% |
|    4 | transform_tensor_descriptor                 |   281 |    1689.94 |     6.01 |       7.7% |
...

Documentation

The .md file provides:

  • Clear usage examples
  • Granularity trade-off table
  • Natural language interface for Claude
  • Environment variable configuration
  • Tips for choosing the right granularity

This skill enables data-driven optimization of CK build times by making -ftime-trace analysis easy and automated.

tenpercent and others added 5 commits January 13, 2026 18:38
Add automated build time analysis using Clang's -ftime-trace feature
to identify template instantiation bottlenecks.

Features:
- Configurable granularity (500µs, 100µs, 1µs)
- Comprehensive markdown reports with statistics
- Template family analysis and optimization recommendations
- Integration with ck-docker for containerized builds

Testing shows default 500µs granularity filters out 86% of
template instantiations. Using 100µs captures 2.7x more data
while keeping trace files manageable at ~11MB.

Key findings on example_convnd_fwd_xdl_fp8:
- Template instantiation: 26.6% of compilation time
- TensorDescriptor: 2,297 instantiations (18.5% of time)
- run_grouped_conv_fwd: Only 3 instantiations but 583ms average

Co-Authored-By: Claude <noreply@anthropic.com>
tenpercent and others added 8 commits January 13, 2026 22:41
- Add Jinja2 template for report generation (.claude/skills/templates/build_analysis_report.md.jinja)
- Refactor analysis script to use template rendering instead of string concatenation
- Add custom Jinja2 filters for formatting (format_number, truncate, pad)
- Separate presentation from logic for better maintainability
- Template makes report format easier to modify and extend

Requirements:
- python3-jinja2 must be installed in Docker container (apt-get install python3-jinja2)

Benefits:
- Cleaner code with separation of concerns
- Easier to customize report format
- Better readability and maintainability

Co-Authored-By: Claude <noreply@anthropic.com>
- Extract analysis script from bash heredoc into standalone Python file
- Add PEP 723 inline script metadata for dependency management
- Make script compatible with pipx and uv for automatic dependency installation
- Improve code organization with proper functions and docstrings
- Update documentation with PEP 723 usage examples

Changes:
- New file: analyze_build_trace.py (PEP 723 compliant)
- Modified: ck-build-analysis (now uses external Python script)
- Modified: ck-build-analysis.md (added implementation details section)

Benefits:
- Script can be run standalone with pipx/uv
- Better code organization and maintainability
- Clear dependency declaration
- Easier to test and develop independently

Example standalone usage:
  pipx run .claude/skills/analyze_build_trace.py trace.json report.md target 100 22 templates/

Co-Authored-By: Claude <noreply@anthropic.com>
- Automatically detect and use uv if available in container
- Fall back to python3 if uv not found (backward compatible)
- Leverage PEP 723 metadata for zero-config dependency installation
- Update documentation with uv installation instructions

Benefits:
- Zero manual dependency installation with uv
- Isolated dependency environment (no system pollution)
- Fast dependency caching for subsequent runs
- Automatic dependency resolution from PEP 723 metadata

Tested with:
- uv 0.9.25: Auto-installs jinja2 from PEP 723 metadata
- python3: Falls back when uv unavailable (requires python3-jinja2)

Installation:
  docker exec <container> bash -c "curl -LsSf https://astral.sh/uv/install.sh | sh"

Co-Authored-By: Claude <noreply@anthropic.com>
- Extract shared configuration logic to .claude/skills/common.sh
  - Container naming and detection functions
  - Git branch sanitization
  - Docker image configuration
  - GPU target detection
  - Reduces ~50 lines of duplicate code between skills

- Refactor ck-docker to use common.sh utilities
  - Replace manual docker ps checks with helper functions
  - Use shared container_exists() and container_is_running()
  - Use shared detect_gpu_target() and get_docker_image()

- Refactor ck-build-analysis to use common.sh utilities
  - Use shared get_project_root() and get_container_name()
  - Use shared ensure_container_running()
  - Use shared detect_gpu_target()

- Change default granularity from 500µs to 100µs
  - Provides better balance between detail and performance
  - Captures ~15k instantiations vs ~5k at 500µs
  - Still manageable 15-20 MB trace files
  - Update all documentation and help text

Co-Authored-By: Claude <noreply@anthropic.com>
- Automatically install uv if not found in container
- Eliminates manual dependency setup
- No fallback to python3 + manual jinja2 installation needed
- First run installs uv (~5 seconds), subsequent runs use cached version
- Update documentation to reflect automatic installation

Co-Authored-By: Claude <noreply@anthropic.com>
- Install uv via Ubuntu package manager (pipx) for security
- Avoids piping curl to bash which is a security concern
- More reliable and verifiable installation method
- Auto-installs pipx via apt if not already present
- Update documentation to reflect package-based installation

Co-Authored-By: Claude <noreply@anthropic.com>
Security fixes:

1. Command Injection Prevention
   - Use docker exec -e flag to pass variables as environment variables
   - Change bash -c to use single quotes to prevent shell expansion
   - Properly quote all variables within the single-quoted commands
   - Affects: CMAKE configuration, ninja build, trace file search, Python analysis

2. Path Traversal Protection for OUTPUT_FILE
   - Validate OUTPUT_FILE contains no path separators (/)
   - Validate OUTPUT_FILE contains no parent directory references (..)
   - Allows file extensions (.md) but blocks directory traversal
   - Prevents writing files outside project directory

Tested:
- ✅ Path traversal blocked: --output="../../../tmp/evil.md"
- ✅ Double-dot blocked: --output="..evil.md"
- ✅ Normal operation: --output="security_test.md"
- ✅ Build process works with quoted variables

Co-Authored-By: Claude <noreply@anthropic.com>
Performance and precision improvements:

- Parse durations as integers (microseconds) instead of floats (milliseconds)
- Accumulate all durations in microseconds for better precision
- Use integer division for average calculations
- Avoid floating point arithmetic throughout data processing

Template updates:
- Add us_to_ms and us_to_s Jinja2 filters for display formatting
- Convert microseconds to milliseconds/seconds only for display
- Update all template fields to use conversion filters
- Maintain precision in calculations, format only for output

Benefits:
- Better precision (no floating point rounding errors)
- Faster processing (integer arithmetic)
- Matches native trace file format (microseconds)
- Cleaner separation of storage vs display formatting

Co-Authored-By: Claude <noreply@anthropic.com>
tenpercent and others added 3 commits January 14, 2026 00:18
Instead of generic boilerplate advice, generate specific actionable
recommendations based on the actual analysis data:

High-Impact Targets (by total time):
- Show top 5 templates with actual times and percentages
- Recommend strategy based on patterns:
  - High count (>100) → Extern templates
  - High individual cost (>50ms) → Template specialization
  - Otherwise → Explicit instantiation

Frequently Instantiated (>100 times):
- Identify templates compiled repeatedly
- Recommend PCH or extern templates

Most Expensive Individual Instantiations:
- Show top 3 specific instantiations to profile
- Point to exact templates consuming most time

Example before (useless):
  "Focus on High-Impact Templates: Address top 10 families first"

Example after (actionable):
  "TensorDescriptor - 4.2s total (18.1%)
   - 2,546 instantiations, 1.65ms average
   - Strategy: Extern templates - High instantiation count"

Co-Authored-By: Claude <noreply@anthropic.com>
- Add AMD copyright header and MIT license identifier
- Format code with ruff for consistent style
- Remove unused pathlib.Path import
- Convert single quotes to double quotes
- Fix line wrapping and indentation per ruff style

All ruff checks now pass without errors.

Co-Authored-By: Claude <noreply@anthropic.com>
Add AMD copyright and MIT license identifier to:
- common.sh
- ck-build-analysis
- ck-docker

Matches the copyright header format used throughout the codebase.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants