-
Notifications
You must be signed in to change notification settings - Fork 263
Add ck-build-analysis skill for compilation profiling #3561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tenpercent
wants to merge
16
commits into
develop
Choose a base branch
from
tenpercent/ck-build-analysis-skill
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ild and run tests in the container
Add automated build time analysis using Clang's -ftime-trace feature to identify template instantiation bottlenecks. Features: - Configurable granularity (500µs, 100µs, 1µs) - Comprehensive markdown reports with statistics - Template family analysis and optimization recommendations - Integration with ck-docker for containerized builds Testing shows default 500µs granularity filters out 86% of template instantiations. Using 100µs captures 2.7x more data while keeping trace files manageable at ~11MB. Key findings on example_convnd_fwd_xdl_fp8: - Template instantiation: 26.6% of compilation time - TensorDescriptor: 2,297 instantiations (18.5% of time) - run_grouped_conv_fwd: Only 3 instantiations but 583ms average Co-Authored-By: Claude <noreply@anthropic.com>
- Add Jinja2 template for report generation (.claude/skills/templates/build_analysis_report.md.jinja) - Refactor analysis script to use template rendering instead of string concatenation - Add custom Jinja2 filters for formatting (format_number, truncate, pad) - Separate presentation from logic for better maintainability - Template makes report format easier to modify and extend Requirements: - python3-jinja2 must be installed in Docker container (apt-get install python3-jinja2) Benefits: - Cleaner code with separation of concerns - Easier to customize report format - Better readability and maintainability Co-Authored-By: Claude <noreply@anthropic.com>
- Extract analysis script from bash heredoc into standalone Python file - Add PEP 723 inline script metadata for dependency management - Make script compatible with pipx and uv for automatic dependency installation - Improve code organization with proper functions and docstrings - Update documentation with PEP 723 usage examples Changes: - New file: analyze_build_trace.py (PEP 723 compliant) - Modified: ck-build-analysis (now uses external Python script) - Modified: ck-build-analysis.md (added implementation details section) Benefits: - Script can be run standalone with pipx/uv - Better code organization and maintainability - Clear dependency declaration - Easier to test and develop independently Example standalone usage: pipx run .claude/skills/analyze_build_trace.py trace.json report.md target 100 22 templates/ Co-Authored-By: Claude <noreply@anthropic.com>
- Automatically detect and use uv if available in container - Fall back to python3 if uv not found (backward compatible) - Leverage PEP 723 metadata for zero-config dependency installation - Update documentation with uv installation instructions Benefits: - Zero manual dependency installation with uv - Isolated dependency environment (no system pollution) - Fast dependency caching for subsequent runs - Automatic dependency resolution from PEP 723 metadata Tested with: - uv 0.9.25: Auto-installs jinja2 from PEP 723 metadata - python3: Falls back when uv unavailable (requires python3-jinja2) Installation: docker exec <container> bash -c "curl -LsSf https://astral.sh/uv/install.sh | sh" Co-Authored-By: Claude <noreply@anthropic.com>
- Extract shared configuration logic to .claude/skills/common.sh - Container naming and detection functions - Git branch sanitization - Docker image configuration - GPU target detection - Reduces ~50 lines of duplicate code between skills - Refactor ck-docker to use common.sh utilities - Replace manual docker ps checks with helper functions - Use shared container_exists() and container_is_running() - Use shared detect_gpu_target() and get_docker_image() - Refactor ck-build-analysis to use common.sh utilities - Use shared get_project_root() and get_container_name() - Use shared ensure_container_running() - Use shared detect_gpu_target() - Change default granularity from 500µs to 100µs - Provides better balance between detail and performance - Captures ~15k instantiations vs ~5k at 500µs - Still manageable 15-20 MB trace files - Update all documentation and help text Co-Authored-By: Claude <noreply@anthropic.com>
- Automatically install uv if not found in container - Eliminates manual dependency setup - No fallback to python3 + manual jinja2 installation needed - First run installs uv (~5 seconds), subsequent runs use cached version - Update documentation to reflect automatic installation Co-Authored-By: Claude <noreply@anthropic.com>
- Install uv via Ubuntu package manager (pipx) for security - Avoids piping curl to bash which is a security concern - More reliable and verifiable installation method - Auto-installs pipx via apt if not already present - Update documentation to reflect package-based installation Co-Authored-By: Claude <noreply@anthropic.com>
Security fixes: 1. Command Injection Prevention - Use docker exec -e flag to pass variables as environment variables - Change bash -c to use single quotes to prevent shell expansion - Properly quote all variables within the single-quoted commands - Affects: CMAKE configuration, ninja build, trace file search, Python analysis 2. Path Traversal Protection for OUTPUT_FILE - Validate OUTPUT_FILE contains no path separators (/) - Validate OUTPUT_FILE contains no parent directory references (..) - Allows file extensions (.md) but blocks directory traversal - Prevents writing files outside project directory Tested: - ✅ Path traversal blocked: --output="../../../tmp/evil.md" - ✅ Double-dot blocked: --output="..evil.md" - ✅ Normal operation: --output="security_test.md" - ✅ Build process works with quoted variables Co-Authored-By: Claude <noreply@anthropic.com>
Performance and precision improvements: - Parse durations as integers (microseconds) instead of floats (milliseconds) - Accumulate all durations in microseconds for better precision - Use integer division for average calculations - Avoid floating point arithmetic throughout data processing Template updates: - Add us_to_ms and us_to_s Jinja2 filters for display formatting - Convert microseconds to milliseconds/seconds only for display - Update all template fields to use conversion filters - Maintain precision in calculations, format only for output Benefits: - Better precision (no floating point rounding errors) - Faster processing (integer arithmetic) - Matches native trace file format (microseconds) - Cleaner separation of storage vs display formatting Co-Authored-By: Claude <noreply@anthropic.com>
Instead of generic boilerplate advice, generate specific actionable recommendations based on the actual analysis data: High-Impact Targets (by total time): - Show top 5 templates with actual times and percentages - Recommend strategy based on patterns: - High count (>100) → Extern templates - High individual cost (>50ms) → Template specialization - Otherwise → Explicit instantiation Frequently Instantiated (>100 times): - Identify templates compiled repeatedly - Recommend PCH or extern templates Most Expensive Individual Instantiations: - Show top 3 specific instantiations to profile - Point to exact templates consuming most time Example before (useless): "Focus on High-Impact Templates: Address top 10 families first" Example after (actionable): "TensorDescriptor - 4.2s total (18.1%) - 2,546 instantiations, 1.65ms average - Strategy: Extern templates - High instantiation count" Co-Authored-By: Claude <noreply@anthropic.com>
- Add AMD copyright header and MIT license identifier - Format code with ruff for consistent style - Remove unused pathlib.Path import - Convert single quotes to double quotes - Fix line wrapping and indentation per ruff style All ruff checks now pass without errors. Co-Authored-By: Claude <noreply@anthropic.com>
Add AMD copyright and MIT license identifier to: - common.sh - ck-build-analysis - ck-docker Matches the copyright header format used throughout the codebase. Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add Build Time Analysis Skill
Summary
Adds
ck-build-analysisskill to automate compilation profiling using Clang's-ftime-tracefeature. This skill helps identify template instantiation bottlenecks and optimize build times for Composable Kernel targets.Motivation
CK's heavy use of template metaprogramming leads to long compilation times (20-30+ seconds per file). Understanding where compilation time is spent is critical for:
Changes
Added two new files to
.claude/skills/:ck-build-analysis- Executable bash script that:-ftime-traceand custom granularityck-build-analysis.md- Documentation with:Usage
Example Output
The generated report includes:
Executive Summary
Key Sections
Analysis Results
Testing on
example_convnd_fwd_xdl_fp8revealed:Granularity Comparison
Finding: Default 500µs threshold filters out 86% of template instantiations. Using 100µs captures 2.7x more data while keeping trace files manageable.
Top Template Bottlenecks (100µs granularity)
Key Insights:
run_grouped_conv_fwdhas only 3 instantiations but averages 583ms eachIntegration
--outputTesting
Tested with multiple targets and granularity levels:
example_convnd_fwd_xdl_fp8(500µs, 100µs, 1µs)Future Improvements
Potential enhancements:
Example Report Preview
Documentation
The
.mdfile provides:This skill enables data-driven optimization of CK build times by making
-ftime-traceanalysis easy and automated.