-
Notifications
You must be signed in to change notification settings - Fork 27
[Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| # Return default if detection failed | ||
| if gpu_name is None: | ||
| print("⚠️ GPU auto-detection failed, using A100 specs as fallback") | ||
| return GPU_SPECS_DATABASE["NVIDIA A100"].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fall back to A100? Or does returning an empty dict make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree returning empty dict is cleaner, but it will also lead to KeyError in the optimization flow. Should we make a decision of disabling the optimization if no gpu_name is found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense; if there are setup/detection issues then we shouldn't optimize
|
|
||
| # GPU specifications database | ||
| # Sources: NVIDIA official specifications, manufacturer datasheets | ||
| GPU_SPECS_DATABASE = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this const to its own file? It makes module overriding easier
| gpu_name_lower = gpu_name.lower() | ||
| for key, specs in GPU_SPECS_DATABASE.items(): | ||
| key_lower = key.lower() | ||
| # Check if either name contains the other | ||
| if gpu_name_lower in key_lower or key_lower in gpu_name_lower: | ||
| print(f"ℹ️ Matched '{gpu_name}' to '{key}' (fuzzy match)") | ||
| return specs.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if you've encountered this case before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes I'll just put a100 or h100 in my optimization workflow. What do you think, should we just force the gpu name input to be exactly matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum it
| if detected_name: | ||
| print(f"\nDetected GPU: {detected_name}") | ||
| else: | ||
| print("\nNo GPU detected (nvidia-smi not available)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print("\nNo GPU detected (nvidia-smi not available)") | |
| print("\nNo GPU detected (nvidia-smi not available)") | |
| exit() |
| Metric definitions are in metric_schema.py. | ||
| """ | ||
|
|
||
| from typing import Any, Callable, Dict, List, Optional, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😦
|
|
||
| for label, key, unit in GPU_SPEC_FIELDS: | ||
| value = gpu_specs.get(key, "N/A") | ||
| lines.append(f"- **{label}:** {value}{unit}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want the unit if N/A
| raise ValueError("NCU metrics are empty - cannot build judge prompt") | ||
|
|
||
| # Extract first kernel's metrics for the metric getter | ||
| first_kernel = list(ncu_metrics.values())[0] if ncu_metrics else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we check for empty above
| except json.JSONDecodeError: | ||
| pass | ||
|
|
||
| # Strategy 2: Find first { ... } block with "bottleneck_1" field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are strategy 2/3 typically encountered?
Not required for this PR, but we can look into forcing the llm providers to provide a structured output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really actually. All my experiments return dual bottleneck analysis
| ) and _validate_bottleneck_entry(analysis["bottleneck_2"]) | ||
|
|
||
|
|
||
| VALID_CATEGORIES = frozenset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frozenset isn't wrong, but we're just using it as a lookup so normal set is fine
ac11151 to
bfa2fd0
Compare
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
af9b7af to
e2c599e
Compare
This PR introduces a
diagnosemodule building GPU performance analysis prompts. The module provides GPU hardware specification lookup, NCU metric schema definitions, and composable prompt section rendering for bottleneck analysis.Core Components
1. MetricSchema (
metric_schema.py)2. GPU Specs (
gpu_specs.py)3. **Judger Prompts ** (
judger_prompts.py)Example Usage
Detected GPU: NVIDIA H100
Using specs for: NVIDIA H100 (Hopper)
You are a senior GPU performance engineer. Analyze the target GPU spec, the current kernel, and the Nsight Compute (NCU) profiling metrics. Identify EXACTLY TWO DISTINCT bottlenecks from the hardware profiling data, and propose specific optimization methods for each. Be surgical and metrics-driven.
......