-
Notifications
You must be signed in to change notification settings - Fork 27
[Optimization 2/n] Add Benchmarking Module (PyTorch & Kernel Performance Eval) #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Jack-Khuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 High Level comments:
-
Can we group-up chunks of
kernel_subprocess.main? Parsing 300+ lines was a bit daunting for what is effectivelybenchmark(ref) benchmark(kernel)
-
I think we're all guilty of this one, we should start pruning some of the codegen comments where they aren't needed 😅
# Move to device inp = inp.to(device=device) # Load problem interface Model, get_inputs, get_init_inputs = load_problem_interface(problem_file)
| class BenchmarkLockManager: | ||
| """Manages GPU benchmarking locks to prevent resource contention.""" | ||
|
|
||
| def __init__(self, lock: Optional[Any], worker_id: int, logger: logging.Logger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would lock be None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock was introduced with the beam search approach. If we only have one worker, technically we don't need the lock. That said, we can simplify the design by always requiring a lock. Even if it's just one worker, the cost of lock acquire/release is negligible
| def __init__( | ||
| self, | ||
| logger: logging.Logger, | ||
| temp_dir: Path, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto 1/n PR
| - time_ms: Mean time (for backward compatibility) | ||
| - speedup: Speedup vs baseline | ||
| """ | ||
| return self._benchmark_kernel_subprocess( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the indirection if it's a passthrough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - we can just merge these two functions
| self.logger.error(traceback.format_exc()) | ||
| return {"time_ms": float("inf")} | ||
|
|
||
| def benchmark_function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is from old legacy code - let's remove this one since it's not currently being used
| out = fn(*inputs, *init_inputs) | ||
| return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| out = fn(*inputs, *init_inputs) | |
| return out | |
| return fn(*inputs, *init_inputs) |
| try: | ||
| # Initialize model to extract weight and bias | ||
| if init_inputs: | ||
| extract_model = Model(*init_inputs).to(device=device, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already create a copy of this above?
| # Extract weight and bias from model layer | ||
| # Check various possible attribute names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to a helper function
| init_inputs = [] | ||
| if get_init_inputs is not None: | ||
| init_inputs = get_init_inputs() | ||
| if not isinstance(init_inputs, (tuple, list)): | ||
| init_inputs = [init_inputs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| init_inputs = [] | |
| if get_init_inputs is not None: | |
| init_inputs = get_init_inputs() | |
| if not isinstance(init_inputs, (tuple, list)): | |
| init_inputs = [init_inputs] | |
| init_inputs = get_init_inputs() if get_init_inputs is not None else [] | |
| if not isinstance(init_inputs, (tuple, list)): | |
| init_inputs = [init_inputs] |
| if init_inputs: | ||
| model = Model(*init_inputs) | ||
| else: | ||
| model = Model() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inline?
| try: | ||
| from triton import testing as triton_testing | ||
| except ImportError: | ||
| raise ImportError("Triton is required for time_with_triton_do_bench") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop check?
79e2e52 to
dd8fe36
Compare
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
dd8fe36 to
1378fc3
Compare
Summary
This PR introduces a new
benchmarking/module withinopt_worker_componentfor unified kernel performance measurement. The module provides subprocess-isolated benchmarking, CUDA event timing, and performance statistics collection.Core Components
1. Benchmark (
benchmark.py)BenchmarkLockManagerfor GPU resource contention prevention in multi-worker scenarios2. KernelSubprocess (
kernel_subprocess.py)kernel_function(*inputs)init_inputs(features, eps)3. Timing Utilities (
timing.py)do_benchwrapper with adaptive trial countTest Results:
The benchmarking module successfully evaluate matmul performance from the optimized kernel and the pytorch baseline: