Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

Summary

This PR introduces a new benchmarking/ module within opt_worker_component for unified kernel performance measurement. The module provides subprocess-isolated benchmarking, CUDA event timing, and performance statistics collection.

Core Components

1. Benchmark (benchmark.py)

  • High-level unified benchmark class for Triton kernels and PyTorch baselines
  • BenchmarkLockManager for GPU resource contention prevention in multi-worker scenarios
  • Subprocess isolation for kernel benchmarking (crash protection)
  • Direct mode for PyTorch baselines

2. KernelSubprocess (kernel_subprocess.py)

  • Standalone profiling script for isolated kernel benchmarks
  • Task-agnostic design
  • Handles multiple kernel types:
    • Standard kernels: kernel_function(*inputs)
    • Conv/Linear kernels: Extracts weights from Model instances
    • RMSNorm kernels: Passes init_inputs (features, eps)
  • JSON output for programmatic result consumption

3. Timing Utilities (timing.py)

  • CUDA event-based timing with L2 cache clearing
  • Triton do_bench wrapper with adaptive trial count
  • Dynamic module import for kernel/problem files
  • Comprehensive timing statistics (mean, std, min, max)

Test Results:

The benchmarking module successfully evaluate matmul performance from the optimized kernel and the pytorch baseline:

pytorch = bench.benchmark_pytorch(Path("problem.py"))
kernel = bench.benchmark_kernel(Path("kernel_optimized.py"), Path("problem.py"))

# Results
speedup = pytorch['stats']['mean'] / kernel['time_ms']
print(f"PyTorch: {pytorch['stats']['mean']:.3f} ms")
print(f"Kernel:  {kernel['time_ms']:.3f} ms")
print(f"Speedup: {speedup:.2f}x")
PyTorch: 0.226 ms
Kernel:  0.253 ms
Speedup: 0.89x

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 8, 2026
@kaiming-cheng kaiming-cheng changed the title Add Benchmarking Module (PyTorch & Kernel Performance Eval) [Optimization 2/n] Add Benchmarking Module (PyTorch & Kernel Performance Eval) Jan 13, 2026
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 High Level comments:

  • Can we group-up chunks of kernel_subprocess.main? Parsing 300+ lines was a bit daunting for what is effectively

    benchmark(ref)
    benchmark(kernel)
  • I think we're all guilty of this one, we should start pruning some of the codegen comments where they aren't needed 😅

    # Move to device
    inp = inp.to(device=device)
    
    # Load problem interface
    Model, get_inputs, get_init_inputs = load_problem_interface(problem_file)

class BenchmarkLockManager:
"""Manages GPU benchmarking locks to prevent resource contention."""

def __init__(self, lock: Optional[Any], worker_id: int, logger: logging.Logger):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would lock be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock was introduced with the beam search approach. If we only have one worker, technically we don't need the lock. That said, we can simplify the design by always requiring a lock. Even if it's just one worker, the cost of lock acquire/release is negligible

def __init__(
self,
logger: logging.Logger,
temp_dir: Path,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto 1/n PR

- time_ms: Mean time (for backward compatibility)
- speedup: Speedup vs baseline
"""
return self._benchmark_kernel_subprocess(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the indirection if it's a passthrough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - we can just merge these two functions

self.logger.error(traceback.format_exc())
return {"time_ms": float("inf")}

def benchmark_function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this being used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from old legacy code - let's remove this one since it's not currently being used

Comment on lines 66 to 67
out = fn(*inputs, *init_inputs)
return out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
out = fn(*inputs, *init_inputs)
return out
return fn(*inputs, *init_inputs)

try:
# Initialize model to extract weight and bias
if init_inputs:
extract_model = Model(*init_inputs).to(device=device, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already create a copy of this above?

Comment on lines 322 to 323
# Extract weight and bias from model layer
# Check various possible attribute names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to a helper function

Comment on lines 169 to 173
init_inputs = []
if get_init_inputs is not None:
init_inputs = get_init_inputs()
if not isinstance(init_inputs, (tuple, list)):
init_inputs = [init_inputs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
init_inputs = []
if get_init_inputs is not None:
init_inputs = get_init_inputs()
if not isinstance(init_inputs, (tuple, list)):
init_inputs = [init_inputs]
init_inputs = get_init_inputs() if get_init_inputs is not None else []
if not isinstance(init_inputs, (tuple, list)):
init_inputs = [init_inputs]

Comment on lines 176 to 179
if init_inputs:
model = Model(*init_inputs)
else:
model = Model()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline?

Comment on lines 419 to 422
try:
from triton import testing as triton_testing
except ImportError:
raise ImportError("Triton is required for time_with_triton_do_bench")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop check?

@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_2_clean branch from 79e2e52 to dd8fe36 Compare January 15, 2026 19:20
Kaiming Cheng added 16 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_2_clean branch from dd8fe36 to 1378fc3 Compare January 15, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants