[Optimization 2/n] Add Benchmarking Module (PyTorch & Kernel Performance Eval) #71

kaiming-cheng · 2026-01-08T16:42:32Z

Summary

This PR introduces a new benchmarking/ module within opt_worker_component for unified kernel performance measurement. The module provides subprocess-isolated benchmarking, CUDA event timing, and performance statistics collection.

Core Components

1. Benchmark (benchmark.py)

High-level unified benchmark class for Triton kernels and PyTorch baselines
BenchmarkLockManager for GPU resource contention prevention in multi-worker scenarios
Subprocess isolation for kernel benchmarking (crash protection)
Direct mode for PyTorch baselines

2. KernelSubprocess (kernel_subprocess.py)

Standalone profiling script for isolated kernel benchmarks
Task-agnostic design
Handles multiple kernel types:
- Standard kernels: kernel_function(*inputs)
- Conv/Linear kernels: Extracts weights from Model instances
- RMSNorm kernels: Passes init_inputs (features, eps)
JSON output for programmatic result consumption

3. Timing Utilities (timing.py)

CUDA event-based timing with L2 cache clearing
Triton do_bench wrapper with adaptive trial count
Dynamic module import for kernel/problem files
Comprehensive timing statistics (mean, std, min, max)

Test Results:

The benchmarking module successfully evaluate matmul performance from the optimized kernel and the pytorch baseline:

pytorch = bench.benchmark_pytorch(Path("problem.py"))
kernel = bench.benchmark_kernel(Path("kernel_optimized.py"), Path("problem.py"))

# Results
speedup = pytorch['stats']['mean'] / kernel['time_ms']
print(f"PyTorch: {pytorch['stats']['mean']:.3f} ms")
print(f"Kernel:  {kernel['time_ms']:.3f} ms")
print(f"Speedup: {speedup:.2f}x")

PyTorch: 0.226 ms
Kernel:  0.253 ms
Speedup: 0.89x

Jack-Khuu

2 High Level comments:

Can we group-up chunks of kernel_subprocess.main? Parsing 300+ lines was a bit daunting for what is effectively
```
benchmark(ref)
benchmark(kernel)
```

I think we're all guilty of this one, we should start pruning some of the codegen comments where they aren't needed 😅

# Move to device
inp = inp.to(device=device)

# Load problem interface
Model, get_inputs, get_init_inputs = load_problem_interface(problem_file)

Jack-Khuu · 2026-01-14T01:35:03Z

triton_kernel_agent/opt_worker_component/benchmarking/benchmark.py

+class BenchmarkLockManager:
+    """Manages GPU benchmarking locks to prevent resource contention."""
+
+    def __init__(self, lock: Optional[Any], worker_id: int, logger: logging.Logger):


When would lock be None?

The lock was introduced with the beam search approach. If we only have one worker, technically we don't need the lock. That said, we can simplify the design by always requiring a lock. Even if it's just one worker, the cost of lock acquire/release is negligible

Jack-Khuu · 2026-01-14T01:40:11Z

triton_kernel_agent/opt_worker_component/benchmarking/benchmark.py

+    def __init__(
+        self,
+        logger: logging.Logger,
+        temp_dir: Path,


ditto 1/n PR

Jack-Khuu · 2026-01-14T01:43:15Z

triton_kernel_agent/opt_worker_component/benchmarking/benchmark.py

+                - time_ms: Mean time (for backward compatibility)
+                - speedup: Speedup vs baseline
+        """
+        return self._benchmark_kernel_subprocess(


Do we need the indirection if it's a passthrough?

good point - we can just merge these two functions

Jack-Khuu · 2026-01-14T01:47:47Z

triton_kernel_agent/opt_worker_component/benchmarking/benchmark.py

+            self.logger.error(traceback.format_exc())
+            return {"time_ms": float("inf")}
+
+    def benchmark_function(


How is this being used?

this is from old legacy code - let's remove this one since it's not currently being used

Jack-Khuu · 2026-01-14T01:49:59Z

triton_kernel_agent/opt_worker_component/benchmarking/kernel_subprocess.py

+            out = fn(*inputs, *init_inputs)
+        return out


nit

Suggested change

out = fn(*inputs, *init_inputs)

return out

return fn(*inputs, *init_inputs)

Jack-Khuu · 2026-01-14T02:08:45Z

triton_kernel_agent/opt_worker_component/benchmarking/kernel_subprocess.py

+        try:
+            # Initialize model to extract weight and bias
+            if init_inputs:
+                extract_model = Model(*init_inputs).to(device=device, dtype=dtype)


Don't we already create a copy of this above?

Jack-Khuu · 2026-01-14T02:09:24Z

triton_kernel_agent/opt_worker_component/benchmarking/kernel_subprocess.py

+            # Extract weight and bias from model layer
+            # Check various possible attribute names


Let's move this to a helper function

Jack-Khuu · 2026-01-14T02:32:56Z

triton_kernel_agent/opt_worker_component/benchmarking/timing.py

+    init_inputs = []
+    if get_init_inputs is not None:
+        init_inputs = get_init_inputs()
+        if not isinstance(init_inputs, (tuple, list)):
+            init_inputs = [init_inputs]


Suggested change

init_inputs = []

if get_init_inputs is not None:

init_inputs = get_init_inputs()

if not isinstance(init_inputs, (tuple, list)):

init_inputs = [init_inputs]

init_inputs = get_init_inputs() if get_init_inputs is not None else []

if not isinstance(init_inputs, (tuple, list)):

init_inputs = [init_inputs]

Jack-Khuu · 2026-01-14T02:33:22Z

triton_kernel_agent/opt_worker_component/benchmarking/timing.py

+    if init_inputs:
+        model = Model(*init_inputs)
+    else:
+        model = Model()


Jack-Khuu · 2026-01-14T02:38:24Z

triton_kernel_agent/opt_worker_component/benchmarking/timing.py

+    try:
+        from triton import testing as triton_testing
+    except ImportError:
+        raise ImportError("Triton is required for time_with_triton_do_bench")


drop check?

Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 8, 2026

kaiming-cheng requested review from Jack-Khuu and Laurawly January 8, 2026 16:42

kaiming-cheng changed the title ~~Add Benchmarking Module (PyTorch & Kernel Performance Eval)~~ [Optimization 2/n] Add Benchmarking Module (PyTorch & Kernel Performance Eval) Jan 13, 2026

Jack-Khuu reviewed Jan 14, 2026

View reviewed changes

kaiming-cheng force-pushed the kaiming/opt_component_2_clean branch from 79e2e52 to dd8fe36 Compare January 15, 2026 19:20

Kaiming Cheng added 16 commits January 15, 2026 11:44

NCU profiling wrapper generation and execution

07a3268

Refactor profiling components and add kernel_perf_util

3c4b124

Refactor profiling components and add kernel_perf_util

11f4e79

Refactor profiling components and add kernel_perf_util

251f419

update directory name and add package in pyproject

b789660

Remove kernel_perf_util directory

4d35d57

move gpu spec.py to future PR and fix import

d871678

Add copyright header

db0c754

fix ruff

cd29759

address previous comments

bbfa6cd

fix ruff

543453a

Introducing benchmarking infra for kernel performance

4febdd6

fix ruff

d92a7b7

fix ruff

2994315

address comments

1378fc3

kaiming-cheng force-pushed the kaiming/opt_component_2_clean branch from dd8fe36 to 1378fc3 Compare January 15, 2026 19:48

	out = fn(inputs, init_inputs)
	return out
	return fn(inputs, init_inputs)

		# Extract weight and bias from model layer
		# Check various possible attribute names

[Optimization 2/n] Add Benchmarking Module (PyTorch & Kernel Performance Eval) #71

Are you sure you want to change the base?

[Optimization 2/n] Add Benchmarking Module (PyTorch & Kernel Performance Eval) #71

Uh oh!

Conversation

kaiming-cheng commented Jan 8, 2026

Summary

Core Components

Test Results:

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants