Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

Summary

Introduces OptimizationOrchestrator that iteratively optimizes Triton kernels using NCU profiling and bottleneck analysis.
Add verify_with_refinement() method to worker.py for single-pass verification with refinement loop

How OptimizationOrchestrator Works

  1. Profile current kernel with NCU to get hardware metrics
  2. Analyze bottleneck (memory-bound vs compute-bound) via LLM
  3. Generate optimized kernel based on bottleneck-specific strategy
  4. Verify correctness with refinement loop (reuses VerificationWorker)
  5. Benchmark and track best-performing kernel
  6. Repeat for N rounds, always optimizing from current kernel state

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 14, 2026

if divergence > self.divergence_threshold:
self.logger.warning(
f"[{round_num}] ⚠️ EXCESSIVE DIVERGENCE: {new_time:.4f} ms is {divergence:.1f}% worse"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Excessive => Egregious?

Comment on lines 416 to 430
if pytorch_baseline_time and pytorch_baseline_time != float("inf"):
pytorch_speedup = pytorch_baseline_time / best_time
self.logger.info(f" PyTorch baseline: {pytorch_baseline_time:.4f} ms")
self.logger.info(
f" Baseline time: {baseline_results['time_ms']:.4f} ms"
)
self.logger.info(f" Best time: {best_time:.4f} ms")
self.logger.info(f" Speedup vs PyTorch: {pytorch_speedup:.2f}x")
self.logger.info(f" Speedup vs baseline: {baseline_speedup:.2f}x")
self.logger.info(f" Improvement: {improvement_percent:.1f}%")
else:
self.logger.info(f" Baseline time: {baseline_results['time_ms']:.4f} ms")
self.logger.info(f" Best time: {best_time:.4f} ms")
self.logger.info(f" Speedup: {baseline_speedup:.2f}x")
self.logger.info(f" Improvement: {improvement_percent:.1f}%")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not necessarily this order, but just some light refactoring. It'd be nice if the conditional when there's a pytorch baseline (or any arbitrary additional baseline) just wraps it's unique lines

Suggested change
if pytorch_baseline_time and pytorch_baseline_time != float("inf"):
pytorch_speedup = pytorch_baseline_time / best_time
self.logger.info(f" PyTorch baseline: {pytorch_baseline_time:.4f} ms")
self.logger.info(
f" Baseline time: {baseline_results['time_ms']:.4f} ms"
)
self.logger.info(f" Best time: {best_time:.4f} ms")
self.logger.info(f" Speedup vs PyTorch: {pytorch_speedup:.2f}x")
self.logger.info(f" Speedup vs baseline: {baseline_speedup:.2f}x")
self.logger.info(f" Improvement: {improvement_percent:.1f}%")
else:
self.logger.info(f" Baseline time: {baseline_results['time_ms']:.4f} ms")
self.logger.info(f" Best time: {best_time:.4f} ms")
self.logger.info(f" Speedup: {baseline_speedup:.2f}x")
self.logger.info(f" Improvement: {improvement_percent:.1f}%")
self.logger.info(f" Best time: {best_time:.4f} ms")
self.logger.info(f" Baseline time: {baseline_results['time_ms']:.4f} ms")
self.logger.info(f" Speedup vs baseline: {baseline_speedup:.2f}x")
if pytorch_baseline_time and pytorch_baseline_time != float("inf"):
pytorch_speedup = pytorch_baseline_time / best_time
self.logger.info(f" PyTorch baseline: {pytorch_baseline_time:.4f} ms")
self.logger.info(f" Speedup vs PyTorch: {pytorch_speedup:.2f}x")
self.logger.info(f" Improvement: {improvement_percent:.1f}%")

"history": list(self.history),
}

def verify_with_refinement(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not ecstatic about having two functions (run/verify_with_refinement) that are this similar
Is there an easy way to pull out/tweak the single pass logic from run and reuse that?

Kaiming Cheng added 27 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_5 branch from be2accf to dd55d1d Compare January 15, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants