[Optimization 5/n] Add Optimization Orchestrator #78

kaiming-cheng · 2026-01-14T01:00:52Z

Summary

Introduces OptimizationOrchestrator that iteratively optimizes Triton kernels using NCU profiling and bottleneck analysis.
Add verify_with_refinement() method to worker.py for single-pass verification with refinement loop

How `OptimizationOrchestrator` Works

Profile current kernel with NCU to get hardware metrics
Analyze bottleneck (memory-bound vs compute-bound) via LLM
Generate optimized kernel based on bottleneck-specific strategy
Verify correctness with refinement loop (reuses VerificationWorker)
Benchmark and track best-performing kernel
Repeat for N rounds, always optimizing from current kernel state

Jack-Khuu · 2026-01-14T19:18:46Z

triton_kernel_agent/opt_worker_component/orchestrator/optimization_orchestrator.py

+
+            if divergence > self.divergence_threshold:
+                self.logger.warning(
+                    f"[{round_num}] ⚠️  EXCESSIVE DIVERGENCE: {new_time:.4f} ms is {divergence:.1f}% worse"


nit: Excessive => Egregious?

Jack-Khuu · 2026-01-14T19:28:54Z

triton_kernel_agent/opt_worker_component/orchestrator/optimization_orchestrator.py

+        if pytorch_baseline_time and pytorch_baseline_time != float("inf"):
+            pytorch_speedup = pytorch_baseline_time / best_time
+            self.logger.info(f"   PyTorch baseline: {pytorch_baseline_time:.4f} ms")
+            self.logger.info(
+                f"   Baseline time:    {baseline_results['time_ms']:.4f} ms"
+            )
+            self.logger.info(f"   Best time:        {best_time:.4f} ms")
+            self.logger.info(f"   Speedup vs PyTorch:  {pytorch_speedup:.2f}x")
+            self.logger.info(f"   Speedup vs baseline: {baseline_speedup:.2f}x")
+            self.logger.info(f"   Improvement:         {improvement_percent:.1f}%")
+        else:
+            self.logger.info(f"   Baseline time: {baseline_results['time_ms']:.4f} ms")
+            self.logger.info(f"   Best time:     {best_time:.4f} ms")
+            self.logger.info(f"   Speedup:       {baseline_speedup:.2f}x")
+            self.logger.info(f"   Improvement:   {improvement_percent:.1f}%")


nit: Not necessarily this order, but just some light refactoring. It'd be nice if the conditional when there's a pytorch baseline (or any arbitrary additional baseline) just wraps it's unique lines

Suggested change

if pytorch_baseline_time and pytorch_baseline_time != float("inf"):

pytorch_speedup = pytorch_baseline_time / best_time

self.logger.info(f" PyTorch baseline: {pytorch_baseline_time:.4f} ms")

self.logger.info(

f" Baseline time: {baseline_results['time_ms']:.4f} ms"

)

self.logger.info(f" Best time: {best_time:.4f} ms")

self.logger.info(f" Speedup vs PyTorch: {pytorch_speedup:.2f}x")

self.logger.info(f" Speedup vs baseline: {baseline_speedup:.2f}x")

self.logger.info(f" Improvement: {improvement_percent:.1f}%")

else:

self.logger.info(f" Baseline time: {baseline_results['time_ms']:.4f} ms")

self.logger.info(f" Best time: {best_time:.4f} ms")

self.logger.info(f" Speedup: {baseline_speedup:.2f}x")

self.logger.info(f" Improvement: {improvement_percent:.1f}%")

self.logger.info(f" Best time: {best_time:.4f} ms")

self.logger.info(f" Baseline time: {baseline_results['time_ms']:.4f} ms")

self.logger.info(f" Speedup vs baseline: {baseline_speedup:.2f}x")

if pytorch_baseline_time and pytorch_baseline_time != float("inf"):

pytorch_speedup = pytorch_baseline_time / best_time

self.logger.info(f" PyTorch baseline: {pytorch_baseline_time:.4f} ms")

self.logger.info(f" Speedup vs PyTorch: {pytorch_speedup:.2f}x")

self.logger.info(f" Improvement: {improvement_percent:.1f}%")

Jack-Khuu · 2026-01-14T19:37:43Z

triton_kernel_agent/worker.py

            "history": list(self.history),
        }
+
+    def verify_with_refinement(


I'm not ecstatic about having two functions (run/verify_with_refinement) that are this similar
Is there an easy way to pull out/tweak the single pass logic from run and reuse that?

Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 14, 2026

Jack-Khuu reviewed Jan 14, 2026

View reviewed changes

Kaiming Cheng added 27 commits January 15, 2026 11:44

NCU profiling wrapper generation and execution

07a3268

Refactor profiling components and add kernel_perf_util

3c4b124

Refactor profiling components and add kernel_perf_util

11f4e79

Refactor profiling components and add kernel_perf_util

251f419

update directory name and add package in pyproject

b789660

Remove kernel_perf_util directory

4d35d57

move gpu spec.py to future PR and fix import

d871678

Add copyright header

db0c754

fix ruff

cd29759

address previous comments

bbfa6cd

fix ruff

543453a

Introducing benchmarking infra for kernel performance

4febdd6

fix ruff

d92a7b7

fix ruff

2994315

address comments

1378fc3

Diagnose module - prompt constructor

45fec80

Refactors the diagnose_prompt module into a modular architecture

b640cde

fix diff issue

e952123

fix ruff issue

e7ba29a

fix

72ac4d1

fix ruff

e2c599e

optimization prompt

d5e6edc

add optimization orchestrator and add an API in the worker.py

054367f

fix ruff

8f7cce7

fix

45ec33d

fix

dd55d1d

kaiming-cheng force-pushed the kaiming/opt_component_5 branch from be2accf to dd55d1d Compare January 15, 2026 19:49

kaiming-cheng requested a review from Laurawly January 15, 2026 19:53

fix from e2e testing

04a4891

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization 5/n] Add Optimization Orchestrator #78

[Optimization 5/n] Add Optimization Orchestrator #78

Uh oh!

kaiming-cheng commented Jan 14, 2026

Uh oh!

Jack-Khuu Jan 14, 2026

Uh oh!

Jack-Khuu Jan 14, 2026

Uh oh!

Jack-Khuu Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Optimization 5/n] Add Optimization Orchestrator #78

Are you sure you want to change the base?

[Optimization 5/n] Add Optimization Orchestrator #78

Uh oh!

Conversation

kaiming-cheng commented Jan 14, 2026

Summary

How OptimizationOrchestrator Works

Uh oh!

Jack-Khuu Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

How `OptimizationOrchestrator` Works