Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

This PR introduces OptimizationWorker from opt_worker.py. The OptimizationWorker integrated class from opt_worker_components, demonstrating the end-to-end usage of the optimization pipeline.

Changes

opt_worker.py introduces OptimizationWorker - a hardware-aware optimization worker that orchestrates the full optimization pipeline:

bottleneck_analyzer.py: We also add a new class to interface with the modular components in opt_worker_component/diagnose_prompt, which wraps the Judge LLM workflow for dual-bottleneck analysis

worker_util.py extracts shared utility functions used by both VerificationWorker and OptimizationWorker:

Test

worker = OptimizationWorker(
    worker_id=0,
    workdir=workdir,
    log_dir=log_dir,
    max_rounds=5,
    openai_model="gpt-5",
    high_reasoning_effort=True,
    # Hardware-aware parameters
    gpu_name=None,  # Auto-detect GPU
    enable_ncu_profiling=True,
    bottleneck_id=1,  # Focus on primary bottleneck
    # Benchmarking parameters
    benchmark_warmup=25,
    benchmark_repeat=100,
    # Performance safeguards
    divergence_threshold=50.0,  # Revert if 50% worse
    target_platform="cuda",
)

success, best_kernel, metrics = worker.optimize_kernel(
    kernel_code=kernel_code,
    problem_file=problem_file,
    test_code=test_code,
)
2026-01-18 13:00:06,083 - opt_worker_0 - INFO - [1] Profiling current kernel with NCU...
2026-01-18 13:00:27,331 - opt_worker_0 - INFO - ✅ NCU profiling completed for round 1
2026-01-18 13:00:27,331 - opt_worker_0 - INFO - [1] Analyzing bottleneck...
2026-01-18 13:04:27,073 - opt_worker_0 - INFO - [1] Bottleneck analysis complete: primary=memory-bound
2026-01-18 13:04:27,073 - opt_worker_0 - INFO - [1] Generating optimized kernel...
2026-01-18 13:06:47,056 - opt_worker_0 - INFO - [1] Verifying correctness...
2026-01-18 13:08:34,780 - opt_worker_0 - INFO - [1] ✅ Correctness check passed
2026-01-18 13:08:38,346 - opt_worker_0 - INFO - [1] 🎉 NEW BEST! 0.2761 ms (speedup: 1.09x, improvement: 7.9%)

Kaiming Cheng added 30 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 18, 2026
@kaiming-cheng kaiming-cheng changed the title [Optimization 6/n] Add Optimization worker [Optimization 6/n] Introduce Optimization Worker Jan 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants