ScalingIntelligence · nataliakokoromyti · Mar 26, 2026 · Apr 8, 2026
diff --git a/README.md b/README.md
@@ -31,6 +31,47 @@ We then extend `KernelBenchEnv` to support:
 - **Batching**: `KernelBenchEnvGroupBuilder` groups multiple rollouts for the same problem, enabling **GRPO-style** training where rewards are normalized within groups.
 - **Dataset Construction**: `KernelBenchDatasetBuilder` handles the iteration over KernelBench levels and problems, partitioning them into training and evaluation sets. You are welcome to extend it to support more problems beyond what is currently in KernelBench.
 
+### Multi-Turn RL
+
+We extend the single-turn pipeline with multi-turn iterative refinement, following the approach in [Kevin](https://arxiv.org/abs/2507.11948). Instead of generating one kernel per problem, the model generates a kernel, receives evaluation feedback (compilation errors, correctness failures, or speedup results), and refines its solution over multiple turns.
+
+`MultiTurnKernelBenchEnv` manages the multi-turn loop:
+- **History management**: Prior turns (prompt, response, feedback) are kept in context with token-based truncation to stay within the context window.
+- **Evaluation feedback**: Structured feedback tells the model what went wrong (compilation error, incorrect output, or correct but slow) so it can fix specific issues.
+- **Early stopping**: Optionally stop the episode when the kernel passes all correctness tests.
+
+Training uses GRPO with discounted returns across turns:
+- Per-turn scores are computed as `S = 0.3 * correct + speedup` (only for correct kernels).
+- Discounted returns: `R_t = S_t + γ * R_{t+1}` (backward recursion, γ=0.4 by default).
+- Advantages are normalized across all `group_size × max_turns` turn-level samples: `(R - mean) / (std + ε)`.
+- PPO with asymmetric clipping (Clip-Higher, ε_low=0.2, ε_high=0.28) and constant length normalization.
+
+Enable multi-turn via config:
+```yaml
+multiturn:
+    enabled: true
+    max_turns: 4      # Refinement turns per trajectory
+    gamma: 0.4        # Discount factor
+    aggregation: "sum" # "sum" or "max"
+```
+
+Or via CLI:
+```bash
+uv run python -m kernelbench_tinker.scripts.train_kernel_rl \
+    --config src/kernelbench_tinker/config/rl_kernelbench.yaml \
+    multiturn.enabled=true \
+    log_path=./runs/my_multiturn_experiment
+```
+
+Multi-turn inference is also supported via the eval script:
+```bash
+uv run python -m kernelbench_tinker.scripts.eval_kernel_rl \
+    checkpoint_path=<your_checkpoint> \
+    multiturn_enabled=true \
+    multiturn_max_turns=8 \
+    level=1
+```
+
 
 ### Directory Structure
 ```text
@@ -54,6 +95,7 @@ src/kernelbench_tinker/
   envs/
     kernelbench_client.py       # KernelBench Python API wrapper
     kernelbench_env.py          # Single-turn RL environment
+    multiturn_kernelbench_env.py # Multi-turn RL environment
   training/
     models.py                   # Model/renderer configuration
     reward.py                   # Reward shaping
@@ -282,7 +324,6 @@ Note the scope of this repo is an open-source implementation of KernelBench-Tink
 
 * More reward examples leveraging more fine-grained metrics
 * More reward hack checking
-* Multi-turn RL to have denser reward signal like [Kevin](https://arxiv.org/abs/2507.11948)
 * Improve Step time and training efficiency
 
 

diff --git a/src/kernelbench_tinker/config/configs.py b/src/kernelbench_tinker/config/configs.py
@@ -14,9 +14,9 @@
 class EvalConfig:
     """
     Configuration for kernel evaluation.
-    
-    This config is passed to Modal for kernel evaluation and controls
-    how correctness and performance are measured.
+
+    Drives correctness/performance measurement and selects the evaluator
+    backend (Modal cloud NVIDIA, or local subprocess for on-host AMD/HIP).
     """
 
     # Correctness testing
@@ -34,10 +34,18 @@ class EvalConfig:
     check_for_excessive_speedup: bool = True
     excessive_speedup_threshold: float = 10.0
 
+    # Evaluator backend selection: "modal" (cloud, NVIDIA) or "local" (in-process subprocess)
+    evaluator_backend: str = "modal"
+
     # Modal configuration
     modal_gpu_type: str = "A100"
     modal_timeout: float = 120.0
 
+    # Local evaluator configuration (used when evaluator_backend == "local")
+    # gpu_arch is passed to set_gpu_arch() — e.g. ["gfx950"] for MI350X, ["gfx942"] for MI300X
+    gpu_arch: list[str] = field(default_factory=list)
+    local_timeout: float = 300.0
+
 
 @dataclass
 class PromptConfig:
@@ -81,3 +89,57 @@ class DatasetConfig:
 
     # Train/test split
     test_fraction: float = 0.1
+
+
+@dataclass
+class MultiTurnConfig:
+    """
+    Configuration for multi-turn RL training.
+
+    Controls the iterative refinement loop where the model receives
+    evaluation feedback and can fix errors across multiple turns.
+    """
+
+    # Enable multi-turn mode (False = single-turn)
+    enabled: bool = False
+
+    # Maximum refinement turns per trajectory
+    max_turns: int = 4
+
+    # Discount factor for multi-turn returns: R_t = S_t + gamma * R_{t+1}
+    gamma: float = 0.4
+
+    # Return aggregation mode: "sum" or "max"
+    #   sum: R_t = Σ γ^(i-t) × S_i  (reward turns leading to many good kernels)
+    #   max: R_t = max{ γ^(i-t) × S_i } (reward turns leading to one great kernel)
+    aggregation: str = "sum"
+
+    # Stop the episode early when the kernel is correct.
+    # Default False for training: model needs post-correctness turns to
+    # learn speedup optimization.  Set True at eval time if desired.
+    early_stop_on_correct: bool = False
+
+    # Optional: require this speedup before early stopping
+    speedup_threshold: float | None = None
+
+    # Prompt
+    prompt_max_tokens: int | None = None  # Token budget for history truncation (None = char fallback)
+    inject_think_token: bool = False  # Append <think>\n to generation prompts
+
+    # Generation
+    temperature: float = 0.9
+    top_p: float = 1.0
+    seed: int | None = None
+
+    # Response length extension mid-training (0 = disabled)
+    max_tokens_extended: int = 22000
+    max_tokens_extend_after_step: int = 30
+
+    # Training
+    loss_fn: str = "ppo"
+    max_grad_norm: float = 0.05
+    warmup_ratio: float = 0.03
+    clip_epsilon_low: float = 0.2
+    clip_epsilon_high: float = 0.28
+    constant_length_norm: int = 16384
+    num_substeps: int = 2
diff --git a/src/kernelbench_tinker/config/rl_kernelbench.yaml b/src/kernelbench_tinker/config/rl_kernelbench.yaml
@@ -26,6 +26,33 @@ learning_rate: 0.000002  # 2e-6 as explicit float
 max_tokens: 16384
 temperature: 1.0
 
+# =============================================================================
+# Multi-turn Configuration (disabled by default)
+# =============================================================================
+multiturn:
+    enabled: false                     # true to enable iterative refinement
+    max_turns: 4                       # Maximum refinement turns per trajectory
+    gamma: 0.4                         # Discount factor for multi-turn returns
+    aggregation: "sum"                 # "sum" (reward many good kernels) or "max" (reward one great kernel)
+    early_stop_on_correct: false       # Stop episode when kernel passes all tests
+    speedup_threshold: null            # Required speedup before early stopping (null = any correct)
+    # Prompt
+    prompt_max_tokens: null            # Token budget for history truncation (null = char fallback)
+    inject_think_token: false          # Append <think>\n to generation prompts
+    # Generation
+    temperature: 0.9                   # Generation temperature
+    top_p: 1.0                         # Nucleus sampling (1.0 = disabled)
+    seed: null                         # Random seed for generation (null = random)
+    max_tokens_extended: 22000         # Extend max_tokens mid-training (0 = disabled)
+    max_tokens_extend_after_step: 30   # Step at which to switch
+    # Training
+    loss_fn: "ppo"                     # Loss function (single-turn uses top-level loss_fn)
+    max_grad_norm: 0.05                # Gradient clipping (0.0 = disabled)
+    warmup_ratio: 0.03                 # Linear LR warmup fraction
+    clip_epsilon_low: 0.2              # PPO clip lower bound
+    clip_epsilon_high: 0.28            # PPO clip upper bound (Clip-High)
+    constant_length_norm: 16384        # GRPO constant length normalization (0 = disabled)
+
 # =============================================================================
 # Training Configuration
 # =============================================================================
@@ -57,6 +84,7 @@ dataset_builder:
     # Problem Selection
     # ---------------------------------------------------------------------------
     level: 1                      # KernelBench level (1, 2, 3, or 4)
+    levels: null                  # Train on multiple levels (e.g. [1, 2]); overrides level when set
     start_problem: null           # First problem ID (null = start from 1)
     end_problem: null             # Last problem ID (null = all problems)
     dataset_src: "huggingface"    # "huggingface" or "local"
@@ -107,6 +135,9 @@ dataset_builder:
     reward_correctness_weight: 0.3
     reward_speed_weight: 1.0
     reward_length_weight: 0.0
+    reward_speed_max_reward: 10.0     # Cap on speed reward component (set high to uncap)
+    reward_clip_min: null             # Lower bound on total reward (null = no clipping)
+    reward_clip_max: null             # Upper bound on total reward (null = no clipping)
 
     # ---------------------------------------------------------------------------
     # Reward Hacking Detection (Static Checker)

diff --git a/src/kernelbench_tinker/config/rl_kernelbench_hip.yaml b/src/kernelbench_tinker/config/rl_kernelbench_hip.yaml
@@ -0,0 +1,156 @@
+# KernelBench RL Training Configuration — HIP backend on AMD MI350X
+# ===================================================================
+#
+# Drives the kernelbench-tinker integration against AMD GPUs (MI300X / MI350X)
+# using the HIP backend added in upstream KernelBench PR #135.
+#
+# Key differences from rl_kernelbench.yaml:
+#   - dataset_builder.backend = "hip"
+#   - dataset_builder.evaluator_backend = "local"  (no Modal — runs in-process subprocess)
+#   - dataset_builder.gpu_arch = ["gfx950"]        (MI350X; use ["gfx942"] for MI300X)
+#   - dataset_builder.precision = "fp32"
+#   - reward_static_checker_backend = "hip"
+#
+# Required Environment Variables:
+#   TINKER_API_KEY       - Tinker distributed training API key
+#   PYTORCH_ROCM_ARCH    - Set automatically by the local evaluator from gpu_arch
+#
+# Usage (on the AMD cluster, inside the apptainer container):
+#   srun --account=matx --partition=matx-interactive --gres=gpu:1 \
+#       apptainer exec --rocm /matx/u/knatalia/rocm_pytorch.sif \
+#       python -m kernelbench_tinker.scripts.train_kernel_rl \
+#           --config src/kernelbench_tinker/config/rl_kernelbench_hip.yaml \
+#           log_path=./runs/hip_rl_experiment
+
+# =============================================================================
+# Model Configuration
+# =============================================================================
+model_name: "Qwen/Qwen3-30B-A3B"
+lora_rank: 64
+learning_rate: 0.000002
+
+# =============================================================================
+# Generation Configuration
+# =============================================================================
+max_tokens: 16384
+temperature: 1.0
+
+# =============================================================================
+# Multi-turn Configuration
+# =============================================================================
+multiturn:
+    enabled: true                      # Multi-turn refinement is the whole point of moving to HIP RL
+    max_turns: 4
+    gamma: 0.4
+    aggregation: "sum"
+    early_stop_on_correct: false
+    speedup_threshold: null
+    prompt_max_tokens: null
+    inject_think_token: false
+    temperature: 0.9
+    top_p: 1.0
+    seed: null
+    max_tokens_extended: 22000
+    max_tokens_extend_after_step: 30
+    loss_fn: "ppo"
+    max_grad_norm: 0.05
+    warmup_ratio: 0.03
+    clip_epsilon_low: 0.2
+    clip_epsilon_high: 0.28
+    constant_length_norm: 16384
+
+# =============================================================================
+# Training Configuration
+# =============================================================================
+num_substeps: 2
+loss_fn: "importance_sampling"
+kl_penalty_coef: 0.0
+kl_discount_factor: 0.0
+remove_constant_reward_groups: true
+
+# =============================================================================
+# Logging and Checkpointing
+# =============================================================================
+log_path: "./runs/hip_rl_default"
+save_every: 1
+wandb_project: kernelbench-tinker
+wandb_name: kb-tinker-rl-hip
+
+# =============================================================================
+# Dataset Builder Configuration
+# =============================================================================
+dataset_builder:
+    # ---------------------------------------------------------------------------
+    # Problem Selection — sweep all 300 problems across all 3 levels
+    # ---------------------------------------------------------------------------
+    level: 1
+    levels: [1, 2, 3]             # Iterate every task across every level
+    start_problem: null
+    end_problem: null
+    dataset_src: "local"          # Use local KernelBench/ submodule (HF dataset is CUDA-only)
+
+    # ---------------------------------------------------------------------------
+    # Kernel Backend — HIP via PR #135
+    # ---------------------------------------------------------------------------
+    backend: "hip"
+
+    # ---------------------------------------------------------------------------
+    # Evaluator Backend — local in-process subprocess (no Modal)
+    # ---------------------------------------------------------------------------
+    evaluator_backend: "local"
+    gpu_arch: ["gfx950"]          # MI350X; use ["gfx942"] for MI300X
+    local_timeout: 300.0          # seconds per kernel subprocess (matches eval_hip.py)
+
+    # ---------------------------------------------------------------------------
+    # Batching
+    # ---------------------------------------------------------------------------
+    batch_size: 8
+    group_size: 16
+    num_epochs: 22
+    shuffle: true
+    test_fraction: 0.1
+
+    # ---------------------------------------------------------------------------
+    # Prompt Configuration
+    # ---------------------------------------------------------------------------
+    renderer_name: "qwen3"
+    prompt_option: "one_shot"
+    prompt_precision: null
+    prompt_include_hardware: false
+    prompt_gpu_name: null
+
+    # ---------------------------------------------------------------------------
+    # Evaluation
+    # ---------------------------------------------------------------------------
+    num_correct_trials: 5
+    measure_performance: true
+    num_perf_trials: 100
+    timing_method: "cuda_event"   # Reused as the HIP-event timing path internally
+    precision: "fp32"
+    check_for_excessive_speedup: true
+    excessive_speedup_threshold: 10.0
+
+    # Modal fields kept as inert defaults (ignored when evaluator_backend == "local")
+    modal_gpu_type: "A100"
+    modal_timeout: 60.0
+
+    # ---------------------------------------------------------------------------
+    # Reward Weights
+    # ---------------------------------------------------------------------------
+    reward_format_weight: 0.0
+    reward_compile_weight: 0.0
+    reward_correctness_weight: 0.3
+    reward_speed_weight: 1.0
+    reward_length_weight: 0.0
+    reward_speed_max_reward: 10.0
+    reward_clip_min: null
+    reward_clip_max: null
+
+    # ---------------------------------------------------------------------------
+    # Reward Hacking Detection (Static Checker)
+    # ---------------------------------------------------------------------------
+    reward_enable_static_checker: true
+    reward_static_checker_backend: "hip"
+    reward_static_checker_precision: "fp32"
+    reward_static_checker_strict: null
+    reward_static_checker_warnings: null