You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"6. Exhaust the low-hanging search ladder before custom kernels: baseline -> packaged model-family, checkpoint, or runtime flavor variants -> runtime flags or attention implementation -> dtype or quant or checkpoint variants, including synthesized FP8 conversion when no packaged FP8 artifact exists -> torch.compile or CUDA graphs if supported -> Triton or CuTe or CUDA kernels -> deeper runtime patching.",
40
40
"7. Choose the backend and workflow yourself. Do not assume Triton, CuTe, or CUDA helpers exist. Use generic file tools plus run_command to create, edit, build, verify, and benchmark code.",
41
41
"8. When a command belongs to a candidate stage, call run_command with session, candidate, and stage so Fusion persists the artifact. Use run_benchmark and run_profile with session and candidate for benchmark/profile stages.",
42
-
"9. If compile, correctness, inference, or performance problems appear, inspect the outputs, patch the code or scripts, and retry. Do not stop at the first fixable error or the first small performance win.",
43
-
"10. Verify correctness before claiming success. Prefer explicit tolerances, reproducible seeds, and benchmark evidence.",
44
-
"11. Keep the optimization session state accurate by recording stages and using the candidate workspace instead of ad hoc temp paths.",
45
-
"12. For FP8 or other converted quantization paths, save the calibration recipe, runtime flags, and any fallback higher-precision modules. Compare normalized steady-state metrics, not just raw wall time. When model families produce different output lengths, prefer metrics like rtf, x_real_time, or tokens/sec. Keep download, compile, and warmup overhead separate from steady-state generation speed.",
46
-
"13. Maintain a current best candidate. If a new candidate regresses or breaks correctness, fall back to the current best and continue the search.",
47
-
"14. End only after each applicable candidate family has been tested, rejected with evidence, or blocked by the environment. Then report the best candidate, what changed, what passed, what failed, and the next most valuable experiment if more time remains.",
42
+
"9. After profile collection, use analyze_profile so Fusion converts raw Nsight output into a BottleneckReport and Prescription before you decide on deeper kernel changes.",
43
+
"10. Use show_outer_loop_status and record_loop_decision to make the outer-loop state explicit. Do not launch deeper custom kernel search until packaged model, runtime, quantization, compile, and attention-backend branches are exhausted or explicitly blocked.",
44
+
"11. During kernel search, persist round artifacts with save_round_artifact or record_reflexion under candidates/<id>/rounds/<n> so prompt, diagnosis, prescription, verify, bench, and reflexion data survive across turns.",
45
+
"12. Use assess_benchmark_runs before ranking performance-sensitive candidates, and use rank_search_candidates to keep a top-K survivor set and promote the current best candidate explicitly.",
46
+
"13. If compile, correctness, inference, or performance problems appear, inspect the outputs, patch the code or scripts, and retry. Do not stop at the first fixable error or the first small performance win.",
47
+
"14. Verify correctness before claiming success. Prefer explicit tolerances, reproducible seeds, and benchmark evidence.",
48
+
"15. Keep the optimization session state accurate by recording stages and using the candidate workspace instead of ad hoc temp paths.",
49
+
"16. For FP8 or other converted quantization paths, save the calibration recipe, runtime flags, and any fallback higher-precision modules. Compare normalized steady-state metrics, not just raw wall time. When model families produce different output lengths, prefer metrics like rtf, x_real_time, or tokens/sec. Keep download, compile, and warmup overhead separate from steady-state generation speed.",
50
+
"17. Maintain a current best candidate. If a new candidate regresses or breaks correctness, fall back to the current best and continue the search.",
51
+
"18. End only after each applicable candidate family has been tested, rejected with evidence, or blocked by the environment. Then report the best candidate, what changed, what passed, what failed, and the next most valuable experiment if more time remains.",
0 commit comments