Refactor: simplify dsv4 ratio4 decode compressor#427
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR removes the optional Hadamard/rotate post-processing transformation from the DeepSeek v4 decode compressor kernel. The compressor now writes normalized KV directly to cache without intermediate finalization, eliminating the conditional rotate path and simplifying the normalization sequence. Function signatures, golden reference, and test infrastructure are updated accordingly. ChangesHadamard/Rotate Post-Processing Removal
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request simplifies the ratio-4 overlapping compressor model by removing unused rotation logic, including the rotate parameter, the hadamard tensor, and the associated Hadamard transform block. It also refactors the normalization loop to process only the non-RoPE (NOPE) span, while separately re-deriving the RoPE input slice. Additionally, a --golden-data command-line argument has been added to the test runner. No review comments were provided, so there is no feedback to address.
- Drop the hadamard rotate path (dead since hw-native-sys#424 pinned ROTATE=False): remove the rotate/hadamard params, the kv_hadamard/kv_write branch, the kv_final intermediate, and the matching golden/spec plumbing. kv_and_cache_write now reads normed_kv directly. - rmsnorm_rope: normalize only the NOPE span in the loop and re-derive the RoPE input slice locally from pooled_kv (bit-identical) instead of reading normed_kv back, dropping a same-tensor GM round-trip. - Add a --golden-data CLI option for fixed-input validation runs. - Adapt the decode_attention_csa caller to the new compressor signature (it called the ratio4 compressor with rotate=False); verified PASS. rmsnorm_rope stays on pl.range + pl.at (verified 10/10 on a2a3); pl.spmd / pl.parallel are non-deterministic here, see hw-native-sys#419. Refs hw-native-sys#419
Mirror the ratio4 refactor in decode_compressor_ratio128's rmsnorm_rope spmd body: - Narrow the normalization pipeline to the NOPE span (NOPE_HEAD_DIM // HEAD_TILE) instead of the full HEAD_DIM; the rope span is written once below by the RoPE'd rope_buf. - Re-derive the RoPE input slice locally from pooled_kv with the same inv_rms/gamma instead of reading normed_kv back, removing a same-tensor GM write-then-read round-trip that hits the issue hw-native-sys#419 runtime race. Verified by 5 serial a2a3 NPU runs: kv / compress_state / cmp_kv_cache all PASS.
Summary
ROTATE=False): remove therotate/hadamardparams, thekv_hadamard/kv_writebranch, thekv_finalintermediate, and the matching golden/spec plumbing;kv_and_cache_writenow readsnormed_kvdirectly.rmsnorm_rope: normalize only the NOPE span in the loop and re-derive the RoPE input slice locally frompooled_kv(bit-identical), instead of readingnormed_kvback — drops a same-tensor GM round-trip.--golden-dataCLI option for fixed-input validation runs.rmsnorm_ropestays onpl.range + pl.at(verified 10/10 on a2a3);pl.spmd/pl.parallelare non-deterministic here.Related Issues
Refs #419