⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction and 8x unroll#39
⚡ Thunderbolt: softmax_v6 — Single-FMA range reduction and 8x unroll#39bugparty wants to merge 1 commit into
Conversation
Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
📝 WalkthroughWalkthroughThis PR introduces softmax_v6, a new AVX2 softmax implementation combining single-FMA range reduction with ChangesAVX2 Softmax v6 with Single-FMA Range Reduction
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
ml_kernels/src/kernel_bench.cpp (1)
337-342: ⚡ Quick winAlign function-body brace placement with C++ style guideline.
Line 337 and Line 339 define function bodies with opening braces on the same line; this violates the project’s C/C++ brace rule.
Suggested style-only diff
class SoftmaxV6Benchmark : public SoftmaxBenchmark { public: - const char *name() const override { return "softmax_v6"; } + const char *name() const override + { + return "softmax_v6"; + } - void run() override { + void run() override + { ml_kernels::softmax_v6(inputs_[current_idx_].data(), outputs_[current_idx_].data(), inputs_[0].size()); current_idx_ = (current_idx_ + 1) % pool_size_; } };As per coding guidelines, "Keep braces on their own lines for function bodies".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/kernel_bench.cpp` around lines 337 - 342, The function definitions for name() and run() use opening braces on the same line, violating the project's C/C++ brace rule; update both function bodies (the name() override and the run() override that calls ml_kernels::softmax_v6 and updates current_idx_) to place the opening brace on its own line (i.e., move "{" to the next line for the name() and run() functions) while preserving their existing bodies and behavior (including the call to softmax_v6(...) and the current_idx_ = (current_idx_ + 1) % pool_size_ statement).ml_kernels/src/test_naive_ops.cpp (2)
184-211: ⚡ Quick winEliminate code duplication across softmax version tests.
The test logic for
test_softmax_v6()is nearly identical totest_softmax_v5()(lines 155-182),test_softmax_v4(), andtest_softmax_v3(). This duplication violates the DRY principle and increases maintenance burden.Refactor to use a parameterized helper function that accepts the softmax function pointer and version name.
♻️ Proposed refactor to eliminate duplication
void test_softmax_version( void (*softmax_fn)(const float*, float*, std::size_t), const char* version_name) { std::cout << "Running test_softmax_" << version_name << "..." << std::endl; std::vector<float> input = { -2.0f, -0.5f, 1.0f, 3.0f, 0.0f, 0.0f, 0.0f, 0.0f, 100.0f, 100.0f, -100.0f, -100.0f, 5.0f, -5.0f, 2.0f, -2.0f, 1.1f, 1.2f, 1.3f, 1.4f, -1.1f, -1.2f, -1.3f, -1.4f, 10.0f, 20.0f, 30.0f, 40.0f, -10.0f, -20.0f, -30.0f, -40.0f }; std::vector<float> output_naive(input.size(), 0.0f); std::vector<float> output_version(input.size(), 0.0f); ml_kernels::softmax_naive(input.data(), output_naive.data(), input.size()); softmax_fn(input.data(), output_version.data(), input.size()); float sum = 0.0f; for (std::size_t i = 0; i < input.size(); ++i) { assert(std::fabs(output_naive[i] - output_version[i]) < 1e-4f); sum += output_version[i]; } assert(std::fabs(sum - 1.0f) < 1e-4f); std::cout << "test_softmax_" << version_name << " passed!" << std::endl; } void test_softmax_v5() { test_softmax_version(ml_kernels::softmax_v5, "v5"); } void test_softmax_v6() { test_softmax_version(ml_kernels::softmax_v6, "v6"); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 211, The test_softmax_v6() duplicates logic used by test_softmax_v3/4/5; introduce a reusable helper like test_softmax_version(void (*softmax_fn)(const float*, float*, std::size_t), const char* version_name) that runs ml_kernels::softmax_naive, invokes the provided softmax_fn (e.g., ml_kernels::softmax_v6), compares outputs and checks the sum, then replace test_softmax_v3/4/5/6 to call this helper with the appropriate function pointer and version name to eliminate duplication while keeping ml_kernels::softmax_naive and ml_kernels::softmax_vN references intact.
186-195: ⚡ Quick winAdd test case for non-8-aligned input size.
The current test uses 32 elements, which is evenly divisible by 8 (the unroll factor). This leaves the scalar tail handling code path untested. Per the review stack context, softmax_v6 includes scalar tail handling for the normalization loop.
Add a test case with a size that is not a multiple of 8 (e.g., 35 or 30 elements) to validate the tail handling logic.
🧪 Example test case with non-aligned size
Add this test after the existing validation:
// Test with non-8-aligned size to validate tail handling { std::vector<float> input_tail = { -2.0f, -0.5f, 1.0f, 3.0f, 0.0f, 100.0f, -100.0f, 5.0f, -5.0f, 2.0f, 1.1f, 1.2f, 1.3f, 1.4f, -1.1f, -1.2f, -1.3f, -1.4f, 10.0f, 20.0f, 30.0f, 40.0f, -10.0f, -20.0f, -30.0f, -40.0f, 0.5f, 0.6f, 0.7f, 0.8f }; // 30 elements (not divisible by 8) std::vector<float> output_naive_tail(input_tail.size(), 0.0f); std::vector<float> output_v6_tail(input_tail.size(), 0.0f); ml_kernels::softmax_naive(input_tail.data(), output_naive_tail.data(), input_tail.size()); ml_kernels::softmax_v6(input_tail.data(), output_v6_tail.data(), input_tail.size()); float sum_tail = 0.0f; for (std::size_t i = 0; i < input_tail.size(); ++i) { assert(std::fabs(output_naive_tail[i] - output_v6_tail[i]) < 1e-4f); sum_tail += output_v6_tail[i]; } assert(std::fabs(sum_tail - 1.0f) < 1e-4f); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/src/test_naive_ops.cpp` around lines 186 - 195, Add a new softmax validation case in test_naive_ops.cpp using a non-8-aligned input size so the scalar tail path in softmax_v6 is exercised. Keep the existing 32-element test, then add a separate input/output block with 30 or 35 elements and compare ml_kernels::softmax_naive against ml_kernels::softmax_v6 using the same assertions as the current test. Use the existing softmax_naive and softmax_v6 helpers to locate the right test section.ml_kernels/include/ml_kernels/softmax.h (1)
505-505: ⚡ Quick winPlace function opening braces on their own lines.
Both new function definitions keep
{on the signature line; this file pattern requires newline brace style for function bodies.♻️ Suggested formatting fix
-inline __m256 exp256_ps_v3(__m256 x) { +inline __m256 exp256_ps_v3(__m256 x) +{ @@ -inline void softmax_v6(const float *input, float *output, std::size_t n) { +inline void softmax_v6(const float *input, float *output, std::size_t n) +{As per coding guidelines,
**/*.{c,cpp,cc,h,hpp}must "Keep braces on their own lines for function bodies".Also applies to: 543-543
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` at line 505, The function definitions currently place the opening brace on the same line as the signature (e.g., exp256_ps_v3), violating the file's newline-brace style; update the function signatures (including the other occurrence around line ~543) so the `{` is moved to its own line directly beneath the signature for each function (adjust signatures like exp256_ps_v3 and the other function name at ~543) to comply with the "Keep braces on their own lines for function bodies" guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.jules/thunderbolt.md:
- Around line 1-5: Update the log entry date to 2026-05-20 and correct the
benchmark metadata: in the Evidence line that mentions N=1048576 (Fixed Memory)
replace the unit and value "GFLOP/s improved from 3.56 to 3.78" with "GB/s
improved from 3.57 to 3.78" so it matches the PR objective; keep references to
the implementation names exp256 and softmax_v5 as-is to preserve context.
---
Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Line 505: The function definitions currently place the opening brace on the
same line as the signature (e.g., exp256_ps_v3), violating the file's
newline-brace style; update the function signatures (including the other
occurrence around line ~543) so the `{` is moved to its own line directly
beneath the signature for each function (adjust signatures like exp256_ps_v3 and
the other function name at ~543) to comply with the "Keep braces on their own
lines for function bodies" guideline.
In `@ml_kernels/src/kernel_bench.cpp`:
- Around line 337-342: The function definitions for name() and run() use opening
braces on the same line, violating the project's C/C++ brace rule; update both
function bodies (the name() override and the run() override that calls
ml_kernels::softmax_v6 and updates current_idx_) to place the opening brace on
its own line (i.e., move "{" to the next line for the name() and run()
functions) while preserving their existing bodies and behavior (including the
call to softmax_v6(...) and the current_idx_ = (current_idx_ + 1) % pool_size_
statement).
In `@ml_kernels/src/test_naive_ops.cpp`:
- Around line 184-211: The test_softmax_v6() duplicates logic used by
test_softmax_v3/4/5; introduce a reusable helper like test_softmax_version(void
(*softmax_fn)(const float*, float*, std::size_t), const char* version_name) that
runs ml_kernels::softmax_naive, invokes the provided softmax_fn (e.g.,
ml_kernels::softmax_v6), compares outputs and checks the sum, then replace
test_softmax_v3/4/5/6 to call this helper with the appropriate function pointer
and version name to eliminate duplication while keeping
ml_kernels::softmax_naive and ml_kernels::softmax_vN references intact.
- Around line 186-195: Add a new softmax validation case in test_naive_ops.cpp
using a non-8-aligned input size so the scalar tail path in softmax_v6 is
exercised. Keep the existing 32-element test, then add a separate input/output
block with 30 or 35 elements and compare ml_kernels::softmax_naive against
ml_kernels::softmax_v6 using the same assertions as the current test. Use the
existing softmax_naive and softmax_v6 helpers to locate the right test section.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f149679a-c58c-4470-a279-d6d71d072821
📒 Files selected for processing (4)
.jules/thunderbolt.mdml_kernels/include/ml_kernels/softmax.hml_kernels/src/kernel_bench.cppml_kernels/src/test_naive_ops.cpp
| ## 2024-05-20 - AVX2 Vectorized Softmax with single-FMA range reduction and 8x max unroll | ||
|
|
||
| **Learning:** Replaces the 2-FMA Cody-Waite range reduction in `exp256` with a single FMA using `ln(2)`, removing an instruction from the critical path while remaining within ML precision tolerances. Additionally, unrolling the max reduction 8x (from 4x) to better saturate execution ports yields measurable throughput improvements over `softmax_v5` implementation on larger inputs and fixed memory configurations (e.g. N=1048576, GFLOP/s improved from 3.56 to 3.78). | ||
|
|
||
| **Evidence:** End-to-end framework benchmarks showed an increase in GFLOP/s for N=1048576 (Fixed Memory) from 3.56 to 3.78 and for N=262144 (Fixed Memory) from 4.00 to 4.18. |
There was a problem hiding this comment.
Fix benchmark metadata consistency (date + unit/value).
Line 1 uses 2024-05-20, but this PR was created on 2026-05-20. Also Line 3/Line 5 cites 3.56 -> 3.78 as GFLOP/s, while the PR objective reports 3.57 -> 3.78 for GB/s at N=1,048,576. Please align the log entry to the measured run metadata to avoid ambiguity.
✏️ Suggested doc patch
-## 2024-05-20 - AVX2 Vectorized Softmax with single-FMA range reduction and 8x max unroll
+## 2026-05-20 - AVX2 Vectorized Softmax with single-FMA range reduction and 8x max unroll
@@
-**Learning:** Replaces the 2-FMA Cody-Waite range reduction in `exp256` with a single FMA using `ln(2)`, removing an instruction from the critical path while remaining within ML precision tolerances. Additionally, unrolling the max reduction 8x (from 4x) to better saturate execution ports yields measurable throughput improvements over `softmax_v5` implementation on larger inputs and fixed memory configurations (e.g. N=1048576, GFLOP/s improved from 3.56 to 3.78).
+**Learning:** Replaces the 2-FMA Cody-Waite range reduction in `exp256` with a single FMA using `ln(2)`, removing an instruction from the critical path while remaining within ML precision tolerances. Additionally, unrolling the max reduction 8x (from 4x) to better saturate execution ports yields measurable throughput improvements over `softmax_v5` on larger fixed-memory inputs (e.g. N=1048576, throughput improved from 3.57 to 3.78 GB/s).
@@
-**Evidence:** End-to-end framework benchmarks showed an increase in GFLOP/s for N=1048576 (Fixed Memory) from 3.56 to 3.78 and for N=262144 (Fixed Memory) from 4.00 to 4.18.
+**Evidence:** End-to-end framework benchmarks showed throughput at N=1048576 (Fixed Memory) improving from 3.57 to 3.78 GB/s, and GFLOP/s at N=262144 (Fixed Memory) improving from 4.00 to 4.18.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.jules/thunderbolt.md around lines 1 - 5, Update the log entry date to
2026-05-20 and correct the benchmark metadata: in the Evidence line that
mentions N=1048576 (Fixed Memory) replace the unit and value "GFLOP/s improved
from 3.56 to 3.78" with "GB/s improved from 3.57 to 3.78" so it matches the PR
objective; keep references to the implementation names exp256 and softmax_v5
as-is to preserve context.
💡 What: The optimization implemented is an AVX2-vectorized
softmax_v6kernel. It improvesexp256_pswith a single-FMA range reduction and unrolls the max/normalization loops 8x.🎯 Why: The previous
exp256implementation used a 2-FMA Cody-Waite range reduction sequence that extended the critical path latency. Furthermore, the max and normalize loops were only unrolled 4x, leaving execution port capacity unused.🏗️ How:
fnmaddwith splitln(2)) with a single FMA (fnmaddusing fullln(2)).📊 Impact: Peak throughput on N=1,048,576 Fixed Memory allocation increased from 3.57 GB/s to 3.78 GB/s. For N=262,144 Fixed Memory, GFLOP/s improved from 4.00 to 4.18.
🖥️ Tested on: Haswell+ architecture, Linux, gcc 13.
🔬 How to reproduce:
./build/ml_kernels/ml_kernel_bench --iters 20 --warmup 5 --filter 'softmax_v[56]'PR created automatically by Jules for task 14369317833739531274 started by @bugparty
Summary by CodeRabbit
New Features
Tests
Documentation