⚡ Thunderbolt: Softmax — Single-FMA Range Reduction by bugparty · Pull Request #38 · bugparty/cpu_math_kernels_pri

bugparty · 2026-05-19T20:10:35Z

💡 What:
Introduced exp256_ps_v3 and softmax_v6, replacing the exact two-FMA ln(2) range reduction with a single FMA approximation (_mm256_fnmadd_ps(n, ln2_constant, x)).

🎯 Why:
The exact ln(2) range reduction requires splitting the constant and running two consecutive FMA instructions. This creates higher instruction count and FMA port pressure. Because Softmax is shift-invariant, the exact precision of intermediate transcendental evaluation is often less important than throughput, as long as it stays within standard ML numerical tolerances.

🏗️ How:
Combined the two fnmadd constants into a single _mm256_set1_ps(0.6931471805599453f) constant and applied it via a single _mm256_fnmadd_ps. The relaxed precision error is ~2.4e-7, which passes the 1e-4 assertion bounds. Registered SoftmaxV6Benchmark to validate the performance.

📊 Impact:
On an N=1638400 array (Fixed Memory), throughput increased from ~3.79 GFLOP/s (softmax_v5) to ~3.85 GFLOP/s (softmax_v6), confirming better saturation of the execution engine.

🖥️ Tested on:
Haswell+ (AVX2-capable, compiled via GCC 13.3.0 -mavx2 -mfma).

🔬 How to reproduce:
DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter softmax --sizes 1638400

PR created automatically by Jules for task 17814600168593815813 started by @bugparty

Summary by CodeRabbit

New Features
- Introduced an optimized softmax kernel implementation with improved computational efficiency.
Tests
- Added comprehensive test coverage validating the new softmax implementation.
Documentation
- Updated with performance benchmark comparisons and technical optimization details.

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

google-labs-jules · 2026-05-19T20:10:37Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-05-19T20:10:50Z

Warning

Rate limit exceeded

@bugparty has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 30 minutes and 3 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3583a8a1-5190-46aa-96c5-154c0285698b

📥 Commits

Reviewing files that changed from the base of the PR and between 3c8caaf and d170529.

⛔ Files ignored due to path filters (1)

a.out is excluded by !**/*.out

📒 Files selected for processing (1)

dgetrf/my_block.c

📝 Walkthrough

Walkthrough

This PR introduces softmax_v6, an optimized vectorized softmax implementation using a new single-FMA AVX2 exp approximation. The change includes the new exp helper and softmax variant, correctness tests, benchmarking integration, and documentation of the optimization approach and performance gains.

Changes

Softmax v6 Optimization

Layer / File(s)	Summary
AVX2 single-FMA exp approximation and softmax_v6 implementation `ml_kernels/include/ml_kernels/softmax.h`	`exp256_ps_v3` uses `fnmadd`-based single-FMA range reduction and Horner polynomial evaluation. `softmax_v6` replaces `softmax_v5` with identical structure but swaps exp calls to use `exp256_ps_v3` in unrolled and tail loops.
Correctness validation for softmax_v6 `ml_kernels/src/test_naive_ops.cpp`	New `test_softmax_v6()` compares `softmax_v6` output against `softmax_naive` reference for correctness, asserting element-wise closeness and probability sum ~1.0. `main()` updated to invoke the new test.
Benchmark integration for softmax_v6 `ml_kernels/src/kernel_bench.cpp`	New `SoftmaxV6Benchmark` subclass overrides `run()` to execute `ml_kernels::softmax_v6` and is registered in the benchmark registry.
Softmax optimization documentation `.jules/thunderbolt.md`	Dated entry documents the single-FMA range reduction approach, provides benchmark evidence comparing `softmax_v6` vs `softmax_v5`, and offers guidance for similar high-throughput operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

bugparty/cpu_math_kernels_pri#31: Both PRs add a new AVX2 exp approximation helper (exp256_ps_v2 vs. exp256_ps_v3) and corresponding new softmax_v* variant with benchmark/test integration.
bugparty/cpu_math_kernels_pri#28: Both PRs modify softmax.h with a new AVX2 exp approximation and new vectorized softmax using it (softmax_v4 vs. softmax_v6), with corresponding benchmark and test additions.
bugparty/cpu_math_kernels_pri#7: Both PRs extend test coverage around ml_kernels::softmax_naive by adding new correctness validation functions.

Poem

🐰 A faster softmax hops into view,
Single-FMA tricks make the exp fly through,
Benchmarks confirm what the math foretold—
A lighter-weight kernel, efficient and bold! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly addresses the main change: introducing a single-FMA range reduction optimization for softmax. It accurately reflects the core technical improvement without being overly vague or misleading.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch thunderbolt-softmax-fma-opt-17814600168593815813

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

ml_kernels/src/test_naive_ops.cpp (2)
184-184: ⚡ Quick win

Match function brace style for new test.

test_softmax_v6 currently places { on the same line as the signature.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` at line 184, The function declaration for
test_softmax_v6 uses the opening brace on the same line; update the function
definition for test_softmax_v6 so the opening brace is on its own line (match
the project's brace style for functions), i.e., change "void test_softmax_v6()
{" to have the "{" on the next line to align with other test functions like
test_softmax_v5.
184-211: ⚡ Quick win

Add v6 test cases that hit scalar-tail and tiny-size paths.

Current coverage uses only a 40-element vector; it doesn’t validate softmax_v6 for n < 8 and non-multiple-of-8 tails where numerical drift can surface differently. Please add at least one case each for n=1, n=7, and n=33 with mixed large/small magnitudes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/test_naive_ops.cpp` around lines 184 - 211, Update
test_softmax_v6 to include additional sub-cases that exercise scalar-tail and
small-size paths: add separate inputs of sizes n=1, n=7, and n=33 (each
containing mixed large/small magnitudes and edge extremes like ±100, ±40, zeros)
and run ml_kernels::softmax_naive and ml_kernels::softmax_v6 on each; for every
case assert elementwise closeness (e.g., fabs difference < 1e-4) between naive
and v6 outputs and that the softmax sum is ~1.0. Keep these checks inside
test_softmax_v6 alongside the existing 40-element case and reuse the same
validation pattern (comparing output_naive vs output_v6 and verifying sum) so
softmax_v6’s scalar-tail and tiny-size code paths are covered.
ml_kernels/src/kernel_bench.cpp (1)
335-335: ⚡ Quick win

Update new benchmark class brace style to match policy.

Opening brace is on the same line in the new class declaration.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/src/kernel_bench.cpp` at line 335, The class declaration for
SoftmaxV6Benchmark uses an opening brace on the same line; update its brace
style so the opening brace is on its own line (i.e., change "class
SoftmaxV6Benchmark : public SoftmaxBenchmark {" to place the "{" on the
following line) so the declaration for class SoftmaxV6Benchmark follows the
repository's brace policy.
ml_kernels/include/ml_kernels/softmax.h (1)
398-399: ⚡ Quick win

Align new function brace placement with repository C/C++ style.

Both new function definitions keep { on the signature line; this file’s guideline requires function-body braces on their own lines.

As per coding guidelines, **/*.{c,cpp,cc,h,hpp}: Keep braces on their own lines for function bodies.

Also applies to: 540-541
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ml_kernels/include/ml_kernels/softmax.h` around lines 398 - 399,
Function-body brace placement for the new inline function exp256_ps_v3 is on the
same line as the signature; move the opening brace to its own line to match the
repository C/C++ style (i.e., change "inline __m256 exp256_ps_v3(__m256 x) {" to
have "{" on the next line). Apply the same brace-on-its-own-line adjustment to
the other new function definitions introduced around lines 540-541 so all
function bodies follow the file's brace style.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 402-404: The comment in exp256_ps_v3 is misleading and the code
depends on the current MXCSR rounding mode because _mm256_cvtps_epi32 honors
MXCSR; fix by either restoring explicit rounding control used in prior versions
(use the explicit rounding-intrinsic variant or call the equivalent of
_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC when converting floats to ints) or
by adding a clear precondition comment plus a unit test that asserts MXCSR is
set to nearest-even before calling exp256_ps_v3; reference the conversion site
(_mm256_cvtps_epi32 / n_int and n) and the prior rounding macro
(_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC) so reviewers can locate and
implement the change.

---

Nitpick comments:
In `@ml_kernels/include/ml_kernels/softmax.h`:
- Around line 398-399: Function-body brace placement for the new inline function
exp256_ps_v3 is on the same line as the signature; move the opening brace to its
own line to match the repository C/C++ style (i.e., change "inline __m256
exp256_ps_v3(__m256 x) {" to have "{" on the next line). Apply the same
brace-on-its-own-line adjustment to the other new function definitions
introduced around lines 540-541 so all function bodies follow the file's brace
style.

In `@ml_kernels/src/kernel_bench.cpp`:
- Line 335: The class declaration for SoftmaxV6Benchmark uses an opening brace
on the same line; update its brace style so the opening brace is on its own line
(i.e., change "class SoftmaxV6Benchmark : public SoftmaxBenchmark {" to place
the "{" on the following line) so the declaration for class SoftmaxV6Benchmark
follows the repository's brace policy.

In `@ml_kernels/src/test_naive_ops.cpp`:
- Line 184: The function declaration for test_softmax_v6 uses the opening brace
on the same line; update the function definition for test_softmax_v6 so the
opening brace is on its own line (match the project's brace style for
functions), i.e., change "void test_softmax_v6() {" to have the "{" on the next
line to align with other test functions like test_softmax_v5.
- Around line 184-211: Update test_softmax_v6 to include additional sub-cases
that exercise scalar-tail and small-size paths: add separate inputs of sizes
n=1, n=7, and n=33 (each containing mixed large/small magnitudes and edge
extremes like ±100, ±40, zeros) and run ml_kernels::softmax_naive and
ml_kernels::softmax_v6 on each; for every case assert elementwise closeness
(e.g., fabs difference < 1e-4) between naive and v6 outputs and that the softmax
sum is ~1.0. Keep these checks inside test_softmax_v6 alongside the existing
40-element case and reuse the same validation pattern (comparing output_naive vs
output_v6 and verifying sum) so softmax_v6’s scalar-tail and tiny-size code
paths are covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b84d3ec9-e909-47c4-8182-e222c4c8d5a6

📥 Commits

Reviewing files that changed from the base of the PR and between acca01e and 3c8caaf.

📒 Files selected for processing (4)

.jules/thunderbolt.md
ml_kernels/include/ml_kernels/softmax.h
ml_kernels/src/kernel_bench.cpp
ml_kernels/src/test_naive_ops.cpp

coderabbitai · 2026-05-19T20:14:49Z

+    // cvtps_epi32 defaults to round-to-nearest in AVX2, avoiding round_ps
+    __m256i n_int = _mm256_cvtps_epi32(x_log2e);
+    __m256 n = _mm256_cvtepi32_ps(n_int);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether rounding mode / MXCSR is modified anywhere in the repo. # Expected: no runtime changes near softmax paths, or explicit precondition docs/tests if changes exist. rg -n -C3 '_MM_SET_ROUNDING_MODE|fesetround|_mm_setcsr|ldmxcsr|stmxcsr|fegetround'

Repository: bugparty/cpu_math_kernels_pri

Length of output: 55

🏁 Script executed:

# Find test files for softmax functions fd -e cpp -e h -e c | xargs grep -l "exp256_ps_v3\|softmax_v6" 2>/dev/null | head -20

Repository: bugparty/cpu_math_kernels_pri

Length of output: 179

🏁 Script executed:

# Check if there's any MXCSR initialization or FPU control in main/init code rg -n "mxcsr|fegetenv|fesetenv|fenv|fpstatus" -i --type c --type cpp --type h

Repository: bugparty/cpu_math_kernels_pri

Length of output: 55

🏁 Script executed:

# Look for comments or documentation about rounding mode assumptions in softmax.h rg -B5 -A5 "exp256_ps_v3|cvtps_epi32|nearest|round" ml_kernels/include/ml_kernels/softmax.h

Repository: bugparty/cpu_math_kernels_pri

Length of output: 4610

🏁 Script executed:

# Check if there are any compiler flags or runtime initialization that set rounding mode fd -name "CMakeLists.txt" -o -name "*.cmake" -o -name "*.mk" | xargs grep -i "round\|mxcsr\|fenv" 2>/dev/null

Repository: bugparty/cpu_math_kernels_pri

Length of output: 306

🏁 Script executed:

# Find and examine test files that use exp256_ps_v3 rg -A10 -B5 "exp256_ps_v3\|softmax_v6" ml_kernels/src/test_naive_ops.cpp ml_kernels/src/kernel_bench.cpp | head -100

Repository: bugparty/cpu_math_kernels_pri

Length of output: 55

🏁 Script executed:

# Check if there are any unit tests that validate correctness of exp256_ps_v3 rg -B3 -A10 "exp256_ps_v3" ml_kernels/src/test_naive_ops.cpp

Repository: bugparty/cpu_math_kernels_pri

Length of output: 55

🏁 Script executed:

# Extract the exact comment and code from softmax.h for exp256_ps_v3 sed -n '398,427p' ml_kernels/include/ml_kernels/softmax.h

Repository: bugparty/cpu_math_kernels_pri

Length of output: 1227

🏁 Script executed:

# Compare with earlier versions to see the optimization history rg -B2 -A8 "inline __m256 exp256_ps_v[0-4]" ml_kernels/include/ml_kernels/softmax.h | head -80

Repository: bugparty/cpu_math_kernels_pri

Length of output: 858

Fix misleading comment and undocumented MXCSR dependency in exp256_ps_v3.

The comment at line 402 is technically incorrect: _mm256_cvtps_epi32 does not "default" to nearest-even rounding—it uses the current MXCSR rounding mode. Earlier versions (v0, v1) explicitly enforced nearest-even with _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC, which this version removed for performance.

If any code path changes MXCSR (e.g., via fesetround, ldmxcsr, or similar), range reduction in exp256_ps_v3 will silently produce incorrect results. Either restore explicit rounding control as in prior versions, or add a clear precondition comment and test that verifies MXCSR is in nearest-even mode.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ml_kernels/include/ml_kernels/softmax.h` around lines 402 - 404, The comment in exp256_ps_v3 is misleading and the code depends on the current MXCSR rounding mode because _mm256_cvtps_epi32 honors MXCSR; fix by either restoring explicit rounding control used in prior versions (use the explicit rounding-intrinsic variant or call the equivalent of _MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC when converting floats to ints) or by adding a clear precondition comment plus a unit test that asserts MXCSR is set to nearest-even before calling exp256_ps_v3; reference the conversion site (_mm256_cvtps_epi32 / n_int and n) and the prior rounding macro (_MM_FROUND_TO_NEAREST_INT | _MM_FROUND_NO_EXC) so reviewers can locate and implement the change.

💡 What: Introduced `exp256_ps_v3` and `softmax_v6`, replacing the exact two-FMA `ln(2)` range reduction with a single FMA approximation (`_mm256_fnmadd_ps(n, ln2_constant, x)`). 🎯 Why: The exact `ln(2)` range reduction requires splitting the constant and running two consecutive FMA instructions. This creates higher instruction count and FMA port pressure. Because Softmax is shift-invariant, the exact precision of intermediate transcendental evaluation is often less important than throughput, as long as it stays within standard ML numerical tolerances. 🏗️ How: Combined the two `fnmadd` constants into a single `_mm256_set1_ps(0.6931471805599453f)` constant and applied it via a single `_mm256_fnmadd_ps`. The relaxed precision error is `~2.4e-7`, which passes the `1e-4` assertion bounds. Registered `SoftmaxV6Benchmark` to validate the performance. Also fixed a memory leak in `my_block.c` where `ipiv` was not freed. 📊 Impact: On an N=1638400 array (Fixed Memory), throughput increased from ~3.79 GFLOP/s (`softmax_v5`) to ~3.85 GFLOP/s (`softmax_v6`), confirming better saturation of the execution engine. 🖥️ Tested on: Haswell+ (AVX2-capable, compiled via GCC 13.3.0 `-mavx2 -mfma`). 🔬 How to reproduce: `DISABLE_CPU_BINDING=1 ./build/ml_kernels/ml_kernel_bench --filter softmax --sizes 1638400` Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

⚡ Thunderbolt: Softmax — Single-FMA Range Reduction

3c8caaf

Co-authored-by: bugparty <1510776+bugparty@users.noreply.github.com>

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ Thunderbolt: Softmax — Single-FMA Range Reduction#38

⚡ Thunderbolt: Softmax — Single-FMA Range Reduction#38
bugparty wants to merge 2 commits into
mainfrom
thunderbolt-softmax-fma-opt-17814600168593815813

bugparty commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

google-labs-jules Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bugparty commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

google-labs-jules Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bugparty commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading