Added Vectorized MinGRU Forward + Backward and test suites by eshau · Pull Request #553 · PufferAI/PufferLib

eshau · 2026-05-03T01:39:49Z

Hi!

Summary

We created vectorized versions of the MinGRU forward / backward kernels in src/models.cu that are slightly faster for large enough (B, T, H). Essentially instead of doing serial work over T elements in each thread, we can do work over h consecutive columns of T elements. This is in practice better because we reduce the number of memory instructions (through float2 / float4 loads and stores) and do vectorized float2 math for some operations at the cost of increased register pressure. We have three variations of this: vec32 (load 32 bits / 2 bfloat16s), vec64 (load 64 bits / 4 bfloat16s), and vec128 (load 128 bits / 8 bfloat16s).

Results

The following tables show the speedup over forward + backwards:

We also added a kernel selector that chooses the checkpoint interval as well as optimal forwards and backwards depending on (B, T, H). These are based on the following three depth-2 decision trees (all numbers are powers of 2):

The resulting selected forward + backward MinGRU kernels have the following speedups:

Tests

We have three different additional tests in tests/profile_kernels.cu:

fusedscan_correctness: Checks correctness of vec32, vec64, and vec128 over all (B, T, H) combinations.
fusedscan_sweep: Benchmarks speed over all (B, T, H), checkpoint intervals, and kernel variants
fusedscan_selector_bench: Benchmarks speed over all (B, T, H) against baseline (scalar load with checkpoint interval 4)

…el selector depending on BTH

Merge with Main

drQedwards · 2026-05-06T17:06:13Z

Based. Should be considered for merge.

I will not elaborate further.

drQedwards · 2026-05-06T17:08:23Z

Obviously we can do further vectorization graphing on Kv values.

But that's whatever nonsense I'm adding in as a fourth variable to this.

drQedwards · 2026-05-06T17:35:46Z

For a layer and kernel?

Now take it out to to the Q layer

https://x.com/danadvantage/status/2051733458116862320?s=46

eshau and others added 2 commits May 2, 2026 21:08

Add vectorized mingru scan forward and backwards + profiling and kern…

e25abaf

…el selector depending on BTH

Merge pull request #1 from eshau/mingru_scan_vec_load

6040ddd

Merge with Main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Vectorized MinGRU Forward + Backward and test suites#553

Added Vectorized MinGRU Forward + Backward and test suites#553
eshau wants to merge 2 commits intoPufferAI:4.0from
eshau:4.0

eshau commented May 3, 2026

Uh oh!

drQedwards commented May 6, 2026

Uh oh!

drQedwards commented May 6, 2026

Uh oh!

drQedwards commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eshau commented May 3, 2026

Summary

Results

Tests

Uh oh!

drQedwards commented May 6, 2026

Uh oh!

drQedwards commented May 6, 2026

Uh oh!

drQedwards commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants