Skip to content

Added Vectorized MinGRU Forward + Backward and test suites#553

Open
eshau wants to merge 2 commits intoPufferAI:4.0from
eshau:4.0
Open

Added Vectorized MinGRU Forward + Backward and test suites#553
eshau wants to merge 2 commits intoPufferAI:4.0from
eshau:4.0

Conversation

@eshau
Copy link
Copy Markdown

@eshau eshau commented May 3, 2026

Hi!

Summary

We created vectorized versions of the MinGRU forward / backward kernels in src/models.cu that are slightly faster for large enough (B, T, H). Essentially instead of doing serial work over T elements in each thread, we can do work over h consecutive columns of T elements. This is in practice better because we reduce the number of memory instructions (through float2 / float4 loads and stores) and do vectorized float2 math for some operations at the cost of increased register pressure. We have three variations of this: vec32 (load 32 bits / 2 bfloat16s), vec64 (load 64 bits / 4 bfloat16s), and vec128 (load 128 bits / 8 bfloat16s).

Results

The following tables show the speedup over forward + backwards:
image

We also added a kernel selector that chooses the checkpoint interval as well as optimal forwards and backwards depending on (B, T, H). These are based on the following three depth-2 decision trees (all numbers are powers of 2):
image
image
image

The resulting selected forward + backward MinGRU kernels have the following speedups:
image

Tests

We have three different additional tests in tests/profile_kernels.cu:

  • fusedscan_correctness: Checks correctness of vec32, vec64, and vec128 over all (B, T, H) combinations.
  • fusedscan_sweep: Benchmarks speed over all (B, T, H), checkpoint intervals, and kernel variants
  • fusedscan_selector_bench: Benchmarks speed over all (B, T, H) against baseline (scalar load with checkpoint interval 4)

@drQedwards
Copy link
Copy Markdown

Based. Should be considered for merge.

I will not elaborate further.

@drQedwards
Copy link
Copy Markdown

Obviously we can do further vectorization graphing on Kv values.

But that's whatever nonsense I'm adding in as a fourth variable to this.

@drQedwards
Copy link
Copy Markdown

For a layer and kernel?

Now take it out to to the Q layer

https://x.com/danadvantage/status/2051733458116862320?s=46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants