perf: move to seq-parallel atomic reduction in bwd_b by TimoImhof · Pull Request #3 · TimoImhof/fastaf

TimoImhof · 2026-04-12T08:53:14Z

The bwd_b kernel is currently the primary bottleneck in the fused backward pas:

It iterates over the sequence dimension ($S$) within a single thread block:

# dL/d(out).T @ (X @ A.T)
    for s in range(0, tl.cdiv(dL_dout_S, BLOCK_S)):
        # ... inner = X @ A.T ...
        # ... _total_acc = tl.dot(dL_out, _interm_tile) ...

leading to redundant memory traffic: Both $X$ and $A$ are re-loaded from global memory $S/BLOCK_S$ times. Since $S$ is typically the largest dimension, this creates massive, unnecessary VRAM pressure.

Proposed Fix: Sequence-Parallelism Mirroring the successful optimization of bwd_a - this PR refactors bwd_b to:

Parallelize over $S$: Map the grid to the sequence dimension so that multiple blocks process chunks of the sequence concurrently.
Eliminate Redundant Loads: Each block loads its chunk of $X$ and $A$ exactly once.
Atomic Accumulation: Uses tl.atomic_add to aggregate partial gradients into the final $B$ matrix.

problem statement

743e1d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: move to seq-parallel atomic reduction in bwd_b#3

perf: move to seq-parallel atomic reduction in bwd_b#3
TimoImhof wants to merge 1 commit into
mainfrom
perf/bwd_b

TimoImhof commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TimoImhof commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TimoImhof commented Apr 12, 2026 •

edited

Loading