Add FMA function for the fltflt data type #1123

tbensonatl · 2026-01-24T00:45:54Z

fltflt_fma() performs a * b + c for fltflt types more efficiently than a fltflt_mul() followed by a fltflt_add(). The fused function can perform one fewer normalization than the separate functions.

This PR also switches from function names like fltflt_add_float(fltflt, float) to overloads of fltflt_add(). The former were intended to be more easily usable in a C context, but the file now contains many other C++ features (ctors, conversion operators, comparison operators, etc.).w

fltflt_fma() performs a * b + c for fltflt types more efficiently than a fltflt_mul() followed by a fltflt_add(). The fused function can perform one fewer normalization than the separate functions. This PR also switches from function names like fltflt_add_float(fltflt, float) to overloads of fltflt_add(). The former were intended to be more easily usable in a C context, but the file now contains many other C++ features (ctors, conversion operators, comparison operators, etc.).w Signed-off-by: Thomas Benson <tbenson@nvidia.com>

copy-pr-bot · 2026-01-24T00:45:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-01-24T00:48:17Z

Greptile Summary

Added fused multiply-add (fltflt_fma()) function for the fltflt data type with 6 optimized overloads supporting mixed float/fltflt arguments, and refactored existing function names from explicit type suffixes (e.g., fltflt_add_float()) to C++ function overloads (e.g., fltflt_add()).

Key improvements:

The FMA function performs a * b + c with 2 normalizations instead of 3 (when using separate multiply and add), improving computational efficiency
Applied in ComputeRangeToPixelFloatFloat() for SAR backprojection distance calculations
Comprehensive test coverage validates 44+ mantissa bits of precision across all overload combinations
API modernization improves code consistency and makes the library more idiomatic C++

Confidence Score: 5/5

This PR is safe to merge with high confidence
The implementation follows established algorithms from Thall's paper, includes comprehensive test coverage for all overloads, maintains backward compatibility through the refactoring, and demonstrates practical usage in SAR processing. The FMA optimization is mathematically sound and performance-improving.
No files require special attention

Important Files Changed

Filename	Overview
include/matx/kernels/fltflt.h	Adds `fltflt_fma()` function with 6 overloads for mixed float/fltflt types, and refactors function names from `fltflt_add_float()` style to overloaded `fltflt_add()` for consistency
include/matx/kernels/sar_bp.cuh	Updated `ComputeRangeToPixelFloatFloat()` to use new `fltflt_fma()` for improved performance computing distance sqrt(dx² + dy² + dz²)
test/00_misc/FloatFloatTests.cu	Added comprehensive test coverage for `fltflt_fma()` with 6 different overload combinations, verifying 44+ mantissa bits accuracy

Sequence Diagram

sequenceDiagram
    participant User
    participant ComputeRangeToPixel
    participant fltflt_fma
    participant fltflt_two_prod_fma
    participant fltflt_two_sum
    participant fltflt_fast_two_sum
    participant fltflt_sqrt

    User->>ComputeRangeToPixel: Compute distance
    ComputeRangeToPixel->>fltflt_fma: fltflt_fma(dx, dx, dy * dy)
    Note over fltflt_fma: Compute dx² + dy²
    fltflt_fma->>fltflt_two_prod_fma: Multiply a.hi * b.hi
    fltflt_two_prod_fma-->>fltflt_fma: Return product with error term
    fltflt_fma->>fltflt_fma: Add cross terms with fmaf_rn()
    fltflt_fma->>fltflt_two_sum: Add product to c (skip intermediate normalization)
    fltflt_two_sum-->>fltflt_fma: Return sum with error term
    fltflt_fma->>fltflt_fma: Add p.lo component
    fltflt_fma->>fltflt_fast_two_sum: Normalize once
    fltflt_fast_two_sum-->>fltflt_fma: Return normalized result
    fltflt_fma->>fltflt_fma: Add c.lo component
    fltflt_fma->>fltflt_fast_two_sum: Final normalization
    fltflt_fast_two_sum-->>fltflt_fma: Return final result
    fltflt_fma-->>ComputeRangeToPixel: dx² + dy²
    ComputeRangeToPixel->>fltflt_fma: fltflt_fma(dz, dz, dx2dy2)
    Note over fltflt_fma: Compute dz² + (dx² + dy²)
    fltflt_fma-->>ComputeRangeToPixel: dx² + dy² + dz²
    ComputeRangeToPixel->>fltflt_sqrt: sqrt(dx² + dy² + dz²)
    fltflt_sqrt-->>ComputeRangeToPixel: Final distance
    ComputeRangeToPixel-->>User: Return range to pixel

tbensonatl requested a review from cliffburdick January 24, 2026 00:45

tbensonatl self-assigned this Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FMA function for the fltflt data type #1123

Add FMA function for the fltflt data type #1123

tbensonatl commented Jan 24, 2026

Uh oh!

copy-pr-bot bot commented Jan 24, 2026

Uh oh!

greptile-apps bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add FMA function for the fltflt data type #1123

Are you sure you want to change the base?

Add FMA function for the fltflt data type #1123

Conversation

tbensonatl commented Jan 24, 2026

Uh oh!

copy-pr-bot bot commented Jan 24, 2026

Uh oh!

greptile-apps bot commented Jan 24, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant