Skip to content

Conversation

@tbensonatl
Copy link
Collaborator

fltflt_fma() performs a * b + c for fltflt types more efficiently than a fltflt_mul() followed by a fltflt_add(). The fused function can perform one fewer normalization than the separate functions.

This PR also switches from function names like fltflt_add_float(fltflt, float) to overloads of fltflt_add(). The former were intended to be more easily usable in a C context, but the file now contains many other C++ features (ctors, conversion operators, comparison operators, etc.).w

fltflt_fma() performs a * b + c for fltflt types more efficiently
than a fltflt_mul() followed by a fltflt_add(). The fused function
can perform one fewer normalization than the separate functions.

This PR also switches from function names like fltflt_add_float(fltflt, float)
to overloads of fltflt_add(). The former were intended to be more
easily usable in a C context, but the file now contains many other
C++ features (ctors, conversion operators, comparison operators, etc.).w

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl tbensonatl self-assigned this Jan 24, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link

greptile-apps bot commented Jan 24, 2026

Greptile Summary

Added fused multiply-add (fltflt_fma()) function for the fltflt data type with 6 optimized overloads supporting mixed float/fltflt arguments, and refactored existing function names from explicit type suffixes (e.g., fltflt_add_float()) to C++ function overloads (e.g., fltflt_add()).

Key improvements:

  • The FMA function performs a * b + c with 2 normalizations instead of 3 (when using separate multiply and add), improving computational efficiency
  • Applied in ComputeRangeToPixelFloatFloat() for SAR backprojection distance calculations
  • Comprehensive test coverage validates 44+ mantissa bits of precision across all overload combinations
  • API modernization improves code consistency and makes the library more idiomatic C++

Confidence Score: 5/5

  • This PR is safe to merge with high confidence
  • The implementation follows established algorithms from Thall's paper, includes comprehensive test coverage for all overloads, maintains backward compatibility through the refactoring, and demonstrates practical usage in SAR processing. The FMA optimization is mathematically sound and performance-improving.
  • No files require special attention

Important Files Changed

Filename Overview
include/matx/kernels/fltflt.h Adds fltflt_fma() function with 6 overloads for mixed float/fltflt types, and refactors function names from fltflt_add_float() style to overloaded fltflt_add() for consistency
include/matx/kernels/sar_bp.cuh Updated ComputeRangeToPixelFloatFloat() to use new fltflt_fma() for improved performance computing distance sqrt(dx² + dy² + dz²)
test/00_misc/FloatFloatTests.cu Added comprehensive test coverage for fltflt_fma() with 6 different overload combinations, verifying 44+ mantissa bits accuracy

Sequence Diagram

sequenceDiagram
    participant User
    participant ComputeRangeToPixel
    participant fltflt_fma
    participant fltflt_two_prod_fma
    participant fltflt_two_sum
    participant fltflt_fast_two_sum
    participant fltflt_sqrt

    User->>ComputeRangeToPixel: Compute distance
    ComputeRangeToPixel->>fltflt_fma: fltflt_fma(dx, dx, dy * dy)
    Note over fltflt_fma: Compute dx² + dy²
    fltflt_fma->>fltflt_two_prod_fma: Multiply a.hi * b.hi
    fltflt_two_prod_fma-->>fltflt_fma: Return product with error term
    fltflt_fma->>fltflt_fma: Add cross terms with fmaf_rn()
    fltflt_fma->>fltflt_two_sum: Add product to c (skip intermediate normalization)
    fltflt_two_sum-->>fltflt_fma: Return sum with error term
    fltflt_fma->>fltflt_fma: Add p.lo component
    fltflt_fma->>fltflt_fast_two_sum: Normalize once
    fltflt_fast_two_sum-->>fltflt_fma: Return normalized result
    fltflt_fma->>fltflt_fma: Add c.lo component
    fltflt_fma->>fltflt_fast_two_sum: Final normalization
    fltflt_fast_two_sum-->>fltflt_fma: Return final result
    fltflt_fma-->>ComputeRangeToPixel: dx² + dy²
    ComputeRangeToPixel->>fltflt_fma: fltflt_fma(dz, dz, dx2dy2)
    Note over fltflt_fma: Compute dz² + (dx² + dy²)
    fltflt_fma-->>ComputeRangeToPixel: dx² + dy² + dz²
    ComputeRangeToPixel->>fltflt_sqrt: sqrt(dx² + dy² + dz²)
    fltflt_sqrt-->>ComputeRangeToPixel: Final distance
    ComputeRangeToPixel-->>User: Return range to pixel
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant