Skip to content

Refactor/add scalar dist function and setup cmake with detailed cpu flag#238

Open
richyreachy wants to merge 45 commits intomainfrom
refactor/add_scalar_dist_function_and_setup_cmake_with_detailed_cpu_flag
Open

Refactor/add scalar dist function and setup cmake with detailed cpu flag#238
richyreachy wants to merge 45 commits intomainfrom
refactor/add_scalar_dist_function_and_setup_cmake_with_detailed_cpu_flag

Conversation

@richyreachy
Copy link
Collaborator

@richyreachy richyreachy commented Mar 17, 2026

add scalar dist function and setup cmake with detailed cpu flag

Greptile Summary

This PR is a large-scale refactor that renames all distance/inner-product kernel functions to include their data-type suffix (e.g. InnerProductAVXInnerProductFp32AVX), adds scalar fallback implementations for every data type across all three distance families (Euclidean, inner product, and MIPS), and improves the CMake build system to detect and apply distinct compiler march flags for SSE, AVX2, AVX512, and the new AVX512FP16 tier. It also splits the avx512.cc files to separate the legacy-AVX512 (via conversion) from the native AVX512FP16 kernels.

Key changes:

  • New files: *_scalar.cc for euclidean, inner product, and MIPS distances; *_avx512fp16.cc files split out from *_avx512.cc
  • All dispatch files now fall through to a scalar path when no SIMD feature is available at compile or run time, removing previous platform guards that left the Compute functions conditionally undefined
  • cmake/option.cmake adds a fourth march-flag output variable and new fallback arch lists for AVX512 and AVX512FP16 detection

Remaining issues from prior review threads that are not yet fully addressed:

  • Multiple bugs in cmake/option.cmake remain: wrong loop variable ${_arch} (should be ${_arch_avx512}), non-AVX512 fallback targets in the AVX512 list, missing include(CheckCXXCompilerFlag), and cached result-variable reuse across loop iterations
  • src/ailego/CMakeLists.txt: math_batch/*_dispatch.cc files remain in MATH_FILES_AVX512 while math/*_dispatch.cc files moved to MATH_FILES_AVX512FP16, creating an inconsistency

New findings in this review:

  • Several EuclideanDistance*Scalar variants (EuclideanDistanceFp16Scalar, EuclideanDistanceFp32Scalar, EuclideanDistanceInt4Scalar, EuclideanDistanceInt8Scalar) in euclidean_distance_matrix_scalar.cc are dead code — no dispatch file calls them
  • InnerProductScalar<Float16> accumulates static_cast<float>(m[i] * q[i]) where the multiplication may happen in 16-bit precision before widening; converting each operand to float first would improve scalar fallback accuracy
  • Minor indentation inconsistency in the new foreach(MATH_FILE ${MATH_FILES_AVX512FP16}) loop in src/ailego/CMakeLists.txt

Confidence Score: 2/5

  • Not safe to merge — several previously-flagged CMake correctness bugs remain unresolved and can silently produce broken or suboptimal builds.
  • The core C++ refactor (function renames, scalar fallback wiring, dispatch logic) is largely correct and fixes multiple previously-reported bugs. However, cmake/option.cmake still contains multiple confirmed issues: the ${_arch} typo that prevents MATH_MARCH_FLAG_AVX512 from ever being set correctly, non-AVX512 architectures in the AVX512 fallback list, missing CheckCXXCompilerFlag include, and the cached-variable reuse problem. These will cause silent mis-compilation of *_avx512.cc files on many hosts. The src/ailego/CMakeLists.txt inconsistency between math_batch and math dispatch file groupings also remains. Until the CMake issues are resolved, builds may silently produce binaries that either fail to compile or disable all AVX-512 code paths.
  • cmake/option.cmake requires the most attention due to the loop variable typo, non-AVX512 fallback targets, missing module include, and cached variable reuse. src/ailego/CMakeLists.txt also needs review of the math_batch dispatch inconsistency.

Important Files Changed

Filename Overview
cmake/option.cmake Adds VAR_NAME_AVX512FP16 parameter to setup_compiler_march_for_x86; introduces new AVX512 and AVX512FP16 fallback lists, but includes non-AVX512 targets (core-avx2, x86-64) in AVX512 list, uses a wrong loop variable ${_arch} instead of ${_arch_avx512}, uses check_cxx_compiler_flag without including CheckCXXCompilerFlag, and relies on a cached variable COMPILER_FLAG_SUPPORT that is reused across loop iterations — all previously flagged in the review thread.
src/ailego/CMakeLists.txt Moves dispatch files from MATH_FILES_AVX512 into new MATH_FILES_AVX512FP16 group; the math_batch/*_dispatch.cc files remain in MATH_FILES_AVX512 creating an inconsistency; the new foreach loop has incorrect indentation; AUTO_DETECT_ARCH guard removal was previously flagged.
src/ailego/math/euclidean_distance_matrix_scalar.cc New file providing scalar fallback implementations for all data types (FP16, FP32, INT4, INT8). Logic is generally correct; INT4 nibble-packing table index is consistent. However, EuclideanDistance*Scalar variants are defined but never called from any dispatch file (dead code), and the FP16 template multiplication may accumulate in half-precision before widening.
src/ailego/math/inner_product_matrix_scalar.cc New file providing scalar inner product fallbacks and sparse segment computation infrastructure. The ComputeSegments<T> template correctly negates the sum for Minus variants. Forward declarations of ComputeInnerProductSparseInSegmentFp32/Fp16 are resolved in their respective dispatch files. Float16 multiplication in InnerProductScalar<Float16> may reduce precision before widening to float.
src/ailego/math/mips_euclidean_distance_matrix_scalar.cc New file correctly implementing all MIPS distance scalar fallbacks for INT4, INT8, FP16, and FP32. Spherical and RepeatedQuadratic injection formulas are consistent with NEON and SIMD implementations.
src/ailego/math/euclidean_distance_matrix_fp32_dispatch.cc Previously-flagged missing return statements are now fixed; all SIMD branches correctly return after storing the result. Scalar fallback properly wired. NEON path updated to use renamed SquaredEuclideanDistanceFp32NEON.
src/ailego/math/inner_product_matrix_fp16_dispatch.cc Dense dispatch correctly renamed and wired with scalar fallback. MinusInnerProduct AVX path previously-flagged bug now uses correct MinusInnerProductFp16AVX. Sparse MinusInnerProductSparseMatrix::Compute calls scalar-named helper that internally dispatches via SIMD through ComputeInnerProductSparseInSegmentFp16.
src/ailego/math/inner_product_matrix_fp32_dispatch.cc Previously-flagged undeclared NEON function names fixed (InnerProductFp32NEON, MinusInnerProductFp32NEON). ComputeInnerProductSparseInSegmentFp32 now properly defined here and forward-declared in scalar.cc. Scalar fallback correctly wired for all branches.
src/ailego/math/mips_euclidean_distance_matrix_fp32_dispatch.cc Previously-flagged FP32 MIPS scalar fallbacks are now wired up. SSE functions correctly guarded. NEON path folded into the same Compute functions using #if defined(__ARM_NEON) guards. Scalar fallback confirmed for all code paths.
src/ailego/math/inner_product_matrix_fp16_avx512fp16.cc New file containing the moved InnerProductFp16AVX512FP16 and new MinusInnerProductFp16AVX512FP16 implementations using native __AVX512FP16__ intrinsics, plus the sparse InnerProductSparseInSegmentFp16AVX512FP16 implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Dispatch Compute()"] --> B{ARM NEON?}
    B -->|Yes| C["Call *NEON() variant"]
    B -->|No| D{__AVX512FP16__\ncompiled in?}
    D -->|Yes| E{CpuFeatures\n.AVX512_FP16?}
    E -->|Yes| F["Call *AVX512FP16()"]
    E -->|No| G{__AVX512F__\ncompiled in?}
    D -->|No| G
    G -->|Yes| H{CpuFeatures\n.AVX512F?}
    H -->|Yes| I["Call *AVX512()"]
    H -->|No| J{__AVX__ compiled in?}
    G -->|No| J
    J -->|Yes| K{CpuFeatures\n.AVX?}
    K -->|Yes| L["Call *AVX()"]
    K -->|No| M{__SSE__ compiled in?}
    J -->|No| M
    M -->|Yes| N{CpuFeatures\n.SSE?}
    N -->|Yes| O["Call *SSE()"]
    N -->|No| P["Call *Scalar() [NEW]"]
    M -->|No| P
    F --> Q["return result"]
    I --> Q
    L --> Q
    O --> Q
    P --> Q
    C --> Q
Loading

Comments Outside Diff (7)

  1. src/ailego/math/inner_product_matrix_fp32_dispatch.cc, line 386-393 (link)

    P2 InnerProductSparseInSegmentFp32SSE may be undefined — potential linker error

    InnerProductSparseInSegmentFp32SSE is forward-declared here inside #if defined(__SSE4_1__) and called inside ComputeInnerProductSparseInSegmentFp32. The corresponding inner_product_matrix_fp32_sse.cc was significantly refactored in this PR (73 lines changed). If that file does not export a symbol named InnerProductSparseInSegmentFp32SSE (i.e., the rename convention introduced for other SSE functions was applied there too, changing the previous name), any build targeting SSE4.1 will fail at link time with an unresolved symbol.

    Please verify that inner_product_matrix_fp32_sse.cc defines a function with exactly this signature:

    float InnerProductSparseInSegmentFp32SSE(uint32_t m_sparse_count,
                                             const uint16_t *m_sparse_index,
                                             const float *m_sparse_value,
                                             uint32_t q_sparse_count,
                                             const uint16_t *q_sparse_index,
                                             const float *q_sparse_value);
  2. src/ailego/math/inner_product_matrix_scalar.cc, line 544-576 (link)

    P2 Circular call chain between scalar and dispatch translation units

    MinusInnerProductSparseFp32Scalar (this file) → ComputeSegments<float> (this file) → ComputeInnerProductSparseInSegment<float> (this file) → ComputeInnerProductSparseInSegmentFp32 (declared extern here, defined in inner_product_matrix_fp32_dispatch.cc) → InnerProductSparseInSegmentFp32Scalar (back in this file).

    While this is technically valid C++ at link time, the circular dependency between the two TUs makes the sparse FP32 path difficult to reason about. A simpler alternative would be to have ComputeInnerProductSparseInSegment<float> call InnerProductSparseInSegmentFp32Scalar directly for the scalar path, or to inline the segment dispatch entirely here, matching the pattern used for the FP16 variant in inner_product_matrix_fp16_dispatch.cc which has a self-contained ComputeInnerProductSparseInSegmentFp16 definition.

  3. src/ailego/math/euclidean_distance_matrix_int8_sse.cc, line 22-24 (link)

    inline function called across translation unit boundaries

    SquaredEuclideanDistanceInt8SSEInternal is declared inline in this .cc file but is forward-declared (without inline) and called from euclidean_distance_matrix_int8_avx2.cc — a different translation unit. In C++, an inline function definition in a .cc file is not guaranteed to emit an external linkage symbol: the compiler may inline every call within this TU and omit the symbol entirely, causing a linker error in the AVX2 TU.

    The fix is to remove the inline keyword, since the function is intended to be reachable across TUs:

  4. tests/ailego/math/euclidean_distance_matrix_fp16_test.cc, line 142 (link)

    Test coverage gap: small-vector dimensions excluded

    The dimension range was narrowed from [1, 65] to [32, 65], completely removing tests for vectors with 1–31 elements. The primary motivation for this PR is to add scalar fallbacks so that every distance function works on any vector size — yet the exact size range that exercises those new scalar paths (small dimensions, typically < 32 or < 8 elements) is now excluded from testing.

    Bugs in boundary handling (e.g., the masked tail loops, off-by-one in last_aligned calculations, or a wrong scalar accumulation loop) would be invisible under this test configuration. Restoring the lower bound to 1 (or at minimum 1) is needed to validate the new scalar implementations.

    The same narrowing is also applied at line 187.

  5. src/ailego/math/euclidean_distance_matrix_scalar.cc, line 53-65 (link)

    INT4 element-nibble bit-shift indexing inconsistency

    The index used for the second INT4 element in the scalar Euclidean distance loop differs in bit-shift direction from the inner product scalar code, and is internally inconsistent with the element layout assumed by the MIPS scalar and SSE/AVX2 implementations.

    For a packed byte where the low nibble holds element 0 and the high nibble holds element 1:

    • Element 0 table index: ((m_val << 4) & 0xf0) | ((q_val) & 0xf) — uses m_lo and q_lo. ✓
    • Element 1 table index: ((m_val) & 0xf0) | ((q_val >> 4) & 0xf) — uses m_hi and q_hi. ✓

    Both indices are correct individually, but cross-check the InnerProductInt4Scalar in inner_product_matrix_scalar.cc (lines ~56–64) which uses the exact same pattern. If the nibble layout assumed by the lookup table (Int4SquaredDiffTable, Int4MulTable) ever differs from this encoding, results will be silently wrong. Ensure Int4SquaredDiffTable is indexed as (m_nibble << 4) | q_nibble to match the computed indices above.

  6. src/ailego/math/inner_product_matrix_scalar.cc, line 34-41 (link)

    Float16 multiplication precision may silently accumulate in half-precision

    InnerProductScalar<ailego::Float16> (and its Minus variant) computes:

    sum += static_cast<float>(m[i] * q[i]);

    If Float16::operator* returns a Float16 (as is typical for custom half-precision types), the multiplication and intermediate result are in 16-bit precision before being widened to float. For large dim values this causes compounding precision loss that is worse than widening each operand before multiplying:

    This matches how FP32 and INT8 variants behave and avoids potential silent accuracy degradation in the scalar fallback path.

  7. src/ailego/math/euclidean_distance_matrix_scalar.cc, line 40-47 (link)

    EuclideanDistance*Scalar functions are defined but never called

    EuclideanDistanceFp16Scalar, EuclideanDistanceFp32Scalar, EuclideanDistanceInt4Scalar, and EuclideanDistanceInt8Scalar are all defined in this file, but none of the dispatch files forward-declare or call them. All dispatch-level Euclidean distance functions call the squared variant and then take std::sqrt outside the function. These four functions are unreachable dead code and will trigger linker warnings on strict builds.

    If they are intentionally kept as utility functions for future use, they should at minimum be placed in an anonymous namespace or marked [[maybe_unused]] to suppress warnings.

Last reviewed commit: "fix: cmake config"

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

1 similar comment
@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

@richyreachy
Copy link
Collaborator Author

@greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant