Refactor/add scalar dist function and setup cmake with detailed cpu flag by richyreachy · Pull Request #238 · alibaba/zvec

richyreachy · 2026-03-17T09:49:41Z

add scalar dist function and setup cmake with detailed cpu flag

Greptile Summary

This PR is a large-scale refactor that renames all distance/inner-product kernel functions to include their data-type suffix (e.g. InnerProductAVX → InnerProductFp32AVX), adds scalar fallback implementations for every data type across all three distance families (Euclidean, inner product, and MIPS), and improves the CMake build system to detect and apply distinct compiler march flags for SSE, AVX2, AVX512, and the new AVX512FP16 tier. It also splits the avx512.cc files to separate the legacy-AVX512 (via conversion) from the native AVX512FP16 kernels.

Key changes:

New files: *_scalar.cc for euclidean, inner product, and MIPS distances; *_avx512fp16.cc files split out from *_avx512.cc
All dispatch files now fall through to a scalar path when no SIMD feature is available at compile or run time, removing previous platform guards that left the Compute functions conditionally undefined
cmake/option.cmake adds a fourth march-flag output variable and new fallback arch lists for AVX512 and AVX512FP16 detection

Remaining issues from prior review threads that are not yet fully addressed:

Multiple bugs in cmake/option.cmake remain: wrong loop variable ${_arch} (should be ${_arch_avx512}), non-AVX512 fallback targets in the AVX512 list, missing include(CheckCXXCompilerFlag), and cached result-variable reuse across loop iterations
src/ailego/CMakeLists.txt: math_batch/*_dispatch.cc files remain in MATH_FILES_AVX512 while math/*_dispatch.cc files moved to MATH_FILES_AVX512FP16, creating an inconsistency

New findings in this review:

Several EuclideanDistance*Scalar variants (EuclideanDistanceFp16Scalar, EuclideanDistanceFp32Scalar, EuclideanDistanceInt4Scalar, EuclideanDistanceInt8Scalar) in euclidean_distance_matrix_scalar.cc are dead code — no dispatch file calls them
InnerProductScalar<Float16> accumulates static_cast<float>(m[i] * q[i]) where the multiplication may happen in 16-bit precision before widening; converting each operand to float first would improve scalar fallback accuracy
Minor indentation inconsistency in the new foreach(MATH_FILE ${MATH_FILES_AVX512FP16}) loop in src/ailego/CMakeLists.txt

Confidence Score: 2/5

Not safe to merge — several previously-flagged CMake correctness bugs remain unresolved and can silently produce broken or suboptimal builds.
The core C++ refactor (function renames, scalar fallback wiring, dispatch logic) is largely correct and fixes multiple previously-reported bugs. However, cmake/option.cmake still contains multiple confirmed issues: the ${_arch} typo that prevents MATH_MARCH_FLAG_AVX512 from ever being set correctly, non-AVX512 architectures in the AVX512 fallback list, missing CheckCXXCompilerFlag include, and the cached-variable reuse problem. These will cause silent mis-compilation of *_avx512.cc files on many hosts. The src/ailego/CMakeLists.txt inconsistency between math_batch and math dispatch file groupings also remains. Until the CMake issues are resolved, builds may silently produce binaries that either fail to compile or disable all AVX-512 code paths.
cmake/option.cmake requires the most attention due to the loop variable typo, non-AVX512 fallback targets, missing module include, and cached variable reuse. src/ailego/CMakeLists.txt also needs review of the math_batch dispatch inconsistency.

Important Files Changed

Filename	Overview
cmake/option.cmake	Adds `VAR_NAME_AVX512FP16` parameter to `setup_compiler_march_for_x86`; introduces new AVX512 and AVX512FP16 fallback lists, but includes non-AVX512 targets (`core-avx2`, `x86-64`) in AVX512 list, uses a wrong loop variable `${_arch}` instead of `${_arch_avx512}`, uses `check_cxx_compiler_flag` without including `CheckCXXCompilerFlag`, and relies on a cached variable `COMPILER_FLAG_SUPPORT` that is reused across loop iterations — all previously flagged in the review thread.
src/ailego/CMakeLists.txt	Moves dispatch files from `MATH_FILES_AVX512` into new `MATH_FILES_AVX512FP16` group; the `math_batch/*_dispatch.cc` files remain in `MATH_FILES_AVX512` creating an inconsistency; the new foreach loop has incorrect indentation; `AUTO_DETECT_ARCH` guard removal was previously flagged.
src/ailego/math/euclidean_distance_matrix_scalar.cc	New file providing scalar fallback implementations for all data types (FP16, FP32, INT4, INT8). Logic is generally correct; INT4 nibble-packing table index is consistent. However, `EuclideanDistance*Scalar` variants are defined but never called from any dispatch file (dead code), and the FP16 template multiplication may accumulate in half-precision before widening.
src/ailego/math/inner_product_matrix_scalar.cc	New file providing scalar inner product fallbacks and sparse segment computation infrastructure. The `ComputeSegments<T>` template correctly negates the sum for `Minus` variants. Forward declarations of `ComputeInnerProductSparseInSegmentFp32/Fp16` are resolved in their respective dispatch files. Float16 multiplication in `InnerProductScalar<Float16>` may reduce precision before widening to float.
src/ailego/math/mips_euclidean_distance_matrix_scalar.cc	New file correctly implementing all MIPS distance scalar fallbacks for INT4, INT8, FP16, and FP32. Spherical and RepeatedQuadratic injection formulas are consistent with NEON and SIMD implementations.
src/ailego/math/euclidean_distance_matrix_fp32_dispatch.cc	Previously-flagged missing `return` statements are now fixed; all SIMD branches correctly return after storing the result. Scalar fallback properly wired. NEON path updated to use renamed `SquaredEuclideanDistanceFp32NEON`.
src/ailego/math/inner_product_matrix_fp16_dispatch.cc	Dense dispatch correctly renamed and wired with scalar fallback. MinusInnerProduct AVX path previously-flagged bug now uses correct `MinusInnerProductFp16AVX`. Sparse `MinusInnerProductSparseMatrix::Compute` calls scalar-named helper that internally dispatches via SIMD through `ComputeInnerProductSparseInSegmentFp16`.
src/ailego/math/inner_product_matrix_fp32_dispatch.cc	Previously-flagged undeclared NEON function names fixed (`InnerProductFp32NEON`, `MinusInnerProductFp32NEON`). `ComputeInnerProductSparseInSegmentFp32` now properly defined here and forward-declared in scalar.cc. Scalar fallback correctly wired for all branches.
src/ailego/math/mips_euclidean_distance_matrix_fp32_dispatch.cc	Previously-flagged FP32 MIPS scalar fallbacks are now wired up. SSE functions correctly guarded. NEON path folded into the same `Compute` functions using `#if defined(__ARM_NEON)` guards. Scalar fallback confirmed for all code paths.
src/ailego/math/inner_product_matrix_fp16_avx512fp16.cc	New file containing the moved `InnerProductFp16AVX512FP16` and new `MinusInnerProductFp16AVX512FP16` implementations using native `__AVX512FP16__` intrinsics, plus the sparse `InnerProductSparseInSegmentFp16AVX512FP16` implementation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Dispatch Compute()"] --> B{ARM NEON?}
    B -->|Yes| C["Call *NEON() variant"]
    B -->|No| D{__AVX512FP16__\ncompiled in?}
    D -->|Yes| E{CpuFeatures\n.AVX512_FP16?}
    E -->|Yes| F["Call *AVX512FP16()"]
    E -->|No| G{__AVX512F__\ncompiled in?}
    D -->|No| G
    G -->|Yes| H{CpuFeatures\n.AVX512F?}
    H -->|Yes| I["Call *AVX512()"]
    H -->|No| J{__AVX__ compiled in?}
    G -->|No| J
    J -->|Yes| K{CpuFeatures\n.AVX?}
    K -->|Yes| L["Call *AVX()"]
    K -->|No| M{__SSE__ compiled in?}
    J -->|No| M
    M -->|Yes| N{CpuFeatures\n.SSE?}
    N -->|Yes| O["Call *SSE()"]
    N -->|No| P["Call *Scalar() [NEW]"]
    M -->|No| P
    F --> Q["return result"]
    I --> Q
    L --> Q
    O --> Q
    P --> Q
    C --> Q

Comments Outside Diff (7)

src/ailego/math/inner_product_matrix_fp32_dispatch.cc, line 386-393 (link)

InnerProductSparseInSegmentFp32SSE may be undefined — potential linker error

InnerProductSparseInSegmentFp32SSE is forward-declared here inside #if defined(__SSE4_1__) and called inside ComputeInnerProductSparseInSegmentFp32. The corresponding inner_product_matrix_fp32_sse.cc was significantly refactored in this PR (73 lines changed). If that file does not export a symbol named InnerProductSparseInSegmentFp32SSE (i.e., the rename convention introduced for other SSE functions was applied there too, changing the previous name), any build targeting SSE4.1 will fail at link time with an unresolved symbol.

Please verify that inner_product_matrix_fp32_sse.cc defines a function with exactly this signature:
```
float InnerProductSparseInSegmentFp32SSE(uint32_t m_sparse_count,
                                         const uint16_t *m_sparse_index,
                                         const float *m_sparse_value,
                                         uint32_t q_sparse_count,
                                         const uint16_t *q_sparse_index,
                                         const float *q_sparse_value);
```
src/ailego/math/inner_product_matrix_scalar.cc, line 544-576 (link)

Circular call chain between scalar and dispatch translation units

MinusInnerProductSparseFp32Scalar (this file) → ComputeSegments<float> (this file) → ComputeInnerProductSparseInSegment<float> (this file) → ComputeInnerProductSparseInSegmentFp32 (declared extern here, defined in inner_product_matrix_fp32_dispatch.cc) → InnerProductSparseInSegmentFp32Scalar (back in this file).

While this is technically valid C++ at link time, the circular dependency between the two TUs makes the sparse FP32 path difficult to reason about. A simpler alternative would be to have ComputeInnerProductSparseInSegment<float> call InnerProductSparseInSegmentFp32Scalar directly for the scalar path, or to inline the segment dispatch entirely here, matching the pattern used for the FP16 variant in inner_product_matrix_fp16_dispatch.cc which has a self-contained ComputeInnerProductSparseInSegmentFp16 definition.
src/ailego/math/euclidean_distance_matrix_int8_sse.cc, line 22-24 (link)

inline function called across translation unit boundaries

SquaredEuclideanDistanceInt8SSEInternal is declared inline in this .cc file but is forward-declared (without inline) and called from euclidean_distance_matrix_int8_avx2.cc — a different translation unit. In C++, an inline function definition in a .cc file is not guaranteed to emit an external linkage symbol: the compiler may inline every call within this TU and omit the symbol entirely, causing a linker error in the AVX2 TU.

The fix is to remove the inline keyword, since the function is intended to be reachable across TUs:
tests/ailego/math/euclidean_distance_matrix_fp16_test.cc, line 142 (link)

Test coverage gap: small-vector dimensions excluded

The dimension range was narrowed from [1, 65] to [32, 65], completely removing tests for vectors with 1–31 elements. The primary motivation for this PR is to add scalar fallbacks so that every distance function works on any vector size — yet the exact size range that exercises those new scalar paths (small dimensions, typically < 32 or < 8 elements) is now excluded from testing.

Bugs in boundary handling (e.g., the masked tail loops, off-by-one in last_aligned calculations, or a wrong scalar accumulation loop) would be invisible under this test configuration. Restoring the lower bound to 1 (or at minimum 1) is needed to validate the new scalar implementations.

The same narrowing is also applied at line 187.
src/ailego/math/euclidean_distance_matrix_scalar.cc, line 53-65 (link)

INT4 element-nibble bit-shift indexing inconsistency

The index used for the second INT4 element in the scalar Euclidean distance loop differs in bit-shift direction from the inner product scalar code, and is internally inconsistent with the element layout assumed by the MIPS scalar and SSE/AVX2 implementations.

For a packed byte where the low nibble holds element 0 and the high nibble holds element 1:
- Element 0 table index: ((m_val << 4) & 0xf0) | ((q_val) & 0xf) — uses m_lo and q_lo. ✓
- Element 1 table index: ((m_val) & 0xf0) | ((q_val >> 4) & 0xf) — uses m_hi and q_hi. ✓
Both indices are correct individually, but cross-check the InnerProductInt4Scalar in inner_product_matrix_scalar.cc (lines ~56–64) which uses the exact same pattern. If the nibble layout assumed by the lookup table (Int4SquaredDiffTable, Int4MulTable) ever differs from this encoding, results will be silently wrong. Ensure Int4SquaredDiffTable is indexed as (m_nibble << 4) | q_nibble to match the computed indices above.
src/ailego/math/inner_product_matrix_scalar.cc, line 34-41 (link)

Float16 multiplication precision may silently accumulate in half-precision

InnerProductScalar<ailego::Float16> (and its Minus variant) computes:
```
sum += static_cast<float>(m[i] * q[i]);
```
If Float16::operator* returns a Float16 (as is typical for custom half-precision types), the multiplication and intermediate result are in 16-bit precision before being widened to float. For large dim values this causes compounding precision loss that is worse than widening each operand before multiplying:

This matches how FP32 and INT8 variants behave and avoids potential silent accuracy degradation in the scalar fallback path.
src/ailego/math/euclidean_distance_matrix_scalar.cc, line 40-47 (link)

EuclideanDistance*Scalar functions are defined but never called

EuclideanDistanceFp16Scalar, EuclideanDistanceFp32Scalar, EuclideanDistanceInt4Scalar, and EuclideanDistanceInt8Scalar are all defined in this file, but none of the dispatch files forward-declare or call them. All dispatch-level Euclidean distance functions call the squared variant and then take std::sqrt outside the function. These four functions are unreachable dead code and will trigger linker warnings on strict builds.

If they are intentionally kept as utility functions for future use, they should at minimum be placed in an anonymous namespace or marked [[maybe_unused]] to suppress warnings.

_{Last reviewed commit: "fix: cmake config"}

src/ailego/math/euclidean_distance_matrix_fp32_dispatch.cc

src/ailego/math/euclidean_distance_matrix_fp16_dispatch.cc

src/ailego/math/inner_product_matrix_fp16_dispatch.cc

src/ailego/math/inner_product_matrix_fp32_dispatch.cc

cmake/option.cmake

richyreachy · 2026-03-17T12:55:47Z

@greptile

src/ailego/math/inner_product_matrix_int8_dispatch.cc

src/ailego/math/mips_euclidean_distance_matrix_scalar.cc

src/ailego/CMakeLists.txt

richyreachy · 2026-03-18T08:51:22Z

@greptile

…cmake_with_detailed_cpu_flag

richyreachy · 2026-03-19T02:28:25Z

@greptile

src/ailego/math/mips_euclidean_distance_matrix_fp32_dispatch.cc

src/ailego/math/inner_product_matrix_scalar.cc

…cmake_with_detailed_cpu_flag

richyreachy · 2026-03-19T05:42:01Z

@greptile

cmake/option.cmake

…cmake_with_detailed_cpu_flag

richyreachy · 2026-03-19T07:41:31Z

@greptile

richyreachy · 2026-03-19T07:46:45Z

@greptile

richyreachy · 2026-03-19T09:01:44Z

@greptile

…cmake_with_detailed_cpu_flag

richyreachy · 2026-03-19T09:39:27Z

@greptile

richyreachy · 2026-03-19T14:09:12Z

@greptile

richyreachy · 2026-03-19T14:12:39Z

@greptile

richyreachy added 10 commits March 16, 2026 11:05

fix: add scalar

d322818

Merge branch 'main' into fix/fix_performance

625cd0c

refactor: add scalar

4ea17e9

fix: fix scalar

efecee9

fix: remove inline

5f5ef13

refactor: separate avx512 fp16 and use -m flag instead of -m march

58a9cc8

add fp16 avx512fp16

6ed3306

merge with fp16 separation

cf4a2ee

Merge branch 'refactor/separate_fp16' into fix/fix_performance

5ac9e0d

fix: format cmake config

ddd3dc5

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

richyreachy added 4 commits March 17, 2026 18:56

fix: avx512 fp16

0b21b7a

fix: fp16 typo

f8ea918

revert: use march back since performance degrades

2b78014

fix: fix typo according to greptile

f91a91e

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

richyreachy added 12 commits March 18, 2026 10:05

fix: fix neon

28c5a37

fix: fix naming

61eff0c

fix: fix naming

2f6472d

fix: int4

97586a2

fix: fix sparse

9aebde3

fix: fix sparse

50c3522

fix: fix sparse

c63e206

fix: fix int8 scalar

6e1c474

fix: fix sparse

f2370a1

fix: fix ut

9012959

refactor: change cmake march

6c1c8bb

refactor: change cmake march

bb4e8cd

richyreachy added 8 commits March 18, 2026 17:30

fix: fix avx512fp16

1d37aeb

refactor: change math batch

37166e0

fix: fix cmake config

d00aa56

fix: mips fp16

9061a95

fix: mips fp16

0daf6fe

Merge branch 'main' into refactor/add_scalar_dist_function_and_setup_…

6cd1365

…cmake_with_detailed_cpu_flag

fix: update turbo cmake

5e7b9ac

fix: fip mips fp16 scalar

c1a7132

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

src/ailego/math/mips_euclidean_distance_matrix_fp32_dispatch.cc Show resolved Hide resolved

src/ailego/math/inner_product_matrix_scalar.cc Show resolved Hide resolved

richyreachy added 4 commits March 19, 2026 11:09

fix: add fp32 mips

c1ea0d0

fix: missout icelake

ff58d68

fix: mips fp32 neon

106c513

Merge branch 'main' into refactor/add_scalar_dist_function_and_setup_…

fa958d5

…cmake_with_detailed_cpu_flag

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

cmake/option.cmake Outdated Show resolved Hide resolved

Merge branch 'main' into refactor/add_scalar_dist_function_and_setup_…

1cd42f7

…cmake_with_detailed_cpu_flag

richyreachy requested review from egolearner, feihongxu0824 and iaojnh March 19, 2026 08:25

fix: fix cmake config

3d8bdf7

Merge branch 'main' into refactor/add_scalar_dist_function_and_setup_…

9804a8f

…cmake_with_detailed_cpu_flag

fix: add avx512fp16

91332ac

fix: cmake config

dc4d33c

Conversation

richyreachy commented Mar 17, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Comments Outside Diff (7)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richyreachy commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

richyreachy commented Mar 18, 2026

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

richyreachy commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

richyreachy commented Mar 17, 2026 •

edited by greptile-apps bot

Loading