Refactor/add scalar dist function and setup cmake with detailed cpu flag#238
Open
richyreachy wants to merge 45 commits intomainfrom
Open
Refactor/add scalar dist function and setup cmake with detailed cpu flag#238richyreachy wants to merge 45 commits intomainfrom
richyreachy wants to merge 45 commits intomainfrom
Conversation
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
…cmake_with_detailed_cpu_flag
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
…cmake_with_detailed_cpu_flag
Collaborator
Author
|
@greptile |
1 similar comment
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
…cmake_with_detailed_cpu_flag
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
Collaborator
Author
|
@greptile |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
add scalar dist function and setup cmake with detailed cpu flag
Greptile Summary
This PR is a large-scale refactor that renames all distance/inner-product kernel functions to include their data-type suffix (e.g.
InnerProductAVX→InnerProductFp32AVX), adds scalar fallback implementations for every data type across all three distance families (Euclidean, inner product, and MIPS), and improves the CMake build system to detect and apply distinct compiler march flags for SSE, AVX2, AVX512, and the new AVX512FP16 tier. It also splits theavx512.ccfiles to separate the legacy-AVX512 (via conversion) from the native AVX512FP16 kernels.Key changes:
*_scalar.ccfor euclidean, inner product, and MIPS distances;*_avx512fp16.ccfiles split out from*_avx512.ccComputefunctions conditionally undefinedcmake/option.cmakeadds a fourth march-flag output variable and new fallback arch lists for AVX512 and AVX512FP16 detectionRemaining issues from prior review threads that are not yet fully addressed:
cmake/option.cmakeremain: wrong loop variable${_arch}(should be${_arch_avx512}), non-AVX512 fallback targets in the AVX512 list, missinginclude(CheckCXXCompilerFlag), and cached result-variable reuse across loop iterationssrc/ailego/CMakeLists.txt:math_batch/*_dispatch.ccfiles remain inMATH_FILES_AVX512whilemath/*_dispatch.ccfiles moved toMATH_FILES_AVX512FP16, creating an inconsistencyNew findings in this review:
EuclideanDistance*Scalarvariants (EuclideanDistanceFp16Scalar,EuclideanDistanceFp32Scalar,EuclideanDistanceInt4Scalar,EuclideanDistanceInt8Scalar) ineuclidean_distance_matrix_scalar.ccare dead code — no dispatch file calls themInnerProductScalar<Float16>accumulatesstatic_cast<float>(m[i] * q[i])where the multiplication may happen in 16-bit precision before widening; converting each operand tofloatfirst would improve scalar fallback accuracyforeach(MATH_FILE ${MATH_FILES_AVX512FP16})loop insrc/ailego/CMakeLists.txtConfidence Score: 2/5
cmake/option.cmakestill contains multiple confirmed issues: the${_arch}typo that preventsMATH_MARCH_FLAG_AVX512from ever being set correctly, non-AVX512 architectures in the AVX512 fallback list, missingCheckCXXCompilerFlaginclude, and the cached-variable reuse problem. These will cause silent mis-compilation of*_avx512.ccfiles on many hosts. Thesrc/ailego/CMakeLists.txtinconsistency betweenmath_batchandmathdispatch file groupings also remains. Until the CMake issues are resolved, builds may silently produce binaries that either fail to compile or disable all AVX-512 code paths.cmake/option.cmakerequires the most attention due to the loop variable typo, non-AVX512 fallback targets, missing module include, and cached variable reuse.src/ailego/CMakeLists.txtalso needs review of themath_batchdispatch inconsistency.Important Files Changed
VAR_NAME_AVX512FP16parameter tosetup_compiler_march_for_x86; introduces new AVX512 and AVX512FP16 fallback lists, but includes non-AVX512 targets (core-avx2,x86-64) in AVX512 list, uses a wrong loop variable${_arch}instead of${_arch_avx512}, usescheck_cxx_compiler_flagwithout includingCheckCXXCompilerFlag, and relies on a cached variableCOMPILER_FLAG_SUPPORTthat is reused across loop iterations — all previously flagged in the review thread.MATH_FILES_AVX512into newMATH_FILES_AVX512FP16group; themath_batch/*_dispatch.ccfiles remain inMATH_FILES_AVX512creating an inconsistency; the new foreach loop has incorrect indentation;AUTO_DETECT_ARCHguard removal was previously flagged.EuclideanDistance*Scalarvariants are defined but never called from any dispatch file (dead code), and the FP16 template multiplication may accumulate in half-precision before widening.ComputeSegments<T>template correctly negates the sum forMinusvariants. Forward declarations ofComputeInnerProductSparseInSegmentFp32/Fp16are resolved in their respective dispatch files. Float16 multiplication inInnerProductScalar<Float16>may reduce precision before widening to float.returnstatements are now fixed; all SIMD branches correctly return after storing the result. Scalar fallback properly wired. NEON path updated to use renamedSquaredEuclideanDistanceFp32NEON.MinusInnerProductFp16AVX. SparseMinusInnerProductSparseMatrix::Computecalls scalar-named helper that internally dispatches via SIMD throughComputeInnerProductSparseInSegmentFp16.InnerProductFp32NEON,MinusInnerProductFp32NEON).ComputeInnerProductSparseInSegmentFp32now properly defined here and forward-declared in scalar.cc. Scalar fallback correctly wired for all branches.Computefunctions using#if defined(__ARM_NEON)guards. Scalar fallback confirmed for all code paths.InnerProductFp16AVX512FP16and newMinusInnerProductFp16AVX512FP16implementations using native__AVX512FP16__intrinsics, plus the sparseInnerProductSparseInSegmentFp16AVX512FP16implementation.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["Dispatch Compute()"] --> B{ARM NEON?} B -->|Yes| C["Call *NEON() variant"] B -->|No| D{__AVX512FP16__\ncompiled in?} D -->|Yes| E{CpuFeatures\n.AVX512_FP16?} E -->|Yes| F["Call *AVX512FP16()"] E -->|No| G{__AVX512F__\ncompiled in?} D -->|No| G G -->|Yes| H{CpuFeatures\n.AVX512F?} H -->|Yes| I["Call *AVX512()"] H -->|No| J{__AVX__ compiled in?} G -->|No| J J -->|Yes| K{CpuFeatures\n.AVX?} K -->|Yes| L["Call *AVX()"] K -->|No| M{__SSE__ compiled in?} J -->|No| M M -->|Yes| N{CpuFeatures\n.SSE?} N -->|Yes| O["Call *SSE()"] N -->|No| P["Call *Scalar() [NEW]"] M -->|No| P F --> Q["return result"] I --> Q L --> Q O --> Q P --> Q C --> QComments Outside Diff (7)
src/ailego/math/inner_product_matrix_fp32_dispatch.cc, line 386-393 (link)InnerProductSparseInSegmentFp32SSEmay be undefined — potential linker errorInnerProductSparseInSegmentFp32SSEis forward-declared here inside#if defined(__SSE4_1__)and called insideComputeInnerProductSparseInSegmentFp32. The correspondinginner_product_matrix_fp32_sse.ccwas significantly refactored in this PR (73 lines changed). If that file does not export a symbol namedInnerProductSparseInSegmentFp32SSE(i.e., the rename convention introduced for other SSE functions was applied there too, changing the previous name), any build targeting SSE4.1 will fail at link time with an unresolved symbol.Please verify that
inner_product_matrix_fp32_sse.ccdefines a function with exactly this signature:src/ailego/math/inner_product_matrix_scalar.cc, line 544-576 (link)MinusInnerProductSparseFp32Scalar(this file) →ComputeSegments<float>(this file) →ComputeInnerProductSparseInSegment<float>(this file) →ComputeInnerProductSparseInSegmentFp32(declared extern here, defined ininner_product_matrix_fp32_dispatch.cc) →InnerProductSparseInSegmentFp32Scalar(back in this file).While this is technically valid C++ at link time, the circular dependency between the two TUs makes the sparse FP32 path difficult to reason about. A simpler alternative would be to have
ComputeInnerProductSparseInSegment<float>callInnerProductSparseInSegmentFp32Scalardirectly for the scalar path, or to inline the segment dispatch entirely here, matching the pattern used for the FP16 variant ininner_product_matrix_fp16_dispatch.ccwhich has a self-containedComputeInnerProductSparseInSegmentFp16definition.src/ailego/math/euclidean_distance_matrix_int8_sse.cc, line 22-24 (link)inlinefunction called across translation unit boundariesSquaredEuclideanDistanceInt8SSEInternalis declaredinlinein this.ccfile but is forward-declared (withoutinline) and called fromeuclidean_distance_matrix_int8_avx2.cc— a different translation unit. In C++, aninlinefunction definition in a.ccfile is not guaranteed to emit an external linkage symbol: the compiler may inline every call within this TU and omit the symbol entirely, causing a linker error in the AVX2 TU.The fix is to remove the
inlinekeyword, since the function is intended to be reachable across TUs:tests/ailego/math/euclidean_distance_matrix_fp16_test.cc, line 142 (link)Test coverage gap: small-vector dimensions excluded
The dimension range was narrowed from
[1, 65]to[32, 65], completely removing tests for vectors with 1–31 elements. The primary motivation for this PR is to add scalar fallbacks so that every distance function works on any vector size — yet the exact size range that exercises those new scalar paths (small dimensions, typically< 32or< 8elements) is now excluded from testing.Bugs in boundary handling (e.g., the masked tail loops, off-by-one in
last_alignedcalculations, or a wrong scalar accumulation loop) would be invisible under this test configuration. Restoring the lower bound to1(or at minimum1) is needed to validate the new scalar implementations.The same narrowing is also applied at line 187.
src/ailego/math/euclidean_distance_matrix_scalar.cc, line 53-65 (link)INT4 element-nibble bit-shift indexing inconsistency
The index used for the second INT4 element in the scalar Euclidean distance loop differs in bit-shift direction from the inner product scalar code, and is internally inconsistent with the element layout assumed by the MIPS scalar and SSE/AVX2 implementations.
For a packed byte where the low nibble holds element 0 and the high nibble holds element 1:
((m_val << 4) & 0xf0) | ((q_val) & 0xf)— usesm_loandq_lo. ✓((m_val) & 0xf0) | ((q_val >> 4) & 0xf)— usesm_hiandq_hi. ✓Both indices are correct individually, but cross-check the
InnerProductInt4Scalarininner_product_matrix_scalar.cc(lines ~56–64) which uses the exact same pattern. If the nibble layout assumed by the lookup table (Int4SquaredDiffTable,Int4MulTable) ever differs from this encoding, results will be silently wrong. EnsureInt4SquaredDiffTableis indexed as(m_nibble << 4) | q_nibbleto match the computed indices above.src/ailego/math/inner_product_matrix_scalar.cc, line 34-41 (link)Float16multiplication precision may silently accumulate in half-precisionInnerProductScalar<ailego::Float16>(and itsMinusvariant) computes:If
Float16::operator*returns aFloat16(as is typical for custom half-precision types), the multiplication and intermediate result are in 16-bit precision before being widened tofloat. For largedimvalues this causes compounding precision loss that is worse than widening each operand before multiplying:This matches how FP32 and INT8 variants behave and avoids potential silent accuracy degradation in the scalar fallback path.
src/ailego/math/euclidean_distance_matrix_scalar.cc, line 40-47 (link)EuclideanDistance*Scalarfunctions are defined but never calledEuclideanDistanceFp16Scalar,EuclideanDistanceFp32Scalar,EuclideanDistanceInt4Scalar, andEuclideanDistanceInt8Scalarare all defined in this file, but none of the dispatch files forward-declare or call them. All dispatch-level Euclidean distance functions call the squared variant and then takestd::sqrtoutside the function. These four functions are unreachable dead code and will trigger linker warnings on strict builds.If they are intentionally kept as utility functions for future use, they should at minimum be placed in an anonymous namespace or marked
[[maybe_unused]]to suppress warnings.Last reviewed commit: "fix: cmake config"