Feat/mulhilo#1344
Conversation
Some toolchains (notably certain GCC builds) define shift and mul immediate intrinsics as macros that apply a textual C-style cast to their operand. That cast does not traverse the multi-level alias inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd -> avx512f), so a batch<T, ISA> fails to convert to its native register type in those contexts. Declare the conversion operator on batch itself so the native type is always one member-lookup away.
Adds three integer-multiplication primitives exposed via the public API:
- mullo(x, y): low half of the lane-wise product (equivalent to x * y)
- mulhi(x, y): high half of the lane-wise product
- mulhilo(x, y): returns {mulhi, mullo} as a pair
Native kernels are provided for:
- NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit)
- SVE (svmulh_x)
- RVV (rvvmulh)
- SSE2 (mulhi_epi16 / mulhi_epu16)
- SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core)
- AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core)
- AVX-512F (shared 64-bit core)
- AVX-512BW (mulhi_epi16/epu16)
The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp:
mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with
xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific
widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the
same 64x64 -> hi software path and unifies the signed-fixup to a single
arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on
SSE4.1/AVX2 via bitwise_rshift).
The generic fallback in common dispatches per-type through mulhi_helper,
using a wider native integer for <=32-bit types and software split-and-
multiply (or __int128 when available) for 64-bit.
…vlen=128) QEMU < 11's RVV TCG emulation is dramatically slower than scalar (QEMU issue #2137). At vlen=128, gcc's RVV codegen for our test_xsimd ends up running long enough under apt-shipped qemu-user-static (8.2.x noble, 9.x plucky, 10.x trixie) to overflow the 6h GHA job timeout while making no observable progress. Measured locally: qemu 8.2.2 (Ubuntu 24.04 apt) : test_xsimd at vlen=128 times out qemu 9.2.1 (Ubuntu 25.04 plucky) : ditto qemu 10.0.8 (Debian trixie) : ditto qemu 11.0.0 (Arch) + gcc 15.1 : 367 cases / 5664 asserts in <10 min Vlens >= 256 stay within the test step budget on apt qemu (smaller emulator slowdown per logical op). Keep the existing cross-rvv.yml workflow as-is — multi-compiler matrix (gcc-14, clang-17/18), apt qemu-user-static — but drop vector_bits=128 from its matrix and add fail-fast: false plus a 15 min timeout-minutes safety net so a stuck entry doesn't cancel its peers or burn 6h. Add a sibling workflow cross-rvv-arch.yml that runs the build and test inside archlinux:latest (qemu 11 + gcc 15.1) and covers vector_bits=128/256/512. This restores RVV vlen=128 coverage today without waiting for ubuntu-latest to ship qemu 11. References: QEMU 11.0.0 release notes https://www.qemu.org/2026/04/22/qemu-11-0-0/ QEMU RVV slowdowns issue https://gitlab.com/qemu-project/qemu/-/issues/2137 Ubuntu RVV vstart bug https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2095169
serge-sans-paille
left a comment
There was a problem hiding this comment.
First round of comments, I haven't reviewed the avx512 not the rvv part yet.
| * textual C-style cast inside the macro does not traverse the alias | ||
| * inheritance chain. Declaring the operator here makes it visible on | ||
| * the batch type directly. */ | ||
| XSIMD_INLINE operator register_type() const noexcept |
There was a problem hiding this comment.
you could use using types::simd_register<T, A>::operator register_type instead.
|
|
||
| #if defined(__SIZEOF_INT128__) | ||
| template <class T> | ||
| typename std::enable_if<std::is_integral<T>::value && (sizeof(T) == 8), T>::type |
There was a problem hiding this comment.
I fear the `is_integral`` test might fail here, as this type is not a standard type, see https://godbolt.org/z/3nd7KEj9K (works on gcc though)
| array_type hi_expected; | ||
| for (std::size_t i = 0; i < size; ++i) | ||
| { | ||
| lo_expected[i] = static_cast<value_type>(static_cast<UT>(a[i]) * static_cast<UT>(b[i])); |
There was a problem hiding this comment.
Can you explain this intermediate cast to an unsigend type?
| # So vlen=128 RVV coverage lives in this workflow, which runs the build | ||
| # and test inside an `archlinux:latest` container (qemu 11 + gcc 15.1). | ||
| # The matching ubuntu-runner workflow `cross-rvv.yml` keeps multi-compiler | ||
| # matrix coverage (gcc-14, clang-17/18) for vlens >= 256, where the apt |
There was a problem hiding this comment.
I guess there's no way to have all compilers running on archlinux?
| +---------------------------------------+----------------------------------------------------+ | ||
| | :cpp:func:`mul` | per slot multiply | | ||
| +---------------------------------------+----------------------------------------------------+ | ||
| | :cpp:func:`mullo` | low N bits of the 2N-bit integer product | |
There was a problem hiding this comment.
indentation seems odd.
| XSIMD_INLINE batch<uint16_t, A> mulhi(batch<uint16_t, A> const& self, batch<uint16_t, A> const& other, requires_arch<sse2>) noexcept | ||
| { | ||
| return _mm_mulhi_epu16(self, other); | ||
| } |
There was a problem hiding this comment.
note for self: we would implement the 8 bit version of all those multiply, don't know if it makes sense though.
| XSIMD_INLINE batch<int32_t, A> mulhi(batch<int32_t, A> const& self, batch<int32_t, A> const& other, requires_arch<sse4_1>) noexcept | ||
| { | ||
| __m128i even = _mm_mul_epi32(self, other); // 64-bit products in lanes 0,2 | ||
| __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)), |
There was a problem hiding this comment.
you could just _mm_srli_epi32(self, 32) instead, right?
There was a problem hiding this comment.
and same for other, of course
| __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)), | ||
| _mm_shuffle_epi32(other, _MM_SHUFFLE(3, 3, 1, 1))); | ||
| // hi halves in the low 32 of each 64 lane of (even>>32), and in the high 32 of odd | ||
| __m128i even_hi = _mm_srli_epi64(even, 32); |
There was a problem hiding this comment.
you could avoid this shift by using directly a call to _mm_shuffle_ps on even and odd
| { | ||
| __m128i even = _mm_mul_epu32(self, other); | ||
| __m128i odd = _mm_mul_epu32(_mm_srli_epi64(self, 32), _mm_srli_epi64(other, 32)); | ||
| __m128i even_hi = _mm_srli_epi64(even, 32); |
There was a problem hiding this comment.
same here for blend vs. shuffle. You did use _mm_srli_epi64 here though, so there is probably a reason I don't get ^^!
| { | ||
| int16x8_t lo = vmull_s8(vget_low_s8(lhs), vget_low_s8(rhs)); | ||
| int16x8_t hi = vmull_s8(vget_high_s8(lhs), vget_high_s8(rhs)); | ||
| return vcombine_s8(vshrn_n_s16(lo, 8), vshrn_n_s16(hi, 8)); |
There was a problem hiding this comment.
What about vuzpq_s8 (and same below)
Successor to #1334 (closed). Adds mulhi/mullo/mulhilo for integer batches; CI split for RVV cross-compile to use qemu 11 (arch container) at vlen=128 due to known qemu RVV slowdowns at vlen=128 in qemu < 11 (QEMU issue #2137). All 20 fork CI jobs green on this branch.
Took me forever to find the issue as I am using qemu 11 locally...