Feat/mulhilo by DiamonDinoia · Pull Request #1344 · xtensor-stack/xsimd

DiamonDinoia · 2026-05-06T17:50:01Z

Successor to #1334 (closed). Adds mulhi/mullo/mulhilo for integer batches; CI split for RVV cross-compile to use qemu 11 (arch container) at vlen=128 due to known qemu RVV slowdowns at vlen=128 in qemu < 11 (QEMU issue #2137). All 20 fork CI jobs green on this branch.

Took me forever to find the issue as I am using qemu 11 locally...

Some toolchains (notably certain GCC builds) define shift and mul immediate intrinsics as macros that apply a textual C-style cast to their operand. That cast does not traverse the multi-level alias inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd -> avx512f), so a batch<T, ISA> fails to convert to its native register type in those contexts. Declare the conversion operator on batch itself so the native type is always one member-lookup away.

Adds three integer-multiplication primitives exposed via the public API: - mullo(x, y): low half of the lane-wise product (equivalent to x * y) - mulhi(x, y): high half of the lane-wise product - mulhilo(x, y): returns {mulhi, mullo} as a pair Native kernels are provided for: - NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit) - SVE (svmulh_x) - RVV (rvvmulh) - SSE2 (mulhi_epi16 / mulhi_epu16) - SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core) - AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core) - AVX-512F (shared 64-bit core) - AVX-512BW (mulhi_epi16/epu16) The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp: mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the same 64x64 -> hi software path and unifies the signed-fixup to a single arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on SSE4.1/AVX2 via bitwise_rshift). The generic fallback in common dispatches per-type through mulhi_helper, using a wider native integer for <=32-bit types and software split-and- multiply (or __int128 when available) for 64-bit.

…vlen=128) QEMU < 11's RVV TCG emulation is dramatically slower than scalar (QEMU issue #2137). At vlen=128, gcc's RVV codegen for our test_xsimd ends up running long enough under apt-shipped qemu-user-static (8.2.x noble, 9.x plucky, 10.x trixie) to overflow the 6h GHA job timeout while making no observable progress. Measured locally: qemu 8.2.2 (Ubuntu 24.04 apt) : test_xsimd at vlen=128 times out qemu 9.2.1 (Ubuntu 25.04 plucky) : ditto qemu 10.0.8 (Debian trixie) : ditto qemu 11.0.0 (Arch) + gcc 15.1 : 367 cases / 5664 asserts in <10 min Vlens >= 256 stay within the test step budget on apt qemu (smaller emulator slowdown per logical op). Keep the existing cross-rvv.yml workflow as-is — multi-compiler matrix (gcc-14, clang-17/18), apt qemu-user-static — but drop vector_bits=128 from its matrix and add fail-fast: false plus a 15 min timeout-minutes safety net so a stuck entry doesn't cancel its peers or burn 6h. Add a sibling workflow cross-rvv-arch.yml that runs the build and test inside archlinux:latest (qemu 11 + gcc 15.1) and covers vector_bits=128/256/512. This restores RVV vlen=128 coverage today without waiting for ubuntu-latest to ship qemu 11. References: QEMU 11.0.0 release notes https://www.qemu.org/2026/04/22/qemu-11-0-0/ QEMU RVV slowdowns issue https://gitlab.com/qemu-project/qemu/-/issues/2137 Ubuntu RVV vstart bug https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2095169

serge-sans-paille

First round of comments, I haven't reviewed the avx512 not the rvv part yet.

serge-sans-paille · 2026-05-06T20:17:23Z

+         * textual C-style cast inside the macro does not traverse the alias
+         * inheritance chain. Declaring the operator here makes it visible on
+         * the batch type directly. */
+        XSIMD_INLINE operator register_type() const noexcept


you could use using types::simd_register<T, A>::operator register_type instead.

serge-sans-paille · 2026-05-06T20:22:37Z

+
+#if defined(__SIZEOF_INT128__)
+    template <class T>
+    typename std::enable_if<std::is_integral<T>::value && (sizeof(T) == 8), T>::type


I fear the `is_integral`` test might fail here, as this type is not a standard type, see https://godbolt.org/z/3nd7KEj9K (works on gcc though)

serge-sans-paille · 2026-05-06T20:29:30Z

+            array_type hi_expected;
+            for (std::size_t i = 0; i < size; ++i)
+            {
+                lo_expected[i] = static_cast<value_type>(static_cast<UT>(a[i]) * static_cast<UT>(b[i]));


Can you explain this intermediate cast to an unsigend type?

serge-sans-paille · 2026-05-06T20:31:35Z

+# So vlen=128 RVV coverage lives in this workflow, which runs the build
+# and test inside an `archlinux:latest` container (qemu 11 + gcc 15.1).
+# The matching ubuntu-runner workflow `cross-rvv.yml` keeps multi-compiler
+# matrix coverage (gcc-14, clang-17/18) for vlens >= 256, where the apt


I guess there's no way to have all compilers running on archlinux?

serge-sans-paille · 2026-05-06T20:32:10Z

 +---------------------------------------+----------------------------------------------------+
 | :cpp:func:`mul`                       | per slot multiply                                  |
 +---------------------------------------+----------------------------------------------------+
+| :cpp:func:`mullo`                    | low N bits of the 2N-bit integer product           |


indentation seems odd.

serge-sans-paille · 2026-05-06T20:49:13Z

+        XSIMD_INLINE batch<uint16_t, A> mulhi(batch<uint16_t, A> const& self, batch<uint16_t, A> const& other, requires_arch<sse2>) noexcept
+        {
+            return _mm_mulhi_epu16(self, other);
+        }


note for self: we would implement the 8 bit version of all those multiply, don't know if it makes sense though.

serge-sans-paille · 2026-05-06T20:54:17Z

+        XSIMD_INLINE batch<int32_t, A> mulhi(batch<int32_t, A> const& self, batch<int32_t, A> const& other, requires_arch<sse4_1>) noexcept
+        {
+            __m128i even = _mm_mul_epi32(self, other); // 64-bit products in lanes 0,2
+            __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)),


you could just _mm_srli_epi32(self, 32) instead, right?

and same for other, of course

serge-sans-paille · 2026-05-06T20:55:51Z

+            __m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)),
+                                        _mm_shuffle_epi32(other, _MM_SHUFFLE(3, 3, 1, 1)));
+            // hi halves in the low 32 of each 64 lane of (even>>32), and in the high 32 of odd
+            __m128i even_hi = _mm_srli_epi64(even, 32);


you could avoid this shift by using directly a call to _mm_shuffle_ps on even and odd

serge-sans-paille · 2026-05-06T20:56:56Z

+        {
+            __m128i even = _mm_mul_epu32(self, other);
+            __m128i odd = _mm_mul_epu32(_mm_srli_epi64(self, 32), _mm_srli_epi64(other, 32));
+            __m128i even_hi = _mm_srli_epi64(even, 32);


same here for blend vs. shuffle. You did use _mm_srli_epi64 here though, so there is probably a reason I don't get ^^!

serge-sans-paille · 2026-05-06T21:07:43Z

+        {
+            int16x8_t lo = vmull_s8(vget_low_s8(lhs), vget_low_s8(rhs));
+            int16x8_t hi = vmull_s8(vget_high_s8(lhs), vget_high_s8(rhs));
+            return vcombine_s8(vshrn_n_s16(lo, 8), vshrn_n_s16(hi, 8));


What about vuzpq_s8 (and same below)

DiamonDinoia added 3 commits May 5, 2026 12:02

DiamonDinoia requested a review from serge-sans-paille May 6, 2026 20:02

serge-sans-paille reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/mulhilo#1344

Feat/mulhilo#1344
DiamonDinoia wants to merge 3 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/mulhilo

DiamonDinoia commented May 6, 2026 •

edited

Loading

Uh oh!

serge-sans-paille left a comment

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

serge-sans-paille May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DiamonDinoia commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serge-sans-paille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DiamonDinoia commented May 6, 2026 •

edited

Loading