Skip to content

Feat/mulhilo#1344

Open
DiamonDinoia wants to merge 3 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/mulhilo
Open

Feat/mulhilo#1344
DiamonDinoia wants to merge 3 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/mulhilo

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

@DiamonDinoia DiamonDinoia commented May 6, 2026

Successor to #1334 (closed). Adds mulhi/mullo/mulhilo for integer batches; CI split for RVV cross-compile to use qemu 11 (arch container) at vlen=128 due to known qemu RVV slowdowns at vlen=128 in qemu < 11 (QEMU issue #2137). All 20 fork CI jobs green on this branch.

Took me forever to find the issue as I am using qemu 11 locally...

Some toolchains (notably certain GCC builds) define shift and mul
immediate intrinsics as macros that apply a textual C-style cast to
their operand. That cast does not traverse the multi-level alias
inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd
-> avx512f), so a batch<T, ISA> fails to convert to its native
register type in those contexts.

Declare the conversion operator on batch itself so the native type is
always one member-lookup away.
Adds three integer-multiplication primitives exposed via the public API:
  - mullo(x, y): low half of the lane-wise product (equivalent to x * y)
  - mulhi(x, y): high half of the lane-wise product
  - mulhilo(x, y): returns {mulhi, mullo} as a pair

Native kernels are provided for:
  - NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit)
  - SVE (svmulh_x)
  - RVV (rvvmulh)
  - SSE2 (mulhi_epi16 / mulhi_epu16)
  - SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core)
  - AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core)
  - AVX-512F (shared 64-bit core)
  - AVX-512BW (mulhi_epi16/epu16)

The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp:
mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with
xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific
widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the
same 64x64 -> hi software path and unifies the signed-fixup to a single
arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on
SSE4.1/AVX2 via bitwise_rshift).

The generic fallback in common dispatches per-type through mulhi_helper,
using a wider native integer for <=32-bit types and software split-and-
multiply (or __int128 when available) for 64-bit.
…vlen=128)

QEMU < 11's RVV TCG emulation is dramatically slower than scalar
(QEMU issue #2137). At vlen=128, gcc's RVV codegen for our test_xsimd
ends up running long enough under apt-shipped qemu-user-static
(8.2.x noble, 9.x plucky, 10.x trixie) to overflow the 6h GHA job
timeout while making no observable progress.

Measured locally:
  qemu 8.2.2 (Ubuntu 24.04 apt)    : test_xsimd at vlen=128 times out
  qemu 9.2.1 (Ubuntu 25.04 plucky) : ditto
  qemu 10.0.8 (Debian trixie)      : ditto
  qemu 11.0.0 (Arch) + gcc 15.1    : 367 cases / 5664 asserts in <10 min
Vlens >= 256 stay within the test step budget on apt qemu (smaller
emulator slowdown per logical op).

Keep the existing cross-rvv.yml workflow as-is — multi-compiler
matrix (gcc-14, clang-17/18), apt qemu-user-static — but drop
vector_bits=128 from its matrix and add fail-fast: false plus a
15 min timeout-minutes safety net so a stuck entry doesn't
cancel its peers or burn 6h.

Add a sibling workflow cross-rvv-arch.yml that runs the build
and test inside archlinux:latest (qemu 11 + gcc 15.1) and covers
vector_bits=128/256/512. This restores RVV vlen=128 coverage today
without waiting for ubuntu-latest to ship qemu 11.

References:
  QEMU 11.0.0 release notes  https://www.qemu.org/2026/04/22/qemu-11-0-0/
  QEMU RVV slowdowns issue   https://gitlab.com/qemu-project/qemu/-/issues/2137
  Ubuntu RVV vstart bug      https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2095169
Copy link
Copy Markdown
Contributor

@serge-sans-paille serge-sans-paille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments, I haven't reviewed the avx512 not the rvv part yet.

* textual C-style cast inside the macro does not traverse the alias
* inheritance chain. Declaring the operator here makes it visible on
* the batch type directly. */
XSIMD_INLINE operator register_type() const noexcept
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use using types::simd_register<T, A>::operator register_type instead.

Comment thread test/test_batch.cpp

#if defined(__SIZEOF_INT128__)
template <class T>
typename std::enable_if<std::is_integral<T>::value && (sizeof(T) == 8), T>::type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear the `is_integral`` test might fail here, as this type is not a standard type, see https://godbolt.org/z/3nd7KEj9K (works on gcc though)

Comment thread test/test_batch.cpp
array_type hi_expected;
for (std::size_t i = 0; i < size; ++i)
{
lo_expected[i] = static_cast<value_type>(static_cast<UT>(a[i]) * static_cast<UT>(b[i]));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this intermediate cast to an unsigend type?

# So vlen=128 RVV coverage lives in this workflow, which runs the build
# and test inside an `archlinux:latest` container (qemu 11 + gcc 15.1).
# The matching ubuntu-runner workflow `cross-rvv.yml` keeps multi-compiler
# matrix coverage (gcc-14, clang-17/18) for vlens >= 256, where the apt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's no way to have all compilers running on archlinux?

+---------------------------------------+----------------------------------------------------+
| :cpp:func:`mul` | per slot multiply |
+---------------------------------------+----------------------------------------------------+
| :cpp:func:`mullo` | low N bits of the 2N-bit integer product |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation seems odd.

XSIMD_INLINE batch<uint16_t, A> mulhi(batch<uint16_t, A> const& self, batch<uint16_t, A> const& other, requires_arch<sse2>) noexcept
{
return _mm_mulhi_epu16(self, other);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for self: we would implement the 8 bit version of all those multiply, don't know if it makes sense though.

XSIMD_INLINE batch<int32_t, A> mulhi(batch<int32_t, A> const& self, batch<int32_t, A> const& other, requires_arch<sse4_1>) noexcept
{
__m128i even = _mm_mul_epi32(self, other); // 64-bit products in lanes 0,2
__m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could just _mm_srli_epi32(self, 32) instead, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and same for other, of course

__m128i odd = _mm_mul_epi32(_mm_shuffle_epi32(self, _MM_SHUFFLE(3, 3, 1, 1)),
_mm_shuffle_epi32(other, _MM_SHUFFLE(3, 3, 1, 1)));
// hi halves in the low 32 of each 64 lane of (even>>32), and in the high 32 of odd
__m128i even_hi = _mm_srli_epi64(even, 32);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could avoid this shift by using directly a call to _mm_shuffle_ps on even and odd

{
__m128i even = _mm_mul_epu32(self, other);
__m128i odd = _mm_mul_epu32(_mm_srli_epi64(self, 32), _mm_srli_epi64(other, 32));
__m128i even_hi = _mm_srli_epi64(even, 32);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for blend vs. shuffle. You did use _mm_srli_epi64 here though, so there is probably a reason I don't get ^^!

{
int16x8_t lo = vmull_s8(vget_low_s8(lhs), vget_low_s8(rhs));
int16x8_t hi = vmull_s8(vget_high_s8(lhs), vget_high_s8(rhs));
return vcombine_s8(vshrn_n_s16(lo, 8), vshrn_n_s16(hi, 8));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about vuzpq_s8 (and same below)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants