Skip to content

Shuffle scalable vector in CodeGen_ARM#8898

Merged
alexreinking merged 27 commits intohalide:mainfrom
stevesuzuki-arm:pr-shuffle_sve2
Mar 23, 2026
Merged

Shuffle scalable vector in CodeGen_ARM#8898
alexreinking merged 27 commits intohalide:mainfrom
stevesuzuki-arm:pr-shuffle_sve2

Conversation

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor

@stevesuzuki-arm stevesuzuki-arm commented Dec 11, 2025

By design, LLVM shufflevector doesn't accept scalable vectors.
So, we try to use llvm.vector.xx intrinsic where possible.
However, those are not enough to cover wide usage of shuffles in Halide.
To handle arbitrary index pattern, we decompose a shuffle operation
to a sequence of multiple native shuffles, which are lowered to
Arm SVE2 intrinsic TBL or TBL2.

Another approach could be to perform shuffle in fixed sized vector
by adding conversion between scalable vector and fixed vector.
However, it seems to be only possible via load/store memory,
which would presumably be poor performance.

This change also includes:

  • Peep-hole the particular predicate pattern to emit WHILELT instruction
  • Shuffle 1bit type scalable vectors as 8bit with type casts
  • Peep-hole concat_vectors for padding to align up vector
  • Fix redundant broadcast in CodeGen_LLVM

Depends on:

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

With this PR and #8888, Halide tests pass without fail on host machine with SVE2 128 bits vector. I confirmed by ctest --exclude-regex 'interpolate|lens_blur|unsharp|tutorial_lesson_12' --label-regex 'internal|correctness|generator|error|warning|tutorial|python' --build-config Release

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

The CI test failure below is a known issue which should be fixed by #8888.

st2w_int32_x8                   (arm-64-linux-no_neon-sve2-vector_bits_256)
StartAssertion failed: (!isScalable() || isZero()) && "Request for a fixed element count on a scalable object", file C:\build_bot\worker\llvm-main-x86-32-windows\llvm-project\llvm\include\llvm/Support/TypeSize.h, line 202

I will rebase once #8888 is merged.

Theoretically, these are llvm common and not ARM specific,
but for now, keep it for ARM only to avoid any affect to
other targets.
The workaround of checking wide_enough in get_vector_type() was
causing the issue of mixing FixedVector and ScalableVector
in generating a intrinsic instruction in SVE2 codegen.
By this change, we select scalable vector for most of the cases.

Note the workaround for vscale > 1 case will be addressed in
a separate commit.
By design, LLVM shufflevector doesn't accept scalable vectors.
So, we try to use llvm.vector.xx intrinsic where possible.
However, those are not enough to cover wide usage of shuffles in Halide.
To handle arbitrary index pattern, we decompose a shuffle operation
to a sequence of multiple native shuffles, which are lowered to
Arm SVE2 intrinsic TBL or TBL2.

Another approach could be to perform shuffle in fixed sized vector
by adding conversion between scalable vector and fixed vector.
However, it seems to be only possible via load/store memory,
which would presumably be poor performance.

This change also includes:
- Peep-hole the particular predicate pattern to emit WHILELT instruction
- Shuffle 1bit type scalable vectors as 8bit with type casts
- Peep-hole concat_vectors for padding to align up vector
- Fix redundant broadcast in CodeGen_LLVM
Modified codegen of vector broadcast in SVE2 to emit
TBL ARM intrin instead of llvm.vector.insert.

Fix performance test failure of nested_vectorization_gemm
@alexreinking
Copy link
Copy Markdown
Member

Rebased on main to integrate #8888

@alexreinking
Copy link
Copy Markdown
Member

@stevesuzuki-arm -- looks like there are some real LLVM integration errors.

- Fix to cover vector_bits > 128 case appropriately
- Add test target with vector_bits = 512
- W is increased to 512 because Bounds given for
  op_st4b_int8_x512_0 did not cover required region
- Refine target compatibility check for can_run_code
@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

I think both #8950 and #8898 are necessary to fix llvm crash in old llvm version. Let's see CI results.

@alexreinking
Copy link
Copy Markdown
Member

I think both #8950 and #8898 are necessary to fix llvm crash in old llvm version. Let's see CI results.

Can one be merged first, or should we join them together as one PR? After all the changes I made to CI, we're trying to adopt a policy of only merging green PRs.

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

This PR has already incorporated #8950, so if the on-going CI result is OK, we should close #8950 without merging and keep this as combined one.

Copy link
Copy Markdown
Member

@alexreinking alexreinking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See in-line comments for my smaller concerns.

I'm not sure DecomposeVectorShuffle is the right design. I think it's needlessly coupling computing the shuffle plan (which doesn't depend on the codegen backend) with generating the code for the plan (which does).

Comment thread src/DecomposeVectorShuffle.cpp Outdated
Comment thread src/DecomposeVectorShuffle.cpp Outdated
Comment thread src/DecomposeVectorShuffle.cpp Outdated
Comment thread src/DecomposeVectorShuffle.cpp Outdated
Comment thread src/DecomposeVectorShuffle.cpp
Comment thread src/DecomposeVectorShuffle.h Outdated
Comment thread src/DecomposeVectorShuffle.h Outdated
Comment thread src/DecomposeVectorShuffle.h Outdated
Comment thread src/CodeGen_ARM.cpp Outdated
Comment thread src/DecomposeVectorShuffle.h Outdated
Comment thread src/CMakeLists.txt
Comment thread test/correctness/simd_op_check_sve2.cpp Outdated
@alexreinking
Copy link
Copy Markdown
Member

If it's okay with you, @stevesuzuki-arm, I can perform the refactoring I requested myself. I'll just push to this branch.

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

If it's okay with you, @stevesuzuki-arm, I can perform the refactoring I requested myself. I'll just push to this branch.

Thank you for reviewing this large set of updates. I really appreciate it.
It’s totally fine with me if you’d like to go ahead and perform the refactoring and push to this branch.

@alexreinking
Copy link
Copy Markdown
Member

Okay, I made all the changes I requested to DecomposeVectorShuffle... now I will look at CodeGen_ARM 🙂

Copy link
Copy Markdown
Member

@alexreinking alexreinking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a little bit more testing, I think. Can you point to a test that covers the new codegen without needing to run SVE2?

Comment thread test/correctness/simd_op_check_sve2.cpp
Comment thread test/correctness/decompose_vector_shuffle.cpp Outdated
In case the caller of shuffle_vectors() set indices of -1,
steps_for_dst_slice in shuffle_plan may be empty,
which ended up with calling concat_vectors with null element.

The fix is to create undef vector for that particular case.

The original issue was found in correctness_interleave
(transposition with sz=27).
@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

@alexreinking Would you tell me if you have any suggestion for me to do in order to move this forward?

@alexreinking
Copy link
Copy Markdown
Member

alexreinking commented Mar 11, 2026

Let's rebase this on main. We now have SVE2 testing through GHA and I'm inclined to accept if green.

@alexreinking
Copy link
Copy Markdown
Member

@stevesuzuki-arm -- Please check to see if the LLVM 21 failures can be easily solved. Otherwise, skip the failing tests on LLVM <22 + SVE2

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

@stevesuzuki-arm -- Please check to see if the LLVM 21 failures can be easily solved. Otherwise, skip the failing tests on LLVM <22 + SVE2

CI failures with llvm-21 are as follows:

correctness_interleave
LLVM ERROR: Don't know how to widen the operands for INSERT_SUBVECTOR
llvm/llvm-project#160134
llvm/llvm-project#169300

correctness_stmt_to_html
tutorial_lesson_05_scheduling_1
Request for a fixed element count on a scalable object
llvm/llvm-project#160127

correctness_predicated_store_load_single_lane
LLVM ERROR: Unable to widen vector store
llvm/llvm-project#54424

Making Workaround in codegen would be tricky, so I'd prefer "we should just say we only support sve2 from llvm 22 up".

@alexreinking
Copy link
Copy Markdown
Member

Making Workaround in codegen would be tricky, so I'd prefer "we should just say we only support sve2 from llvm 22 up".

Works for me. Skip the tests for now. In a follow-up PR, we can add a user_error for trying to compile SVE2 code with LLVM 21. That will also require not detecting SVE2 as part of host when linked against LLVM 21, so it's a larger change we should defer.

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

@alexreinking Can we merge this now?

@alexreinking
Copy link
Copy Markdown
Member

alexreinking commented Mar 15, 2026

Works for me. Skip the tests for now. In a follow-up PR ...

@alexreinking Can we merge this now?

I was waiting for you to add [SKIP] prints to the failing tests.

The following tests FAILED:
	219 - correctness_interleave (Subprocess aborted)       correctness
	283 - correctness_predicated_store_load_single_lane (Subprocess aborted) correctness
	336 - correctness_stmt_to_html (Subprocess aborted)     correctness
	713 - tutorial_lesson_05_scheduling_1 (Subprocess aborted) multithreaded tutorial

Something like this:

    if (Internal::get_llvm_version() < 220 &&
        get_jit_target_from_environment().has_feature(Target::SVE2)) {
        printf("[SKIP] LLVM %d has known SVE backend bugs for this test.\n",
               Internal::get_llvm_version());
        return 0;
    }

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

Thank you for clarifying. I thought the job would be skipped in case of LLVM21. OK, I will do what you suggested.

stevesuzuki-arm and others added 2 commits March 15, 2026 20:38
SVE2 backend has some LLVM issues which have been fixed in LLVM 22.
For now, we skip the tests which fail due to those issues
in case LLVM version < 22.
@alexreinking
Copy link
Copy Markdown
Member

Just merged main into this branch for the new pre-commit tooling. Will merge when green. Thank you so much for your patience and valuable contributions, @stevesuzuki-arm 🎉

@mcourteaux
Copy link
Copy Markdown
Contributor

We might want to hold of on merging this, as this is the type of PR that can bring us a lot of headache. I think it'd be wise to first fully implement our goals in #9044, integrate those strategies here.

@alexreinking alexreinking merged commit 52cdeeb into halide:main Mar 23, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants