Shuffle scalable vector in CodeGen_ARM#8898
Open
stevesuzuki-arm wants to merge 6 commits intohalide:mainfrom
Open
Shuffle scalable vector in CodeGen_ARM#8898stevesuzuki-arm wants to merge 6 commits intohalide:mainfrom
stevesuzuki-arm wants to merge 6 commits intohalide:mainfrom
Conversation
Contributor
Author
|
With this PR and #8888, Halide tests pass without fail on host machine with SVE2 128 bits vector. I confirmed by |
Contributor
Author
|
The CI test failure below is a known issue which should be fixed by #8888. I will rebase once #8888 is merged. |
Theoretically, these are llvm common and not ARM specific, but for now, keep it for ARM only to avoid any affect to other targets.
The workaround of checking wide_enough in get_vector_type() was causing the issue of mixing FixedVector and ScalableVector in generating a intrinsic instruction in SVE2 codegen. By this change, we select scalable vector for most of the cases. Note the workaround for vscale > 1 case will be addressed in a separate commit.
By design, LLVM shufflevector doesn't accept scalable vectors. So, we try to use llvm.vector.xx intrinsic where possible. However, those are not enough to cover wide usage of shuffles in Halide. To handle arbitrary index pattern, we decompose a shuffle operation to a sequence of multiple native shuffles, which are lowered to Arm SVE2 intrinsic TBL or TBL2. Another approach could be to perform shuffle in fixed sized vector by adding conversion between scalable vector and fixed vector. However, it seems to be only possible via load/store memory, which would presumably be poor performance. This change also includes: - Peep-hole the particular predicate pattern to emit WHILELT instruction - Shuffle 1bit type scalable vectors as 8bit with type casts - Peep-hole concat_vectors for padding to align up vector - Fix redundant broadcast in CodeGen_LLVM
Modified codegen of vector broadcast in SVE2 to emit TBL ARM intrin instead of llvm.vector.insert. Fix performance test failure of nested_vectorization_gemm
4a40326 to
9c9e621
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
By design, LLVM shufflevector doesn't accept scalable vectors.
So, we try to use llvm.vector.xx intrinsic where possible.
However, those are not enough to cover wide usage of shuffles in Halide.
To handle arbitrary index pattern, we decompose a shuffle operation
to a sequence of multiple native shuffles, which are lowered to
Arm SVE2 intrinsic TBL or TBL2.
Another approach could be to perform shuffle in fixed sized vector
by adding conversion between scalable vector and fixed vector.
However, it seems to be only possible via load/store memory,
which would presumably be poor performance.
This change also includes: