Build with `sycl_thrust` for device examples if building oneMKL. #53

BenBrock · 2025-04-24T21:54:32Z

Summary:
Use sycl-thrust to support Thrust for device examples with oneMKL and fix rocSPARSE backend architecture.

Details:
The changes in this PR now include:

Add get_queue() function that, using an execution policy and a pointer to some data, returns a SYCL queue where algorithms will be launched.
Add execution policies.
Compile the pre-existing Thrust examples and tests.
Refactor cuSPARSE and rocSPARSE backends to more closely match other vendor backend architectures:
- Move core SpMV implementation to standalone multiply function
- Unify descriptor handling with consistent naming (a_descr, b_descr, c_descr)
- Add proper type validation and transpose operation support

There are currently a few execution policies the user can use:

spblas::mkl::par runs where the data is. It will inspect the data and run on that device.
spblas::mkl::device runs on a queue obtained with the default_selector_v selector.
spblas::mkl::host runs on a queue obtained with the cpu_selector_v selector.

Thrust versions of these also work.

The user can also manually create an execution policy with spblas::mkl::device_policy, which they must provide with a queue or device+context.

Merge Checklist:

Passing CI
Update documentation or README.md
Additional Test/example added (if applicable) and passing
At least one reviewer approval
(optional) Clang sanitizer scan run and triaged
Clang formatter applied (verified as part of passing CI)

spencerpatty · 2025-04-25T18:43:45Z

examples/device/simple_spmv.cpp

  // y = A * x
-  spblas::spmv_state_t state;
-  spblas::multiply(state, a, x_span, y_span);
+  spblas::multiply(a, x_span, y_span);


where did the state go ?

This should be resolved now---both rocSPARSE and cuSPARSE now take an optional operation_info_t object.

spencerpatty · 2025-04-25T18:44:46Z

include/spblas/vendor/onemkl_sycl/detail/execution_policy.hpp

+  sycl::queue get_queue(T* ptr) const {
+    return spblas::__mkl::get_pointer_queue(ptr);
+  }


so this fills the queue* into ptr ? or what is T used for ?

spencerpatty · 2025-04-25T18:47:30Z

include/spblas/vendor/onemkl_sycl/detail/get_pointer_device.hpp

+template <typename T>
+sycl::queue get_pointer_queue(T* ptr) {
+  try {
+    auto&& [device, context] = get_pointer_device(ptr);
+    return sycl::queue(context, device);
+  } catch (...) {
+    return sycl::queue(sycl::cpu_selector_v);
+  }
+}


what is this one for, as well ? what is T ?

This uses SYCL's runtime APIs to determine which device is associated with the memory referenced by the pointer. Then it returns a queue on that device.

aha, can we put a comment to describe exactly that -- what happens if the pointer is not associated with a context ?

Sure, I'll add a comment. If the pointer is not associated with any context, we return a queue associated with the CPU device (see line 40).

spencerpatty · 2025-05-01T15:43:13Z

include/spblas/vendor/onemkl_sycl/detail/execution_policy.hpp

+  sycl::queue queue_;
+};
+
+inline parallel_policy par;


what does it mean to have an inline declaration here ? we are defining a singleton of spblas::mkl::par but why inline, what does it do ?

inline just means that there's no external linkage. Without inline, if you compiled two .o files that both use par, each would declare its own symbol for par, and you'd get linker errors. This allows the library to be header only.

aha, that makes sense. thanks for the explanation!

spencerpatty · 2025-05-01T15:50:08Z

include/spblas/vendor/onemkl_sycl/spmv_impl.hpp

-  sycl::queue q(sycl::cpu_selector_v);
+  auto a_data = __detail::get_ultimate_base(a).values().data();
+
+  auto&& q = __mkl::get_queue(policy, a_data);


so do we always need to extract an array to put into get_queue() with the policy to get the queue ? I see that it is querying from the array what device/context are associated and then creating a queue from those ... under which circumstances will it get a common queue that already existed and use it ?

doesn't this end up creating multiple queues, so they are not linked for submission events ?

to me, ideally we have a single queue/device/context that is being used per policy, right ?

If the user creates their own policy with device_policy, everything will be created on the same queue. If we wanted to use the same queue with par, we could create a global hash table of queues based keyed on device and context, then select the appropriate queue.

According to the SYCL spec, creating queues should be cheap, and de-allocating a queue should not force completion of associated SYCL events (although that's not always the case with Intel's runtime).

my question is about whether this is a good thing, having separate queues each time (even if they are inorder queues)

sycl::queue q1(dev, ctxt); sycl::queue q2(dev, ctxt); auto ev1 = q1.parallel_for(range1, kernel1); auto ev2 = q2.parallel_for(range2, kernel2);

do we have any ordering between ev1 and ev2 ?

what if we put ev1 into ev2 dependency ?

auto ev1 = q1.parallel_for(range1, kernel1); auto ev2 = q2.parallel_for(range2, kernel2, {ev1});

with inorder or out-of-order queues, they have no relation in the first case, and theoretically they should work in the second case with ordering, but are we providing a nice way to order these things ?

I suppose in our current case where we are synchronous in each function, we have no problems, but I was thinking it might be good to think forward to asynchronous case ... creating queues on the fly shouldn't be a big deal, but can we avoid it with you table suggestion ? and what about inorder queues ... ? the benefit there is the lack of need for events, but we would still need them with queues created on the fly, right ?

Currently everything's synchronous. I think for the asynchronous API, there are essentially two choices: a stream-based approach, or an event-based approach. If it's a stream-based approach, the user is almost certainly going to have to create the stream, and it will have a dedicated in-order queue associated with it.

If it's an event-based approach, the only requirement to create dependencies between events is that they belong to the same context. In Intel's runtime at least, we have a guarantee that they will be in the same context, so we (or the user) can create dependencies between different events. If we don't want to depend on this (or want to avoid creating queues on-the-fly), the previously mentioned hash table approach works.

I think for more advanced behavior like asynchrony it's also reasonable to expect the user to do a little more work (e.g. creating their own execution policy).

I agree and see that we are on the same page now

I was thinking we were already implementing the common stream-based approach here, but we are not. We are implementing the synchronous API design. How much extra work would it be to add the stream-based approach (while still sync-ing at end of each function call) ? Might be good to just put that in place right away while there are few places that have this mechanism added to it ...

If someone else wants to take a stab at it, I think that'd be fine, but I don't think we should delay merging this until we have asynchrony working, as there are a few details (mostly related to state and memory allocation) that we'll need to figure out.

agreed, we can return to asynchrony later on and sort out the state/mem alloc/ user experience for asynchrony at a future time

yhmtsai

Thanks for starting the execution policy!

For execution policy, I think it should be optional.
When user does not provide it, we can have an assumption that the data and the queue/stream are the default one. I believe rocSPARSE or oneMKL GPU pass queue or stream just as the temporary workaround before the execution policy.
For example, cuda and hip will use the default stream on the current device, which device should not be changed among different calls.
sycl will use the default_queue_selector which can also be limited/controlled by ONEAPI_DEVICE_SELECTOR.

another thing is the state. I think it is on purpose.
It always keeps the state as an input argument in the single-stage API like multi-stage API, which must require the state to keep the information to next kernel call.

yhmtsai · 2025-05-07T21:04:24Z

examples/device/CMakeLists.txt

+  add_executable(${example_name} ${example_name}.cpp)
+
  if (ENABLE_ROCSPARSE)
+    find_package(rocthrust REQUIRED)


maybe still keep it in the root CMakeLists?

Thanks, fixed this.

yhmtsai · 2025-05-07T21:05:55Z

include/spblas/vendor/onemkl_sycl/spmv_impl.hpp

 namespace spblas {

-template <matrix A, vector X, vector Y>
+template <typename ExecutionPolicy, matrix A, vector X, vector Y>


I think the execution policy should be optional. I will describe it more detail in the comments.

Yes, the execution policy is currently optional; for the MKL vendor backend we have overloads both with and without it.

yhmtsai · 2025-05-07T21:09:14Z

test/gtest/CMakeLists.txt

 target_link_libraries(spblas-tests spblas fmt GTest::gtest_main)

+if (ENABLE_ROCSPARSE)
+  find_package(rocthrust REQUIRED)


we had some challenges in previous experience on repeatedly finding the same package in different folders. I will suggest we just find the package once in the root CMakeLists and use it in the subfolders unless there is another concern.

Thanks, fixed! (See above.)

ipml.).

spencerpatty · 2025-07-07T16:44:34Z

examples/cusparse/cusparse_simple_spmv.cpp

+  spblas::operation_info_t info;
+  spblas::multiply(info, a, x_span, y_span);


can you explain this change? My understanding was that we will have the following elements in

execution policy

spmv state object

sparse matrix object

x and y vector objects

what is operation_info_t ?

operation_info_t is the state. I believe in the current proposal we refer to it as operation_state_t. It just hasn't been renamed here (I can do that in a separate PR). (Perhaps confusingly, there is a non-user-visible operation_state_t class, which is an implementation detail. That's how vendor backends store their data inside the state.)

I think one semi-open question is whether to have different state objects for each type of operation. Personally I think it creates more complexity for the user, as they have to juggle different kinds of state objects.

no, it should be _state_t as the paper described.
I still believe that the first step should be individual state for each component.
They can be much easier merged later rather than create a big object then later split up.

spencerpatty · 2025-07-07T17:24:23Z

examples/device/device_spmv.cpp

  // y = A * x
-  spblas::spmv_state_t state;
-  spblas::multiply(state, a, x_span, y_span);
+  spblas::multiply(a, x_span, y_span);


so no state now in this simple example ?

Correct, although you can create a state and pass it in if you want:

spblas::operation_info_t state; spblas::multiply(state, a, x_span, y_span);

spencerpatty

I generally approve of getting these changes for rocSparse/cuSparse and oneMKL SYCL backends into the main code base. We should definitely discuss naming of things at some point in future (not this PR, as I am not clear on what operation_info_t is supposed to mean, compared to an operation_state_t or spmv_state_t object or spgemm_state_t, or multiply_state_t or whatever we are thinking of calling it (one per operation name or one total ? ) I think the word state belongs in the name but other than that, I have less preferences.

Approved for merge

spencerpatty · 2025-07-07T17:56:12Z

Execution policy -- can be optional, but IMO we should use it more often than not

State object -- I kind of currently think that this should be required, but maybe I don't yet see why it shouldn't be ... if anyone has a different persepctiove, open to discuss :)

matrix_handle_t objects vs plain xyz_view objects -- this is where users get to choose with ramifications for overhead and optimizations internally that can be used over and over ... although a part of me wishes it were always a matrix_handle_t object, but I get the desire for simple no fuss API calls with the views , so I suppose we can deal with some slightly more complicated internal logic to handle both cases. I just don't think we want to be switching it up half way through a multi-step API call set as we are likely storing something in the state objects ...

BenBrock · 2025-07-07T19:40:30Z

Thanks for the review, @spencerpatty. Couple of responses:

operation_info_t is the state. I believe we currently call this operation_state_t in discussions. It just hasn't been renamed yet.
We've had a few discussions about whether to have separate state object types for different operations or to have a single state type. Personally I prefer just having an operation_state_t, and I recall that being the consensus in the past.
When it comes to operations without state, personally I like the idea of at least spmv and spmm supporting a single call to multiply, with the optional ability to call multiply_inspect and multiply_compute (and/or multiply_fill, depending on the naming scheme, which is still unresolved as I recall).
matrix_handle_t objects are the next big thing to handle in all the backends. I think we can make sure that all backends have helper functions to make it straightforward to extract matrix handles if they exist.

yhmtsai · 2025-07-10T12:33:36Z

I would prefer the pr just does a limited scope from the description and title. Otherwise, I am quite surprised at that this sycl_thrust changes something in cusparse and rocsparse. the changes here is too big to fit in this pr scope.
This PR removes some design I want to keep and that's why I write it in this way.

spmv_state_t (<op>_state_t) is changed to operation_info_t which does not match the paper description.
object structure -> free function. It removes user can call it from the state directly which is more clear that state store something for this operaiton. Another thing from the object, I can make the data more safe without leaking access to user without friend function declaraiton. Of course, it is not required from paper, but it gives some good points and does not hurt the free function usage.

BenBrock added 2 commits April 24, 2025 14:52

Build with sycl_thrust for device examples if building oneMKL.

8e272ad

Add support for device SpMV tests via Thrust.

e79ee7f

BenBrock mentioned this pull request Apr 25, 2025

onemkl GPU version SpMV #52

Open

6 tasks

spencerpatty reviewed Apr 25, 2025

View reviewed changes

spencerpatty reviewed May 1, 2025

View reviewed changes

yhmtsai reviewed May 7, 2025

View reviewed changes

BenBrock and others added 12 commits July 5, 2025 00:41

Merge branch 'main' into dev/brock/onemkl-gpu

a841a09

Remove unnecessary find commands from examples.

1f4679e

Support both CPU and GPU tests.

53adaa6

Set SPBLAS_BACKEND_CPU to ON if using reference impl. (no vendor

0b9af32

ipml.).

Fix trailing whitespace

61ad2ac

Update to compile CPU examples with MKL.

053369b

Fix for ArmPL

e00ccc2

Update cuSPARSE backend to operation_info_t like the other backends.

56534e8

Fixes to run cuSPARSE.

4e7c172

Add missing link statement to CMake.

16dc7d1

First attempt at updating rocSPARSE vendor backend.

2adca57

Minor fixes to get things working.

e0ed0be

BenBrock marked this pull request as ready for review July 7, 2025 07:10

BenBrock requested review from spencerpatty and yhmtsai July 7, 2025 07:18

spencerpatty reviewed Jul 7, 2025

View reviewed changes

spencerpatty approved these changes Jul 7, 2025

View reviewed changes

BenBrock merged commit 1fa7b3a into main Jul 7, 2025
32 of 36 checks passed

BenBrock deleted the dev/brock/onemkl-gpu branch July 7, 2025 19:41

		spblas::operation_info_t info;
		spblas::multiply(info, a, x_span, y_span);

Build with sycl_thrust for device examples if building oneMKL. #53

Build with sycl_thrust for device examples if building oneMKL. #53

Uh oh!

Conversation

BenBrock commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spencerpatty May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhmtsai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

spencerpatty left a comment

Choose a reason for hiding this comment

Uh oh!

spencerpatty commented Jul 7, 2025

Build with `sycl_thrust` for device examples if building oneMKL. #53

Build with `sycl_thrust` for device examples if building oneMKL. #53

BenBrock commented Apr 24, 2025 •

edited

Loading

spencerpatty May 1, 2025 •

edited

Loading