Skip to content

Conversation

@BenBrock
Copy link
Collaborator

@BenBrock BenBrock commented Apr 24, 2025

Summary:
Use sycl-thrust to support Thrust for device examples with oneMKL and fix rocSPARSE backend architecture.

Details:
The changes in this PR now include:

  • Add get_queue() function that, using an execution policy and a pointer to some data, returns a SYCL queue where algorithms will be launched.
  • Add execution policies.
  • Compile the pre-existing Thrust examples and tests.
  • Refactor cuSPARSE and rocSPARSE backends to more closely match other vendor backend architectures:
    • Move core SpMV implementation to standalone multiply function
    • Unify descriptor handling with consistent naming (a_descr, b_descr, c_descr)
    • Add proper type validation and transpose operation support

There are currently a few execution policies the user can use:

  • spblas::mkl::par runs where the data is. It will inspect the data and run on that device.
  • spblas::mkl::device runs on a queue obtained with the default_selector_v selector.
  • spblas::mkl::host runs on a queue obtained with the cpu_selector_v selector.

Thrust versions of these also work.

  • The user can also manually create an execution policy with spblas::mkl::device_policy, which they must provide with a queue or device+context.

Merge Checklist:

  • Passing CI
  • Update documentation or README.md
  • Additional Test/example added (if applicable) and passing
  • At least one reviewer approval
  • (optional) Clang sanitizer scan run and triaged
  • Clang formatter applied (verified as part of passing CI)

@BenBrock BenBrock mentioned this pull request Apr 25, 2025
6 tasks
// y = A * x
spblas::spmv_state_t state;
spblas::multiply(state, a, x_span, y_span);
spblas::multiply(a, x_span, y_span);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did the state go ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be resolved now---both rocSPARSE and cuSPARSE now take an optional operation_info_t object.

Comment on lines +15 to +17
sycl::queue get_queue(T* ptr) const {
return spblas::__mkl::get_pointer_queue(ptr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this fills the queue* into ptr ? or what is T used for ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below

Comment on lines +34 to +42
template <typename T>
sycl::queue get_pointer_queue(T* ptr) {
try {
auto&& [device, context] = get_pointer_device(ptr);
return sycl::queue(context, device);
} catch (...) {
return sycl::queue(sycl::cpu_selector_v);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this one for, as well ? what is T ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses SYCL's runtime APIs to determine which device is associated with the memory referenced by the pointer. Then it returns a queue on that device.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, can we put a comment to describe exactly that -- what happens if the pointer is not associated with a context ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a comment. If the pointer is not associated with any context, we return a queue associated with the CPU device (see line 40).

sycl::queue queue_;
};

inline parallel_policy par;
Copy link
Contributor

@spencerpatty spencerpatty May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean to have an inline declaration here ? we are defining a singleton of spblas::mkl::par but why inline, what does it do ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline just means that there's no external linkage. Without inline, if you compiled two .o files that both use par, each would declare its own symbol for par, and you'd get linker errors. This allows the library to be header only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, that makes sense. thanks for the explanation!

sycl::queue q(sycl::cpu_selector_v);
auto a_data = __detail::get_ultimate_base(a).values().data();

auto&& q = __mkl::get_queue(policy, a_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so do we always need to extract an array to put into get_queue() with the policy to get the queue ? I see that it is querying from the array what device/context are associated and then creating a queue from those ... under which circumstances will it get a common queue that already existed and use it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this end up creating multiple queues, so they are not linked for submission events ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me, ideally we have a single queue/device/context that is being used per policy, right ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user creates their own policy with device_policy, everything will be created on the same queue. If we wanted to use the same queue with par, we could create a global hash table of queues based keyed on device and context, then select the appropriate queue.

According to the SYCL spec, creating queues should be cheap, and de-allocating a queue should not force completion of associated SYCL events (although that's not always the case with Intel's runtime).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my question is about whether this is a good thing, having separate queues each time (even if they are inorder queues)

sycl::queue q1(dev, ctxt);
sycl::queue q2(dev, ctxt);

auto ev1 = q1.parallel_for(range1, kernel1);
auto ev2 = q2.parallel_for(range2, kernel2);

do we have any ordering between ev1 and ev2 ?

what if we put ev1 into ev2 dependency ?

auto ev1 = q1.parallel_for(range1, kernel1);
auto ev2 = q2.parallel_for(range2, kernel2, {ev1}); 

with inorder or out-of-order queues, they have no relation in the first case, and theoretically they should work in the second case with ordering, but are we providing a nice way to order these things ?

I suppose in our current case where we are synchronous in each function, we have no problems, but I was thinking it might be good to think forward to asynchronous case ... creating queues on the fly shouldn't be a big deal, but can we avoid it with you table suggestion ? and what about inorder queues ... ? the benefit there is the lack of need for events, but we would still need them with queues created on the fly, right ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently everything's synchronous. I think for the asynchronous API, there are essentially two choices: a stream-based approach, or an event-based approach. If it's a stream-based approach, the user is almost certainly going to have to create the stream, and it will have a dedicated in-order queue associated with it.

If it's an event-based approach, the only requirement to create dependencies between events is that they belong to the same context. In Intel's runtime at least, we have a guarantee that they will be in the same context, so we (or the user) can create dependencies between different events. If we don't want to depend on this (or want to avoid creating queues on-the-fly), the previously mentioned hash table approach works.

I think for more advanced behavior like asynchrony it's also reasonable to expect the user to do a little more work (e.g. creating their own execution policy).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and see that we are on the same page now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we were already implementing the common stream-based approach here, but we are not. We are implementing the synchronous API design. How much extra work would it be to add the stream-based approach (while still sync-ing at end of each function call) ? Might be good to just put that in place right away while there are few places that have this mechanism added to it ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone else wants to take a stab at it, I think that'd be fine, but I don't think we should delay merging this until we have asynchrony working, as there are a few details (mostly related to state and memory allocation) that we'll need to figure out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, we can return to asynchrony later on and sort out the state/mem alloc/ user experience for asynchrony at a future time

Copy link
Contributor

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting the execution policy!

For execution policy, I think it should be optional.
When user does not provide it, we can have an assumption that the data and the queue/stream are the default one. I believe rocSPARSE or oneMKL GPU pass queue or stream just as the temporary workaround before the execution policy.
For example, cuda and hip will use the default stream on the current device, which device should not be changed among different calls.
sycl will use the default_queue_selector which can also be limited/controlled by ONEAPI_DEVICE_SELECTOR.

another thing is the state. I think it is on purpose.
It always keeps the state as an input argument in the single-stage API like multi-stage API, which must require the state to keep the information to next kernel call.

add_executable(${example_name} ${example_name}.cpp)

if (ENABLE_ROCSPARSE)
find_package(rocthrust REQUIRED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe still keep it in the root CMakeLists?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed this.

namespace spblas {

template <matrix A, vector X, vector Y>
template <typename ExecutionPolicy, matrix A, vector X, vector Y>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the execution policy should be optional. I will describe it more detail in the comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the execution policy is currently optional; for the MKL vendor backend we have overloads both with and without it.

target_link_libraries(spblas-tests spblas fmt GTest::gtest_main)

if (ENABLE_ROCSPARSE)
find_package(rocthrust REQUIRED)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had some challenges in previous experience on repeatedly finding the same package in different folders. I will suggest we just find the package once in the root CMakeLists and use it in the subfolders unless there is another concern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed! (See above.)

@BenBrock BenBrock marked this pull request as ready for review July 7, 2025 07:10
@BenBrock BenBrock requested review from spencerpatty and yhmtsai July 7, 2025 07:18
Comment on lines +79 to +80
spblas::operation_info_t info;
spblas::multiply(info, a, x_span, y_span);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this change? My understanding was that we will have the following elements in

  • execution policy
  • spmv state object
  • sparse matrix object
  • x and y vector objects

what is operation_info_t ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operation_info_t is the state. I believe in the current proposal we refer to it as operation_state_t. It just hasn't been renamed here (I can do that in a separate PR). (Perhaps confusingly, there is a non-user-visible operation_state_t class, which is an implementation detail. That's how vendor backends store their data inside the state.)

I think one semi-open question is whether to have different state objects for each type of operation. Personally I think it creates more complexity for the user, as they have to juggle different kinds of state objects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it should be _state_t as the paper described.
I still believe that the first step should be individual state for each component.
They can be much easier merged later rather than create a big object then later split up.

// y = A * x
spblas::spmv_state_t state;
spblas::multiply(state, a, x_span, y_span);
spblas::multiply(a, x_span, y_span);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so no state now in this simple example ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, although you can create a state and pass it in if you want:

spblas::operation_info_t state;
spblas::multiply(state, a, x_span, y_span);

Copy link
Contributor

@spencerpatty spencerpatty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally approve of getting these changes for rocSparse/cuSparse and oneMKL SYCL backends into the main code base. We should definitely discuss naming of things at some point in future (not this PR, as I am not clear on what operation_info_t is supposed to mean, compared to an operation_state_t or spmv_state_t object or spgemm_state_t, or multiply_state_t or whatever we are thinking of calling it (one per operation name or one total ? ) I think the word state belongs in the name but other than that, I have less preferences.

Approved for merge

@spencerpatty
Copy link
Contributor

Execution policy -- can be optional, but IMO we should use it more often than not

State object -- I kind of currently think that this should be required, but maybe I don't yet see why it shouldn't be ... if anyone has a different persepctiove, open to discuss :)

matrix_handle_t objects vs plain xyz_view objects -- this is where users get to choose with ramifications for overhead and optimizations internally that can be used over and over ... although a part of me wishes it were always a matrix_handle_t object, but I get the desire for simple no fuss API calls with the views , so I suppose we can deal with some slightly more complicated internal logic to handle both cases. I just don't think we want to be switching it up half way through a multi-step API call set as we are likely storing something in the state objects ...

@BenBrock
Copy link
Collaborator Author

BenBrock commented Jul 7, 2025

Thanks for the review, @spencerpatty. Couple of responses:

  • operation_info_t is the state. I believe we currently call this operation_state_t in discussions. It just hasn't been renamed yet.
  • We've had a few discussions about whether to have separate state object types for different operations or to have a single state type. Personally I prefer just having an operation_state_t, and I recall that being the consensus in the past.
  • When it comes to operations without state, personally I like the idea of at least spmv and spmm supporting a single call to multiply, with the optional ability to call multiply_inspect and multiply_compute (and/or multiply_fill, depending on the naming scheme, which is still unresolved as I recall).
  • matrix_handle_t objects are the next big thing to handle in all the backends. I think we can make sure that all backends have helper functions to make it straightforward to extract matrix handles if they exist.

@BenBrock BenBrock merged commit 1fa7b3a into main Jul 7, 2025
32 of 36 checks passed
@BenBrock BenBrock deleted the dev/brock/onemkl-gpu branch July 7, 2025 19:41
@yhmtsai
Copy link
Contributor

yhmtsai commented Jul 10, 2025

I would prefer the pr just does a limited scope from the description and title. Otherwise, I am quite surprised at that this sycl_thrust changes something in cusparse and rocsparse. the changes here is too big to fit in this pr scope.
This PR removes some design I want to keep and that's why I write it in this way.

  • spmv_state_t (<op>_state_t) is changed to operation_info_t which does not match the paper description.
  • object structure -> free function. It removes user can call it from the state directly which is more clear that state store something for this operaiton. Another thing from the object, I can make the data more safe without leaking access to user without friend function declaraiton. Of course, it is not required from paper, but it gives some good points and does not hurt the free function usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants