Skip to content

Conversation

@BenBrock
Copy link
Collaborator

@BenBrock BenBrock commented Jul 5, 2025

Summary:
Add experimental support for SYCL reference backend.

Details:

  • Clean up CMake for vendor backends.
  • Add support for SYCL reference backend.

Merge Checklist:

  • Passing CI
  • Update documentation or README.md
  • Additional Test/example added (if applicable) and passing
  • At least one reviewer approval
  • (optional) Clang sanitizer scan run and triaged
  • Clang formatter applied (verified as part of passing CI)

Comment on lines 36 to 37
for (auto elem_idx = lid; elem_idx < row.size();
elem_idx += lsz) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it looks like you are doing subgroup vector parallelism over the nonzeros in the row of A? For SpMM there might be some scenarios where it is better to do subgroup vector parallelism over elements of B (especially when there are more columns than 32, it is more rare than we think to have a sparse matrix that has on average 32 or more nonzeros in each row, so this row.size() bound often means that we are not using the full parallelism here ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would be ideal is to have essentially two algorithms that could be selected at runtime -- one with vector parallelism over dense matrix elements, and the other that could own one or more rows of the sparse matrix and do some sort of segmented reduction (of course sycl doesn't have a segmented scan yet, but we can do a full prefix scan over the set and then subtract stuff to get segmented scans ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree—currently we've got two algorithm, the "split k" and "split j" methods. I'm running some benchmarks now, and those will hopefully tell us when to call which method. (And also generally illuminate what their performance characteristics are.)

Comment on lines +128 to +131
double gb = 1e-9 * (nnz * sizeof(value_t) + nnz * sizeof(index_t) +
(m + 1) * sizeof(offset_t) + k * n * sizeof(value_t) +
m * n * sizeof(value_t));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spmm is one of the few sparse algorithms that has potential of getting into compute bound region instead of just memory bound, so calculating gflops is also helpful. all others should just be looked at compared to the gb memory limits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can happen because of the potential reuse of B dense matrix if we are careful from cache, while streaming A matrix and limiting accesses to C (along with trying to not cache C at all)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, for measuring peak perf of a kernel, it is a good idea to have a warmup loop with several iterations untimed, then a timed run loop that in aggregate takes on order of seconds or at least ms to run, with average time per run computed and recorded. this increases the change of repeatability and stability of measurement and runs over time and makes them much more comparable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a few updates that do both things: compute GFLOPs in addition to BW achieved, and do up to a 2 second warmup before timing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants