-
Notifications
You must be signed in to change notification settings - Fork 8
Clean up CMake, add option for SYCL reference. #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| for (auto elem_idx = lid; elem_idx < row.size(); | ||
| elem_idx += lsz) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it looks like you are doing subgroup vector parallelism over the nonzeros in the row of A? For SpMM there might be some scenarios where it is better to do subgroup vector parallelism over elements of B (especially when there are more columns than 32, it is more rare than we think to have a sparse matrix that has on average 32 or more nonzeros in each row, so this row.size() bound often means that we are not using the full parallelism here ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be ideal is to have essentially two algorithms that could be selected at runtime -- one with vector parallelism over dense matrix elements, and the other that could own one or more rows of the sparse matrix and do some sort of segmented reduction (of course sycl doesn't have a segmented scan yet, but we can do a full prefix scan over the set and then subtract stuff to get segmented scans ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree—currently we've got two algorithm, the "split k" and "split j" methods. I'm running some benchmarks now, and those will hopefully tell us when to call which method. (And also generally illuminate what their performance characteristics are.)
| double gb = 1e-9 * (nnz * sizeof(value_t) + nnz * sizeof(index_t) + | ||
| (m + 1) * sizeof(offset_t) + k * n * sizeof(value_t) + | ||
| m * n * sizeof(value_t)); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spmm is one of the few sparse algorithms that has potential of getting into compute bound region instead of just memory bound, so calculating gflops is also helpful. all others should just be looked at compared to the gb memory limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can happen because of the potential reuse of B dense matrix if we are careful from cache, while streaming A matrix and limiting accesses to C (along with trying to not cache C at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, for measuring peak perf of a kernel, it is a good idea to have a warmup loop with several iterations untimed, then a timed run loop that in aggregate takes on order of seconds or at least ms to run, with average time per run computed and recorded. this increases the change of repeatability and stability of measurement and runs over time and makes them much more comparable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a few updates that do both things: compute GFLOPs in addition to BW achieved, and do up to a 2 second warmup before timing.
device at runtime.
Summary:
Add experimental support for SYCL reference backend.
Details:
Merge Checklist: