Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 21 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ jobs:
- name: Test
run: |
source /opt/intel/oneapi/setvars.sh
./build/test/gtest/spblas-tests
ONEMKL_DEVICE_SELECTOR=*:cpu ./build/test/gtest/spblas-tests

macos:
runs-on: 'macos-latest'
Expand Down Expand Up @@ -111,3 +111,23 @@ jobs:
shell: bash -l {0}
run: |
./build/test/gtest/spblas-tests

intel-llvm-gpu:
runs-on: 'gpu_intel'
steps:
- uses: actions/checkout@v4
- name: CMake
shell: bash -l {0}
run: |
module load intel-oneapi-compilers intel-oneapi-dpl intel-oneapi-mkl cmake
cmake -B build -DCMAKE_CXX_COMPILER=icpx -DENABLE_ONEMKL_SYCL=ON
- name: Build
shell: bash -l {0}
run: |
module load intel-oneapi-compilers intel-oneapi-dpl intel-oneapi-mkl
make -C build -j `nproc`
- name: Test
shell: bash -l {0}
run: |
module load intel-oneapi-compilers intel-oneapi-dpl intel-oneapi-mkl
ONEMKL_DEVICE_SELECTOR=level_zero:gpu ./build/test/gtest/spblas-gpu-tests
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ include(FetchContent)
if (ENABLE_ONEMKL_SYCL)
find_package(MKL REQUIRED)
target_link_libraries(spblas INTERFACE MKL::MKL_SYCL) # SYCL APIs
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DSPBLAS_ENABLE_ONEMKL_SYCL")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl -DSPBLAS_ENABLE_ONEMKL_SYCL")
endif()

if (ENABLE_ARMPL)
Expand Down
70 changes: 70 additions & 0 deletions include/spblas/vendor/onemkl_sycl/mkl_allocator.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#pragma once

#include <sycl/sycl.hpp>

namespace spblas {
namespace mkl {

template <typename T, std::size_t Alignment = 0>
class mkl_allocator {
public:
using value_type = T;
using pointer = T*;
using const_pointer = const T*;
using reference = T&;
using const_reference = const T&;
using size_type = std::size_t;
using difference_type = std::ptrdiff_t;
Comment on lines +8 to +17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have a typename T on the allocator? don't we have potential for creating all sorts of kinds of things ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently follows the std::allocator design. In the operation, we will use allocator<char> to create the workspace.


mkl_allocator() noexcept {
auto* queue = new sycl::queue{sycl::default_selector_v};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elsewhere we are using sycl::cpu_selector_v right now, but should probably switch to this default selector ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh i looked further and it seems you are handling that in spmv by using the queue that the state object is introduced with ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhmtsai @BenBrock @upsj @YvanMokwinski and any others who are interested in design of state/policy/allocator interactions here. We should probably huddle sometime soon and discuss ownership and interaction of queue (stream) indicating device intent between the two objects that will be input into each operation:

  1. execution policy -- we originally designed around idea that policy should hold the queue(stream) for the operation
  2. state -- holds an optional allocator and any other stateful objects to be reused. The allocator needs a queue(stream), so should allocator/state take in an execution policy? If state has its own queue(stream), it seems like it will be possible to end up with multiple queues in the spmv operation -- one from policy and one from state... which one is to be used ? what if they are different ? for sycl::queues, it will affect ordering of operations, for streams, I guess if you are always using the default stream, it shouldn't matter so much, but as soon as a user creates their own stream, it could go bad...

queue_manager_ =
std::move(std::shared_ptr<sycl::queue>{queue, [](sycl::queue* q) {
q->wait_and_throw();
delete q;
}});
}

// taking a shallow copy of queue from elsewhere, so we don't own destruction
mkl_allocator(sycl::queue* q) noexcept
: queue_manager_(q, [](sycl::queue* q) {}) {}
Comment on lines +29 to +30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mkl_allocator(sycl::queue* q) noexcept
: queue_manager_(q, [](sycl::queue* q) {}) {}
/* taking a shallow copy of queue from elsewhere, so we don't own destruction */
mkl_allocator(sycl::queue* q) noexcept
: queue_manager_(q, [](sycl::queue* q) {}) {}


template <typename U>
mkl_allocator(const mkl_allocator<U, Alignment>& other) noexcept
: queue_manager_(other.queue_) {}

mkl_allocator(const mkl_allocator&) = default;
mkl_allocator& operator=(const mkl_allocator&) = default;
~mkl_allocator() = default;

using is_always_equal = std::false_type;

pointer allocate(std::size_t size) {
return sycl::malloc_device<value_type>(size, *(this->queue()));
}

void deallocate(pointer ptr, std::size_t n = 0) {
if (ptr != nullptr) {
sycl::free(ptr, *(this->queue()));
}
}
Comment on lines +42 to +50
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that allocate/deallocate should be templated, and not the allocator class, right ?


bool operator==(const mkl_allocator&) const = default;
bool operator!=(const mkl_allocator&) const = default;

template <typename U>
struct rebind {
using other = mkl_allocator<U, Alignment>;
};

sycl::queue* queue() const noexcept {
return queue_manager_.get();
}

private:
// using shared_ptr to support copy constructor
std::shared_ptr<sycl::queue> queue_manager_;
};

} // namespace mkl
} // namespace spblas
36 changes: 31 additions & 5 deletions include/spblas/vendor/onemkl_sycl/spmv_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

#include <oneapi/mkl.hpp>

#include "mkl_allocator.hpp"
#include <spblas/detail/log.hpp>
#include <spblas/detail/operation_info_t.hpp>
#include <spblas/detail/ranges.hpp>
Expand All @@ -24,28 +25,53 @@

namespace spblas {

class spmv_state_t {
public:
spmv_state_t() : spmv_state_t(mkl::mkl_allocator<char>{}) {}

spmv_state_t(sycl::queue* q) : spmv_state_t(mkl::mkl_allocator<char>{q}) {}

spmv_state_t(mkl::mkl_allocator<char> alloc) : alloc_(alloc) {}

sycl::queue* queue() {
return alloc_.queue();
}

private:
mkl::mkl_allocator<char> alloc_;
};
Comment on lines +28 to +42
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we switch to using uint8_t instead of char as a 1 byte = 8 bits intent ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, I'm not sure we want a template T on the allocator, but rather on allocate the allocate/deallocate member functions, right ?


template <matrix A, vector X, vector Y>
requires((__detail::has_csr_base<A> || __detail::has_csc_base<A>) &&
__detail::has_contiguous_range_base<X> &&
__ranges::contiguous_range<Y>)
void multiply(A&& a, X&& x, Y&& y) {
void multiply(spmv_state_t& state, A&& a, X&& x, Y&& y) {
log_trace("");
auto a_base = __detail::get_ultimate_base(a);
auto x_base = __detail::get_ultimate_base(x);

auto alpha_optional = __detail::get_scaling_factor(a, x);
tensor_scalar_t<A> alpha = alpha_optional.value_or(1);

sycl::queue q(sycl::cpu_selector_v);
auto q_ptr = state.queue();

auto a_handle = __mkl::create_matrix_handle(q, a_base);
auto a_handle = __mkl::create_matrix_handle(*q_ptr, a_base);
auto a_transpose = __mkl::get_transpose(a);

oneapi::mkl::sparse::gemv(q, a_transpose, alpha, a_handle,
oneapi::mkl::sparse::gemv(*q_ptr, a_transpose, alpha, a_handle,
__ranges::data(x_base), 0.0, __ranges::data(y))
.wait();

oneapi::mkl::sparse::release_matrix_handle(q, &a_handle).wait();
oneapi::mkl::sparse::release_matrix_handle(*q_ptr, &a_handle).wait();
}

template <matrix A, vector X, vector Y>
requires((__detail::has_csr_base<A> || __detail::has_csc_base<A>) &&
__detail::has_contiguous_range_base<X> &&
__ranges::contiguous_range<Y>)
void multiply(A&& a, X&& x, Y&& y) {
spmv_state_t state;
multiply(state, a, x, y);
}

} // namespace spblas
8 changes: 8 additions & 0 deletions test/gtest/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,11 @@ target_link_libraries(spblas-tests spblas fmt GTest::gtest_main)

include(GoogleTest)
gtest_discover_tests(spblas-tests)

# unify it together after cusparse
if(ENABLE_ONEMKL_SYCL)
add_executable(spblas-gpu-tests rocsparse/spmv_test.cpp)
target_include_directories(spblas-gpu-tests PRIVATE ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries(spblas-gpu-tests spblas fmt GTest::gtest_main)
gtest_discover_tests(spblas-gpu-tests)
endif()
59 changes: 59 additions & 0 deletions test/gtest/onemkl/device_vector.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#pragma once
#include <iterator>
#include <memory>
#include <spblas/vendor/onemkl_sycl/mkl_allocator.hpp>
#include <sycl/sycl.hpp>
#include <vector>

namespace thrust {

template <typename InputIt, typename OutputIt>
requires(std::contiguous_iterator<InputIt> &&
std::contiguous_iterator<OutputIt>)
OutputIt copy(InputIt first, InputIt last, OutputIt d_first) {
sycl::queue queue(sycl::default_selector_v);
using input_value_type = typename std::iterator_traits<InputIt>::value_type;
using output_value_type = typename std::iterator_traits<OutputIt>::value_type;
input_value_type* first_ptr = std::to_address(first);
output_value_type* d_first_ptr = std::to_address(d_first);
auto num = std::distance(first, last);
queue.memcpy(d_first_ptr, first_ptr, num * sizeof(input_value_type))
.wait_and_throw();
return d_first + num;
}

// incompleted impl for thrust vector in oneMKL just for test usage
template <typename ValueType>
class device_vector {
public:
device_vector(std::vector<ValueType> host_vector)
: alloc_{}, size_(host_vector.size()), ptr_(nullptr) {
ptr_ = alloc_.allocate(size_);
thrust::copy(host_vector.begin(), host_vector.end(), ptr_);
}

~device_vector() {
alloc_.deallocate(ptr_, size_);
ptr_ = nullptr;
}

ValueType* begin() {
return ptr_;
}

ValueType* end() {
return ptr_ + size_;
}

// just to give data().get()
std::shared_ptr<ValueType> data() {
return std::shared_ptr<ValueType>(ptr_, [](ValueType* ptr) {});
}

private:
spblas::mkl::mkl_allocator<ValueType> alloc_;
std::size_t size_;
ValueType* ptr_;
};

} // namespace thrust
5 changes: 5 additions & 0 deletions test/gtest/rocsparse/spmv_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@
#include <spblas/spblas.hpp>

#include <gtest/gtest.h>

#ifdef SPBLAS_ENABLE_ONEMKL_SYCL
#include "onemkl/device_vector.hpp"
#else
Comment on lines +6 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is generalizing the rocsparse device spmv_test to just a device/spmv_test.cpp for rocsparse/mkl_sycl and in the future others ? Should we change the name from rocsparse/ folder to device/ or accelerator/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I put it into #40 (change it to device). I tended to avoid the same changes in two prs because it might make the reviews hard to exchange the idea in the same pr. I can move that into here in case this pr moves more quickly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I suggest we shift it to thrust_device, then if we have other device examples that are specific for say, sycl or rocm or cuda we could have sycl_device, rocm_device or cuda_device folders as well ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do different folders here contain the tests with different vector allocation or specific functions?
If you mean the specific functions, I might put them into device/<vendor>
device contains the uniform test for all backend, and device/<vendor> contains the tests just for specific vendor.
Both ways are clear, so I do not mind choosing another one unless one of them makes the CMake setting worse. (I do not think so, but just in case)

#include <thrust/device_vector.h>
#endif

using value_t = float;
using index_t = spblas::index_t;
Expand Down