Skip to content

Initial OpenMP/GPU support#228

Open
pbartholomew08 wants to merge 28 commits intoxcompact3d:mainfrom
pbartholomew08:omp_gpu
Open

Initial OpenMP/GPU support#228
pbartholomew08 wants to merge 28 commits intoxcompact3d:mainfrom
pbartholomew08:omp_gpu

Conversation

@pbartholomew08
Copy link
Copy Markdown
Member

This branch implements the basic offloading of data using openMP with the ability to perform the vecadd operation.

It now behaves polymorphically and will create device fields when
that is the `next` type and host fields otherwise
As libx3d2_backends links libx3d2 and xcompact/tests link both the
linking order must be libx3d2_backends then libx3d2 to prevent
duplicate symbols during linking.
This is to allow initialisation of the base class whether called
by OMP/CPU or the OMP/TGT object
The code runs through the test successfully - need to confirm
offload and check for data movement
This means data resides on the device only
Note this is only working on AMD GPUs (although it should be supported
on NVIDIA GPUs w/Cray compiler)
@pbartholomew08 pbartholomew08 marked this pull request as ready for review April 2, 2026 10:54
@pbartholomew08 pbartholomew08 requested a review from ia267 April 2, 2026 10:54
Copy link
Copy Markdown
Collaborator

@ia267 ia267 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of things to address before merging.

use m_common, only: dp, pi, DIR_X
use m_mesh, only: mesh_t

use m_omptg_allocator, only: omptgt_allocator_t
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: should be m_omptgt_allocator

I guess this file hasn't been added to CMakeListst.txt so it wasn't pickedup.

Comment thread src/CMakeLists.txt
set(CMAKE_Fortran_FLAGS_DEBUG "-g -Og -Wall -Wpedantic -Werror -Wimplicit-interface -Wimplicit-procedure -Wno-unused-dummy-argument")
set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -ffast-math")
if (OMP_TGT)
# A bit of a hack - hardcoded for MI300A
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use CMake cache varibales (e.g. OMP_TGT_ARCH) so we don't need to edit CMake files?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - this was added for development, and not intended as the actual implementation

Comment thread tests/CMakeLists.txt
target_link_libraries(${test_name} PRIVATE OpenMP::OpenMP_Fortran)

if(${backend} STREQUAL omp_tgt)
# Note this is somewhat of a hack - hardcoded to build against MI300A
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use CMake cache variables instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a list of options as follow:

set(gpu_archs gfx942 CACHE STRING "")
set_property(CACHE gpu_archs PROPERTY STRINGS gfx942 gfx900 gfx902 gfx906 gfx908 gfx909 gfx90a gfx90c)

One other alternative is to add an additional option of the form cmake .. -Dgpu_arch=gfx942

This would avoid updating the list of toggled option every time there is a new architecture coming out but put the responsibility of setting the correct input on the user.

In both case, we should be able to set the parameter as follow
target_compile_options(${test_name} PRIVATE "--offload-arch=${gpu_arch}")

end subroutine

! Deallocates device-resident memory before deallocating the base type
subroutine destroy(self)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This destroy subroutine is not bound to omptgt_allocator_t. Need to add procedure :: destroy => destroy to omptgt_allocator_t

Comment thread tests/performance/perf_cuda_reorder.f90 Outdated
Comment on lines +296 to +306
!$omp target teams distribute parallel do private(out_i, out_j, out_k) collapse(3) map(to:u) has_device_addr(u_)
do k = 1, dims(3)
do j = 1, dims(2)
do i = 1, dims(1)
call get_index_reordering(out_i, out_j, out_k, i, j, k, &
dir_from, dir_to, SZ, cart_padded)
u_(out_i, out_j, out_k) = u(i, j, k)
end do
end do
end do
!$omp end target teams distribute parallel do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: Re-ordering without using shared (scratchpad) memory is likely to be inefficient on GPU due to non-coalesced memory access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants