Fix memory issues in CUDA EXX screening#182
Fix memory issues in CUDA EXX screening#182vmitq wants to merge 1 commit intowavefunction91:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes correctness issues in CUDA exact-exchange (EXX) EK screening for large workloads by adjusting how task patches are generated/processed on device, and by widening several CUDA screening counters to 64-bit to avoid overflow.
Changes:
- Reworks the CUDA
exx_ek_screeningtask loop to process eachgenerate_buffers()patch independently (avoiding buffer overwrite across inner-loop iterations). - Promotes CUDA EXX screening collision counts / position lists from 32-bit to 64-bit.
- Adds a new device-memory requirement flag for reserving EXX collision scratch space, and increases Scheme1 static padding.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/xc_integrator/xc_data/device/xc_device_data.hpp |
Adds task_exx_collision and a sizing helper to reserve scratch space for EXX collision work. |
src/xc_integrator/xc_data/device/xc_device_aos_data.cxx |
Accounts for the new EXX collision scratch reservation in AoS device memory requirements. |
src/xc_integrator/local_work_driver/device/scheme1_data_base.cxx |
Increases static allocation padding used for alignment. |
src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu |
Widens collision counts/position lists to 64-bit and updates associated kernel signatures and allocations. |
src/xc_integrator/integrator_util/exx_screening.cxx |
Removes the problematic double-loop batching and calls collision per generated task patch. |
Comments suppressed due to low confidence (1)
src/xc_integrator/local_work_driver/device/cuda/kernels/exx_ek_screening_bfn_stats.cu:299
print_countsnow takesuint64_t* counts, but theprintfformat string still uses%dforcounts[i_task](andi_taskis also not anint). This is undefined behavior and can corrupt output or crash when debugging. Use the correct 64-bit/size_t format specifiers (or cast to an explicitly formatted type).
__global__ void print_counts(size_t ntasks, uint64_t* counts) {
for(auto i_task = 0 ; i_task < ntasks; ++i_task) {
printf("[GPU] ITASK %d: %d\n", i_task,counts[i_task]);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| template <int32_t buffer_size, typename buffer_type = uint32_t> | ||
| __global__ void bitvector_to_position_list_shellpair( | ||
| size_t ntasks, | ||
| size_t nsp, | ||
| size_t LD_bit, | ||
| const uint32_t* collisions, | ||
| const uint32_t* counts, | ||
| uint32_t* position_list | ||
| const uint64_t* counts, | ||
| uint64_t* position_list | ||
| ) { |
There was a problem hiding this comment.
In bitvector_to_position_list_shellpair, the signature was widened to uint64_t counts/position_list, but the kernel body still uses 32-bit int for indices/counters (e.g., current, j_block). For large nsp/counts (the scenario that motivated 64-bit counts), these int variables can overflow and generate incorrect position lists. Promote these loop counters/offsets to size_t/uint64_t consistently.
| template <int32_t buffer_size, typename buffer_type = uint32_t> | ||
| __global__ void bitvector_to_position_list_shells( | ||
| size_t ntasks, | ||
| size_t nshells, | ||
| size_t LD_bit, | ||
| const uint32_t* collisions, | ||
| const uint32_t* counts, | ||
| const uint64_t* counts, | ||
| const int32_t* shell_size, | ||
| uint32_t* position_list, | ||
| uint64_t* position_list, | ||
| size_t* nbe_list |
There was a problem hiding this comment.
In bitvector_to_position_list_shells, counts/position_list were widened to 64-bit, but the kernel still uses 32-bit int counters/loop variables (e.g., current, j_block). This can overflow for large shell counts and defeat the purpose of moving to 64-bit counts. Use size_t/uint64_t for these indices and for offset arithmetic into position_list.
| inline size_t task_exx_collision_size(size_t nshells) { | ||
| const size_t nslt = (nshells * (nshells+1)) / 2 | ||
| + nshells | ||
| ; |
There was a problem hiding this comment.
task_exx_collision_size computes the scratch size using the dense lower-triangular shell-pair count (nshells*(nshells+1))/2, but ShellPairCollection is stored sparsely (only nonzero primitive-screened pairs). Using the dense upper bound can massively overestimate per-task memory, forcing very small GPU task patches or even preventing batching on large systems. Consider basing this size on the actual allocated nshell_pairs (e.g., global_dims.nshell_pairs) instead of nshells.
| inline size_t task_exx_collision_size(size_t nshells) { | |
| const size_t nslt = (nshells * (nshells+1)) / 2 | |
| + nshells | |
| ; | |
| inline size_t task_exx_collision_size(size_t nshell_pairs) { | |
| // Size the EXX collision workspace based on the actual number of | |
| // allocated shell pairs rather than the dense lower-triangular | |
| // shell-pair count (which can massively overestimate the required | |
| // memory for sparse ShellPairCollection objects). | |
| const size_t nslt = nshell_pairs; |
| reqt.task_bfn_shell_indirection_size( nbe_bfn ) * sizeof(int32_t) + | ||
|
|
||
| // Scratch memory to store shell pairs | ||
| reqt.task_exx_collision_size( nshells ) * sizeof(int64_t) + |
There was a problem hiding this comment.
The EXX collision scratch reservation is currently computed from nshells (dense (nshells*(nshells+1))/2 upper bound) rather than the actually allocated sparse shell-pair count (global_dims.nshell_pairs). On large/sparse systems this can drastically inflate get_mem_req, causing generate_buffers to choose much smaller task patches than necessary. Prefer using global_dims.nshell_pairs (or passing nshell_pairs into the sizing helper) for a tighter bound.
| reqt.task_exx_collision_size( nshells ) * sizeof(int64_t) + | |
| reqt.task_exx_collision_size( task.global_dims.nshell_pairs ) * sizeof(int64_t) + |
This PR addresses memory-related issues in exact exchange calculation on CUDA GPU devices.