Skip to content

Conversation

@f3sch
Copy link
Collaborator

@f3sch f3sch commented Apr 17, 2025

Fixes two things for ITS GPU to be deterministic again.

o2::gpu::CAMath::Min(ZBins - 1, utils.getZBinIndex(layerIndex + 1, zRangeMax)),

Takes ZBin which is constexpr 256 but should be taken from params which set it to 64.
nCells[iLayer] + 1, // num_items

Should be foundSeedsTable.size(), since we can find more seeds than nCells.
with this I get (although there still are some bins with ~0.01% difference in 10 pp TFs. Any idea why?
ratio

The rest is mostly a refactoring which mostly happened on-the-fly especially the clusterToTracksImpl since I wanted to be sure that both algorithms do not diverge at any point.

@github-actions
Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

@f3sch f3sch marked this pull request as ready for review April 17, 2025 18:55
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 8be3ee5 at 2025-04-17 22:43:

## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:802:85: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:815:90: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:821:85: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:834:90: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:898:84: error: use nullptr [modernize-use-nullptr]
++ [[ 0 == 0 ]]
++ exit 1
--

Full log here.

@alibuild
Copy link
Collaborator

alibuild commented Apr 18, 2025

Error while checking build/O2/fullCI_slc9 for 55f819d at 2025-04-18 12:04:

## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:802:85: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:815:90: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:821:85: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:834:90: error: use nullptr [modernize-use-nullptr]
/sw/BUILD/11921aa24f97e95c080dfce14a35f98fd40639b9/O2/Detectors/ITSMFT/ITS/tracking/GPU/hip/TrackingKernels.hip:898:84: error: use nullptr [modernize-use-nullptr]
++ [[ 0 == 0 ]]
++ exit 1
--

Full log here.

@mconcas
Copy link
Collaborator

mconcas commented Apr 18, 2025

Hi, thanks for looking into this. As per usual it would be better to separate code refactoring from small fixes. Can you split it in two?

@f3sch
Copy link
Collaborator Author

f3sch commented Apr 18, 2025

Sure, now first commit is the actual fix, second the refactoring work done in GPU kernel code, the third is the refactoring of the hybrid/cpu call chain.

mconcas
mconcas previously approved these changes Apr 22, 2025
@f3sch f3sch marked this pull request as draft April 22, 2025 09:54
f3sch added 3 commits April 23, 2025 20:25
- compute-sanitizer reveal malicious write past allocated table
- ZBins for map lookup was not taken from params
- Removes unused headers
- adds two new functions for ex/in-clusive scan via cub
- inlines square
- applies deterministc mode blocks=1,threads=1 to all kernels
@f3sch
Copy link
Collaborator Author

f3sch commented Apr 23, 2025

Now it is up to the precision of TH1F identical :)
20 TFs in both of high IR pp.
image

@f3sch f3sch marked this pull request as ready for review April 23, 2025 19:07
@mconcas mconcas merged commit 72b50c6 into AliceO2Group:dev Apr 24, 2025
12 checks passed
@f3sch f3sch deleted the its/gpu_fix branch April 24, 2025 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants