Skip to content

Conversation

@mconcas
Copy link
Collaborator

@mconcas mconcas commented Mar 28, 2025

  • Fix missing header
  • Cleanup Stale ITS GPU code

@github-actions
Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

@alibuild
Copy link
Collaborator

alibuild commented Mar 28, 2025

Error while checking build/O2/fullCI_slc9 for b0b0fa3 at 2025-03-30 01:31:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14124-slc9_x86-64/0/Detectors/ITSMFT/ITS/tracking/include/ITStracking/Definitions.h:32:10: fatal error: cuda_runtime.h: No such file or directory
ninja: build stopped: subcommand failed.

Full log here.

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for ec3a797 at 2025-03-31 13:13:

## sw/BUILD/O2-latest/log
/opt/rocm/include/hip/hip_runtime.h:48:10: fatal error: stdint.h: No such file or directory
ninja: build stopped: subcommand failed.

Full log here.

@mconcas
Copy link
Collaborator Author

mconcas commented Mar 31, 2025

@davidrohr: After some refactoring it seems that some unwanted includes slip into some RTC code from Definitions.h. Is there a macro I can use to try hide this from RTC?
My assumption is that, specifically, cudaStream_t and hipStream_t will be come from elsewhere, if needed.

Edit: I am trying now with #ifndef GPUCA_GPUCODE_DEVICE to see if it is enough with CI.

@davidrohr
Copy link
Collaborator

Why is the ITS Definitions.h used for RTC code at all? It will not do any RTC for ITS code...
In any case, I am protecting all standard C/C++ headers with #ifndef GPUCA_GPUCODE_DEVICE, such that they never appear in device code. So that is certainly safe to do.

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for cbc0af5 at 2025-03-31 17:07:

## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
/sw/BUILD/27837940a3dc480df24a7a5f11ae5a0b18de6b94/O2/GPU/GPUTracking/Base/hip/hipify/GPUReconstructionHIP.hip:90:23: error: use '= default' to define a trivial destructor [modernize-use-equals-default]
++ [[ 0 == 0 ]]
++ exit 1
--

Full log here.

@mconcas
Copy link
Collaborator Author

mconcas commented Mar 31, 2025

Why is the ITS Definitions.h used for RTC code at all? It will not do any RTC for ITS code... In any case, I am protecting all standard C/C++ headers with #ifndef GPUCA_GPUCODE_DEVICE, such that they never appear in device code. So that is certainly safe to do.

Not even sure it is RTC, actually, but as I cannot reproduce error locally I guessed it was that. Anyways, the protection works indeed, the error now seems spurious...

@davidrohr
Copy link
Collaborator

OK, good. Could you just tell me quickly where exactly you added the #ifndef GPUCA_GPUCODE_DEVICE protection? I'd like to check why it was needed. Might indicate that something is not working as intended.

For the current error, I honestly do not understand it... Perhaps let's wait for the FullCI to rerun, if it disappears, let's ignore it.

@mconcas
Copy link
Collaborator Author

mconcas commented Mar 31, 2025

OK, good. Could you just tell me quickly where exactly you added the #ifndef GPUCA_GPUCODE_DEVICE protection? I'd like to check why it was needed. Might indicate that something is not working as intended.

For the current error, I honestly do not understand it... Perhaps let's wait for the FullCI to rerun, if it disappears, let's ignore it.

Hold on, I am simplifying even more the thing, moving the Stream definition inside the place where it is used, not exposing anything to headers AND removing two additional files.
Pushing now, let's see the CI.

@mconcas mconcas force-pushed the pr_cleanup_itsgpu branch 2 times, most recently from 29157a6 to 6c692b3 Compare March 31, 2025 17:49
@mconcas mconcas force-pushed the pr_cleanup_itsgpu branch from 6c692b3 to 1e5dfe6 Compare March 31, 2025 17:50
@alibuild
Copy link
Collaborator

alibuild commented Mar 31, 2025

Error while checking build/O2/fullCI_slc9 for 1e5dfe6 at 2025-04-01 22:12:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:


## sw/BUILD/O2-sim-challenge-test-latest/log
./sim-challenge.logDetected critical problem in logfile mftmchMatch.log
./sim-challenge.logmftmchMatch.log:[15469:internal-dpl-ccdb-backend]: [22:12:09][ERROR] Exception while running: Fatal error. Rethrowing.
./sim-challenge.logmftmchMatch.log-[15469:internal-dpl-ccdb-backend]: [22:12:09][FATAL] Unhandled o2::framework::runtime_error reached the top of main of o2-globalfwd-matcher-workflow, device shutting down. Reason: Fatal error
./sim-challenge.log[15469:internal-dpl-ccdb-backend]: [22:12:09][ERROR] CCDBDownloader CURL transfer error - Timeout was reached
./sim-challenge.log[15469:internal-dpl-ccdb-backend]: [22:12:09][ERROR] CcdbDownloader finished transfer http://alice-ccdb.cern.ch/MFT/Config/AlpideParam for 1546300800000 (agent_id: alimetal01.cern.ch-1743538324-B1Bjrj) with http code: 0
./sim-challenge.log[15469:internal-dpl-ccdb-backend]: [22:12:09][ERROR] File MFT/Config/AlpideParam could not be retrieved. No more hosts to try.
./sim-challenge.log[15469:internal-dpl-ccdb-backend]: [22:12:09][FATAL] Unable to find CCDB object MFT/Config/AlpideParam/1546300800000
./sim-challenge.log[15469:internal-dpl-ccdb-backend]: [22:12:09][ERROR] Exception while running: Fatal error. Rethrowing.
./sim-challenge.log[15469:internal-dpl-ccdb-backend]: [22:12:09][FATAL] Unhandled o2::framework::runtime_error reached the top of main of o2-globalfwd-matcher-workflow, device shutting down. Reason: Fatal error
./sim-challenge.log[ERROR] Workflow crashed - PID 15469 (internal-dpl-ccdb-backend) did not exit correctly however it's not clear why. Exit code forced to 128.
./sim-challenge.log[ERROR]  - Device internal-dpl-ccdb-backend: pid 15469 (exit 128)
./sim-challenge.log[INFO]    - First error: [22:12:09][FATAL] Unable to find CCDB object MFT/Config/AlpideParam/1546300800000
./sim-challenge.log[ERROR] SEVERE: Device internal-dpl-ccdb-backend (15469) had at least one message above severity 7: Unable to find CCDB object MFT/Config/AlpideParam/1546300800000
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/37}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/38}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/40}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/42}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/43}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/45}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/46}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/47}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/48}
./digi.log[ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/49}
./digi.log[6057:internal-dpl-clock]: [ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/37}
./digi.log[6057:internal-dpl-clock]: [ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/38}
./digi.log[6057:internal-dpl-clock]: [ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/40}
./digi.log[6057:internal-dpl-clock]: [ERROR] Found duplicate input binding with different spec.:collisioncontext {SIM/COLLISIONCONTEXT/42}
[0 more errors; see full log]

Full log here.

@davidrohr
Copy link
Collaborator

The codechecker issue was genuine, should be fixed by #14129

@mconcas mconcas merged commit 720f7c4 into AliceO2Group:dev Apr 2, 2025
11 checks passed
@mconcas mconcas deleted the pr_cleanup_itsgpu branch April 2, 2025 07:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants