Skip to content

Conversation

@aalkin
Copy link
Member

@aalkin aalkin commented Apr 1, 2025

No description provided.

@aalkin aalkin requested a review from a team as a code owner April 1, 2025 09:21
@github-actions
Copy link
Contributor

github-actions bot commented Apr 1, 2025

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

@aalkin aalkin requested a review from saganatt April 1, 2025 10:48
@alibuild
Copy link
Collaborator

alibuild commented Apr 1, 2025

Error while checking build/O2/fullCI_slc9 for d72798d at 2025-04-01 14:22:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:


## sw/BUILD/O2-full-system-test-latest/log
Detected critical problem in logfile reco_NOGPU.log
reco_NOGPU.log:[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] Exception while running: Fatal error. Rethrowing.
reco_NOGPU.log-[27266:internal-dpl-ccdb-backend]: [14:22:07][FATAL] Unhandled o2::framework::runtime_error reached the top of main of o2-itsmft-stf-decoder-workflow, device shutting down. Reason: Fatal error
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] CCDBDownloader CURL transfer error - Timeout was reached
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] CcdbDownloader finished transfer http://alice-ccdb.cern.ch/TOF/Calib/LHCphase for 1550600800000 (agent_id: alimetal01.cern.ch-1743510114-iscfhX) with http code: 0
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] File TOF/Calib/LHCphase could not be retrieved. No more hosts to try.
[27266:internal-dpl-ccdb-backend]: [14:22:06][FATAL] Unable to find CCDB object TOF/Calib/LHCphase/1550600800000
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] Exception while running: Fatal error. Rethrowing.
[27266:internal-dpl-ccdb-backend]: [14:22:07][FATAL] Unhandled o2::framework::runtime_error reached the top of main of o2-itsmft-stf-decoder-workflow, device shutting down. Reason: Fatal error
[ERROR] Workflow crashed - PID 27266 (internal-dpl-ccdb-backend) did not exit correctly however it's not clear why. Exit code forced to 128.
[ERROR]  - Device internal-dpl-ccdb-backend: pid 27266 (exit 128)
[INFO]    - First error: [14:22:06][FATAL] Unable to find CCDB object TOF/Calib/LHCphase/1550600800000
[ERROR] SEVERE: Device internal-dpl-ccdb-backend (27266) had at least one message above severity 7: Unable to find CCDB object TOF/Calib/LHCphase/1550600800000


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/57b8f500b155f4d4e155a1a7aed05c36ea4c2be9/slc9_x86-64/o2checkcode/1.0-local52/etc/modulefiles
++ cat
--

[0 more errors; see full log]

Full log here.

@ktf
Copy link
Member

ktf commented Apr 1, 2025

@shahor02 any idea of what happened to TOF/Calib/LHCphase/1550600800000?

Tested on hyperloop. Works fine.

@ktf ktf merged commit 1447517 into AliceO2Group:dev Apr 1, 2025
10 of 12 checks passed
@shahor02
Copy link
Collaborator

shahor02 commented Apr 1, 2025

@ktf I've already seen today such a timeout in the FST, perhaps @costing can tell why it happens:

[27266:internal-dpl-ccdb-backend]: [14:21:54][INFO] Init CcdApi with UserAgentID: alimetal01.cern.ch-1743510114-iscfhX, Host: http://alice-ccdb.cern.ch, Curl timeouts: upload:20 download:1
...
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] CCDBDownloader CURL transfer error - Timeout was reached
[27266:internal-dpl-ccdb-backend]: 
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] CcdbDownloader finished transfer http://alice-ccdb.cern.ch/TOF/Calib/LHCphase for 1550600800000 (agent_id: alimetal01.cern.ch-1743510114-iscfhX) with http code: 0
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] File TOF/Calib/LHCphase could not be retrieved. No more hosts to try.
[27266:internal-dpl-ccdb-backend]: [14:22:06][ALARM] Curl request to http://alice-ccdb.cern.ch/TOF/Calib/LHCphase/1550600800000/, response code: 0
[27266:internal-dpl-ccdb-backend]: [14:22:06][FATAL] Unable to find CCDB object TOF/Calib/LHCphase/1550600800000

@costing
Copy link
Collaborator

costing commented Apr 2, 2025

Hi @shahor02 ,

The FSTs should not (and probably cannot) access the Offline server. They should contact o2-ccdb.internal instead of alice-ccdb.cern.ch .

Cheers,

.costin

@shahor02
Copy link
Collaborator

shahor02 commented Apr 2, 2025

@costing, this is FST running in the FullCI on the build servers. I believe it always used alice-ccdb.cern.ch. It usually works fine, also, in this job, plenty of objects were fetched before the timeout happened.

@costing
Copy link
Collaborator

costing commented Apr 2, 2025

Ah, sorry, I thought this runs on the FST nodes.

I don't find any logs on the server side with the given agent ID. Other requests from the same machine but a different process unique key show up so a priori these should also have worked ... I don't see why there would be a difference between processes running on the same host ...

{"timestamp":1743510115168,...,"method":"GET","status":303,"elapsed_ms":162.595328,"path":"CTP\/Calib\/OrbitReset","sov":1550600800000,"notAfter":3385078236000,"userAgent":"alimetal01.cern.ch-1743510114-Tbb4hn"}
...
{"timestamp":1743510126808,...,"method":"GET","status":303,"elapsed_ms":1047.944325,"path":"TOF\/Calib\/LHCphase","sov":1550600800000,"notAfter":3385078236000,"userAgent":"alimetal01.cern.ch-1743510114-Tbb4hn"}

(88 requests)

Cheers,

.costin

@shahor02
Copy link
Collaborator

shahor02 commented Apr 2, 2025

Hi @costing

In the log https://ali-ci.cern.ch/alice-build-logs/AliceO2Group/AliceO2/14133/d72798d7ddf8e6d8cec53af4e42cf970aa2fe256/build_O2_fullCI_slc9/fullLog.txt I see plenty of successful fetches with this agent-id, before it finally fails on timeout:

grep 'alimetal01.cern.ch-1743510114-iscfhX' fullLog.txt 
[27266:internal-dpl-ccdb-backend]: [14:21:54][INFO] Init CcdApi with UserAgentID: alimetal01.cern.ch-1743510114-iscfhX, Host: http://alice-ccdb.cern.ch, Curl timeouts: upload:20 download:1
[27266:internal-dpl-ccdb-backend]: [14:21:56][INFO] ccdb reads http://alice-ccdb.cern.ch/CTP/Calib/OrbitReset/1550600800000/b07c53e0-b4c0-11ec-b66d-90ce809b250c for 1550600800000 (load to memory, agent_id: alimetal01.cern.ch-1743510114-iscfhX), 
[27266:internal-dpl-ccdb-backend]: [14:21:56][INFO] ccdb reads http://alice-ccdb.cern.ch/EMC/Config/RecoParam/1546300800000/d791f4c0-3ffb-11ed-a67e-2a010e0a0b16 for 1550600800000 (load to memory, agent_id: alimetal01.cern.ch-1743510114-iscfhX), 
[27266:internal-dpl-ccdb-backend]: [14:21:56][INFO] ccdb reads http://alice-ccdb.cern.ch/EMC/Calib/FeeDCS/1546297200000/22914a24-fafa-11ed-9692-200114580202 for 1550600800000 (load to memory, agent_id: alimetal01.cern.ch-1743510114-iscfhX), 
[27266:internal-dpl-ccdb-backend]: [14:21:56][INFO] ccdb reads http://alice-ccdb.cern.ch/ITS/Calib/ClusterDictionary/0/c245a720-9fe3-11ec-975c-200114580202 for 1550600800000 (load to memory, agent_id: alimetal01.cern.ch-1743510114-iscfhX), 
[27266:internal-dpl-ccdb-backend]: [14:21:56][INFO] ccdb reads http://alice-ccdb.cern.ch/ITS/Config/AlpideParam/1/ad17417b-8f72-11ee-9a08-200114580202 for 1550600800000 (load to memory, agent_id: alimetal01.cern.ch-1743510114-iscfhX), 
...
27266:internal-dpl-ccdb-backend]: [14:22:05][INFO] ccdb reads http://alice-ccdb.cern.ch/TRD/Calib/NoiseMapMCM/1/e871e7e5-d22d-11ec-8dd8-511cc1ec24ee for 1550600800000 (load to memory, agent_id: alimetal01.cern.ch-1743510114-iscfhX), 
[27266:internal-dpl-ccdb-backend]: [14:22:05][INFO] ccdb reads http://alice-ccdb.cern.ch/TRD/Calib/CalGain/1546300800001/2956c9c2-1a72-11ee-b5e6-200114580204 for 1550600800000 (load to memory, agent_id: alimetal01.cern.ch-1743510114-iscfhX), 
[27266:internal-dpl-ccdb-backend]: [14:22:06][ERROR] CcdbDownloader finished transfer http://alice-ccdb.cern.ch/TOF/Calib/LHCphase for 1550600800000 (agent_id: alimetal01.cern.ch-1743510114-iscfhX) with http code: 0

@aalkin aalkin deleted the fix-comb-gen-concept branch December 11, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants