-
Notifications
You must be signed in to change notification settings - Fork 143
Description
We see cancelled stage requests coinciding with logged messages such as:
Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637
We believe the cause of this is the dCache hsm code not handling cancellation properly for files that are cancelled before being staged (typically a stage timeout when staging a huge amount of files).
It is unclear to us whether this is a long-standing issue being more exposed due to WLCG/ATLAS staging more files from tape, or of this is due to regressions in dCache.
Details
Within NDGF we are using ENDIT to integrate with our tape systems, lately we have done some work to enhance the dCache ENDIT provider by doing space allocation as files are staged instead of at queue time.
dCache version: 10.2.13
ENDIT provider version: 2.0.0-alpha.14 latealloc branch, https://github.com/neicnordic/dcache-endit-provider/tree/2.0.0-alpha.14
Reproducing
It's easy to reproduce, just queue a stage request and then cancel it, this is an example where we also enable DEBUG logging:
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > log set stdout DEBUG
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh restore 00003B736F6968DE43E282BDA48EC348E637
Fetch request queued.
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh ls
dc02514a-0b02-428d-94e2-d52fcb6f3113 ACTIVE Thu Nov 06 13:30:58 CET 2025 Thu Nov 06 13:30:58 CET 2025 00003B736F6968DE43E282BDA48EC348E637 test:tape
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh kill 00003B736F6968DE43E282BDA48EC348E637
Kill initialized
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh ls
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > log set stdout WARN
This results in the following log:
06 Nov 2025 13:22:04 (ops_hpc2n_umu_se_r02) [] Pool mode changed to enabled
06 Nov 2025 13:30:38 (ops_hpc2n_umu_se_r02) [lu8:bw admin] rep rm: removing 00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Creating new entry for 00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Stage request created for 00003B736F6968DE43E282BDA48EC348E637.
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Activating stage of 00003B736F6968DE43E282BDA48EC348E637.
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [] StageTask start: wrote /grid/pool/request/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] ListeningNearlineStorage cancel: called
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] PollingEnditNearlineStorage cancel: called
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] StageTask abort: return true
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] PollingEnditNearlineStorage cancel: return true
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] ListeningNearlineStorage add: Calling request.failed(e)
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Stage of 00003B736F6968DE43E282BDA48EC348E637 failed with CacheException(rc=10006;msg=Stage was cancelled.).
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] remove entry 00003B736F6968DE43E282BDA48EC348E637: Transfer failed and replica is FROM_STORE
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] ListeningNearlineStorage cancel: called
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] cleaning up 00003B736F6968DE43E282BDA48EC348E637
From the DEBUG output we see that the ENDIT provider reports the stage as failed.
However, note the java.nio.file.NoSuchFileException errors produced by the dCache code. This suggests that despite the request being cancelled/failed something tries to treat it as successful anyway...