Skip to content

Cancelled restore is not really cancelled #7938

@ZNikke

Description

@ZNikke

We see cancelled stage requests coinciding with logged messages such as:

Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637

We believe the cause of this is the dCache hsm code not handling cancellation properly for files that are cancelled before being staged (typically a stage timeout when staging a huge amount of files).

It is unclear to us whether this is a long-standing issue being more exposed due to WLCG/ATLAS staging more files from tape, or of this is due to regressions in dCache.

Details

Within NDGF we are using ENDIT to integrate with our tape systems, lately we have done some work to enhance the dCache ENDIT provider by doing space allocation as files are staged instead of at queue time.

dCache version: 10.2.13

ENDIT provider version: 2.0.0-alpha.14 latealloc branch, https://github.com/neicnordic/dcache-endit-provider/tree/2.0.0-alpha.14

Reproducing

It's easy to reproduce, just queue a stage request and then cancel it, this is an example where we also enable DEBUG logging:

[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > log set stdout DEBUG
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh restore 00003B736F6968DE43E282BDA48EC348E637
Fetch request queued.
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh ls
dc02514a-0b02-428d-94e2-d52fcb6f3113 ACTIVE Thu Nov 06 13:30:58 CET 2025 Thu Nov 06 13:30:58 CET 2025 00003B736F6968DE43E282BDA48EC348E637 test:tape
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh kill 00003B736F6968DE43E282BDA48EC348E637
Kill initialized
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > rh ls
[piggy] (ops_hpc2n_umu_se_r02@marengo_hpc2n_umu_se_2Domain) admin > log set stdout WARN

This results in the following log:

06 Nov 2025 13:22:04 (ops_hpc2n_umu_se_r02) [] Pool mode changed to enabled
06 Nov 2025 13:30:38 (ops_hpc2n_umu_se_r02) [lu8:bw admin] rep rm: removing 00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Creating new entry for 00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Stage request created for 00003B736F6968DE43E282BDA48EC348E637.
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Activating stage of 00003B736F6968DE43E282BDA48EC348E637.
06 Nov 2025 13:30:58 (ops_hpc2n_umu_se_r02) [] StageTask start: wrote /grid/pool/request/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] ListeningNearlineStorage cancel: called
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] PollingEnditNearlineStorage cancel: called
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] StageTask abort: return true
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] PollingEnditNearlineStorage cancel: return true
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] ListeningNearlineStorage add: Calling request.failed(e)
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Stage of 00003B736F6968DE43E282BDA48EC348E637 failed with CacheException(rc=10006;msg=Stage was cancelled.).
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] remove entry 00003B736F6968DE43E282BDA48EC348E637: Transfer failed and replica is FROM_STORE
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] ListeningNearlineStorage cancel: called
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] Failed to read file size: java.nio.file.NoSuchFileException: /grid/pool/ops_hpc2n_umu_se_r02/data/00003B736F6968DE43E282BDA48EC348E637
06 Nov 2025 13:31:03 (ops_hpc2n_umu_se_r02) [lu8:bw admin] cleaning up 00003B736F6968DE43E282BDA48EC348E637

From the DEBUG output we see that the ENDIT provider reports the stage as failed.

However, note the java.nio.file.NoSuchFileException errors produced by the dCache code. This suggests that despite the request being cancelled/failed something tries to treat it as successful anyway...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions