When serializing executions of workflows that take directory parameters, CWLProv does not create corresponding directories in the RO bundle: rather, files are always placed in directories whose name consists of the first two characters of the file's sha1 checksum.
When converting from CWLProv we recreate the original directories, giving them a name obtained by concatenating the sorted checksums of all contained files and computing the checksum of the concatenation. This means that directories with the same contents end up being mapped to the same directory in the output RO-Crate. This is especially convenient to avoid data duplication between workflow parameters and tool parameters: for instance, when a directory is an input of the workflow and also of the first step.
However, there are cases where we might not want to do that. For instance, suppose that a workflow takes an array of two directories as input:
cwlVersion: v1.2
class: Workflow
requirements:
ScatterFeatureRequirement: {}
inputs:
dir_array: Directory[]
outputs: []
steps:
date_step:
label: Prints date of input dirs
scatter: dir
in:
dir: dir_array
out: []
run: dirdate.cwl
Where dirdate.cwl is:
cwlVersion: v1.2
class: CommandLineTool
baseCommand: [date, "-r"]
inputs:
dir:
type: Directory
inputBinding:
position: 1
outputs: []
Suppose the workflow is launched with the following parameters:
dir_array:
- class: Directory
location: foo
- class: Directory
location: bar
Where foo and bar have the same contents, e.g., they both contain a text file whose content is the string "dummy". What we currently get in the RO-Crate is:
{
"@id": "packed.cwl#main/dir_array",
"@type": "FormalParameter",
"additionalType": "Dataset",
"multipleValues": "True",
"name": "dir_array"
},
...
{
"@id": "#pv-main/dir_array",
"@type": "PropertyValue",
"exampleOfWork": {
"@id": "packed.cwl#main/dir_array"
},
"name": "dir_array",
"value": [
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
},
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
}
]
},
...
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/",
"@type": "Dataset",
"alternateName": "foo",
"exampleOfWork": {
"@id": "packed.cwl#dirdate.cwl/dir"
},
"hasPart": [
{
"@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/0c8b9d6f753e8d8ec9276bfe98e993a133847642"
}
]
},
Note that the duplicate id in the value of #pv-main/dir_array is a bug: the list should contain only one copy, since the duplicate makes no sense in the RO-Crate JSON-LD. Also, the Dataset has an alternateName of "foo", while "bar" does not appear in the metadata. Thus, in this case, the representation does not reflect the fact that the workflow took a list of two distinct directories as input.
When serializing executions of workflows that take directory parameters, CWLProv does not create corresponding directories in the RO bundle: rather, files are always placed in directories whose name consists of the first two characters of the file's sha1 checksum.
When converting from CWLProv we recreate the original directories, giving them a name obtained by concatenating the sorted checksums of all contained files and computing the checksum of the concatenation. This means that directories with the same contents end up being mapped to the same directory in the output RO-Crate. This is especially convenient to avoid data duplication between workflow parameters and tool parameters: for instance, when a directory is an input of the workflow and also of the first step.
However, there are cases where we might not want to do that. For instance, suppose that a workflow takes an array of two directories as input:
Where
dirdate.cwlis:Suppose the workflow is launched with the following parameters:
Where
fooandbarhave the same contents, e.g., they both contain a text file whose content is the string "dummy". What we currently get in the RO-Crate is:{ "@id": "packed.cwl#main/dir_array", "@type": "FormalParameter", "additionalType": "Dataset", "multipleValues": "True", "name": "dir_array" }, ... { "@id": "#pv-main/dir_array", "@type": "PropertyValue", "exampleOfWork": { "@id": "packed.cwl#main/dir_array" }, "name": "dir_array", "value": [ { "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/" }, { "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/" } ] }, ... { "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/", "@type": "Dataset", "alternateName": "foo", "exampleOfWork": { "@id": "packed.cwl#dirdate.cwl/dir" }, "hasPart": [ { "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/0c8b9d6f753e8d8ec9276bfe98e993a133847642" } ] },Note that the duplicate id in the
valueof#pv-main/dir_arrayis a bug: the list should contain only one copy, since the duplicate makes no sense in the RO-Crate JSON-LD. Also, theDatasethas analternateNameof "foo", while "bar" does not appear in the metadata. Thus, in this case, the representation does not reflect the fact that the workflow took a list of two distinct directories as input.