Handle distinct directories with same content

When serializing executions of workflows that take directory parameters, CWLProv does not create corresponding directories in the RO bundle: rather, files are always placed in directories whose name consists of the first two characters of the file's sha1 checksum.

When converting from CWLProv we recreate the original directories, giving them a name obtained by concatenating the sorted checksums of all contained files and [computing the checksum of the concatenation](https://github.com/ResearchObject/runcrate/blob/a728698cb62110064ae809c46b863d9089634ccf/src/runcrate/convert.py#L227). This means that directories with the same contents end up being mapped to the _same_ directory in the output RO-Crate. This is especially convenient to avoid data duplication between workflow parameters and tool parameters: for instance, when a directory is an input of the workflow and also of the first step.

However, there are cases where we might not want to do that. For instance, suppose that a workflow takes an array of two directories as input:

```yaml
cwlVersion: v1.2
class: Workflow
requirements:
  ScatterFeatureRequirement: {}

inputs:
  dir_array: Directory[]
outputs: []

steps:
  date_step:
    label: Prints date of input dirs
    scatter: dir
    in:
      dir: dir_array
    out: []
    run: dirdate.cwl
```

Where `dirdate.cwl` is:

```yaml
cwlVersion: v1.2
class: CommandLineTool
baseCommand: [date, "-r"]

inputs:
  dir:
    type: Directory
    inputBinding:
      position: 1
outputs: []
```

Suppose the workflow is launched with the following parameters:

```yaml
dir_array:
  - class: Directory
    location: foo
  - class: Directory
    location: bar
```

Where `foo` and `bar` have the same contents, e.g., they both contain a text file whose content is the string "dummy". What we currently get in the RO-Crate is:

```json
{
    "@id": "packed.cwl#main/dir_array",
    "@type": "FormalParameter",
    "additionalType": "Dataset",
    "multipleValues": "True",
    "name": "dir_array"
},
...
{
    "@id": "#pv-main/dir_array",
    "@type": "PropertyValue",
    "exampleOfWork": {
        "@id": "packed.cwl#main/dir_array"
    },
    "name": "dir_array",
    "value": [
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
        },
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/"
        }
    ]
},
...
{
    "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/",
    "@type": "Dataset",
    "alternateName": "foo",
    "exampleOfWork": {
        "@id": "packed.cwl#dirdate.cwl/dir"
    },
    "hasPart": [
        {
            "@id": "df3cc24afc943eab58469eebaff500a2a4a823c5/0c8b9d6f753e8d8ec9276bfe98e993a133847642"
        }
    ]
},
```

Note that the duplicate id in the `value` of `#pv-main/dir_array` is a bug: the list should contain only one copy, since the duplicate makes no sense in the RO-Crate JSON-LD. Also, the `Dataset` has an `alternateName` of "foo", while "bar" does not appear in the metadata. Thus, in this case, the representation does not reflect the fact that the workflow took a list of two distinct directories as input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle distinct directories with same content #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle distinct directories with same content #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions