Skip to content

[WIP] refactor: module support#318

Draft
adthrasher wants to merge 5 commits into
mainfrom
refactor/wdl_modules
Draft

[WIP] refactor: module support#318
adthrasher wants to merge 5 commits into
mainfrom
refactor/wdl_modules

Conversation

@adthrasher
Copy link
Copy Markdown
Member

DO NOT MERGE

Demonstration PR for upcoming module support in WDL 1.4.

Before submitting this PR, please make sure:

  • You have added a few sentences describing the PR here.
  • The code passes all CI tests without any errors or warnings.
  • You have added tests (when appropriate).
  • You have added an entry in any relevant CHANGELOGs (when appropriate).
  • If you have made any changes to the scripts/ or docker/ directories, please ensure any image versions have been incremented accordingly!
  • You have updated the README or other documentation to account for these changes (when appropriate).

@adthrasher
Copy link
Copy Markdown
Member Author

@a-frantz after some discussion with @claymcleod I mocked up some example modules. I intentionally kept it simple, but I covered both a defined entry point and a default entry point (index.wdl). I'll be interested to see how things work once Sprocket has the more fully featured module support.

@adthrasher
Copy link
Copy Markdown
Member Author

Assuming I've understood the module spec properly, I've created a number of examples here.

  • The "per-tool" module: This is what I've done with fq and samtools. This seems like it will be a huge maintenance burden, as each tool definition gets moved to a folder and has an accompanying module.json. So it will essentially double the number of files in the repo.
  • Grouping tools: This is what I did with the new alignment subdirectory. This would enable you to do something like import alignment/bwa and import alignment/star to get the precise aligner.
  • I didn't do this, but we could simply make tools a module. Then you'd do something like import tools/sambamba to get individual tools.

For workflows, I think the organization is obvious for something like DNAseq or RNAseq. We'd have a single module with entry points for the FASTQ and BAM. This has the advantage that it also hides the core workflows from end users. I'm less clear on how we should organize the other workflows (e.g. bam-to-fastqs). I have a single example of a standalone module for a workflow.

I also didn't touch the data structures folder. I suspect that should end up as a single module with various sub-paths.

I'm also not sure how the versioning works. Clay's spec says that git-based dependencies are by git tags, so we'd have to rethink how we've been doing releases and follow that specific format (e.g. <module>/<version>).

"path": "./dnaseq-standard-fastq.wdl"
},
"bam": {
"path": "./dnaseq-standard.wdl"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is right. I think it needs a source object wrapping it.

@claymcleod
Copy link
Copy Markdown
Member

Assuming I've understood the module spec properly, I've created a number of examples here.

* The "per-tool" module: This is what I've done with `fq` and `samtools`. This seems like it will be a huge maintenance burden, as each tool definition gets moved to a folder and has an accompanying `module.json`. So it will essentially double the number of files in the repo.

* Grouping tools: This is what I did with the new `alignment` subdirectory. This would enable you to do something like `import alignment/bwa` and `import alignment/star` to get the precise aligner.

* I didn't do this, but we could simply make `tools` a module. Then you'd do something like `import tools/sambamba` to get individual tools.

For workflows, I think the organization is obvious for something like DNAseq or RNAseq. We'd have a single module with entry points for the FASTQ and BAM. This has the advantage that it also hides the core workflows from end users. I'm less clear on how we should organize the other workflows (e.g. bam-to-fastqs). I have a single example of a standalone module for a workflow.

I also didn't touch the data structures folder. I suspect that should end up as a single module with various sub-paths.

I'm also not sure how the versioning works. Clay's spec says that git-based dependencies are by git tags, so we'd have to rethink how we've been doing releases and follow that specific format (e.g. <module>/<version>).

I think you've got all of this right. My recommendation is to do try the tool model (i.e., the first model) to start out. The reason is mainly because that's going to enable a really rich discovery of the modules when we write the registry at OpenWDL (e.g., the metadata about which tools exist in the module and at what versions/licenses is going to be easiest to reason about in this mode).

I think the second solution could work as well, especially if the maintenance burden of the first becomes too high.

I would stay away from the third version, as I feel it pulls far too much information into one module.json.

@adthrasher
Copy link
Copy Markdown
Member Author

Assuming I've understood the module spec properly, I've created a number of examples here.

* The "per-tool" module: This is what I've done with `fq` and `samtools`. This seems like it will be a huge maintenance burden, as each tool definition gets moved to a folder and has an accompanying `module.json`. So it will essentially double the number of files in the repo.

* Grouping tools: This is what I did with the new `alignment` subdirectory. This would enable you to do something like `import alignment/bwa` and `import alignment/star` to get the precise aligner.

* I didn't do this, but we could simply make `tools` a module. Then you'd do something like `import tools/sambamba` to get individual tools.

For workflows, I think the organization is obvious for something like DNAseq or RNAseq. We'd have a single module with entry points for the FASTQ and BAM. This has the advantage that it also hides the core workflows from end users. I'm less clear on how we should organize the other workflows (e.g. bam-to-fastqs). I have a single example of a standalone module for a workflow.
I also didn't touch the data structures folder. I suspect that should end up as a single module with various sub-paths.
I'm also not sure how the versioning works. Clay's spec says that git-based dependencies are by git tags, so we'd have to rethink how we've been doing releases and follow that specific format (e.g. <module>/<version>).

I think you've got all of this right. My recommendation is to do try the tool model (i.e., the first model) to start out. The reason is mainly because that's going to enable a really rich discovery of the modules when we write the registry at OpenWDL (e.g., the metadata about which tools exist in the module and at what versions/licenses is going to be easiest to reason about in this mode).

I think the second solution could work as well, especially if the maintenance burden of the first becomes too high.

I would stay away from the third version, as I feel it pulls far too much information into one module.json.

I think I'd be on board with #1 if you could have a flat tools directory (as we do now) with a single module.json. I'd envision that module.json containing, essentially, an array of module declarations. It seems strange to me to have a directory per tool with either a generic index.wdl or a <tool>.wdl and a module.json as the only entries. It feels quite cluttered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants