Misc Run Setup: Add freecycle, adjust default errorStrategy, and fix resume/cleanup #60

erflynn · 2023-12-01T02:52:57Z

This PR addresses several feature requests and modifications in the run script and configuration for the single cell pipeline.
Specifically:

fix run_resume.sh script #47 Fix run_resume.sh script. This just had not been updated to match the run.sh
Deletes working during if OOT error #41 Deletes working directory if OOT error This was a slightly tricky bug. Basically, if the parent run.sh script is cancelled by the external system vs by an internal error, it has a parent error of 0 even if subprocesses do not have this. To fix this, I moved the working directory deletion from the run script to the individual steps using the workflow.onComplete directive from nextflow.
Get nextflow to submit to specified partition #50 Get nextflow to submit to specified partition @dtm2451 solved this with unset SBATCH_PARTITION, and I implemented by adding this to the run scripts and the config. I set default to freecycle for test, and then to freecycle,krummellab,common for standard runs.
Check if we can set a limit to resources consumed by nextflow #43 Check if we can set a limit to resources consumed by nextflow I set the max number of jobs to 20. I could not find an option to set the max number of cpus for SLURM (can be set for local). We may want to lower this number.
I have also changed the default errorStrategy to "finish" instead of "terminate" (so that existing steps aren't interrupted if something fails), and included additional comments about how to modify this if another strategy (e.g. ignore, retry) if desired.
EDIT 1/16/2023: this also solves the error that cellranger is very slow on /c4/scratch/ by switching to /scratch/ (now hours instead of days), but future work should implement this more cleanly

…o finish instead of terminate

… run script to address problems with incorrect cleanup after external cancellation. also moved test queue settings to compute.config

erflynn · 2023-12-01T02:55:37Z

A couple specific questions for discussion:
@AlaaALatif - do you want me to update partition and working directory for the bulk pipeline as well? Happy to do so, can do this easily and push a commit.

@dtm2451 @amadeovezz @AlaaALatif - what are your thoughts on having the default partition be: freecycle,krummellab,common, default number of max jobs to be 60, and errorStrategy to 'finish'?
(note: errorStrategy will be changed when we switch to dynamic retries like for bulk, this is temporary)

dtm2451 · 2023-12-05T18:20:38Z

what are your thoughts on having the default partition be: freecycle,krummellab,common, default number of max jobs to be 60, and errorStrategy to 'finish' (note: errorStrategy will be changed when we switch to dynamic retries like for bulk, this is temporary)

partition bit: sounds great to me generally, but:
- Just a note: I wonder what nextflow does if a job gets cancelled by the scheduler? We might want to build some sort of catch and retry in if possible... But since we've rarely, if ever, seen freecycle jobs actually get cancelled from primary lab prioritization, probably fine to not build this until needed
- Is there a way to have this adjust automatically when users don't have access to the krummellab partition? If not, we should document what to do if you see an sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified error.
default number of jobs as 60 seems pretty high to me? 60 cellranger jobs would take A LOT of resources.
errorStrategy bit -- 👍, idk and happy defaulting to your suggestion there!

dtm2451 · 2023-12-05T18:27:01Z

single_cell_RNAseq/README.md

 Of note, it can be useful to adjust these values sensibly before / for all runs by adjusting this file.  This can be especially useful with tissue data where it is often useful to compare across libraries and batches, with consistent cutoff values, as a method of assessing relative quality across the entire project.

+### c4 partitions, additional configuration
+We have the following partitions available on the c4 cluster for DSCoLab members: krummellab,common (default partitions, which are set in the environment variable `$SBATCH_PARTITION`) and freecycle. Freecycle allows us to run jobs on any compute that is not in use (24h job max), with the caveat that the job may be cancelled if a partition owner requires these resources. We have set the default for `test` jobs to freecycle, and for standard jobs to `freecycle,krummellab,common`, which means they are first submitted to freecycle, then krummellab, then common. You may want to adjust this and other cluster options, such as the errorStrategy (what happens to the whole pipeline when it errors). See comments in `nextflow.config` describing what we have set as defaults, and what you may want to adjust.


After "when it errors)." and before "See comments in nextflow.config", perhaps add:

If you do not have access to the krummellab partition on c4, you will need to change the process.queue parameter in your nextflow.config file.

Thankfully, it just defaults to whichever ones you have access to. So if I submit to costellolab,freecycle, it just goes to freecycle. Similarly if the resources are too high for freecycle (e.g. cellranger requests > 24h), it goes to common,krummellab

Added the note to the README too, thanks!

Just a note: I wonder what nextflow does if a job gets cancelled by the scheduler? We might want to build some sort of catch and retry in if possible... But since we've rarely, if ever, seen freecycle jobs actually get cancelled from primary lab prioritization, probably fine to not build this until needed

I think we should definitely do this! There is a way to set it to retry on specific job codes, e.g.:
errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' }
Let me see if I can figure out how to figure out what the job code is when jobs are cancelled by freecycle, and then I can add this... Let me know if you have examples.

Thankfully, it just defaults to whichever ones you have access to. So if I submit to costellolab,freecycle, it just goes to freecycle. Similarly if the resources are too high for freecycle (e.g. cellranger requests > 24h), it goes to common,krummellab

Oh nice! Then... maybe the note I had you add isn't needed. Or at least as is. Cuz it works out fine if they don't change it!

default number of jobs as 60 seems pretty high to me? 60 cellranger jobs would take A LOT of resources.

Noted - lowering to 20 for single cell pipeline. I wish there were a way to set a max number of cpus grrr because this is more what we want. I'll ask on the nextflow slack.
We should all discuss @AlaaALatif (this is what is set for bulk), and also bring up at a group meeting what folks want for this or other defaults

Oh nice! Then... maybe the note I had you add isn't needed. Or at least as is. Cuz it works out fine if they don't change it!

I think it's good to note that they might want to change it, happy to keep this :)

looks good as you rephrased, other than likely typo in "you will may to change"

@dtm2451

Added suggestions from @dtm2451

Now contains the "[ ]" lists introduced by Amadeo's update Also mentions the data types must be one of "CITE" or "GEX" (not both) and optionally any other modality, and provides examples

erflynn · 2023-12-06T18:52:59Z

@amadeovezz when you're reviewing, can you also double check the structure described in the updated example-inputs/README.md matches your updated input config structure?

amadeovezz · 2023-12-08T02:42:15Z

@amadeovezz when you're reviewing, can you also double check the structure described in the updated example-inputs/README.md matches your updated input config structure?

Yep they match!

Minor note: fmx_assign_to_gt and ref_vcf_dir aren't in there yet, but your PR is close to merging so they should be there soon

erflynn · 2023-12-11T23:47:30Z

Helpful feedback from nextflow slack on how to set max.cpus, max.memory w/ SLURM:

If you're looking to cap / threshold all requests from a pipeline, that's more difficult. You can do it in nf-core pipelines using --max_cpus and --max_memory etc (docs) but that's specific to nf-core and not a general Nextflow feature (yet, there's an issue to add it as core functionality)

nextflow-io/nextflow#640
https://github.com/nf-core/tools/blob/99961bedab1518f592668727a4d692c4ddf3c336/nf_core/pipeline-template/nextflow.config#L206-L237

erflynn · 2023-12-11T23:50:26Z

Lingering to-dos:

add max cpus, memory based on nf-core pipelines
add fmx-assign-to-gt + ref-vcf-dir to example-inputs/README.md as well as example params
check with @AlaaALatif about which features to add to bulk_pipeline (e.g. fix for OOT error, resume script, maxes?) and add these
find out if jobs actually get cancelled on freecycle, if yes --> then update to have retries for this
look into scratch nextflow option, assess if it improves runtime

…ocessing_pipelines into ef/partition_config

erflynn · 2024-01-16T22:44:04Z

Weird error that I have now patched -- @dtm2451 and I observed that cellranger is much slower (days instead of hours) running from the nextflow pipeline. It appears to be the difference between writing to /scratch/<user>/<cr_job_id>/ (what we typically use) and /c4/scratch/<user>/<parent_nf_job>/ (nextflow working directory for sharing across nodes), where /scratch/ is much faster.

I have now changed this, but tagging @amadeovezz @AlaaALatif -- if you have slow steps they should write to /scratch/ instead of /c4/scratch/ and then move the data to /c4/scratch/ after.

We probably want to switch to doing this for all steps that take more than a few mins? I think that this nextflow flag should do it - but I need to look into more: https://www.nextflow.io/docs/latest/process.html#scratch

erflynn added 3 commits November 30, 2023 16:42

adjusted run script usage reqs, fixed run_resume

1b2fa74

moved queue to nextflow.config, added retry options and set default t…

7dd3dcb

…o finish instead of terminate

moved cleanup to workflow onComplete in the individual steps from the…

da5ba85

… run script to address problems with incorrect cleanup after external cancellation. also moved test queue settings to compute.config

erflynn requested review from AlaaALatif, amadeovezz and dtm2451 December 1, 2023 02:55

dtm2451 reviewed Dec 5, 2023

View reviewed changes

erflynn added 4 commits December 5, 2023 11:18

Update README.md

65845af

Added suggestions from @dtm2451

Lowered queue size to 20

a69f985

fixed typo in README.md

dbdcc53

Updated test data README to reflect updated config

a1f3f28

Now contains the "[ ]" lists introduced by Amadeo's update Also mentions the data types must be one of "CITE" or "GEX" (not both) and optionally any other modality, and provides examples

erflynn added 2 commits January 12, 2024 13:29

cherry pick from ef/add_vdj_v2 6e26d1 to speed up CR

ae93673

Merge branch 'ef/partition_config' of github.com:UCSF-DSCOLAB/data_pr…

82425fe

…ocessing_pipelines into ef/partition_config

erflynn mentioned this pull request Mar 19, 2024

Incorporate major changes from single cell dev branch into main #70

Merged

Misc Run Setup: Add freecycle, adjust default errorStrategy, and fix resume/cleanup #60

Are you sure you want to change the base?

Misc Run Setup: Add freecycle, adjust default errorStrategy, and fix resume/cleanup #60

Uh oh!

Conversation

erflynn commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erflynn commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtm2451 commented Dec 5, 2023

Uh oh!

dtm2451 Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

erflynn Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

erflynn Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

erflynn Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dtm2451 Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

erflynn Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

erflynn Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

dtm2451 Dec 5, 2023

Choose a reason for hiding this comment

Uh oh!

erflynn commented Dec 6, 2023

Uh oh!

amadeovezz commented Dec 8, 2023

Uh oh!

erflynn commented Dec 11, 2023

Uh oh!

erflynn commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erflynn commented Jan 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

erflynn commented Dec 1, 2023 •

edited

Loading

erflynn commented Dec 1, 2023 •

edited

Loading

erflynn Dec 5, 2023 •

edited

Loading

erflynn commented Dec 11, 2023 •

edited

Loading