Skip to content

Conversation

@erflynn
Copy link
Collaborator

@erflynn erflynn commented Dec 1, 2023

This PR addresses several feature requests and modifications in the run script and configuration for the single cell pipeline.
Specifically:

  • fix run_resume.sh script #47 Fix run_resume.sh script. This just had not been updated to match the run.sh
  • Deletes working during if OOT error #41 Deletes working directory if OOT error This was a slightly tricky bug. Basically, if the parent run.sh script is cancelled by the external system vs by an internal error, it has a parent error of 0 even if subprocesses do not have this. To fix this, I moved the working directory deletion from the run script to the individual steps using the workflow.onComplete directive from nextflow.
  • Get nextflow to submit to specified partition #50 Get nextflow to submit to specified partition @dtm2451 solved this with unset SBATCH_PARTITION, and I implemented by adding this to the run scripts and the config. I set default to freecycle for test, and then to freecycle,krummellab,common for standard runs.
  • Check if we can set a limit to resources consumed by nextflow #43 Check if we can set a limit to resources consumed by nextflow I set the max number of jobs to 20. I could not find an option to set the max number of cpus for SLURM (can be set for local). We may want to lower this number.
  • I have also changed the default errorStrategy to "finish" instead of "terminate" (so that existing steps aren't interrupted if something fails), and included additional comments about how to modify this if another strategy (e.g. ignore, retry) if desired.
  • EDIT 1/16/2023: this also solves the error that cellranger is very slow on /c4/scratch/ by switching to /scratch/ (now hours instead of days), but future work should implement this more cleanly

… run script to address problems with incorrect cleanup after external cancellation. also moved test queue settings to compute.config
@erflynn
Copy link
Collaborator Author

erflynn commented Dec 1, 2023

A couple specific questions for discussion:
@AlaaALatif - do you want me to update partition and working directory for the bulk pipeline as well? Happy to do so, can do this easily and push a commit.

@dtm2451 @amadeovezz @AlaaALatif - what are your thoughts on having the default partition be: freecycle,krummellab,common, default number of max jobs to be 60, and errorStrategy to 'finish'?
(note: errorStrategy will be changed when we switch to dynamic retries like for bulk, this is temporary)

@dtm2451
Copy link
Collaborator

dtm2451 commented Dec 5, 2023

what are your thoughts on having the default partition be: freecycle,krummellab,common, default number of max jobs to be 60, and errorStrategy to 'finish' (note: errorStrategy will be changed when we switch to dynamic retries like for bulk, this is temporary)

  • partition bit: sounds great to me generally, but:
    • Just a note: I wonder what nextflow does if a job gets cancelled by the scheduler? We might want to build some sort of catch and retry in if possible... But since we've rarely, if ever, seen freecycle jobs actually get cancelled from primary lab prioritization, probably fine to not build this until needed
    • Is there a way to have this adjust automatically when users don't have access to the krummellab partition? If not, we should document what to do if you see an sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified error.
  • default number of jobs as 60 seems pretty high to me? 60 cellranger jobs would take A LOT of resources.
  • errorStrategy bit -- 👍, idk and happy defaulting to your suggestion there!

Of note, it can be useful to adjust these values sensibly before / for all runs by adjusting this file. This can be especially useful with tissue data where it is often useful to compare across libraries and batches, with consistent cutoff values, as a method of assessing relative quality across the entire project.

### c4 partitions, additional configuration
We have the following partitions available on the c4 cluster for DSCoLab members: krummellab,common (default partitions, which are set in the environment variable `$SBATCH_PARTITION`) and freecycle. Freecycle allows us to run jobs on any compute that is not in use (24h job max), with the caveat that the job may be cancelled if a partition owner requires these resources. We have set the default for `test` jobs to freecycle, and for standard jobs to `freecycle,krummellab,common`, which means they are first submitted to freecycle, then krummellab, then common. You may want to adjust this and other cluster options, such as the errorStrategy (what happens to the whole pipeline when it errors). See comments in `nextflow.config` describing what we have set as defaults, and what you may want to adjust.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After "when it errors)." and before "See comments in nextflow.config", perhaps add:

If you do not have access to the krummellab partition on c4, you will need to change the process.queue parameter in your nextflow.config file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankfully, it just defaults to whichever ones you have access to. So if I submit to costellolab,freecycle, it just goes to freecycle. Similarly if the resources are too high for freecycle (e.g. cellranger requests > 24h), it goes to common,krummellab

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the note to the README too, thanks!

Copy link
Collaborator Author

@erflynn erflynn Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: I wonder what nextflow does if a job gets cancelled by the scheduler? We might want to build some sort of catch and retry in if possible... But since we've rarely, if ever, seen freecycle jobs actually get cancelled from primary lab prioritization, probably fine to not build this until needed

I think we should definitely do this! There is a way to set it to retry on specific job codes, e.g.:
errorStrategy { task.exitStatus == 140 ? 'retry' : 'terminate' }
Let me see if I can figure out how to figure out what the job code is when jobs are cancelled by freecycle, and then I can add this... Let me know if you have examples.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankfully, it just defaults to whichever ones you have access to. So if I submit to costellolab,freecycle, it just goes to freecycle. Similarly if the resources are too high for freecycle (e.g. cellranger requests > 24h), it goes to common,krummellab

Oh nice! Then... maybe the note I had you add isn't needed. Or at least as is. Cuz it works out fine if they don't change it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default number of jobs as 60 seems pretty high to me? 60 cellranger jobs would take A LOT of resources.

Noted - lowering to 20 for single cell pipeline. I wish there were a way to set a max number of cpus grrr because this is more what we want. I'll ask on the nextflow slack.
We should all discuss @AlaaALatif (this is what is set for bulk), and also bring up at a group meeting what folks want for this or other defaults

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice! Then... maybe the note I had you add isn't needed. Or at least as is. Cuz it works out fine if they don't change it!

I think it's good to note that they might want to change it, happy to keep this :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good as you rephrased, other than likely typo in "you will may to change"

Added suggestions from @dtm2451
Now contains the "[ ]" lists introduced by Amadeo's update
Also mentions the data types must be one of "CITE" or "GEX" (not both) and optionally any other modality, and provides examples
@erflynn
Copy link
Collaborator Author

erflynn commented Dec 6, 2023

@amadeovezz when you're reviewing, can you also double check the structure described in the updated example-inputs/README.md matches your updated input config structure?

@amadeovezz
Copy link
Collaborator

@amadeovezz when you're reviewing, can you also double check the structure described in the updated example-inputs/README.md matches your updated input config structure?

Yep they match!

Minor note: fmx_assign_to_gt and ref_vcf_dir aren't in there yet, but your PR is close to merging so they should be there soon

@erflynn
Copy link
Collaborator Author

erflynn commented Dec 11, 2023

Helpful feedback from nextflow slack on how to set max.cpus, max.memory w/ SLURM:

If you're looking to cap / threshold all requests from a pipeline, that's more difficult. You can do it in nf-core pipelines using --max_cpus and --max_memory etc (docs) but that's specific to nf-core and not a general Nextflow feature (yet, there's an issue to add it as core functionality)

nextflow-io/nextflow#640
https://github.com/nf-core/tools/blob/99961bedab1518f592668727a4d692c4ddf3c336/nf_core/pipeline-template/nextflow.config#L206-L237

@erflynn
Copy link
Collaborator Author

erflynn commented Dec 11, 2023

Lingering to-dos:

  • add max cpus, memory based on nf-core pipelines
  • add fmx-assign-to-gt + ref-vcf-dir to example-inputs/README.md as well as example params
  • check with @AlaaALatif about which features to add to bulk_pipeline (e.g. fix for OOT error, resume script, maxes?) and add these
  • find out if jobs actually get cancelled on freecycle, if yes --> then update to have retries for this
  • look into scratch nextflow option, assess if it improves runtime

@erflynn
Copy link
Collaborator Author

erflynn commented Jan 16, 2024

Weird error that I have now patched -- @dtm2451 and I observed that cellranger is much slower (days instead of hours) running from the nextflow pipeline. It appears to be the difference between writing to /scratch/<user>/<cr_job_id>/ (what we typically use) and /c4/scratch/<user>/<parent_nf_job>/ (nextflow working directory for sharing across nodes), where /scratch/ is much faster.

I have now changed this, but tagging @amadeovezz @AlaaALatif -- if you have slow steps they should write to /scratch/ instead of /c4/scratch/ and then move the data to /c4/scratch/ after.

We probably want to switch to doing this for all steps that take more than a few mins? I think that this nextflow flag should do it - but I need to look into more: https://www.nextflow.io/docs/latest/process.html#scratch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants