-
Notifications
You must be signed in to change notification settings - Fork 1
Setup run hpc #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Setup run hpc #119
Conversation
|
@ghar1821 I had a look at the new files in I do have one question though: is the idea to keep those files in this remote repo? I have the feeling that, as those scripts work only for WEHI servers, might create confusion. My suggestion would be to either add a disclaimer in the README (e.g.: "Files in scripts/* outline how to run the benchmark on an hpc cluster", so people can adapt to their needs) or we move that code in a separate location (e.g. another repo?). |
Is there are reason why only these two methods were adapted? |
| "rpca_to_goal" | ||
| "rpca_to_mid" | ||
| "no_integration" | ||
| # "perfect_integration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect_integration is commented out?
| # repeat for metrics | ||
| # do one for metrics as well later. | ||
| # emd has been used | ||
| METRICS=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why some metrics are commented out?
| * Tune the resource requirement for each method (PR #119). | ||
| * Low time, mem, cpu for control methods. | ||
| * Mid time, mem, cpu for most methods, except below. | ||
| * High (or very high) time, mem, cpu for computationally ones like rPCA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
computationally expensive maybe(?)
| # Environment variable is set, use it | ||
| print(paste0("Using HPC temp dir from env: ", tmp_dir)) | ||
| } else { | ||
| # Environment variable not set, use meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this shall work if running the code not in a cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it should. because the HPC_VIASH_META_TEMP_DIR env variable won't be set, and thus tmp_dir will return empty string fall into else. But I have no means of testing it atm because I can't run anything on aws :(
| # Environment variable is set, use it | ||
| print(paste0("Using HPC temp dir from env: ", tmp_dir)) | ||
| } else { | ||
| # Environment variable not set, use meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this shall work if running the code not in a cluster?
| ## VIASH END | ||
|
|
||
| cat("Reticulate Python config:\n") | ||
| print(reticulate::py_config()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't cytonorm native in R? If I am not mistaken, we aren't making use of reticulate
| ## VIASH START | ||
| par <- list( | ||
| input = "resources_test/debug/batchadjust/_viash_par/input_1/censored_split1.h5ad", | ||
| input = "/Users/putri.g/Documents/cytobenchmark/dataset/lille_spectral_flow_cytometry/censored_split1.h5ad", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: local path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to update the config.vsh.yaml by including the different cases and the SD == 0 problem workaround in the description field?
Describe your changes
This PR introduces new configuration setup to support running benchmarks on the WEHI HPC environment and some implementation changes.
Config to run on WEHI hpc
The hpc's scratch system is prone to task collisions when multiple jobs access the same shared cache/temp folders. It is not smart enough to keep each job isolated (or maybe it is a nextflow problem, I am not too sure). To resolve this, I introduced env variables (on top of few that are already mandatory) like
HPC_VIASH_META_TEMP_DIR,NUMBA_CACHE_DIR, andAPPTAINER_TMPDIR. These are the "parent" directory. Each task will create a sub directory within it with the Task ID as the directory name and use it. This prevents methods that write out temp files (e.g., CytoNorm, BatchAdjust) from overwriting each other's temp files.The HPC_VIASH_META_TEMP_DIR now overrides
meta[temp_dir]only if it is present. Viash by default set the same temp dir to all methods which can cause temp file collision. By using this env variable when it is available, we can prevent this.Apptainer Config - apart from adding the mandatory cache, I have to use envwhitelist to ensure these the env variables are properly passed into the container. Otherwise, the caches from the containers will default to my home directory or the /tmp folders in the node, which I am not allowed to use.
I also introduced a "warmup" script to prevent apptainer image pull deadlocks and head job timing out. If I set
ociAutoPull=True, concurrent tasks running the same methods/metrics will overwrite the same image files simultaneously, leading to cache deadlocks. Alternatively, settingociAutoPull=Falselet the head job pull all images before the benchmark run, but it can lead to timeouts even with a 2-daypullTimeout. So the solution is running a warmup script that submits a single workflow per method/metric to pull and process the required Apptainer images sequentially before firing the full benchmark - must be done manually. We can maybe simplify this in the future by introducing a slurm job that pull and process the images.I also updated the default retry attempts to 3x. This mitigates random "Bus Errors" that somehow resolves itself after a second or third attempt.
Implementation changes
As I was re-running the tasks, I made some changes to some methods/metrics implementations:
get_obs_var_for_integratedhelper to handle type mismatches when overriding string-based batch columns with integer maps for perfect integration.default to what is set by viash. See previous section why this is needed.
Checklist before requesting a review
I have performed a self-review of my code
Check the correct box. Does this PR contain:
Proposed changes are described in the CHANGELOG.md
CI Tests succeed and look good!