Setup run hpc #119

ghar1821 · 2026-01-16T04:25:27Z

Describe your changes

This PR introduces new configuration setup to support running benchmarks on the WEHI HPC environment and some implementation changes.

Config to run on WEHI hpc

The hpc's scratch system is prone to task collisions when multiple jobs access the same shared cache/temp folders. It is not smart enough to keep each job isolated (or maybe it is a nextflow problem, I am not too sure). To resolve this, I introduced env variables (on top of few that are already mandatory) like HPC_VIASH_META_TEMP_DIR, NUMBA_CACHE_DIR, and APPTAINER_TMPDIR. These are the "parent" directory. Each task will create a sub directory within it with the Task ID as the directory name and use it. This prevents methods that write out temp files (e.g., CytoNorm, BatchAdjust) from overwriting each other's temp files.

The HPC_VIASH_META_TEMP_DIR now overrides meta[temp_dir] only if it is present. Viash by default set the same temp dir to all methods which can cause temp file collision. By using this env variable when it is available, we can prevent this.

Apptainer Config - apart from adding the mandatory cache, I have to use envwhitelist to ensure these the env variables are properly passed into the container. Otherwise, the caches from the containers will default to my home directory or the /tmp folders in the node, which I am not allowed to use.

I also introduced a "warmup" script to prevent apptainer image pull deadlocks and head job timing out. If I set ociAutoPull=True, concurrent tasks running the same methods/metrics will overwrite the same image files simultaneously, leading to cache deadlocks. Alternatively, setting ociAutoPull=False let the head job pull all images before the benchmark run, but it can lead to timeouts even with a 2-day pullTimeout. So the solution is running a warmup script that submits a single workflow per method/metric to pull and process the required Apptainer images sequentially before firing the full benchmark - must be done manually. We can maybe simplify this in the future by introducing a slurm job that pull and process the images.

I also updated the default retry attempts to 3x. This mitigates random "Bus Errors" that somehow resolves itself after a second or third attempt.

Implementation changes

As I was re-running the tasks, I made some changes to some methods/metrics implementations:

CytoVI: Now uses a MinMax scaler fitted on Batch 1 post-correction for normalization.
Ratio Inconsistent Peaks: Added handling for edge cases where methods return only zero for a given marker/donor/cell type, preventing division by zero when calculating sd.
HarmonyPy: Removed redundant transpose operation. The latest harmonypy updates no longer require this.
CytoNorm: Fixed a bug in to mid methods where recompute was incorrectly set to FALSE (now TRUE).
Perfect Integration: Fixed a bug where string-based batch columns (vs. integers) resulted in only control samples being returned. Note: Most datasets currently use int for batches, which violates our schema. See Issue Batch obs is not always str in datasets #121 for long-term fix.
BatchAdjust: Fixed a (dumb) requirement where non-control samples need "Batch_" somewhere in the sample name.
Updated get_obs_var_for_integrated helper to handle type mismatches when overriding string-based batch columns with integer maps for perfect integration.
Resource Tuning: adjusted time, mem, and cpu requirements:
Low: Control methods.
Mid: Most methods/metrics.
High/Very High: rPCA.
Update batchadjust, cytonorm to use HPC temp dir if the environment variable is set or else
default to what is set by viash. See previous section why this is needed.

Checklist before requesting a review

I have performed a self-review of my code
Check the correct box. Does this PR contain:
- Breaking changes
- New functionality
- Major changes
- Minor changes
- Bug fixes
Proposed changes are described in the CHANGELOG.md
CI Tests succeed and look good!

ghar1821 · 2026-01-17T23:12:57Z

@LuLeom @rcannood can you please have a look at this pull request? There are a lot of changes to the config to run benchmark on hpc and on methods/metrics implementation.

I've fired off one full run and everything seems to be running. I will take a look at the results when everything finishes.

LuLeom · 2026-01-19T12:51:29Z

@ghar1821 I had a look at the new files in scripts/*. Unfortunately I am not super used to the frameworks used, I guess @rcannood can provide a better review on that specific part.

I do have one question though: is the idea to keep those files in this remote repo? I have the feeling that, as those scripts work only for WEHI servers, might create confusion. My suggestion would be to either add a disclaimer in the README (e.g.: "Files in scripts/* outline how to run the benchmark on an hpc cluster", so people can adapt to their needs) or we move that code in a separate location (e.g. another repo?).

LuLeom · 2026-01-19T14:40:32Z

Update batchadjust, cytonorm to use HPC temp dir if the environment variable is set or else
default to what is set by viash. See previous section why this is needed.

Is there are reason why only these two methods were adapted?

LuLeom · 2026-01-19T12:33:26Z

scripts/run_benchmark/run_warmup.sh

+    "rpca_to_goal"
+    "rpca_to_mid"
+    "no_integration"
+    # "perfect_integration"


perfect_integration is commented out?

LuLeom · 2026-01-19T12:34:39Z

scripts/run_benchmark/run_warmup.sh

+# repeat for metrics
+# do one for metrics as well later.
+# emd has been used
+METRICS=(


is there a reason why some metrics are commented out?

LuLeom · 2026-01-19T12:37:02Z

CHANGELOG.md

+* Tune the resource requirement for each method (PR #119).
+  * Low time, mem, cpu for control methods.
+  * Mid time, mem, cpu for most methods, except below.
+  * High (or very high) time, mem, cpu for computationally ones like rPCA.


computationally expensive maybe(?)

LuLeom · 2026-01-19T13:07:16Z

src/methods/batchadjust_all_controls/script.R

+  # Environment variable is set, use it
+  print(paste0("Using HPC temp dir from env: ", tmp_dir))
+} else {
+  # Environment variable not set, use meta


So, this shall work if running the code not in a cluster?

yes it should. because the HPC_VIASH_META_TEMP_DIR env variable won't be set, and thus tmp_dir will return empty string fall into else. But I have no means of testing it atm because I can't run anything on aws :(

LuLeom · 2026-01-19T13:21:41Z

src/methods/batchadjust_one_control/script.R

+  # Environment variable is set, use it
+  print(paste0("Using HPC temp dir from env: ", tmp_dir))
+} else {
+  # Environment variable not set, use meta


So, this shall work if running the code not in a cluster?

LuLeom · 2026-01-19T13:30:07Z

src/methods/cytonorm_all_controls_to_goal/script.R

 ## VIASH END

+cat("Reticulate Python config:\n")
+print(reticulate::py_config())


Isn't cytonorm native in R? If I am not mistaken, we aren't making use of reticulate

LuLeom · 2026-01-19T13:42:18Z

src/methods/batchadjust_all_controls/script.R

 ## VIASH START
 par <- list(
-  input = "resources_test/debug/batchadjust/_viash_par/input_1/censored_split1.h5ad",
+  input = "/Users/putri.g/Documents/cytobenchmark/dataset/lille_spectral_flow_cytometry/censored_split1.h5ad",


warning: local path

LuLeom · 2026-01-19T14:59:59Z

src/metrics/ratio_inconsistent_peaks/script.py

Would it be possible to update the config.vsh.yaml by including the different cases and the SD == 0 problem workaround in the description field?

ghar1821 added 28 commits January 15, 2026 19:32

add scripts to process raw dataset

42effdb

editing config to set apptainer cache dir

83ceda2

editing pre-run scripts and trying to fix R methods not running.

f3019a9

add h5py to setup

f664c48

reverting changes to setup

338aa23

separate submit scripts

f41f6f4

finally the first setting that works!!!!

66c348e

update config and settings for control methods

fa1ade7

adjusted resources for metrics and methods

5902468

update cytovi to use A30 gpu

7497ea4

add numba cache dir export to allow jit caching

cf3d35b

update cytovi implementation

35e3fdb

force recompute for all cytonorm

bad078d

add temp dir resolution for hpc

bdcbf46

remove transpose from harmonypy

7bada43

adding support for hpc

6793f3d

update temp dir again

aa4b07a

latest config file that works reasonably well with hpc

44def10

add some job submit scripts for SLURM

fc4df26

update tmp_path for cytonorm

bcc7ddb

redirect numba cache dir away from /tmp and to its own folder.

9dae0b6

update batch adjust non control samples naming

379b3dd

fix bug in perfect integration subsetting

d062157

fix bug where we can't replace the batch column if it is not integer

7627554

fix bug where the donor loc are somewhat mismatched..

5ff088b

update ratio inconsistent peak where corrected data return only zero

2864b9c

Update script.py

ddc57cf

update scripts

2312cb2

ghar1821 requested review from LuLeom and rcannood January 17, 2026 23:11

LuLeom reviewed Jan 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup run hpc #119

Setup run hpc #119

Uh oh!

ghar1821 commented Jan 16, 2026 •

edited

Loading

Uh oh!

ghar1821 commented Jan 17, 2026

Uh oh!

LuLeom commented Jan 19, 2026 •

edited

Loading

Uh oh!

LuLeom commented Jan 19, 2026 •

edited

Loading

Uh oh!

LuLeom Jan 19, 2026 •

edited

Loading

Uh oh!

LuLeom Jan 19, 2026

Uh oh!

LuLeom Jan 19, 2026

Uh oh!

LuLeom Jan 19, 2026

Uh oh!

ghar1821 Jan 19, 2026

Uh oh!

LuLeom Jan 19, 2026

Uh oh!

LuLeom Jan 19, 2026

Uh oh!

LuLeom Jan 19, 2026

Uh oh!

LuLeom Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Setup run hpc #119

Are you sure you want to change the base?

Setup run hpc #119

Uh oh!

Conversation

ghar1821 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Config to run on WEHI hpc

Implementation changes

Checklist before requesting a review

Uh oh!

ghar1821 commented Jan 17, 2026

Uh oh!

LuLeom commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuLeom commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuLeom Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ghar1821 commented Jan 16, 2026 •

edited

Loading

LuLeom commented Jan 19, 2026 •

edited

Loading

LuLeom commented Jan 19, 2026 •

edited

Loading

LuLeom Jan 19, 2026 •

edited

Loading