Skip to content

Conversation

@ghar1821
Copy link
Contributor

@ghar1821 ghar1821 commented Jan 16, 2026

Describe your changes

This PR introduces new configuration setup to support running benchmarks on the WEHI HPC environment and some implementation changes.

Config to run on WEHI hpc

The hpc's scratch system is prone to task collisions when multiple jobs access the same shared cache/temp folders. It is not smart enough to keep each job isolated (or maybe it is a nextflow problem, I am not too sure). To resolve this, I introduced env variables (on top of few that are already mandatory) like HPC_VIASH_META_TEMP_DIR, NUMBA_CACHE_DIR, and APPTAINER_TMPDIR. These are the "parent" directory. Each task will create a sub directory within it with the Task ID as the directory name and use it. This prevents methods that write out temp files (e.g., CytoNorm, BatchAdjust) from overwriting each other's temp files.

The HPC_VIASH_META_TEMP_DIR now overrides meta[temp_dir] only if it is present. Viash by default set the same temp dir to all methods which can cause temp file collision. By using this env variable when it is available, we can prevent this.

Apptainer Config - apart from adding the mandatory cache, I have to use envwhitelist to ensure these the env variables are properly passed into the container. Otherwise, the caches from the containers will default to my home directory or the /tmp folders in the node, which I am not allowed to use.

I also introduced a "warmup" script to prevent apptainer image pull deadlocks and head job timing out. If I set ociAutoPull=True, concurrent tasks running the same methods/metrics will overwrite the same image files simultaneously, leading to cache deadlocks. Alternatively, setting ociAutoPull=False let the head job pull all images before the benchmark run, but it can lead to timeouts even with a 2-day pullTimeout. So the solution is running a warmup script that submits a single workflow per method/metric to pull and process the required Apptainer images sequentially before firing the full benchmark - must be done manually. We can maybe simplify this in the future by introducing a slurm job that pull and process the images.

I also updated the default retry attempts to 3x. This mitigates random "Bus Errors" that somehow resolves itself after a second or third attempt.

Implementation changes

As I was re-running the tasks, I made some changes to some methods/metrics implementations:

  • CytoVI: Now uses a MinMax scaler fitted on Batch 1 post-correction for normalization.
  • Ratio Inconsistent Peaks: Added handling for edge cases where methods return only zero for a given marker/donor/cell type, preventing division by zero when calculating sd.
  • HarmonyPy: Removed redundant transpose operation. The latest harmonypy updates no longer require this.
  • CytoNorm: Fixed a bug in to mid methods where recompute was incorrectly set to FALSE (now TRUE).
  • Perfect Integration: Fixed a bug where string-based batch columns (vs. integers) resulted in only control samples being returned. Note: Most datasets currently use int for batches, which violates our schema. See Issue Batch obs is not always str in datasets #121 for long-term fix.
  • BatchAdjust: Fixed a (dumb) requirement where non-control samples need "Batch_" somewhere in the sample name.
  • Updated get_obs_var_for_integrated helper to handle type mismatches when overriding string-based batch columns with integer maps for perfect integration.
  • Resource Tuning: adjusted time, mem, and cpu requirements:
  • Low: Control methods.
  • Mid: Most methods/metrics.
  • High/Very High: rPCA.
  • Update batchadjust, cytonorm to use HPC temp dir if the environment variable is set or else
    default to what is set by viash. See previous section why this is needed.

Checklist before requesting a review

  • I have performed a self-review of my code

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI Tests succeed and look good!

@ghar1821 ghar1821 requested review from LuLeom and rcannood January 17, 2026 23:11
@ghar1821
Copy link
Contributor Author

@LuLeom @rcannood can you please have a look at this pull request? There are a lot of changes to the config to run benchmark on hpc and on methods/metrics implementation.

I've fired off one full run and everything seems to be running. I will take a look at the results when everything finishes.

@LuLeom
Copy link
Contributor

LuLeom commented Jan 19, 2026

@ghar1821 I had a look at the new files in scripts/*. Unfortunately I am not super used to the frameworks used, I guess @rcannood can provide a better review on that specific part.

I do have one question though: is the idea to keep those files in this remote repo? I have the feeling that, as those scripts work only for WEHI servers, might create confusion. My suggestion would be to either add a disclaimer in the README (e.g.: "Files in scripts/* outline how to run the benchmark on an hpc cluster", so people can adapt to their needs) or we move that code in a separate location (e.g. another repo?).

@LuLeom
Copy link
Contributor

LuLeom commented Jan 19, 2026

Update batchadjust, cytonorm to use HPC temp dir if the environment variable is set or else
default to what is set by viash. See previous section why this is needed.

Is there are reason why only these two methods were adapted?

"rpca_to_goal"
"rpca_to_mid"
"no_integration"
# "perfect_integration"
Copy link
Contributor

@LuLeom LuLeom Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect_integration is commented out?

# repeat for metrics
# do one for metrics as well later.
# emd has been used
METRICS=(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why some metrics are commented out?

* Tune the resource requirement for each method (PR #119).
* Low time, mem, cpu for control methods.
* Mid time, mem, cpu for most methods, except below.
* High (or very high) time, mem, cpu for computationally ones like rPCA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

computationally expensive maybe(?)

# Environment variable is set, use it
print(paste0("Using HPC temp dir from env: ", tmp_dir))
} else {
# Environment variable not set, use meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this shall work if running the code not in a cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it should. because the HPC_VIASH_META_TEMP_DIR env variable won't be set, and thus tmp_dir will return empty string fall into else. But I have no means of testing it atm because I can't run anything on aws :(

# Environment variable is set, use it
print(paste0("Using HPC temp dir from env: ", tmp_dir))
} else {
# Environment variable not set, use meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this shall work if running the code not in a cluster?

## VIASH END

cat("Reticulate Python config:\n")
print(reticulate::py_config())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't cytonorm native in R? If I am not mistaken, we aren't making use of reticulate

## VIASH START
par <- list(
input = "resources_test/debug/batchadjust/_viash_par/input_1/censored_split1.h5ad",
input = "/Users/putri.g/Documents/cytobenchmark/dataset/lille_spectral_flow_cytometry/censored_split1.h5ad",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: local path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to update the config.vsh.yaml by including the different cases and the SD == 0 problem workaround in the description field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants