Skip to content

feat!: SPRAS revision#320

Open
tristan-f-r wants to merge 54 commits intoReed-CompBio:mainfrom
tristan-f-r:hash
Open

feat!: SPRAS revision#320
tristan-f-r wants to merge 54 commits intoReed-CompBio:mainfrom
tristan-f-r:hash

Conversation

@tristan-f-r
Copy link
Collaborator

@tristan-f-r tristan-f-r commented Jul 9, 2025

This change means that output files will not be reused whenever SPRAS is updated if osdf_immutable is true, furthering the immutability goal necessary to get OSDF integration working for SPRAS benchmarking. ('updated' depends on the git commit hash or the actual SPRAS release version)

This adds the unique spras_revision to every single paramater combination (before hashing) and the dataset label, to provide OSDF support on the level of deterministic, non-seeded algorithms when datasets are immutable.

This has the added benefit of allowing SPRAS users to simply upgrade their SPRAS version without needing to clear output, which complements #380. The refactored test also partially covers #165 and #45. (This is also where the majority of the code comes from: The actual feature patch here is a 50 line change.)

See #321 implemented by #335 for handling nondeterministic algorithms / seeded algorithms.


To make this change, a significant test refactor in test/analysis was needed to remove hardcoded paths (which contained the hashes being modified per-commit in this PR.) It turns out that whenever we make any change to the hash, this [original: the patch here fixes this] test breaks! That's why this PR is depended on by so many other PRs.

This adds the unique spras_revision to every single paramater combination (before hashing) and the dataset label, to provide OSDF support on the level of deterministic algorithms.
@tristan-f-r tristan-f-r marked this pull request as ready for review July 9, 2025 20:51
@tristan-f-r tristan-f-r added enhancement New feature or request needed for benchmarking Priority PRs needed for the benchmarking paper labels Jul 9, 2025
@tristan-f-r tristan-f-r changed the title feat: spras_revision feat: SPRAS revision Jul 9, 2025
@tristan-f-r

This comment was marked as outdated.

@tristan-f-r tristan-f-r marked this pull request as draft July 9, 2025 21:37
@tristan-f-r tristan-f-r marked this pull request as ready for review July 10, 2025 19:34
@tristan-f-r tristan-f-r changed the title feat: SPRAS revision feat!: SPRAS revision Jul 10, 2025
@tristan-f-r

This comment was marked as outdated.

@tristan-f-r tristan-f-r added the P-high This is a blocker for many PRs/issues/features label Jul 24, 2025
@github-actions github-actions bot added the merge-conflict This PR has merge conflicts. label Jan 31, 2026
@github-actions github-actions bot removed the merge-conflict This PR has merge conflicts. label Jan 31, 2026
Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments. I still haven't looked through all the test code.

@tristan-f-r
Copy link
Collaborator Author

tristan-f-r commented Jan 31, 2026

Since both past approaches do not scale well, I've decided to only focus on the RECORD file.

This fails specifically in the case where SPRAS is somehow ran without being installed as a python module, and I can't think of a plausible scenario where this happens.

@agitter
Copy link
Collaborator

agitter commented Feb 8, 2026

As a follow up to our meeting discussion, I'm wondering if this type of output file versioning should be optional. Then when running in CHTC and writing to OSDF (or running locally and opting in) it could be enabled. By making it opt in, we would have simpler filenames by default and ensure the user knows they have to install and run SPRAS a specific way for this feature to work.

@tristan-f-r
Copy link
Collaborator Author

That makes the most sense to me as well 👍

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer and the scope looks better now that the labeled outputs are optional. I still will need a final end-to-end pass because I've been reviewing in chunks.

Comment on lines +41 to +42
(https://packaging.python.org/en/latest/specifications/recording-installed-packages/#the-record-file), which contains
hashes of all files associated with the package distribution [other than itself], and is also included in the package distribution.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this and the linked Python docs and still misunderstood what the RECORD file contains. I didn't realize it is only the installed files for spras, not other packages in the environment. We shoud more documentation here or elsewhere guiding them about what the RECORD is and what this is doing.

My RECORD file:

__editable__.spras-0.5.0.pth,sha256=Bfwel6hOIDjITy_NLfQmXV6gja-XwLim5TDD8-2fzaM,81
__editable___spras_0_5_0_finder.py,sha256=JBwB5Khy9xvKLILhNNp3Q_CCd0HAXfhLPDOuqnTI_TY,3411
__pycache__/__editable___spras_0_5_0_finder.cpython-311.pyc,,
spras-0.5.0.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
spras-0.5.0.dist-info/METADATA,sha256=Cyp5UT6s98dYpmedeIxpvzvpHwyStLBGhahxvcrblYE,9922
spras-0.5.0.dist-info/RECORD,,
spras-0.5.0.dist-info/REQUESTED,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
spras-0.5.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
spras-0.5.0.dist-info/direct_url.json,sha256=VgKtY2-jbwtyENLTmdCCsYx4OapvenhSYDvtbBwa610,102
spras-0.5.0.dist-info/licenses/LICENSE,sha256=JE22iLJ5oyVI79XNYOyJC9uBnyA-RxzahyYy8ANlWwg,1091
spras-0.5.0.dist-info/top_level.txt,sha256=vqYU7OLTmNDzg3UaQ4o_4zjYNAEPqiVCUFc00wsRJB4,6

Copy link
Collaborator Author

@tristan-f-r tristan-f-r Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RECORD file does only contain the installed files for SPRAS and that is intended: SPRAS behavior should only be changing on the SPRAS code and its associated dependency versions. Since we pin all of our dependencies, the behavior should never change. (Note that METADATA is an 'installed file' of SPRAS, which contains all of SPRAS's dependency information.)

[I suspect that your RECORD file attached is a snippet of the RECORD file. Else, you wouldn't be able to do import spras in a Python shell outside of the project's root directory.]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my complete RECORD file, and I can import spras from other directories outside the project root. I added a fake .txt extension so I can upload it.
RECORD.txt

"""
Path(OUT_DIR).mkdir(parents=True, exist_ok=True)

subprocess.run(["snakemake", "--cores", "1", "--configfile", "test/analysis/input/example.yaml"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we calling snakemake directly in a Cytoscape test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to reconstruct the INPUT_PATHWAYS list, which no longer has fixed file names. (Or at least we would want to make sure such input files work when osdf_immutable is enabled.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the goal of this test? I'm not sure how to think about testing with and without the immutable option enabled. Arguably this set of tests should be isolated to Cytoscape behavior and the behavior of the immutable outputs should be tested separately if it needs to be.

Copy link
Collaborator

@agitter agitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some high level comments:

  • We should find a place to document these major changes for users or developers. If we are using config.yaml as the temporary place to document config options, that's fine. Developers won't know what immutability means and he main concepts to be aware of unless they find these comments in the source.
  • The parameters files in the logs/ output are not immutable. Do they need to be?
  • The output subdirectories like data0_72021389-bowtiebuilder-params-4YLQBJ5 take some getting used to. I'm not sure what else we could do to make it more concise other than use a shorter hash. The eval subdirectories also end up included the tag twice: data0_72021389-gs0_72021389-eval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request needed for benchmarking Priority PRs needed for the benchmarking paper P-high This is a blocker for many PRs/issues/features tuning Workflow-spanning algorithm tuning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants