Conversation
This adds the unique spras_revision to every single paramater combination (before hashing) and the dataset label, to provide OSDF support on the level of deterministic algorithms.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
whoops! accidentally feature-regressed
agitter
left a comment
There was a problem hiding this comment.
A few more comments. I still haven't looked through all the test code.
|
Since both past approaches do not scale well, I've decided to only focus on the RECORD file. This fails specifically in the case where SPRAS is somehow ran without being installed as a python module, and I can't think of a plausible scenario where this happens. |
|
As a follow up to our meeting discussion, I'm wondering if this type of output file versioning should be optional. Then when running in CHTC and writing to OSDF (or running locally and opting in) it could be enabled. By making it opt in, we would have simpler filenames by default and ensure the user knows they have to install and run SPRAS a specific way for this feature to work. |
|
That makes the most sense to me as well 👍 |
what is going on in ci???
okay - sysconfig.get_path("purelib") is correct
we need better typing :/
agitter
left a comment
There was a problem hiding this comment.
This is getting closer and the scope looks better now that the labeled outputs are optional. I still will need a final end-to-end pass because I've been reviewing in chunks.
spras/config/config.py
Outdated
| (https://packaging.python.org/en/latest/specifications/recording-installed-packages/#the-record-file), which contains | ||
| hashes of all files associated with the package distribution [other than itself], and is also included in the package distribution. |
There was a problem hiding this comment.
I read this and the linked Python docs and still misunderstood what the RECORD file contains. I didn't realize it is only the installed files for spras, not other packages in the environment. We shoud more documentation here or elsewhere guiding them about what the RECORD is and what this is doing.
My RECORD file:
__editable__.spras-0.5.0.pth,sha256=Bfwel6hOIDjITy_NLfQmXV6gja-XwLim5TDD8-2fzaM,81
__editable___spras_0_5_0_finder.py,sha256=JBwB5Khy9xvKLILhNNp3Q_CCd0HAXfhLPDOuqnTI_TY,3411
__pycache__/__editable___spras_0_5_0_finder.cpython-311.pyc,,
spras-0.5.0.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
spras-0.5.0.dist-info/METADATA,sha256=Cyp5UT6s98dYpmedeIxpvzvpHwyStLBGhahxvcrblYE,9922
spras-0.5.0.dist-info/RECORD,,
spras-0.5.0.dist-info/REQUESTED,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
spras-0.5.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
spras-0.5.0.dist-info/direct_url.json,sha256=VgKtY2-jbwtyENLTmdCCsYx4OapvenhSYDvtbBwa610,102
spras-0.5.0.dist-info/licenses/LICENSE,sha256=JE22iLJ5oyVI79XNYOyJC9uBnyA-RxzahyYy8ANlWwg,1091
spras-0.5.0.dist-info/top_level.txt,sha256=vqYU7OLTmNDzg3UaQ4o_4zjYNAEPqiVCUFc00wsRJB4,6
There was a problem hiding this comment.
The RECORD file does only contain the installed files for SPRAS and that is intended: SPRAS behavior should only be changing on the SPRAS code and its associated dependency versions. Since we pin all of our dependencies, the behavior should never change. (Note that METADATA is an 'installed file' of SPRAS, which contains all of SPRAS's dependency information.)
[I suspect that your RECORD file attached is a snippet of the RECORD file. Else, you wouldn't be able to do import spras in a Python shell outside of the project's root directory.]
There was a problem hiding this comment.
That was my complete RECORD file, and I can import spras from other directories outside the project root. I added a fake .txt extension so I can upload it.
RECORD.txt
| """ | ||
| Path(OUT_DIR).mkdir(parents=True, exist_ok=True) | ||
|
|
||
| subprocess.run(["snakemake", "--cores", "1", "--configfile", "test/analysis/input/example.yaml"]) |
There was a problem hiding this comment.
Why are we calling snakemake directly in a Cytoscape test?
There was a problem hiding this comment.
We need to reconstruct the INPUT_PATHWAYS list, which no longer has fixed file names. (Or at least we would want to make sure such input files work when osdf_immutable is enabled.)
There was a problem hiding this comment.
Is that the goal of this test? I'm not sure how to think about testing with and without the immutable option enabled. Arguably this set of tests should be isolated to Cytoscape behavior and the behavior of the immutable outputs should be tested separately if it needs to be.
agitter
left a comment
There was a problem hiding this comment.
Some high level comments:
- We should find a place to document these major changes for users or developers. If we are using
config.yamlas the temporary place to document config options, that's fine. Developers won't know what immutability means and he main concepts to be aware of unless they find these comments in the source. - The parameters files in the
logs/output are not immutable. Do they need to be? - The output subdirectories like
data0_72021389-bowtiebuilder-params-4YLQBJ5take some getting used to. I'm not sure what else we could do to make it more concise other than use a shorter hash. The eval subdirectories also end up included the tag twice:data0_72021389-gs0_72021389-eval.
This change means that output files will not be reused whenever SPRAS is updated if
osdf_immutableis true, furthering the immutability goal necessary to get OSDF integration working for SPRAS benchmarking. ('updated' depends on the git commit hash or the actual SPRAS release version)This adds the unique
spras_revisionto every single paramater combination (before hashing) and the dataset label, to provide OSDF support on the level of deterministic, non-seeded algorithms when datasets are immutable.This has the added benefit of allowing SPRAS users to simply upgrade their SPRAS version without needing to clear
output, which complements #380. The refactored test also partially covers #165 and #45. (This is also where the majority of the code comes from: The actual feature patch here is a 50 line change.)See #321 implemented by #335 for handling nondeterministic algorithms / seeded algorithms.
To make this change, a significant test refactor in
test/analysiswas needed to remove hardcoded paths (which contained the hashes being modified per-commit in this PR.) It turns out that whenever we make any change to the hash, this [original: the patch here fixes this] test breaks! That's why this PR is depended on by so many other PRs.