feat: scaffolding, caching, EGFR by tristan-f-r · Pull Request #65 · Reed-CompBio/spras-benchmarking

tristan-f-r · 2026-03-18T03:20:16Z

We bundle EGFR along with the rest of the caching infrastructure. Notes:

All motivation for the caching system lives under cache/README.md.
We removed pra.yaml for now, as the only PRAs are the synthetic data and the ResponseNet data, and soon the DepMap data.
The CONTRIBUTING.md file is in Changes to CONTRIBUTING guide #57.
directory.py contains unnecessary files from other datasets that were deemed universal.

not needed just yet

ntalluri

I did a light review of the PR; did not look to hard at the code itself yet. I mostly was gathering ideas on what was happening from the READMEs.

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

this is only in github actions

tristan-f-r · 2026-04-30T01:06:34Z

My two notes:

How important is it that the 10 value stays 10? (e.g. does 1000 work just as well?) The methodology described in the paper says that it just needs to be a value greater than every other prize.
Unless (*) it is the case that we plan to have different algorithm parameter inputs for different dataset collections, I disagree with the methodology that the paper describes for a config per dataset collection, where I would instead believe it to be straightforward to continue as is. Otherwise, unless (*) is true, we directly contradict one of SPRAS's usability goals of running several datasets on several algorithms.

ntalluri · 2026-04-30T16:56:26Z

Based on meeting:

update to have datasets fetch configs either in directory.py or in a snakemake file that is dataset specific (one or the other not both)

this idea will be updated in the contributing guide as well

keeping the value of 10 for the prizes for egfr

hard to justify but that's just the plan for now

we will have 4 separate dataset collection configs (we can also keep the test-config too if needed).

tristan-f-r · 2026-05-04T03:55:30Z

I use local and global 'files' to mean dataset-specific and directory.py-configured CacheItems, respectively.

As for (1), there are a few notes to collect:

Global files are hard to track, as just seen in DISEASES: consequently, not only are global files are the cause of the majority of the complexity in-code, but they also cause the most complexity for users.
Using only local files cause version drift problems: we risk not being able to easily track data files that need updating. (This is commonly solved by the registry pattern, which is precisely what global files are an instance of.)
A combination of global & local files allows us to heavily dampen the problem with global files as mentioned with (1), or that is, we can more easily find files associated with a specific dataset when working within one dataset.

Notably, the above design constraints mean that local files are the easiest for users to understand and for us to implement. However, if one believes in (2), or that there are substantial benefits to be gained from using global files that outweigh their implementation consequences, then global files become tempting. Then, it turns out that all of the concerns about implementing global/local files directly apply to just global-only files:

"How does a user know when data is being used?" is a question one has to ask for global files as well.
"How does a user know where data is at," in the global/local file context, is equivalent to "is the data in my local Snakefile" (easily answered), or "where is the data in directory.py?" making this question nearly equivalent to the previous question.
"How is a user meant to know when to make a file local or global?" is answered by "if the data has been used," making this question also nearly equivalent to the first question.

That is, since the code & user complexity of introducing local & global files instead of just global files is minor, the concerns of global/local files are nearly equivalent to that of global files, and global/local files give us (3), one should either implement global/local files or local-only files.

I believe Anna's position was on local-only files, but since I also believe that we stand to gain more from global files than just local files, I'll keep on using global/local files.

tristan-f-r · 2026-05-04T04:21:29Z

As for (3), I'll move these to separate configs. In the future, when we do config autogeneration, if SPRAS gains the capability to run different datasets on different algorithms, we should programatically merge these configurations. [I may experiment with this in later configurations, especially with #66 which admits a need for configuration generation.]

ntalluri · 2026-05-05T17:08:18Z

+def main():
+    interactome_df = pandas.read_csv(egfr_directory / "raw" / "9606.protein.links.full.txt", sep=" ")
+    # Rename the columns both to stylistically keep it in-line with SPRAS and functionally for `normalize_interactome`.
+    interactome_df = interactome_df.rename(columns={"protein1": "Interactor1", "protein2": "Interactor2", "combined_score": "Weight"})


is there a reason we are using the combined score instead of experiments for the collection?

I need double check what panther pathways used for the interactomes but I think we used experiments.

That decision is borrowed from DISEASES, though it isn't appropriate justification for using the same methodology for EGFR.

For how experimental this dataset collection is, experiments seems like the right choice. I'm also not sure what is all in the combined_score.

Actually, both here and DISEASES don't have a particular reason for combined_score usage?

Also, for the paper, we are actually only going to use the STRING interactome.

I'll need to think about this more. It's going to be hard to justify what channel to use quickly.

Also, for the paper, we are actually only going to use the STRING interactome.

I'm aware, but if we find how the original weights for the (PhosphoSitePlus) interactome were assigned, then we can use a similar system for our EGFR interactome to be as close as possible to the data in the original TPS paper.

From [a supplementary section of] the TPS paper:

Likewise, we observe that some proteins, such as RAS and RAF family members, are not included in the TPS pathway because our mass spectrometry data do not detect their phosphorylation. To increase robustness to potential false negatives in the mass spectrometry, the input PPI network could be modified to include edges from relevant reference pathways with high weights (similar to (Patil et al., 2013)) so that PCSF prefers to include these interactions instead of other high-confidence connections in the PPI network. The weight of these prior knowledge edges would control the tradeoff between condition-specific de novo pathway discovery and conformance with prior knowledge.

I still can't find a formal treatment of weights in the PhosphoSitePlus interactome paper, but from this little section, it seems that the hint of modifying the input network (not to be confused with the background network) to include prior knowledge would only be a good hint if the background network itself contained such prior knowledge scores. (i.e. that we should use combined_score for EGFR specifically, though we should, and as you've been doing, review the use of the combined score channel in other datasets.)

keeping this open so I can decide what to do.

ntalluri · 2026-05-06T14:45:04Z

That is, since the code & user complexity of introducing local & global files instead of just global files is minor, the concerns of global/local files are nearly equivalent to that of global files, and global/local files give us (3), one should either implement global/local files or local-only files.

I believe Anna's position was on local-only files, but since I also believe that we stand to gain more from global files than just local files, I'll keep on using global/local files.

This is still way to complicated with having both options. I would like to make all the data for a dataset to be one place either directory.py or the local snakemake file. Please work on making this simpler for a new user.

ntalluri

small review; github isn't letting me add any review/comments directly on code to any PRs right now.

ntalluri · 2026-05-06T14:29:55Z

+# BioMart XML Queries
+
+Directory for storing XML queries generated from [the BioMart interface](https://www.ensembl.org/info/data/biomart/index.html),
+which provides universal mappings regardng different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.


Suggested change

which provides universal mappings regardng different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.

which provides universal mappings regarding different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.

ntalluri · 2026-05-06T14:32:45Z

+
+This may be expanded in the future, so only depend on this file as a debugging utility.
+
+For example, `python cache/cli.py a/b.c b.c` would download the file under `a`, `b.c` in `directory`


is there some documentation we can have so we know how to use this file to debug? If not, can you add that to the readme? Otherwise this file is useless to new users.

ntalluri · 2026-05-06T14:48:19Z

+    "STRING": {
+        # Our latest STRING files are v12: datasets use 'latest'
+        # when they intend to use the most up-to-date STRING file.
+        "latest": "v12",


latest just overcomplicates things for an already super complicated system.

ntalluri · 2026-05-12T16:12:37Z

+        """
+        Downloads this `CacheItem` to the desired `output`,
+        comparing the `cached` file to the `pinned` and `unpinned` files,
+        warning when `cached` doesn't match `unpinned`, and erroring when


Suggested change

warning when `cached` doesn't match `unpinned`, and erroring when

warning when `cached` doesn't match `unpinned`, and warning when

…e direct

ntalluri · 2026-05-14T19:33:45Z

+        g: 0
+
+datasets:
+  - label: scoresegfr_string


Suggested change

- label: scoresegfr_string

- label: egfr

tristan-f-r · 2026-05-14T20:43:05Z

We should keep latest. It's two lines of code for a nice abstraction.

ntalluri · 2026-05-14T20:48:47Z

Updating latest to a different version would change all downstream data that references it. To keep data static, I am removing latest.

chore: drop other datasets

b49439e

tristan-f-r added the enhancement New feature or request label Mar 18, 2026

tristan-f-r added 2 commits March 17, 2026 20:36

Merge branch 'main' into egfr-and-infrastructure

2018a13

chore: re-include

136e5ff

tristan-f-r mentioned this pull request Mar 18, 2026

Changes to CONTRIBUTING guide #57

Draft

tristan-f-r added 2 commits March 17, 2026 20:42

chore: drop tools

472468d

not needed just yet

chore: re-add tools

a5de971

This was referenced Mar 18, 2026

dataset: DISEASES #66

Open

dataset: yeast osmotic stress #67

Open

dataset: hiv #68

Open

dataset: muscle skeletal (from ResponseNet) #69

Open

dataset: DepMap #70

Draft

tristan-f-r added the dataset Mutating datasets in any way. label Mar 18, 2026

tristan-f-r mentioned this pull request Mar 18, 2026

dataset: synthetic from PANTHER #71

Draft

tristan-f-r added 3 commits March 18, 2026 05:53

docs: cache

8ddccb4

style: fmt

90cc277

docs: on caching

eb23b8f

tristan-f-r mentioned this pull request Mar 18, 2026

chore: delete [temporarily!] #64

Merged

tristan-f-r changed the title ~~feat: initial scaffolding, EGFR~~ feat: scaffolding, caching, EGFR Mar 18, 2026

ntalluri reviewed Mar 18, 2026

View reviewed changes

Comment thread web/public/favicon.svg Outdated

ntalluri reviewed Mar 18, 2026

View reviewed changes

tristan-f-r and others added 2 commits March 18, 2026 16:54

docs: suggestions from review

4b524bc

Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

docs: more comments, refactor: mv function out of Snakefile

69fda05

ntalluri reviewed Mar 19, 2026

View reviewed changes

Comment thread cache/README.md Outdated

ntalluri reviewed Mar 19, 2026

View reviewed changes

Comment thread cache/README.md Outdated

tristan-f-r added 4 commits March 19, 2026 18:58

docs(datasets): mention responsenet and egfr

15c7ecb

docs(datasets): add old synthetic data branch

729a51b

chore: mv to scores instead of dmmm

922be5d

docs: drop expiration docs

f3d6d41

this is only in github actions

tristan-f-r mentioned this pull request Mar 23, 2026

Hook into loguru to warn for outdated datasets #72

Open

tristan-f-r added 2 commits April 30, 2026 00:56

refactor: change file name

5f6720c

chore: apply suggestions

8755803

tristan-f-r requested a review from ntalluri April 30, 2026 01:04

fix: set header=False for normalized interactome

c7efcba

refactor: use other prize value

1b60dcb

tristan-f-r added 2 commits May 4, 2026 04:20

refactor: move config to egfr

fc09ffb

ci: use EGFR for config name

5837d7f

tristan-f-r requested review from ntalluri and removed request for ntalluri May 4, 2026 04:21

ci: add uv environment tests

1ec407e

ntalluri reviewed May 5, 2026

View reviewed changes

ntalluri reviewed May 6, 2026

View reviewed changes

ntalluri reviewed May 12, 2026

View reviewed changes

ntalluri added 9 commits May 14, 2026 10:03

update the deduplicate tool function, remove normalize function and b…

5276687

…e direct

remove concept of latest

3006983

precommit

3f69499

fix tools test

7595652

cleanup

1f0ca12

precommit

78c0adf

add todo

e682f09

update trim_input_nodes.py, it was an & and it needed to be an |

230a765

update comment and precommit

b0a3165

ntalluri reviewed May 14, 2026

View reviewed changes

	which provides universal mappings regardng different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.
	which provides universal mappings regarding different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.


		This may be expanded in the future, so only depend on this file as a debugging utility.

		For example, `python cache/cli.py a/b.c b.c` would download the file under `a`, `b.c` in `directory`

	warning when `cached` doesn't match `unpinned`, and erroring when
	warning when `cached` doesn't match `unpinned`, and warning when

Conversation

tristan-f-r commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ntalluri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tristan-f-r commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tristan-f-r commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tristan-f-r commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tristan-f-r commented May 14, 2026

tristan-f-r commented Mar 18, 2026 •

edited

Loading

tristan-f-r commented Apr 30, 2026 •

edited

Loading

ntalluri commented Apr 30, 2026 •

edited

Loading

tristan-f-r commented May 4, 2026 •

edited

Loading

tristan-f-r commented May 4, 2026 •

edited

Loading

ntalluri May 5, 2026 •

edited

Loading

tristan-f-r May 5, 2026 •

edited

Loading

tristan-f-r May 5, 2026 •

edited

Loading

ntalluri commented May 6, 2026 •

edited

Loading

ntalluri May 14, 2026 •

edited

Loading