Skip to content

feat: scaffolding, caching, EGFR#65

Open
tristan-f-r wants to merge 72 commits into
mainfrom
egfr-and-infrastructure
Open

feat: scaffolding, caching, EGFR#65
tristan-f-r wants to merge 72 commits into
mainfrom
egfr-and-infrastructure

Conversation

@tristan-f-r
Copy link
Copy Markdown
Contributor

@tristan-f-r tristan-f-r commented Mar 18, 2026

We bundle EGFR along with the rest of the caching infrastructure. Notes:

  • All motivation for the caching system lives under cache/README.md.
  • We removed pra.yaml for now, as the only PRAs are the synthetic data and the ResponseNet data, and soon the DepMap data.
  • The CONTRIBUTING.md file is in Changes to CONTRIBUTING guide #57.
  • directory.py contains unnecessary files from other datasets that were deemed universal.

@tristan-f-r tristan-f-r added the enhancement New feature or request label Mar 18, 2026
@tristan-f-r tristan-f-r changed the title feat: initial scaffolding, EGFR feat: scaffolding, caching, EGFR Mar 18, 2026
Comment thread web/public/favicon.svg Outdated
Copy link
Copy Markdown
Collaborator

@ntalluri ntalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a light review of the PR; did not look to hard at the code itself yet. I mostly was gathering ideas on what was happening from the READMEs.

Comment thread configs/egfr.yaml
Comment thread tools/README.md
Comment thread .vscode/extensions.json Outdated
Comment thread .devcontainer/devcontainer.json
Comment thread cache/biomart/README.md Outdated
Comment thread README.md Outdated
Comment thread README.md
Comment thread README.md Outdated
Comment thread configs/dmmm.yaml Outdated
Comment thread configs/dmmm.yaml Outdated
tristan-f-r and others added 2 commits March 18, 2026 16:54
Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
Comment thread cache/README.md Outdated
Comment thread cache/README.md Outdated
@tristan-f-r tristan-f-r requested a review from ntalluri April 30, 2026 01:04
@tristan-f-r
Copy link
Copy Markdown
Contributor Author

tristan-f-r commented Apr 30, 2026

My two notes:

  • How important is it that the 10 value stays 10? (e.g. does 1000 work just as well?) The methodology described in the paper says that it just needs to be a value greater than every other prize.
  • Unless (*) it is the case that we plan to have different algorithm parameter inputs for different dataset collections, I disagree with the methodology that the paper describes for a config per dataset collection, where I would instead believe it to be straightforward to continue as is. Otherwise, unless (*) is true, we directly contradict one of SPRAS's usability goals of running several datasets on several algorithms.

@ntalluri
Copy link
Copy Markdown
Collaborator

ntalluri commented Apr 30, 2026

Based on meeting:

  1. update to have datasets fetch configs either in directory.py or in a snakemake file that is dataset specific (one or the other not both)
  • this idea will be updated in the contributing guide as well
  1. keeping the value of 10 for the prizes for egfr
  • hard to justify but that's just the plan for now
  1. we will have 4 separate dataset collection configs (we can also keep the test-config too if needed).

@tristan-f-r
Copy link
Copy Markdown
Contributor Author

tristan-f-r commented May 4, 2026

I use local and global 'files' to mean dataset-specific and directory.py-configured CacheItems, respectively.

As for (1), there are a few notes to collect:

  1. Global files are hard to track, as just seen in DISEASES: consequently, not only are global files are the cause of the majority of the complexity in-code, but they also cause the most complexity for users.
  2. Using only local files cause version drift problems: we risk not being able to easily track data files that need updating. (This is commonly solved by the registry pattern, which is precisely what global files are an instance of.)
  3. A combination of global & local files allows us to heavily dampen the problem with global files as mentioned with (1), or that is, we can more easily find files associated with a specific dataset when working within one dataset.

Notably, the above design constraints mean that local files are the easiest for users to understand and for us to implement. However, if one believes in (2), or that there are substantial benefits to be gained from using global files that outweigh their implementation consequences, then global files become tempting. Then, it turns out that all of the concerns about implementing global/local files directly apply to just global-only files:

  • "How does a user know when data is being used?" is a question one has to ask for global files as well.
  • "How does a user know where data is at," in the global/local file context, is equivalent to "is the data in my local Snakefile" (easily answered), or "where is the data in directory.py?" making this question nearly equivalent to the previous question.
  • "How is a user meant to know when to make a file local or global?" is answered by "if the data has been used," making this question also nearly equivalent to the first question.

That is, since the code & user complexity of introducing local & global files instead of just global files is minor, the concerns of global/local files are nearly equivalent to that of global files, and global/local files give us (3), one should either implement global/local files or local-only files.

I believe Anna's position was on local-only files, but since I also believe that we stand to gain more from global files than just local files, I'll keep on using global/local files.

@tristan-f-r
Copy link
Copy Markdown
Contributor Author

tristan-f-r commented May 4, 2026

As for (3), I'll move these to separate configs. In the future, when we do config autogeneration, if SPRAS gains the capability to run different datasets on different algorithms, we should programatically merge these configurations. [I may experiment with this in later configurations, especially with #66 which admits a need for configuration generation.]

@tristan-f-r tristan-f-r requested review from ntalluri and removed request for ntalluri May 4, 2026 04:21
def main():
interactome_df = pandas.read_csv(egfr_directory / "raw" / "9606.protein.links.full.txt", sep=" ")
# Rename the columns both to stylistically keep it in-line with SPRAS and functionally for `normalize_interactome`.
interactome_df = interactome_df.rename(columns={"protein1": "Interactor1", "protein2": "Interactor2", "combined_score": "Weight"})
Copy link
Copy Markdown
Collaborator

@ntalluri ntalluri May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are using the combined score instead of experiments for the collection?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need double check what panther pathways used for the interactomes but I think we used experiments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That decision is borrowed from DISEASES, though it isn't appropriate justification for using the same methodology for EGFR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For how experimental this dataset collection is, experiments seems like the right choice. I'm also not sure what is all in the combined_score.

Copy link
Copy Markdown
Contributor Author

@tristan-f-r tristan-f-r May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, both here and DISEASES don't have a particular reason for combined_score usage?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for the paper, we are actually only going to use the STRING interactome.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to think about this more. It's going to be hard to justify what channel to use quickly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for the paper, we are actually only going to use the STRING interactome.

I'm aware, but if we find how the original weights for the (PhosphoSitePlus) interactome were assigned, then we can use a similar system for our EGFR interactome to be as close as possible to the data in the original TPS paper.

Copy link
Copy Markdown
Contributor Author

@tristan-f-r tristan-f-r May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From [a supplementary section of] the TPS paper:

Likewise, we observe that some proteins, such as RAS and RAF family members, are not included in the TPS pathway because our mass spectrometry data do not detect their phosphorylation. To increase robustness to potential false negatives in the mass spectrometry, the input PPI network could be modified to include edges from relevant reference pathways with high weights (similar to (Patil et al., 2013)) so that PCSF prefers to include these interactions instead of other high-confidence connections in the PPI network. The weight of these prior knowledge edges would control the tradeoff between condition-specific de novo pathway discovery and conformance with prior knowledge.

I still can't find a formal treatment of weights in the PhosphoSitePlus interactome paper, but from this little section, it seems that the hint of modifying the input network (not to be confused with the background network) to include prior knowledge would only be a good hint if the background network itself contained such prior knowledge scores. (i.e. that we should use combined_score for EGFR specifically, though we should, and as you've been doing, review the use of the combined score channel in other datasets.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping this open so I can decide what to do.

@ntalluri
Copy link
Copy Markdown
Collaborator

ntalluri commented May 6, 2026

That is, since the code & user complexity of introducing local & global files instead of just global files is minor, the concerns of global/local files are nearly equivalent to that of global files, and global/local files give us (3), one should either implement global/local files or local-only files.

I believe Anna's position was on local-only files, but since I also believe that we stand to gain more from global files than just local files, I'll keep on using global/local files.

This is still way to complicated with having both options. I would like to make all the data for a dataset to be one place either directory.py or the local snakemake file. Please work on making this simpler for a new user.

Copy link
Copy Markdown
Collaborator

@ntalluri ntalluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small review; github isn't letting me add any review/comments directly on code to any PRs right now.

Comment thread cache/biomart/README.md
# BioMart XML Queries

Directory for storing XML queries generated from [the BioMart interface](https://www.ensembl.org/info/data/biomart/index.html),
which provides universal mappings regardng different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
which provides universal mappings regardng different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.
which provides universal mappings regarding different biological datasets. See the martview: https://www.ensembl.org/biomart/martview.

Comment thread cache/cli.py

This may be expanded in the future, so only depend on this file as a debugging utility.

For example, `python cache/cli.py a/b.c b.c` would download the file under `a`, `b.c` in `directory`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there some documentation we can have so we know how to use this file to debug? If not, can you add that to the readme? Otherwise this file is useless to new users.

Comment thread cache/directory.py
Comment thread cache/directory.py Outdated
"STRING": {
# Our latest STRING files are v12: datasets use 'latest'
# when they intend to use the most up-to-date STRING file.
"latest": "v12",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest just overcomplicates things for an already super complicated system.

Comment thread cache/directory.py
Comment thread cache/directory.py
Comment thread cache/directory.py
"""
Downloads this `CacheItem` to the desired `output`,
comparing the `cached` file to the `pinned` and `unpinned` files,
warning when `cached` doesn't match `unpinned`, and erroring when
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
warning when `cached` doesn't match `unpinned`, and erroring when
warning when `cached` doesn't match `unpinned`, and warning when

Comment thread configs/egfr.yaml
g: 0

datasets:
- label: scoresegfr_string
Copy link
Copy Markdown
Collaborator

@ntalluri ntalluri May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- label: scoresegfr_string
- label: egfr

@tristan-f-r
Copy link
Copy Markdown
Contributor Author

We should keep latest. It's two lines of code for a nice abstraction.

@ntalluri
Copy link
Copy Markdown
Collaborator

Updating latest to a different version would change all downstream data that references it. To keep data static, I am removing latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Mutating datasets in any way. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants