feat: scaffolding, caching, EGFR#65
Conversation
not needed just yet
ntalluri
left a comment
There was a problem hiding this comment.
I did a light review of the PR; did not look to hard at the code itself yet. I mostly was gathering ideas on what was happening from the READMEs.
Co-authored-by: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
|
My two notes:
|
|
Based on meeting:
|
|
I use local and global 'files' to mean dataset-specific and As for (1), there are a few notes to collect:
Notably, the above design constraints mean that local files are the easiest for users to understand and for us to implement. However, if one believes in (2), or that there are substantial benefits to be gained from using global files that outweigh their implementation consequences, then global files become tempting. Then, it turns out that all of the concerns about implementing global/local files directly apply to just global-only files:
That is, since the code & user complexity of introducing local & global files instead of just global files is minor, the concerns of global/local files are nearly equivalent to that of global files, and global/local files give us (3), one should either implement global/local files or local-only files. I believe Anna's position was on local-only files, but since I also believe that we stand to gain more from global files than just local files, I'll keep on using global/local files. |
|
As for (3), I'll move these to separate configs. In the future, when we do config autogeneration, if SPRAS gains the capability to run different datasets on different algorithms, we should programatically merge these configurations. [I may experiment with this in later configurations, especially with #66 which admits a need for configuration generation.] |
| def main(): | ||
| interactome_df = pandas.read_csv(egfr_directory / "raw" / "9606.protein.links.full.txt", sep=" ") | ||
| # Rename the columns both to stylistically keep it in-line with SPRAS and functionally for `normalize_interactome`. | ||
| interactome_df = interactome_df.rename(columns={"protein1": "Interactor1", "protein2": "Interactor2", "combined_score": "Weight"}) |
There was a problem hiding this comment.
is there a reason we are using the combined score instead of experiments for the collection?
There was a problem hiding this comment.
I need double check what panther pathways used for the interactomes but I think we used experiments.
There was a problem hiding this comment.
That decision is borrowed from DISEASES, though it isn't appropriate justification for using the same methodology for EGFR.
There was a problem hiding this comment.
For how experimental this dataset collection is, experiments seems like the right choice. I'm also not sure what is all in the combined_score.
There was a problem hiding this comment.
Actually, both here and DISEASES don't have a particular reason for combined_score usage?
There was a problem hiding this comment.
Also, for the paper, we are actually only going to use the STRING interactome.
There was a problem hiding this comment.
I'll need to think about this more. It's going to be hard to justify what channel to use quickly.
There was a problem hiding this comment.
Also, for the paper, we are actually only going to use the STRING interactome.
I'm aware, but if we find how the original weights for the (PhosphoSitePlus) interactome were assigned, then we can use a similar system for our EGFR interactome to be as close as possible to the data in the original TPS paper.
There was a problem hiding this comment.
From [a supplementary section of] the TPS paper:
Likewise, we observe that some proteins, such as RAS and RAF family members, are not included in the TPS pathway because our mass spectrometry data do not detect their phosphorylation. To increase robustness to potential false negatives in the mass spectrometry, the input PPI network could be modified to include edges from relevant reference pathways with high weights (similar to (Patil et al., 2013)) so that PCSF prefers to include these interactions instead of other high-confidence connections in the PPI network. The weight of these prior knowledge edges would control the tradeoff between condition-specific de novo pathway discovery and conformance with prior knowledge.
I still can't find a formal treatment of weights in the PhosphoSitePlus interactome paper, but from this little section, it seems that the hint of modifying the input network (not to be confused with the background network) to include prior knowledge would only be a good hint if the background network itself contained such prior knowledge scores. (i.e. that we should use combined_score for EGFR specifically, though we should, and as you've been doing, review the use of the combined score channel in other datasets.)
There was a problem hiding this comment.
keeping this open so I can decide what to do.
This is still way to complicated with having both options. I would like to make all the data for a dataset to be one place either directory.py or the local snakemake file. Please work on making this simpler for a new user. |
ntalluri
left a comment
There was a problem hiding this comment.
small review; github isn't letting me add any review/comments directly on code to any PRs right now.
| # BioMart XML Queries | ||
|
|
||
| Directory for storing XML queries generated from [the BioMart interface](https://www.ensembl.org/info/data/biomart/index.html), | ||
| which provides universal mappings regardng different biological datasets. See the martview: https://www.ensembl.org/biomart/martview. |
There was a problem hiding this comment.
| which provides universal mappings regardng different biological datasets. See the martview: https://www.ensembl.org/biomart/martview. | |
| which provides universal mappings regarding different biological datasets. See the martview: https://www.ensembl.org/biomart/martview. |
|
|
||
| This may be expanded in the future, so only depend on this file as a debugging utility. | ||
|
|
||
| For example, `python cache/cli.py a/b.c b.c` would download the file under `a`, `b.c` in `directory` |
There was a problem hiding this comment.
is there some documentation we can have so we know how to use this file to debug? If not, can you add that to the readme? Otherwise this file is useless to new users.
| "STRING": { | ||
| # Our latest STRING files are v12: datasets use 'latest' | ||
| # when they intend to use the most up-to-date STRING file. | ||
| "latest": "v12", |
There was a problem hiding this comment.
latest just overcomplicates things for an already super complicated system.
| """ | ||
| Downloads this `CacheItem` to the desired `output`, | ||
| comparing the `cached` file to the `pinned` and `unpinned` files, | ||
| warning when `cached` doesn't match `unpinned`, and erroring when |
There was a problem hiding this comment.
| warning when `cached` doesn't match `unpinned`, and erroring when | |
| warning when `cached` doesn't match `unpinned`, and warning when |
| g: 0 | ||
|
|
||
| datasets: | ||
| - label: scoresegfr_string |
There was a problem hiding this comment.
| - label: scoresegfr_string | |
| - label: egfr |
|
We should keep latest. It's two lines of code for a nice abstraction. |
|
Updating latest to a different version would change all downstream data that references it. To keep data static, I am removing latest. |
We bundle EGFR along with the rest of the caching infrastructure. Notes:
cache/README.md.pra.yamlfor now, as the only PRAs are the synthetic data and the ResponseNet data, and soon the DepMap data.CONTRIBUTING.mdfile is in Changes to CONTRIBUTING guide #57.directory.pycontains unnecessary files from other datasets that were deemed universal.