dataset: synthetic from PANTHER#71
Conversation
There was a problem hiding this comment.
Can you add two configs specific to the panther pathways and how we are using it for the computational performance and pathway accurary/algo similaity assessments.
There was a problem hiding this comment.
We might want to consider dataset categories. Separate configs for each dataset would kill parallelism.
|
|
||
| def main(): | ||
| pathways_df = parse_pc_pathways(current_directory / "raw" / "pathways.txt") | ||
| print("Fetching pathways... [This may take some time. On the author's desktop machine, it took 15 minutes.]") |
There was a problem hiding this comment.
| print("Fetching pathways... [This may take some time. On the author's desktop machine, it took 15 minutes.]") | |
| print("Fetching pathways... [This may take some time; around 15 minutes.]") |
There was a problem hiding this comment.
Can you add an overview comment on what is happening in this code.
| human_receptors = human_receptors[["NODE", "uniprot"]] | ||
| human_receptors.to_csv(folder / "sources.txt", sep="\t", index=False) | ||
|
|
||
| # Finally, scores |
There was a problem hiding this comment.
| # Finally, scores | |
| # Finally, scores and actives |
|
|
||
| # Finally, scores | ||
| scores = pd.concat([human_tfs, human_receptors]).drop_duplicates() | ||
| scores["prizes"] = 1 |
There was a problem hiding this comment.
| scores["prizes"] = 1 | |
| scores["prizes"] = 1.0 |
| # Then, we need to get the sources and targets, save them, | ||
| # and mark them with 1.0 prizes: | ||
|
|
||
| # First, for our targets, or transcription factors |
There was a problem hiding this comment.
| # First, for our targets, or transcription factors | |
| # First, for our targets (transcription factors) |
| human_tfs = human_tfs[["NODE", "uniprot"]] | ||
| human_tfs.to_csv(folder / "targets.txt", sep="\t", index=False) | ||
|
|
||
| # Then, for our receptors. NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this. |
There was a problem hiding this comment.
| # Then, for our receptors. NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this. | |
| # Then, for our receptors (surfaceomes). NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this. |
There was a problem hiding this comment.
Why is this trimming to what is in the gold standard? My plan was to trim all data to what is available in the interactome.
| interactome_df["protein1"] = interactome_df["protein1"].astype(str).str.removeprefix("9606.") | ||
| interactome_df["protein2"] = interactome_df["protein2"].astype(str).str.removeprefix("9606.") | ||
| # Since this is links.full vs links, we need to restrict to a subset of headers before saving the interactome. | ||
| interactome_df = interactome_df[["protein1", "protein2", "combined_score"]] |
There was a problem hiding this comment.
I don't think is supposed to be combined_score; however I don't know where the original work is to double check this.
| # Convert the interactome to SPRAS format | ||
| print("Reading interactome...") | ||
| interactome_df = pandas.read_csv( | ||
| current_directory / ".." / "raw" / "9606.protein.links.full.v12.0.txt", sep=" ", usecols=["protein1", "protein2", "combined_score"] |
There was a problem hiding this comment.
this needs to be switched back to experiments >= 1; look at the original work
|
This PR is missing the actual sampling tool to make the thresholded interactomes and the code to upsample all the panther pathways into one dataset. |
|
The sampling code got lost: sample.py was in the git history Once #65 gets merged, we will add this code back in |
This is a draft since we don't provide any
configslinking to any specific data.