dataset: synthetic from PANTHER by tristan-f-r · Pull Request #71 · Reed-CompBio/spras-benchmarking

tristan-f-r · 2026-03-18T05:41:33Z

This is a draft since we don't provide any configs linking to any specific data.

Blocked by feat: scaffolding, caching, EGFR #65.
Blocked by feat: web #73

not needed just yet

ntalluri

Partly reviewed.

ntalluri · 2026-03-25T18:03:24Z

Can you add two configs specific to the panther pathways and how we are using it for the computational performance and pathway accurary/algo similaity assessments.

We might want to consider dataset categories. Separate configs for each dataset would kill parallelism.

ntalluri · 2026-03-25T18:04:25Z

Should be removed by #65.

ntalluri · 2026-03-25T18:06:06Z

+
+def main():
+    pathways_df = parse_pc_pathways(current_directory / "raw" / "pathways.txt")
+    print("Fetching pathways... [This may take some time. On the author's desktop machine, it took 15 minutes.]")


Suggested change

print("Fetching pathways... [This may take some time. On the author's desktop machine, it took 15 minutes.]")

print("Fetching pathways... [This may take some time; around 15 minutes.]")

ntalluri · 2026-03-25T18:06:51Z

Can you add an overview comment on what is happening in this code.

ntalluri · 2026-03-25T18:55:19Z

+    human_receptors = human_receptors[["NODE", "uniprot"]]
+    human_receptors.to_csv(folder / "sources.txt", sep="\t", index=False)
+
+    # Finally, scores


Suggested change

# Finally, scores

# Finally, scores and actives

ntalluri · 2026-03-25T18:55:28Z

+
+    # Finally, scores
+    scores = pd.concat([human_tfs, human_receptors]).drop_duplicates()
+    scores["prizes"] = 1


Suggested change

scores["prizes"] = 1

scores["prizes"] = 1.0

ntalluri · 2026-03-25T18:55:56Z

+    # Then, we need to get the sources and targets, save them,
+    # and mark them with 1.0 prizes:
+
+    # First, for our targets, or transcription factors


Suggested change

# First, for our targets, or transcription factors

# First, for our targets (transcription factors)

ntalluri · 2026-03-25T18:56:25Z

+    human_tfs = human_tfs[["NODE", "uniprot"]]
+    human_tfs.to_csv(folder / "targets.txt", sep="\t", index=False)
+
+    # Then, for our receptors. NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this.


Suggested change

# Then, for our receptors. NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this.

# Then, for our receptors (surfaceomes). NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this.

ntalluri · 2026-04-16T15:21:20Z

Why is this trimming to what is in the gold standard? My plan was to trim all data to what is available in the interactome.

ntalluri · 2026-05-05T17:12:13Z

+    interactome_df["protein1"] = interactome_df["protein1"].astype(str).str.removeprefix("9606.")
+    interactome_df["protein2"] = interactome_df["protein2"].astype(str).str.removeprefix("9606.")
+    # Since this is links.full vs links, we need to restrict to a subset of headers before saving the interactome.
+    interactome_df = interactome_df[["protein1", "protein2", "combined_score"]]


I don't think is supposed to be combined_score; however I don't know where the original work is to double check this.

ntalluri · 2026-05-05T17:21:57Z

+    # Convert the interactome to SPRAS format
+    print("Reading interactome...")
+    interactome_df = pandas.read_csv(
+        current_directory / ".." / "raw" / "9606.protein.links.full.v12.0.txt", sep=" ", usecols=["protein1", "protein2", "combined_score"]


this needs to be switched back to experiments >= 1; look at the original work

ntalluri · 2026-05-08T18:23:33Z

This PR is missing the actual sampling tool to make the thresholded interactomes and the code to upsample all the panther pathways into one dataset.

ntalluri · 2026-05-08T19:20:06Z

The sampling code got lost: sample.py was in the git history

Once #65 gets merged, we will add this code back in

tristan-f-r added 6 commits March 18, 2026 03:16

chore: drop other datasets

b49439e

Merge branch 'main' into egfr-and-infrastructure

2018a13

chore: re-include

136e5ff

chore: drop tools

472468d

not needed just yet

chore: re-add tools

a5de971

dataset: synthetic data

50fa813

tristan-f-r added dataset Mutating datasets in any way. blocked-by-other-pr For PRs that depend on other PRs. labels Mar 18, 2026

tristan-f-r marked this pull request as draft March 18, 2026 05:42

tristan-f-r mentioned this pull request Mar 18, 2026

feat: scaffolding, caching, EGFR #65

Open

ntalluri reviewed Mar 19, 2026

View reviewed changes

Comment thread datasets/egfr/README.md

ntalluri reviewed Mar 26, 2026

View reviewed changes

fix(tools): re-introduce trim.py

d366409

ntalluri reviewed Apr 16, 2026

View reviewed changes

ntalluri reviewed May 5, 2026

View reviewed changes

Comment thread datasets/README.md

ntalluri reviewed May 5, 2026

View reviewed changes

	print("Fetching pathways... [This may take some time. On the author's desktop machine, it took 15 minutes.]")
	print("Fetching pathways... [This may take some time; around 15 minutes.]")

	# First, for our targets, or transcription factors
	# First, for our targets (transcription factors)

	# Then, for our receptors. NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this.
	# Then, for our receptors (surfaceomes). NOTE: we skip the first row since it's empty in the XLSX, so this might break if the surfaceome authors fix this.

Conversation

tristan-f-r commented Mar 18, 2026 • edited by ntalluri Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ntalluri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntalluri commented May 8, 2026

Uh oh!

ntalluri commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tristan-f-r commented Mar 18, 2026 •

edited by ntalluri

Loading

ntalluri left a comment •

edited

Loading