dataset: DISEASES by tristan-f-r · Pull Request #66 · Reed-CompBio/spras-benchmarking

tristan-f-r · 2026-03-18T05:07:48Z

From #39.

Depends on feat: scaffolding, caching, EGFR #65.

not needed just yet

…andard script and something we need to address in the files script. Will add more to the review.

annaritz

I left one comment regarding caching - will save the others for #65.

Working through some namespace mapping issues - I added some code and comments on the scripts/ directory files.

annaritz · 2026-04-02T18:36:41Z

Silly question - why dmmm? Disease module mining ___? Is this noted somewhere?

Not a silly question - Neha pointed this out in #65, so this file was changed to scores.yaml instead. (I didn't keep this branch updated.)

annaritz · 2026-04-02T18:40:17Z

+    GS_string_df = GS_combined_threshold.merge(string_aliases, on="ENSP", how="inner")
+    GS_string_df = GS_string_df.drop_duplicates(subset=["ENSG", "ENSP", "geneName", "diseaseID", "diseaseName"])
+
+    ## THIS HAS A MAJOR ISSUE


Flagging this comment - It's looking like mapping from ENSG to ENSP loses many proteins, so many that diseases that used to have >10 high-confidence genes now have 0 or 1. Proposed approach to check this:

Confirm that the print statements below are correctly reporting the number of genes pre-mapping and post-mapping for each disease that passes our filters above.

Investigate the mapping and figure out why we are losing so many genes.

Once mapping is fixed, plot the distribution of gene set sizes to identify a better GENE_SET_SIZE_MINIMUM value (currently 10).

The mapping was not the issue: printing out the GS_combined_threshold directly notes that we lose very few genes with STRING, and the majority of the loss is coming from our choice of CONFIDENCE_SCORE_MINIMUM.

[I also checked against the old DISEASES branch to make sure that the choice of BioMart over gProfiler for mapping from ENSP to ENSG was not the issue, and it isn't. However, reading through this again, it's unclear to me both now and originally why we even map to ENSG in the first place? Dropping it and only doing the ENSP -> String ENSP mapping seems to include more genes than usual, presumably because the drop_duplicates call on the ENSG column is important but lacks documentation.]

annaritz · 2026-04-02T18:42:44Z

+'''
+Generates input node set files based on TIGA trait-gene associations and Disease Ontology
+annotations. 
+TODO: not all of these input node set files are necessary for benchmarking; only those with


Flagging this comment - we are actually doing more than we need here by saving all possible input node set files, and then filtering out the ones we do not consider in the benchmarking data collection. It's okay for now, but not the pipeline described in the README.

annaritz · 2026-04-02T18:43:44Z

+
+    # Filter the SNP dataset for genes in the disease set.
+
+    # UNRESOLVED ISSUE:


Flagging this comment - we should not be re-using the hard-coded 10 threshold in this script. At this point, we should take the gold standard file, the input GWAS node files, and generate the standardized outputs by cross-referencing them. No other parameters should be used.

Specifically: - Move mentions of fetch.py to ../Snakefile - Clarify some variable names - Note that the STRING id mapping is not the issue

From 4->3.

…res back to 4

annaritz · 2026-04-29T18:31:38Z

+    # Get the BioMart ENSP -> ENSG mapping
+    biomart_data = pd.read_csv(diseases_path / "raw" / "ensg-ensp.tsv", sep="\t", names=["ENSP", "ENSG"])
+
+    # The DISEASES data is in the ENSP namespace, but we want to work in ENSG.


It seems like we convert from ENSP to ENSG, do filtering and then convert back from ENSG to ENSP. Why do we do this? Why not keep it in ENSP and convert whatever is in ENSG to ENSP?

annaritz · 2026-04-30T00:11:14Z

-    # Threshold based on GENE_SET_SIZE_MINIMUM
-    GS_score_group = GS_ids_df.groupby("diseaseName")
+    # Threshold the high-confidence gene-gene pairs based on GENE_SET_SIZE_MINIMUM
+    GS_score_group = GS_ids_high_confidence.groupby("diseaseName")


THis was where the bug was. The old code:

GS_score_group = GS_ids_df.groupby("diseaseName")

It was using the wrong dataframe (GS_ids_df instead of GS_ids_score_threshold). I have renamed the variabes and plan to refactor to make this more apparent.

annaritz · 2026-04-30T00:13:03Z

+
+    # Identify the diseases that are in the gold standard and the inputs.
+    GS_string_df = GS_string_df[GS_string_df["diseaseID"].isin(tiga_string_df["id"])]
+    GS_combined_group = GS_string_df.groupby("diseaseName")


I have found another issue - the Disease Name in the gold standard is capitalized different than the "trait" in the inputs. I don't think it affects the downstream results but I noticed this when printing some sanity checks. We should be using the Disease Ontology ID to avoid this.

For this reason, as well as other reasons, I plan to clean up & refactor this code so it's easier to understand.

…ses to be 31.

annaritz

Major refactor of this dataset. The README.md includes an updated workflow figure, a description of each function, and the outputs produced by the function.

Major changes:

gold_standard.py: there is no mapping from ENSP -> ENSG -> ENSP (no longer requires ensg-ensp.tsv file). Code is refactored to cleanly filter low-confidence disease-gene pairs and diseases with only a few high-confidence genes. Writes file to processed/ directory.
trait_gene_assoc.py (formerly inputs.py): uses EBI's Ontology xRef Service (OxO) API to map TIGA traits (EFO/MONDO) to DOID, as described by the DISEASES2 paper. Note that OBA traits are _not_mapped since they are missing from OxO. No longer requires HumanDO.tsv and HumanDO.tsv.metadata. Also requires the gold standard file and only retains mapped DOID diseases that are present in the gold standard (we would ignore all others). Writes file to processed/ directory.
prepare_inputs.py (formerly files.py): Both files above have been mapped to the DOID disease namespace and the ENSP gene/protein namespace. For each disease, write a separate prizes file and gold standard (GS) file. Files are of the form <DOID_DiseaseName>, with disease name in lower case. Store them in prize_files/ and GS_files/ respectively.

annaritz · 2026-05-03T18:24:57Z

QUESTION: We currently don't require that the gold standard disease-gene pairs have ENSP IDs in the interactome (STRING); we do require this for the TIGA GWAS inputs. Should we also ensure that all genes are in STRINGDB in the gold standard, or is that caught in a downstream process?

annaritz · 2026-05-03T18:26:22Z

OUTSTANDING TODO: The Snakemake file is up-to-date, but there are three files that are no longer needed (ensg-ensp.tsv, HumanDO.tsv, and HumanDO.tsv.metadata). I removed them from the Snakemake fetch() commands, but (a) are these also somewhere else? This is another use case for local vs. global files - how do I know whether these files are being used by other data collections? How do I remove existing files from outdated pipelines? Etc.

annaritz · 2026-05-03T18:28:30Z

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

annaritz · 2026-05-03T18:30:43Z

OUTSTANDING TODO: The filtered DISEASES files are a fraction of the full files. From the DISEASES download website: "The full files contain all links in the DISEASES database. The filtered files contain only the non-redundant associations that are shown within the web interface when querying for a gene."

We should re-run with the full dataset to ensure that we're not missing diseases. If we do capture more candidate diseases than those with the filtered files, then the filtered files should be swapped with the full files.

ntalluri · 2026-05-04T14:11:05Z

QUESTION: We currently don't require that the gold standard disease-gene pairs have ENSP IDs in the interactome (STRING); we do require this for the TIGA GWAS inputs. Should we also ensure that all genes are in STRINGDB in the gold standard, or is that caught in a downstream process?

We should be trimming the gold standard to be the ones in the interactome as well. I think there is code for that in #65.

ntalluri · 2026-05-04T14:12:43Z

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

annaritz · 2026-05-04T14:42:23Z

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

We'll have to think about how to do this. I don't think we can use snakemake to generate the files and then have the last rule be "re-generate this Snakemake file."

tristan-f-r · 2026-05-04T20:45:06Z

OUTSTANDING TODO: The Snakemake file is up-to-date, but there are three files that are no longer needed (ensg-ensp.tsv, HumanDO.tsv, and HumanDO.tsv.metadata). I removed them from the Snakemake fetch() commands, but (a) are these also somewhere else? This is another use case for local vs. global files - how do I know whether these files are being used by other data collections? How do I remove existing files from outdated pipelines? Etc.

I talk about this in more detail in a comment under #65, but for local files, you can remove them outright. For global files, that is harder to track, but one can quickly search if they are being used by using the (for lack of a better word, 'query tuple') ("BioMart", "ensg-ensp.tsv") across the codebase for any other uses of it.

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

We'll have to think about how to do this. I don't think we can use snakemake to generate the files and then have the last rule be "re-generate this Snakemake file."

We'll have to use a Snakemake checkpoint. I can add this, but I'm also wary of breaking anything in this pipeline, so as @ntalluri suggested some time ago, I'll upload the processed files to Google Drive, and make sure that they don't change with my new changes.

tristan-f-r added 6 commits March 18, 2026 03:16

chore: drop other datasets

b49439e

Merge branch 'main' into egfr-and-infrastructure

2018a13

chore: re-include

136e5ff

chore: drop tools

472468d

not needed just yet

chore: re-add tools

a5de971

feat: diseases

aba68bd

tristan-f-r added dataset Mutating datasets in any way. blocked-by-other-pr For PRs that depend on other PRs. labels Mar 18, 2026

tristan-f-r mentioned this pull request Mar 18, 2026

feat: scaffolding, caching, EGFR #65

Open

ntalluri reviewed Mar 19, 2026

View reviewed changes

Comment thread datasets/diseases/README.md

annaritz added 2 commits April 2, 2026 11:33

added comments to scripts and notes to README; found a bug in gold st…

faaafe6

…andard script and something we need to address in the files script. Will add more to the review.

forgot to add inputs.p

173cf76

annaritz reviewed Apr 2, 2026

View reviewed changes

tristan-f-r and others added 4 commits April 9, 2026 03:24

address some comments

65f7ca9

Specifically: - Move mentions of fetch.py to ../Snakefile - Clarify some variable names - Note that the STRING id mapping is not the issue

fix: decrease confidence score

afa9f99

From 4->3.

docs: cmt ensp/ensg

d608386

found bug in gold_standard.py; added print statements; reset score th…

d2064b5

…res back to 4

annaritz reviewed Apr 30, 2026

View reviewed changes

annaritz added 5 commits April 29, 2026 17:27

added debugging print statements to files.py. Updated number of disea…

2c32ec9

…ses to be 31.

major refactor

7a79aef

updated Snakefile

32a92eb

removed the data/ directory - those files are now in processed/

aa518c2

added workflow fig

cf62f94

annaritz reviewed May 3, 2026

View reviewed changes


		# Filter the SNP dataset for genes in the disease set.

		# UNRESOLVED ISSUE:

Conversation

tristan-f-r commented Mar 18, 2026

Uh oh!

Uh oh!

annaritz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

annaritz Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

annaritz Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

tristan-f-r Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

annaritz Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

annaritz Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

annaritz Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

annaritz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

annaritz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

annaritz left a comment

Choose a reason for hiding this comment

Uh oh!

annaritz commented May 3, 2026

Uh oh!

annaritz commented May 3, 2026

Uh oh!

annaritz commented May 3, 2026

Uh oh!

annaritz commented May 3, 2026

Uh oh!

ntalluri commented May 4, 2026

Uh oh!

ntalluri commented May 4, 2026

Uh oh!

annaritz commented May 4, 2026

Uh oh!

tristan-f-r commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tristan-f-r Apr 2, 2026 •

edited

Loading

tristan-f-r Apr 9, 2026 •

edited

Loading