Skip to content

dataset: DISEASES#66

Open
tristan-f-r wants to merge 17 commits into
mainfrom
diseases-dataset
Open

dataset: DISEASES#66
tristan-f-r wants to merge 17 commits into
mainfrom
diseases-dataset

Conversation

@tristan-f-r
Copy link
Copy Markdown
Contributor

@tristan-f-r tristan-f-r added dataset Mutating datasets in any way. blocked-by-other-pr For PRs that depend on other PRs. labels Mar 18, 2026
Comment thread datasets/diseases/README.md
annaritz added 2 commits April 2, 2026 11:33
…andard script and something we need to address in the files script. Will add more to the review.
Copy link
Copy Markdown
Contributor

@annaritz annaritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment regarding caching - will save the others for #65.

Working through some namespace mapping issues - I added some code and comments on the scripts/ directory files.

Comment thread cache/README.md
Comment thread configs/dmmm.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question - why dmmm? Disease module mining ___? Is this noted somewhere?

Copy link
Copy Markdown
Contributor Author

@tristan-f-r tristan-f-r Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a silly question - Neha pointed this out in #65, so this file was changed to scores.yaml instead. (I didn't keep this branch updated.)

GS_string_df = GS_combined_threshold.merge(string_aliases, on="ENSP", how="inner")
GS_string_df = GS_string_df.drop_duplicates(subset=["ENSG", "ENSP", "geneName", "diseaseID", "diseaseName"])

## THIS HAS A MAJOR ISSUE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this comment - It's looking like mapping from ENSG to ENSP loses many proteins, so many that diseases that used to have >10 high-confidence genes now have 0 or 1. Proposed approach to check this:

  1. Confirm that the print statements below are correctly reporting the number of genes pre-mapping and post-mapping for each disease that passes our filters above.
  2. Investigate the mapping and figure out why we are losing so many genes.
  3. Once mapping is fixed, plot the distribution of gene set sizes to identify a better GENE_SET_SIZE_MINIMUM value (currently 10).

Copy link
Copy Markdown
Contributor Author

@tristan-f-r tristan-f-r Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mapping was not the issue: printing out the GS_combined_threshold directly notes that we lose very few genes with STRING, and the majority of the loss is coming from our choice of CONFIDENCE_SCORE_MINIMUM.

[I also checked against the old DISEASES branch to make sure that the choice of BioMart over gProfiler for mapping from ENSP to ENSG was not the issue, and it isn't. However, reading through this again, it's unclear to me both now and originally why we even map to ENSG in the first place? Dropping it and only doing the ENSP -> String ENSP mapping seems to include more genes than usual, presumably because the drop_duplicates call on the ENSG column is important but lacks documentation.]

Comment thread datasets/diseases/scripts/inputs.py Outdated
Comment thread datasets/diseases/scripts/inputs.py Outdated
'''
Generates input node set files based on TIGA trait-gene associations and Disease Ontology
annotations.
TODO: not all of these input node set files are necessary for benchmarking; only those with
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this comment - we are actually doing more than we need here by saving all possible input node set files, and then filtering out the ones we do not consider in the benchmarking data collection. It's okay for now, but not the pipeline described in the README.

Comment thread datasets/diseases/scripts/files.py Outdated

# Filter the SNP dataset for genes in the disease set.

# UNRESOLVED ISSUE:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this comment - we should not be re-using the hard-coded 10 threshold in this script. At this point, we should take the gold standard file, the input GWAS node files, and generate the standardized outputs by cross-referencing them. No other parameters should be used.

tristan-f-r and others added 4 commits April 9, 2026 03:24
Specifically:
- Move mentions of fetch.py to ../Snakefile
- Clarify some variable names
- Note that the STRING id mapping is not the issue
# Get the BioMart ENSP -> ENSG mapping
biomart_data = pd.read_csv(diseases_path / "raw" / "ensg-ensp.tsv", sep="\t", names=["ENSP", "ENSG"])

# The DISEASES data is in the ENSP namespace, but we want to work in ENSG.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we convert from ENSP to ENSG, do filtering and then convert back from ENSG to ENSP. Why do we do this? Why not keep it in ENSP and convert whatever is in ENSG to ENSP?

# Threshold based on GENE_SET_SIZE_MINIMUM
GS_score_group = GS_ids_df.groupby("diseaseName")
# Threshold the high-confidence gene-gene pairs based on GENE_SET_SIZE_MINIMUM
GS_score_group = GS_ids_high_confidence.groupby("diseaseName")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis was where the bug was. The old code:

 GS_score_group = GS_ids_df.groupby("diseaseName")

It was using the wrong dataframe (GS_ids_df instead of GS_ids_score_threshold). I have renamed the variabes and plan to refactor to make this more apparent.

Comment thread datasets/diseases/scripts/files.py Outdated

# Identify the diseases that are in the gold standard and the inputs.
GS_string_df = GS_string_df[GS_string_df["diseaseID"].isin(tiga_string_df["id"])]
GS_combined_group = GS_string_df.groupby("diseaseName")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found another issue - the Disease Name in the gold standard is capitalized different than the "trait" in the inputs. I don't think it affects the downstream results but I noticed this when printing some sanity checks. We should be using the Disease Ontology ID to avoid this.

For this reason, as well as other reasons, I plan to clean up & refactor this code so it's easier to understand.

Copy link
Copy Markdown
Contributor

@annaritz annaritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major refactor of this dataset. The README.md includes an updated workflow figure, a description of each function, and the outputs produced by the function.

Image

Major changes:

  • gold_standard.py: there is no mapping from ENSP -> ENSG -> ENSP (no longer requires ensg-ensp.tsv file). Code is refactored to cleanly filter low-confidence disease-gene pairs and diseases with only a few high-confidence genes. Writes file to processed/ directory.
  • trait_gene_assoc.py (formerly inputs.py): uses EBI's Ontology xRef Service (OxO) API to map TIGA traits (EFO/MONDO) to DOID, as described by the DISEASES2 paper. Note that OBA traits are _not_mapped since they are missing from OxO. No longer requires HumanDO.tsv and HumanDO.tsv.metadata. Also requires the gold standard file and only retains mapped DOID diseases that are present in the gold standard (we would ignore all others). Writes file to processed/ directory.
  • prepare_inputs.py (formerly files.py): Both files above have been mapped to the DOID disease namespace and the ENSP gene/protein namespace. For each disease, write a separate prizes file and gold standard (GS) file. Files are of the form <DOID_DiseaseName>, with disease name in lower case. Store them in prize_files/ and GS_files/ respectively.

@annaritz
Copy link
Copy Markdown
Contributor

annaritz commented May 3, 2026

QUESTION: We currently don't require that the gold standard disease-gene pairs have ENSP IDs in the interactome (STRING); we do require this for the TIGA GWAS inputs. Should we also ensure that all genes are in STRINGDB in the gold standard, or is that caught in a downstream process?

@annaritz
Copy link
Copy Markdown
Contributor

annaritz commented May 3, 2026

OUTSTANDING TODO: The Snakemake file is up-to-date, but there are three files that are no longer needed (ensg-ensp.tsv, HumanDO.tsv, and HumanDO.tsv.metadata). I removed them from the Snakemake fetch() commands, but (a) are these also somewhere else? This is another use case for local vs. global files - how do I know whether these files are being used by other data collections? How do I remove existing files from outdated pipelines? Etc.

@annaritz
Copy link
Copy Markdown
Contributor

annaritz commented May 3, 2026

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

@annaritz
Copy link
Copy Markdown
Contributor

annaritz commented May 3, 2026

OUTSTANDING TODO: The filtered DISEASES files are a fraction of the full files. From the DISEASES download website: "The full files contain all links in the DISEASES database. The filtered files contain only the non-redundant associations that are shown within the web interface when querying for a gene."

We should re-run with the full dataset to ensure that we're not missing diseases. If we do capture more candidate diseases than those with the filtered files, then the filtered files should be swapped with the full files.

@ntalluri
Copy link
Copy Markdown
Collaborator

ntalluri commented May 4, 2026

QUESTION: We currently don't require that the gold standard disease-gene pairs have ENSP IDs in the interactome (STRING); we do require this for the TIGA GWAS inputs. Should we also ensure that all genes are in STRINGDB in the gold standard, or is that caught in a downstream process?

We should be trimming the gold standard to be the ones in the interactome as well. I think there is code for that in #65.

@ntalluri
Copy link
Copy Markdown
Collaborator

ntalluri commented May 4, 2026

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

@annaritz
Copy link
Copy Markdown
Contributor

annaritz commented May 4, 2026

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

We'll have to think about how to do this. I don't think we can use snakemake to generate the files and then have the last rule be "re-generate this Snakemake file."

@tristan-f-r
Copy link
Copy Markdown
Contributor Author

OUTSTANDING TODO: The Snakemake file is up-to-date, but there are three files that are no longer needed (ensg-ensp.tsv, HumanDO.tsv, and HumanDO.tsv.metadata). I removed them from the Snakemake fetch() commands, but (a) are these also somewhere else? This is another use case for local vs. global files - how do I know whether these files are being used by other data collections? How do I remove existing files from outdated pipelines? Etc.

I talk about this in more detail in a comment under #65, but for local files, you can remove them outright. For global files, that is harder to track, but one can quickly search if they are being used by using the (for lack of a better word, 'query tuple') ("BioMart", "ensg-ensp.tsv") across the codebase for any other uses of it.

QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files.

Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked.

We'll have to think about how to do this. I don't think we can use snakemake to generate the files and then have the last rule be "re-generate this Snakemake file."

We'll have to use a Snakemake checkpoint. I can add this, but I'm also wary of breaking anything in this pipeline, so as @ntalluri suggested some time ago, I'll upload the processed files to Google Drive, and make sure that they don't change with my new changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocked-by-other-pr For PRs that depend on other PRs. dataset Mutating datasets in any way.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants