dataset: DISEASES#66
Conversation
…andard script and something we need to address in the files script. Will add more to the review.
There was a problem hiding this comment.
Silly question - why dmmm? Disease module mining ___? Is this noted somewhere?
There was a problem hiding this comment.
Not a silly question - Neha pointed this out in #65, so this file was changed to scores.yaml instead. (I didn't keep this branch updated.)
| GS_string_df = GS_combined_threshold.merge(string_aliases, on="ENSP", how="inner") | ||
| GS_string_df = GS_string_df.drop_duplicates(subset=["ENSG", "ENSP", "geneName", "diseaseID", "diseaseName"]) | ||
|
|
||
| ## THIS HAS A MAJOR ISSUE |
There was a problem hiding this comment.
Flagging this comment - It's looking like mapping from ENSG to ENSP loses many proteins, so many that diseases that used to have >10 high-confidence genes now have 0 or 1. Proposed approach to check this:
- Confirm that the print statements below are correctly reporting the number of genes pre-mapping and post-mapping for each disease that passes our filters above.
- Investigate the mapping and figure out why we are losing so many genes.
- Once mapping is fixed, plot the distribution of gene set sizes to identify a better GENE_SET_SIZE_MINIMUM value (currently 10).
There was a problem hiding this comment.
The mapping was not the issue: printing out the GS_combined_threshold directly notes that we lose very few genes with STRING, and the majority of the loss is coming from our choice of CONFIDENCE_SCORE_MINIMUM.
[I also checked against the old DISEASES branch to make sure that the choice of BioMart over gProfiler for mapping from ENSP to ENSG was not the issue, and it isn't. However, reading through this again, it's unclear to me both now and originally why we even map to ENSG in the first place? Dropping it and only doing the ENSP -> String ENSP mapping seems to include more genes than usual, presumably because the drop_duplicates call on the ENSG column is important but lacks documentation.]
| ''' | ||
| Generates input node set files based on TIGA trait-gene associations and Disease Ontology | ||
| annotations. | ||
| TODO: not all of these input node set files are necessary for benchmarking; only those with |
There was a problem hiding this comment.
Flagging this comment - we are actually doing more than we need here by saving all possible input node set files, and then filtering out the ones we do not consider in the benchmarking data collection. It's okay for now, but not the pipeline described in the README.
|
|
||
| # Filter the SNP dataset for genes in the disease set. | ||
|
|
||
| # UNRESOLVED ISSUE: |
There was a problem hiding this comment.
Flagging this comment - we should not be re-using the hard-coded 10 threshold in this script. At this point, we should take the gold standard file, the input GWAS node files, and generate the standardized outputs by cross-referencing them. No other parameters should be used.
Specifically: - Move mentions of fetch.py to ../Snakefile - Clarify some variable names - Note that the STRING id mapping is not the issue
From 4->3.
| # Get the BioMart ENSP -> ENSG mapping | ||
| biomart_data = pd.read_csv(diseases_path / "raw" / "ensg-ensp.tsv", sep="\t", names=["ENSP", "ENSG"]) | ||
|
|
||
| # The DISEASES data is in the ENSP namespace, but we want to work in ENSG. |
There was a problem hiding this comment.
It seems like we convert from ENSP to ENSG, do filtering and then convert back from ENSG to ENSP. Why do we do this? Why not keep it in ENSP and convert whatever is in ENSG to ENSP?
| # Threshold based on GENE_SET_SIZE_MINIMUM | ||
| GS_score_group = GS_ids_df.groupby("diseaseName") | ||
| # Threshold the high-confidence gene-gene pairs based on GENE_SET_SIZE_MINIMUM | ||
| GS_score_group = GS_ids_high_confidence.groupby("diseaseName") |
There was a problem hiding this comment.
THis was where the bug was. The old code:
GS_score_group = GS_ids_df.groupby("diseaseName")
It was using the wrong dataframe (GS_ids_df instead of GS_ids_score_threshold). I have renamed the variabes and plan to refactor to make this more apparent.
|
|
||
| # Identify the diseases that are in the gold standard and the inputs. | ||
| GS_string_df = GS_string_df[GS_string_df["diseaseID"].isin(tiga_string_df["id"])] | ||
| GS_combined_group = GS_string_df.groupby("diseaseName") |
There was a problem hiding this comment.
I have found another issue - the Disease Name in the gold standard is capitalized different than the "trait" in the inputs. I don't think it affects the downstream results but I noticed this when printing some sanity checks. We should be using the Disease Ontology ID to avoid this.
For this reason, as well as other reasons, I plan to clean up & refactor this code so it's easier to understand.
annaritz
left a comment
There was a problem hiding this comment.
Major refactor of this dataset. The README.md includes an updated workflow figure, a description of each function, and the outputs produced by the function.
Major changes:
gold_standard.py: there is no mapping fromENSP -> ENSG -> ENSP(no longer requiresensg-ensp.tsvfile). Code is refactored to cleanly filter low-confidence disease-gene pairs and diseases with only a few high-confidence genes. Writes file toprocessed/directory.trait_gene_assoc.py(formerlyinputs.py): uses EBI's Ontology xRef Service (OxO) API to map TIGA traits (EFO/MONDO) to DOID, as described by the DISEASES2 paper. Note that OBA traits are _not_mapped since they are missing from OxO. No longer requiresHumanDO.tsvandHumanDO.tsv.metadata. Also requires the gold standard file and only retains mapped DOID diseases that are present in the gold standard (we would ignore all others). Writes file toprocessed/directory.prepare_inputs.py(formerlyfiles.py): Both files above have been mapped to the DOID disease namespace and the ENSP gene/protein namespace. For each disease, write a separate prizes file and gold standard (GS) file. Files are of the form<DOID_DiseaseName>, with disease name in lower case. Store them inprize_files/andGS_files/respectively.
|
QUESTION: We currently don't require that the gold standard disease-gene pairs have ENSP IDs in the interactome (STRING); we do require this for the TIGA GWAS inputs. Should we also ensure that all genes are in STRINGDB in the gold standard, or is that caught in a downstream process? |
|
OUTSTANDING TODO: The Snakemake file is up-to-date, but there are three files that are no longer needed ( |
|
QUESTION: The Snakefile currently only has two prize and GS files required (the two are arbitrarily chosen). Should those stand in for all 40-ish disease files? We don't want the Snakefile to be dependent on the outputs of the files their rules generate...this seems very circular. If one of those two files is missing, then it should be fine to regenerate all disease input files. |
|
OUTSTANDING TODO: The We should re-run with the full dataset to ensure that we're not missing diseases. If we do capture more candidate diseases than those with the filtered files, then the filtered files should be swapped with the full files. |
We should be trimming the gold standard to be the ones in the interactome as well. I think there is code for that in #65. |
Yes the snakefile should create all of the 40ish disease files. So it should also be dependent on all 40ish diseases not the 2 arbitrary ones picked. |
We'll have to think about how to do this. I don't think we can use snakemake to generate the files and then have the last rule be "re-generate this Snakemake file." |
I talk about this in more detail in a comment under #65, but for local files, you can remove them outright. For global files, that is harder to track, but one can quickly search if they are being used by using the (for lack of a better word, 'query tuple')
We'll have to use a Snakemake checkpoint. I can add this, but I'm also wary of breaking anything in this pipeline, so as @ntalluri suggested some time ago, I'll upload the processed files to Google Drive, and make sure that they don't change with my new changes. |
From #39.