Skip to content

Conversation

@sanjaynagi
Copy link
Collaborator

@sanjaynagi sanjaynagi commented Sep 5, 2025

This PR adds advanced haplotype clustering, allowing users to cut the tree and assign clusters, and to plot haplotype data (amino acid mutations by default) below the tree.

TODO

  • Add tests, notebook
  • Currently the code loads the SNP genotypes to get SNP effects, this is unnecessary and ideally we can load the SNP effects straight from the haplotypes

@sanjaynagi
Copy link
Collaborator Author

We now use the snp_effects() function... didnt know that existed, but makes sense

@sanjaynagi
Copy link
Collaborator Author

I also add grey lines to the amino acid plot of the diplotype clustering functions - it looks a lot neater.

And the haplotype function can now support multiple transcripts.

@ahernank
Copy link
Collaborator

This is super cool, @sanjaynagi. Thanks very much. I am struggling with bandwidth this week but have asked @leehart and @jonbrenas to take a look so we can merge asap. :)

@sanjaynagi
Copy link
Collaborator Author

Awesome, thanks @ahernank

@leehart
Copy link
Collaborator

leehart commented Sep 18, 2025

Just adding a corresponding issue for this #819

@leehart
Copy link
Collaborator

leehart commented Sep 19, 2025

Many thanks @sanjaynagi - This is great!

Pylint is telling me there's a potential glitch around the following code in the transcript_haplotypes function in the AnophelesHapClustAnalysis class:

        # Get SNP genotype allele counts for the transcript, applying snp_query
        df_eff = (
            self.snp_effects(
                transcript=transcript,
            )
            .query(snp_query)
            .reset_index(drop=True)
        )

Raises problem:

Instance of 'AnophelesHapClustAnalysis' has no 'snp_effects' member

I gather the issue is that AnophelesHapClustAnalysis inherits from AnophelesHapData and AnophelesSnpData, which both inherit from (AnophelesSampleMetadata, AnophelesGenomeFeaturesData, AnophelesGenomeSequenceData) but none of those have snp_effects, which is in AnophelesSnpFrequencyAnalysis instead.

class AnophelesSnpFrequencyAnalysis(AnophelesSnpData, AnophelesFrequencyAnalysis):

Obviously all the tests pass, and the examples in the notebooks appear to work, so I don't quite understand why something like an AttributeError is not already being raised somewhere.

We should probably check to see whether this warning from Pylint is right, anyhow. We might need to modify the inheritance chain, e.g. adding AnophelesSnpFrequencyAnalysis to AnophelesHapClustAnalysis, or move snp_effects into AnophelesSnpData, or maybe there's a different way to call snp_effects, etc.


# Get SNP genotype allele counts for the transcript, applying snp_query
df_eff = (
self.snp_effects(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pylint reports:

Instance of 'AnophelesHapClustAnalysis' has no 'snp_effects' member

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. also not clear why the code was working - but i have added the Frequency class to hapclust class now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Lee

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sanjaynagi . The snp_effects function is in AnophelesSnpFrequencyAnalysis rather than AnophelesFrequencyAnalysis, and AnophelesFrequencyAnalysis only inherits from AnophelesBase, so probably won't make snp_effects available to AnophelesHapClustAnalysis, if I understand correctly.

Also, since AnophelesSnpFrequencyAnalysis inherits from AnophelesSnpData, I suspect we might need to put it before AnophelesSnpData in the AnophelesHapClustAnalysis base classes list. That made we wonder whether we need AnophelesSnpData there at all, but when I tried removing it locally, I got a method resolution error.

Unfortunately, I'm starting to suspect that there might be a larger architectural issue here, around inheritance, namely https://en.wikipedia.org/wiki/Multiple_inheritance#The_diamond_problem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad. I will replace the import of AnophelesFrequencyAnalysis and see what happens

@sanjaynagi
Copy link
Collaborator Author

Also, here is a notebook with a demo of the function https://colab.research.google.com/drive/1HqDJL0cz9gAKaeXdwYjcGXx-X1GYJznL?usp=sharing

@leehart
Copy link
Collaborator

leehart commented Sep 30, 2025

Looks like we're still waiting to try using AnophelesSnpFrequencyAnalysis (which has snp_effects) instead of AnophelesFrequencyAnalysis for the AnophelesHapClustAnalysis class, to satisfy Pylint.

I believe AnophelesSnpFrequencyAnalysis will need to be placed before AnophelesSnpData in the list, because AnophelesSnpFrequencyAnalysis inherits from AnophelesSnpData, e.g.

class AnophelesHapClustAnalysis(AnophelesSnpFrequencyAnalysis, AnophelesHapData, AnophelesSnpData):

We might be able to address the potential diamond inheritance problem in a separate issue.

Otherwise, I reckon this looks good to go.

@leehart
Copy link
Collaborator

leehart commented Jan 8, 2026

Hi @sanjaynagi. Is it OK if I bring this PR's branch up-to-date with the base branch, and take a closer look at these CI test failures?

@sanjaynagi
Copy link
Collaborator Author

sanjaynagi commented Jan 8, 2026 via email

@leehart
Copy link
Collaborator

leehart commented Jan 8, 2026

Thanks @sanjaynagi .

Looks like we have a couple of mypy linting errors in the CI checks https://github.com/malariagen/malariagen-data-python/actions/runs/20814001998/job/59784955756?pr=817

malariagen_data/util.py:114: error: "Series[Any]" not callable  [operator]
malariagen_data/anoph/sample_metadata.py:1158: error: Name "pd.core.series.Series" is not defined  [name-defined]
Found 2 errors in 2 files (checked 98 source files)

I also noticed that we have:

class AnophelesHapClustAnalysis(AnophelesHapData, AnophelesSnpData, AnophelesFrequencyAnalysis)

instead of

class AnophelesHapClustAnalysis(AnophelesSnpFrequencyAnalysis, AnophelesHapData, AnophelesSnpData):

In any case, we are still running into issue #835 at https://github.com/malariagen/malariagen-data-python/actions/runs/20814001969/job/59784955727?pr=817

/home/runner/work/_temp/27a2a858-3d62-46e1-9296-6eb374ce0f4e.sh: line 1:  2259 Segmentation fault      (core dumped)

...which is preventing us from the CI checks & tests completing. This segfault issue exists outside the changes in this PR.

@leehart
Copy link
Collaborator

leehart commented Jan 16, 2026

Noting linting failures:

malariagen_data/util.py:114: error: "Series[Any]" not callable [operator]
malariagen_data/anoph/sample_metadata.py:1158: error: Name "pd.core.series.Series" is not defined [name-defined]
Found 2 errors in 2 files (checked 98 source files)

These can be addressed in similar ways to pending PRs #842 and #844

However, the segfault error is still sporadic and unresolved, via issue #835 and PR #842

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants