Skip to content

Conversation

@tristanpwdennis
Copy link
Collaborator

@tristanpwdennis tristanpwdennis commented Jun 17, 2025

This replaces PR #793.

Continuing previous discussion....

Thanks @leehart - I'll keep working on this and adding features (e.g. cohorts analysis and fixing the metadata so it is compatible with Ag3/Af1), before I request a review.

In the meantime, I'm happy to contribute to some of the re-engineering to allow more optional implementations of functionality to new classes (e.g. As1, Adir1, updated Amin1 and potentially others). If that would be helpful, please point me at whatever I can do.

@tristanpwdennis tristanpwdennis mentioned this pull request Jun 17, 2025
@tristanpwdennis tristanpwdennis requested a review from leehart June 17, 2025 02:24
@leehart
Copy link
Collaborator

leehart commented Jun 17, 2025

Hi @tristanpwdennis . Thanks for your work on this. I hope we can figure out the necessary adjustments to make this fit.

I see you've marked this PR ready for review but we'll still need the existing code in anoph/sample_metadata.py to support the existing functionality for ag3.py and af1.py, as well as other modules in the pipe. It looks like we'll need to figure out how to resolve the study_info and terms_of_use_info. I believe the current assumption is that all our data will include that info, otherwise we'll probably need to find a way to conditionally support it where it exists. I believe that terms-of-use info is likely to be quite fundamental, for example, since we want to be able to tell users which sample sets are restricted and which aren't.

Similarly, there is currently an assumption that sequence QC metadata will be available (in a certain place and format), which we will need to resolve. If those data aren't available in this case, then we will need to re-engineer the code (in general) to allow for this, without affecting the other existing classes, functions and data support.

The existing Af1.x support does not include _aim_analysis, so that should be relatively straightforward to resolve. See that malariagen_data/af1.py has:

            aim_analysis=None,
            aim_metadata_dtype=None,
            aim_ids=None,
            aim_palettes=None,

I'm not sure I understand the changes to the anopheles.py code, perhaps you could explain those? Also the addition of the pinned version of ipykernel to the pyproject.toml?

Ultimately, in order to add anything to the master branch, we would also require all of the CI checks to pass.

I've created an issue for this work #797 and assigned you, just to help us track the work.

@tristanpwdennis tristanpwdennis marked this pull request as draft June 18, 2025 03:53
@tristanpwdennis
Copy link
Collaborator Author

tristanpwdennis commented Jun 18, 2025

Hi Lee - thanks for taking a look at this.

Requesting a review must have been an accident - I was just trying to add you back to the PR but probably misunderstood. I've turned it back into a draft (think that's the right thing to do?) and will re-request when I think it's actually ready.

The code, data and metadata in vobs-asia are a few years behind the structure currently used for Af and Ag - so things like the ToU inclusion, inclusion of the QC metadata, broke when I tried to run the current API version with the data. All of these data exist, and can be included, I just need to adapt the data and GCS filesystem structure to be compatible with the API. Working on this now.

The changes to anopheles.py add caching to the roh_hmm function - on reflection, would this have been better as its own PR? If so, can do that I think.

Apologies - a lot of my edits are so that I can get some analysis done first, then make the data and alterations compatible with the API second - but I don't think there is any reason why we can't get it to work?

Not sure about the iykernel version - will take a look.

Is discussion here or on the issue best?

HTH

@leehart
Copy link
Collaborator

leehart commented Jun 19, 2025

Thanks @tristanpwdennis, that's helped to clarify a few things for me.

Yeah, I expect the work to add caching to the roh_hmm might be more suited to a separate issue and PR, as you suggest. It doesn't seem specific to adding support for A. dirus data, but I guess it might be related to this PR, or somehow required? If that gets merged into master first, you could then update this PR with the same changes. Having that work in a separate issue and PR would help us to focus discussion and reviews on that particular update, but it's not unusual (and is often convenient) to bundle related changes into the same PR.

Thanks again for explaining the situation. I'll be on hand to review this again once everything's ready and aligned. Or I can take a look at things along the way, if needs be.

Regarding where to post discussion, it is often a grey area, but generally I reckon comments relating to the issue itself (e.g. describing a bug or the requirements for a new feature) seem to belong with the issue, whereas comments relating to a particular ongoing attempt to resolve an issue (e.g. a particular approach or set of code changes) usually belong with the PR itself. The comments will be seen either way, but having them in the most relevant place can help us when we're looking back later.

Having said that, discussion often starts under the issue until a PR is drafted, and then naturally continues within the PR until the PR is approved, and then the issue is closed, on a good day.

@tristanpwdennis
Copy link
Collaborator Author

Great, thanks Lee. See Issue #798 and PR #799 regarding the caching, and will let you know about Adir1 as we go on.

Copy link
Collaborator

@ahernank ahernank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging shortly, unless any concerns.

@leehart
Copy link
Collaborator

leehart commented Nov 20, 2025

Hi @ahernank - One concern is that this PR is failing still tests, which would cause the master branch to also fail tests when merged in (as well as the obvious problems of safeguards failing).
Also potentially important: the PR for the unrestricted_use_only and surveillance_use_only constructor params #724 hasn't been fully approved and merged in yet.

@jonbrenas
Copy link
Collaborator

I agree with @leehart. The tests fail on an error which actually exists (as opposed to the more transient "we chose an empty region somehow") so I don't think we can merge it yet.

@tristanpwdennis
Copy link
Collaborator Author

Hi guys

I had been meaning to discuss the reasons the tests are failing - but was a little slow. Apologies. If you examine the tracebacks, the tests fail in three areas:

  • The absence of 'contam_pc' and 'contam_LLR' in the metadata. These weren't calculated for the dirus data and cause the sample metadata tests to fail.
  • The absence of phasing data. For some reason, this affects the g123 analysis, despite that g123 is meant to operate on unphased data?
  • The absence of sex calling data. The dirus genome is in pieces, and sex calling was a challenge, so we omitted it for now until we have a better reference to identify the sex chromosomes. This causes the diplotype clustering tests to fail.

Depending on what you guys think, we have three options:

  • Remove dirus and minimus from these coverage tests. This is probably the most expedient, and bearing in mind the rustic nature of the amin1 API as it is now ;).
  • Rewrite the coverage tests - but this may mean areas of ag3 or af1 aren't covered.
  • Add new coverage tests specifically for dirus and minimus (as they will have the same issues).

Please let me know what you think. I am inclined to go for #1 immediately and then add new or modified tests as soon as possible - but this is because we are hoping to submit a paper.

Cheers all, thank you for your eyes on this.

@jonbrenas
Copy link
Collaborator

Thank you, @tristanpwdennis .

I agree that removing these tests for dirus and minimus is probably the most expedient way to progress as long as we remember that it is still in a beta state.

I think some of the issues can be addressed fairly easily, though. For g123, it looks like the issue is that dirus_noneyet is considered to be a valid phasing analysis ID but that doesn't have data. The g123 code uses a phasing analysis ID if one is provided (as it is a convenient filter) but can also use 'all' or 'segregating'. I would advise to run the tests with one of those instead.

For the metadata, the tests are quite important so I would rather add dummy data (NaN or 0) in the missing columns. There is also an errant column 'Unnamed: 0' that needs to be removed.

For the diplotype clustering, it is possible to use a different query, at least for Adir.

@tristanpwdennis
Copy link
Collaborator Author

Thanks @jonbrenas!

I have removed the checks for the g123 and dip clustering. The G123 still looks for haplotype data - unless I am missing something - and I couldn't work out how to modify the query in the dip_clustering_advanced - which I don't think is included in Adir1.x anyway as we don't have CNV calls yet.

Let me know what you think

jonbrenas
jonbrenas previously approved these changes Nov 25, 2025
Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jonbrenas jonbrenas dismissed their stale review November 25, 2025 10:39

We need to make some updates to deal with the new errors caused by the merge with master.

@jonbrenas
Copy link
Collaborator

The integration tests failed because 'unrestricted_use_only' and 'surveillance_use_only' are not instantiated. That was to be expected and can easily be corrected.
The linting issue seems to be an errant empty line. It should also be easy to correct.
The tests seem to fail because of the absence of the sex calls (again). It would probably be a good idea to use a different kind of query.

@tristanpwdennis
Copy link
Collaborator Author

It's hard to think of a value in the metadata that may apply in the same way as sex_call (e.g. by country or sample set may break?, whereas in all datasets we know the result will be M or F). I've added a tag to the cases so that the tests with those queries only run on ag3 and af1.

Linting fixed.

Surveillance and restriction instantiated. The tests are failing, but I am unable to reproduce this error locally - thoughts?

@jonbrenas
Copy link
Collaborator

jonbrenas commented Nov 26, 2025

The tests returned a SEGFAULT (which is not typically an error that you should be able to see in Python) so it seemed likely that it was a transient error with the script. Re-running the tests solved it.

An option for the query would be to build it as "country == C" where C is the country of the first sample in the metadata. Just skipping the test works, at least until we find a better solution, imho.

Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will merge later today if no one objects.

@jonbrenas jonbrenas dismissed their stale review November 26, 2025 11:55

The UI decided to change my review from "Approve" to "Request changes"!

Copy link
Collaborator

@jonbrenas jonbrenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will merge later today if no one objects.

@jonbrenas jonbrenas dismissed leehart’s stale review November 27, 2025 08:39

The changes were effected.

@jonbrenas jonbrenas merged commit fbbd2d0 into master Nov 27, 2025
30 of 58 checks passed
@leehart leehart deleted the adir1.x branch January 13, 2026 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants