Adir1.x #795

tristanpwdennis · 2025-06-17T02:03:56Z

This replaces PR #793.

Continuing previous discussion....

Thanks @leehart - I'll keep working on this and adding features (e.g. cohorts analysis and fixing the metadata so it is compatible with Ag3/Af1), before I request a review.

In the meantime, I'm happy to contribute to some of the re-engineering to allow more optional implementations of functionality to new classes (e.g. As1, Adir1, updated Amin1 and potentially others). If that would be helpful, please point me at whatever I can do.

…en-data-python into adir1.x

leehart · 2025-06-17T11:44:00Z

Hi @tristanpwdennis . Thanks for your work on this. I hope we can figure out the necessary adjustments to make this fit.

I see you've marked this PR ready for review but we'll still need the existing code in anoph/sample_metadata.py to support the existing functionality for ag3.py and af1.py, as well as other modules in the pipe. It looks like we'll need to figure out how to resolve the study_info and terms_of_use_info. I believe the current assumption is that all our data will include that info, otherwise we'll probably need to find a way to conditionally support it where it exists. I believe that terms-of-use info is likely to be quite fundamental, for example, since we want to be able to tell users which sample sets are restricted and which aren't.

Similarly, there is currently an assumption that sequence QC metadata will be available (in a certain place and format), which we will need to resolve. If those data aren't available in this case, then we will need to re-engineer the code (in general) to allow for this, without affecting the other existing classes, functions and data support.

The existing Af1.x support does not include _aim_analysis, so that should be relatively straightforward to resolve. See that malariagen_data/af1.py has:

            aim_analysis=None,
            aim_metadata_dtype=None,
            aim_ids=None,
            aim_palettes=None,

I'm not sure I understand the changes to the anopheles.py code, perhaps you could explain those? Also the addition of the pinned version of ipykernel to the pyproject.toml?

Ultimately, in order to add anything to the master branch, we would also require all of the CI checks to pass.

I've created an issue for this work #797 and assigned you, just to help us track the work.

tristanpwdennis · 2025-06-18T04:08:02Z

Hi Lee - thanks for taking a look at this.

Requesting a review must have been an accident - I was just trying to add you back to the PR but probably misunderstood. I've turned it back into a draft (think that's the right thing to do?) and will re-request when I think it's actually ready.

The code, data and metadata in vobs-asia are a few years behind the structure currently used for Af and Ag - so things like the ToU inclusion, inclusion of the QC metadata, broke when I tried to run the current API version with the data. All of these data exist, and can be included, I just need to adapt the data and GCS filesystem structure to be compatible with the API. Working on this now.

The changes to anopheles.py add caching to the roh_hmm function - on reflection, would this have been better as its own PR? If so, can do that I think.

Apologies - a lot of my edits are so that I can get some analysis done first, then make the data and alterations compatible with the API second - but I don't think there is any reason why we can't get it to work?

Not sure about the iykernel version - will take a look.

Is discussion here or on the issue best?

HTH

leehart · 2025-06-19T11:57:27Z

Thanks @tristanpwdennis, that's helped to clarify a few things for me.

Yeah, I expect the work to add caching to the roh_hmm might be more suited to a separate issue and PR, as you suggest. It doesn't seem specific to adding support for A. dirus data, but I guess it might be related to this PR, or somehow required? If that gets merged into master first, you could then update this PR with the same changes. Having that work in a separate issue and PR would help us to focus discussion and reviews on that particular update, but it's not unusual (and is often convenient) to bundle related changes into the same PR.

Thanks again for explaining the situation. I'll be on hand to review this again once everything's ready and aligned. Or I can take a look at things along the way, if needs be.

Regarding where to post discussion, it is often a grey area, but generally I reckon comments relating to the issue itself (e.g. describing a bug or the requirements for a new feature) seem to belong with the issue, whereas comments relating to a particular ongoing attempt to resolve an issue (e.g. a particular approach or set of code changes) usually belong with the PR itself. The comments will be seen either way, but having them in the most relevant place can help us when we're looking back later.

Having said that, discussion often starts under the issue until a PR is drafted, and then naturally continues within the PR until the PR is approved, and then the issue is closed, on a good day.

tristanpwdennis · 2025-06-19T12:33:20Z

Great, thanks Lee. See Issue #798 and PR #799 regarding the caching, and will let you know about Adir1 as we go on.

Had commented this out, back now!

…s hap/cnv/aim data)

ahernank

Merging shortly, unless any concerns.

leehart · 2025-11-20T09:03:16Z

Hi @ahernank - One concern is that this PR is failing still tests, which would cause the master branch to also fail tests when merged in (as well as the obvious problems of safeguards failing).
Also potentially important: the PR for the unrestricted_use_only and surveillance_use_only constructor params #724 hasn't been fully approved and merged in yet.

jonbrenas · 2025-11-20T09:17:09Z

I agree with @leehart. The tests fail on an error which actually exists (as opposed to the more transient "we chose an empty region somehow") so I don't think we can merge it yet.

tristanpwdennis · 2025-11-20T21:59:28Z

Hi guys

I had been meaning to discuss the reasons the tests are failing - but was a little slow. Apologies. If you examine the tracebacks, the tests fail in three areas:

The absence of 'contam_pc' and 'contam_LLR' in the metadata. These weren't calculated for the dirus data and cause the sample metadata tests to fail.
The absence of phasing data. For some reason, this affects the g123 analysis, despite that g123 is meant to operate on unphased data?
The absence of sex calling data. The dirus genome is in pieces, and sex calling was a challenge, so we omitted it for now until we have a better reference to identify the sex chromosomes. This causes the diplotype clustering tests to fail.

Depending on what you guys think, we have three options:

Remove dirus and minimus from these coverage tests. This is probably the most expedient, and bearing in mind the rustic nature of the amin1 API as it is now ;).
Rewrite the coverage tests - but this may mean areas of ag3 or af1 aren't covered.
Add new coverage tests specifically for dirus and minimus (as they will have the same issues).

Please let me know what you think. I am inclined to go for #1 immediately and then add new or modified tests as soon as possible - but this is because we are hoping to submit a paper.

Cheers all, thank you for your eyes on this.

jonbrenas · 2025-11-22T09:03:29Z

Thank you, @tristanpwdennis .

I agree that removing these tests for dirus and minimus is probably the most expedient way to progress as long as we remember that it is still in a beta state.

I think some of the issues can be addressed fairly easily, though. For g123, it looks like the issue is that dirus_noneyet is considered to be a valid phasing analysis ID but that doesn't have data. The g123 code uses a phasing analysis ID if one is provided (as it is a convenient filter) but can also use 'all' or 'segregating'. I would advise to run the tests with one of those instead.

For the metadata, the tests are quite important so I would rather add dummy data (NaN or 0) in the missing columns. There is also an errant column 'Unnamed: 0' that needs to be removed.

For the diplotype clustering, it is possible to use a different query, at least for Adir.

tristanpwdennis · 2025-11-25T05:12:05Z

Thanks @jonbrenas!

I have removed the checks for the g123 and dip clustering. The G123 still looks for haplotype data - unless I am missing something - and I couldn't work out how to modify the query in the dip_clustering_advanced - which I don't think is included in Adir1.x anyway as we don't have CNV calls yet.

Let me know what you think

jonbrenas

LGTM

We need to make some updates to deal with the new errors caused by the merge with master.

jonbrenas · 2025-11-25T14:57:36Z

The integration tests failed because 'unrestricted_use_only' and 'surveillance_use_only' are not instantiated. That was to be expected and can easily be corrected.
The linting issue seems to be an errant empty line. It should also be easy to correct.
The tests seem to fail because of the absence of the sex calls (again). It would probably be a good idea to use a different kind of query.

tristanpwdennis · 2025-11-26T03:56:11Z

It's hard to think of a value in the metadata that may apply in the same way as sex_call (e.g. by country or sample set may break?, whereas in all datasets we know the result will be M or F). I've added a tag to the cases so that the tests with those queries only run on ag3 and af1.

Linting fixed.

Surveillance and restriction instantiated. The tests are failing, but I am unable to reproduce this error locally - thoughts?

jonbrenas · 2025-11-26T11:51:11Z

The tests returned a SEGFAULT (which is not typically an error that you should be able to see in Python) so it seemed likely that it was a transient error with the script. Re-running the tests solved it.

An option for the query would be to build it as "country == C" where C is the country of the first sample in the metadata. Just skipping the test works, at least until we find a better solution, imho.

jonbrenas

LGTM. I will merge later today if no one objects.

The UI decided to change my review from "Approve" to "Request changes"!

jonbrenas

LGTM. I will merge later today if no one objects.

The changes were effected.

tristanpwdennis added 13 commits June 9, 2025 14:45

add adir1 class for anopheles dirus data release

3fe3752

add notebook exploring / troubleshooting adir1

921b9c4

comment out TOU lookups for metadata

e3e45fe

add annotation to commented out bit of metadata

b8ed3cb

forgot how the alphabet works d'oh

57a9b42

add hashing to roh function

ced9224

add adir1 class for anopheles dirus data release

15c9b28

add notebook exploring / troubleshooting adir1

624a21f

comment out TOU lookups for metadata

6cbce3a

add annotation to commented out bit of metadata

fcbb281

forgot how the alphabet works d'oh

ce7b80a

add hashing to roh function

1b19634

Merge branch 'adir1.x' of https://github.com/tristanpwdennis/malariag…

d1bad60

…en-data-python into adir1.x

tristanpwdennis mentioned this pull request Jun 17, 2025

Adir1.x #793

Closed

tristanpwdennis requested a review from leehart June 17, 2025 02:24

reintroduce cohorts, aims, qc to metadata

6320047

tristanpwdennis marked this pull request as draft June 18, 2025 03:53

tristanpwdennis added 9 commits July 10, 2025 01:33

add cohorts analysis

5cd669f

QC data now being built with metadata

0a9c07c

Had commented this out, back now!

QC metadata back in now

9a3ec96

Update adir1.py

fb8a42e

Resolve poetry.lock conflict

996f545

Add Adir1 docs page, add index entry (docs contain all functions minu…

6fa3c63

…s hap/cnv/aim data)

add updated release path, change from 1.x to 1.0

53a85e4

update paths to release directory

0f58591

tou added back into sample_metadata

edee495

add coverage tests for adir1

1087431

ahernank approved these changes Nov 20, 2025

View reviewed changes

This was referenced Nov 20, 2025

Adir1 and Amin1 test_g123.py test fail - can't find haplotype data #828

Open

Adir1 and Amin testtest_sample_metadata.py fails - no contam_ data #829

Closed

Adir1 and Amin1 fail test_diplotype_clustering.py - no sex_call data #830

Open

tristanpwdennis added 3 commits November 25, 2025 11:24

fix cohort validation in adir1 conftest & add dummy contam data

7303340

remove tests for g123 and dipclust

4989f74

more test fixes

ac9677e

jonbrenas previously approved these changes Nov 25, 2025

View reviewed changes

Merge branch 'master' into adir1.x

b76f818

tristanpwdennis added 4 commits November 26, 2025 08:57

add surveillance/restricted use flags

5956494

add surveillance/restricted use flags again

826a920

Apply pre-commit formatting fixes

bebd524

skip tests for adir1 that have sex call query

5e358c2

jonbrenas previously requested changes Nov 26, 2025

View reviewed changes

jonbrenas approved these changes Nov 26, 2025

View reviewed changes

jonbrenas merged commit fbbd2d0 into master Nov 27, 2025
30 of 58 checks passed

leehart mentioned this pull request Jan 13, 2026

Segmentation fault during CI checks #835

Open

leehart deleted the adir1.x branch January 13, 2026 10:45

Adir1.x #795

Adir1.x #795

Uh oh!

Conversation

tristanpwdennis commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Jun 17, 2025

Uh oh!

tristanpwdennis commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leehart commented Jun 19, 2025

Uh oh!

tristanpwdennis commented Jun 19, 2025

Uh oh!

ahernank left a comment

Choose a reason for hiding this comment

Uh oh!

leehart commented Nov 20, 2025

Uh oh!

jonbrenas commented Nov 20, 2025

Uh oh!

tristanpwdennis commented Nov 20, 2025

Uh oh!

jonbrenas commented Nov 22, 2025

Uh oh!

tristanpwdennis commented Nov 25, 2025

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

jonbrenas commented Nov 25, 2025

Uh oh!

tristanpwdennis commented Nov 26, 2025

Uh oh!

jonbrenas commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tristanpwdennis commented Jun 17, 2025 •

edited

Loading

tristanpwdennis commented Jun 18, 2025 •

edited

Loading

jonbrenas commented Nov 26, 2025 •

edited

Loading