Skip to content

Fix: implement classification for in-frame complex variants#1201

Closed
kunal-10-cloud wants to merge 4 commits intomalariagen:masterfrom
kunal-10-cloud:fix/gh-todo-inframe-complex-variants
Closed

Fix: implement classification for in-frame complex variants#1201
kunal-10-cloud wants to merge 4 commits intomalariagen:masterfrom
kunal-10-cloud:fix/gh-todo-inframe-complex-variants

Conversation

@kunal-10-cloud
Copy link
Contributor

@kunal-10-cloud kunal-10-cloud commented Mar 22, 2026

Summary

Replaces TODO placeholder code with a classification helper function that accurately determines effect and impact for in-frame complex variants by comparing amino acid sequences. Includes position-aware detection for START/STOP codon changes and comprehensive multi-codon test coverage.
Fixes: #1198

What Changed

Added _classify_inframe_complex_effect() helper function that:

  • Compares reference vs. alternate amino acid sequences
  • Detects synonymous mutations, missense changes, and functionally important variants (START/STOP codon alterations)
  • Assigns scientifically justified impact levels (LOW, MODERATE, HIGH) instead of UNKNOWN
  • Returns properly classified variant effects consistent with existing SNP patterns
  • Position-aware checks: Handles multi-amino-acid strings (e.g., "MLR", "Q*R") correctly

Added comprehensive test suite with 16 tests validating:

  • All classification branches and edge cases (9 original + 4 new multi-codon tests)
  • Data integrity and field preservation
  • Absence of TODO strings in output
  • Scientific correctness of impact assignments
  • Multi-codon scenarios: START_LOST detection on first position, STOP mutations anywhere in string

Validation

  • ruff linting: Passed
  • mypy type checking: No issues
  • Pre-commit hooks: All checks passed
  • Test results: 16/16 passing (including 4 new multi-codon tests)

Technical Approach

In-frame complex variants have complete amino acid information available, enabling SNP-style classification. Position-aware logic ensures:

  • Accurate START_LOST detection by checking first amino acid only (ref_aa[0] == "M")
  • Accurate STOP mutations detection by checking presence of "" anywhere in string ("" in ref_aa / alt_aa)
  • Reliable handling of rare multi-codon translations
  • Consistent effect naming with existing code patterns

Copilot AI review requested due to automatic review settings March 22, 2026 22:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Issue #1200 by replacing the in-frame complex-variant “TODO” placeholder in veff with a helper-based classification path, and adding tests to validate the new behavior.

Changes:

  • Replaced placeholder effect/impact assignment for mixed MNP+INDEL in-frame variants with _classify_inframe_complex_effect().
  • Added a new unit test module covering the helper’s effect/impact outputs and basic invariants (e.g., no “TODO”, field preservation).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
malariagen_data/veff.py Routes in-frame complex (mixed MNP+INDEL) variants through a new helper to assign non-UNKNOWN effect/impact.
tests/test_veff.py Adds tests for the helper and basic output integrity checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…sification

- Update classification logic with position-aware checks:
  * START_LOST: check first amino acid only (ref_aa[0])
  * STOP mutations: check presence anywhere in string
- Add 4 new test cases for multi-codon scenarios (MLR, Q*R, QR*, etc.)
- Rename TestInframeComplexEffectIntegration to TestClassifyInframeComplexEffectEdgeCases
- Test coverage: 16/16 passing
@jonbrenas
Copy link
Collaborator

Hi @kunal-10-cloud, as I explained in #1181 (which solves the exact same issue you are addressing here), the situation is more complex. Your solution is more complex than what could be found in #1181, granted, but it is also not accurate.

Replace TODO with INFRAME_COMPLEX_VARIANT/MODERATE placeholder.
Code path unreachable in production (SNP-only), proper classification
requires frame-shift analysis per PR malariagen#1181 discussion.
@kunal-10-cloud
Copy link
Contributor Author

hi @jonbrenas, I have tried to simplify the approach. Can you please take a look at it and let me know how can i proceed

@jonbrenas
Copy link
Collaborator

Hi @kunal-10-cloud, I don't think you understood what I was saying. You basically reverted your PR to what was in #1181. That's not what I was asking at all. What I am saying is that:

  1. This code is never reached so you shouldn't bother.
  2. What the effects should be in such a situation is a question of biology and not of computer science or software engineering.
  3. The problem is almost certainly not described properly.

All that to say, this is not an issue that you can or should try to solve and so this PR has no reason to exist.

@kunal-10-cloud
Copy link
Contributor Author

kunal-10-cloud commented Mar 23, 2026

Hi @kunal-10-cloud, I don't think you understood what I was saying. You basically reverted your PR to what was in #1181. That's not what I was asking at all. What I am saying is that:

  1. This code is never reached so you shouldn't bother.
  2. What the effects should be in such a situation is a question of biology and not of computer science or software engineering.
  3. The problem is almost certainly not described properly.

All that to say, this is not an issue that you can or should try to solve and so this PR has no reason to exist.

Hi @jonbrenas i understood your point and i was aware of this thats why in the comments i mentioned the same thing as well and asked how to proceed irrespective of that really thanks for the description and taking out your time.
Will close this issue for now can you please review these PR's instead

#1190(Approved) #1199(review) #1188(review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fragile cache name property pattern for _xpehh_gwss_cache_name and _ihs_gwss_cache_name

3 participants