Skip to content

Gene symbols patch#76

Merged
parashardhapola merged 9 commits intoNygenAnalytics:masterfrom
suu-yi:gene-symbols-patch
Mar 25, 2026
Merged

Gene symbols patch#76
parashardhapola merged 9 commits intoNygenAnalytics:masterfrom
suu-yi:gene-symbols-patch

Conversation

@suu-yi
Copy link
Contributor

@suu-yi suu-yi commented Mar 25, 2026

Catch entrez ids
raise error if auto-detection of gene symbol column contains gene ids > 50%

@suu-yi suu-yi marked this pull request as ready for review March 25, 2026 16:09
@qodo-code-review
Copy link

Review Summary by Qodo

Enhance gene symbol validation and ID detection

🐞 Bug fix 🧪 Tests ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Improve gene ID detection to handle float-stringified Entrez IDs (e.g., "7157.0")
• Add validation to reject gene symbols columns containing >50% gene IDs with informative error
• Remove unused gene_symbols_col parameter from validate_adata function
• Add comprehensive test suite for gene ID detection and percentage calculation
Diagram
flowchart LR
  A["Gene symbols column"] --> B["materialize_canonical_gene_symbols_column"]
  B --> C["clean_gene_names"]
  C --> D["_id_like_percentage check"]
  D --> E{">50% IDs?"}
  E -->|Yes| F["Raise ValueError"]
  E -->|No| G["Store canonical column"]
  H["_is_gene_id_like"] --> D
  I["Tests"] --> H
  I --> D
Loading

Grey Divider

File Changes

1. cytetype/main.py Miscellaneous +2/-1

Remove unused parameter from validation call

• Remove unused gene_symbols_col parameter from validate_adata function call
• Add blank line for code formatting in initialization block

cytetype/main.py


2. cytetype/preprocessing/validation.py ✨ Enhancement +16/-3

Add gene ID validation and improve ID detection

• Update regex pattern in _is_gene_id_like to match float-stringified Entrez IDs (e.g., "7157.0")
• Add validation logic in materialize_canonical_gene_symbols_column to detect and reject columns
 with >50% gene IDs
• Raise informative ValueError with examples and guidance when gene ID percentage exceeds
 threshold
• Remove gene_symbols_col parameter from validate_adata function signature

cytetype/preprocessing/validation.py


3. tests/test_validation.py 🧪 Tests +97/-0

Add comprehensive gene ID detection tests

• Add TestIsGeneIdLike class with parametrized tests for Ensembl IDs, RefSeq IDs, integer Entrez
 IDs, and float-stringified Entrez IDs
• Add tests to verify gene symbols are not incorrectly flagged as gene IDs
• Add edge case tests for empty strings, whitespace, and invalid float formats
• Add TestIdLikePercentage class with tests for percentage calculation across various ID types and
 mixed scenarios

tests/test_validation.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 25, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Action required

1. 50% threshold mismatch🐞 Bug ✓ Correctness
Description
Auto-detection treats candidates as viable only when ID-like % is <50, but canonical materialization
raises only when ID-like % is >50; data at exactly 50% ID-like will be rejected by auto-detection
yet allowed through materialization, causing inconsistent behavior and unexpected fallbacks.
Code

cytetype/preprocessing/validation.py[R107-109]

+    id_pct = _id_like_percentage(cleaned)
+    if id_pct > 50:
+        examples = [v for v in cleaned[:20] if _is_gene_id_like(v)][:5]
Evidence
The newly added hard-error check in materialize_canonical_gene_symbols_column uses a strict “> 50”
threshold, while resolve_gene_symbols_column filters viable auto-detection candidates with a strict
“< 50” threshold, leaving an inconsistent boundary at exactly 50%. This can cause auto-detection to
refuse a candidate at 50% and fall back even though materialization would accept the same data.

cytetype/preprocessing/validation.py[90-136]
cytetype/preprocessing/validation.py[168-280]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Auto-detection viability and materialization rejection use different inequality operators around the 50% ID-like boundary. This makes exactly-50% cases behave inconsistently (rejected by auto-detection but accepted by materialization).
### Issue Context
- `materialize_canonical_gene_symbols_column` rejects only when `id_pct > 50`.
- `resolve_gene_symbols_column` currently considers candidates viable only when `id_like_pct < 50`.
### Fix Focus Areas
- cytetype/preprocessing/validation.py[90-136]
- cytetype/preprocessing/validation.py[168-280]
### Suggested change
Pick one definition for the boundary and apply it consistently. If the intended behavior is “error only when > 50% are IDs”, then:
- Change viable filtering to include 50% (e.g., `<= 50`).
- Keep the materialization check as `> 50` (or update message if you decide otherwise).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Misleading ID examples🐞 Bug ✓ Correctness
Description
The ValueError examples are taken from the first 20 values, but the ID-like percentage is computed
from a random sample (up to 2000), so the error can report a high ID-like percentage while showing
an empty/non-representative example list.
Code

cytetype/preprocessing/validation.py[R107-113]

+    id_pct = _id_like_percentage(cleaned)
+    if id_pct > 50:
+        examples = [v for v in cleaned[:20] if _is_gene_id_like(v)][:5]
+        raise ValueError(
+            f"The resolved gene symbols from {source_name} contain more than 50% gene IDs rather than "
+            f"gene symbols ({id_pct:.0f}% of values look like identifiers, e.g., {examples}). "
+            f"CyteType requires human-readable gene symbols (e.g., 'TSPAN6', 'DPM1', 'SCYL3'). "
Evidence
_id_like_percentage samples values randomly when the list is large, but the raised error collects
examples only from cleaned[:20]. If IDs are not present in the first 20 entries but are prevalent
elsewhere (and in the random sample), the error message becomes unhelpful (e.g., examples=[]).

cytetype/preprocessing/validation.py[90-136]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The raised ValueError includes example ID-like values from `cleaned[:20]`, but the percentage computation may be based on a different random sample. This can produce misleading/empty examples.
### Issue Context
`_id_like_percentage` uses `random.sample` for large inputs, while examples are collected deterministically from the head of the list.
### Fix Focus Areas
- cytetype/preprocessing/validation.py[90-136]
### Suggested change
Ensure examples come from the same population used to compute `id_pct`. Options:
- Refactor `_id_like_percentage` to optionally return the sample used (or accept a precomputed sample).
- Or compute a sample once in `materialize_canonical_gene_symbols_column` and use it both for `id_pct` and for collecting examples.
- Alternatively, compute `id_pct` on `cleaned[:N]` if you want deterministic behavior, and make that explicit.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. validate_adata API break🐞 Bug ⚙ Maintainability
Description
The exported validate_adata function removed the gene_symbols_col parameter, which is a
backward-incompatible signature change for downstream users importing it from
cytetype.preprocessing.
Code

cytetype/preprocessing/validation.py[L283-284]

-    gene_symbols_col: str | None,
coordinates_key: str,
Evidence
validate_adata is re-exported as part of the public preprocessing module, so removing a parameter
can break external callers even though internal usage was updated. The current validate_adata
implementation does not use gene symbol inputs, suggesting the parameter removal is API-only and
could be preserved for compatibility.

cytetype/preprocessing/init.py[1-17]
cytetype/preprocessing/validation.py[293-340]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`validate_adata` dropped a parameter (`gene_symbols_col`) but is exported from `cytetype.preprocessing`. This can break downstream code that calls `validate_adata(..., gene_symbols_col=...)`.
### Issue Context
The parameter appears unused in the current `validate_adata` body, so it can be retained without changing behavior.
### Fix Focus Areas
- cytetype/preprocessing/validation.py[293-340]
- cytetype/preprocessing/__init__.py[1-17]
### Suggested change
Reintroduce the parameter for compatibility, ideally as optional/kw-only, e.g.:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@suu-yi
Copy link
Contributor Author

suu-yi commented Mar 25, 2026

@parashardhapola
Multi-line error message?

Screenshot 2026-03-25 at 18 31 00

@parashardhapola parashardhapola merged commit f1b0d09 into NygenAnalytics:master Mar 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants