Gene symbols patch by suu-yi · Pull Request #76 · NygenAnalytics/CyteType

suu-yi · 2026-03-25T15:46:39Z

Catch entrez ids
raise error if auto-detection of gene symbol column contains gene ids > 50%

qodo-code-review · 2026-03-25T16:09:55Z

Review Summary by Qodo

Enhance gene symbol validation and ID detection

🐞 Bug fix 🧪 Tests ✨ Enhancement

Walkthroughs

Description

• Improve gene ID detection to handle float-stringified Entrez IDs (e.g., "7157.0")
• Add validation to reject gene symbols columns containing >50% gene IDs with informative error
• Remove unused gene_symbols_col parameter from validate_adata function
• Add comprehensive test suite for gene ID detection and percentage calculation

Diagram

flowchart LR
  A["Gene symbols column"] --> B["materialize_canonical_gene_symbols_column"]
  B --> C["clean_gene_names"]
  C --> D["_id_like_percentage check"]
  D --> E{">50% IDs?"}
  E -->|Yes| F["Raise ValueError"]
  E -->|No| G["Store canonical column"]
  H["_is_gene_id_like"] --> D
  I["Tests"] --> H
  I --> D

File Changes

1. cytetype/main.py Miscellaneous +2/-1

Remove unused parameter from validation call

• Remove unused gene_symbols_col parameter from validate_adata function call
• Add blank line for code formatting in initialization block

cytetype/main.py

2. cytetype/preprocessing/validation.py ✨ Enhancement +16/-3

Add gene ID validation and improve ID detection

• Update regex pattern in _is_gene_id_like to match float-stringified Entrez IDs (e.g., "7157.0")
• Add validation logic in materialize_canonical_gene_symbols_column to detect and reject columns
 with >50% gene IDs
• Raise informative ValueError with examples and guidance when gene ID percentage exceeds
 threshold
• Remove gene_symbols_col parameter from validate_adata function signature

cytetype/preprocessing/validation.py

3. tests/test_validation.py 🧪 Tests +97/-0

Add comprehensive gene ID detection tests

• Add TestIsGeneIdLike class with parametrized tests for Ensembl IDs, RefSeq IDs, integer Entrez
 IDs, and float-stringified Entrez IDs
• Add tests to verify gene symbols are not incorrectly flagged as gene IDs
• Add edge case tests for empty strings, whitespace, and invalid float formats
• Add TestIdLikePercentage class with tests for percentage calculation across various ID types and
 mixed scenarios

tests/test_validation.py

qodo-code-review · 2026-03-25T16:09:58Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

1. ~~50% threshold mismatch~~ ☑ 🐞 Bug ✓ Correctness

Description

Auto-detection treats candidates as viable only when ID-like % is <50, but canonical materialization
raises only when ID-like % is >50; data at exactly 50% ID-like will be rejected by auto-detection
yet allowed through materialization, causing inconsistent behavior and unexpected fallbacks.

Code

cytetype/preprocessing/validation.py[R107-109]

+    id_pct = _id_like_percentage(cleaned)
+    if id_pct > 50:
+        examples = [v for v in cleaned[:20] if _is_gene_id_like(v)][:5]

Evidence

The newly added hard-error check in materialize_canonical_gene_symbols_column uses a strict “> 50”
threshold, while resolve_gene_symbols_column filters viable auto-detection candidates with a strict
“< 50” threshold, leaving an inconsistent boundary at exactly 50%. This can cause auto-detection to
refuse a candidate at 50% and fall back even though materialization would accept the same data.

cytetype/preprocessing/validation.py[90-136]
cytetype/preprocessing/validation.py[168-280]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Auto-detection viability and materialization rejection use different inequality operators around the 50% ID-like boundary. This makes exactly-50% cases behave inconsistently (rejected by auto-detection but accepted by materialization).
### Issue Context
- `materialize_canonical_gene_symbols_column` rejects only when `id_pct > 50`.
- `resolve_gene_symbols_column` currently considers candidates viable only when `id_like_pct < 50`.
### Fix Focus Areas
- cytetype/preprocessing/validation.py[90-136]
- cytetype/preprocessing/validation.py[168-280]
### Suggested change
Pick one definition for the boundary and apply it consistently. If the intended behavior is “error only when > 50% are IDs”, then:
- Change viable filtering to include 50% (e.g., `<= 50`).
- Keep the materialization check as `> 50` (or update message if you decide otherwise).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Misleading ID examples~~ ☑ 🐞 Bug ✓ Correctness

Description

The ValueError examples are taken from the first 20 values, but the ID-like percentage is computed
from a random sample (up to 2000), so the error can report a high ID-like percentage while showing
an empty/non-representative example list.

Code

cytetype/preprocessing/validation.py[R107-113]

+    id_pct = _id_like_percentage(cleaned)
+    if id_pct > 50:
+        examples = [v for v in cleaned[:20] if _is_gene_id_like(v)][:5]
+        raise ValueError(
+            f"The resolved gene symbols from {source_name} contain more than 50% gene IDs rather than "
+            f"gene symbols ({id_pct:.0f}% of values look like identifiers, e.g., {examples}). "
+            f"CyteType requires human-readable gene symbols (e.g., 'TSPAN6', 'DPM1', 'SCYL3'). "

Evidence

_id_like_percentage samples values randomly when the list is large, but the raised error collects
examples only from cleaned[:20]. If IDs are not present in the first 20 entries but are prevalent
elsewhere (and in the random sample), the error message becomes unhelpful (e.g., examples=[]).

cytetype/preprocessing/validation.py[90-136]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The raised ValueError includes example ID-like values from `cleaned[:20]`, but the percentage computation may be based on a different random sample. This can produce misleading/empty examples.
### Issue Context
`_id_like_percentage` uses `random.sample` for large inputs, while examples are collected deterministically from the head of the list.
### Fix Focus Areas
- cytetype/preprocessing/validation.py[90-136]
### Suggested change
Ensure examples come from the same population used to compute `id_pct`. Options:
- Refactor `_id_like_percentage` to optionally return the sample used (or accept a precomputed sample).
- Or compute a sample once in `materialize_canonical_gene_symbols_column` and use it both for `id_pct` and for collecting examples.
- Alternatively, compute `id_pct` on `cleaned[:N]` if you want deterministic behavior, and make that explicit.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~validate_adata API break~~ ☑ 🐞 Bug ⚙ Maintainability

Description

The exported validate_adata function removed the gene_symbols_col parameter, which is a
backward-incompatible signature change for downstream users importing it from
cytetype.preprocessing.

Code

cytetype/preprocessing/validation.py[L283-284]
-    gene_symbols_col: str | None,
coordinates_key: str,

Evidence

validate_adata is re-exported as part of the public preprocessing module, so removing a parameter
can break external callers even though internal usage was updated. The current validate_adata
implementation does not use gene symbol inputs, suggesting the parameter removal is API-only and
could be preserved for compatibility.

cytetype/preprocessing/init.py[1-17]
cytetype/preprocessing/validation.py[293-340]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`validate_adata` dropped a parameter (`gene_symbols_col`) but is exported from `cytetype.preprocessing`. This can break downstream code that calls `validate_adata(..., gene_symbols_col=...)`.
### Issue Context
The parameter appears unused in the current `validate_adata` body, so it can be retained without changing behavior.
### Fix Focus Areas
- cytetype/preprocessing/validation.py[293-340]
- cytetype/preprocessing/__init__.py[1-17]
### Suggested change
Reintroduce the parameter for compatibility, ideally as optional/kw-only, e.g.:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

ⓘ The new review experience is currently in Beta. Learn more

cytetype/preprocessing/validation.py

… into gene-symbols-patch

suu-yi · 2026-03-25T17:37:17Z

@parashardhapola
Multi-line error message?

suu-yi and others added 5 commits March 25, 2026 15:59

clean unused arg

f2d07ea

Update validation.py

7dac18b

Create test_validation.py

a421616

Merge branch 'master' into pr/76

e7e3101

comma

15a302e

suu-yi marked this pull request as ready for review March 25, 2026 16:09

qodo-code-review bot reviewed Mar 25, 2026

View reviewed changes

cytetype/preprocessing/validation.py Outdated Show resolved Hide resolved

suu-yi added 4 commits March 25, 2026 18:18

qodo fix 1

dbd59a6

Merge branch 'gene-symbols-patch' of https://github.com/suu-yi/CyteType…

80e1273

… into gene-symbols-patch

Update validation.py

6512bdf

revert arg cleanup

49b0955

parashardhapola merged commit f1b0d09 into NygenAnalytics:master Mar 25, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene symbols patch#76

Gene symbols patch#76
parashardhapola merged 9 commits intoNygenAnalytics:masterfrom
suu-yi:gene-symbols-patch

suu-yi commented Mar 25, 2026

Uh oh!

qodo-code-review bot commented Mar 25, 2026

Uh oh!

qodo-code-review bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

suu-yi commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suu-yi commented Mar 25, 2026

Uh oh!

qodo-code-review bot commented Mar 25, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-code-review bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

suu-yi commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qodo-code-review bot commented Mar 25, 2026 •

edited

Loading