Skip to content

Conversation

@Ulthran
Copy link
Contributor

@Ulthran Ulthran commented Jan 26, 2026

Motivation

  • Make HostSpecies, SampleType and machine-prefix mappings authoritative, file-backed TSVs that are easy to reference from other projects.
  • Remove duplication in the database and provide lightweight Python wrappers so other services can import the canonical lists without DB access.
  • Expose machine prefix → machine type mapping for programmatic use.

Description

  • Add sample_registry/standards.py implementing dataclasses and loader/wrapper classes (StandardSampleTypes, StandardHostSpeciesList, MachineTypeMappings) that read packaged TSVs via importlib.resources and expose accessors like get(), names(), values(), and as_dict().
  • Add TSV reference data under sample_registry/data/ (standard_sample_types.tsv, standard_host_species.tsv, machine_types.tsv) and a root machine_types.tsv for easy URL access, and add sample_registry/data to package data in pyproject.toml (include data/*.tsv).
  • Remove the StandardSampleType and StandardHostSpecies ORM models from sample_registry/models.py and delete the database initialization / registration logic that wrote these tables from sample_registry/db.py, sample_registry/register.py, and sample_registry/registrar.py.
  • Update runtime code to use the TSV-backed data: sample_registry/registrar.py now uses MACHINE_TYPE_MAPPINGS for machines, and sample_registry/app.py uses STANDARD_SAMPLE_TYPES and STANDARD_HOST_SPECIES for stats and counts.
  • Update tests to remove DB-driven standard registration checks and add tests/test_standards.py to validate TSV loading and mappings.

Testing

  • Ran the test suite with pytest -q; test collection failed with ModuleNotFoundError: No module named 'flask_sqlalchemy', so automated tests did not complete.

Codex Task

Copilot AI review requested due to automatic review settings January 26, 2026 17:55
@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

❌ Patch coverage is 77.19298% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.13%. Comparing base (167adba) to head (4b2e898).

Files with missing lines Patch % Lines
sample_registry/app.py 0.00% 14 Missing ⚠️
sample_registry/standards.py 87.50% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #62      +/-   ##
==========================================
- Coverage   60.23%   60.13%   -0.11%     
==========================================
  Files           8        9       +1     
  Lines         757      765       +8     
==========================================
+ Hits          456      460       +4     
- Misses        301      305       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the standard reference data (sample types, host species, and machine type mappings) from database-backed storage to TSV file-backed storage. This change makes the reference data authoritative, file-backed, and easily importable by other services without requiring database access.

Changes:

  • Added sample_registry/standards.py with dataclasses and loader classes for TSV-backed reference data
  • Added TSV reference files in sample_registry/data/ directory and duplicates at repository root for URL access
  • Removed StandardSampleType and StandardHostSpecies ORM models and related database initialization code
  • Updated runtime code to use TSV-backed data instead of database queries

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sample_registry/standards.py New module implementing dataclasses and loaders for TSV-backed standard reference data
sample_registry/data/standard_sample_types.tsv TSV file containing standard sample type reference data
sample_registry/data/standard_host_species.tsv TSV file containing standard host species reference data
sample_registry/data/machine_types.tsv TSV file containing machine prefix to type mappings
sample_registry/data/init.py Package initialization for data directory
machine_types.tsv Duplicate of machine_types.tsv at repository root for easy URL access
sample_registry/models.py Removed StandardSampleType and StandardHostSpecies ORM models
sample_registry/db.py Removed database initialization functions for standard types
sample_registry/registrar.py Updated to use MACHINE_TYPE_MAPPINGS from TSV-backed data
sample_registry/register.py Removed register_sample_types and register_host_species functions
sample_registry/app.py Updated stats endpoint to query using TSV-backed standard lists instead of database joins
tests/test_standards.py New tests validating TSV loading and data access
tests/test_registrar.py Removed tests for DB-backed standard type registration
tests/test_register.py Removed tests for DB-backed standard type registration
pyproject.toml Added sample_registry.data package and data/*.tsv to package data; removed obsolete console scripts

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return list(self._by_prefix.keys())

def values(self) -> list[str]:
return list(self._by_prefix.values())
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values() method returns all machine types from the dictionary, which can contain duplicates if multiple prefixes map to the same machine type. In machine_types.tsv, both "M" and "SH" map to "Illumina-MiSeq". This causes duplicate entries in the argparse choices when SampleRegistry.machines is used (see sample_registry/register.py lines 120 and 142). Consider returning unique values by using set() or modifying the return to be distinct.

Suggested change
return list(self._by_prefix.values())
# Return unique machine types, preserving insertion order
return list(dict.fromkeys(self._by_prefix.values()))

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants