Skip to content

ctd_cast_uuid is not a unique PK: OxAveuM_StaCorr mis-modeled + GPS jitter #53

@bbest

Description

@bbest

Problem

ctd_cast.ctd_cast_uuid is declared the primary key but is not unique — 6,063,332 rows / 5,550,014 distinct uuid → 513,318 duplicate-uuid rows (319,767 dup groups). Pre-existing (the v2026.04.08 release has it too — that ctd_cast had 594,011 dup rows). Distinct from #52 (different mechanism).

Found while verifying the #52 fix: the ctd_raw de-dup (#52) reduced ctd_cast dups 594K→513K but a second mechanism remains.

Two causes

Column-variance breakdown across the 319,767 dup-uuid groups:

column groups where it varies
ox_aveu_m_sta_corr 277,893
geom / lat_dec / lon_dec 140,990 / 81,807 / 95,646
everything else (datetime_utc, site_key, cast_key, data_stage, …) 0

Cause 1 — OxAveuM_StaCorr is a mis-modeled measurement (dominant, ~87%). The source CSV column OxAveuM_StaCorr (oxygen µM average, station-corrected) is a per-depth measurement value, but it is not registered in metadata/measurement_type.csv (the umol/kg oxygen types registered are only ox1u_m* and ox2u_m*, not the ave). So meas_cols doesn't include it, cast_cols <- setdiff(raw_cols, c(meas_cols, ...)) keeps it, and it lands in ctd_cast as if it were cast-level metadata. Because it varies per depth, SELECT DISTINCT cast_cols emits multiple ctd_cast rows per (cruise_key, cast_key, cast_dir, datetime_utc).

Cause 2 — GPS jitter. lat_dec/lon_dec differ in the ~5th–6th decimal (sub-meter) between scans that share a 1-second datetime_utc.

Effect

  • ctd_cast_uuid is not a valid unique PK.
  • ctd_summary (built via ctd_measurement m JOIN ctd_cast c ON ctd_cast_uuid) fans out — confirmed +137,482 extra rows for cruise 2003-02-31JD alone (~4.6% inflation), so ctd_summary.n_obs is over-counted for affected positions.
  • Joining ctd_thinctd_cast for cast metadata (e.g. in the ctd-viz app) would duplicate cast points.

Suggested fix

  1. Register OxAveuM_StaCorr in metadata/measurement_type.csv (e.g. oxygen_umol_kg_ave_sta_corr, mirroring the existing oxygen_ml_l_ave_sta_corr) so it pivots into ctd_measurement and leaves cast_cols. Consider making it the canonical umol/kg oxygen type (the ave is generally preferred over sensor-1).
  2. De-duplicate ctd_cast to one row per (cruise_key, cast_key, cast_dir, datetime_utc) for the residual GPS-jitter case (pick one position deterministically).
  3. Add a guard: assert every raw measurement-like column is registered in measurement_type.csv (so an unregistered column can't silently leak into ctd_cast again).

Once fixed, ctd_cast_uuid is a valid PK and ctd_summary is no longer inflated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions