Skip to content

ctd_measurement has duplicate (ctd_cast_uuid, depth_m, measurement_type) keys #52

@bbest

Description

@bbest

Problem

ctd_measurement (released parquet, v2026.04.08) has duplicate (ctd_cast_uuid, depth_m, measurement_type) keys — its declared natural key. assign_deterministic_uuids_md5() only warns on non-unique keys, so ctd_measurement_uuid is itself non-unique and the duplication shipped silently.

Evidence

Sampled the first 4 cruise partitions of data/parquet/calcofi_ctd-cast/ctd_measurement/:

cruise_key rows duplicate-key rows
1998-02-31JD 1,250,773 0
1998-04-31JD 3,040,202 1,520,101 (every row doubled)
1998-07-32NM 1,582,430 0
2003-02-31JD 6,006,342 3,002,285 (every row doubled)

Across those 4 cruises: 4,522,386 duplicate-key groups. 96.6% have identical measurement_value (true exact duplicates); 3.4% have conflicting values (max spread 43.76 in measurement units) — likely the same cast present in both a final and a preliminary source file, both unzipped and loaded.

Likely cause

In ingest_calcofi_ctd-cast.qmd, the read/bind/filter step or the final-vs-preliminary selection is not deduplicating: some cruises' casts are loaded twice. The "selects final if available, otherwise preliminary" logic may not be excluding the preliminary file when a final exists, or a file is bound twice.

Impact

  • ctd_measurement_uuid is non-unique → breaks any downstream join/PK assumption.
  • Doubled rows inflate the table (~38% of sampled rows are duplicates).
  • Conflicting-value dups mean some (cast, depth, type) positions have two different recorded values with no recorded provenance to disambiguate.

Workaround in place

The new ctd_thin table (#51) de-duplicates defensively — QUALIFY ROW_NUMBER() OVER (PARTITION BY ctd_cast_uuid, depth_m, measurement_type ORDER BY measurement_qual NULLS LAST, measurement_value) = 1 — so it has a unique key regardless. But ctd_measurement itself should be fixed at the source.

Suggested fix

  • Identify why those cruises double-load (final/preliminary dedup, or a double bind)
  • Add a uniqueness assertion (not just a warning) on ctd_measurement's key after the pivot, failing the ingest if violated
  • For genuinely conflicting final/preliminary values, decide a precedence rule (prefer final) and record it

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions