Runs once per input CSV after every per-target Parquet has been written by Step 8. Its job is to catch regressions in pipeline-produced columns — not to re-validate the raw input (that's QUALITY_CHECKS.md).
This document covers when post-QC runs, what each check verifies, and the supplementary files it produces. For the entry point in the code, see src/post_quality_check.py.
Inside process_csv_files, right after the per-target Steps 3–9 loop ends. Gated by --end-at >= 8 (Step 8 must have produced Parquet output). If the gate isn't met — or the Step8_FullColumns/ folder is missing/empty — post-QC writes a one-line "skipped: no Parquet files found" log and returns. It is best-effort and never blocks the rest of the pipeline.
If input QC fails for a given file, the pipeline skips it entirely (including post-QC) — so the absence of a PostQClog_* file when QC failed is expected.
- The directory
ProcessedData_<csv_basename>/Step8_FullColumns/(one.parquetper target). - These are concatenated row-wise into a single DataFrame, and the checks run against the whole thing.
Two files written next to the input-QC logs, in ProcessedData_<csv_basename>/:
- Plain text log —
PostQClog_<YYYYMMDD>_<csv_basename>.log, grouped by section, same layout as the input-QC log (sections → numbered checks → Overall → Statistics Summary). - Excel companion —
PostQClog_<YYYYMMDD>_<csv_basename>.xlsx, one row per check with columnsSection,Check #,Criteria,Status,Detail. Rows are color-coded (green = PASS, red = FAIL, yellow = WARN) and the header row is frozen. A second sheet,Statistics, mirrors the stats block from the log.
Total: 23 checks across 5 sections. All reuse the helpers from src/quality_check.py.
LABELonly contains{0, 1}— FAIL. Set-membership against{0, 1}. Catches the silent-zero regression we hit when the binary-label convention changed (see PIPELINE.md → Step 7).AIRCHECK_LABELonly contains{-2, -1, 0, 1, 2, 3, 4}— FAIL. The full label set produced by Step 6.
PVALUEis in[0, 1]— FAIL. Inclusive on both ends; Welch's t-test should never produce a value outside this range.TARGET_INTENSITY_VALUE >= 0— FAIL.NONTARGET_INTENSITY_VALUE >= 0— FAIL.SELECTIVE_VALUE >= 0— FAIL.NTC_VALUE >= 0— FAIL.ENRICHMENT >= 0— FAIL.EASMS_ENRICHMENT >= 0— FAIL.SELECTIVE_ENRICHMENT >= 0— FAIL.
MW(molecular weight) is positive (> 0) — FAIL. A real molecule can't have zero or negative mass.ALOGPis in[-5, 10](typical range) — WARN. Outliers (e.g. extremely lipophilic / hydrophilic compounds) can be legitimate, so this surfaces as informational rather than fatal. Bounds areALOGP_MIN/ALOGP_MAXat the top of src/post_quality_check.py.
MassSpec_Detectedonly contains{Y, N}— FAIL. Set-membership.HAD_DUPLICATE_INTENSITYonly contains{Y, N}— FAIL. Set-membership.
Each row's fingerprint vector must have exactly the expected number of bits/features. The check works on both fingerprint storage formats — numpy arrays (fp_format="array", the default) and legacy comma-separated strings (fp_format="string").
ECFP4length = 2048 — FAIL.ECFP6length = 2048 — FAIL.FCFP4length = 2048 — FAIL.FCFP6length = 2048 — FAIL.MACCSlength = 167 — FAIL.RDKlength = 2048 — FAIL.AVALONlength = 2048 — FAIL.TOPTORlength = 2048 — FAIL.ATOMPAIRlength = 2048 — FAIL.
Expected lengths come from the _dimension attribute of each HitGen*FPFunc class in src/fingerprints.py. If a future fingerprint changes dimension, update FP_EXPECTED_LENGTHS at the top of src/post_quality_check.py.
- Recomputation checks like "verify
TARGET_INTENSITY_VALUE == mean(POS_INT_REP1..3)". Those would re-run the pipeline's math against the same code that just produced the value — the check would only fail if the same code disagreed with itself a second time. - Re-validation of raw input columns. That happens before Step 1 in QUALITY_CHECKS.md.
Constants at the top of src/post_quality_check.py:
LABEL_VALUES,AIRCHECK_LABEL_VALUES— discrete allowed sets.PVALUE_MIN,PVALUE_MAX— p-value bounds (inclusive).ALOGP_MIN,ALOGP_MAX— typical ALOGP range (used for the WARN check).FP_EXPECTED_LENGTHS— fingerprint vector dimensions, one entry per fingerprint.YN_COLUMNS— list of{Y, N}flag columns.
Adding a new column-content check is the same pattern as the input-QC SECTIONS — add a (description, function) tuple to the relevant section in SECTIONS at the bottom of the file. The orchestrator picks it up on the next run; no other wiring needed.