Post-Pipeline Quality Checks

Runs once per input CSV after every per-target Parquet has been written by Step 8. Its job is to catch regressions in pipeline-produced columns — not to re-validate the raw input (that's QUALITY_CHECKS.md).

This document covers when post-QC runs, what each check verifies, and the supplementary files it produces. For the entry point in the code, see src/post_quality_check.py.

When it runs

Inside process_csv_files, right after the per-target Steps 3–9 loop ends. Gated by --end-at >= 8 (Step 8 must have produced Parquet output). If the gate isn't met — or the Step8_FullColumns/ folder is missing/empty — post-QC writes a one-line "skipped: no Parquet files found" log and returns. It is best-effort and never blocks the rest of the pipeline.

If input QC fails for a given file, the pipeline skips it entirely (including post-QC) — so the absence of a PostQClog_* file when QC failed is expected.

Inputs

The directory ProcessedData_<csv_basename>/Step8_FullColumns/ (one .parquet per target).
These are concatenated row-wise into a single DataFrame, and the checks run against the whole thing.

Outputs

Two files written next to the input-QC logs, in ProcessedData_<csv_basename>/:

Plain text log — PostQClog_<YYYYMMDD>_<csv_basename>.log, grouped by section, same layout as the input-QC log (sections → numbered checks → Overall → Statistics Summary).
Excel companion — PostQClog_<YYYYMMDD>_<csv_basename>.xlsx, one row per check with columns Section, Check #, Criteria, Status, Detail. Rows are color-coded (green = PASS, red = FAIL, yellow = WARN) and the header row is frozen. A second sheet, Statistics, mirrors the stats block from the log.

Checks

Total: 23 checks across 5 sections. All reuse the helpers from src/quality_check.py.

Label Checks

LABEL only contains {0, 1} — FAIL. Set-membership against {0, 1}. Catches the silent-zero regression we hit when the binary-label convention changed (see PIPELINE.md → Step 7).
AIRCHECK_LABEL only contains {-2, -1, 0, 1, 2, 3, 4} — FAIL. The full label set produced by Step 6.

Score Range Checks

PVALUE is in [0, 1] — FAIL. Inclusive on both ends; Welch's t-test should never produce a value outside this range.
TARGET_INTENSITY_VALUE >= 0 — FAIL.
NONTARGET_INTENSITY_VALUE >= 0 — FAIL.
SELECTIVE_VALUE >= 0 — FAIL.
NTC_VALUE >= 0 — FAIL.
ENRICHMENT >= 0 — FAIL.
EASMS_ENRICHMENT >= 0 — FAIL.
SELECTIVE_ENRICHMENT >= 0 — FAIL.

Molecular Property Checks

MW (molecular weight) is positive (> 0) — FAIL. A real molecule can't have zero or negative mass.
ALOGP is in [-5, 10] (typical range) — WARN. Outliers (e.g. extremely lipophilic / hydrophilic compounds) can be legitimate, so this surfaces as informational rather than fatal. Bounds are ALOGP_MIN / ALOGP_MAX at the top of src/post_quality_check.py.

Flag Column Checks

MassSpec_Detected only contains {Y, N} — FAIL. Set-membership.
HAD_DUPLICATE_INTENSITY only contains {Y, N} — FAIL. Set-membership.

Fingerprint Length Checks

Each row's fingerprint vector must have exactly the expected number of bits/features. The check works on both fingerprint storage formats — numpy arrays (fp_format="array", the default) and legacy comma-separated strings (fp_format="string").

ECFP4 length = 2048 — FAIL.
ECFP6 length = 2048 — FAIL.
FCFP4 length = 2048 — FAIL.
FCFP6 length = 2048 — FAIL.
MACCS length = 167 — FAIL.
RDK length = 2048 — FAIL.
AVALON length = 2048 — FAIL.
TOPTOR length = 2048 — FAIL.
ATOMPAIR length = 2048 — FAIL.

Expected lengths come from the _dimension attribute of each HitGen*FPFunc class in src/fingerprints.py. If a future fingerprint changes dimension, update FP_EXPECTED_LENGTHS at the top of src/post_quality_check.py.

What is deliberately not included

Recomputation checks like "verify TARGET_INTENSITY_VALUE == mean(POS_INT_REP1..3)". Those would re-run the pipeline's math against the same code that just produced the value — the check would only fail if the same code disagreed with itself a second time.
Re-validation of raw input columns. That happens before Step 1 in QUALITY_CHECKS.md.

Tuning

Constants at the top of src/post_quality_check.py:

LABEL_VALUES, AIRCHECK_LABEL_VALUES — discrete allowed sets.
PVALUE_MIN, PVALUE_MAX — p-value bounds (inclusive).
ALOGP_MIN, ALOGP_MAX — typical ALOGP range (used for the WARN check).
FP_EXPECTED_LENGTHS — fingerprint vector dimensions, one entry per fingerprint.
YN_COLUMNS — list of {Y, N} flag columns.

Adding a new column-content check is the same pattern as the input-QC SECTIONS — add a (description, function) tuple to the relevant section in SECTIONS at the bottom of the file. The orchestrator picks it up on the next run; no other wiring needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post-Pipeline Quality Checks

When it runs

Inputs

Outputs

Checks

Label Checks

Score Range Checks

Molecular Property Checks

Flag Column Checks

Fingerprint Length Checks

What is deliberately not included

Tuning

FilesExpand file tree

POST_QC.md

Latest commit

History

POST_QC.md

File metadata and controls

Post-Pipeline Quality Checks

When it runs

Inputs

Outputs

Checks

Label Checks

Score Range Checks

Molecular Property Checks

Flag Column Checks

Fingerprint Length Checks

What is deliberately not included

Tuning