Step 0 of the pipeline. Runs a series of pre-processing validations against every raw input file before any data transformation. If any check returns FAIL the file is skipped and the rest of the pipeline does not run on it; WARN does not block.
This document covers what each check does, what produces a FAIL vs WARN, the supplementary report files some checks generate, and how to add new checks. For the entry point in the code, see src/quality_check.py.
For each input file, QC writes two files into ProcessedData_<csv_basename>/:
- Plain text log —
QCaircheck<YYYYMMDD>_<csv_basename>.log, grouped by section. - Excel companion —
QCaircheck<YYYYMMDD>_<csv_basename>.xlsx, one row per check with columnsSection,Check #,Criteria,Status,Detail. Rows are color-coded (green = PASS, red = FAIL, yellow = WARN) and the header row is frozen.
Both files contain the same information; pick whichever is easier to read.
Example of the Excel companion log:
The screenshot above shows a run where Check 7 (column-name match against ASMS Meta Data.csv) failed with both missing and extra columns, and Check 15 (duplicate rows) raised a WARN — the pipeline would skip this file because of Check 7, but the duplicate-row warning on its own would not have blocked it.
Some checks also produce supplementary CSVs (e.g. FullyDuplicate_rows_report.csv, invalid_smiles_report.csv, formula_not_in_library_report.csv, …) next to the QC logs. They are listed under the relevant check below.
After all checks run, both the .log and .xlsx outputs include a Statistics Summary section so you can spot-check the dataset at a glance without opening the raw CSV. It runs whether the overall result is PASS or FAIL.
The .log ends with a plain-text block; the .xlsx has a second sheet named Statistics. Both contain the same three sub-sections, sourced from the QC-loaded dataframe (i.e. after fully-duplicate rows are dropped):
- Overview — total rows, total columns, distinct proteins (
TARGET_ID), distinct compounds (COMPOUND_ID). - Per-protein breakdown — for each
TARGET_ID: row count and theBINARY_LABEL=0/BINARY_LABEL=1counts (label columns only appear whenBINARY_LABELexists in the input). - Numeric column statistics — for every numeric column (
POOL_SIZE,PROTEIN_CONC,POS_INT_REP1/2/3,MZ,RT, …): non-null count, min, max, mean.
Excerpt of what the .log Statistics block looks like:
============================================================
Statistics Summary
============================================================
Total rows: 85,224
Total columns: 26
Distinct proteins (TARGET_ID): 8
Distinct compounds (COMPOUND_ID): 12,000
Per-protein breakdown (rows + BINARY_LABEL counts):
TARGET_ID rows label=0 label=1
WDR91_A4D1P6_392_747 10,653 10,521 132
rep_P0DTD1_5325_5925 10,653 10,418 235
...
Numeric column statistics:
Column non-null min max mean
POOL_SIZE 85,224 463 1488 876.4
POS_INT_REP1 85,224 2,103 9.2e+09 1.4e+06
...
The summary is defensive by design — each section, each protein row, and each numeric column is wrapped independently, so a failure in one (e.g. a corrupted single column) only replaces that row with (could not compute: <reason>) and never blocks the rest. A catastrophic failure of the whole summary is captured as (Statistics summary failed to render: <reason>) at the bottom of the log.
- File opens without errors — verifies the OS can open the file for reading.
- File is a CSV (extension) — verifies the file extension is
.csv. - File is not empty — verifies size > 0 bytes.
- File size is under 10 GB — guards against accidentally pointing at something huge. Limit is
MAX_FILE_SIZE_GBat the top of src/quality_check.py. - File encoding is UTF-8 — reads the file in chunks and verifies it decodes cleanly with no encoding errors.
- File is a CSV (parseable content) — uses
pandas.read_csv(nrows=5)to confirm the content actually parses as CSV; reports row count, column count, and column names. - Columns match
ASMS Meta Data.csvreference — compares the input file's column names against the canonical list inASMS Meta Data.csv(see Readme.md → Data Inputs → ASMS Meta Data.csv). On mismatch the log listsmissing from file: [...]andextra columns not in reference: [...].
Files must be named asms_<provider>_<batch>_<library>_<YYYYMMDD>.csv (e.g. asms_acmecorp_01_Chemdiv_9k_20260512.csv).
- Filename has no special characters or spaces — only
[A-Za-z0-9_.-]allowed. - Filename starts with
asms_. - Filename matches overall format — must parse as
asms_<provider>_<batchN>_<library>_<YYYYMMDD>.csv. - Provider acronym is registered — the
<provider>token must appear in theacronymcolumn ofProviders.csv(see Providers config below). - Batch number is in valid range — integer between
MIN_BATCH_NUMBER(0) andMAX_BATCH_NUMBER(10000). Leading zeros are allowed (01,0100). - Library name is registered — the
<library>token must match the filename stem of a file inMasterLists/(excludingMasterList_Information.xlsx). - Date is valid
YYYYMMDDand not in the future — parsed withdatetime.strptime; must be ≤ today.
- No fully duplicate rows — uses
pandas.DataFrame.duplicated(keep="first")to detect rows where every column value matches an earlier row. Reports the count and the file line numbers (1-indexed, including the header row) of the first few duplicates.- Severity: WARN, not FAIL. Duplicate rows do not block the pipeline — they are removed later by Step 3 (anomaly_selection), which calls
df.drop_duplicates(). - When duplicates are found, the check writes
FullyDuplicate_rows_report.csvcontaining every row that is part of a duplicate group (all copies, not just the dropped ones) with a leadingFileLinecolumn indicating the 1-indexed line in the source CSV.
- Severity: WARN, not FAIL. Duplicate rows do not block the pipeline — they are removed later by Step 3 (anomaly_selection), which calls
Before running these, the orchestrator reads the file once and drops fully-duplicate rows (the same rows Check 15 flagged), so column-content checks see the cleaned data — not the raw file. This is QC-internal only; the actual pipeline's Step 3 still does its own drop_duplicates() on the unmodified input.
- COMPOUND_ID is string (VARCHAR) — every non-null value must be a Python
str. WARN if any non-string values are found (pandas may auto-cast numeric-looking IDs to int, which is worth flagging but not necessarily fatal). - COMPOUND_ID has no leading/trailing whitespace — FAIL if any value starts or ends with a space or tab; whitespace tends to break joins against
MasterLists/later. - COMPOUND_ID has no null values — FAIL if any row's COMPOUND_ID is
NaN. - COMPOUND_ID is unique within each TARGET_ID — assumes each molecule is tested at most once per target. If duplicates exist (same
TARGET_ID, sameCOMPOUND_ID, but the rows differ in some other column — otherwise Check 15 would have caught them), the check WARNs and writes one report CSV per offending target:duplicate_COMPOUND_ID_per_TARGET_ID_<TARGET_ID>.csv(filename is sanitized to alphanumerics +_.-).
QC runs before Step 4 (isomer handling), so SMILES values may still be ;-separated isomer groups (e.g. "CC;CCC"). Checks 23 and 24 split on ; and validate / match each component independently, so isomer rows don't false-fail.
- SMILES is string (VARCHAR) — every non-null value must be a Python
str. WARN. - SMILES has no leading/trailing whitespace — FAIL; matches against the library are exact-string matches.
- SMILES has no null values — FAIL.
- SMILES is valid (non-empty, RDKit-parseable) — FAIL when any row is empty or any isomer component fails
rdkit.Chem.MolFromSmiles. Writesinvalid_smiles_report.csvwith the offending rows and anIssuecolumn (emptyormalformed: '<part>'). - SMILES is in the associated library — WARN. Resolves the library for this raw CSV via
MasterList_Information.xlsx(FileName → MaterListName →<MaterListName>.xlsx), loads itsSMILEScolumn, and checks every input SMILES (or isomer component) against it. Writessmiles_not_in_library_report.csvwith each offending row and aMissingPartcolumn showing which component wasn't found. This check FAILs (not WARNs) if the library could not be located at all — that's a configuration error. - SMILES is unique within each TARGET_ID — WARN. Same idea as Check 19 but on SMILES: if the same molecule appears more than once for the same target, a per-target report is written:
duplicate_SMILES_per_TARGET_ID_<TARGET_ID>.csv.
Each raw file represents exactly one batch, so every row should share the same ASMS_BATCH_NAME value, formatted as <provider>_<batch_number> (e.g. sgcto_01) where <provider> is one of the acronyms registered in Providers.csv.
- ASMS_BATCH_NAME is string (VARCHAR) — WARN if any non-string values.
- ASMS_BATCH_NAME has no leading/trailing whitespace — FAIL.
- ASMS_BATCH_NAME has no null values — FAIL.
- ASMS_BATCH_NAME is consistent across all rows — FAIL when more than one distinct value appears in the column (one raw file should encode exactly one batch).
- ASMS_BATCH_NAME follows
<provider>_<batch_number>— FAIL when a value does not match the regex^[A-Za-z]+_\d+$or when its provider segment is not in the loadedProviders.csvlist. Lists the offending values (up to five) in the log message.
The library file uses the column name formula (lowercase, though the lookup is case-insensitive so Formula / FORMULA also work). Isomer rows in the input may have ;-separated SMILES and ;-separated COMPOUND_FORMULA — the matching check pairs the two component-by-component.
- COMPOUND_FORMULA is string (VARCHAR) — WARN.
- COMPOUND_FORMULA has no leading/trailing whitespace — FAIL.
- COMPOUND_FORMULA has no null values — FAIL.
- COMPOUND_FORMULA is in the associated library — WARN. Checks set membership: every formula in
COMPOUND_FORMULA(or each component when the value is;-separated) must appear in theformulacolumn of the associated library. When some don't, the check writesformula_not_in_library_report.csvwith the offending rows plus a leadingMissingFormulacolumn showing which component wasn't found. FAILs (instead of WARNing) only when the library itself cannot be located.
- POOL_NAME is string (VARCHAR) — WARN.
- POOL_NAME has no leading/trailing whitespace — FAIL.
- POOL_NAME has no null values — FAIL.
- POOL_ID is string (VARCHAR) — WARN.
- POOL_ID has no leading/trailing whitespace — FAIL.
- POOL_ID has no null values — FAIL.
- POOL_SIZE is integer (INT) — WARN. Accepts either an integer dtype, or a float dtype whose non-null values are all whole numbers (pandas downcasts to float as soon as one NaN appears, so this is common).
- POOL_SIZE is in valid range [400, 1500] — FAIL. Bounds are configurable via
POOL_SIZE_MIN/POOL_SIZE_MAXat the top of src/quality_check.py. The detail message reports both out-of-range values and any cells that couldn't be parsed as numbers. - POOL_SIZE has no null values — FAIL.
TARGET_ID must follow <name>_<UniprotID>_<startAA>_<endAA> (e.g. WDR91_A4D1P6_392_747). The Uniprot_ID segment is not yet validated against a registry — see the TODO at the bottom of the TARGET_ID block in src/quality_check.py; a check function and the wiring instructions are sketched there for when a list of valid Uniprot_IDs becomes available.
- TARGET_ID is string (VARCHAR) — WARN.
- TARGET_ID has no leading/trailing whitespace — FAIL.
- TARGET_ID has no null values — FAIL.
- TARGET_ID matches
<name>_<UniprotID>_<start>_<end>— FAIL. Regex:^[A-Za-z0-9]+_[A-Za-z0-9]+_\d+_\d+$. Lists up to five offending values in the log message. - TARGET_ID start < end (and both numeric) — FAIL. Parses the two trailing digit groups as integers and verifies
start < endfor every unique TARGET_ID that matched the format. - All TARGET_IDs have the same number of compounds — WARN. Each batch is expected to test the same library against every target, so all TARGET_IDs should appear with identical
COMPOUND_IDcounts. When counts differ the detail reports min, max, and the distinct counts observed.
A number associated with each protein in the batch. In a batch of 8 proteins, the values are integers 1, 2, …, 8; in a 5-protein batch they would be 1, 2, …, 5. The check is flexible on N (the batch size) but expects the distinct values to be exactly {1, 2, …, N}.
- PROTEIN_NUMBER is integer (INT) — WARN.
- PROTEIN_NUMBER has no null values — FAIL.
- PROTEIN_NUMBER values form
{1, 2, ..., N}— WARN. PASSes when the distinct values are exactly the integers 1..N for whatever N is observed in the file. Any deviation (fewer, more, gaps, off-by-one start, non-integer values, unparseable values) WARNs and the detail message lists the actual distinct values plus the expected{1..N}set so you can see what's off.
PROTEIN_ID holds the Uniprot ID of the protein. Each TARGET_ID represents one protein region, so all rows sharing a TARGET_ID must also share the same PROTEIN_ID.
- PROTEIN_ID is string (VARCHAR) — WARN.
- PROTEIN_ID has no leading/trailing whitespace — FAIL.
- PROTEIN_ID has no null values — FAIL.
- PROTEIN_ID is consistent within each TARGET_ID — FAIL. Groups rows by
TARGET_IDand counts distinctPROTEIN_IDvalues per group; if any group has more than one, the row group is flagged. The detail message lists up to five offending targets along with the conflicting PROTEIN_ID values seen.
INCUBATION_VOLUME is the incubation volume used in the run, in µL.
- INCUBATION_VOLUME is numeric (FLOAT) — WARN. Accepts any numeric dtype, or string values that all coerce cleanly to numbers.
- INCUBATION_VOLUME values are positive (> 0) — FAIL. Reports both non-positive values and any cells that couldn't be parsed as numbers.
- INCUBATION_VOLUME has no null values — FAIL.
Placeholder, not yet active: a check for "within realistic experimental range" is sketched as a commented-out block in src/quality_check.py right above the active INCUBATION_VOLUME checks. When the realistic uL range is decided, set
INCUBATION_VOLUME_MIN/INCUBATION_VOLUME_MAX, uncomment the function, and add it to the SECTIONS list.
PROTEIN_CONC is the protein concentration used in the run, in µM. The experimental protocol fixes it at PROTEIN_CONC_EXPECTED = 1.0 µM (the constant lives at the top of the PROTEIN_CONC block in src/quality_check.py; change it if the protocol uses a different fixed value).
- PROTEIN_CONC is numeric (FLOAT) — WARN.
- PROTEIN_CONC equals expected value (1.0) — FAIL. Uses a small floating-point tolerance (
atol = 1e-9) so values like1.0000000001still pass. The detail message reports both non-matching values and any cells that couldn't be parsed as numbers. - PROTEIN_CONC has no null values — FAIL.
Placeholder, not yet active: a check for "within realistic experimental range" is sketched as a commented-out block in src/quality_check.py right above the active PROTEIN_CONC checks. When the realistic µM range is decided, set
PROTEIN_CONC_MIN/PROTEIN_CONC_MAX, uncomment the function, and add it to the SECTIONS list.
COMPOUND_CONC is the compound concentration used in the run, in µM.
- COMPOUND_CONC is numeric (FLOAT) — WARN.
- COMPOUND_CONC values are positive (> 0) — FAIL. Reports both non-positive values and any cells that couldn't be parsed as numbers.
- COMPOUND_CONC has no null values — FAIL.
- MS_REPRODUCABILITY is boolean (BOOL) — WARN. Verifies the pandas dtype is
bool. - MS_REPRODUCABILITY only contains True/False — FAIL. Set-membership against
{True, False}. Catches accidental string"True"/"False"or numeric 0/1 values that may slip in with object-dtype columns. - MS_REPRODUCABILITY has no null values — FAIL.
Threshold POS_INT_REP_MIN = 0 lives at the top of the POS_INT_REP block in src/quality_check.py; change it once to update all three replicates.
- POS_INT_REP1 is numeric (FLOAT) — WARN.
- POS_INT_REP1 values are >= 0 — FAIL. Reports up to five offending values plus any cells that couldn't be parsed as numbers.
- POS_INT_REP1 has no null values — FAIL.
- POS_INT_REP2 is numeric (FLOAT) — WARN.
- POS_INT_REP2 values are >= 0 — FAIL.
- POS_INT_REP2 has no null values — FAIL.
- POS_INT_REP3 is numeric (FLOAT) — WARN.
- POS_INT_REP3 values are >= 0 — FAIL.
- POS_INT_REP3 has no null values — FAIL.
BINARY_LABEL is 1 if significantly enriched, 0 otherwise.
- BINARY_LABEL is integer (INT) — WARN. Accepts integer dtype, or float dtype whose non-null values are all whole numbers.
- BINARY_LABEL only contains {0, 1} — FAIL. Set-membership against
{0, 1}. Lists up to five offending values. - BINARY_LABEL has no null values — FAIL.
The library name in this column must be a single alphanumeric token (e.g. EASMS12kV1) — no underscores, no spaces — and must match one of the library filename stems found in MasterLists/. Each file represents one library, so all rows must share the same value.
- LIBRARY_NAME is string (VARCHAR) — WARN.
- LIBRARY_NAME has no leading/trailing whitespace — FAIL.
- LIBRARY_NAME has no null values — FAIL.
- LIBRARY_NAME is alphanumeric (no underscores/spaces) — FAIL. Regex:
^[A-Za-z0-9]+$. - LIBRARY_NAME is registered — FAIL. Each value must match a filename stem in
MasterLists/(MasterList_Information.xlsxexcluded). Uses the samelibrariescontext as filename Check 13. - LIBRARY_NAME is consistent across all rows — FAIL when more than one distinct value appears in the column.
- Library name matches filename, column, and
MasterLists/file — FAIL. Cross-check: the<library>segment of the filename, the (single) value in theLIBRARY_NAMEcolumn, and a file<library>.xlsxinMasterLists/must all name the same library. Complements Checks 13 / 85 / 86 (which each look at one source in isolation).
Must be exactly one of the registered data-generator names listed in Providers.csv under the data_generator_name column (e.g. ASMS_SGC_TORONTO, ASMS_NUVISAN_GERMANY, ASMS_AZ_UK).
- DATA_GENERATOR_NAME is string (VARCHAR) — WARN.
- DATA_GENERATOR_NAME has no leading/trailing whitespace — FAIL.
- DATA_GENERATOR_NAME has no null values — FAIL.
- DATA_GENERATOR_NAME is registered (in Providers.csv) — FAIL. Each value must appear in the
data_generator_namecolumn ofProviders.csv. - DATA_GENERATOR_NAME is consistent across all rows — FAIL when more than one distinct value appears (each file should encode exactly one data generator).
Date format YYYYMMDD (e.g. 20260513).
- EXPERIMENT_DATE is string (VARCHAR) — WARN.
- EXPERIMENT_DATE has no leading/trailing whitespace — FAIL.
- EXPERIMENT_DATE has no null values — FAIL.
- EXPERIMENT_DATE is valid YYYYMMDD and not in the future — FAIL. Parses with
datetime.strptime("%Y%m%d"); rejects bad formats and dates that fall after today.
Allowed values (case-sensitive): achiral, chiral_selective, chiral_not_selective, chiral_undetermined. Edit the CHIRAL_SELECTIVITY_ALLOWED set in src/quality_check.py to change the allowed list.
- CHIRAL_SELECTIVITY is string (VARCHAR) — WARN.
- CHIRAL_SELECTIVITY has no leading/trailing whitespace — FAIL.
- CHIRAL_SELECTIVITY has no null values — FAIL.
- CHIRAL_SELECTIVITY is one of the allowed values — FAIL. Set-membership against
CHIRAL_SELECTIVITY_ALLOWED. Lists up to five offending values in the log message.
- MZ is numeric (FLOAT) — WARN.
- MZ is in valid range [150, 600] — FAIL. Inclusive on both ends; bounds are
MZ_MIN/MZ_MAXat the top of the MZ block in src/quality_check.py. - MZ has no null values — FAIL.
- RT is numeric (FLOAT) — WARN.
- RT is in valid range (0, 6) exclusive — FAIL. Strictly greater than 0 and strictly less than 6 (so
0and6themselves both fail). Bounds areRT_MIN/RT_MAXat the top of the RT block in src/quality_check.py. - RT has no null values — FAIL.
The protein sequence must be longer than 6 characters. Threshold lives at PROTEIN_SEQ_MIN_LENGTH = 6 near the top of the PROTEIN_SEQ block in src/quality_check.py; bump it if you want a stricter minimum.
- PROTEIN_SEQ is string (VARCHAR) — WARN.
- PROTEIN_SEQ has no leading/trailing whitespace — FAIL.
- PROTEIN_SEQ has no null values — FAIL.
- PROTEIN_SEQ length > 6 — FAIL. Strictly greater than 6 characters (so a 7-character sequence passes, 6 fails). Lists up to five offending values.
- PROTEIN_TAG is string (VARCHAR) — WARN.
- PROTEIN_TAG has no leading/trailing whitespace — FAIL.
- PROTEIN_TAG has no null values — FAIL.
The list of valid provider acronyms is loaded from Providers.csv inside --input-dir (next to RawData/). Expected format:
acronym,name,data_generator_name
acmecorp,Acme Corp Research Labs,ASMS_ACME_CORP
fakelab,FakeLab Pharmaceuticals Inc,ASMS_FAKELAB
genericrx,GenericRx Therapeutics,ASMS_GENERICRXacronymis used by filename Check 11 (the<provider>segment of the raw CSV filename) and theASMS_BATCH_NAMEformat check.data_generator_nameis used by column Check 90 (theDATA_GENERATOR_NAMEcolumn in the raw CSV must match one of these exact strings).
The real Providers.csv is gitignored (private company info). A fake version with placeholder names lives at Providers_sample.csv — copy it into your --input-dir as Providers.csv and replace the entries with the real acronyms and data-generator names.
Add more checks by appending a (description, function) tuple to one of the SECTIONS entries in src/quality_check.py. Each function takes file_path, **context and returns either:
(passed: bool, message: str)— short form.passed=True→ PASS,passed=False→ FAIL.(passed: bool, message: str, status: str)— long form.statusis one of"PASS","FAIL","WARN". Use"WARN"for issues that should be reported in the log but should not block the pipeline (e.g. duplicate rows that downstream steps will clean up).
The orchestrator passes the following keys via context:
providers— list of valid provider acronyms loaded fromProviders.csvdata_generators— set of valid data-generator names from thedata_generator_namecolumn ofProviders.csvlibraries— list of registered library names (filename stems fromMasterLists/)meta_columns— list of canonical column names fromASMS Meta Data.csvoutput_dir— the same folder as the log file (useful for writing supplementary report CSVs)df— the input CSV pre-loaded as apandas.DataFramewith fully-duplicate rows dropped (for column-content checks). Will beNoneif the file could not be parsed.masterlist_dir— path to theMasterLists/folder, for checks that need to load a specific library file viaMasterList_Information.xlsx.
