What happens to a raw ASMS results CSV once Quality Check passes. Steps 1–9 below run sequentially per input file; each step's output is persisted on disk so any step can be resumed from a saved checkpoint.
For QC details (Step 0), see QUALITY_CHECKS.md. For run instructions and the --start-from / --end-at flags, see Running a Subset of Steps below or USAGE.md.
The entry point is src/Main.py. It iterates over every CSV in RawData/ and runs Step 0 (QC) first, then the steps below for files that pass QC.
1. Split by target — separate_protein_files.split_protein_data
Groups rows in the raw CSV by TARGET_ID and writes one CSV per target into ProcessedData_<csv_basename>/Separated_Files/. Requires PROTEIN_NUMBER, ASMS_BATCH_NAME, and TARGET_ID columns.
2. Compute scores — add_scores.compute_and_add_scores
For each per-target file, this step uses the other per-target files in the batch as a non-target reference and adds the following columns:
-
TARGET_VALUE— mean of the three replicate intensities for the current target:TARGET_VALUE = mean(POS_INT_REP1, POS_INT_REP2, POS_INT_REP3)(skipping NaN). -
SELECTIVE_VALUE— per compound, the maximum ofTARGET_VALUEacross all other targets in the batch (i.e., the strongest off-target signal observed for this compound). -
NTC_VALUE— per compound, the minimum ofTARGET_VALUEacross all other targets (acts as a no-target-control reference: the weakest off-target signal). -
ENRICHMENT— current target's signal over the weakest off-target signal:ENRICHMENT = TARGET_VALUE / NTC_VALUE. -
SELECTIVE_ENRICHMENT— current target's signal over the strongest off-target signal:SELECTIVE_ENRICHMENT = TARGET_VALUE / SELECTIVE_VALUE. Values ≥ 1 indicate selectivity for this target. -
MEAN_NONTARGET_VALUES— for the current compound, take each other target file, compute the mean of its three replicates, then average those means across files. This is the "average off-target signal" for the compound. -
EASMS_ENRICHMENT— current target's signal over the mean off-target signal:EASMS_ENRICHMENT = TARGET_VALUE / MEAN_NONTARGET_VALUES. -
PVALUE— two-sample Welch's t-test (scipy.stats.ttest_indwithequal_var=False) comparing:- the current target's three replicates (
POS_INT_REP{1,2,3}) - vs. all pooled replicates from the other targets for the same compound.
Returns
Noneif either side has insufficient samples (current < 1 value, or other < 3 values), and1.0if either side has zero variance. - the current target's three replicates (
Requires the three replicate columns POS_INT_REP1, POS_INT_REP2, POS_INT_REP3 and a COMPOUND_ID column on every input file.
3. Filter anomalies — anomaly_selection.filter_anomalous_data
Resolves duplicate-SMILES rows with conflicting enrichment:
- All
ENRICHMENT < 1→ keep the smallest - All
ENRICHMENT > 10→ keep the largest - Mixed values → drop all of them (ambiguous)
Anomalies and removals are logged for audit.
4. Handle isomers — isomer_handling.handle_isomers
Splits rows whose SMILES contains multiple isomers separated by ; into one row per isomer, and records the original group in a new ISOMERS column.
5. Add negative samples — add_negatives.add_negative_samples_from_masterlist
Looks up the master-list file for the current raw CSV via MasterList_Information.xlsx, loads it, and adds compounds from the master list as negative samples. Each master list must contain a SMILES column.
6. Generate ML labels — produce_ml_labels.generate_ml_labels
Assigns an AIRCHECK_LABEL integer based on EASMS_ENRICHMENT, PVALUE, ISOMERS, and HAD_DUPLICATE_INTENSITY (values range from −2 to 4 — see the module's docstring for the exact rules). This is the last CSV-format step.
7. Extract fingerprints + rename + binary label — fingerprint_extraction.extract_fingerprints
Three transformations applied together:
- Fingerprints / descriptors (via src/fingerprints.py, src/utils.py):
MW,ALOGP, andECFP4,ECFP6,FCFP4,FCFP6,MACCS,RDK,AVALON,TOPTOR,ATOMPAIR. - Column renames (inline in
Main.py):TARGET_VALUE→TARGET_INTENSITY_VALUE,MEAN_NONTARGET_VALUES→NONTARGET_INTENSITY_VALUE. - Binary label:
LABEL = 1 if BINARY_LABEL == "Y" else 0.
This is the first Parquet-format step (CSV is dropped because the wide fingerprint columns make it slow and huge).
Each fingerprint column (ECFP4, MACCS, …) can be stored in one of two formats. Set the TypeOfFp constant near the top of the __main__ block in src/Main.py:
TypeOfFp |
Stored as | When to use |
|---|---|---|
"array" (default) |
numpy.float32 array per row — ready to feed directly into a model |
New runs. Downstream code can read the column without any string-to-array conversion. |
"string" |
Comma-separated string per row (e.g. "1,0,0,1,…") |
Legacy format from earlier pipeline versions. Use if downstream code expects the string layout (e.g. np.fromstring(x, sep=',', dtype=np.float32)). |
Both formats survive a Parquet round-trip; "array" is just one step closer to the form a model consumes.
8. Select full column set — column_selection.select_final_columns
Reads the Step 7 output and selects DesiredColumns (46 cols: all metadata + scores + labels + fingerprints). Saved as Parquet.
9. Select key column set — column_selection.select_final_columns
Reads the same Step 7 output (parallel branch — not downstream of Step 8) and selects DesiredColumns2 (19 cols: IDs, target info, scores, label, MW, ALOGP, fingerprints). Saved as Parquet.
Post-pipeline QC — post_quality_check.run_post_quality_checks
After all per-target Parquet files have been written by Step 8, a lightweight QC pass runs once per input CSV against the concatenated Step 8 output. Its job is to catch regressions in pipeline-produced columns (label values, score ranges, fingerprint lengths) — not to re-validate the raw input. Best-effort: only runs when --end-at >= 8; if Step 8 hasn't produced output it writes a one-line "skipped" log and returns.
Outputs (next to the input-QC logs):
PostQClog_<YYYYMMDD>_<csv_basename>.logPostQClog_<YYYYMMDD>_<csv_basename>.xlsx
See POST_QC.md for the full check list (23 checks across 5 sections), tuning constants, and design notes.
Use --start-from N and --end-at N to resume from or stop at a specific step. Earlier steps that have already been saved are loaded from disk; later steps are skipped.
# Re-run only fingerprint extraction onward (steps 1-6 are loaded from disk)
python src/Main.py --start-from 7
# Run only step 9 (re-derive the key column subset) using Step 7's saved Parquet
python src/Main.py --start-from 9 --end-at 9
# Run just the early cleaning (steps 1-5), stop before label generation
python src/Main.py --end-at 5--start-from accepts 0–9 (0 = Quality Check), --end-at accepts 0–9. Defaults: --start-from 0 --end-at 9 (run everything, including QC). Step 0 (QC) runs only when --start-from 0.
For each input CSV, the pipeline creates one ProcessedData_<csv_basename>/ folder inside --output-dir (which defaults to --input-dir). Each step's output lives in its own folder so any step can be re-run from saved checkpoints:
ProcessedData_<csv_basename>/
├── QCaircheck<YYYYMMDD>_<csv_basename>.log # step 0 plain-text log (date the check was run)
├── QCaircheck<YYYYMMDD>_<csv_basename>.xlsx # step 0 same data in Excel (color-coded)
├── PostQClog_<YYYYMMDD>_<csv_basename>.log # post-pipeline QC plain-text log
├── PostQClog_<YYYYMMDD>_<csv_basename>.xlsx # post-pipeline QC Excel (color-coded)
├── Step1_Separated/ # step 1 — split by target (CSV)
│ └── <target>.csv
├── Step2_WithScores/ # step 2 — score columns added (CSV)
├── Step3_AnomalyFiltered/ # step 3 — anomalies resolved (CSV)
├── Step4_IsomerHandled/ # step 4 — isomers split (CSV)
├── Step5_WithNegatives/ # step 5 — masterlist negatives (CSV)
├── Step6_MLReady/ # step 6 — labels added (CSV)
├── Step7_WithFingerprints/ # step 7 — FPs + rename + LABEL (Parquet)
│ └── <target>.parquet
├── Step8_FullColumns/ # step 8 — full column subset (Parquet)
└── Step9_KeyColumns/ # step 9 — slim column subset (Parquet)
Step 0 may also write supplementary report CSVs (FullyDuplicate_rows_report.csv, invalid_smiles_report.csv, etc.) alongside the QC log files when checks find issues — see QUALITY_CHECKS.md for the full list.