Data Processing Pipeline

What happens to a raw ASMS results CSV once Quality Check passes. Steps 1–9 below run sequentially per input file; each step's output is persisted on disk so any step can be resumed from a saved checkpoint.

For QC details (Step 0), see QUALITY_CHECKS.md. For run instructions and the --start-from / --end-at flags, see Running a Subset of Steps below or USAGE.md.

Pipeline Steps

The entry point is src/Main.py. It iterates over every CSV in RawData/ and runs Step 0 (QC) first, then the steps below for files that pass QC.

1. Split by target — `separate_protein_files.split_protein_data`

Groups rows in the raw CSV by TARGET_ID and writes one CSV per target into ProcessedData_<csv_basename>/Separated_Files/. Requires PROTEIN_NUMBER, ASMS_BATCH_NAME, and TARGET_ID columns.

2. Compute scores — `add_scores.compute_and_add_scores`

For each per-target file, this step uses the other per-target files in the batch as a non-target reference and adds the following columns:

TARGET_VALUE — mean of the three replicate intensities for the current target: TARGET_VALUE = mean(POS_INT_REP1, POS_INT_REP2, POS_INT_REP3) (skipping NaN).
SELECTIVE_VALUE — per compound, the maximum of TARGET_VALUE across all other targets in the batch (i.e., the strongest off-target signal observed for this compound).
NTC_VALUE — per compound, the minimum of TARGET_VALUE across all other targets (acts as a no-target-control reference: the weakest off-target signal).
ENRICHMENT — current target's signal over the weakest off-target signal: ENRICHMENT = TARGET_VALUE / NTC_VALUE.
SELECTIVE_ENRICHMENT — current target's signal over the strongest off-target signal: SELECTIVE_ENRICHMENT = TARGET_VALUE / SELECTIVE_VALUE. Values ≥ 1 indicate selectivity for this target.
MEAN_NONTARGET_VALUES — for the current compound, take each other target file, compute the mean of its three replicates, then average those means across files. This is the "average off-target signal" for the compound.
EASMS_ENRICHMENT — current target's signal over the mean off-target signal: EASMS_ENRICHMENT = TARGET_VALUE / MEAN_NONTARGET_VALUES.
PVALUE — two-sample Welch's t-test (scipy.stats.ttest_ind with equal_var=False) comparing:
- the current target's three replicates (POS_INT_REP{1,2,3})
- vs. all pooled replicates from the other targets for the same compound.
Returns None if either side has insufficient samples (current < 1 value, or other < 3 values), and 1.0 if either side has zero variance.

Requires the three replicate columns POS_INT_REP1, POS_INT_REP2, POS_INT_REP3 and a COMPOUND_ID column on every input file.

3. Filter anomalies — `anomaly_selection.filter_anomalous_data`

Resolves duplicate-SMILES rows with conflicting enrichment:

All ENRICHMENT < 1 → keep the smallest
All ENRICHMENT > 10 → keep the largest
Mixed values → drop all of them (ambiguous)

Anomalies and removals are logged for audit.

4. Handle isomers — `isomer_handling.handle_isomers`

Splits rows whose SMILES contains multiple isomers separated by ; into one row per isomer, and records the original group in a new ISOMERS column.

5. Add negative samples — `add_negatives.add_negative_samples_from_masterlist`

Looks up the master-list file for the current raw CSV via MasterList_Information.xlsx, loads it, and adds compounds from the master list as negative samples. Each master list must contain a SMILES column.

6. Generate ML labels — `produce_ml_labels.generate_ml_labels`

Assigns an AIRCHECK_LABEL integer based on EASMS_ENRICHMENT, PVALUE, ISOMERS, and HAD_DUPLICATE_INTENSITY (values range from −2 to 4 — see the module's docstring for the exact rules). This is the last CSV-format step.

7. Extract fingerprints + rename + binary label — `fingerprint_extraction.extract_fingerprints`

Three transformations applied together:

Fingerprints / descriptors (via src/fingerprints.py, src/utils.py): MW, ALOGP, and ECFP4, ECFP6, FCFP4, FCFP6, MACCS, RDK, AVALON, TOPTOR, ATOMPAIR.
Column renames (inline in Main.py): TARGET_VALUE → TARGET_INTENSITY_VALUE, MEAN_NONTARGET_VALUES → NONTARGET_INTENSITY_VALUE.
Binary label: LABEL = 1 if BINARY_LABEL == "Y" else 0.

This is the first Parquet-format step (CSV is dropped because the wide fingerprint columns make it slow and huge).

Fingerprint storage format — `TypeOfFp`

Each fingerprint column (ECFP4, MACCS, …) can be stored in one of two formats. Set the TypeOfFp constant near the top of the __main__ block in src/Main.py:

`TypeOfFp`	Stored as	When to use
`"array"` (default)	`numpy.float32` array per row — ready to feed directly into a model	New runs. Downstream code can read the column without any string-to-array conversion.
`"string"`	Comma-separated string per row (e.g. `"1,0,0,1,…"`)	Legacy format from earlier pipeline versions. Use if downstream code expects the string layout (e.g. `np.fromstring(x, sep=',', dtype=np.float32)`).

Both formats survive a Parquet round-trip; "array" is just one step closer to the form a model consumes.

8. Select full column set — `column_selection.select_final_columns`

Reads the Step 7 output and selects DesiredColumns (46 cols: all metadata + scores + labels + fingerprints). Saved as Parquet.

9. Select key column set — `column_selection.select_final_columns`

Reads the same Step 7 output (parallel branch — not downstream of Step 8) and selects DesiredColumns2 (19 cols: IDs, target info, scores, label, MW, ALOGP, fingerprints). Saved as Parquet.

Post-pipeline QC — `post_quality_check.run_post_quality_checks`

After all per-target Parquet files have been written by Step 8, a lightweight QC pass runs once per input CSV against the concatenated Step 8 output. Its job is to catch regressions in pipeline-produced columns (label values, score ranges, fingerprint lengths) — not to re-validate the raw input. Best-effort: only runs when --end-at >= 8; if Step 8 hasn't produced output it writes a one-line "skipped" log and returns.

Outputs (next to the input-QC logs):

PostQClog_<YYYYMMDD>_<csv_basename>.log
PostQClog_<YYYYMMDD>_<csv_basename>.xlsx

See POST_QC.md for the full check list (23 checks across 5 sections), tuning constants, and design notes.

Running a Subset of Steps

Use --start-from N and --end-at N to resume from or stop at a specific step. Earlier steps that have already been saved are loaded from disk; later steps are skipped.

# Re-run only fingerprint extraction onward (steps 1-6 are loaded from disk)
python src/Main.py --start-from 7

# Run only step 9 (re-derive the key column subset) using Step 7's saved Parquet
python src/Main.py --start-from 9 --end-at 9

# Run just the early cleaning (steps 1-5), stop before label generation
python src/Main.py --end-at 5

--start-from accepts 0–9 (0 = Quality Check), --end-at accepts 0–9. Defaults: --start-from 0 --end-at 9 (run everything, including QC). Step 0 (QC) runs only when --start-from 0.

Output Layout

For each input CSV, the pipeline creates one ProcessedData_<csv_basename>/ folder inside --output-dir (which defaults to --input-dir). Each step's output lives in its own folder so any step can be re-run from saved checkpoints:

ProcessedData_<csv_basename>/
├── QCaircheck<YYYYMMDD>_<csv_basename>.log    # step 0 plain-text log (date the check was run)
├── QCaircheck<YYYYMMDD>_<csv_basename>.xlsx   # step 0 same data in Excel (color-coded)
├── PostQClog_<YYYYMMDD>_<csv_basename>.log    # post-pipeline QC plain-text log
├── PostQClog_<YYYYMMDD>_<csv_basename>.xlsx   # post-pipeline QC Excel (color-coded)
├── Step1_Separated/             # step 1 — split by target           (CSV)
│   └── <target>.csv
├── Step2_WithScores/            # step 2 — score columns added       (CSV)
├── Step3_AnomalyFiltered/       # step 3 — anomalies resolved        (CSV)
├── Step4_IsomerHandled/         # step 4 — isomers split             (CSV)
├── Step5_WithNegatives/         # step 5 — masterlist negatives      (CSV)
├── Step6_MLReady/               # step 6 — labels added              (CSV)
├── Step7_WithFingerprints/      # step 7 — FPs + rename + LABEL      (Parquet)
│   └── <target>.parquet
├── Step8_FullColumns/           # step 8 — full column subset        (Parquet)
└── Step9_KeyColumns/            # step 9 — slim column subset        (Parquet)

Step 0 may also write supplementary report CSVs (FullyDuplicate_rows_report.csv, invalid_smiles_report.csv, etc.) alongside the QC log files when checks find issues — see QUALITY_CHECKS.md for the full list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Processing Pipeline

Pipeline Steps

1. Split by target — `separate_protein_files.split_protein_data`

2. Compute scores — `add_scores.compute_and_add_scores`

3. Filter anomalies — `anomaly_selection.filter_anomalous_data`

4. Handle isomers — `isomer_handling.handle_isomers`

5. Add negative samples — `add_negatives.add_negative_samples_from_masterlist`

6. Generate ML labels — `produce_ml_labels.generate_ml_labels`

7. Extract fingerprints + rename + binary label — `fingerprint_extraction.extract_fingerprints`

Fingerprint storage format — `TypeOfFp`

8. Select full column set — `column_selection.select_final_columns`

9. Select key column set — `column_selection.select_final_columns`

Post-pipeline QC — `post_quality_check.run_post_quality_checks`

Running a Subset of Steps

Output Layout

FilesExpand file tree

PIPELINE.md

Latest commit

History

PIPELINE.md

File metadata and controls

Data Processing Pipeline

Pipeline Steps

1. Split by target — separate_protein_files.split_protein_data

2. Compute scores — add_scores.compute_and_add_scores

3. Filter anomalies — anomaly_selection.filter_anomalous_data

4. Handle isomers — isomer_handling.handle_isomers

5. Add negative samples — add_negatives.add_negative_samples_from_masterlist

6. Generate ML labels — produce_ml_labels.generate_ml_labels

7. Extract fingerprints + rename + binary label — fingerprint_extraction.extract_fingerprints

Fingerprint storage format — TypeOfFp

8. Select full column set — column_selection.select_final_columns

9. Select key column set — column_selection.select_final_columns

Post-pipeline QC — post_quality_check.run_post_quality_checks

Running a Subset of Steps

Output Layout

1. Split by target — `separate_protein_files.split_protein_data`

2. Compute scores — `add_scores.compute_and_add_scores`

3. Filter anomalies — `anomaly_selection.filter_anomalous_data`

4. Handle isomers — `isomer_handling.handle_isomers`

5. Add negative samples — `add_negatives.add_negative_samples_from_masterlist`

6. Generate ML labels — `produce_ml_labels.generate_ml_labels`

7. Extract fingerprints + rename + binary label — `fingerprint_extraction.extract_fingerprints`

Fingerprint storage format — `TypeOfFp`

8. Select full column set — `column_selection.select_final_columns`

9. Select key column set — `column_selection.select_final_columns`

Post-pipeline QC — `post_quality_check.run_post_quality_checks`