Create a virtual environment and install dependencies:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtOther shells:
# Git Bash
python -m venv .venv
source .venv/Scripts/activate
pip install -r requirements.txt:: Windows cmd.exe
python -m venv .venv
.venv\Scripts\activate.bat
pip install -r requirements.txt.\.venv\Scripts\Activate.ps1The pipeline takes two paths:
--input-dir— folder containingRawData/,MasterLists/, andProviders.csv. Defaults to current directory.--output-dir— folder whereProcessedData_<csv>/is created. Defaults to--input-dir.
# Run from the repo root with everything in place
python src/Main.py
# Specify input only (outputs land inside the same folder)
python src/Main.py --input-dir "D:\my\dataset"
# Separate input and output
python src/Main.py --input-dir "D:\my\dataset" --output-dir "D:\my\results"Help text:
python src/Main.py --helpUse --start-from N and --end-at N to control which steps execute. Step numbers are 1–9 (see Readme.md for what each step does). Skipped earlier steps are loaded from their saved output on disk.
# Run only the Quality Check (step 0) and stop
python src/Main.py --end-at 0
# Run only steps 1 and 2
python src/Main.py --end-at 2
# Run only step 1
python src/Main.py --end-at 1
# Re-run from fingerprint extraction onward (steps 1-6 are loaded from disk)
python src/Main.py --start-from 7
# Run exactly one step (e.g. step 5)
python src/Main.py --start-from 5 --end-at 5Defaults: --start-from 0 --end-at 9 (run everything, including QC). Quality checks (step 0) run only when --start-from 0.