Add JSONL conversion + clean preprocessing stages and uv support#21
Add JSONL conversion + clean preprocessing stages and uv support#21Zehui127 wants to merge 2 commits into
Conversation
New downstream Snakemake targets:
- `jsonl`: convert {forwards,pairwise}-{train,test}-NNN.tar.zst shards to
json_sdiff JSONL via pegasus-evals/fasta_to_jsonl.py, in parallel per shard
with a resumable per-shard cache under results/{analysis}/jsonl/shards/.
Emits {forwards,pairwise,combined}_{train,test}.jsonl.
- `clean`: filter JSONLs through clean.py from /home/ubuntu/datasets,
emitting filtered JSONL + per-file manifest CSV under jsonl_clean/.
Both targets honour `target_analyses` and `trajectory_mode`; defaults live
in defaults/config.yaml under the `jsonl:` / `clean:` keys with per-analysis
overrides under analysis.{name}.{jsonl,clean}.
uv support:
- pyproject.toml mirroring requirements.txt (requires-python>=3.10,
tool.uv.package=false so uv only manages deps).
- uv.lock for reproducible installs.
- .gitignore: .venv/ and .venv-*/.
- README: install via `uv sync` or `pip install -r requirements.txt`;
prefix commands with `uv run` or activate the venv.
requirements.txt is left intact for plain-pip users.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the `jsonl` and `clean` rules pointed at scripts living in sibling repos by absolute path (/home/ubuntu/pegasus-evals, /home/ubuntu/datasets), making the pipeline non-portable. Vendor both files into scripts/ so a fresh clone works without any neighbouring repos: - scripts/fasta_to_jsonl.py (from pegasus-evals; numpy is the only ext dep) - scripts/clean_lib.py (from pegasus-datasets src/clean.py; stdlib only) scripts/run_clean.py now imports clean_lib as a sibling module — the --datasets-root arg is gone. defaults/config.yaml drops the absolute-path defaults (fasta_to_jsonl_script, datasets_root, the pegasus-evals venv python path) and relies on `python` resolving to whatever interpreter is active — which under `uv run snakemake ...` is the project venv with numpy already installed via biopython/augur. README updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the PR @Zehui127. Working through this with Claude Code, it flagged just one thing as significant
The …and updating the call site in Does this make sense to you? |
|
Separately, I have a small pedantic issue: I've used to |
Summary
jsonl— parallel per-shard FASTA → json_sdiff JSONL conversion with a resumable per-shard cache, emitting{forwards,pairwise,combined}_{train,test}.jsonlper analysis.clean— quality filter (max_hunk_len,gap_allele_frac,ref_gap_frac,mut_density) emitting filtered JSONL + per-file drop manifest CSV.scripts/fasta_to_jsonl.py(from pegasus-evals) andscripts/clean_lib.py(from pegasus-datasets) are vendored so the pipeline is self-contained — no sibling repos required.target_analysesandtrajectory_mode. Defaults live under newjsonl:/clean:blocks indefaults/config.yaml; per-dataset overrides go underanalysis.{name}.jsonl/analysis.{name}.clean.pyproject.toml+uv.lockforuv sync/uv run snakemake ....requirements.txtis left intact for plain-pip users.Test plan
uv syncresolves and installs all depsuv run snakemake --configfile defaults/viral.yaml --cores 4 -p clean --config target_analyses='["n450-xs"]'runs end-to-end and produces 4 trajectory shards, 6 JSONL files, 6 cleaned JSONL files + manifests-nshows the expected DAG fortrajectory_mode=both(20 jobs) and reduced DAGs forforwards/pairwiseonly🤖 Generated with Claude Code