Add JSONL conversion + clean preprocessing stages and uv support by Zehui127 · Pull Request #21 · blab/trajectories

Zehui127 · 2026-05-31T19:40:14Z

Summary

Two new Snakemake targets wrapping downstream preprocessing:
- jsonl — parallel per-shard FASTA → json_sdiff JSONL conversion with a resumable per-shard cache, emitting {forwards,pairwise,combined}_{train,test}.jsonl per analysis.
- clean — quality filter (max_hunk_len, gap_allele_frac, ref_gap_frac, mut_density) emitting filtered JSONL + per-file drop manifest CSV.
scripts/fasta_to_jsonl.py (from pegasus-evals) and scripts/clean_lib.py (from pegasus-datasets) are vendored so the pipeline is self-contained — no sibling repos required.
Both targets honour target_analyses and trajectory_mode. Defaults live under new jsonl: / clean: blocks in defaults/config.yaml; per-dataset overrides go under analysis.{name}.jsonl / analysis.{name}.clean.
Adds pyproject.toml + uv.lock for uv sync / uv run snakemake .... requirements.txt is left intact for plain-pip users.

Test plan

uv sync resolves and installs all deps
uv run snakemake --configfile defaults/viral.yaml --cores 4 -p clean --config target_analyses='["n450-xs"]' runs end-to-end and produces 4 trajectory shards, 6 JSONL files, 6 cleaned JSONL files + manifests
Dry-run -n shows the expected DAG for trajectory_mode=both (20 jobs) and reduced DAGs for forwards / pairwise only

🤖 Generated with Claude Code

New downstream Snakemake targets: - `jsonl`: convert {forwards,pairwise}-{train,test}-NNN.tar.zst shards to json_sdiff JSONL via pegasus-evals/fasta_to_jsonl.py, in parallel per shard with a resumable per-shard cache under results/{analysis}/jsonl/shards/. Emits {forwards,pairwise,combined}_{train,test}.jsonl. - `clean`: filter JSONLs through clean.py from /home/ubuntu/datasets, emitting filtered JSONL + per-file manifest CSV under jsonl_clean/. Both targets honour `target_analyses` and `trajectory_mode`; defaults live in defaults/config.yaml under the `jsonl:` / `clean:` keys with per-analysis overrides under analysis.{name}.{jsonl,clean}. uv support: - pyproject.toml mirroring requirements.txt (requires-python>=3.10, tool.uv.package=false so uv only manages deps). - uv.lock for reproducible installs. - .gitignore: .venv/ and .venv-*/. - README: install via `uv sync` or `pip install -r requirements.txt`; prefix commands with `uv run` or activate the venv. requirements.txt is left intact for plain-pip users. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the `jsonl` and `clean` rules pointed at scripts living in sibling repos by absolute path (/home/ubuntu/pegasus-evals, /home/ubuntu/datasets), making the pipeline non-portable. Vendor both files into scripts/ so a fresh clone works without any neighbouring repos: - scripts/fasta_to_jsonl.py (from pegasus-evals; numpy is the only ext dep) - scripts/clean_lib.py (from pegasus-datasets src/clean.py; stdlib only) scripts/run_clean.py now imports clean_lib as a sibling module — the --datasets-root arg is gone. defaults/config.yaml drops the absolute-path defaults (fasta_to_jsonl_script, datasets_root, the pegasus-evals venv python path) and relies on `python` resolving to whatever interpreter is active — which under `uv run snakemake ...` is the project venv with numpy already installed via biopython/augur. README updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

trvrb · 2026-06-02T02:10:55Z

Thanks for the PR @Zehui127. Working through this with Claude Code, it flagged just one thing as significant

--context-size 16 actually produces 12-bp context (scripts/fasta_to_jsonl.py)

format_json_sdiff_hunk calls find_unique_context_size(min_context=context_size, max_context=12) with max_context=12 hardcoded. With CLI default context_size=16, range(16, 13) is empty — the loop never executes and it falls through to return 12. I verified at runtime: the hunks in forwards_train.jsonl have left-context strings of exactly 12 chars (e.g. 'TGGCAGGAATCT'), not 16.

The defaults/config.yaml value jsonl.context_size: 16 is silently a no-op above 12. Three possible fixes; the cleanest is probably making max_context track min_context:

  def find_unique_context_size(ref_seq, pos, ref_allele, min_context=8, max_context=16):
      for ctx_size in range(min_context, max(min_context, max_context) + 1):
          ...

…and updating the call site in format_json_sdiff_hunk to pass max_context=context_size (or a separate max_context_size arg). Worth deciding whether the CLI default should be 12 or 16 with the broader range — the original pegasus-evals semantics may inform that.

Does this make sense to you?

trvrb · 2026-06-02T02:29:28Z

Separately, I have a small pedantic issue: I've used to snakemake clean as a workflow endpoint that removes temporary build artifacts, in analogy to make clean. Can you rename this to something else? Maybe jsonl-qc?

Zehui and others added 2 commits May 31, 2026 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSONL conversion + clean preprocessing stages and uv support#21

Add JSONL conversion + clean preprocessing stages and uv support#21
Zehui127 wants to merge 2 commits into
blab:mainfrom
Zehui127:feature/preprocessing

Zehui127 commented May 31, 2026 •

edited

Loading

Uh oh!

trvrb commented Jun 2, 2026

Uh oh!

trvrb commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zehui127 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

trvrb commented Jun 2, 2026

Uh oh!

trvrb commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zehui127 commented May 31, 2026 •

edited

Loading