Skip to content

Add JSONL conversion + clean preprocessing stages and uv support#21

Open
Zehui127 wants to merge 2 commits into
blab:mainfrom
Zehui127:feature/preprocessing
Open

Add JSONL conversion + clean preprocessing stages and uv support#21
Zehui127 wants to merge 2 commits into
blab:mainfrom
Zehui127:feature/preprocessing

Conversation

@Zehui127
Copy link
Copy Markdown

@Zehui127 Zehui127 commented May 31, 2026

Summary

  • Two new Snakemake targets wrapping downstream preprocessing:
    • jsonl — parallel per-shard FASTA → json_sdiff JSONL conversion with a resumable per-shard cache, emitting {forwards,pairwise,combined}_{train,test}.jsonl per analysis.
    • clean — quality filter (max_hunk_len, gap_allele_frac, ref_gap_frac, mut_density) emitting filtered JSONL + per-file drop manifest CSV.
  • scripts/fasta_to_jsonl.py (from pegasus-evals) and scripts/clean_lib.py (from pegasus-datasets) are vendored so the pipeline is self-contained — no sibling repos required.
  • Both targets honour target_analyses and trajectory_mode. Defaults live under new jsonl: / clean: blocks in defaults/config.yaml; per-dataset overrides go under analysis.{name}.jsonl / analysis.{name}.clean.
  • Adds pyproject.toml + uv.lock for uv sync / uv run snakemake .... requirements.txt is left intact for plain-pip users.

Test plan

  • uv sync resolves and installs all deps
  • uv run snakemake --configfile defaults/viral.yaml --cores 4 -p clean --config target_analyses='["n450-xs"]' runs end-to-end and produces 4 trajectory shards, 6 JSONL files, 6 cleaned JSONL files + manifests
  • Dry-run -n shows the expected DAG for trajectory_mode=both (20 jobs) and reduced DAGs for forwards / pairwise only

🤖 Generated with Claude Code

Zehui and others added 2 commits May 31, 2026 19:31
New downstream Snakemake targets:
- `jsonl`: convert {forwards,pairwise}-{train,test}-NNN.tar.zst shards to
  json_sdiff JSONL via pegasus-evals/fasta_to_jsonl.py, in parallel per shard
  with a resumable per-shard cache under results/{analysis}/jsonl/shards/.
  Emits {forwards,pairwise,combined}_{train,test}.jsonl.
- `clean`: filter JSONLs through clean.py from /home/ubuntu/datasets,
  emitting filtered JSONL + per-file manifest CSV under jsonl_clean/.

Both targets honour `target_analyses` and `trajectory_mode`; defaults live
in defaults/config.yaml under the `jsonl:` / `clean:` keys with per-analysis
overrides under analysis.{name}.{jsonl,clean}.

uv support:
- pyproject.toml mirroring requirements.txt (requires-python>=3.10,
  tool.uv.package=false so uv only manages deps).
- uv.lock for reproducible installs.
- .gitignore: .venv/ and .venv-*/.
- README: install via `uv sync` or `pip install -r requirements.txt`;
  prefix commands with `uv run` or activate the venv.

requirements.txt is left intact for plain-pip users.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the `jsonl` and `clean` rules pointed at scripts living in sibling
repos by absolute path (/home/ubuntu/pegasus-evals, /home/ubuntu/datasets),
making the pipeline non-portable. Vendor both files into scripts/ so a fresh
clone works without any neighbouring repos:

- scripts/fasta_to_jsonl.py (from pegasus-evals; numpy is the only ext dep)
- scripts/clean_lib.py       (from pegasus-datasets src/clean.py; stdlib only)

scripts/run_clean.py now imports clean_lib as a sibling module — the
--datasets-root arg is gone.

defaults/config.yaml drops the absolute-path defaults
(fasta_to_jsonl_script, datasets_root, the pegasus-evals venv python path) and
relies on `python` resolving to whatever interpreter is active — which under
`uv run snakemake ...` is the project venv with numpy already installed via
biopython/augur. README updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trvrb
Copy link
Copy Markdown
Member

trvrb commented Jun 2, 2026

Thanks for the PR @Zehui127. Working through this with Claude Code, it flagged just one thing as significant


--context-size 16 actually produces 12-bp context (scripts/fasta_to_jsonl.py)

format_json_sdiff_hunk calls find_unique_context_size(min_context=context_size, max_context=12) with max_context=12 hardcoded. With CLI default context_size=16, range(16, 13) is empty — the loop never executes and it falls through to return 12. I verified at runtime: the hunks in forwards_train.jsonl have left-context strings of exactly 12 chars (e.g. 'TGGCAGGAATCT'), not 16.

The defaults/config.yaml value jsonl.context_size: 16 is silently a no-op above 12. Three possible fixes; the cleanest is probably making max_context track min_context:

  def find_unique_context_size(ref_seq, pos, ref_allele, min_context=8, max_context=16):
      for ctx_size in range(min_context, max(min_context, max_context) + 1):
          ...

…and updating the call site in format_json_sdiff_hunk to pass max_context=context_size (or a separate max_context_size arg). Worth deciding whether the CLI default should be 12 or 16 with the broader range — the original pegasus-evals semantics may inform that.


Does this make sense to you?

@trvrb
Copy link
Copy Markdown
Member

trvrb commented Jun 2, 2026

Separately, I have a small pedantic issue: I've used to snakemake clean as a workflow endpoint that removes temporary build artifacts, in analogy to make clean. Can you rename this to something else? Maybe jsonl-qc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants