Skip to content

Latest commit

 

History

History
468 lines (369 loc) · 15 KB

File metadata and controls

468 lines (369 loc) · 15 KB

PechaBridge CLI Reference

This document contains the command-line workflow and script reference. If you are a regular user, prefer the Workbench commands documented in ../README.md.

Main Scripts

  • scripts/generate_training_data.py
  • scripts/train_model.py
  • scripts/inference_sbb.py
  • scripts/ocr_on_detections.py
  • scripts/pseudo_label_from_vlm.py
  • scripts/layout_rule_filter.py
  • scripts/run_pseudo_label_workflow.py
  • scripts/download_openpecha_line_segmentation.py
  • scripts/train_line_segmentation.py
  • cli.py (unified diffusion + retrieval-encoder commands)

Install

pip install -r requirements.txt

requirements.txt is the unified dependency file for CLI, UI, VLM, diffusion/LoRA, and retrieval encoder training. Optional compatibility wrappers live in ../requirements/.

Unified CLI (cli.py)

Use:

python cli.py -h

Available subcommands:

  • prepare-texture-lora-dataset
  • train-texture-lora
  • texture-augment
  • train-image-encoder
  • train-text-encoder
  • export-text-hierarchy
  • gen-patches
  • weak-ocr-label
  • ocr-workbench
  • layout-workbench
  • transformer-layout-workbench
  • mine-mnn-pairs
  • train-text-hierarchy-vit
  • eval-text-hierarchy-vit
  • faiss-text-hierarchy-search
  • eval-faiss-crosspage
  • prepare-donut-ocr-dataset
  • eval-ocr-tokenizer
  • train-donut-ocr
  • run-donut-ocr-workflow
  • semantic-search-workbench
  • download-openpecha-ocr-lines
  • download-openpecha-line-segmentation
  • train-line-segmentation

Semantic Search Workbench

Launch the transcript retrieval UI through the unified CLI:

python cli.py semantic-search-workbench \
  --config pechabridge/semantic_search_workbench/semantic-search-config.yaml

Useful flags:

  • --reindex rebuilds the local Qdrant collection before the Gradio app starts
  • --reindex-only rebuilds the collection and exits without launching the UI
  • --api-only starts only the FastAPI microservice
  • --no-api disables the FastAPI microservice for this run even if api.enabled is true in the config

The workbench expects:

  • transcript files grouped by pecha folder
  • one text file per page
  • a metadata.json file in each pecha folder
  • OpenAI credentials in pechabridge/semantic_search_workbench/.env

For the standard transcript filename layout PPN337138764X-00000001.txt, the trailing numeric block is treated as the page number. The bundled example config uses page_number_pattern: ".*-([0-9]+)$" for this.

The bundled example config lives at:

pechabridge/semantic_search_workbench/semantic-search-config.yaml

Retrieval flow:

DE / EN query -> OpenAI translation -> custom embeddings -> Qdrant similarity search
Tibetan query -> direct embedding -> Qdrant similarity search
Wylie / EWTS query -> pyewts conversion -> custom embeddings -> Qdrant similarity search
Qdrant hit -> context window reconstruction -> metadata.json lookup -> page-scan resolution -> optional back-translation

UI workflow:

  • choose the query mode explicitly: DE / EN, Tibetan, or Wylie (EWTS)
  • inspect ranked hit cards with matched lines, context windows, source links, and page scans
  • use the Research Workspace to filter by pecha, focus one hit with its scan, pin up to five hits for comparison, and export selected evidence as Markdown or JSON

FastAPI microservice:

  • enabled via the api section in semantic-search-config.yaml
  • exposes GET /health, GET /config, GET /index, POST /index/rebuild, and POST /search
  • can run alongside Gradio or by itself with --api-only

Metadata strategy:

  • Qdrant stores lightweight per-line references such as pecha_title, pecha_path, metadata_file, page_number, and source_file
  • the full metadata.json is loaded lazily from disk when a hit is rendered
  • this keeps the vector payloads smaller while still allowing page-image lookup via pages[].source_url

Example CLI Workflow

1) Generate synthetic dataset

python scripts/generate_training_data.py \
  --train_samples 100 \
  --val_samples 100 \
  --font_path_tibetan ext/Microsoft\ Himalaya.ttf \
  --font_path_chinese ext/simkai.ttf \
  --dataset_name tibetan-yolo

Optional: apply LoRA-based texture augmentation directly during data generation:

python scripts/generate_training_data.py \
  --train_samples 100 \
  --val_samples 20 \
  --font_path_tibetan ext/Microsoft\ Himalaya.ttf \
  --font_path_chinese ext/simkai.ttf \
  --dataset_name tibetan-yolo \
  --lora_augment_path ./models/texture-lora-sdxl/texture_lora.safetensors \
  --lora_augment_splits train \
  --lora_augment_targets images

2) Train model

python scripts/train_model.py --dataset tibetan-yolo --epochs 100 --export

3) Inference on SBB

python scripts/inference_sbb.py --ppn 337138764X --model runs/detect/train/weights/best.pt

4) OCR / parser inference

List available parsers:

python scripts/ocr_on_detections.py --list-parsers

Legacy parser:

python scripts/ocr_on_detections.py --source image.jpg --parser legacy --model runs/detect/train/weights/best.pt --lang bod

MinerU2.5 parser:

python scripts/ocr_on_detections.py --source image.jpg --parser mineru25 --mineru-command mineru

Transformer parser examples:

python scripts/ocr_on_detections.py --source image.jpg --parser paddleocr_vl
python scripts/ocr_on_detections.py --source image.jpg --parser qwen25vl
python scripts/ocr_on_detections.py --source image.jpg --parser qwen3_vl
python scripts/ocr_on_detections.py --source image.jpg --parser granite_docling
python scripts/ocr_on_detections.py --source image.jpg --parser deepseek_ocr
python scripts/ocr_on_detections.py --source image.jpg --parser florence2
python scripts/ocr_on_detections.py --source image.jpg --parser groundingdino

5) Donut-style OCR workflow (Label 1 only)

End-to-end (generate synthetic data + prepare manifests + train OCR model):

python cli.py run-donut-ocr-workflow \
  --dataset_name tibetan-donut-ocr-label1 \
  --dataset_output_dir ./datasets \
  --font_path_tibetan "ext/Microsoft Himalaya.ttf" \
  --font_path_chinese ext/simkai.ttf \
  --train_samples 2000 \
  --val_samples 200 \
  --target_newline_token "<NL>" \
  --model_output_dir ./models/donut-ocr-label1

Optional with LoRA augmentation during the generation step:

python cli.py run-donut-ocr-workflow \
  --dataset_name tibetan-donut-ocr-label1 \
  --dataset_output_dir ./datasets \
  --font_path_tibetan "ext/Microsoft Himalaya.ttf" \
  --font_path_chinese ext/simkai.ttf \
  --lora_augment_path ./models/texture-lora-sdxl/texture_lora.safetensors \
  --lora_augment_splits train \
  --lora_augment_targets images_and_ocr_crops \
  --model_output_dir ./models/donut-ocr-label1

Manual step-by-step:

# A) Synthetic data + OCR crops/targets (label 1 only for crops)
python scripts/generate_training_data.py \
  --dataset_name tibetan-donut-ocr-label1 \
  --output_dir ./datasets \
  --font_path_tibetan "ext/Microsoft Himalaya.ttf" \
  --font_path_chinese ext/simkai.ttf \
  --train_samples 2000 \
  --val_samples 200 \
  --save_rendered_text_targets \
  --save_ocr_crops \
  --ocr_crop_labels 1 \
  --target_newline_token "<NL>"

# B) Prepare JSONL manifests from ocr_targets/ocr_crops (label_id=1)
python cli.py prepare-donut-ocr-dataset \
  --dataset_dir ./datasets/tibetan-donut-ocr-label1 \
  --output_dir ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1 \
  --label_id 1

# C) Train VisionEncoderDecoder OCR model
python cli.py train-donut-ocr \
  --train_manifest ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1/train_manifest.jsonl \
  --val_manifest ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1/val_manifest.jsonl \
  --output_dir ./models/donut-ocr-label1 \
  --model_name_or_path microsoft/trocr-base-stage1 \

Recommended for OpenPecha OCR line datasets (BoSentencePiece, no tokenizer retraining):

# A) Download and merge OpenPecha OCR HF datasets into train/test/eval line format
python cli.py download-openpecha-ocr-lines \
  --output-dir ./datasets/openpecha_ocr_lines

# B) Prepare Donut manifests from line metadata (val auto-maps to eval)
python cli.py prepare-donut-ocr-dataset \
  --dataset_dir ./datasets/openpecha_ocr_lines \
  --output_dir ./datasets/openpecha_ocr_lines/donut_manifests \
  --splits train,val \
  --text_field text

# C) Compare BoSentencePiece vs baselines before training
python cli.py eval-ocr-tokenizer \
  --manifests-dir ./datasets/openpecha_ocr_lines/donut_manifests \
  --tokenizer openpecha/BoSentencePiece \
  --with-baselines \
  --output-json ./datasets/openpecha_ocr_lines/donut_manifests/tokenizer_compare.json

# D) Train Donut OCR with the same tokenizer used in evaluation
python cli.py train-donut-ocr \
  --train_manifest ./datasets/openpecha_ocr_lines/donut_manifests/train_manifest.jsonl \
  --val_manifest ./datasets/openpecha_ocr_lines/donut_manifests/val_manifest.jsonl \
  --output_dir ./models/donut-openpecha-ocr \
  --model_name_or_path microsoft/trocr-base-stage1 \
  --tokenizer_path openpecha/BoSentencePiece

Note: The Donut OCR training flow now always reuses the configured tokenizer path directly (no tokenizer retraining flag).

6) OpenPecha line segmentation dataset + YOLO training

Download the Hugging Face line-coordinate dataset and convert it into an Ultralytics segment dataset:

python cli.py download-openpecha-line-segmentation \
  --output-dir ./datasets/openpecha_line_segmentation

If you want to create a second dataset with vertically expanded line polygons, you can derive it from the raw base dataset:

python cli.py expand-line-segmentation-dataset \
  --dataset ./datasets/openpecha_line_segmentation/data.yaml \
  --output-dir ./datasets/openpecha_line_segmentation_padded \
  --top-ratio 0.20 \
  --bottom-ratio 0.20

If you want to remove tall/narrow line polygons into a separate dataset root, use the dedicated filter CLI:

python cli.py filter-line-segmentation-dataset \
  --dataset ./datasets/openpecha_line_segmentation_padded/data.yaml \
  --output-dir ./datasets/openpecha_line_segmentation_padded_filtered \
  --min-width-height-ratio 1.0

Train a YOLO segmentation model on the converted dataset. The line-image preprocessing now belongs to the training run, not to the downloader:

python cli.py train-line-segmentation \
  --dataset ./datasets/openpecha_line_segmentation/data.yaml \
  --model yolo11n-seg.pt \
  --image-preprocess-pipeline gray \
  --epochs 100 \
  --project ./runs/segment \
  --name tibetan-line-seg

The OCR Workbench can then switch between Classical CV line splitting and Pretrained YOLO Model. The training command defaults to gray, matching the DONUT OCR gray preprocessing semantics (min_rgb, binarize=false), while the downloaded dataset stays raw.

7) Patch Retrieval Dataset + mp-InfoNCE ViT Training (current)

Generate the patch dataset (patches/ + meta/patches.parquet) from page images:

python cli.py gen-patches \
  --model ./models/layoutModels/layout_model.pt \
  --input-dir ./sbb_images \
  --output-dir ./datasets/text_patches \
  --no-samples 100 \
  --debug-dump 10

Optional: generate weak OCR labels:

python cli.py weak-ocr-label \
  --dataset ./datasets/text_patches \
  --meta ./datasets/text_patches/meta/patches.parquet \
  --out ./datasets/text_patches/meta/weak_ocr.parquet \
  --num_workers 8 \
  --resume

Mine robust cross-page MNN positives:

python cli.py mine-mnn-pairs \
  --dataset ./datasets/text_patches \
  --meta ./datasets/text_patches/meta/patches.parquet \
  --out ./datasets/text_patches/meta/mnn_pairs.parquet \
  --config ./configs/mnn_mining.yaml \
  --num-workers 8 \
  --debug-dump 20

Train a pretrained ViT/DINOv2 retrieval encoder with mp-InfoNCE using mnn, ocr, or both weak positive sources:

python cli.py train-text-hierarchy-vit \
  --dataset-dir ./datasets/text_patches \
  --output-dir ./models/text_hierarchy_vit_mpnce \
  --model-name-or-path facebook/dinov2-base \
  --train-mode patch_mpnce \
  --positive-sources both \
  --pairs-parquet ./datasets/text_patches/meta/mnn_pairs.parquet \
  --weak-ocr-parquet ./datasets/text_patches/meta/weak_ocr.parquet \
  --phase1-epochs 2 \
  --phase2-epochs 8 \
  --unfreeze-last-n-blocks 2

Cross-page FAISS evaluation from exported embeddings (same-page results excluded):

python cli.py eval-faiss-crosspage \
  --embeddings-npy ./models/text_hierarchy_vit_mpnce/faiss_embeddings.npy \
  --embeddings-meta ./models/text_hierarchy_vit_mpnce/faiss_embeddings_meta.parquet \
  --mnn-pairs ./datasets/text_patches/meta/mnn_pairs.parquet \
  --output-dir ./models/text_hierarchy_vit_mpnce/eval_crosspage \
  --recall-ks 1,5,10 \
  --exclude-same-page

FAISS similarity search on a query crop (interactive inspection):

python cli.py faiss-text-hierarchy-search \
  --query-image ./some_query.png \
  --dataset-dir ./datasets/text_patches \
  --backbone-dir ./models/text_hierarchy_vit_mpnce/text_hierarchy_vit_backbone \
  --projection-head-path ./models/text_hierarchy_vit_mpnce/text_hierarchy_projection_head.pt \
  --output-dir ./models/text_hierarchy_vit_mpnce/faiss_search \
  --top-k 10

7) Legacy TextHierarchy export + ViT retrieval training (still supported)

Export line/word hierarchy crops from page images:

python cli.py export-text-hierarchy \
  --model ./models/layoutModels/layout_model.pt \
  --input-dir ./sbb_images \
  --output-dir ./datasets/text_hierarchy \
  --no_samples 100

Train on the legacy hierarchy layout:

python cli.py train-text-hierarchy-vit \
  --dataset-dir ./datasets/text_hierarchy \
  --output-dir ./models/text_hierarchy_vit \
  --train-mode legacy \
  --model-name-or-path facebook/dinov2-base \
  --target-height 64 \
  --width-buckets 256,384,512,768 \
  --max-width 1024

Evaluate legacy hierarchy retrieval quality:

python cli.py eval-text-hierarchy-vit \
  --dataset-dir ./datasets/text_hierarchy \
  --backbone-dir ./models/text_hierarchy_vit/text_hierarchy_vit_backbone \
  --projection-head-path ./models/text_hierarchy_vit/text_hierarchy_projection_head.pt \
  --output-dir ./models/text_hierarchy_vit/eval \
  --recall-ks 1,5,10

Label Studio (CLI)

export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=$(pwd)/datasets/tibetan-yolo

label-studio-converter import yolo \
  -i datasets/tibetan-yolo/train \
  -o ls-tasks.json \
  --image-ext ".png" \
  --image-root-url "/data/local-files/?d=train/images"

Start Label Studio:

label-studio

Additional Docs