This document contains the command-line workflow and script reference.
If you are a regular user, prefer the Workbench commands documented in ../README.md.
scripts/generate_training_data.pyscripts/train_model.pyscripts/inference_sbb.pyscripts/ocr_on_detections.pyscripts/pseudo_label_from_vlm.pyscripts/layout_rule_filter.pyscripts/run_pseudo_label_workflow.pyscripts/download_openpecha_line_segmentation.pyscripts/train_line_segmentation.pycli.py(unified diffusion + retrieval-encoder commands)
pip install -r requirements.txtrequirements.txt is the unified dependency file for CLI, UI, VLM, diffusion/LoRA, and retrieval encoder training.
Optional compatibility wrappers live in ../requirements/.
Use:
python cli.py -hAvailable subcommands:
prepare-texture-lora-datasettrain-texture-loratexture-augmenttrain-image-encodertrain-text-encoderexport-text-hierarchygen-patchesweak-ocr-labelocr-workbenchlayout-workbenchtransformer-layout-workbenchmine-mnn-pairstrain-text-hierarchy-viteval-text-hierarchy-vitfaiss-text-hierarchy-searcheval-faiss-crosspageprepare-donut-ocr-dataseteval-ocr-tokenizertrain-donut-ocrrun-donut-ocr-workflowsemantic-search-workbenchdownload-openpecha-ocr-linesdownload-openpecha-line-segmentationtrain-line-segmentation
Launch the transcript retrieval UI through the unified CLI:
python cli.py semantic-search-workbench \
--config pechabridge/semantic_search_workbench/semantic-search-config.yamlUseful flags:
--reindexrebuilds the local Qdrant collection before the Gradio app starts--reindex-onlyrebuilds the collection and exits without launching the UI--api-onlystarts only the FastAPI microservice--no-apidisables the FastAPI microservice for this run even ifapi.enabledis true in the config
The workbench expects:
- transcript files grouped by pecha folder
- one text file per page
- a
metadata.jsonfile in each pecha folder - OpenAI credentials in
pechabridge/semantic_search_workbench/.env
For the standard transcript filename layout PPN337138764X-00000001.txt, the trailing numeric block is treated as the page number.
The bundled example config uses page_number_pattern: ".*-([0-9]+)$" for this.
The bundled example config lives at:
pechabridge/semantic_search_workbench/semantic-search-config.yaml
Retrieval flow:
DE / EN query -> OpenAI translation -> custom embeddings -> Qdrant similarity search
Tibetan query -> direct embedding -> Qdrant similarity search
Wylie / EWTS query -> pyewts conversion -> custom embeddings -> Qdrant similarity search
Qdrant hit -> context window reconstruction -> metadata.json lookup -> page-scan resolution -> optional back-translation
UI workflow:
- choose the query mode explicitly:
DE / EN,Tibetan, orWylie (EWTS) - inspect ranked hit cards with matched lines, context windows, source links, and page scans
- use the Research Workspace to filter by pecha, focus one hit with its scan, pin up to five hits for comparison, and export selected evidence as Markdown or JSON
FastAPI microservice:
- enabled via the
apisection insemantic-search-config.yaml - exposes
GET /health,GET /config,GET /index,POST /index/rebuild, andPOST /search - can run alongside Gradio or by itself with
--api-only
Metadata strategy:
- Qdrant stores lightweight per-line references such as
pecha_title,pecha_path,metadata_file,page_number, andsource_file - the full
metadata.jsonis loaded lazily from disk when a hit is rendered - this keeps the vector payloads smaller while still allowing page-image lookup via
pages[].source_url
python scripts/generate_training_data.py \
--train_samples 100 \
--val_samples 100 \
--font_path_tibetan ext/Microsoft\ Himalaya.ttf \
--font_path_chinese ext/simkai.ttf \
--dataset_name tibetan-yoloOptional: apply LoRA-based texture augmentation directly during data generation:
python scripts/generate_training_data.py \
--train_samples 100 \
--val_samples 20 \
--font_path_tibetan ext/Microsoft\ Himalaya.ttf \
--font_path_chinese ext/simkai.ttf \
--dataset_name tibetan-yolo \
--lora_augment_path ./models/texture-lora-sdxl/texture_lora.safetensors \
--lora_augment_splits train \
--lora_augment_targets imagespython scripts/train_model.py --dataset tibetan-yolo --epochs 100 --exportpython scripts/inference_sbb.py --ppn 337138764X --model runs/detect/train/weights/best.ptList available parsers:
python scripts/ocr_on_detections.py --list-parsersLegacy parser:
python scripts/ocr_on_detections.py --source image.jpg --parser legacy --model runs/detect/train/weights/best.pt --lang bodMinerU2.5 parser:
python scripts/ocr_on_detections.py --source image.jpg --parser mineru25 --mineru-command mineruTransformer parser examples:
python scripts/ocr_on_detections.py --source image.jpg --parser paddleocr_vl
python scripts/ocr_on_detections.py --source image.jpg --parser qwen25vl
python scripts/ocr_on_detections.py --source image.jpg --parser qwen3_vl
python scripts/ocr_on_detections.py --source image.jpg --parser granite_docling
python scripts/ocr_on_detections.py --source image.jpg --parser deepseek_ocr
python scripts/ocr_on_detections.py --source image.jpg --parser florence2
python scripts/ocr_on_detections.py --source image.jpg --parser groundingdinoEnd-to-end (generate synthetic data + prepare manifests + train OCR model):
python cli.py run-donut-ocr-workflow \
--dataset_name tibetan-donut-ocr-label1 \
--dataset_output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--train_samples 2000 \
--val_samples 200 \
--target_newline_token "<NL>" \
--model_output_dir ./models/donut-ocr-label1Optional with LoRA augmentation during the generation step:
python cli.py run-donut-ocr-workflow \
--dataset_name tibetan-donut-ocr-label1 \
--dataset_output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--lora_augment_path ./models/texture-lora-sdxl/texture_lora.safetensors \
--lora_augment_splits train \
--lora_augment_targets images_and_ocr_crops \
--model_output_dir ./models/donut-ocr-label1Manual step-by-step:
# A) Synthetic data + OCR crops/targets (label 1 only for crops)
python scripts/generate_training_data.py \
--dataset_name tibetan-donut-ocr-label1 \
--output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--train_samples 2000 \
--val_samples 200 \
--save_rendered_text_targets \
--save_ocr_crops \
--ocr_crop_labels 1 \
--target_newline_token "<NL>"
# B) Prepare JSONL manifests from ocr_targets/ocr_crops (label_id=1)
python cli.py prepare-donut-ocr-dataset \
--dataset_dir ./datasets/tibetan-donut-ocr-label1 \
--output_dir ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1 \
--label_id 1
# C) Train VisionEncoderDecoder OCR model
python cli.py train-donut-ocr \
--train_manifest ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1/train_manifest.jsonl \
--val_manifest ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1/val_manifest.jsonl \
--output_dir ./models/donut-ocr-label1 \
--model_name_or_path microsoft/trocr-base-stage1 \Recommended for OpenPecha OCR line datasets (BoSentencePiece, no tokenizer retraining):
# A) Download and merge OpenPecha OCR HF datasets into train/test/eval line format
python cli.py download-openpecha-ocr-lines \
--output-dir ./datasets/openpecha_ocr_lines
# B) Prepare Donut manifests from line metadata (val auto-maps to eval)
python cli.py prepare-donut-ocr-dataset \
--dataset_dir ./datasets/openpecha_ocr_lines \
--output_dir ./datasets/openpecha_ocr_lines/donut_manifests \
--splits train,val \
--text_field text
# C) Compare BoSentencePiece vs baselines before training
python cli.py eval-ocr-tokenizer \
--manifests-dir ./datasets/openpecha_ocr_lines/donut_manifests \
--tokenizer openpecha/BoSentencePiece \
--with-baselines \
--output-json ./datasets/openpecha_ocr_lines/donut_manifests/tokenizer_compare.json
# D) Train Donut OCR with the same tokenizer used in evaluation
python cli.py train-donut-ocr \
--train_manifest ./datasets/openpecha_ocr_lines/donut_manifests/train_manifest.jsonl \
--val_manifest ./datasets/openpecha_ocr_lines/donut_manifests/val_manifest.jsonl \
--output_dir ./models/donut-openpecha-ocr \
--model_name_or_path microsoft/trocr-base-stage1 \
--tokenizer_path openpecha/BoSentencePieceNote: The Donut OCR training flow now always reuses the configured tokenizer path directly (no tokenizer retraining flag).
Download the Hugging Face line-coordinate dataset and convert it into an Ultralytics segment dataset:
python cli.py download-openpecha-line-segmentation \
--output-dir ./datasets/openpecha_line_segmentationIf you want to create a second dataset with vertically expanded line polygons, you can derive it from the raw base dataset:
python cli.py expand-line-segmentation-dataset \
--dataset ./datasets/openpecha_line_segmentation/data.yaml \
--output-dir ./datasets/openpecha_line_segmentation_padded \
--top-ratio 0.20 \
--bottom-ratio 0.20If you want to remove tall/narrow line polygons into a separate dataset root, use the dedicated filter CLI:
python cli.py filter-line-segmentation-dataset \
--dataset ./datasets/openpecha_line_segmentation_padded/data.yaml \
--output-dir ./datasets/openpecha_line_segmentation_padded_filtered \
--min-width-height-ratio 1.0Train a YOLO segmentation model on the converted dataset. The line-image preprocessing now belongs to the training run, not to the downloader:
python cli.py train-line-segmentation \
--dataset ./datasets/openpecha_line_segmentation/data.yaml \
--model yolo11n-seg.pt \
--image-preprocess-pipeline gray \
--epochs 100 \
--project ./runs/segment \
--name tibetan-line-segThe OCR Workbench can then switch between Classical CV line splitting and Pretrained YOLO Model.
The training command defaults to gray, matching the DONUT OCR gray preprocessing semantics (min_rgb, binarize=false), while the downloaded dataset stays raw.
Generate the patch dataset (patches/ + meta/patches.parquet) from page images:
python cli.py gen-patches \
--model ./models/layoutModels/layout_model.pt \
--input-dir ./sbb_images \
--output-dir ./datasets/text_patches \
--no-samples 100 \
--debug-dump 10Optional: generate weak OCR labels:
python cli.py weak-ocr-label \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/weak_ocr.parquet \
--num_workers 8 \
--resumeMine robust cross-page MNN positives:
python cli.py mine-mnn-pairs \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/mnn_pairs.parquet \
--config ./configs/mnn_mining.yaml \
--num-workers 8 \
--debug-dump 20Train a pretrained ViT/DINOv2 retrieval encoder with mp-InfoNCE using mnn, ocr, or both weak positive sources:
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/text_patches \
--output-dir ./models/text_hierarchy_vit_mpnce \
--model-name-or-path facebook/dinov2-base \
--train-mode patch_mpnce \
--positive-sources both \
--pairs-parquet ./datasets/text_patches/meta/mnn_pairs.parquet \
--weak-ocr-parquet ./datasets/text_patches/meta/weak_ocr.parquet \
--phase1-epochs 2 \
--phase2-epochs 8 \
--unfreeze-last-n-blocks 2Cross-page FAISS evaluation from exported embeddings (same-page results excluded):
python cli.py eval-faiss-crosspage \
--embeddings-npy ./models/text_hierarchy_vit_mpnce/faiss_embeddings.npy \
--embeddings-meta ./models/text_hierarchy_vit_mpnce/faiss_embeddings_meta.parquet \
--mnn-pairs ./datasets/text_patches/meta/mnn_pairs.parquet \
--output-dir ./models/text_hierarchy_vit_mpnce/eval_crosspage \
--recall-ks 1,5,10 \
--exclude-same-pageFAISS similarity search on a query crop (interactive inspection):
python cli.py faiss-text-hierarchy-search \
--query-image ./some_query.png \
--dataset-dir ./datasets/text_patches \
--backbone-dir ./models/text_hierarchy_vit_mpnce/text_hierarchy_vit_backbone \
--projection-head-path ./models/text_hierarchy_vit_mpnce/text_hierarchy_projection_head.pt \
--output-dir ./models/text_hierarchy_vit_mpnce/faiss_search \
--top-k 10Export line/word hierarchy crops from page images:
python cli.py export-text-hierarchy \
--model ./models/layoutModels/layout_model.pt \
--input-dir ./sbb_images \
--output-dir ./datasets/text_hierarchy \
--no_samples 100Train on the legacy hierarchy layout:
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/text_hierarchy \
--output-dir ./models/text_hierarchy_vit \
--train-mode legacy \
--model-name-or-path facebook/dinov2-base \
--target-height 64 \
--width-buckets 256,384,512,768 \
--max-width 1024Evaluate legacy hierarchy retrieval quality:
python cli.py eval-text-hierarchy-vit \
--dataset-dir ./datasets/text_hierarchy \
--backbone-dir ./models/text_hierarchy_vit/text_hierarchy_vit_backbone \
--projection-head-path ./models/text_hierarchy_vit/text_hierarchy_projection_head.pt \
--output-dir ./models/text_hierarchy_vit/eval \
--recall-ks 1,5,10export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=$(pwd)/datasets/tibetan-yolo
label-studio-converter import yolo \
-i datasets/tibetan-yolo/train \
-o ls-tasks.json \
--image-ext ".png" \
--image-root-url "/data/local-files/?d=train/images"Start Label Studio:
label-studio- Pseudo-labeling and Label Studio import details: pseudo_labeling_label_studio.md
- Patch dataset generation: dataset_generation.md
- MNN mining (cross-page positives): mnn_mining.md
- Retrieval training (mp-InfoNCE + MNN/OCR): retrieval_mpnce_training.md
- Weak OCR labeling: weak_ocr.md
- Diffusion + LoRA details: texture_augmentation.md
- Retrieval roadmap: tibetan_ngram_retrieval_plan.md