This repository contains a plan-based extraction pipeline (V2) for extracting structured data from clinical trial research papers (PDFs) using Large Language Models. The pipeline has four stages: Chunking → Planning → Extraction → Evaluation, with explicit extraction plans for interpretability and category-aware evaluation.
- Pipeline Overview
- Architecture & Complete Flow
- Module Reference (
src/) - Configuration
- Output Structure
- Setup and Dependencies
- Usage
- Preprocessing (used by Chunking)
- Architecture Notes
High-level flow:
flowchart TD
A[PDF Input] --> B[Stage 1: Chunking]
B --> C[pdf_chunked.json]
C --> D[Stage 2: Planning]
D --> E[Extraction Plans]
E --> F[Stage 3: Extraction]
F --> G[extraction_metadata.json + CSV]
G --> H[Stage 4: Evaluation]
H --> I[evaluation_results.json + summary_metrics.json]
style B fill:#e1f5ff
style D fill:#fff0e1
style F fill:#fff4e1
style H fill:#e8f5e9
- Stage 1 (Chunking): PDF → text/table/figure chunks, with optional LLM-based page classification.
- Stage 2 (Planning): For each column group, an LLM (with PDF + chunks) produces a free-form extraction plan; a local structurer turns it into JSON (where to look, page, source_type, confidence).
- Stage 3 (Extraction): Plans are executed: for each group, the LLM extracts values from plan-relevant chunks; structurer produces structured extractions; results are merged into one row + metadata.
- Stage 4 (Evaluation): Extracted row is compared to ground truth with category-aware scoring (exact_match, numeric_tolerance, structured_text); correctness/completeness per column and summary metrics.
PDF
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: CHUNKING │
│ process_pdf() → optional PageClassifier → PDFChunker.chunk() │
│ Output: chunking/pdf_chunked.json │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: PLANNING │
│ PlanGenerator.generate_plans(pdf, chunks) │
│ • LLM (e.g. Gemini) + PDF → free-form plan per group │
│ • OutputStructurer (local Qwen) → GroupExtractionPlanV2 JSON │
│ Output: planning/*_plan.json, plans_all_columns.json │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 3: EXTRACTION │
│ PlanExecutor.execute_plans(pdf, chunks, plans) │
│ • For each group: find_relevant_chunks(plan) → LLM + PDF │
│ • OutputStructurer → GroupExtractionV2 → merge │
│ Output: extraction/extraction_metadata.json, extracted_table.csv│
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 4: EVALUATION │
│ EvaluatorV2.run() │
│ • Load extraction + ground truth + definitions (with category) │
│ • Batches by category → Gemini judge → structurer → scores │
│ Output: evaluation/evaluation_results.json, summary_metrics.json│
└─────────────────────────────────────────────────────────────────┘
| Stage / Cross-cutting | Directory / File | Role |
|---|---|---|
| Entry point | src/main/main_v2.py |
CLI menu, runs 1–6 (chunk only, plan only, extract only, eval only, full pipeline, or plan→extract→eval from existing chunks). |
| Config | src/config/config.py |
Paths, API keys, per-stage providers/models, chunking/planning/extraction/eval workers, structurer URL. |
| Chunking | src/chunking/ |
Page classification, PDF chunking (text, table, figure), preprocessing hooks. |
| Planning | src/planning/plan_generator.py |
PlanGenerator: LLM + PDF → plan text → structurer → GroupExtractionPlanV2 per group. |
| Extraction | src/extraction/plan_executor.py |
PlanExecutor: load plans, find relevant chunks, LLM extract → structurer → GroupExtractionV2, merge and write CSV + metadata. |
| LLM | src/LLMProvider/provider.py |
LLMProvider: Gemini, OpenAI, Novita, Groq, DeepInfra; generate(), generate_with_pdf(), upload_pdf/cleanup_pdf. |
| Structurer | src/LLMProvider/structurer.py |
OutputStructurer: free-form text → JSON via local model (e.g. vLLM Qwen) with Pydantic schema. |
| Table defs | src/table_definitions/definitions.py |
load_definitions(): column groups from CSV (Label → list of Column Name + Definition). |
| Evaluation | src/evaluation/evaluator_v2.py |
EvaluatorV2: load data, group by eval category, batch, Gemini judge, structurer, aggregate and save. |
| Preprocessing | src/preprocessing/pdf_margin_preprocessing.py |
Header/footer detection and cleaning used during chunking. |
| Utils | src/utils/logging_utils.py |
setup_logger() for consistent logging. |
- Entry point:
main()(interactive PDF path + menu) orrun_pipeline_from_args(pdf_path, choice)for programmatic/web use. - Choices:
1Chunking only;2Planning only;3Extraction only;4Evaluation only;5Full pipeline;6Planning → Extraction → Evaluation (reuse existing chunks). - Helpers:
run_chunking(),run_planning(),run_extraction(),run_evaluation();create_versioned_output_dir()whenVERSION_OUTPUTSis True;_find_existing()to resolve existing chunk/plan/extraction paths for skip-if-exists behavior. - Output root:
RESULTS_BASE_DIR / pdf_name(optionally with versionedrun_YYYY-MM-DD_HH-MM-SSandlatestsymlink).
- API keys / cloud config: From
.env:VERTEX_API_KEY(local Vertex testing),GOOGLE_CLOUD_PROJECT,GOOGLE_CLOUD_LOCATION,OPENAI_API_KEY,NOVITA_API_KEY,LLAMA_KEY(Groq),DEEPINFRA_API_KEY. - Per-task LLM:
CHUNKING_*,EXTRACTION_*,EVALUATION_*(legacy); V2:PLANNING_PROVIDER/MODEL,EXTRACTION_PROVIDER_V2/MODEL_V2,EVALUATION_PROVIDER_V2/MODEL_V2; workers:PLANNING_WORKERS,EXTRACTION_WORKERS,EVALUATION_WORKERS. - Structurer (local):
STRUCTURER_BASE_URL(e.g.http://localhost:8001/v1),STRUCTURER_MODEL(e.g.Qwen/Qwen3-8B). - Chunking:
TEXT_CHUNK_MIN_SIZE,CHUNKING_MODE,PIXMAP_RESOLUTION,USE_LLM_PAGE_CLASSIFICATION,PAGE_CLASSIFICATION_MODEL. - Paths:
DEFINITIONS_CSV_PATH,DEFINITIONS_EVAL_CATEGORY_PATH,GOLD_TABLE_JSON_PATH,RESULTS_BASE_DIR. - Pipeline behavior:
VERSION_OUTPUTS,SKIP_STAGE_IF_EXISTS,EXTRACTION_MODE(e.g."plan").
chunking.pyprocess_pdf(pdf_path, output_path, use_llm_classification): Top-level entry. Optionally runsPageClassifier.classify()then buildsPDFChunker(pdf_path, page_metadata).chunk(), saves JSON.PDFChunker: Holdstable_pages/figure_pagesfrom metadata (or processes all pages). For each page:_process_page_text()(accumulate cleaned text),_process_tables()(LLM + optional pdfplumber fallback),_process_figures()(regex + LLM description). After all pages:_create_large_text_chunks()(sentence/paragraph chunking). Returns list of chunks (type: text/table/figure, content, page, etc.).
page_classifier.pyPageClassifier: Uses Gemini to classify which pages have tables/figures; usesOutputStructurerto getTablesResponse/FiguresResponse.classify()returns{"tables": [...], "figures": [...]}for targeted chunking.
utils_chunking.py- Text chunking (
text_chunking()), table extraction helpers (extract_tables_pdfplumber(),parse_table_extraction_response()), image/LLM helpers (ask_gemini_with_image(),extract_caption_from_gemini()),save_chunks_to_json(); heuristic header/footer filtering used with preprocessing.
- Text chunking (
- Data structures:
Column,ColumnGroup;ColumnExtractionPlanV2(column_index, column_name, found_in_pdf, page, source_type, confidence, extraction_plan),GroupExtractionPlanV2(group_name, columns). PlanGenerator(provider, definitions, structurer=None, name_policy): Builds column groups from definitions; uses optionalOutputStructurer(default from config).generate_plan_for_group(group, pdf_handle, chunks, output_dir): Builds prompt with chunk summaries and canonical columns; callsprovider.generate_with_pdf(); writes raw plan tologs/{stem}_raw.txt; callsstructurer.structure()withGroupExtractionPlanV2schema; validates/normalizes withvalidate_and_normalize_group_plan(); saves{stem}_plan.json.generate_plans(pdf_path, chunks, output_dir, workers): Uploads PDF, runsgenerate_plan_for_groupper group in parallel (ThreadPoolExecutor), writesplans_all_columns.json, returns{group_name: plan_data}.
- Data structures:
ColumnExtractionV2(column_index, column_name, value, evidence, page, confidence),GroupExtractionV2(group_name, extractions). - Helpers:
find_relevant_chunks(plan.columns, chunks)(by source_type and page);format_chunks(),format_columns_for_prompt();validate_and_normalize_plan(),validate_and_normalize_extraction();_generate_outputs()(writesextraction_metadata.jsonandextracted_table.csv). _extract_group(...): For a single group: filter plan tofound_in_pdf, get relevant chunks, build prompt with columns + chunks;provider.generate_with_pdf(); log raw;structurer.structure(GroupExtractionV2)(with OpenAI/Gemini fallback on failure); normalize extraction and returnGroupExtractionV2.PlanExecutor(provider, structurer, name_policy).execute_plans(pdf_path, chunks, plans, output_path, workers): Loads definitions, validates plans, uploads PDF; runs_extract_groupper group in parallel; calls_generate_outputs(); returns loaded metadata dict.load_plans_from_dir(plans_dir): Loads fromplans_all_columns.jsonor*_plan.jsonin the directory.
LLMProvider(provider, model): Unified interface for Gemini, OpenAI, Novita, Groq, DeepInfra. Methods:generate(),generate_with_image(),generate_with_pdf(prompt, pdf_handle, ...),upload_pdf(path)/cleanup_pdf(handle), and batch helpers. ReturnsLLMResponse(text, tokens, cost, success, error).
OutputStructurer(base_url, model, api_key, enable_thinking): Uses OpenAI-compatible client (e.g. vLLM) to turn free-form text into JSON.structure(text, schema, max_retries, temperature, return_dict)builds a schema prompt, calls the model, parses JSON and validates with Pydantic; returnsStructurerResponse(data, success, attempts, error).
load_definitions(csv_path=None, cols_to_test_path=None): Reads CSV (default from config); groups byLabel; each group is list of{"Column Name", "Definition"}. Optional filter viacols_to_test_path(included labels).
EvaluatorV2(extraction_file, ground_truth_file, definitions_file, document_name, output_dir): Loads extraction JSON (flat column → value), ground truth JSON (document row byDocument Name), definitions with eval categories (Definitions_with_eval_category.csv).load_data(): Fillspredicted_values,ground_truth_values,column_categories,column_definitions,column_labels.group_columns_by_category(): Groups common columns intoexact_match,numeric_tolerance,structured_text.build_prompt(category, columns): Category-specific instructions (exact match, numeric tolerance, structured text) and column-wise GT vs Pred.evaluate_batch(category, columns): Gemini evaluation prompt →structure_response()(Qwen structurer or Gemini fallback) → list of{column, correctness, completeness, reason}.evaluate_all(max_workers): Batches by category/label, runs batches in parallel, stores results inself.results.aggregate_metrics(): Overall and per-category avg correctness/completeness/overall.save_results(): Writesevaluation_results.json,summary_metrics.json,llm_logs/gemini_calls.jsonl,structurer_calls.jsonl.run(): load_data → evaluate_all → save_results.
detect_repeating_patterns(pdf_path, sample_pages): Learns top/bottom repeating text patterns from first N pages.extract_text_blocks_with_position(): PyMuPDF text blocks with bounding boxes.is_header_or_footer_by_position()/by_pattern()/by_heuristics(): Filter blocks (used from chunking/utils).clean_page_text_advanced(page, page_height, patterns): Applies position → pattern → heuristic filtering; returns cleaned page text.
setup_logger(name): Returns logger used across pipeline components.
Key settings in src/config/config.py:
| Area | Variables |
|---|---|
| V2 Planning | PLANNING_PROVIDER, PLANNING_MODEL, PLANNING_WORKERS |
| V2 Extraction | EXTRACTION_PROVIDER_V2, EXTRACTION_MODEL_V2, EXTRACTION_WORKERS |
| V2 Evaluation | EVALUATION_PROVIDER_V2, EVALUATION_MODEL_V2, EVALUATION_WORKERS |
| Structurer (local) | STRUCTURER_BASE_URL, STRUCTURER_MODEL |
| Chunking | TEXT_CHUNK_MIN_SIZE, CHUNKING_MODE, USE_LLM_PAGE_CLASSIFICATION, PAGE_CLASSIFICATION_MODEL, PIXMAP_RESOLUTION |
| Paths | DEFINITIONS_CSV_PATH, DEFINITIONS_EVAL_CATEGORY_PATH, GOLD_TABLE_JSON_PATH, RESULTS_BASE_DIR |
| Behavior | VERSION_OUTPUTS, SKIP_STAGE_IF_EXISTS |
API keys / env (from .env): VERTEX_API_KEY, GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_LOCATION, OPENAI_API_KEY, NOVITA_API_KEY, LLAMA_KEY, DEEPINFRA_API_KEY.
With versioning enabled (VERSION_OUTPUTS = True), each run can create run_YYYY-MM-DD_HH-MM-SS under the document folder and latest symlink. Example:
RESULTS_BASE_DIR / {pdf_name} /
├── run_2025-02-05_12-00-00/ # or "latest" -> run_...
│ ├── chunking/
│ │ └── pdf_chunked.json
│ ├── planning/
│ │ ├── *_plan.json
│ │ ├── plans_all_columns.json
│ │ └── logs/
│ ├── extraction/
│ │ ├── extraction_metadata.json
│ │ ├── extracted_table.csv
│ │ └── logs/
│ └── evaluation/
│ ├── evaluation_results.json
│ ├── summary_metrics.json
│ └── llm_logs/
- extraction_metadata.json: Per-column value, evidence, page, plan info (e.g. plan_found_in_pdf, plan_page, plan_source_type).
- summary_metrics.json: Overall and by-category avg correctness, completeness, overall score.
- Python: Install from
src/requirements.txt:
pip install -r src/requirements.txt - spaCy:
python -m spacy download en_core_web_sm - Environment:
.envin project root with API keys (see Configuration). - Local structurer (vLLM): For planning/extraction/structuring, run vLLM (e.g.
./run_vllm.sh) soSTRUCTURER_BASE_URLis reachable. - Column definitions:
src/table_definitions/Definitions_open_ended.csv(andDefinitions_with_eval_category.csvfor evaluation). - Ground truth:
dataset/Manual_Benchmark_GoldTable_cleaned.json(or path set inGOLD_TABLE_JSON_PATH) for evaluation.
Interactive (CLI):
python src/main/main_v2.pyEnter PDF path when prompted, then choose 1–6 (chunk only, plan only, extract only, eval only, full pipeline, or plan→extract→eval from existing chunks).
Programmatic / Web:
from pathlib import Path
from src.main.main_v2 import run_pipeline_from_args
run_dir, extraction_file, err = run_pipeline_from_args(
Path("path/to/document.pdf"),
"5" # full pipeline
)
if err:
print("Error:", err)
else:
print("Run dir:", run_dir, "Extraction:", extraction_file)Run a single stage: Use choices 1–4 and ensure prior stage outputs exist (or run 5 once). With SKIP_STAGE_IF_EXISTS, existing chunk/plan/extraction/eval files are reused when paths are found.
Chunking uses src/preprocessing/pdf_margin_preprocessing.py to clean page text before building text chunks:
- Position-based: Drop blocks in top/bottom margin regions (
TOP_MARGIN,BOTTOM_MARGIN). - Pattern-based:
detect_repeating_patterns()on first N pages; drop blocks matching learned top/bottom patterns. - Heuristic-based: (in
utils_chunking) Short blocks, copyright/keywords, journal/volume/issue, dates, URLs, page numbers.
clean_page_text_advanced() applies all three; only blocks that pass are kept for accumulation and later text chunking.
- Plan-based extraction: Planning produces explicit (group, column, page, source_type, confidence) plans; extraction follows these plans and only sends relevant chunks to the LLM, improving consistency and traceability.
- Dual LLM roles: Cloud LLM (e.g. Gemini/OpenAI) for PDF-aware planning and extraction; local structurer (e.g. Qwen via vLLM) for turning free-form text into strict JSON against Pydantic schemas.
- Parallelism: Planning and extraction run per-group in parallel (configurable workers); evaluation runs batches in parallel by category.
- Skip-if-exists: When
SKIP_STAGE_IF_EXISTSis True, main_v2 looks for existing chunk/plan/extraction/eval outputs and skips re-running that stage. - Logging: All stages use
setup_logger()fromsrc/utils/logging_utils.py; planning and extraction write raw LLM outputs underlogs/in their output directories.
[Add license information here]
[Add citation information here]