CoRal-Map-Make: Clinical Trial Data Extraction Pipeline

This repository contains a plan-based extraction pipeline (V2) for extracting structured data from clinical trial research papers (PDFs) using Large Language Models. The pipeline has four stages: Chunking → Planning → Extraction → Evaluation, with explicit extraction plans for interpretability and category-aware evaluation.

Pipeline Overview

High-level flow:

flowchart TD
    A[PDF Input] --> B[Stage 1: Chunking]
    B --> C[pdf_chunked.json]
    C --> D[Stage 2: Planning]
    D --> E[Extraction Plans]
    E --> F[Stage 3: Extraction]
    F --> G[extraction_metadata.json + CSV]
    G --> H[Stage 4: Evaluation]
    H --> I[evaluation_results.json + summary_metrics.json]
    
    style B fill:#e1f5ff
    style D fill:#fff0e1
    style F fill:#fff4e1
    style H fill:#e8f5e9

Stage 1 (Chunking): PDF → text/table/figure chunks, with optional LLM-based page classification.
Stage 2 (Planning): For each column group, an LLM (with PDF + chunks) produces a free-form extraction plan; a local structurer turns it into JSON (where to look, page, source_type, confidence).
Stage 3 (Extraction): Plans are executed: for each group, the LLM extracts values from plan-relevant chunks; structurer produces structured extractions; results are merged into one row + metadata.
Stage 4 (Evaluation): Extracted row is compared to ground truth with category-aware scoring (exact_match, numeric_tolerance, structured_text); correctness/completeness per column and summary metrics.

Architecture & Complete Flow

Data Flow

PDF
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 1: CHUNKING                                               │
│  process_pdf() → optional PageClassifier → PDFChunker.chunk()    │
│  Output: chunking/pdf_chunked.json                               │
└─────────────────────────────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 2: PLANNING                                               │
│  PlanGenerator.generate_plans(pdf, chunks)                       │
│  • LLM (e.g. Gemini) + PDF → free-form plan per group           │
│  • OutputStructurer (local Qwen) → GroupExtractionPlanV2 JSON    │
│  Output: planning/*_plan.json, plans_all_columns.json            │
└─────────────────────────────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 3: EXTRACTION                                             │
│  PlanExecutor.execute_plans(pdf, chunks, plans)                 │
│  • For each group: find_relevant_chunks(plan) → LLM + PDF       │
│  • OutputStructurer → GroupExtractionV2 → merge                  │
│  Output: extraction/extraction_metadata.json, extracted_table.csv│
└─────────────────────────────────────────────────────────────────┘
  │
  ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 4: EVALUATION                                             │
│  EvaluatorV2.run()                                               │
│  • Load extraction + ground truth + definitions (with category) │
│  • Batches by category → Gemini judge → structurer → scores      │
│  Output: evaluation/evaluation_results.json, summary_metrics.json│
└─────────────────────────────────────────────────────────────────┘

Module Map

Stage / Cross-cutting	Directory / File	Role
Entry point	`src/main/main_v2.py`	CLI menu, runs 1–6 (chunk only, plan only, extract only, eval only, full pipeline, or plan→extract→eval from existing chunks).
Config	`src/config/config.py`	Paths, API keys, per-stage providers/models, chunking/planning/extraction/eval workers, structurer URL.
Chunking	`src/chunking/`	Page classification, PDF chunking (text, table, figure), preprocessing hooks.
Planning	`src/planning/plan_generator.py`	PlanGenerator: LLM + PDF → plan text → structurer → `GroupExtractionPlanV2` per group.
Extraction	`src/extraction/plan_executor.py`	PlanExecutor: load plans, find relevant chunks, LLM extract → structurer → `GroupExtractionV2`, merge and write CSV + metadata.
LLM	`src/LLMProvider/provider.py`	LLMProvider: Gemini, OpenAI, Novita, Groq, DeepInfra; `generate()`, `generate_with_pdf()`, `upload_pdf`/`cleanup_pdf`.
Structurer	`src/LLMProvider/structurer.py`	OutputStructurer: free-form text → JSON via local model (e.g. vLLM Qwen) with Pydantic schema.
Table defs	`src/table_definitions/definitions.py`	`load_definitions()`: column groups from CSV (Label → list of Column Name + Definition).
Evaluation	`src/evaluation/evaluator_v2.py`	EvaluatorV2: load data, group by eval category, batch, Gemini judge, structurer, aggregate and save.
Preprocessing	`src/preprocessing/pdf_margin_preprocessing.py`	Header/footer detection and cleaning used during chunking.
Utils	`src/utils/logging_utils.py`	`setup_logger()` for consistent logging.

Module Reference (`src/`)

`src/main/main_v2.py`

Entry point: main() (interactive PDF path + menu) or run_pipeline_from_args(pdf_path, choice) for programmatic/web use.
Choices: 1 Chunking only; 2 Planning only; 3 Extraction only; 4 Evaluation only; 5 Full pipeline; 6 Planning → Extraction → Evaluation (reuse existing chunks).
Helpers: run_chunking(), run_planning(), run_extraction(), run_evaluation(); create_versioned_output_dir() when VERSION_OUTPUTS is True; _find_existing() to resolve existing chunk/plan/extraction paths for skip-if-exists behavior.
Output root: RESULTS_BASE_DIR / pdf_name (optionally with versioned run_YYYY-MM-DD_HH-MM-SS and latest symlink).

`src/config/config.py`

API keys / cloud config: From .env: VERTEX_API_KEY (local Vertex testing), GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_LOCATION, OPENAI_API_KEY, NOVITA_API_KEY, LLAMA_KEY (Groq), DEEPINFRA_API_KEY.
Per-task LLM: CHUNKING_*, EXTRACTION_*, EVALUATION_* (legacy); V2: PLANNING_PROVIDER/MODEL, EXTRACTION_PROVIDER_V2/MODEL_V2, EVALUATION_PROVIDER_V2/MODEL_V2; workers: PLANNING_WORKERS, EXTRACTION_WORKERS, EVALUATION_WORKERS.
Structurer (local): STRUCTURER_BASE_URL (e.g. http://localhost:8001/v1), STRUCTURER_MODEL (e.g. Qwen/Qwen3-8B).
Chunking: TEXT_CHUNK_MIN_SIZE, CHUNKING_MODE, PIXMAP_RESOLUTION, USE_LLM_PAGE_CLASSIFICATION, PAGE_CLASSIFICATION_MODEL.
Paths: DEFINITIONS_CSV_PATH, DEFINITIONS_EVAL_CATEGORY_PATH, GOLD_TABLE_JSON_PATH, RESULTS_BASE_DIR.
Pipeline behavior: VERSION_OUTPUTS, SKIP_STAGE_IF_EXISTS, EXTRACTION_MODE (e.g. "plan").

`src/chunking/`

chunking.py
- process_pdf(pdf_path, output_path, use_llm_classification): Top-level entry. Optionally runs PageClassifier.classify() then builds PDFChunker(pdf_path, page_metadata).chunk(), saves JSON.
- PDFChunker: Holds table_pages/figure_pages from metadata (or processes all pages). For each page: _process_page_text() (accumulate cleaned text), _process_tables() (LLM + optional pdfplumber fallback), _process_figures() (regex + LLM description). After all pages: _create_large_text_chunks() (sentence/paragraph chunking). Returns list of chunks (type: text/table/figure, content, page, etc.).
page_classifier.py
- PageClassifier: Uses Gemini to classify which pages have tables/figures; uses OutputStructurer to get TablesResponse/FiguresResponse. classify() returns {"tables": [...], "figures": [...]} for targeted chunking.
utils_chunking.py
- Text chunking (text_chunking()), table extraction helpers (extract_tables_pdfplumber(), parse_table_extraction_response()), image/LLM helpers (ask_gemini_with_image(), extract_caption_from_gemini()), save_chunks_to_json(); heuristic header/footer filtering used with preprocessing.

`src/planning/plan_generator.py`

Data structures: Column, ColumnGroup; ColumnExtractionPlanV2 (column_index, column_name, found_in_pdf, page, source_type, confidence, extraction_plan), GroupExtractionPlanV2 (group_name, columns).
PlanGenerator(provider, definitions, structurer=None, name_policy): Builds column groups from definitions; uses optional OutputStructurer (default from config).
generate_plan_for_group(group, pdf_handle, chunks, output_dir): Builds prompt with chunk summaries and canonical columns; calls provider.generate_with_pdf(); writes raw plan to logs/{stem}_raw.txt; calls structurer.structure() with GroupExtractionPlanV2 schema; validates/normalizes with validate_and_normalize_group_plan(); saves {stem}_plan.json.
generate_plans(pdf_path, chunks, output_dir, workers): Uploads PDF, runs generate_plan_for_group per group in parallel (ThreadPoolExecutor), writes plans_all_columns.json, returns {group_name: plan_data}.

`src/extraction/plan_executor.py`

Data structures: ColumnExtractionV2 (column_index, column_name, value, evidence, page, confidence), GroupExtractionV2 (group_name, extractions).
Helpers: find_relevant_chunks(plan.columns, chunks) (by source_type and page); format_chunks(), format_columns_for_prompt(); validate_and_normalize_plan(), validate_and_normalize_extraction(); _generate_outputs() (writes extraction_metadata.json and extracted_table.csv).
_extract_group(...): For a single group: filter plan to found_in_pdf, get relevant chunks, build prompt with columns + chunks; provider.generate_with_pdf(); log raw; structurer.structure(GroupExtractionV2) (with OpenAI/Gemini fallback on failure); normalize extraction and return GroupExtractionV2.
PlanExecutor(provider, structurer, name_policy).execute_plans(pdf_path, chunks, plans, output_path, workers): Loads definitions, validates plans, uploads PDF; runs _extract_group per group in parallel; calls _generate_outputs(); returns loaded metadata dict.
load_plans_from_dir(plans_dir): Loads from plans_all_columns.json or *_plan.json in the directory.

`src/LLMProvider/provider.py`

LLMProvider(provider, model): Unified interface for Gemini, OpenAI, Novita, Groq, DeepInfra. Methods: generate(), generate_with_image(), generate_with_pdf(prompt, pdf_handle, ...), upload_pdf(path) / cleanup_pdf(handle), and batch helpers. Returns LLMResponse (text, tokens, cost, success, error).

`src/LLMProvider/structurer.py`

OutputStructurer(base_url, model, api_key, enable_thinking): Uses OpenAI-compatible client (e.g. vLLM) to turn free-form text into JSON. structure(text, schema, max_retries, temperature, return_dict) builds a schema prompt, calls the model, parses JSON and validates with Pydantic; returns StructurerResponse(data, success, attempts, error).

`src/table_definitions/definitions.py`

load_definitions(csv_path=None, cols_to_test_path=None): Reads CSV (default from config); groups by Label; each group is list of {"Column Name", "Definition"}. Optional filter via cols_to_test_path (included labels).

`src/evaluation/evaluator_v2.py`

EvaluatorV2(extraction_file, ground_truth_file, definitions_file, document_name, output_dir): Loads extraction JSON (flat column → value), ground truth JSON (document row by Document Name), definitions with eval categories (Definitions_with_eval_category.csv).
load_data(): Fills predicted_values, ground_truth_values, column_categories, column_definitions, column_labels.
group_columns_by_category(): Groups common columns into exact_match, numeric_tolerance, structured_text.
build_prompt(category, columns): Category-specific instructions (exact match, numeric tolerance, structured text) and column-wise GT vs Pred.
evaluate_batch(category, columns): Gemini evaluation prompt → structure_response() (Qwen structurer or Gemini fallback) → list of {column, correctness, completeness, reason}.
evaluate_all(max_workers): Batches by category/label, runs batches in parallel, stores results in self.results.
aggregate_metrics(): Overall and per-category avg correctness/completeness/overall.
save_results(): Writes evaluation_results.json, summary_metrics.json, llm_logs/gemini_calls.jsonl, structurer_calls.jsonl.
run(): load_data → evaluate_all → save_results.

`src/preprocessing/pdf_margin_preprocessing.py`

detect_repeating_patterns(pdf_path, sample_pages): Learns top/bottom repeating text patterns from first N pages.
extract_text_blocks_with_position(): PyMuPDF text blocks with bounding boxes.
is_header_or_footer_by_position() / by_pattern() / by_heuristics(): Filter blocks (used from chunking/utils).
clean_page_text_advanced(page, page_height, patterns): Applies position → pattern → heuristic filtering; returns cleaned page text.

`src/utils/logging_utils.py`

setup_logger(name): Returns logger used across pipeline components.

Configuration

Key settings in src/config/config.py:

Area	Variables
V2 Planning	`PLANNING_PROVIDER`, `PLANNING_MODEL`, `PLANNING_WORKERS`
V2 Extraction	`EXTRACTION_PROVIDER_V2`, `EXTRACTION_MODEL_V2`, `EXTRACTION_WORKERS`
V2 Evaluation	`EVALUATION_PROVIDER_V2`, `EVALUATION_MODEL_V2`, `EVALUATION_WORKERS`
Structurer (local)	`STRUCTURER_BASE_URL`, `STRUCTURER_MODEL`
Chunking	`TEXT_CHUNK_MIN_SIZE`, `CHUNKING_MODE`, `USE_LLM_PAGE_CLASSIFICATION`, `PAGE_CLASSIFICATION_MODEL`, `PIXMAP_RESOLUTION`
Paths	`DEFINITIONS_CSV_PATH`, `DEFINITIONS_EVAL_CATEGORY_PATH`, `GOLD_TABLE_JSON_PATH`, `RESULTS_BASE_DIR`
Behavior	`VERSION_OUTPUTS`, `SKIP_STAGE_IF_EXISTS`

API keys / env (from .env): VERTEX_API_KEY, GOOGLE_CLOUD_PROJECT, GOOGLE_CLOUD_LOCATION, OPENAI_API_KEY, NOVITA_API_KEY, LLAMA_KEY, DEEPINFRA_API_KEY.

Output Structure

With versioning enabled (VERSION_OUTPUTS = True), each run can create run_YYYY-MM-DD_HH-MM-SS under the document folder and latest symlink. Example:

RESULTS_BASE_DIR / {pdf_name} /
├── run_2025-02-05_12-00-00/          # or "latest" -> run_...
│   ├── chunking/
│   │   └── pdf_chunked.json
│   ├── planning/
│   │   ├── *_plan.json
│   │   ├── plans_all_columns.json
│   │   └── logs/
│   ├── extraction/
│   │   ├── extraction_metadata.json
│   │   ├── extracted_table.csv
│   │   └── logs/
│   └── evaluation/
│       ├── evaluation_results.json
│       ├── summary_metrics.json
│       └── llm_logs/

extraction_metadata.json: Per-column value, evidence, page, plan info (e.g. plan_found_in_pdf, plan_page, plan_source_type).
summary_metrics.json: Overall and by-category avg correctness, completeness, overall score.

Setup and Dependencies

Python: Install from src/requirements.txt:
pip install -r src/requirements.txt
spaCy: python -m spacy download en_core_web_sm
Environment: .env in project root with API keys (see Configuration).
Local structurer (vLLM): For planning/extraction/structuring, run vLLM (e.g. ./run_vllm.sh) so STRUCTURER_BASE_URL is reachable.
Column definitions: src/table_definitions/Definitions_open_ended.csv (and Definitions_with_eval_category.csv for evaluation).
Ground truth: dataset/Manual_Benchmark_GoldTable_cleaned.json (or path set in GOLD_TABLE_JSON_PATH) for evaluation.

Usage

Interactive (CLI):

python src/main/main_v2.py

Enter PDF path when prompted, then choose 1–6 (chunk only, plan only, extract only, eval only, full pipeline, or plan→extract→eval from existing chunks).

Programmatic / Web:

from pathlib import Path
from src.main.main_v2 import run_pipeline_from_args

run_dir, extraction_file, err = run_pipeline_from_args(
    Path("path/to/document.pdf"),
    "5"  # full pipeline
)
if err:
    print("Error:", err)
else:
    print("Run dir:", run_dir, "Extraction:", extraction_file)

Run a single stage: Use choices 1–4 and ensure prior stage outputs exist (or run 5 once). With SKIP_STAGE_IF_EXISTS, existing chunk/plan/extraction/eval files are reused when paths are found.

Preprocessing (used by Chunking)

Chunking uses src/preprocessing/pdf_margin_preprocessing.py to clean page text before building text chunks:

Position-based: Drop blocks in top/bottom margin regions (TOP_MARGIN, BOTTOM_MARGIN).
Pattern-based: detect_repeating_patterns() on first N pages; drop blocks matching learned top/bottom patterns.
Heuristic-based: (in utils_chunking) Short blocks, copyright/keywords, journal/volume/issue, dates, URLs, page numbers.

clean_page_text_advanced() applies all three; only blocks that pass are kept for accumulation and later text chunking.

Architecture Notes

Plan-based extraction: Planning produces explicit (group, column, page, source_type, confidence) plans; extraction follows these plans and only sends relevant chunks to the LLM, improving consistency and traceability.
Dual LLM roles: Cloud LLM (e.g. Gemini/OpenAI) for PDF-aware planning and extraction; local structurer (e.g. Qwen via vLLM) for turning free-form text into strict JSON against Pydantic schemas.
Parallelism: Planning and extraction run per-group in parallel (configurable workers); evaluation runs batches in parallel by category.
Skip-if-exists: When SKIP_STAGE_IF_EXISTS is True, main_v2 looks for existing chunk/plan/extraction/eval outputs and skips re-running that stage.
Logging: All stages use setup_logger() from src/utils/logging_utils.py; planning and extraction write raw LLM outputs under logs/ in their output directories.

License

[Add license information here]

Citation

[Add citation information here]

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
dataset		dataset
docs		docs
experiment-analysis		experiment-analysis
experiment-scripts 2		experiment-scripts 2
experiment-scripts		experiment-scripts
new_pipeline_outputs		new_pipeline_outputs
new_pipeline_outputs_10th_feb		new_pipeline_outputs_10th_feb
src		src
temp		temp
test_results		test_results
tests		tests
visualizer		visualizer
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
.nojekyll		.nojekyll
ACL_Demo_2026.pdf		ACL_Demo_2026.pdf
Dockerfile		Dockerfile
README.md		README.md
all_columns_filtered.json		all_columns_filtered.json
available_trials_manual_benchmarks.txt		available_trials_manual_benchmarks.txt
cloudbuild.yaml		cloudbuild.yaml
failures.csv		failures.csv
index.html		index.html
pytest.ini		pytest.ini
requirements-cloudrun.txt		requirements-cloudrun.txt
requirements-dev.txt		requirements-dev.txt
run_agent_extractor_benchmarks.sh		run_agent_extractor_benchmarks.sh
run_baselines.sh		run_baselines.sh
run_full_pipeline_benchmarks.sh		run_full_pipeline_benchmarks.sh
run_reconciliation_agent_benchmarks.sh		run_reconciliation_agent_benchmarks.sh
run_search_agent_benchmarks.sh		run_search_agent_benchmarks.sh
run_vllm.sh		run_vllm.sh
run_vllm_qwen3_vl_32b.sh		run_vllm_qwen3_vl_32b.sh
start_web_interface.sh		start_web_interface.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoRal-Map-Make: Clinical Trial Data Extraction Pipeline

Table of Contents

Pipeline Overview

Architecture & Complete Flow

Data Flow

Module Map

Module Reference (`src/`)

`src/main/main_v2.py`

`src/config/config.py`

`src/chunking/`

`src/planning/plan_generator.py`

`src/extraction/plan_executor.py`

`src/LLMProvider/provider.py`

`src/LLMProvider/structurer.py`

`src/table_definitions/definitions.py`

`src/evaluation/evaluator_v2.py`

`src/preprocessing/pdf_margin_preprocessing.py`

`src/utils/logging_utils.py`

Configuration

Output Structure

Setup and Dependencies

Usage

Preprocessing (used by Chunking)

Architecture Notes

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoRal-Map-Make: Clinical Trial Data Extraction Pipeline

Table of Contents

Pipeline Overview

Architecture & Complete Flow

Data Flow

Module Map

Module Reference (src/)

src/main/main_v2.py

src/config/config.py

src/chunking/

src/planning/plan_generator.py

src/extraction/plan_executor.py

src/LLMProvider/provider.py

src/LLMProvider/structurer.py

src/table_definitions/definitions.py

src/evaluation/evaluator_v2.py

src/preprocessing/pdf_margin_preprocessing.py

src/utils/logging_utils.py

Configuration

Output Structure

Setup and Dependencies

Usage

Preprocessing (used by Chunking)

Architecture Notes

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Module Reference (`src/`)

`src/main/main_v2.py`

`src/config/config.py`

`src/chunking/`

`src/planning/plan_generator.py`

`src/extraction/plan_executor.py`

`src/LLMProvider/provider.py`

`src/LLMProvider/structurer.py`

`src/table_definitions/definitions.py`

`src/evaluation/evaluator_v2.py`

`src/preprocessing/pdf_margin_preprocessing.py`

`src/utils/logging_utils.py`

Packages