Skip to content

dre4ft/pdfvalidator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ›ก๏ธ PDF Validator - Advanced PDF Security Pipeline

A complete security solution for PDF files combining YARA detection, PDF/A conversion, and secure deletion, with a modern web interface.

๐ŸŽฏ Overview

PDF Validator is a Python pipeline that automatically analyzes PDF files to:

  • Detect malicious or suspicious content using YARA rules
  • Neutralize active content by converting to PDF/A-2b
  • Apply recursive logic for suspicious files
  • Perform secure deletion (multi-pass + encryption) of dangerous files
  • Provide complete traceability via logs and web interface

๐Ÿ—๏ธ Architecture

Main Components

Component Role
web_server.py FastAPI server exposing web interface and APIs
api.py FastAPI endpoints for analysis, upload and YARA rules management
pdf_validator.py Main pipeline processing engine
yara_detection.py Loading, compilation and execution of YARA rules
ghostscript.py PDF โ†’ PDF/A-2 conversion (neutralization)
shredder.py Secure deletion with multi-pass + AES-256 encryption
static/ Web interface (HTML/CSS/JS)

Processing Flow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  1๏ธโƒฃ  Upload PDF                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                 โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  2๏ธโƒฃ  YARA Analysis โ†’ Score + Verdict                       โ”‚
โ”‚      โ€ข Score < 40  = Benign โœ…                              โ”‚
โ”‚      โ€ข 40 โ‰ค Score < 70 = Suspect โš ๏ธ                         โ”‚
โ”‚      โ€ข Score โ‰ฅ 70  = Malicious โŒ                           โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚              โ”‚                          โ”‚
     โ†“ Benign โœ…    โ†“ Suspect โš ๏ธ               โ†“ Malicious โŒ
     โ”‚              โ”‚                          โ”‚
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ Conversionโ”‚  โ”‚ Conversion   โ”‚         โ”‚ Secure       โ”‚
 โ”‚ PDF/A     โ”‚  โ”‚ PDF/A        โ”‚         โ”‚ Deletion     โ”‚
 โ”‚ + Delete  โ”‚  โ”‚ + Reanalysis โ”‚         โ”‚ (10 passes)  โ”‚
 โ”‚ Original  โ”‚  โ”‚ Recursive    โ”‚         โ”‚ + AES-256    โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ†“              โ†“                          โ†“
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ†“
         ๐Ÿ“‹ Timestamped logs
         ๐Ÿ“ Quarantined files
         ๐Ÿ” Web interface

๐Ÿ“‹ Detailed Processing Logic

Verdict by YARA Score Threshold

  • Score < 40 : Benign PDF

    • โœ… PDF/A conversion (removal of active content)
    • โœ… Original PDF deletion
    • โœ… PDF/A archival
  • Score 40-69 : Suspicious PDF

    • โš ๏ธ PDF/A conversion
    • โš ๏ธ Original PDF deletion
    • โš ๏ธ Recursive reanalysis of converted PDF/A
    • ๐Ÿ›ก๏ธ Anti-loop protection: stop after 3 conversions
  • Score โ‰ฅ 70 : Malicious PDF

    • ๐Ÿšซ Immediate and secure deletion
    • ๐Ÿ”’ Multi-pass overwrite (10 passes)
    • ๐Ÿ” AES-256 encryption
    • โŒ Complete destruction

Anti-loop Protection

Converted files receive the suffix _pdfa.pdf. If this suffix appears more than 2 times in the filename, the file is moved to quarantine (folder suspect_files/) to prevent infinite loops.


๐Ÿš€ Installation and Setup

Prerequisites

  • Python 3.8+
  • Ghostscript (for PDF/A conversion)
    # macOS
    brew install ghostscript
    
    # Linux
    sudo apt-get install ghostscript
    
    # Windows
    # Download from https://www.ghostscript.com/download/gsdnld.html

Installation

  1. Clone/access the project

    git clone https://github.com/dre4ft/pdfvalidator.git
    cd pdfvalidator
  2. Create virtual environment (optional but recommended)

    python3 -m venv venv
    source venv/bin/activate  # macOS/Linux
    # or
    venv\Scripts\activate  # Windows
  3. Install dependencies

    pip install -r requirements.txt

Startup

python3 web_server.py

The application will be accessible at: http://127.0.0.1:8000


๐Ÿ“ฆ Dependencies

Package Role
fastapi Modern web framework
uvicorn ASGI server for FastAPI
yara-python Threat detection via YARA rules
pypdf PDF file manipulation
fpdf2 PDF generation
cryptography AES-256 encryption
python-multipart Multipart form parsing

๐ŸŒ REST API

POST /api/scan/remote

Analyzes and processes one or more PDF files

Parameters:

  • files : PDF files (multipart/form-data)

Response:

{
  "mode": "remote",
  "received_paths": ["document.pdf"],
  "status": {
    "document.pdf": "Benign file, PDF/A conversion completed."
  }
}

GET /api/yara/rules

Retrieves current YARA rules

Response:

{
  "rules": "rule example { ... }"
}

POST /api/yara/update

Adds new YARA rules

Parameters:

  • body : New rules (text/plain)

Response:

{
  "status": "YARA rules updated successfully."
}

๐Ÿ“ Directory Structure

.
โ”œโ”€โ”€ api.py                          # FastAPI endpoints
โ”œโ”€โ”€ web_server.py                   # Main server
โ”œโ”€โ”€ pdf_validator.py                # Processing pipeline
โ”œโ”€โ”€ yara_detection.py               # YARA engine
โ”œโ”€โ”€ ghostscript.py                  # PDF/A conversion
โ”œโ”€โ”€ shredder.py                     # Secure deletion
โ”œโ”€โ”€ requirements.txt                # Python dependencies
โ”‚
โ”œโ”€โ”€ static/                         # Web interface
โ”‚   โ”œโ”€โ”€ index.html
โ”‚   โ”œโ”€โ”€ app.js
โ”‚   โ””โ”€โ”€ styles.css
โ”‚
โ”œโ”€โ”€ yara_rules/                     # Detection rules
โ”‚   โ”œโ”€โ”€ pdf.yara
โ”‚   โ”œโ”€โ”€ pdf2.yara
โ”‚   โ””โ”€โ”€ pdf.yara.old
โ”‚
โ”œโ”€โ”€ to_analyze/                     # PDFs waiting for analysis
โ”œโ”€โ”€ benign/                         # Benign PDFs archival (converted)
โ”œโ”€โ”€ suspect_files/                  # Quarantined PDFs (anti-loop)
โ”œโ”€โ”€ malicious/                      # Malicious PDFs (deleted)
โ”œโ”€โ”€ suspicious_pdfs/                # Detailed logs
โ”‚
โ”œโ”€โ”€ test/                           # Test suite
โ”‚   โ”œโ”€โ”€ main.py
โ”‚   โ”œโ”€โ”€ runner.py
โ”‚   โ”œโ”€โ”€ caster.py
โ”‚   โ”œโ”€โ”€ clean_result.py
โ”‚   โ”œโ”€โ”€ kpi.py
โ”‚   โ””โ”€โ”€ gen_mal_pdf/                # Malicious PDF generator
โ”‚
โ””โ”€โ”€ pipeline.log                    # Timestamped journal

๐ŸŽฎ Web Interface

"Scan PDF" Tab

  • Drop zone : Drag & drop or click to select PDFs
  • Real-time logs : Track processing (verdict, conversion, deletion)
  • Result consultation : Complete processing history

"YARA Rules" Tab

  • Visualization : Displays all currently active rules
  • Rule addition : Add new detection rules
  • Live updates : Changes are applied immediately

๐Ÿ” Usage Examples

Via web interface

  1. Access http://127.0.0.1:8000
  2. Go to "Scan PDF" tab
  3. Click on the drop zone or perform a drag & drop
  4. Select your PDF files
  5. Check logs to follow processing

Via API (curl)

curl -X POST "http://127.0.0.1:8000/api/scan/remote" \
  -F "files=@document.pdf"

Via Python

python3 pdf_validator.py path/to/file.pdf

๐Ÿ“Š Log Files

pipeline.log : Detailed processing timestamps

2026-01-24 14:32:15 - /Users/romain_travail/pdfvalidator/to_analyze/doc.pdf : [+] Benign file, PDF/A conversion completed.
2026-01-24 14:32:18 - /Users/romain_travail/pdfvalidator/to_analyze/suspect.pdf : [*] Suspect file, PDF/A conversion completed additional analysis in progress...

Quarantine files : Stored in suspect_files/ (anti-loop protection)


โš ๏ธ Limitations and Warnings

  • Depends on YARA rules : Result quality directly depends on configured rules
  • False positives/negatives : YARA rules can generate incorrect detections
  • Ghostscript required : PDF/A conversion requires local Ghostscript installation
  • Content loss : PDF/A conversion may lose complex content (scripts, advanced forms)

๐Ÿ› ๏ธ Troubleshooting

Error: "Ghostscript not found"

  • Verify Ghostscript installation: gs --version
  • Ensure gs is in the PATH

Error: "YARA rules not found"

  • Verify yara_rules/ folder contains pdf.yara
  • Check YARA rules syntax

Files are not being analyzed

  • Verify "Scan PDF" tab is active in the interface
  • Check browser console (F12) for JavaScript errors
  • Check Python logs in terminal

๐Ÿ“ Developer Notes

  • The pipeline applies recursive logic for suspicious files
  • Anti-loop protection prevents infinite conversions
  • Secure deletion uses 10 passes of filling + AES-256 encryption
  • Logs are fully timestamped for traceability
  • Web interface uses Fetch API for asynchronous calls

๐Ÿ“„ License

To be defined according to your needs.


๐Ÿ‘ค Author

Advanced PDF security project.

About

webapp to check and deactivate potential malicious PDF

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors