A complete security solution for PDF files combining YARA detection, PDF/A conversion, and secure deletion, with a modern web interface.
PDF Validator is a Python pipeline that automatically analyzes PDF files to:
- Detect malicious or suspicious content using YARA rules
- Neutralize active content by converting to PDF/A-2b
- Apply recursive logic for suspicious files
- Perform secure deletion (multi-pass + encryption) of dangerous files
- Provide complete traceability via logs and web interface
| Component | Role |
|---|---|
| web_server.py | FastAPI server exposing web interface and APIs |
| api.py | FastAPI endpoints for analysis, upload and YARA rules management |
| pdf_validator.py | Main pipeline processing engine |
| yara_detection.py | Loading, compilation and execution of YARA rules |
| ghostscript.py | PDF โ PDF/A-2 conversion (neutralization) |
| shredder.py | Secure deletion with multi-pass + AES-256 encryption |
| static/ | Web interface (HTML/CSS/JS) |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1๏ธโฃ Upload PDF โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 2๏ธโฃ YARA Analysis โ Score + Verdict โ
โ โข Score < 40 = Benign โ
โ
โ โข 40 โค Score < 70 = Suspect โ ๏ธ โ
โ โข Score โฅ 70 = Malicious โ โ
โโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ โ
โ Benign โ
โ Suspect โ ๏ธ โ Malicious โ
โ โ โ
โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Conversionโ โ Conversion โ โ Secure โ
โ PDF/A โ โ PDF/A โ โ Deletion โ
โ + Delete โ โ + Reanalysis โ โ (10 passes) โ
โ Original โ โ Recursive โ โ + AES-256 โ
โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโฌโโโโดโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ
๐ Timestamped logs
๐ Quarantined files
๐ Web interface
-
Score < 40 : Benign PDF
- โ PDF/A conversion (removal of active content)
- โ Original PDF deletion
- โ PDF/A archival
-
Score 40-69 : Suspicious PDF
โ ๏ธ PDF/A conversionโ ๏ธ Original PDF deletionโ ๏ธ Recursive reanalysis of converted PDF/A- ๐ก๏ธ Anti-loop protection: stop after 3 conversions
-
Score โฅ 70 : Malicious PDF
- ๐ซ Immediate and secure deletion
- ๐ Multi-pass overwrite (10 passes)
- ๐ AES-256 encryption
- โ Complete destruction
Converted files receive the suffix _pdfa.pdf. If this suffix appears more than 2 times in the filename, the file is moved to quarantine (folder suspect_files/) to prevent infinite loops.
- Python 3.8+
- Ghostscript (for PDF/A conversion)
# macOS brew install ghostscript # Linux sudo apt-get install ghostscript # Windows # Download from https://www.ghostscript.com/download/gsdnld.html
-
Clone/access the project
git clone https://github.com/dre4ft/pdfvalidator.git cd pdfvalidator -
Create virtual environment (optional but recommended)
python3 -m venv venv source venv/bin/activate # macOS/Linux # or venv\Scripts\activate # Windows
-
Install dependencies
pip install -r requirements.txt
python3 web_server.pyThe application will be accessible at: http://127.0.0.1:8000
| Package | Role |
|---|---|
fastapi |
Modern web framework |
uvicorn |
ASGI server for FastAPI |
yara-python |
Threat detection via YARA rules |
pypdf |
PDF file manipulation |
fpdf2 |
PDF generation |
cryptography |
AES-256 encryption |
python-multipart |
Multipart form parsing |
Analyzes and processes one or more PDF files
Parameters:
files: PDF files (multipart/form-data)
Response:
{
"mode": "remote",
"received_paths": ["document.pdf"],
"status": {
"document.pdf": "Benign file, PDF/A conversion completed."
}
}Retrieves current YARA rules
Response:
{
"rules": "rule example { ... }"
}Adds new YARA rules
Parameters:
body: New rules (text/plain)
Response:
{
"status": "YARA rules updated successfully."
}.
โโโ api.py # FastAPI endpoints
โโโ web_server.py # Main server
โโโ pdf_validator.py # Processing pipeline
โโโ yara_detection.py # YARA engine
โโโ ghostscript.py # PDF/A conversion
โโโ shredder.py # Secure deletion
โโโ requirements.txt # Python dependencies
โ
โโโ static/ # Web interface
โ โโโ index.html
โ โโโ app.js
โ โโโ styles.css
โ
โโโ yara_rules/ # Detection rules
โ โโโ pdf.yara
โ โโโ pdf2.yara
โ โโโ pdf.yara.old
โ
โโโ to_analyze/ # PDFs waiting for analysis
โโโ benign/ # Benign PDFs archival (converted)
โโโ suspect_files/ # Quarantined PDFs (anti-loop)
โโโ malicious/ # Malicious PDFs (deleted)
โโโ suspicious_pdfs/ # Detailed logs
โ
โโโ test/ # Test suite
โ โโโ main.py
โ โโโ runner.py
โ โโโ caster.py
โ โโโ clean_result.py
โ โโโ kpi.py
โ โโโ gen_mal_pdf/ # Malicious PDF generator
โ
โโโ pipeline.log # Timestamped journal
- Drop zone : Drag & drop or click to select PDFs
- Real-time logs : Track processing (verdict, conversion, deletion)
- Result consultation : Complete processing history
- Visualization : Displays all currently active rules
- Rule addition : Add new detection rules
- Live updates : Changes are applied immediately
- Access http://127.0.0.1:8000
- Go to "Scan PDF" tab
- Click on the drop zone or perform a drag & drop
- Select your PDF files
- Check logs to follow processing
curl -X POST "http://127.0.0.1:8000/api/scan/remote" \
-F "files=@document.pdf"python3 pdf_validator.py path/to/file.pdfpipeline.log : Detailed processing timestamps
2026-01-24 14:32:15 - /Users/romain_travail/pdfvalidator/to_analyze/doc.pdf : [+] Benign file, PDF/A conversion completed.
2026-01-24 14:32:18 - /Users/romain_travail/pdfvalidator/to_analyze/suspect.pdf : [*] Suspect file, PDF/A conversion completed additional analysis in progress...
Quarantine files : Stored in suspect_files/ (anti-loop protection)
- Depends on YARA rules : Result quality directly depends on configured rules
- False positives/negatives : YARA rules can generate incorrect detections
- Ghostscript required : PDF/A conversion requires local Ghostscript installation
- Content loss : PDF/A conversion may lose complex content (scripts, advanced forms)
- Verify Ghostscript installation:
gs --version - Ensure
gsis in the PATH
- Verify
yara_rules/folder containspdf.yara - Check YARA rules syntax
- Verify "Scan PDF" tab is active in the interface
- Check browser console (F12) for JavaScript errors
- Check Python logs in terminal
- The pipeline applies recursive logic for suspicious files
- Anti-loop protection prevents infinite conversions
- Secure deletion uses 10 passes of filling + AES-256 encryption
- Logs are fully timestamped for traceability
- Web interface uses Fetch API for asynchronous calls
To be defined according to your needs.
Advanced PDF security project.