Fast document text extraction API built with Rust and Extractous. Supports multiple formats with OCR capabilities.
- π Multi-Format: PDF, DOCX, TXT, PNG, JPG, HTML, EML
- π OCR Support: Tesseract integration for images and scanned PDFs
- β‘ Fast: Async processing with proper error handling
- π³ Docker Ready: Complete containerization
- π Monitoring: Health checks and structured logging
# Start the service
docker-compose up -d
# Check health
curl http://localhost:8280/health
# Upload a file
curl -X POST -F "file=@document.pdf" http://localhost:8280/parse/files# Install Rust if needed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Run the API
cargo run
# Test
curl -X POST -F "file=@document.pdf" http://localhost:8080/parse/filesGET /healthPOST /parse/files
Content-Type: multipart/form-data
# Upload files
curl -X POST -F "file=@document.pdf" http://localhost:8080/parse/filesPOST /parse/urls
Content-Type: application/json
{
"urls": ["https://example.com/document.pdf"]
}When uploading multiple files, the API ensures you can map results back to your original files:
curl -X POST http://localhost:8080/parse/files \
-F "file=@document1.pdf" \
-F "file=@document2.docx" \
-F "file=@ΠΡΠ°ΠΉΡ_ΡΠ°ΠΉΠ».pdf"{
"request_id": "uuid-here",
"results": [
{
"id": "uuid-file-0",
"file_name": "document1.pdf",
"input_index": 0,
"extracted_text": "...",
"error": null
},
{
"id": "uuid-file-1",
"file_name": "document2.docx",
"input_index": 1,
"extracted_text": "...",
"error": null
},
{
"id": "uuid-file-2",
"file_name": "ΠΡΠ°ΠΉΡ_ΡΠ°ΠΉΠ».pdf",
"input_index": 2,
"extracted_text": "...",
"error": null
}
],
"total_files_processed": 3,
"successful_extractions": 3,
"failed_extractions": 0
}Key Fields for Mapping:
input_index: 0-based index corresponding to the order files were uploadedfile_name: Original filename as uploaded (preserves Unicode characters)id: Unique identifier for this specific extraction
Client Implementation:
// When uploading files
const files = [file1, file2, file3]; // Your input files
// After receiving response
response.results.forEach(result => {
const originalFile = files[result.input_index];
console.log(`File: ${originalFile.name}`);
console.log(`Server filename: ${result.file_name}`);
console.log(`Extracted text: ${result.extracted_text}`);
});Handling Unicode Filenames: The server sanitizes filenames internally to avoid Unicode encoding issues with the Java/Tika layer, but always returns the original filename in the response. This means:
file_namein response = original filename (e.g., "ΠΡΠ°ΠΉΡ list Π½Π° 2 ΠΊΠ²Π°ΡΡΠ°Π» 2025 Π³ v1.pdf")- Internal processing uses ASCII-safe filename (e.g., "___list____2_______2025___v1.pdf")
- Client mapping works correctly regardless of Unicode characters
# Logging level
RUST_LOG=info
# Server configuration
SERVER_HOST=0.0.0.0
SERVER_PORT=8080
# OCR languages (use + to combine multiple)
TESSERACT_LANGUAGES=eng+rus
# CORS allowed origins (comma-separated, default: '*')
ACTIX_CORS_ORIGIN="*"
# Example: ACTIX_CORS_ORIGIN="https://myapp.com,https://admin.myapp.com"
# Bearer token for authentication (optional)
ACTIX_BEARER_TOKEN="your_secret_token"
# If set, /parse/files and /parse/urls require 'Authorization: Bearer <token>' header-
Install language packs:
# Examples sudo apt-get install tesseract-ocr-fra # French sudo apt-get install tesseract-ocr-deu # German sudo apt-get install tesseract-ocr-spa # Spanish
-
Set environment variable:
export TESSERACT_LANGUAGES=eng+rus+fra+deu -
For Docker, update
docker-compose.yml:environment: - TESSERACT_LANGUAGES=eng+rus+fra+deu
Common language codes: eng (English), rus (Russian), fra (French), deu (German), spa (Spanish), ita (Italian), por (Portuguese), chi-sim (Chinese), jpn (Japanese), kor (Korean), ara (Arabic)
| Format | Extensions | OCR Support |
|---|---|---|
.pdf |
β Auto OCR for scanned docs | |
| Word | .docx |
β Native text extraction |
| Text | .txt |
β Direct reading |
| Images | .png, .jpg, .jpeg |
β Full OCR |
| HTML | .html |
β Text extraction |
.eml |
β Content extraction |
- Maximum 10 URLs per request
- Maximum 50MB per file upload
- OCR timeout: 240 seconds
# Run tests
cargo test
# Health check
curl http://localhost:8080/health
# Upload test file
curl -X POST -F "file=@test.pdf" http://localhost:8080/parse/filesBuild issues:
# Install dependencies
sudo apt-get install build-essential tesseract-ocr tesseract-ocr-engCheck logs:
# Docker
docker-compose logs -f
# Local
RUST_LOG=debug cargo runPerformance: Monitor memory usage during OCR processing. Scale horizontally for high load.
βββ src/
β βββ main.rs # HTTP server and routing
β βββ processor.rs # Document processing
βββ Dockerfile # Container build
βββ docker-compose.yml # Service orchestration
βββ README.md # This file
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
cargo test - Submit a pull request