Extractous Document Processing API

Fast document text extraction API built with Rust and Extractous. Supports multiple formats with OCR capabilities.

Features

📄 Multi-Format: PDF, DOCX, TXT, PNG, JPG, HTML, EML
🔍 OCR Support: Tesseract integration for images and scanned PDFs
⚡ Fast: Async processing with proper error handling
🐳 Docker Ready: Complete containerization
📊 Monitoring: Health checks and structured logging

Quick Start

Docker (Recommended)

# Start the service
docker-compose up -d

# Check health
curl http://localhost:8280/health

# Upload a file
curl -X POST -F "file=@document.pdf" http://localhost:8280/parse/files

Local Development

# Install Rust if needed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Run the API
cargo run

# Test
curl -X POST -F "file=@document.pdf" http://localhost:8080/parse/files

API Endpoints

Health Check

GET /health

Process Files

POST /parse/files
Content-Type: multipart/form-data

# Upload files
curl -X POST -F "file=@document.pdf" http://localhost:8080/parse/files

Process URLs

POST /parse/urls
Content-Type: application/json

{
  "urls": ["https://example.com/document.pdf"]
}

Client-Server File Mapping

When uploading multiple files, the API ensures you can map results back to your original files:

File Upload Example

curl -X POST http://localhost:8080/parse/files \
  -F "file=@document1.pdf" \
  -F "file=@document2.docx" \
  -F "file=@Прайс_файл.pdf"

Response Structure

{
  "request_id": "uuid-here",
  "results": [
    {
      "id": "uuid-file-0",
      "file_name": "document1.pdf",
      "input_index": 0,
      "extracted_text": "...",
      "error": null
    },
    {
      "id": "uuid-file-1", 
      "file_name": "document2.docx",
      "input_index": 1,
      "extracted_text": "...",
      "error": null
    },
    {
      "id": "uuid-file-2",
      "file_name": "Прайс_файл.pdf",
      "input_index": 2,
      "extracted_text": "...",
      "error": null
    }
  ],
  "total_files_processed": 3,
  "successful_extractions": 3,
  "failed_extractions": 0
}

Mapping Results to Input Files

Key Fields for Mapping:

input_index: 0-based index corresponding to the order files were uploaded
file_name: Original filename as uploaded (preserves Unicode characters)
id: Unique identifier for this specific extraction

Client Implementation:

// When uploading files
const files = [file1, file2, file3]; // Your input files

// After receiving response
response.results.forEach(result => {
  const originalFile = files[result.input_index];
  console.log(`File: ${originalFile.name}`);
  console.log(`Server filename: ${result.file_name}`);
  console.log(`Extracted text: ${result.extracted_text}`);
});

Handling Unicode Filenames: The server sanitizes filenames internally to avoid Unicode encoding issues with the Java/Tika layer, but always returns the original filename in the response. This means:

file_name in response = original filename (e.g., "Прайс list на 2 квартал 2025 г v1.pdf")
Internal processing uses ASCII-safe filename (e.g., "___list____2_______2025___v1.pdf")
Client mapping works correctly regardless of Unicode characters

Configuration

Environment Variables

# Logging level
RUST_LOG=info

# Server configuration
SERVER_HOST=0.0.0.0
SERVER_PORT=8080

# OCR languages (use + to combine multiple)
TESSERACT_LANGUAGES=eng+rus

# CORS allowed origins (comma-separated, default: '*')
ACTIX_CORS_ORIGIN="*"
# Example: ACTIX_CORS_ORIGIN="https://myapp.com,https://admin.myapp.com"

# Bearer token for authentication (optional)
ACTIX_BEARER_TOKEN="your_secret_token"
# If set, /parse/files and /parse/urls require 'Authorization: Bearer <token>' header

Adding OCR Languages

Install language packs:

# Examples
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-spa  # Spanish

Set environment variable:

export TESSERACT_LANGUAGES=eng+rus+fra+deu

For Docker, update docker-compose.yml:

environment:
  - TESSERACT_LANGUAGES=eng+rus+fra+deu

Common language codes: eng (English), rus (Russian), fra (French), deu (German), spa (Spanish), ita (Italian), por (Portuguese), chi-sim (Chinese), jpn (Japanese), kor (Korean), ara (Arabic)

Supported Formats

Format	Extensions	OCR Support
PDF	`.pdf`	✅ Auto OCR for scanned docs
Word	`.docx`	❌ Native text extraction
Text	`.txt`	❌ Direct reading
Images	`.png`, `.jpg`, `.jpeg`	✅ Full OCR
HTML	`.html`	❌ Text extraction
Email	`.eml`	❌ Content extraction

Limits

Maximum 10 URLs per request
Maximum 50MB per file upload
OCR timeout: 240 seconds

Testing

# Run tests
cargo test

# Health check
curl http://localhost:8080/health

# Upload test file
curl -X POST -F "file=@test.pdf" http://localhost:8080/parse/files

Troubleshooting

Build issues:

# Install dependencies
sudo apt-get install build-essential tesseract-ocr tesseract-ocr-eng

Check logs:

# Docker
docker-compose logs -f

# Local
RUST_LOG=debug cargo run

Performance: Monitor memory usage during OCR processing. Scale horizontally for high load.

Project Structure

├── src/
│   ├── main.rs          # HTTP server and routing
│   └── processor.rs     # Document processing
├── Dockerfile           # Container build
├── docker-compose.yml   # Service orchestration
└── README.md           # This file

Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass: cargo test
Submit a pull request

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Extractous Document Processing API

Features

Quick Start

Docker (Recommended)

Local Development

API Endpoints

Health Check

Process Files

Process URLs

Client-Server File Mapping

File Upload Example

Response Structure

Mapping Results to Input Files

Configuration

Environment Variables

Adding OCR Languages

Supported Formats

Limits

Testing

Troubleshooting

Project Structure

Contributing

About

Uh oh!

Releases

Packages

Languages

draneone/simple_extractous_api

Folders and files

Latest commit

History

Repository files navigation

Extractous Document Processing API

Features

Quick Start

Docker (Recommended)

Local Development

API Endpoints

Health Check

Process Files

Process URLs

Client-Server File Mapping

File Upload Example

Response Structure

Mapping Results to Input Files

Configuration

Environment Variables

Adding OCR Languages

Supported Formats

Limits

Testing

Troubleshooting

Project Structure

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages