Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions integrations/amazon-textract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
layout: integration
name: Amazon Textract
description: Use Amazon Textract with Haystack to extract text, tables, forms, and answers to queries from documents
authors:
- name: deepset
socials:
github: deepset-ai
twitter: deepset_ai
linkedin: https://www.linkedin.com/company/deepset-ai/
pypi: https://pypi.org/project/amazon-textract-haystack
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract
type: Data Ingestion
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
logo: /logos/aws.png
version: Haystack 2.0
toc: true
---

### **Table of Contents**
- [Overview](#overview)
- [Installation](#installation)
- [Usage](#usage)

## Overview

[`AmazonTextractConverter`](https://docs.haystack.deepset.ai/docs/amazontextractconverter) provides an integration of [Amazon Textract](https://aws.amazon.com/textract/) with Haystack.

This component uses Amazon Textract's synchronous API to convert images and single-page PDFs into Haystack `Document` objects using OCR. It supports plain text extraction, structural analysis for tables and forms, and natural-language queries on documents.

**Supported file formats**: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).

**Key features**:
- Plain text extraction with `DetectDocumentText`
- Table, form, signature, and layout detection with `AnalyzeDocument`
- Natural-language queries to extract specific answers from documents
- Access to the raw Textract response for downstream processing

## Installation

Install the Amazon Textract integration:

```bash
pip install amazon-textract-haystack
```

## Usage

The component uses the standard boto3 credential chain. You can set AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`) as environment variables, configure them via `~/.aws/credentials` and `~/.aws/config`, rely on an IAM role when running on AWS infrastructure, or pass them explicitly as [Secret](https://docs.haystack.deepset.ai/docs/secret-management) arguments.

The Textract API is selected automatically based on how you configure the component: `DetectDocumentText` is used by default for plain text extraction, while `AnalyzeDocument` is used whenever you set `feature_types` or pass `queries` at runtime.

### Basic text extraction

Extract plain text from a document with the default configuration, which calls `DetectDocumentText`:

```python
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter()
results = converter.run(sources=["document.png"])
documents = results["documents"]

print(documents[0].content)
```

### Table and form analysis

Use `AnalyzeDocument` to detect tables and forms by setting `feature_types`:

```python
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(sources=["invoice.png"])

documents = results["documents"]
raw_responses = results["raw_textract_response"]
```

Valid `feature_types` values: `"TABLES"`, `"FORMS"`, `"SIGNATURES"`, `"LAYOUT"`.

### Natural-language queries

Ask questions about a document and get extracted answers. The `QUERIES` feature type is enabled automatically when you pass the `queries` parameter at runtime:

```python
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter()
results = converter.run(
sources=["medical_form.png"],
queries=["What is the patient name?", "What is the date of birth?"],
)

documents = results["documents"]
raw_responses = results["raw_textract_response"]
```

Queries can be combined with `feature_types` for both structural and question-based extraction:

```python
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(
sources=["invoice.png"],
queries=["What is the total amount due?"],
)
```

### Explicit credentials

```python
from haystack.utils import Secret
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter(
aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"),
aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"),
aws_region_name=Secret.from_token("us-east-1"),
)
```

For more details on Amazon Textract capabilities and setup, refer to the [Amazon Textract documentation](https://docs.aws.amazon.com/textract/latest/dg/what-is.html).
Comment thread
bogdankostic marked this conversation as resolved.

### License

`amazon-textract-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.