-
Notifications
You must be signed in to change notification settings - Fork 153
Add Amazon Textract integration page #484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bogdankostic
wants to merge
2
commits into
main
Choose a base branch
from
amazon_textract
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| --- | ||
| layout: integration | ||
| name: Amazon Textract | ||
| description: Use Amazon Textract with Haystack to extract text, tables, forms, and answers to queries from documents | ||
| authors: | ||
| - name: deepset | ||
| socials: | ||
| github: deepset-ai | ||
| twitter: deepset_ai | ||
| linkedin: https://www.linkedin.com/company/deepset-ai/ | ||
| pypi: https://pypi.org/project/amazon-textract-haystack | ||
| repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/amazon_textract | ||
| type: Data Ingestion | ||
| report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues | ||
| logo: /logos/aws.png | ||
| version: Haystack 2.0 | ||
| toc: true | ||
| --- | ||
|
|
||
| ### **Table of Contents** | ||
| - [Overview](#overview) | ||
| - [Installation](#installation) | ||
| - [Usage](#usage) | ||
|
|
||
| ## Overview | ||
|
|
||
| [`AmazonTextractConverter`](https://docs.haystack.deepset.ai/docs/amazontextractconverter) provides an integration of [Amazon Textract](https://aws.amazon.com/textract/) with Haystack. | ||
|
|
||
| This component uses Amazon Textract's synchronous API to convert images and single-page PDFs into Haystack `Document` objects using OCR. It supports plain text extraction, structural analysis for tables and forms, and natural-language queries on documents. | ||
|
|
||
| **Supported file formats**: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB). | ||
|
|
||
| **Key features**: | ||
| - Plain text extraction with `DetectDocumentText` | ||
| - Table, form, signature, and layout detection with `AnalyzeDocument` | ||
| - Natural-language queries to extract specific answers from documents | ||
| - Access to the raw Textract response for downstream processing | ||
|
|
||
| ## Installation | ||
|
|
||
| Install the Amazon Textract integration: | ||
|
|
||
| ```bash | ||
| pip install amazon-textract-haystack | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| The component uses the standard boto3 credential chain. You can set AWS credentials (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`) as environment variables, configure them via `~/.aws/credentials` and `~/.aws/config`, rely on an IAM role when running on AWS infrastructure, or pass them explicitly as [Secret](https://docs.haystack.deepset.ai/docs/secret-management) arguments. | ||
|
|
||
| The Textract API is selected automatically based on how you configure the component: `DetectDocumentText` is used by default for plain text extraction, while `AnalyzeDocument` is used whenever you set `feature_types` or pass `queries` at runtime. | ||
|
|
||
| ### Basic text extraction | ||
|
|
||
| Extract plain text from a document with the default configuration, which calls `DetectDocumentText`: | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter | ||
|
|
||
| converter = AmazonTextractConverter() | ||
| results = converter.run(sources=["document.png"]) | ||
| documents = results["documents"] | ||
|
|
||
| print(documents[0].content) | ||
| ``` | ||
|
|
||
| ### Table and form analysis | ||
|
|
||
| Use `AnalyzeDocument` to detect tables and forms by setting `feature_types`: | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter | ||
|
|
||
| converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"]) | ||
| results = converter.run(sources=["invoice.png"]) | ||
|
|
||
| documents = results["documents"] | ||
| raw_responses = results["raw_textract_response"] | ||
| ``` | ||
|
|
||
| Valid `feature_types` values: `"TABLES"`, `"FORMS"`, `"SIGNATURES"`, `"LAYOUT"`. | ||
|
|
||
| ### Natural-language queries | ||
|
|
||
| Ask questions about a document and get extracted answers. The `QUERIES` feature type is enabled automatically when you pass the `queries` parameter at runtime: | ||
|
|
||
| ```python | ||
| from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter | ||
|
|
||
| converter = AmazonTextractConverter() | ||
| results = converter.run( | ||
| sources=["medical_form.png"], | ||
| queries=["What is the patient name?", "What is the date of birth?"], | ||
| ) | ||
|
|
||
| documents = results["documents"] | ||
| raw_responses = results["raw_textract_response"] | ||
| ``` | ||
|
|
||
| Queries can be combined with `feature_types` for both structural and question-based extraction: | ||
|
|
||
| ```python | ||
| converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"]) | ||
| results = converter.run( | ||
| sources=["invoice.png"], | ||
| queries=["What is the total amount due?"], | ||
| ) | ||
| ``` | ||
|
|
||
| ### Explicit credentials | ||
|
|
||
| ```python | ||
| from haystack.utils import Secret | ||
| from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter | ||
|
|
||
| converter = AmazonTextractConverter( | ||
| aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"), | ||
| aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"), | ||
| aws_region_name=Secret.from_token("us-east-1"), | ||
| ) | ||
| ``` | ||
|
|
||
| For more details on Amazon Textract capabilities and setup, refer to the [Amazon Textract documentation](https://docs.aws.amazon.com/textract/latest/dg/what-is.html). | ||
|
|
||
| ### License | ||
|
|
||
| `amazon-textract-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.