Skip to content

Conversation

@Adityav369
Copy link
Collaborator

No description provided.

@jazzberry-ai
Copy link

jazzberry-ai bot commented Dec 3, 2025

Bug Report

Name: Missing Tesseract dependency for OCR functionality
Severity: High
Example test case:

  1. Ingest a document with images.
  2. Call db.retrieve_chunks with use_colpali=True and output_format="text".
  3. Observe that the OCR conversion fails because Tesseract is not installed.
    Description: The output_format="text" option in retrieve_chunks and retrieve_chunks_grouped relies on Tesseract for OCR conversion. However, Tesseract is not included in the base Docker image and is not explicitly installed on all platforms (specifically, not mentioned in the self-hosting guide for Windows). This will cause the OCR functionality to fail, especially in Docker and potentially on Windows, leading to a broken user experience. The documentation should be updated to reflect the need to install Tesseract and it should be included by default in the Docker image.

Comments? Email us.

@Adityav369 Adityav369 merged commit ad9fa73 into main Dec 3, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants