Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions concepts/colpali.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@

## Introduction

Upto now, we've seen RAG techniques that **i)** parse a given document, **ii)** convert it to text, and **iii)** embed the text for retrieval. These techniques have been particualrly text-heavy. Embedding models expect text in, knowledge graphs expect text in, and prasers break down when provided with documents that aren't text-dominant. This motivates the question:

Check warning on line 8 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L8

Did you really mean 'Upto'?

Check warning on line 8 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L8

Did you really mean 'particualrly'?

Check warning on line 8 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L8

Did you really mean 'prasers'?

> When was the last time you looked at a document and only saw text?

Most business documents, research papers, reports, and presentations we encounter daily are rich visual experiences: tables organizing crucial data, charts illuminating trends, infographics explaining complex concepts, and visual layouts that guide our understanding. These visual elements aren't just decorative—they're fundamental to how information is communicated.

Check warning on line 12 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L12

Did you really mean 'infographics'?

However, most RAG systems treat these elements as second-class citizens. They are either ignored or captioned and embedded as text. This leads to poor retrieval performance - especially for tasks that require visual reasoning.

Expand All @@ -26,7 +26,7 @@
## How does it work?

### Embedding Process
The embedding process for ColPali borrows heavily from models like CLIP. That is, the vision encoder part of the model (as seen in the diagram above) is trained via a technique called **Contrastive Learning**. As we've discussed in previous explainers, an encoder is a function (usually a neural network or a transformer) that maps a given input to a fixed-length vector. Contrastive learning is a technique that allows us to train two encoders of different input types (such as image and text) to produce vectors in the "same embedding space". That is, the embedding of the word "dog" would be very close the embedding of the image of a dog. The way we can achieve this is simple in theory:

Check warning on line 29 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L29

Did you really mean 'explainers'?

1) Take a large dataset of image and text pairs.
2) Pass the image and text through the vision and text encoders respectively.
Expand All @@ -40,7 +40,7 @@

### Retrieval Process

The retrieval process for ColPali borrows from late-interaction based reranking techniques such as [ColBERT](https://arxiv.org/abs/2004.12832). The idea is that instead of directly embedding an image or an entire block of text, we can embed individual patches or tokens instead. Then, instead of using the regular dot product or the cosine similarity, we can employ a slightly different scoring function. This scoring funciton looks at the most similar patches and tokens, and then sums those similarities up to obtain a final score.

Check warning on line 43 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L43

Did you really mean 'reranking'?

Check warning on line 43 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L43

Did you really mean 'funciton'?

![ColBERT Architecture](/assets/colbert.png)

Expand All @@ -50,7 +50,7 @@

## How to use ColPali?

With Morphik, using ColPali is as simple as adding a single `true/false` parameter to the `ingest_file` function and the query function. Here is what an example ingestion pathway looks like:

Check warning on line 53 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L53

Did you really mean 'Morphik'?

```python
from morphik import Morphik
Expand All @@ -66,8 +66,41 @@
db.query("At what time-step did we see the highest GDP growth rate?", use_colpali=True)
```

So instead of having to implement the ColPali pipeline from scratch, you can use Morphik to do it for you in a single line of code!

Check warning on line 69 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L69

Did you really mean 'Morphik'?

## Controlling Output Format

When retrieving ColPali chunks (which are page images), you can control how the images are returned using the `output_format` parameter:

```python
# Return as base64-encoded data (default)
chunks = db.retrieve_chunks("quarterly results", use_colpali=True)

# Return as presigned URLs (useful for web UIs)
chunks = db.retrieve_chunks("quarterly results", use_colpali=True, output_format="url")

# Convert images to markdown text via OCR
chunks = db.retrieve_chunks("quarterly results", use_colpali=True, output_format="text")
```

The three output formats are:
- **`"base64"`** (default): Returns base64-encoded image data
- **`"url"`**: Returns presigned HTTPS URLs, convenient for LLMs and UIs that accept remote image URLs

Check warning on line 88 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L88

Did you really mean 'presigned'?

Check warning on line 88 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L88

Did you really mean 'LLMs'?

Check warning on line 88 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L88

Did you really mean 'UIs'?
- **`"text"`**: Converts page images to markdown text via OCR

### Choosing Between Formats

**base64 vs url**: Both formats pass images to LLMs for visual understanding and produce similar inference results. However, `url` is lighter on network transfer since only the URL is sent to your application (the LLM fetches the image directly). This can result in faster response times, especially when working with multiple images.

Check warning on line 93 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L93

Did you really mean 'LLMs'?

**When to use text**: Passing images to LLMs for inference can be slow and consume significant context tokens. Use `output_format="text"` when:

Check warning on line 95 in concepts/colpali.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

concepts/colpali.mdx#L95

Did you really mean 'LLMs'?
- You need **faster inference** speeds
- Your documents are **primarily text-based** (reports, articles, contracts)
- You're hitting **context length limits**

<Note>
If you're experiencing context limit issues with image-based retrieval, it may be because images aren't being passed correctly to the model. See [Generating Completions with Retrieved Chunks](/cookbooks/generating-completions-with-retrieved-chunks) for examples of properly passing images (both base64 and URLs) to vision-capable models like GPT-4o.
</Note>




Expand Down
28 changes: 24 additions & 4 deletions python-sdk/retrieve_chunks.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "retrieve_chunks"

Check warning on line 2 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L2

Did you really mean 'retrieve_chunks'?
description: "Retrieve relevant chunks from Morphik"

Check warning on line 3 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L3

Did you really mean 'Morphik'?
---

<Tabs>
Expand Down Expand Up @@ -45,7 +45,10 @@
- `use_colpali` (bool, optional): Whether to use ColPali-style embedding model to retrieve the chunks (only works for documents ingested with `use_colpali=True`). Defaults to True.
- `folder_name` (str | List[str], optional): Optional folder scope. Accepts a single folder name or a list of folder names.
- `padding` (int, optional): Number of additional chunks/pages to retrieve before and after matched chunks (ColPali only). Defaults to 0.
- `output_format` (str, optional): Controls how image chunks are returned. Set to `"url"` to receive presigned URLs; omit or set to `"base64"` (default) to receive base64 content.
- `output_format` (str, optional): Controls how image chunks are returned:
- `"base64"` (default): Returns base64-encoded image data
- `"url"`: Returns presigned HTTPS URLs

Check warning on line 50 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L50

Did you really mean 'presigned'?
- `"text"`: Converts images to markdown text via OCR
- `query_image` (str, optional): Base64-encoded image for reverse image search. Mutually exclusive with `query`. Requires `use_colpali=True`.

## Metadata Filters
Expand Down Expand Up @@ -126,7 +129,7 @@

The `FinalChunkResult` objects returned by this method have the following properties:

- `content` (str | PILImage): Chunk content (text or image)

Check warning on line 132 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L132

Did you really mean 'PILImage'?
- `score` (float): Relevance score
- `document_id` (str): Parent document ID
- `chunk_number` (int): Chunk sequence number
Expand All @@ -135,13 +138,30 @@
- `filename` (Optional[str]): Original filename
- `download_url` (Optional[str]): URL to download full document

## Image URL output
## Output Format Options

- When `output_format="url"` is provided, image chunks are returned as presigned HTTPS URLs in `content`. This is convenient for UIs and LLMs that accept remote image URLs (e.g., via `image_url`).
- When `output_format` is omitted or set to `"base64"` (default), image chunks are returned as base64 data (the SDK attempts to decode these into a `PIL.Image` for `FinalChunkResult.content`).
- **`"base64"` (default)**: Image chunks are returned as base64 data (the SDK attempts to decode these into a `PIL.Image` for `FinalChunkResult.content`).
- **`"url"`**: Image chunks are returned as presigned HTTPS URLs in `content`. This is convenient for UIs and LLMs that accept remote image URLs (e.g., via `image_url`).

Check warning on line 144 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L144

Did you really mean 'presigned'?

Check warning on line 144 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L144

Did you really mean 'UIs'?

Check warning on line 144 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L144

Did you really mean 'LLMs'?
- **`"text"`**: Image chunks are converted to markdown text via OCR. Use this when you need faster inference or when documents are mostly text-based.
- Text chunks are unaffected by `output_format` and are always returned as strings.
- The `download_url` field may be populated for image chunks. When using `output_format="url"`, it will typically match `content` for those chunks.

### When to Use Each Format

| Format | Best For |
|--------|----------|
| `base64` | Direct image processing, local applications |
| `url` | Web UIs, LLMs with vision capabilities (lighter on network) |

Check warning on line 154 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L154

Did you really mean 'UIs'?

Check warning on line 154 in python-sdk/retrieve_chunks.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks.mdx#L154

Did you really mean 'LLMs'?
| `text` | Faster inference, text-heavy documents, context length concerns |

<Note>
**base64 vs url**: Both formats pass images to LLMs for visual understanding and produce similar results. However, `url` is lighter on network transfer since only the URL is sent to your application (the LLM fetches the image directly). This can result in faster response times, especially with multiple images.

**When to use text**: Passing images to LLMs for inference can be slow and consume significant context tokens. Use `output_format="text"` when you need faster inference speeds or when your documents are primarily text-based.

If you're hitting context limits with images, it may be because they aren't being passed correctly to the model. See [Generating Completions with Retrieved Chunks](/cookbooks/generating-completions-with-retrieved-chunks) for examples of properly passing images (both base64 and URLs) to vision-capable models like GPT-4o.
</Note>

Tip: To download the original raw file for a document, use [`get_document_download_url`](./get_document_download_url).

## Reverse Image Search
Expand Down
5 changes: 4 additions & 1 deletion python-sdk/retrieve_chunks_grouped.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "retrieve_chunks_grouped"

Check warning on line 2 in python-sdk/retrieve_chunks_grouped.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks_grouped.mdx#L2

Did you really mean 'retrieve_chunks_grouped'?
description: "Retrieve relevant chunks with grouping for UI display"
---

Expand Down Expand Up @@ -53,11 +53,14 @@
- `k` (int, optional): Number of results. Defaults to 4.
- `min_score` (float, optional): Minimum similarity threshold. Defaults to 0.0.
- `use_colpali` (bool, optional): Whether to use ColPali-style embedding model. Defaults to True.
- `use_reranking` (bool, optional): Override workspace reranking configuration for this request.

Check warning on line 56 in python-sdk/retrieve_chunks_grouped.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks_grouped.mdx#L56

Did you really mean 'reranking'?
- `folder_name` (str | List[str], optional): Optional folder scope (single name or list of names)
- `end_user_id` (str, optional): Optional end-user scope
- `padding` (int, optional): Number of additional chunks/pages to retrieve before and after matched chunks. Defaults to 0.
- `output_format` (str, optional): Controls how image chunks are returned. Set to `"url"` for presigned URLs or `"base64"` (default) for base64 content.
- `output_format` (str, optional): Controls how image chunks are returned:
- `"base64"` (default): Returns base64-encoded image data
- `"url"`: Returns presigned HTTPS URLs

Check warning on line 62 in python-sdk/retrieve_chunks_grouped.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks_grouped.mdx#L62

Did you really mean 'presigned'?
- `"text"`: Converts images to markdown text via OCR (faster inference, best for text-heavy documents)
- `graph_name` (str, optional): Name of the graph to use for knowledge graph-enhanced retrieval
- `hop_depth` (int, optional): Number of relationship hops to traverse in the graph. Defaults to 1.
- `include_paths` (bool, optional): Whether to include relationship paths in the response. Defaults to False.
Expand Down Expand Up @@ -182,7 +185,7 @@

- This method is similar to [`retrieve_chunks`](./retrieve_chunks) but provides additional grouping for UI display.
- The `chunks` list provides backward compatibility with flat chunk lists.
- The `groups` list organizes results with their padding context, ideal for building search result UIs.

Check warning on line 188 in python-sdk/retrieve_chunks_grouped.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

python-sdk/retrieve_chunks_grouped.mdx#L188

Did you really mean 'UIs'?
- When `padding` is specified, surrounding chunks are included in `padding_chunks` for each group.
- Knowledge graph parameters (`graph_name`, `hop_depth`, `include_paths`) enable graph-enhanced retrieval.

Expand Down
7 changes: 3 additions & 4 deletions self-hosting.mdx
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: "Installation"
description: "Install Morphik on your own infrastructure"

Check warning on line 3 in self-hosting.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

self-hosting.mdx#L3

Did you really mean 'Morphik'?
---

For users who need to run Morphik on their own infrastructure, we provide two installation options: Direct Installation and Docker.

Check warning on line 6 in self-hosting.mdx

View check run for this annotation

Mintlify / Mintlify Validation (databridge) - vale-spellcheck

self-hosting.mdx#L6

Did you really mean 'Morphik'?

<Tabs>
<Tab title="Self Host - Direct Installation (Advanced)">
Expand Down Expand Up @@ -103,22 +103,21 @@
<Tab title="macOS">
```bash
# Install via Homebrew
brew install poppler tesseract libmagic
brew install poppler libmagic
```
</Tab>
<Tab title="Ubuntu/Debian">
```bash
# Install via apt
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr libmagic-dev
sudo apt-get install -y poppler-utils libmagic-dev
```
</Tab>
<Tab title="Windows">
For Windows, you may need to install these dependencies manually:

1. **Poppler**: Download from [poppler for Windows](https://github.com/oschwartz10612/poppler-windows/releases/)
2. **Tesseract**: Download the installer from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)
3. **libmagic**: This is included in the python-magic-bin package which will be installed with pip
2. **libmagic**: This is included in the python-magic-bin package which will be installed with pip
</Tab>
</Tabs>
If you encounter database initialization issues within Docker, you may need to manually initialize the schema:
Expand Down