Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,144 +2,130 @@
title: Configure the Chunk Settings
---

After uploading content to the knowledge base, the next step is chunking and data cleaning. **This stage involves content preprocessing and structuring, where long texts are divided into multiple smaller chunks.**
## What is Chunking?

<Accordion title="What is the Chunking and Cleaning Strategy?">
* **Chunking**
Documents imported into knowledge bases are split into smaller segments called **chunks**. Think of chunking like organizing a large book into chapters and paragraphs—you can't quickly find specific information in one massive block of text, but well-organized sections make retrieval efficient.

Due to the limited context window of LLMs, it is hard to process and transmit the entire knowledge base content at once. Instead, long texts in documents must be splited into content chunks. Even though some advanced models now support uploading complete documents, studies show that retrieval efficiency remains weaker compared to querying individual content chunks.
When users ask questions, the system searches through these chunks for relevant information and provides it to the LLM as context. Without chunking, processing entire documents for every query would be slow and inefficient.

The ability of an LLM to accurately answer questions based on the knowledge base depends on the retrieval effectiveness of content chunks. This process is similar to finding key chapters in a manual for quick answers, without the need to analyze the entire document line by line. After chunking, the knowledge base uses a Top-K retrieval method to identify the most relevant content chunks based on user queries, supplementing key information to enhance the accuracy of responses.
**Key Chunk Parameters**

The size of the content chunks is critical during semantic matching between queries and chunks. Properly sized chunks enable the model to locate the most relevant content accurately while minimizing noise. Overly large or small chunks can negatively impact retrieval effectiveness.
- **Delimiter**: The character or sequence where text is split. For example, `\n\n` splits at paragraph breaks, `\n` at line breaks.

Dify offers two chunking modes: **General Mode** and **Parent-child Mode**, tailored to different document structures and application scenarios. These modes are designed to meet varying requirements for retrieval efficiency and accuracy in knowledge bases.
<Note>
Delimiters are removed during chunking. For example, using `A` as the delimiter splits `CBACD` into `CB` and `CD`.

To avoid information loss, use non-content characters that don't naturally appear in your documents.
</Note>

* **Cleaning**
- **Maximum chunk length**: The maximum size of each chunk in characters. Text exceeding this limit is force-split regardless of delimiter settings.

To ensure effective text retrieval, it’s essential to clean the data before uploading it to the knowledge base. For instance, meaningless characters or empty lines can affect the quality of query responses and should be removed.
</Accordion>
## Choose a Chunk Mode

Whether an LLM can accurately answer knowledge base queries depends on how effectively the system retrieves relevant content chunks. High-relevance chunks are crucial for AI applications to produce precise and comprehensive responses.
<Note>
The chunk mode cannot be changed once the knowledge base is created. However, chunk settings like the delimiter and maximum chunk length can be adjusted at any time.
</Note>

In an AI customer chatbot scenario, for example, directing the LLM to the key content chunks in a tool manual is sufficient to quickly answer user questions—no need to repeatedly analyze the entire document. This approach saves tokens during the analysis phase while boosting the overall quality of the AI-generated answers.
### Mode Overview

### Chunk Mode
<Tabs>
<Tab title="General">

The knowledge base supports two chunking modes: **General Mode** and **Parent-child Mode**. If you are creating a knowledge base for the first time, it is recommended to choose Parent-Child Mode.
In General mode, all chunks share the same settings. Matched chunks are returned directly as retrieval results.

<Info>
**Please note**: The original **“Automatic Chunking and Cleaning”** mode has been automatically updated to **“General”** mode. No changes are required, and you can continue to use the default setting.
**Chunk Settings**

Once the chunk mode is selected and the knowledge base is created, it cannot be changed later. Any new documents added to the knowledge base will follow the same chunking strategy.
</Info>
Beyond delimiter and maximum chunk length, you can also configure **Chunk overlap** to specify how many characters overlap between adjacent chunks. This helps preserve semantic connections and prevents important information from being split across chunk boundaries.

![General mode and Parent-child mode](https://assets-docs.dify.ai/2024/12/b3052a6aae6e4d0e5701dde3a859e326.png)
For example, with a 50-character overlap, the last 50 characters of one chunk will also appear as the first 50 characters of the next chunk.

</Tab>
<Tab title="Parent-child">

#### General Mode
In Parent-child mode, text is split into two tiers: smaller **child chunks** and larger **parent chunks**. When a query matches a child chunk, its entire parent chunk is returned as the retrieval result.

Content will be divided into independent chunks. When a user submits a query, the system automatically calculates the relevance between the chunks and the query keywords. The top-ranked chunks are then retrieved and sent to the LLM for processing the answers.
This solves a common retrieval dilemma: smaller chunks enable precise query matching but lack context, while larger chunks provide rich context but reduce retrieval accuracy.

In this mode, you need to manually define text chunking rules based on different document formats or specific scenario requirements. Refer to the following configuration options for guidance:
Parent-child mode balances both—retrieving with precision and responding with context.

* **Chunk identifier**: The system will automatically execute chunking whenever it detects the specified delimiter. The default value is `\n\n`, which means the text will be chunked by paragraphs.
**Parent Chunk Settings**

![Chunk results of different identifier syntaxes](https://assets-docs.dify.ai/2024/12/2c19c1c1a0446c00e3c07d6f4c8968e4.png)
* **Maximum chunk length:** Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking.
* **Overlapping chunk length**: When data is chunked, there is a certain amount of overlap between chunks. This overlap can help to improve the retention of information and the accuracy of analysis, and enhance retrieval effects. It is recommended that the setting be 10-25% of the chunk length Tokens.
Parent chunks can be created in **Paragraph** or **Full Doc** mode.

**Text Preprocessing Rules**, Text preprocessing rules help filter out irrelevant content from the knowledge base. The following options are available:
<Tabs>
<Tab title="Paragraph">
The document is split into multiple parent chunks based on the specified delimiter and maximum chunk length.

* Replace consecutive spaces, newline characters, and tabs
* Remove all URLs and email addresses
Suitable for lengthy documents with well-structured sections where each section provides meaningful context independently.
</Tab>
<Tab title="Full Doc">
The entire document serves as a single parent chunk.

Once configured, click **“Preview Chunk”** to see the chunking results. You can view the character count for each chunk. If you modify the chunking rules, click the button again to view the latest generated text chunks.
Suitable for small, cohesive documents where the full context is essential for understanding any specific detail.

If multiple documents are uploaded in bulk, you can switch between them by clicking the document titles at the top to review the chunk results for other documents.
<Note>
In **Full Doc** mode:

![General mode](https://assets-docs.dify.ai/2024/12/b3ec2ce860550563234ca22967abdd17.png)
- Only the first 10,000 tokens are processed. Content beyond this limit will be truncated.

After setting the chunking rules, the next step is to specify the indexing method. General mode supports **High-Quality Indexing** **Method** and **Economical Indexing Method**. For more details, please refer to [Set up the Indexing Method](/en/use-dify/knowledge/create-knowledge/setting-indexing-methods).
- The parent chunk cannot be edited once created. To modify it, you must upload a new document.

#### Parent-child Mode
</Note>
</Tab>
</Tabs>

Compared to **General mode**, **Parent-child mode** uses a two-tier data structure that balances precise retrieval with comprehensive context, combining accurate matching and richer contextual information.
**Child Chunk Settings**

In this mode, parent chunks (e.g., paragraphs) serve as larger text units to provide context, while child chunks (e.g., sentences) focus on pinpoint retrieval. The system searches child chunks first to ensure relevance, then fetches the corresponding parent chunk to supply the full context—thereby guaranteeing both accuracy and a complete background in the final response. You can customize how parent and child chunks are split by configuring delimiters and maximum chunk lengths.
Each parent chunk is further split into child chunks using their own delimiter and maximum chunk length settings.

For example, in an AI-powered customer chatbot case, a user query can be mapped to a specific sentence within a support document. The paragraph or chapter containing that sentence is then provided to the LLM, filling in the overall context so the answer is more precise.
</Tab>
</Tabs>

Its fundamental mechanism includes:
### Quick Comparison

* **Query Matching with Child Chunks:**
* Small, focused pieces of information, often as concise as a single sentence within a paragraph, are used to match the user's query.
* These child chunks enable precise and relevant initial retrieval.
* **Contextual Enrichment with Parent Chunks:**
* Larger, encompassing sections—such as a paragraph, a section, or even an entire document—that include the matched child chunks are then retrieved.
* These parent chunks provide comprehensive context for the Language Model (LLM).
| Dimension | General Mode | Parent-child Mode |
|:----------|:-------------|:------------------|
| Chunking Strategy | Single-tier: all chunks use the same settings | Two-tier: separate settings for parent and child chunks |
| Retrieval Workflow | Matched chunks are directly returned | Child chunks are used for matching queries; parent chunks are returned to provide broader context |
| Compatible [Index Method](/en/use-dify/knowledge/create-knowledge/setting-indexing-methods) | High Quality, Economical | High Quality only |
| Best For | Simple, self-contained content like glossaries or FAQs | Information-dense documents like technical manuals or research papers where context matters |

![Parent-child mode schematic](https://assets-docs.dify.ai/2024/12/3e6820c10bd7c5f6884930e3a14e7b66.png)
## Pre-process Text Before Chunking

In this mode, you need to manually configure separate chunking rules for both parent and child chunks based on different document formats or specific scenario requirements.
Before splitting text into chunks, you can clean up irrelevant content to improve retrieval quality.

**Parent Chunk**
- **Replace consecutive spaces, newlines, and tabs**

The parent chunk settings offer the following options:
- Three or more consecutive newlines → two newlines

* **Paragraph**
- Multiple spaces → single space

This mode splits the text in to paragraphs based on delimiters and the maximum chunk length, using the split text as the parent chunk for retrieval. Each paragraph is treated as a parent chunk, suitable for documents with large volumes of text, clear content, and relatively independent paragraphs. The following settings are supported:
- Tabs, form feeds, and special Unicode spaces → regular space

* **Chunk Delimiter**: The system automatically chunks the text whenever the specified delimiter appears. The default value is `\n\n`, which chunks text by paragraphs.
* **Maximum chunk length:** Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking.
* **Full Doc**
- **Remove all URLs and email addresses**\

Instead of splitting the text into paragraphs, the entire document is used as the parent chunk and retrieved directly. For performance reasons, only the first 10,000 tokens of the text are retained. This setting is ideal for smaller documents where paragraphs are interrelated, requiring full doc retrieval.
<Info>
This setting is ignored in **Full Doc** mode.
</Info>

![The difference between p<strong>aragraph and Full doc</strong>](https://assets-docs.dify.ai/2024/12/e3814336710d445a99a9ded3d251622b.png)
## Enable Summary Auto-Gen

**Child Chunk**
<Info>Available for self-hosted deployments only.</Info>

Child chunks are derived from parent chunks by splitting them based on delimiter rules. They are used to identify and match the most relevant and direct information to the query keywords. When using the default child chunking rules, the segmentation typically results in the following:
Automatically generate summaries for all chunks to enhance their retrievability.

* If the parent chunk is a paragraph, child chunks correspond to individual sentences within each paragraph.
* If the parent chunk is the full document, child chunks correspond to the individual sentences within the document.
Summaries are embedded and indexed for retrieval as well. When a summary matches a query, its corresponding chunk is also returned.

You can configure the following chunk settings:
You can manually edit auto-generated summaries or regenerate them for specific documents later. See [Manage Knowledge Content](/en/use-dify/knowledge/manage-knowledge/maintain-knowledge-documents) for details.

* **Chunk Delimiter**: The system automatically chunks the text whenever the specified delimiter appears. The default value is `\n`, which chunks text by sentences.
* **Maximum chunk length:** Specifies the maximum number of text characters allowed per chunk. If this limit is exceeded, the system will automatically enforce chunking.
<Tip>
If you select a vision-capable LLM, summaries will be generated based on both the chunk text and any attached images.
</Tip>

You can also use **text preprocessing rules** to filter out irrelevant content from the knowledge base:
## Preview Chunks

* Replace consecutive spaces, newline characters, and tabs
* Remove all URLs and email addresses
Click **Preview** to see how your content will be chunked. A limited number of chunks will be displayed for a quick review.

After completing the configuration, click **“Preview Chunks”** to view the results. You can see the total character count of the parent chunk.
If the results don't perfectly match your expectations, choose the closest configuration—you can manually fine-tune chunks later. See [Manage Knowledge Content](/en/use-dify/knowledge/manage-knowledge/maintain-knowledge-documents) for details.

Once configured, click **“Preview Chunk”** to see the chunking results. You can see the total character count of the parent chunk. Characters highlighted in blue represent child chunks, and the character count for the current child chunk is also displayed for reference.

If you modify the chunking rules, click the button again to view the latest generated text chunks.

If multiple documents are uploaded in bulk, you can switch between them by clicking the document titles at the top to review the chunk results for other documents.

![Parent-child mode](https://assets-docs.dify.ai/2024/12/af5c9a68f85120a6ea687bf93ecfb80a.png)

To ensure accurate content retrieval, the Parent-child chunk mode only supports the [High-Quality Indexing](/en/use-dify/knowledge/create-knowledge/setting-indexing-methods#high-quality).

### What's the Difference Between Two Modes?

The difference between the two modes lies in the structure of the content chunks. **General Mode** produces multiple independent content chunks, whereas **Parent-child Mode** uses a two-layer chunking approach. In this way, a single parent chunk (e.g., the entire document or a paragraph) contains multiple child chunks (e.g., sentences).

Different chunking methods influence how effectively the LLM can search the knowledge base. When used on the same document, Parent-child Retrieval provides more comprehensive context while maintaining high precision, making it significantly more effective than the traditional single-layer approach.

![The Difference Between General Mode and Parent-child Mode](https://assets-docs.dify.ai/2024/12/0b614c6a07c6ea2151fe17d85ce6a1d1.png)

### Reference

After choosing the chunking mode, refer to the following documentation to configure the indexing method and retrieval method and finish the creation of your knowledge base.


<Card title="Specify the Index Method and Retrieval Settings" icon="link" href="/en/use-dify/knowledge/create-knowledge/setting-indexing-methods">
Check for more details.
</Card>
For multiple documents, click the file name at the top of the preview panel to switch between them.
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,19 @@ You can also refer to the table below for information on configuring chunk struc
| Parent-child Mode | High Quality Only | Embedding Model | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval |
| Q&A Mode | High Quality Only | Embedding Model | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval |

### Summary Auto-Gen

<Info>Available for self-hosted deployments only.</Info>

Automatically generate summaries for all chunks to enhance their retrievability.

Summaries are embedded and indexed for retrieval as well. When a summary matches a query, its corresponding chunk is also returned.

You can manually edit auto-generated summaries or regenerate them for specific documents later. See [Manage Knowledge Content](/en/use-dify/knowledge/manage-knowledge/maintain-knowledge-documents) for details.

<Tip>
If you select a vision-capable LLM, summaries will be generated based on both the chunk text and any attached images.
</Tip>
---

## Step 4: Create User Input Form
Expand Down
Loading
Loading