Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes

Hi,

The `TextChunker` occasionally overcounts the last paragraph size because `ProcessParagraphs` orphan chunk gluing logic is using [number of words instead of number of tokens](https://github.com/microsoft/semantic-kernel/blob/5f282a912853ebf39ef39398a8f9d48a952f63bb/dotnet/src/SemanticKernel.Core/Text/TextChunker.cs#L195-L201) which can lead to last chunk exceeding the target length. This leads to the results of `TextChunker.SplitPlainTextParagraphs` sometimes have too large chunks, causing (frequently silent) loss of information when generating embeddings or using a reranker.

**Platform**
 - Language: C#
 - Source: 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes #13713

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes #13713

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions