Skip to content

Bug: the TextChunker.SplitPlainTextParagraphs sometimes overcount the chunk sizes #13713

@onyxmaster

Description

@onyxmaster

Hi,

The TextChunker occasionally overcounts the last paragraph size because ProcessParagraphs orphan chunk gluing logic is using number of words instead of number of tokens which can lead to last chunk exceeding the target length. This leads to the results of TextChunker.SplitPlainTextParagraphs sometimes have too large chunks, causing (frequently silent) loss of information when generating embeddings or using a reranker.

Platform

  • Language: C#
  • Source:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions