Skip to content

Conversation

@VedantMadane
Copy link

@VedantMadane VedantMadane commented Jan 17, 2026

Summary

Fixes #8798

This PR expands the functionality of DocumentCleaner with two new parameters:

1. strip_whitespace: bool = False

When True, removes leading and trailing whitespace from document content using Python's str.strip().
Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace (useful for markdown formatting).

2.

regex_replace: dict[str, str] | None = None
A dictionary mapping regex patterns to replacement strings. This allows custom replacements instead of just removal. For example:

  • {r'\n\n+': '\n'} replaces multiple consecutive newlines with a single newline
  • {r'\s{2,}': ' '} replaces multiple spaces with a single space

Changes

  • Added strip_whitespace parameter to \DocumentCleaner.init()
  • Added regex_replace\ parameter to DocumentCleaner.init()
  • Added _replace_regex()\ method for custom regex replacements
  • Added comprehensive unit tests for both new features

Test plan

  • Added unit tests for strip_whitespace
  • Added unit tests for regex_replace with single/multiple patterns
  • Added test for combined usage of both features
  • Added test for initialization with new parameters

Add two new parameters to DocumentCleaner:

1. strip_whitespace - removes leading/trailing whitespace using str.strip()

2. regex_replace - maps regex patterns to replacement strings

Fixes deepset-ai#8798
@VedantMadane VedantMadane requested a review from a team as a code owner January 17, 2026 11:28
@VedantMadane VedantMadane requested review from julian-risch and removed request for a team January 17, 2026 11:28
@vercel
Copy link

vercel bot commented Jan 17, 2026

@VedantMadane is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

@CLAassistant
Copy link

CLAassistant commented Jan 17, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jan 17, 2026
@VedantMadane VedantMadane marked this pull request as draft January 17, 2026 11:36
@VedantMadane VedantMadane marked this pull request as ready for review January 17, 2026 11:57
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this pull request @VedantMadane ! The changes look quite good to me already.
My main change request is to keep "\f" unchanged (see the implementation of _remove_regex) and my comment.
The only other thing I would suggest to change before we merge the PR are the parameter names. Instead of strip_whitespace, I suggest strip_whitespaces and instead of regex_replace, I suggest replace_regexes. That way, the naming is more consistent with the other parameter names (remove_extra_whitespaces, remove_regex).
Besides that, I'd like to note that remove_regex is a subset of replace_regexes and in a future version of Haystack, we could decide to deprecate and then later remove remove_regex. However, for now, I'd like to avoid a breaking change. 👍

cleaner = DocumentCleaner(
remove_empty_lines=False,
remove_extra_whitespaces=False,
regex_replace={r"\[REDACTED\]": "***", r"(\d{4})-(\d{2})-(\d{2})": r"\2/\3/\1"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for our team: This is quite powerful and I think we could include such an example in our documentation. That would make it easier for users to understand what the parameter can be used for.

:param text: Text to clean.
:param regex_replace: A dictionary mapping regex patterns to their replacement strings.
:returns: The text with the regex matches replaced by the specified strings.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do texts = text.split("\f") here first, then apply the logic to each text in texts and then return "\f".join(cleaned_text)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an additional variable cleaned_text instead of the input text is also preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a detailed explanation of why we want to keep "\f", you can have a look at #8078 in case you are interested.

@julian-risch
Copy link
Member

Just let me know if you need help with adding a release note (https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes) or with ruff formatting (https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#run-code-quality-checks-locally). I understand this is your first Haystack PR and I'll be happy to help!

Add release notes documenting the new strip_whitespace and regex_replace
parameters added to the DocumentCleaner component.

- strip_whitespace: Removes leading/trailing whitespace while preserving internal formatting
- regex_replace: Allows custom regex-based text transformations with replacement strings

Addresses review feedback requesting release notes for PR deepset-ai#10400.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand the functionality of the DocumentCleaner

3 participants