-
Notifications
You must be signed in to change notification settings - Fork 2.6k
feat: add strip_whitespace and regex_replace to DocumentCleaner #10400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add strip_whitespace and regex_replace to DocumentCleaner #10400
Conversation
Add two new parameters to DocumentCleaner: 1. strip_whitespace - removes leading/trailing whitespace using str.strip() 2. regex_replace - maps regex patterns to replacement strings Fixes deepset-ai#8798
|
@VedantMadane is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
julian-risch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening this pull request @VedantMadane ! The changes look quite good to me already.
My main change request is to keep "\f" unchanged (see the implementation of _remove_regex) and my comment.
The only other thing I would suggest to change before we merge the PR are the parameter names. Instead of strip_whitespace, I suggest strip_whitespaces and instead of regex_replace, I suggest replace_regexes. That way, the naming is more consistent with the other parameter names (remove_extra_whitespaces, remove_regex).
Besides that, I'd like to note that remove_regex is a subset of replace_regexes and in a future version of Haystack, we could decide to deprecate and then later remove remove_regex. However, for now, I'd like to avoid a breaking change. 👍
| cleaner = DocumentCleaner( | ||
| remove_empty_lines=False, | ||
| remove_extra_whitespaces=False, | ||
| regex_replace={r"\[REDACTED\]": "***", r"(\d{4})-(\d{2})-(\d{2})": r"\2/\3/\1"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for our team: This is quite powerful and I think we could include such an example in our documentation. That would make it easier for users to understand what the parameter can be used for.
| :param text: Text to clean. | ||
| :param regex_replace: A dictionary mapping regex patterns to their replacement strings. | ||
| :returns: The text with the regex matches replaced by the specified strings. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do texts = text.split("\f") here first, then apply the logic to each text in texts and then return "\f".join(cleaned_text)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an additional variable cleaned_text instead of the input text is also preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a detailed explanation of why we want to keep "\f", you can have a look at #8078 in case you are interested.
|
Just let me know if you need help with adding a release note (https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes) or with ruff formatting (https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#run-code-quality-checks-locally). I understand this is your first Haystack PR and I'll be happy to help! |
Add release notes documenting the new strip_whitespace and regex_replace parameters added to the DocumentCleaner component. - strip_whitespace: Removes leading/trailing whitespace while preserving internal formatting - regex_replace: Allows custom regex-based text transformations with replacement strings Addresses review feedback requesting release notes for PR deepset-ai#10400.
Summary
Fixes #8798
This PR expands the functionality of DocumentCleaner with two new parameters:
1. strip_whitespace: bool = False
When True, removes leading and trailing whitespace from document content using Python's str.strip().
Unlike remove_extra_whitespaces, this only affects the beginning and end of the text, preserving internal whitespace (useful for markdown formatting).
2.
regex_replace: dict[str, str] | None = None
A dictionary mapping regex patterns to replacement strings. This allows custom replacements instead of just removal. For example:
Changes
Test plan