[WIP] fix: preserve whitespace when removing translation tokens (fix #1302) by serhii73 · Pull Request #1324 · scrapinghub/dateparser

serhii73 · 2026-04-09T11:51:30Z

- Fixed whitespace collapsing when skip tokens (like 'klo' in Finnish) are removed - When a skip token is removed between spaces, preserve the maximum of surrounding spaces - Added comprehensive tests in test_whitespace_preservation.py - Updated existing affected test expectations to match corrected behavior - Fixes issue where double spaces were created when skip tokens were removed

codecov · 2026-04-09T11:53:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.10%. Comparing base (b72ed09) to head (c80acb4).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1324      +/-   ##
==========================================
+ Coverage   97.08%   97.10%   +0.02%     
==========================================
  Files         235      235              
  Lines        2877     2904      +27     
==========================================
+ Hits         2793     2820      +27     
  Misses         84       84

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Fixed 6 edge case tests related to _clear_future_words method - Most cases now correctly expect single spaces (he, nl, pl, vi-lúc, da) - Vietnamese pipe case ('|') correctly expects double space preservation - All 1,342 language and whitespace tests now passing

Copilot

Pull request overview

Fixes issue #1302 by preventing extra whitespace from being introduced when translation “skip” tokens are removed (e.g., Finnish “klo”), and updates tests to reflect the corrected spacing behavior.

Changes:

Update Locale.translate() token post-processing to merge whitespace around removed (empty) skip tokens.
Adjust many expected translations in tests/test_languages.py to remove unintended double spaces caused by skip-token removal.
Add dedicated whitespace-focused tests, plus a standalone script (currently pytest-discoverable).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`dateparser/languages/locale.py`	Adds token filtering/whitespace-merging logic before joining translated tokens.
`tests/test_languages.py`	Updates expected translations to reflect corrected whitespace after skip token removal.
`tests/test_whitespace_preservation.py`	New parameterized tests covering single/multi-space cases for translations.
`test_whitespace.py`	Adds a standalone debugging script (currently looks like a test to pytest).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T12:51:06Z

dateparser/languages/locale.py

+        # Remove empty tokens (skip words) and handle adjacent whitespace
+        # When a skip token is removed between spaces, keep the maximum number of spaces
+        filtered_tokens = []
+        i = 0
+        while i < len(date_string_tokens):
+            token = date_string_tokens[i]
+
+            # Skip empty tokens (removed skip words)
+            if not token:
+                # Count preceding spaces already in filtered_tokens
+                prev_spaces = 0
+                j = len(filtered_tokens) - 1
+                while j >= 0 and filtered_tokens[j] == " ":
+                    prev_spaces += 1
+                    j -= 1
+
+                # Count following spaces in the remaining tokens
+                next_spaces = 0
+                j = i + 1
+                while j < len(date_string_tokens) and date_string_tokens[j] == " ":
+                    next_spaces += 1
+                    j += 1
+
+                # If surrounded by spaces, keep max(prev_spaces, next_spaces)
+                if prev_spaces > 0 and next_spaces > 0:
+                    # Remove prev_spaces from filtered_tokens
+                    for _ in range(prev_spaces):
+                        filtered_tokens.pop()
+
+                    # Add back the maximum number of spaces
+                    max_spaces = max(prev_spaces, next_spaces)
+                    for _ in range(max_spaces):
+                        filtered_tokens.append(" ")
+
+                    # Skip the empty token and all following spaces
+                    i += next_spaces + 1
+                    continue
+
+                i += 1
+                continue
+
+            filtered_tokens.append(token)
+            i += 1
+
        return self._join(
-            list(filter(bool, date_string_tokens)),
+            filtered_tokens,
            separator="" if keep_formatting else " ",
            settings=settings,
        )


Locale.translate now performs a custom pass over tokens to merge spaces around removed skip tokens. Given how central translate() is, please add/extend tests that exercise this behavior with keep_formatting=True as well (where separator="" is used) to ensure the change doesn't introduce spacing regressions in formatting-preserving mode.

Copilot · 2026-04-09T12:51:06Z

tests/test_languages.py

+                "cs", "22. prosinec 2014 v 2:38", "22. december 2014  2:38"
+            ),  # Issue #1302: v becomes in, (cleared, leaves 2 spaces)


The Czech test case still expects a double space between "2014" and "2:38" after removing the skip token "v". With the new whitespace-merging logic in Locale.translate, a skip token between single spaces should yield a single space (max(1,1)=1). Update this expected translation (or adjust the test input) to match the intended behavior for issue #1302.

Suggested change

"cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38"

), # Issue #1302: v becomes in, (cleared, leaves 2 spaces)

"cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38"

), # Issue #1302: skip token "v" is removed and surrounding spaces are merged

Copilot · 2026-04-09T12:51:07Z

test_whitespace.py

+    print(f"Input1:    |{input1}|")
+    print(f"Result1:   |{result1}|")
+    print("Expected1: |28 march 9:37|")
+    print(f"Match: {result1 == '28 march 9:37'}")
+    print()
+
+    # Test 2: Double spaces
+    input2 = "28  maalis  klo  9:37"
+    result2 = locale.translate(input2, settings=settings)
+    print(f"Input2:    |{input2}|")
+    print(f"Result2:   |{result2}|")
+    print("Expected2: |28  march  9:37|")
+    print(f"Match: {result2 == '28  march  9:37'}")
+    print()
+
+    # Test 3: Triple spaces
+    input3 = "28   maalis   klo   9:37"
+    result3 = locale.translate(input3, settings=settings)
+    print(f"Input3:    |{input3}|")
+    print(f"Result3:   |{result3}|")
+    print("Expected3: |28   march   9:37|")
+    print(f"Match: {result3 == '28   march   9:37'}")
+    print()


This file will be collected by pytest (it matches the default test discovery pattern) but it only prints results and never asserts, so it won't fail even if whitespace handling regresses and it may add noisy output in CI. Consider removing this script from the repository, or converting it into a proper test under tests/ with assertions.

Suggested change

print(f"Input1: |{input1}|")

print(f"Result1: |{result1}|")

print("Expected1: |28 march 9:37|")

print(f"Match: {result1 == '28 march 9:37'}")

print()

# Test 2: Double spaces

input2 = "28 maalis klo 9:37"

result2 = locale.translate(input2, settings=settings)

print(f"Input2: |{input2}|")

print(f"Result2: |{result2}|")

print("Expected2: |28 march 9:37|")

print(f"Match: {result2 == '28 march 9:37'}")

print()

# Test 3: Triple spaces

input3 = "28 maalis klo 9:37"

result3 = locale.translate(input3, settings=settings)

print(f"Input3: |{input3}|")

print(f"Result3: |{result3}|")

print("Expected3: |28 march 9:37|")

print(f"Match: {result3 == '28 march 9:37'}")

print()

assert result1 == "28 march 9:37"

# Test 2: Double spaces

input2 = "28 maalis klo 9:37"

result2 = locale.translate(input2, settings=settings)

assert result2 == "28 march 9:37"

# Test 3: Triple spaces

input3 = "28 maalis klo 9:37"

result3 = locale.translate(input3, settings=settings)

assert result3 == "28 march 9:37"

serhii73 requested a review from Copilot April 9, 2026 12:44

Copilot started reviewing on behalf of serhii73 April 9, 2026 12:45 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

serhii73 changed the title ~~fix: preserve whitespace when removing translation tokens (fix #1302)~~ [WIP] fix: preserve whitespace when removing translation tokens (fix #1302) Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix: preserve whitespace when removing translation tokens (fix #1302)#1324

[WIP] fix: preserve whitespace when removing translation tokens (fix #1302)#1324
serhii73 wants to merge 2 commits intomasterfrom
fix/1302-preserve-whitespace

serhii73 commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Copilot AI Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38"
		), # Issue #1302: v becomes in, (cleared, leaves 2 spaces)

Conversation

serhii73 commented Apr 9, 2026

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 9, 2026 •

edited

Loading