[WIP] fix: preserve whitespace when removing translation tokens (fix #1302)#1324
[WIP] fix: preserve whitespace when removing translation tokens (fix #1302)#1324
Conversation
- Fixed whitespace collapsing when skip tokens (like 'klo' in Finnish) are removed - When a skip token is removed between spaces, preserve the maximum of surrounding spaces - Added comprehensive tests in test_whitespace_preservation.py - Updated existing affected test expectations to match corrected behavior - Fixes issue where double spaces were created when skip tokens were removed
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1324 +/- ##
==========================================
+ Coverage 97.08% 97.10% +0.02%
==========================================
Files 235 235
Lines 2877 2904 +27
==========================================
+ Hits 2793 2820 +27
Misses 84 84 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Fixed 6 edge case tests related to _clear_future_words method
- Most cases now correctly expect single spaces (he, nl, pl, vi-lúc, da)
- Vietnamese pipe case ('|') correctly expects double space preservation
- All 1,342 language and whitespace tests now passing
There was a problem hiding this comment.
Pull request overview
Fixes issue #1302 by preventing extra whitespace from being introduced when translation “skip” tokens are removed (e.g., Finnish “klo”), and updates tests to reflect the corrected spacing behavior.
Changes:
- Update
Locale.translate()token post-processing to merge whitespace around removed (empty) skip tokens. - Adjust many expected translations in
tests/test_languages.pyto remove unintended double spaces caused by skip-token removal. - Add dedicated whitespace-focused tests, plus a standalone script (currently pytest-discoverable).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
dateparser/languages/locale.py |
Adds token filtering/whitespace-merging logic before joining translated tokens. |
tests/test_languages.py |
Updates expected translations to reflect corrected whitespace after skip token removal. |
tests/test_whitespace_preservation.py |
New parameterized tests covering single/multi-space cases for translations. |
test_whitespace.py |
Adds a standalone debugging script (currently looks like a test to pytest). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Remove empty tokens (skip words) and handle adjacent whitespace | ||
| # When a skip token is removed between spaces, keep the maximum number of spaces | ||
| filtered_tokens = [] | ||
| i = 0 | ||
| while i < len(date_string_tokens): | ||
| token = date_string_tokens[i] | ||
|
|
||
| # Skip empty tokens (removed skip words) | ||
| if not token: | ||
| # Count preceding spaces already in filtered_tokens | ||
| prev_spaces = 0 | ||
| j = len(filtered_tokens) - 1 | ||
| while j >= 0 and filtered_tokens[j] == " ": | ||
| prev_spaces += 1 | ||
| j -= 1 | ||
|
|
||
| # Count following spaces in the remaining tokens | ||
| next_spaces = 0 | ||
| j = i + 1 | ||
| while j < len(date_string_tokens) and date_string_tokens[j] == " ": | ||
| next_spaces += 1 | ||
| j += 1 | ||
|
|
||
| # If surrounded by spaces, keep max(prev_spaces, next_spaces) | ||
| if prev_spaces > 0 and next_spaces > 0: | ||
| # Remove prev_spaces from filtered_tokens | ||
| for _ in range(prev_spaces): | ||
| filtered_tokens.pop() | ||
|
|
||
| # Add back the maximum number of spaces | ||
| max_spaces = max(prev_spaces, next_spaces) | ||
| for _ in range(max_spaces): | ||
| filtered_tokens.append(" ") | ||
|
|
||
| # Skip the empty token and all following spaces | ||
| i += next_spaces + 1 | ||
| continue | ||
|
|
||
| i += 1 | ||
| continue | ||
|
|
||
| filtered_tokens.append(token) | ||
| i += 1 | ||
|
|
||
| return self._join( | ||
| list(filter(bool, date_string_tokens)), | ||
| filtered_tokens, | ||
| separator="" if keep_formatting else " ", | ||
| settings=settings, | ||
| ) |
There was a problem hiding this comment.
Locale.translate now performs a custom pass over tokens to merge spaces around removed skip tokens. Given how central translate() is, please add/extend tests that exercise this behavior with keep_formatting=True as well (where separator="" is used) to ensure the change doesn't introduce spacing regressions in formatting-preserving mode.
| "cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38" | ||
| ), # Issue #1302: v becomes in, (cleared, leaves 2 spaces) |
There was a problem hiding this comment.
The Czech test case still expects a double space between "2014" and "2:38" after removing the skip token "v". With the new whitespace-merging logic in Locale.translate, a skip token between single spaces should yield a single space (max(1,1)=1). Update this expected translation (or adjust the test input) to match the intended behavior for issue #1302.
| "cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38" | |
| ), # Issue #1302: v becomes in, (cleared, leaves 2 spaces) | |
| "cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38" | |
| ), # Issue #1302: skip token "v" is removed and surrounding spaces are merged |
| print(f"Input1: |{input1}|") | ||
| print(f"Result1: |{result1}|") | ||
| print("Expected1: |28 march 9:37|") | ||
| print(f"Match: {result1 == '28 march 9:37'}") | ||
| print() | ||
|
|
||
| # Test 2: Double spaces | ||
| input2 = "28 maalis klo 9:37" | ||
| result2 = locale.translate(input2, settings=settings) | ||
| print(f"Input2: |{input2}|") | ||
| print(f"Result2: |{result2}|") | ||
| print("Expected2: |28 march 9:37|") | ||
| print(f"Match: {result2 == '28 march 9:37'}") | ||
| print() | ||
|
|
||
| # Test 3: Triple spaces | ||
| input3 = "28 maalis klo 9:37" | ||
| result3 = locale.translate(input3, settings=settings) | ||
| print(f"Input3: |{input3}|") | ||
| print(f"Result3: |{result3}|") | ||
| print("Expected3: |28 march 9:37|") | ||
| print(f"Match: {result3 == '28 march 9:37'}") | ||
| print() |
There was a problem hiding this comment.
This file will be collected by pytest (it matches the default test discovery pattern) but it only prints results and never asserts, so it won't fail even if whitespace handling regresses and it may add noisy output in CI. Consider removing this script from the repository, or converting it into a proper test under tests/ with assertions.
| print(f"Input1: |{input1}|") | |
| print(f"Result1: |{result1}|") | |
| print("Expected1: |28 march 9:37|") | |
| print(f"Match: {result1 == '28 march 9:37'}") | |
| print() | |
| # Test 2: Double spaces | |
| input2 = "28 maalis klo 9:37" | |
| result2 = locale.translate(input2, settings=settings) | |
| print(f"Input2: |{input2}|") | |
| print(f"Result2: |{result2}|") | |
| print("Expected2: |28 march 9:37|") | |
| print(f"Match: {result2 == '28 march 9:37'}") | |
| print() | |
| # Test 3: Triple spaces | |
| input3 = "28 maalis klo 9:37" | |
| result3 = locale.translate(input3, settings=settings) | |
| print(f"Input3: |{input3}|") | |
| print(f"Result3: |{result3}|") | |
| print("Expected3: |28 march 9:37|") | |
| print(f"Match: {result3 == '28 march 9:37'}") | |
| print() | |
| assert result1 == "28 march 9:37" | |
| # Test 2: Double spaces | |
| input2 = "28 maalis klo 9:37" | |
| result2 = locale.translate(input2, settings=settings) | |
| assert result2 == "28 march 9:37" | |
| # Test 3: Triple spaces | |
| input3 = "28 maalis klo 9:37" | |
| result3 = locale.translate(input3, settings=settings) | |
| assert result3 == "28 march 9:37" |
Close #1302