Skip to content

[WIP] fix: preserve whitespace when removing translation tokens (fix #1302)#1324

Open
serhii73 wants to merge 2 commits intomasterfrom
fix/1302-preserve-whitespace
Open

[WIP] fix: preserve whitespace when removing translation tokens (fix #1302)#1324
serhii73 wants to merge 2 commits intomasterfrom
fix/1302-preserve-whitespace

Conversation

@serhii73
Copy link
Copy Markdown
Collaborator

@serhii73 serhii73 commented Apr 9, 2026

Close #1302

- Fixed whitespace collapsing when skip tokens (like 'klo' in Finnish) are removed
- When a skip token is removed between spaces, preserve the maximum of surrounding spaces
- Added comprehensive tests in test_whitespace_preservation.py
- Updated existing affected test expectations to match corrected behavior
- Fixes issue where double spaces were created when skip tokens were removed
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.10%. Comparing base (b72ed09) to head (c80acb4).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1324      +/-   ##
==========================================
+ Coverage   97.08%   97.10%   +0.02%     
==========================================
  Files         235      235              
  Lines        2877     2904      +27     
==========================================
+ Hits         2793     2820      +27     
  Misses         84       84              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Fixed 6 edge case tests related to _clear_future_words method
- Most cases now correctly expect single spaces (he, nl, pl, vi-lúc, da)
- Vietnamese pipe case ('|') correctly expects double space preservation
- All 1,342 language and whitespace tests now passing
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes issue #1302 by preventing extra whitespace from being introduced when translation “skip” tokens are removed (e.g., Finnish “klo”), and updates tests to reflect the corrected spacing behavior.

Changes:

  • Update Locale.translate() token post-processing to merge whitespace around removed (empty) skip tokens.
  • Adjust many expected translations in tests/test_languages.py to remove unintended double spaces caused by skip-token removal.
  • Add dedicated whitespace-focused tests, plus a standalone script (currently pytest-discoverable).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
dateparser/languages/locale.py Adds token filtering/whitespace-merging logic before joining translated tokens.
tests/test_languages.py Updates expected translations to reflect corrected whitespace after skip token removal.
tests/test_whitespace_preservation.py New parameterized tests covering single/multi-space cases for translations.
test_whitespace.py Adds a standalone debugging script (currently looks like a test to pytest).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +156 to 204
# Remove empty tokens (skip words) and handle adjacent whitespace
# When a skip token is removed between spaces, keep the maximum number of spaces
filtered_tokens = []
i = 0
while i < len(date_string_tokens):
token = date_string_tokens[i]

# Skip empty tokens (removed skip words)
if not token:
# Count preceding spaces already in filtered_tokens
prev_spaces = 0
j = len(filtered_tokens) - 1
while j >= 0 and filtered_tokens[j] == " ":
prev_spaces += 1
j -= 1

# Count following spaces in the remaining tokens
next_spaces = 0
j = i + 1
while j < len(date_string_tokens) and date_string_tokens[j] == " ":
next_spaces += 1
j += 1

# If surrounded by spaces, keep max(prev_spaces, next_spaces)
if prev_spaces > 0 and next_spaces > 0:
# Remove prev_spaces from filtered_tokens
for _ in range(prev_spaces):
filtered_tokens.pop()

# Add back the maximum number of spaces
max_spaces = max(prev_spaces, next_spaces)
for _ in range(max_spaces):
filtered_tokens.append(" ")

# Skip the empty token and all following spaces
i += next_spaces + 1
continue

i += 1
continue

filtered_tokens.append(token)
i += 1

return self._join(
list(filter(bool, date_string_tokens)),
filtered_tokens,
separator="" if keep_formatting else " ",
settings=settings,
)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locale.translate now performs a custom pass over tokens to merge spaces around removed skip tokens. Given how central translate() is, please add/extend tests that exercise this behavior with keep_formatting=True as well (where separator="" is used) to ensure the change doesn't introduce spacing regressions in formatting-preserving mode.

Copilot uses AI. Check for mistakes.
Comment on lines +119 to +120
"cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38"
), # Issue #1302: v becomes in, (cleared, leaves 2 spaces)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Czech test case still expects a double space between "2014" and "2:38" after removing the skip token "v". With the new whitespace-merging logic in Locale.translate, a skip token between single spaces should yield a single space (max(1,1)=1). Update this expected translation (or adjust the test input) to match the intended behavior for issue #1302.

Suggested change
"cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38"
), # Issue #1302: v becomes in, (cleared, leaves 2 spaces)
"cs", "22. prosinec 2014 v 2:38", "22. december 2014 2:38"
), # Issue #1302: skip token "v" is removed and surrounding spaces are merged

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +36
print(f"Input1: |{input1}|")
print(f"Result1: |{result1}|")
print("Expected1: |28 march 9:37|")
print(f"Match: {result1 == '28 march 9:37'}")
print()

# Test 2: Double spaces
input2 = "28 maalis klo 9:37"
result2 = locale.translate(input2, settings=settings)
print(f"Input2: |{input2}|")
print(f"Result2: |{result2}|")
print("Expected2: |28 march 9:37|")
print(f"Match: {result2 == '28 march 9:37'}")
print()

# Test 3: Triple spaces
input3 = "28 maalis klo 9:37"
result3 = locale.translate(input3, settings=settings)
print(f"Input3: |{input3}|")
print(f"Result3: |{result3}|")
print("Expected3: |28 march 9:37|")
print(f"Match: {result3 == '28 march 9:37'}")
print()
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will be collected by pytest (it matches the default test discovery pattern) but it only prints results and never asserts, so it won't fail even if whitespace handling regresses and it may add noisy output in CI. Consider removing this script from the repository, or converting it into a proper test under tests/ with assertions.

Suggested change
print(f"Input1: |{input1}|")
print(f"Result1: |{result1}|")
print("Expected1: |28 march 9:37|")
print(f"Match: {result1 == '28 march 9:37'}")
print()
# Test 2: Double spaces
input2 = "28 maalis klo 9:37"
result2 = locale.translate(input2, settings=settings)
print(f"Input2: |{input2}|")
print(f"Result2: |{result2}|")
print("Expected2: |28 march 9:37|")
print(f"Match: {result2 == '28 march 9:37'}")
print()
# Test 3: Triple spaces
input3 = "28 maalis klo 9:37"
result3 = locale.translate(input3, settings=settings)
print(f"Input3: |{input3}|")
print(f"Result3: |{result3}|")
print("Expected3: |28 march 9:37|")
print(f"Match: {result3 == '28 march 9:37'}")
print()
assert result1 == "28 march 9:37"
# Test 2: Double spaces
input2 = "28 maalis klo 9:37"
result2 = locale.translate(input2, settings=settings)
assert result2 == "28 march 9:37"
# Test 3: Triple spaces
input3 = "28 maalis klo 9:37"
result3 = locale.translate(input3, settings=settings)
assert result3 == "28 march 9:37"

Copilot uses AI. Check for mistakes.
@serhii73 serhii73 changed the title fix: preserve whitespace when removing translation tokens (fix #1302) [WIP] fix: preserve whitespace when removing translation tokens (fix #1302) Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue: Extra space in translation

2 participants