Skip to content

Conversation

@Kaua-Klassmann
Copy link
Contributor

Description

This PR fixes the numeric sort to recognize the the locale's thousands separator

Changes

  • Added function to export the locale grouping separators on i18n from uucore
  • Changed "thousands_separator" on NumInfoParseSettings from Option to Option to properly support Unicode separators (like non-breaking spaces)
  • Updated to get the locale grouping separators to set on NumInfoParseSettings to sort

Fixes #10316

@Kaua-Klassmann Kaua-Klassmann changed the title fix :numeric sort (-n) does not recognize thousand separators fix: numeric sort (-n) does not recognize thousand separators Jan 19, 2026
@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/sort/sort-debug-keys. tests/sort/sort-debug-keys is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/tty/tty-eof (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/sort/sort-h-thousands-sep is no longer failing!

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 19, 2026

Merging this PR will not alter performance

✅ 142 untouched benchmarks
⏩ 180 skipped benchmarks1


Comparing Kaua-Klassmann:fix_numeric_sort_does_not_recognize_thousand_separators (e3a4b05) with main (cbbff30)

Open in CodSpeed

Footnotes

  1. 180 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@sylvestre
Copy link
Contributor

Congrats! The gnu test tests/sort/sort-h-thousands-sep is no longer failing!

well done!

But some jobs are failing

@sylvestre
Copy link
Contributor

please also add some high level integration tests in tests/by-util/test_sort.rs

@ChrisDryden
Copy link
Collaborator

It seems that this ICU library and the libc call have differing values for the separator, For example french here is U+202F but using libc its a regular space

@Kaua-Klassmann Kaua-Klassmann force-pushed the fix_numeric_sort_does_not_recognize_thousand_separators branch from 4fa661e to a751737 Compare January 19, 2026 17:11
@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/dd/stderr. tests/dd/stderr is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/sort/sort-debug-keys. tests/sort/sort-debug-keys is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/tac/tac-2-nonseekable. tests/tac/tac-2-nonseekable is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/tail/follow-stdin. tests/tail/follow-stdin is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/rm/rm1 (fails in this run but passes in the 'main' branch)
Congrats! The gnu test tests/sort/sort-h-thousands-sep is no longer failing!

@Kaua-Klassmann
Copy link
Contributor Author

My PR and issue are correct? I world like to contribute to the repository, but I'm beginner in opensource projects.

@sylvestre sylvestre force-pushed the fix_numeric_sort_does_not_recognize_thousand_separators branch from a751737 to 1fa6681 Compare January 20, 2026 22:24
@sylvestre
Copy link
Contributor

yeah, it looks good. i just need more time to verify it

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/unexpand. tests/misc/unexpand is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/sort/sort-debug-keys. tests/sort/sort-debug-keys is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/sort/sort-h-thousands-sep is no longer failing!


let mut first_char = true;

for (idx, &char) in num.iter().enumerate() {
Copy link
Collaborator

@ChrisDryden ChrisDryden Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be outside the scope of this PR since this logic appears to have been here before, but this for loop is incompatible with thousand separators that are multi byte characters (my understanding is that the french one is?). I think the approach might have to be converting this into a while loop so that you can skip over multiple bytes when the thousand separator is matched

echo -e "100\n1\xe2\x80\xaf234\n5\xe2\x80\xaf678" | LC_ALL=fr_FR.utf8 ./target/debug/sort -n

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow I am mistaken, it appears that this is actually a known limitation of sort, GNU sort currently checks to see if it is multi byte and it it is it disables it.

pub struct NumInfoParseSettings {
pub accept_si_units: bool,
pub thousands_separator: Option<u8>,
pub thousands_separator: Option<char>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may seem really counterintuitive but this should stay a u8, if you look at the other comment its because GNU thousand seperation actually doesn't work for multi byte chars so to keep compatibility you need to check to see if its multi byte to see if it should be enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange... but I can change it without any problem

// Get the thousands separator from the locale, handling cases where the separator is empty or multi-character
let locale_thousands_separator = i18n::decimal::locale_grouping_separator().as_bytes();

let thousands_separator = match locale_thousands_separator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind adding a comment here about upstream GNU not being able to handle this, for people reading the code to understand why we're doing this

@Kaua-Klassmann Kaua-Klassmann force-pushed the fix_numeric_sort_does_not_recognize_thousand_separators branch from 3885736 to e3a4b05 Compare January 21, 2026 13:48
@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/sort/sort-debug-keys. tests/sort/sort-debug-keys is passing on 'main'. Maybe you have to rebase?

@ChrisDryden
Copy link
Collaborator

So theres two PRs here that implement the thousands seperator for sort, the other one doesn't have the changes in the test data. I think its because this pr falls back through LC_ALL → LC_NUMERIC → LANG.

I'm still trying to reconcile the differences related to the two PR's. I think we still need the tokenization fixes that are added in #9848 trying to come up with explicit examples that show the deviation with gnu sort for that use case

@ChrisDryden
Copy link
Collaborator

echo -e "1 000\n500" | LC_ALL=fr_FR.utf8 /usr/bin/sort -n

Shows differently in gnu sort compared to this implementation. I think you need to detect if its a blank thousands seperator and also this is missing the HumanNumeric handling, for example:

echo -e "1 K\n500" | LC_ALL=fr_FR.utf8 /usr/bin/sort -h

@ChrisDryden
Copy link
Collaborator

ChrisDryden commented Jan 21, 2026

If I was to strategize about the best way forward here, I really like the way that you are using the ICU library to get the locale separation data instead of calling libc, and that you have the correct fallback for getting the fallback data, then @mattsu2020 has the implementations for the two things that you are missing here.

@sylvestre mind if we merge this and then we re-base @mattsu2020's pr for this with the additional changes for HumanNumeric and the tokenization changes, to get the best of both?

@sylvestre sylvestre merged commit 1a64417 into uutils:main Jan 21, 2026
130 of 131 checks passed
@sylvestre
Copy link
Contributor

sounds good!

@Kaua-Klassmann Kaua-Klassmann deleted the fix_numeric_sort_does_not_recognize_thousand_separators branch January 21, 2026 21:26
mattsu2020 pushed a commit to mattsu2020/coreutils that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sort: numeric sort (-n) does not recognize thousand separators

3 participants