-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
fix: numeric sort (-n) does not recognize thousand separators #10339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: numeric sort (-n) does not recognize thousand separators #10339
Conversation
|
GNU testsuite comparison: |
Merging this PR will not alter performance
Comparing Footnotes
|
well done! But some jobs are failing |
|
please also add some high level integration tests in tests/by-util/test_sort.rs |
|
It seems that this ICU library and the libc call have differing values for the separator, For example french here is U+202F but using libc its a regular space |
4fa661e to
a751737
Compare
|
GNU testsuite comparison: |
|
My PR and issue are correct? I world like to contribute to the repository, but I'm beginner in opensource projects. |
a751737 to
1fa6681
Compare
|
yeah, it looks good. i just need more time to verify it |
|
GNU testsuite comparison: |
|
|
||
| let mut first_char = true; | ||
|
|
||
| for (idx, &char) in num.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be outside the scope of this PR since this logic appears to have been here before, but this for loop is incompatible with thousand separators that are multi byte characters (my understanding is that the french one is?). I think the approach might have to be converting this into a while loop so that you can skip over multiple bytes when the thousand separator is matched
echo -e "100\n1\xe2\x80\xaf234\n5\xe2\x80\xaf678" | LC_ALL=fr_FR.utf8 ./target/debug/sort -n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I am mistaken, it appears that this is actually a known limitation of sort, GNU sort currently checks to see if it is multi byte and it it is it disables it.
src/uu/sort/src/numeric_str_cmp.rs
Outdated
| pub struct NumInfoParseSettings { | ||
| pub accept_si_units: bool, | ||
| pub thousands_separator: Option<u8>, | ||
| pub thousands_separator: Option<char>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may seem really counterintuitive but this should stay a u8, if you look at the other comment its because GNU thousand seperation actually doesn't work for multi byte chars so to keep compatibility you need to check to see if its multi byte to see if it should be enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's strange... but I can change it without any problem
…housand_separators
| // Get the thousands separator from the locale, handling cases where the separator is empty or multi-character | ||
| let locale_thousands_separator = i18n::decimal::locale_grouping_separator().as_bytes(); | ||
|
|
||
| let thousands_separator = match locale_thousands_separator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind adding a comment here about upstream GNU not being able to handle this, for people reading the code to understand why we're doing this
3885736 to
e3a4b05
Compare
|
GNU testsuite comparison: |
|
So theres two PRs here that implement the thousands seperator for sort, the other one doesn't have the changes in the test data. I think its because this pr falls back through LC_ALL → LC_NUMERIC → LANG. I'm still trying to reconcile the differences related to the two PR's. I think we still need the tokenization fixes that are added in #9848 trying to come up with explicit examples that show the deviation with gnu sort for that use case |
Shows differently in gnu sort compared to this implementation. I think you need to detect if its a blank thousands seperator and also this is missing the HumanNumeric handling, for example: |
|
If I was to strategize about the best way forward here, I really like the way that you are using the ICU library to get the locale separation data instead of calling libc, and that you have the correct fallback for getting the fallback data, then @mattsu2020 has the implementations for the two things that you are missing here. @sylvestre mind if we merge this and then we re-base @mattsu2020's pr for this with the additional changes for HumanNumeric and the tokenization changes, to get the best of both? |
|
sounds good! |
Description
This PR fixes the numeric sort to recognize the the locale's thousands separator
Changes
Fixes #10316