While I implement the word segmenter, I met a case 'unicode-segmenter'를 has different result in spec and Node.js
Chrome/V8 (6): ', unicode, -, segmenter, ', 를
Unicode spec-compliant implementation (4): ', unicode, -, segmenter'를
JSC and Spidermonkey follow the spec.
But Intl.Segmenter in Chrome/V8 behave differently:
can't -> can't
a'b -> a'b
a'가 -> a, ', 가
a'α -> a'α
a'א -> a'א
a'中 -> a, ', 中
a'あ -> a, ', あ
a'ア -> a, ', ア
a'ก -> a'ก
a'ア -> a, ', ア
가'나 -> 가, ', 나
α'β -> α'β
א'ב -> א'ב
That implies Chrome/V8 is applying tailoring beyond UAX #29 rather than using the ICU library as-is.
As a native Korean user, that makes perfect sense, the apostrophe joiner is a strange rule in CJK.
There may be other examples around the sentence segmentation too.
The question is, what should the unicode-segmenter follow. Should I strictly follow the spec? Or the practical one that matches the most popular environment?
While I implement the word segmenter, I met a case
'unicode-segmenter'를has different result in spec and Node.jsChrome/V8 (6):
',unicode,-,segmenter,',를Unicode spec-compliant implementation (4):
',unicode,-,segmenter'를JSC and Spidermonkey follow the spec.
But
Intl.Segmenterin Chrome/V8 behave differently:can't->can'ta'b->a'ba'가->a,',가a'α->a'αa'א->a'אa'中->a,',中a'あ->a,',あa'ア->a,',アa'ก->a'กa'ア->a,',ア가'나->가,',나α'β->α'βא'ב->א'בThat implies Chrome/V8 is applying tailoring beyond UAX #29 rather than using the ICU library as-is.
As a native Korean user, that makes perfect sense, the apostrophe joiner is a strange rule in CJK.
There may be other examples around the sentence segmentation too.
The question is, what should the unicode-segmenter follow. Should I strictly follow the spec? Or the practical one that matches the most popular environment?