Chrome/V8 tailored rules for word/sentence

While I implement the word segmenter, I met a case `'unicode-segmenter'를` has different result in spec and Node.js 

Chrome/V8 (6): `'`, `unicode`, `-`, `segmenter`, `'`, `를` 
Unicode spec-compliant implementation (4): `'`, `unicode`, `-`, `segmenter'를`

JSC and Spidermonkey follow the spec.

But `Intl.Segmenter` in Chrome/V8 behave differently:
- `can't` -> `can't`
- `a'b` -> `a'b`
- `a'가` -> `a`, `'`, `가`
- `a'α` -> `a'α`
- `a'א` -> `a'א`
- `a'中` -> `a`, `'`, `中`
- `a'あ` -> `a`, `'`, `あ`
- `a'ア` -> `a`, `'`, `ア`
- `a'ก` -> `a'ก`
- `a'ア` -> `a`, `'`, `ア`
- `가'나` -> `가`, `'`, `나`
- `α'β` -> `α'β`
- `א'ב` -> `א'ב`

That implies Chrome/V8 is applying tailoring beyond UAX #29 rather than using the ICU library as-is.

As a native Korean user, that makes perfect sense, the apostrophe joiner is a strange rule in CJK.

There may be other examples around the sentence segmentation too.

The question is, what should the unicode-segmenter follow. Should I strictly follow the spec? Or the practical one that matches the most popular environment?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chrome/V8 tailored rules for word/sentence #112

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Chrome/V8 tailored rules for word/sentence #112

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions