Skip to content

Fix GraphemeCursor for GB11 case on a chunk boundary#172

Merged
Manishearth merged 1 commit into
unicode-rs:masterfrom
apohrebniak:zwj
May 24, 2026
Merged

Fix GraphemeCursor for GB11 case on a chunk boundary#172
Manishearth merged 1 commit into
unicode-rs:masterfrom
apohrebniak:zwj

Conversation

@apohrebniak
Copy link
Copy Markdown
Contributor

@apohrebniak apohrebniak commented May 21, 2026

Fixes #118

For a case with two chunks 👩 and \u{200d}🔬, when the "pre-context" is processed, the knowledge of the fact that the ZWJ byte has been seen is lost. That's why it's expected again.

This commit fixes this by keeping this "substate" in a GraphemeState::Emoji variant

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an incorrect grapheme boundary result in GraphemeCursor for GB11 when the ZWJ occurs at a chunk boundary by preserving intermediate emoji-state across PreContext requests.

Changes:

  • Extends GraphemeState::Emoji to carry a seen_zwj sub-state so GB11 handling can resume correctly across chunks.
  • Updates the GB11 backtracking logic (handle_emoji) to use/persist the seen_zwj flag when requesting additional pre-context.
  • Adds regression and coverage tests for chunked emoji/ZWJ boundary scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/grapheme.rs
Comment on lines +180 to +181
/// Whether the ZWJ char has been seen already an only a "\p{Extended_Pictographic} Extend*"
/// part of GB11 has to be checked
@Manishearth Manishearth merged commit 9a42b9d into unicode-rs:master May 24, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GraphemeCursor::next_boundary() returns incorrect boundary when chunk starts with ZWJ

3 participants