fix: handle surrogate pairs split across chunk boundaries in Utf8EncodeWorker #963

Yuof · 2026-01-28T13:03:02Z

Summary

When DataWorker chunks a string using substring(), it can split UTF-16 surrogate pairs (characters above U+FFFF) across chunk boundaries. The Utf8EncodeWorker was encoding each chunk independently, causing lone surrogates to be encoded as 3-byte CESU-8 sequences instead of being combined into proper 4-byte UTF-8 sequences.

Problem

DataWorker uses DEFAULT_BLOCK_SIZE = 16 * 1024 and chunks strings with substring(index, nextIndex)
substring() operates on UTF-16 code units, which can split surrogate pairs
Characters above U+FFFF (emoji, rare CJK, Private Use Area, etc.) require surrogate pairs in JavaScript's UTF-16 strings
When a surrogate pair lands at a chunk boundary, the high surrogate ends up in one chunk and the low surrogate in the next
Utf8EncodeWorker.processChunk() had no handling for this case, causing each surrogate to be encoded as a 3-byte sequence (CESU-8) instead of being combined into a 4-byte UTF-8 sequence

Solution

Add leftOver handling to Utf8EncodeWorker (similar to what Utf8DecodeWorker already has for incomplete UTF-8 sequences):

In processChunk(): Check if the chunk ends with a high surrogate (0xD800-0xDBFF). If so, save it and exclude it from the current chunk's encoding.
At the start of each processChunk(): Prepend any saved high surrogate to the current chunk's data.
Add flush() method to handle any leftover at stream end.

Impact

This fix ensures that astral plane characters are correctly encoded as 4-byte UTF-8 sequences regardless of their position in the input string, fixing silent data corruption in generated ZIP files.

Test

Added a test case that positions an astral character (U+1F600, 😀) at the exact chunk boundary and verifies that no CESU-8 sequences (0xED followed by 0xA0-0xBF) appear in the output.

…deWorker When DataWorker chunks a string using substring(), it can split UTF-16 surrogate pairs (characters above U+FFFF) across chunk boundaries. The Utf8EncodeWorker was encoding each chunk independently, causing lone surrogates to be encoded as 3-byte CESU-8 sequences instead of being combined into proper 4-byte UTF-8 sequences. This fix adds leftOver handling to Utf8EncodeWorker (similar to what Utf8DecodeWorker already has) to preserve high surrogates at chunk boundaries and prepend them to the next chunk. Fixes the issue where astral plane characters (emoji, rare CJK, etc.) positioned at 16KB boundaries would produce invalid UTF-8 output.

Copilot

Pull request overview

Fixes incorrect UTF-8 output when DataWorker splits UTF-16 surrogate pairs across chunk boundaries, which previously caused CESU-8-style 3-byte encodings for lone surrogates.

Changes:

Buffer a trailing high surrogate in Utf8EncodeWorker.processChunk() and prepend it to the next chunk before encoding.
Add Utf8EncodeWorker.flush() to emit any remaining buffered high surrogate at stream end.
Add a QUnit regression test placing an astral character exactly on the internal chunk boundary and asserting no CESU-8 surrogate encodings appear.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
lib/utf8.js	Adds cross-chunk surrogate-pair handling in `Utf8EncodeWorker`, plus a `flush()` implementation.
test/asserts/unicode.js	Adds a regression test ensuring astral chars at chunk boundaries don’t produce CESU-8 bytes and round-trip correctly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/asserts/unicode.js

lib/utf8.js

Copilot AI review requested due to automatic review settings January 28, 2026 13:03

Copilot started reviewing on behalf of Yuof January 28, 2026 13:03 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

test/asserts/unicode.js Outdated Show resolved Hide resolved

lib/utf8.js Outdated Show resolved Hide resolved

Yuof mentioned this pull request Jan 29, 2026

fix: pre-encode XML to UTF-8 to avoid surrogate pair corruption in JSZip dolanmiu/docx#3329

Open

Address review feedback: improve comment and test

20c716d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: handle surrogate pairs split across chunk boundaries in Utf8EncodeWorker #963

fix: handle surrogate pairs split across chunk boundaries in Utf8EncodeWorker #963

Uh oh!

Yuof commented Jan 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix: handle surrogate pairs split across chunk boundaries in Utf8EncodeWorker #963

Are you sure you want to change the base?

fix: handle surrogate pairs split across chunk boundaries in Utf8EncodeWorker #963

Uh oh!

Conversation

Yuof commented Jan 28, 2026

Summary

Problem

Solution

Impact

Test

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant