Skip to content

Conversation

@Yuof
Copy link

@Yuof Yuof commented Jan 28, 2026

Summary

When DataWorker chunks a string using substring(), it can split UTF-16 surrogate pairs (characters above U+FFFF) across chunk boundaries. The Utf8EncodeWorker was encoding each chunk independently, causing lone surrogates to be encoded as 3-byte CESU-8 sequences instead of being combined into proper 4-byte UTF-8 sequences.

Problem

  • DataWorker uses DEFAULT_BLOCK_SIZE = 16 * 1024 and chunks strings with substring(index, nextIndex)
  • substring() operates on UTF-16 code units, which can split surrogate pairs
  • Characters above U+FFFF (emoji, rare CJK, Private Use Area, etc.) require surrogate pairs in JavaScript's UTF-16 strings
  • When a surrogate pair lands at a chunk boundary, the high surrogate ends up in one chunk and the low surrogate in the next
  • Utf8EncodeWorker.processChunk() had no handling for this case, causing each surrogate to be encoded as a 3-byte sequence (CESU-8) instead of being combined into a 4-byte UTF-8 sequence

Solution

Add leftOver handling to Utf8EncodeWorker (similar to what Utf8DecodeWorker already has for incomplete UTF-8 sequences):

  1. In processChunk(): Check if the chunk ends with a high surrogate (0xD800-0xDBFF). If so, save it and exclude it from the current chunk's encoding.
  2. At the start of each processChunk(): Prepend any saved high surrogate to the current chunk's data.
  3. Add flush() method to handle any leftover at stream end.

Impact

This fix ensures that astral plane characters are correctly encoded as 4-byte UTF-8 sequences regardless of their position in the input string, fixing silent data corruption in generated ZIP files.

Test

Added a test case that positions an astral character (U+1F600, 😀) at the exact chunk boundary and verifies that no CESU-8 sequences (0xED followed by 0xA0-0xBF) appear in the output.

Related

  • The Utf8DecodeWorker already has similar leftOver handling for incomplete UTF-8 sequences at chunk boundaries
  • This bug was discovered in the context of the docx library which uses JSZip for DOCX file generation

…deWorker

When DataWorker chunks a string using substring(), it can split UTF-16
surrogate pairs (characters above U+FFFF) across chunk boundaries. The
Utf8EncodeWorker was encoding each chunk independently, causing lone
surrogates to be encoded as 3-byte CESU-8 sequences instead of being
combined into proper 4-byte UTF-8 sequences.

This fix adds leftOver handling to Utf8EncodeWorker (similar to what
Utf8DecodeWorker already has) to preserve high surrogates at chunk
boundaries and prepend them to the next chunk.

Fixes the issue where astral plane characters (emoji, rare CJK, etc.)
positioned at 16KB boundaries would produce invalid UTF-8 output.
Copilot AI review requested due to automatic review settings January 28, 2026 13:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect UTF-8 output when DataWorker splits UTF-16 surrogate pairs across chunk boundaries, which previously caused CESU-8-style 3-byte encodings for lone surrogates.

Changes:

  • Buffer a trailing high surrogate in Utf8EncodeWorker.processChunk() and prepend it to the next chunk before encoding.
  • Add Utf8EncodeWorker.flush() to emit any remaining buffered high surrogate at stream end.
  • Add a QUnit regression test placing an astral character exactly on the internal chunk boundary and asserting no CESU-8 surrogate encodings appear.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
lib/utf8.js Adds cross-chunk surrogate-pair handling in Utf8EncodeWorker, plus a flush() implementation.
test/asserts/unicode.js Adds a regression test ensuring astral chars at chunk boundaries don’t produce CESU-8 bytes and round-trip correctly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant