fix: handle surrogate pairs split across chunk boundaries in Utf8EncodeWorker #963
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
When
DataWorkerchunks a string usingsubstring(), it can split UTF-16 surrogate pairs (characters above U+FFFF) across chunk boundaries. TheUtf8EncodeWorkerwas encoding each chunk independently, causing lone surrogates to be encoded as 3-byte CESU-8 sequences instead of being combined into proper 4-byte UTF-8 sequences.Problem
DataWorkerusesDEFAULT_BLOCK_SIZE = 16 * 1024and chunks strings withsubstring(index, nextIndex)substring()operates on UTF-16 code units, which can split surrogate pairsUtf8EncodeWorker.processChunk()had no handling for this case, causing each surrogate to be encoded as a 3-byte sequence (CESU-8) instead of being combined into a 4-byte UTF-8 sequenceSolution
Add
leftOverhandling toUtf8EncodeWorker(similar to whatUtf8DecodeWorkeralready has for incomplete UTF-8 sequences):processChunk(): Check if the chunk ends with a high surrogate (0xD800-0xDBFF). If so, save it and exclude it from the current chunk's encoding.processChunk(): Prepend any saved high surrogate to the current chunk's data.flush()method to handle any leftover at stream end.Impact
This fix ensures that astral plane characters are correctly encoded as 4-byte UTF-8 sequences regardless of their position in the input string, fixing silent data corruption in generated ZIP files.
Test
Added a test case that positions an astral character (U+1F600, 😀) at the exact chunk boundary and verifies that no CESU-8 sequences (0xED followed by 0xA0-0xBF) appear in the output.
Related
Utf8DecodeWorkeralready has similarleftOverhandling for incomplete UTF-8 sequences at chunk boundaries