Skip to content

fix: prevent UTF-8 slicing panic in Differ::fancy_replace#10

Open
RandyMcMillan wants to merge 4 commits intoDimaKudosh:masterfrom
gnostr-org:1876/947305/418560/f035fb8/0009f77-difflib-parsing
Open

fix: prevent UTF-8 slicing panic in Differ::fancy_replace#10
RandyMcMillan wants to merge 4 commits intoDimaKudosh:masterfrom
gnostr-org:1876/947305/418560/f035fb8/0009f77-difflib-parsing

Conversation

@RandyMcMillan
Copy link
Copy Markdown

Address a thread panic at src/differ.rs:271:52 caused by byte-indexed string slicing on multi-byte (UTF-8) character boundaries.

The previous implementation assumed character indices matched byte offsets, leading to crashes when diffing lines containing emojis, mathematical symbols, or non-ASCII characters.

Changes:

  • Replaced direct byte slicing with character-aware indexing using .char_indices().
  • Added a regression test 'test_utf8_intraline_diff_no_panic' to ensure stability when comparing strings with multi-byte characters.

Verified on:

  • x86_64-unknown-linux-gnu
  • aarch64-apple-darwin

03-000ef697

Address a thread panic at src/differ.rs:271:52 caused by byte-indexed
string slicing on multi-byte (UTF-8) character boundaries.

The previous implementation assumed character indices matched byte
offsets, leading to crashes when diffing lines containing emojis,
mathematical symbols, or non-ASCII characters.

Changes:
- Replaced direct byte slicing with character-aware indexing using .char_indices().
- Added a regression test 'test_utf8_intraline_diff_no_panic' to ensure
  stability when comparing strings with multi-byte characters.

Verified on:
- x86_64-unknown-linux-gnu
- aarch64-apple-darwin

03-000ef697
Copilot AI review requested due to automatic review settings April 30, 2026 15:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a panic in Differ::fancy_replace/qformat triggered when handling multi-byte UTF-8 characters by avoiding unsafe byte-indexed slicing, and adds a regression test covering emoji input.

Changes:

  • Introduces a helper to split &str at a character index via char_indices() and uses it when trimming tag prefixes.
  • Adds a UTF-8/emoji comparison test to ensure the differ output remains stable and does not panic.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/differ.rs Adds character-aware splitting and applies it when trimming diff tag strings in qformat.
tests/tests.rs Adds a regression test that diffs two strings containing emojis.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/tests.rs
Comment thread src/differ.rs Outdated
@@ -268,8 +277,8 @@ impl Differ {
);
common = cmp::min(common, count_leading(first_tags.split_at(common).0, ' '));
common = cmp::min(common, count_leading(first_tags.split_at(common).0, ' '));
qformat now trims the shared prefix against both tag lines instead of
 checking first_tags twice, which keeps the generated ? lines aligned when the
 second tag line has different leading spaces.
Add a regression test for the alignment case.

00-0013c970
@RandyMcMillan
Copy link
Copy Markdown
Author

#10 (comment): based in @copilot suggestion

Drop the redundant mut binding when building the second-sequence index.
 The Vec returned from or_insert_with is still mutated, but the binding itself
 does not need to be mutable.

02-000ca690
Rename the regression test to test_utf8_intraline_diff_no_panic so it
 matches the PR description and stays easy to search when debugging similar
 UTF-8 intraline diff issues.

02-0033b485
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a UTF-8 panic in Differ’s intraline diff formatting by removing byte-indexed slicing and replacing it with character-aware splitting, ensuring diffs work on multi-byte Unicode (e.g., emojis). It also corrects tag-line alignment logic and adds regression coverage.

Changes:

  • Add split_at_char_boundary() and use it in Differ::qformat() to avoid slicing &str at non–UTF-8 boundaries.
  • Fix qformat() alignment to consider both tag lines (previously applied the first tag line twice).
  • Add regression tests covering UTF-8 intraline diffs and the corrected alignment behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/differ.rs Introduces char-boundary splitting in qformat() and fixes tag-line alignment; adds a unit test for alignment.
tests/tests.rs Adds an integration regression test ensuring UTF-8 intraline diffs don’t panic and include both Unicode variants.
src/sequencematcher.rs Minor cleanup: removes an unnecessary mut binding in chain_second_seq().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a panic in Differ::fancy_replace/qformat when diffing lines that contain multi-byte UTF-8 characters by avoiding byte-indexed slicing, and adds regression coverage to keep UTF-8 intraline diffs stable.

Changes:

  • Added split_at_char_boundary to convert a character index into a safe byte boundary before splitting.
  • Updated qformat to align based on both tag lines and to split tags using character-aware indexing.
  • Added a UTF-8 regression test to ensure comparing lines with emojis does not panic.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/differ.rs Introduces character-boundary splitting and fixes tag alignment logic in qformat; adds a targeted unit test.
tests/tests.rs Adds a regression test ensuring UTF-8 intraline diffs don’t panic and preserve the original Unicode content.
src/sequencematcher.rs Minor cleanup in chain_second_seq (removes unnecessary mut binding).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/differ.rs
Comment on lines +280 to +281
first_tags = Self::split_at_char_boundary(first_tags, common).1.trim_right();
second_tags = Self::split_at_char_boundary(second_tags, common).1.trim_right();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants