Skip to content

Fix string offset parameters to use character positions instead of byte offsets#31

Open
Claude wants to merge 2 commits intomainfrom
claude/fix-byte-offset-issue
Open

Fix string offset parameters to use character positions instead of byte offsets#31
Claude wants to merge 2 commits intomainfrom
claude/fix-byte-offset-issue

Conversation

@Claude
Copy link
Copy Markdown

@Claude Claude AI commented Apr 1, 2026

The Find(string, int) and Captures(string, int) methods were treating the offset parameter as a byte position in the UTF-8 encoded string, causing incorrect behavior with multi-byte characters (emojis, CJK characters, etc.).

Changes

  • Modified Find(string haystack, int offset) and Captures(string haystack, int offset): Convert character offset to byte offset using Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset)) before passing to the underlying byte-based methods
  • Updated XML documentation: Clarified that string-based methods accept character offsets while byte-based methods accept byte offsets
  • Added test coverage: 6 new tests covering multi-byte UTF-8 scenarios (4-byte emojis, 3-byte CJK characters)

Example

Regex re = new("world");
string haystack = "Hello 🌍 world";  // 🌍 is 4 bytes in UTF-8

// Character positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6) (7) w(8)...
// Byte positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6-9) (10) w(11)...

// Now correctly interprets offset as character position
var match = re.Find(haystack, 8);  // Starts at character 'w', not mid-emoji
Assert.Equal("world", match.ExtractedText);

- Modified Find(string, int) to convert character offset to byte offset using Encoding.UTF8.GetByteCount
- Modified Captures(string, int) to convert character offset to byte offset
- Updated XML documentation to clarify that string methods use character offsets, while byte methods use byte offsets
- Added comprehensive tests for multi-byte UTF-8 characters (emojis and Asian characters)
- All 71 tests passing (6 new tests added)

Agent-Logs-Url: https://github.com/crispthinking/IronRe2/sessions/2d26941d-972d-4c97-a3ec-ad34e99c1ff3

Co-authored-by: corpo-iwillspeak <265613520+corpo-iwillspeak@users.noreply.github.com>
@Claude Claude AI changed the title [WIP] Fix offset interpretation in string haystack Fix string offset parameters to use character positions instead of byte offsets Apr 1, 2026
@Claude Claude AI requested a review from corpo-iwillspeak April 1, 2026 10:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the offset parameter semantics for the string-based Regex.Find(string, int) and Regex.Captures(string, int) APIs so callers can pass a string index (character position) rather than a UTF-8 byte offset, improving correctness with multi-byte UTF-8 characters.

Changes:

  • Convert string offsets to UTF-8 byte offsets before calling the byte-based implementations.
  • Clarify XML docs to distinguish character offsets (string overloads) vs byte offsets (byte overloads).
  • Add multi-byte UTF-8 test coverage (emoji/CJK) for the updated behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/IronRe2/Regex.cs Converts string offsets to byte offsets and updates XML docs to specify offset units.
test/IronRe2.Tests/RegexTests.cs Adds new tests covering multi-byte UTF-8 scenarios for Find/Captures behavior.
Comments suppressed due to low confidence (2)

src/IronRe2/Regex.cs:216

  • The byte-based overload Find(ReadOnlyMemory<byte> hayBytes, int offset) forwards offset directly into the native cre2_match call (via RawMatch) with no range/negative validation. A negative offset or an offset greater than hayBytes.Length can lead to undefined behavior in native code. Consider validating offset (0..hayBytes.Length) and throwing an ArgumentOutOfRangeException (or returning Match.Empty, if that’s the established contract) before calling RawMatch.
    public Match Find(ReadOnlyMemory<byte> hayBytes, int offset)
    {
        var ranges = RawMatch(hayBytes.Span, offset, 1);
        return ranges.Length != 1 ? Match.Empty : new Match(hayBytes, ranges[0]);
    }

src/IronRe2/Regex.cs:328

  • Similarly, Captures(ReadOnlyMemory<byte> haystack, int offset) forwards offset into native matching without validating that it’s within [0, haystack.Length]. Negative or out-of-range offsets can cause undefined behavior in cre2_match. Consider adding an explicit range check and throwing ArgumentOutOfRangeException (or returning Captures.Empty if that matches the intended API contract).
    public Captures Captures(ReadOnlyMemory<byte> haystack, int offset)
    {
        var ranges = RawMatch(haystack.Span, offset, CaptureGroupCount + 1);
        return ranges.Length == 0 ? IronRe2.Captures.Empty : new Captures(haystack, ranges);
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/IronRe2/Regex.cs
Comment on lines 188 to 193
public Match Find(string haystack, int offset)
{
var hayBytes = Encoding.UTF8.GetBytes(haystack);
return Find(hayBytes, offset);
var byteOffset = Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset));
return Find(hayBytes, byteOffset);
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find(string, int offset) currently computes byteOffset via haystack.AsSpan(0, offset) without validating offset. This will throw for negative/out-of-range offsets after already allocating hayBytes, and (more importantly) allows offsets that split a UTF-16 surrogate pair, producing a byteOffset that can point into the middle of the UTF-8 sequence and reintroduce the original bug for emoji/surrogate scenarios. Consider validating offset (0..haystack.Length) and rejecting offsets that land on a low-surrogate preceded by a high-surrogate (i.e., inside a surrogate pair) before computing byteOffset, and compute byteOffset before encoding the whole haystack to avoid unnecessary work on invalid input.

Copilot uses AI. Check for mistakes.
Comment thread src/IronRe2/Regex.cs
Comment on lines 289 to 294
public Captures Captures(string haystack, int offset)
{
var hayBytes = Encoding.UTF8.GetBytes(haystack);
return Captures(hayBytes, offset);
var byteOffset = Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset));
return Captures(hayBytes, byteOffset);
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Captures(string, int offset) has the same surrogate-pair and bounds concerns as Find(string, int): GetByteCount(haystack.AsSpan(0, offset)) can compute a byte offset that does not correspond to a valid UTF-8 boundary when offset splits a UTF-16 surrogate pair, and it will throw for invalid offsets after allocating hayBytes. Add explicit offset validation (including surrogate-pair boundary) and compute byteOffset before encoding the full haystack.

Copilot uses AI. Check for mistakes.
Comment on lines +403 to +405
// Character positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6) (7) w(8) o(9) r(10) l(11) d(12)
// Byte positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6-9) (10) w(11) o(12) r(13) l(14) d(15)

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added index explanation treats the 🌍 emoji as a single "character" at string index 6, but in .NET a string index is a UTF-16 code-unit offset (🌍 is a surrogate pair, length 2). Since the API now documents a "character offset", this comment is likely to mislead readers about which offsets are valid; consider updating the comment (and any stated positions) to reflect UTF-16 indices or explicitly say "Unicode scalar value" if that’s the intended meaning.

Copilot uses AI. Check for mistakes.
Assert.Equal("🌍", match.ExtractedText);
}

[Fact]
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new offset conversion logic for string overloads, it would be good to add a regression test covering offsets that split a UTF-16 surrogate pair (e.g., offset pointing at the low surrogate of an emoji). Today that case can compute a byte offset that lands mid–UTF-8 sequence unless the implementation explicitly rejects it; a test here would lock in the intended behavior (throw vs. treat as invalid vs. adjust to nearest boundary).

Suggested change
[Fact]
[Fact]
public void StringOverloadsRejectOffsetInsideSurrogatePair()
{
Regex re = new("world");
const string haystack = "Hello 🌍 world";
// In UTF-16, the emoji at visual character position 6 is a surrogate pair:
// high surrogate at index 6, low surrogate at index 7.
// Passing offset 7 would split the surrogate pair and could otherwise
// map to a byte offset in the middle of the UTF-8 encoding.
const int lowSurrogateOffset = 7;
Assert.ThrowsAny<ArgumentException>(() => re.Find(haystack, lowSurrogateOffset));
Assert.ThrowsAny<ArgumentException>(() => re.Captures(haystack, lowSurrogateOffset));
}
[Fact]

Copilot uses AI. Check for mistakes.
Comment on lines +417 to +426
public void FindWithCharacterOffsetBeforeMultiByteCharacter()
{
Regex re = new("🌍");
const string haystack = "Hello 🌍 world";

// Find starting at character offset 6 (the emoji itself)
var match = re.Find(haystack, 6);

Assert.True(match.Matched);
Assert.Equal("🌍", match.ExtractedText);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FindWithCharacterOffsetBeforeMultiByteCharacter uses offset = 6, which happens to be both the UTF-16 index of the emoji’s first surrogate and also the UTF-8 byte index where the emoji begins in this particular string. That means this test likely would have passed even before the fix (when offset was treated as a byte index), so it doesn’t strongly validate the new character-offset behavior. Consider adjusting it to use an offset where byte vs. character interpretation diverges (e.g., the offset immediately after the emoji, or explicitly testing the surrogate-boundary case if you choose to define behavior there).

Suggested change
public void FindWithCharacterOffsetBeforeMultiByteCharacter()
{
Regex re = new("🌍");
const string haystack = "Hello 🌍 world";
// Find starting at character offset 6 (the emoji itself)
var match = re.Find(haystack, 6);
Assert.True(match.Matched);
Assert.Equal("🌍", match.ExtractedText);
public void FindWithCharacterOffsetAfterMultiByteCharacterDoesNotMatchPreviousCharacter()
{
Regex re = new("🌍");
const string haystack = "Hello 🌍 world";
// Start searching immediately after the emoji. If the offset is interpreted as a
// character offset, the previous emoji must not be matched.
var match = re.Find(haystack, 7);
Assert.False(match.Matched);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Its not clear that offsets in string haystack refer to byte offsets

3 participants