Fix string offset parameters to use character positions instead of byte offsets by Claude · Pull Request #31 · crispthinking/IronRe2

Claude · 2026-04-01T10:01:32Z

The Find(string, int) and Captures(string, int) methods were treating the offset parameter as a byte position in the UTF-8 encoded string, causing incorrect behavior with multi-byte characters (emojis, CJK characters, etc.).

Changes

Modified Find(string haystack, int offset) and Captures(string haystack, int offset): Convert character offset to byte offset using Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset)) before passing to the underlying byte-based methods
Updated XML documentation: Clarified that string-based methods accept character offsets while byte-based methods accept byte offsets
Added test coverage: 6 new tests covering multi-byte UTF-8 scenarios (4-byte emojis, 3-byte CJK characters)

Example

Regex re = new("world");
string haystack = "Hello 🌍 world";  // 🌍 is 4 bytes in UTF-8

// Character positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6) (7) w(8)...
// Byte positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6-9) (10) w(11)...

// Now correctly interprets offset as character position
var match = re.Find(haystack, 8);  // Starts at character 'w', not mid-emoji
Assert.Equal("world", match.ExtractedText);

- Modified Find(string, int) to convert character offset to byte offset using Encoding.UTF8.GetByteCount - Modified Captures(string, int) to convert character offset to byte offset - Updated XML documentation to clarify that string methods use character offsets, while byte methods use byte offsets - Added comprehensive tests for multi-byte UTF-8 characters (emojis and Asian characters) - All 71 tests passing (6 new tests added) Agent-Logs-Url: https://github.com/crispthinking/IronRe2/sessions/2d26941d-972d-4c97-a3ec-ad34e99c1ff3 Co-authored-by: corpo-iwillspeak <265613520+corpo-iwillspeak@users.noreply.github.com>

Copilot

Pull request overview

This PR fixes the offset parameter semantics for the string-based Regex.Find(string, int) and Regex.Captures(string, int) APIs so callers can pass a string index (character position) rather than a UTF-8 byte offset, improving correctness with multi-byte UTF-8 characters.

Changes:

Convert string offsets to UTF-8 byte offsets before calling the byte-based implementations.
Clarify XML docs to distinguish character offsets (string overloads) vs byte offsets (byte overloads).
Add multi-byte UTF-8 test coverage (emoji/CJK) for the updated behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`src/IronRe2/Regex.cs`	Converts string offsets to byte offsets and updates XML docs to specify offset units.
`test/IronRe2.Tests/RegexTests.cs`	Adds new tests covering multi-byte UTF-8 scenarios for `Find`/`Captures` behavior.

Comments suppressed due to low confidence (2)

src/IronRe2/Regex.cs:216

The byte-based overload Find(ReadOnlyMemory<byte> hayBytes, int offset) forwards offset directly into the native cre2_match call (via RawMatch) with no range/negative validation. A negative offset or an offset greater than hayBytes.Length can lead to undefined behavior in native code. Consider validating offset (0..hayBytes.Length) and throwing an ArgumentOutOfRangeException (or returning Match.Empty, if that’s the established contract) before calling RawMatch.

    public Match Find(ReadOnlyMemory<byte> hayBytes, int offset)
    {
        var ranges = RawMatch(hayBytes.Span, offset, 1);
        return ranges.Length != 1 ? Match.Empty : new Match(hayBytes, ranges[0]);
    }

src/IronRe2/Regex.cs:328

Similarly, Captures(ReadOnlyMemory<byte> haystack, int offset) forwards offset into native matching without validating that it’s within [0, haystack.Length]. Negative or out-of-range offsets can cause undefined behavior in cre2_match. Consider adding an explicit range check and throwing ArgumentOutOfRangeException (or returning Captures.Empty if that matches the intended API contract).

    public Captures Captures(ReadOnlyMemory<byte> haystack, int offset)
    {
        var ranges = RawMatch(haystack.Span, offset, CaptureGroupCount + 1);
        return ranges.Length == 0 ? IronRe2.Captures.Empty : new Captures(haystack, ranges);
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T13:26:42Z

    public Match Find(string haystack, int offset)
    {
        var hayBytes = Encoding.UTF8.GetBytes(haystack);
-        return Find(hayBytes, offset);
+        var byteOffset = Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset));
+        return Find(hayBytes, byteOffset);
    }


Find(string, int offset) currently computes byteOffset via haystack.AsSpan(0, offset) without validating offset. This will throw for negative/out-of-range offsets after already allocating hayBytes, and (more importantly) allows offsets that split a UTF-16 surrogate pair, producing a byteOffset that can point into the middle of the UTF-8 sequence and reintroduce the original bug for emoji/surrogate scenarios. Consider validating offset (0..haystack.Length) and rejecting offsets that land on a low-surrogate preceded by a high-surrogate (i.e., inside a surrogate pair) before computing byteOffset, and compute byteOffset before encoding the whole haystack to avoid unnecessary work on invalid input.

Copilot · 2026-04-23T13:26:42Z

    public Captures Captures(string haystack, int offset)
    {
        var hayBytes = Encoding.UTF8.GetBytes(haystack);
-        return Captures(hayBytes, offset);
+        var byteOffset = Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset));
+        return Captures(hayBytes, byteOffset);
    }


Captures(string, int offset) has the same surrogate-pair and bounds concerns as Find(string, int): GetByteCount(haystack.AsSpan(0, offset)) can compute a byte offset that does not correspond to a valid UTF-8 boundary when offset splits a UTF-16 surrogate pair, and it will throw for invalid offsets after allocating hayBytes. Add explicit offset validation (including surrogate-pair boundary) and compute byteOffset before encoding the full haystack.

Copilot · 2026-04-23T13:26:43Z

+        // Character positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6) (7) w(8) o(9) r(10) l(11) d(12)
+        // Byte positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6-9) (10) w(11) o(12) r(13) l(14) d(15)
+


The newly added index explanation treats the 🌍 emoji as a single "character" at string index 6, but in .NET a string index is a UTF-16 code-unit offset (🌍 is a surrogate pair, length 2). Since the API now documents a "character offset", this comment is likely to mislead readers about which offsets are valid; consider updating the comment (and any stated positions) to reflect UTF-16 indices or explicitly say "Unicode scalar value" if that’s the intended meaning.

Copilot · 2026-04-23T13:26:43Z

+        Assert.Equal("🌍", match.ExtractedText);
+    }
+
+    [Fact]


Given the new offset conversion logic for string overloads, it would be good to add a regression test covering offsets that split a UTF-16 surrogate pair (e.g., offset pointing at the low surrogate of an emoji). Today that case can compute a byte offset that lands mid–UTF-8 sequence unless the implementation explicitly rejects it; a test here would lock in the intended behavior (throw vs. treat as invalid vs. adjust to nearest boundary).

Suggested change

[Fact]

[Fact]

public void StringOverloadsRejectOffsetInsideSurrogatePair()

{

Regex re = new("world");

const string haystack = "Hello 🌍 world";

// In UTF-16, the emoji at visual character position 6 is a surrogate pair:

// high surrogate at index 6, low surrogate at index 7.

// Passing offset 7 would split the surrogate pair and could otherwise

// map to a byte offset in the middle of the UTF-8 encoding.

const int lowSurrogateOffset = 7;

Assert.ThrowsAny<ArgumentException>(() => re.Find(haystack, lowSurrogateOffset));

Assert.ThrowsAny<ArgumentException>(() => re.Captures(haystack, lowSurrogateOffset));

}

[Fact]

Copilot · 2026-04-23T13:26:43Z

+    public void FindWithCharacterOffsetBeforeMultiByteCharacter()
+    {
+        Regex re = new("🌍");
+        const string haystack = "Hello 🌍 world";
+
+        // Find starting at character offset 6 (the emoji itself)
+        var match = re.Find(haystack, 6);
+
+        Assert.True(match.Matched);
+        Assert.Equal("🌍", match.ExtractedText);


FindWithCharacterOffsetBeforeMultiByteCharacter uses offset = 6, which happens to be both the UTF-16 index of the emoji’s first surrogate and also the UTF-8 byte index where the emoji begins in this particular string. That means this test likely would have passed even before the fix (when offset was treated as a byte index), so it doesn’t strongly validate the new character-offset behavior. Consider adjusting it to use an offset where byte vs. character interpretation diverges (e.g., the offset immediately after the emoji, or explicitly testing the surrogate-boundary case if you choose to define behavior there).

Suggested change

public void FindWithCharacterOffsetBeforeMultiByteCharacter()

{

Regex re = new("🌍");

const string haystack = "Hello 🌍 world";

// Find starting at character offset 6 (the emoji itself)

var match = re.Find(haystack, 6);

Assert.True(match.Matched);

Assert.Equal("🌍", match.ExtractedText);

public void FindWithCharacterOffsetAfterMultiByteCharacterDoesNotMatchPreviousCharacter()

{

Regex re = new("🌍");

const string haystack = "Hello 🌍 world";

// Start searching immediately after the emoji. If the offset is interpreted as a

// character offset, the previous emoji must not be matched.

var match = re.Find(haystack, 7);

Assert.False(match.Matched);

Initial plan

dd1884c

Claude AI assigned Claude and corpo-iwillspeak Apr 1, 2026

Claude started work on behalf of corpo-iwillspeak April 1, 2026 10:01 View session

Claude AI linked an issue Apr 1, 2026 that may be closed by this pull request

Its not clear that offsets in string haystack refer to byte offsets #13

Open

Claude AI changed the title ~~[WIP] Fix offset interpretation in string haystack~~ Fix string offset parameters to use character positions instead of byte offsets Apr 1, 2026

Claude finished work on behalf of corpo-iwillspeak April 1, 2026 10:08

Claude AI requested a review from corpo-iwillspeak April 1, 2026 10:08

corpo-iwillspeak approved these changes Apr 23, 2026

View reviewed changes

corpo-iwillspeak marked this pull request as ready for review April 23, 2026 13:21

corpo-iwillspeak requested review from abbottdev-kroll and Copilot April 23, 2026 13:21

Copilot started reviewing on behalf of corpo-iwillspeak April 23, 2026 13:21 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix string offset parameters to use character positions instead of byte offsets#31

Fix string offset parameters to use character positions instead of byte offsets#31
Claude wants to merge 2 commits intomainfrom
claude/fix-byte-offset-issue

Claude AI commented Apr 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Character positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6) (7) w(8) o(9) r(10) l(11) d(12)
		// Byte positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6-9) (10) w(11) o(12) r(13) l(14) d(15)

-    [Fact]
+    [Fact]
+    public void StringOverloadsRejectOffsetInsideSurrogatePair()
+    {
+        Regex re = new("world");
+        const string haystack = "Hello 🌍 world";
+        // In UTF-16, the emoji at visual character position 6 is a surrogate pair:
+        // high surrogate at index 6, low surrogate at index 7.
+        // Passing offset 7 would split the surrogate pair and could otherwise
+        // map to a byte offset in the middle of the UTF-8 encoding.
+        const int lowSurrogateOffset = 7;
+        Assert.ThrowsAny<ArgumentException>(() => re.Find(haystack, lowSurrogateOffset));
+        Assert.ThrowsAny<ArgumentException>(() => re.Captures(haystack, lowSurrogateOffset));
+    }
+    [Fact]

Conversation

Claude AI commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Claude AI commented Apr 1, 2026 •

edited

Loading