Fix string offset parameters to use character positions instead of byte offsets#31
Fix string offset parameters to use character positions instead of byte offsets#31
Conversation
- Modified Find(string, int) to convert character offset to byte offset using Encoding.UTF8.GetByteCount - Modified Captures(string, int) to convert character offset to byte offset - Updated XML documentation to clarify that string methods use character offsets, while byte methods use byte offsets - Added comprehensive tests for multi-byte UTF-8 characters (emojis and Asian characters) - All 71 tests passing (6 new tests added) Agent-Logs-Url: https://github.com/crispthinking/IronRe2/sessions/2d26941d-972d-4c97-a3ec-ad34e99c1ff3 Co-authored-by: corpo-iwillspeak <265613520+corpo-iwillspeak@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes the offset parameter semantics for the string-based Regex.Find(string, int) and Regex.Captures(string, int) APIs so callers can pass a string index (character position) rather than a UTF-8 byte offset, improving correctness with multi-byte UTF-8 characters.
Changes:
- Convert string offsets to UTF-8 byte offsets before calling the byte-based implementations.
- Clarify XML docs to distinguish character offsets (string overloads) vs byte offsets (byte overloads).
- Add multi-byte UTF-8 test coverage (emoji/CJK) for the updated behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/IronRe2/Regex.cs |
Converts string offsets to byte offsets and updates XML docs to specify offset units. |
test/IronRe2.Tests/RegexTests.cs |
Adds new tests covering multi-byte UTF-8 scenarios for Find/Captures behavior. |
Comments suppressed due to low confidence (2)
src/IronRe2/Regex.cs:216
- The byte-based overload
Find(ReadOnlyMemory<byte> hayBytes, int offset)forwardsoffsetdirectly into the nativecre2_matchcall (viaRawMatch) with no range/negative validation. A negative offset or an offset greater thanhayBytes.Lengthcan lead to undefined behavior in native code. Consider validatingoffset(0..hayBytes.Length) and throwing anArgumentOutOfRangeException(or returningMatch.Empty, if that’s the established contract) before callingRawMatch.
public Match Find(ReadOnlyMemory<byte> hayBytes, int offset)
{
var ranges = RawMatch(hayBytes.Span, offset, 1);
return ranges.Length != 1 ? Match.Empty : new Match(hayBytes, ranges[0]);
}
src/IronRe2/Regex.cs:328
- Similarly,
Captures(ReadOnlyMemory<byte> haystack, int offset)forwardsoffsetinto native matching without validating that it’s within[0, haystack.Length]. Negative or out-of-range offsets can cause undefined behavior incre2_match. Consider adding an explicit range check and throwingArgumentOutOfRangeException(or returningCaptures.Emptyif that matches the intended API contract).
public Captures Captures(ReadOnlyMemory<byte> haystack, int offset)
{
var ranges = RawMatch(haystack.Span, offset, CaptureGroupCount + 1);
return ranges.Length == 0 ? IronRe2.Captures.Empty : new Captures(haystack, ranges);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public Match Find(string haystack, int offset) | ||
| { | ||
| var hayBytes = Encoding.UTF8.GetBytes(haystack); | ||
| return Find(hayBytes, offset); | ||
| var byteOffset = Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset)); | ||
| return Find(hayBytes, byteOffset); | ||
| } |
There was a problem hiding this comment.
Find(string, int offset) currently computes byteOffset via haystack.AsSpan(0, offset) without validating offset. This will throw for negative/out-of-range offsets after already allocating hayBytes, and (more importantly) allows offsets that split a UTF-16 surrogate pair, producing a byteOffset that can point into the middle of the UTF-8 sequence and reintroduce the original bug for emoji/surrogate scenarios. Consider validating offset (0..haystack.Length) and rejecting offsets that land on a low-surrogate preceded by a high-surrogate (i.e., inside a surrogate pair) before computing byteOffset, and compute byteOffset before encoding the whole haystack to avoid unnecessary work on invalid input.
| public Captures Captures(string haystack, int offset) | ||
| { | ||
| var hayBytes = Encoding.UTF8.GetBytes(haystack); | ||
| return Captures(hayBytes, offset); | ||
| var byteOffset = Encoding.UTF8.GetByteCount(haystack.AsSpan(0, offset)); | ||
| return Captures(hayBytes, byteOffset); | ||
| } |
There was a problem hiding this comment.
Captures(string, int offset) has the same surrogate-pair and bounds concerns as Find(string, int): GetByteCount(haystack.AsSpan(0, offset)) can compute a byte offset that does not correspond to a valid UTF-8 boundary when offset splits a UTF-16 surrogate pair, and it will throw for invalid offsets after allocating hayBytes. Add explicit offset validation (including surrogate-pair boundary) and compute byteOffset before encoding the full haystack.
| // Character positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6) (7) w(8) o(9) r(10) l(11) d(12) | ||
| // Byte positions: H(0) e(1) l(2) l(3) o(4) (5) 🌍(6-9) (10) w(11) o(12) r(13) l(14) d(15) | ||
|
|
There was a problem hiding this comment.
The newly added index explanation treats the 🌍 emoji as a single "character" at string index 6, but in .NET a string index is a UTF-16 code-unit offset (🌍 is a surrogate pair, length 2). Since the API now documents a "character offset", this comment is likely to mislead readers about which offsets are valid; consider updating the comment (and any stated positions) to reflect UTF-16 indices or explicitly say "Unicode scalar value" if that’s the intended meaning.
| Assert.Equal("🌍", match.ExtractedText); | ||
| } | ||
|
|
||
| [Fact] |
There was a problem hiding this comment.
Given the new offset conversion logic for string overloads, it would be good to add a regression test covering offsets that split a UTF-16 surrogate pair (e.g., offset pointing at the low surrogate of an emoji). Today that case can compute a byte offset that lands mid–UTF-8 sequence unless the implementation explicitly rejects it; a test here would lock in the intended behavior (throw vs. treat as invalid vs. adjust to nearest boundary).
| [Fact] | |
| [Fact] | |
| public void StringOverloadsRejectOffsetInsideSurrogatePair() | |
| { | |
| Regex re = new("world"); | |
| const string haystack = "Hello 🌍 world"; | |
| // In UTF-16, the emoji at visual character position 6 is a surrogate pair: | |
| // high surrogate at index 6, low surrogate at index 7. | |
| // Passing offset 7 would split the surrogate pair and could otherwise | |
| // map to a byte offset in the middle of the UTF-8 encoding. | |
| const int lowSurrogateOffset = 7; | |
| Assert.ThrowsAny<ArgumentException>(() => re.Find(haystack, lowSurrogateOffset)); | |
| Assert.ThrowsAny<ArgumentException>(() => re.Captures(haystack, lowSurrogateOffset)); | |
| } | |
| [Fact] |
| public void FindWithCharacterOffsetBeforeMultiByteCharacter() | ||
| { | ||
| Regex re = new("🌍"); | ||
| const string haystack = "Hello 🌍 world"; | ||
|
|
||
| // Find starting at character offset 6 (the emoji itself) | ||
| var match = re.Find(haystack, 6); | ||
|
|
||
| Assert.True(match.Matched); | ||
| Assert.Equal("🌍", match.ExtractedText); |
There was a problem hiding this comment.
FindWithCharacterOffsetBeforeMultiByteCharacter uses offset = 6, which happens to be both the UTF-16 index of the emoji’s first surrogate and also the UTF-8 byte index where the emoji begins in this particular string. That means this test likely would have passed even before the fix (when offset was treated as a byte index), so it doesn’t strongly validate the new character-offset behavior. Consider adjusting it to use an offset where byte vs. character interpretation diverges (e.g., the offset immediately after the emoji, or explicitly testing the surrogate-boundary case if you choose to define behavior there).
| public void FindWithCharacterOffsetBeforeMultiByteCharacter() | |
| { | |
| Regex re = new("🌍"); | |
| const string haystack = "Hello 🌍 world"; | |
| // Find starting at character offset 6 (the emoji itself) | |
| var match = re.Find(haystack, 6); | |
| Assert.True(match.Matched); | |
| Assert.Equal("🌍", match.ExtractedText); | |
| public void FindWithCharacterOffsetAfterMultiByteCharacterDoesNotMatchPreviousCharacter() | |
| { | |
| Regex re = new("🌍"); | |
| const string haystack = "Hello 🌍 world"; | |
| // Start searching immediately after the emoji. If the offset is interpreted as a | |
| // character offset, the previous emoji must not be matched. | |
| var match = re.Find(haystack, 7); | |
| Assert.False(match.Matched); |
The
Find(string, int)andCaptures(string, int)methods were treating the offset parameter as a byte position in the UTF-8 encoded string, causing incorrect behavior with multi-byte characters (emojis, CJK characters, etc.).Changes
Find(string haystack, int offset)andCaptures(string haystack, int offset): Convert character offset to byte offset usingEncoding.UTF8.GetByteCount(haystack.AsSpan(0, offset))before passing to the underlying byte-based methodsExample