Skip to content

Conversation

@jneen
Copy link
Contributor

@jneen jneen commented Nov 29, 2025

Supersedes #11785, addresses most of #11259

This includes the following non-breaking changes to StringScanner:

  • Refactor #peek to trust that @byte_offset is a valid char index, and use Char::Reader to read a specified number of characters ahead. The previous implementation would cause at least two full scans from the beginning of the string if it was not single byte optimizable, which could result in O(n^2) parses if #peek is used during a hot loop.
  • Add overloads #scan(Int) and #skip(Int) to advance a specified number of characters, again using Char::Reader for non-optimizable strings. It is also a much more performant alternative to scanner.offset += n, which can cause two full string scans, both to retrieve the offset, and find the target offset.
  • Adds #read_char : Char? to scan forward by one (possibly multibyte) character and return it, or nil if at the end of the string. This is equivalent to Ruby's #getch, but the method name was chosen to be closer to IO.
  • Adds #current_char and #previous_char, as well as UInt8-typed #current_byte and #previous_byte, which do not modify any state.
  • Adds #rewind(Int) and #peek_behind(Int) to rewind the head or scan backwards by a specified number of chars, using Char::Reader#previous_char to find the byte offset in the presence of multibyte chars.
  • Adds #beginning_of_line? (checks backwards one char for a newline, same as the Ruby impelementation) and #unscan, which rewinds the bytesize of the last match.
  • Adds #matched? : Bool to check the presence of @last_match, similar to Ruby's implementation.
  • Adjusts #inspect to show where the scan head is.

@HertzDevil
Copy link
Contributor

Related: #11259

@jneen
Copy link
Contributor Author

jneen commented Nov 30, 2025

Happy to add those in this PR as well tbh, I don't think it'd be that difficult

@jneen
Copy link
Contributor Author

jneen commented Nov 30, 2025

Actually, ideally we'd be able to get an encoded character from the byte offset without scanning the whole string for multibyte chars, which String#[] (used by #peek) is wont to do. May require some extra functionality on String to be performant, I'll look into it.

@jneen
Copy link
Contributor Author

jneen commented Nov 30, 2025

...in fact, now that I look at it, it might be better for StringScanner to be backed by Char::Reader instead of String

@jneen jneen changed the title add StringScanner#current_byte and #peek_bytes StringScanner#scan(Int), #skip(Int), current_char significantly speed up operations on multibyte characters Nov 30, 2025
@jneen jneen force-pushed the feature.string-scannner-bytes branch 2 times, most recently from f45db28 to 0563636 Compare November 30, 2025 22:17
@jneen jneen changed the title StringScanner#scan(Int), #skip(Int), current_char significantly speed up operations on multibyte characters StringScanner#scan(Int), #skip(Int), current_char, and significantly speed up operations on multibyte characters Nov 30, 2025
@jneen jneen changed the title StringScanner#scan(Int), #skip(Int), current_char, and significantly speed up operations on multibyte characters StringScanner#scan(Int), #skip(Int), #current_char, etc, and significantly speed up operations on multibyte characters Dec 1, 2025
@jneen jneen force-pushed the feature.string-scannner-bytes branch from 642af07 to 029620d Compare December 1, 2025 14:48
@jneen jneen force-pushed the feature.string-scannner-bytes branch 4 times, most recently from 3b27ef0 to 34ffe24 Compare December 2, 2025 14:34
@jneen jneen force-pushed the feature.string-scannner-bytes branch from 34ffe24 to b59e4af Compare December 2, 2025 16:40
@jneen
Copy link
Contributor Author

jneen commented Dec 5, 2025

Forgot to mention here, but this is ready for a review.

@jneen jneen requested a review from Sija December 7, 2025 18:51
Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are lacking and the class API docs need to be updated for the new features.
I'm also not entirely sure all of these additions make sense.
And we might want to split this big PR into separate ones for the individual changes.
But it should be okay to discuss everything here for now.

# the scan offset.
def peek(len) : String
@str[offset, len]
return "" if len.zero?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: I'm wondering if this optimization for empty string is even relevant? lookahead_byte_length and String#byte_slice already return early for zero length.
On the other hand, it doesn't hurt to have this explicit here 🤷
Ditto below.

def peek_behind(len) : String
return "" if len.zero?
byte_len = lookbehind_byte_length(len)
byte_len = @byte_offset if byte_len > @byte_offset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polish: Simplify with clamp

Suggested change
byte_len = @byte_offset if byte_len > @byte_offset
byte_len = byte_len.clamp(..@byte_offset)


# Returns the byte before the scan head.
def previous_byte? : UInt8?
@str.byte_at?(@byte_offset - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This method returns the last byte of the string when @byte_offset == 0 because negative indices count from the back in String#byte_at. It should return nil instead.
This is an indicator that these new methods need more testing, especially for edge cases.

Ditto below.

Comment on lines +464 to +466
@[AlwaysInline]
private def char_reader : Char::Reader
Char::Reader.new(@str, @byte_offset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: @[AlwaysInline] seems unnecessary here. It's a private method with a trivial body. The optimizer will certainly inline it anyway, even without the annotation. (And if not, might have a good reason to do so).

I don't think it's a good idea to sprinkle @[AlwaysInline] everywhere. In general, the optimizer should be expected to make the right choices.
Explicit annotations should be limited to cases where inlining is required for semantic correctness, or when the optimizer might not recognize the proven potential of inlining. I don't think either is the case here.

Suggested change
@[AlwaysInline]
private def char_reader : Char::Reader
Char::Reader.new(@str, @byte_offset)
private def char_reader : Char::Reader
Char::Reader.new(@str, @byte_offset)

s.inspect.should eq(%(#<StringScanner 0/16 |"this ">))
s.scan(/\w+\s/)
s.inspect.should eq(%(#<StringScanner 5/16 "s is " >))
s.inspect.should eq(%(#<StringScanner 5/16 "s "|"is ">))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Adding an indicator for the cursor position is great. But I'm not sure I like the additional quotes.
I suppose the intention is to avoid ambiguity if the string window contains the vertical bar character?
It also creates additional noise...

Maybe we could use a different character for the cursor. One that would be less likely to show up in a string?
For example, the broken bar ¦, or the caret .

end

# Reads a single character from the string, or returns nil if at the end.
def read_char : Char?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: This method sticks out as a bit odd. It works very much like a scan method. But it has different semantics as it does not set last_match. Maybe it should?
Or is it even necessary to have this method? It should be equivalent to scanner.current_char; scanner.skip(1).

Thinking about this, maybe an alternative or enhancement could be a scan(Int) method which consumes a fixed amount of characters. Then read_char would be equivalent to scan(1)[0].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants