-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
StringScanner#scan(Int), #skip(Int), #current_char, etc, and significantly speed up operations on multibyte characters #16455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Related: #11259 |
|
Happy to add those in this PR as well tbh, I don't think it'd be that difficult |
|
Actually, ideally we'd be able to get an encoded character from the byte offset without scanning the whole string for multibyte chars, which |
|
...in fact, now that I look at it, it might be better for StringScanner to be backed by Char::Reader instead of String |
f45db28 to
0563636
Compare
642af07 to
029620d
Compare
3b27ef0 to
34ffe24
Compare
34ffe24 to
b59e4af
Compare
|
Forgot to mention here, but this is ready for a review. |
straight-shoota
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are lacking and the class API docs need to be updated for the new features.
I'm also not entirely sure all of these additions make sense.
And we might want to split this big PR into separate ones for the individual changes.
But it should be okay to discuss everything here for now.
| # the scan offset. | ||
| def peek(len) : String | ||
| @str[offset, len] | ||
| return "" if len.zero? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: I'm wondering if this optimization for empty string is even relevant? lookahead_byte_length and String#byte_slice already return early for zero length.
On the other hand, it doesn't hurt to have this explicit here 🤷
Ditto below.
| def peek_behind(len) : String | ||
| return "" if len.zero? | ||
| byte_len = lookbehind_byte_length(len) | ||
| byte_len = @byte_offset if byte_len > @byte_offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
polish: Simplify with clamp
| byte_len = @byte_offset if byte_len > @byte_offset | |
| byte_len = byte_len.clamp(..@byte_offset) |
|
|
||
| # Returns the byte before the scan head. | ||
| def previous_byte? : UInt8? | ||
| @str.byte_at?(@byte_offset - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: This method returns the last byte of the string when @byte_offset == 0 because negative indices count from the back in String#byte_at. It should return nil instead.
This is an indicator that these new methods need more testing, especially for edge cases.
Ditto below.
| @[AlwaysInline] | ||
| private def char_reader : Char::Reader | ||
| Char::Reader.new(@str, @byte_offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: @[AlwaysInline] seems unnecessary here. It's a private method with a trivial body. The optimizer will certainly inline it anyway, even without the annotation. (And if not, might have a good reason to do so).
I don't think it's a good idea to sprinkle @[AlwaysInline] everywhere. In general, the optimizer should be expected to make the right choices.
Explicit annotations should be limited to cases where inlining is required for semantic correctness, or when the optimizer might not recognize the proven potential of inlining. I don't think either is the case here.
| @[AlwaysInline] | |
| private def char_reader : Char::Reader | |
| Char::Reader.new(@str, @byte_offset) | |
| private def char_reader : Char::Reader | |
| Char::Reader.new(@str, @byte_offset) |
| s.inspect.should eq(%(#<StringScanner 0/16 |"this ">)) | ||
| s.scan(/\w+\s/) | ||
| s.inspect.should eq(%(#<StringScanner 5/16 "s is " >)) | ||
| s.inspect.should eq(%(#<StringScanner 5/16 "s "|"is ">)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Adding an indicator for the cursor position is great. But I'm not sure I like the additional quotes.
I suppose the intention is to avoid ambiguity if the string window contains the vertical bar character?
It also creates additional noise...
Maybe we could use a different character for the cursor. One that would be less likely to show up in a string?
For example, the broken bar ¦, or the caret ‸.
| end | ||
|
|
||
| # Reads a single character from the string, or returns nil if at the end. | ||
| def read_char : Char? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: This method sticks out as a bit odd. It works very much like a scan method. But it has different semantics as it does not set last_match. Maybe it should?
Or is it even necessary to have this method? It should be equivalent to scanner.current_char; scanner.skip(1).
Thinking about this, maybe an alternative or enhancement could be a scan(Int) method which consumes a fixed amount of characters. Then read_char would be equivalent to scan(1)[0].
Supersedes #11785, addresses most of #11259
This includes the following non-breaking changes to StringScanner:
#peekto trust that@byte_offsetis a valid char index, and useChar::Readerto read a specified number of characters ahead. The previous implementation would cause at least two full scans from the beginning of the string if it was not single byte optimizable, which could result in O(n^2) parses if#peekis used during a hot loop.#scan(Int)and#skip(Int)to advance a specified number of characters, again using Char::Reader for non-optimizable strings. It is also a much more performant alternative toscanner.offset += n, which can cause two full string scans, both to retrieve the offset, and find the target offset.#read_char : Char?to scan forward by one (possibly multibyte) character and return it, or nil if at the end of the string. This is equivalent to Ruby's#getch, but the method name was chosen to be closer to IO.#current_charand#previous_char, as well as UInt8-typed#current_byteand#previous_byte, which do not modify any state.#rewind(Int)and#peek_behind(Int)to rewind the head or scan backwards by a specified number of chars, usingChar::Reader#previous_charto find the byte offset in the presence of multibyte chars.#beginning_of_line?(checks backwards one char for a newline, same as the Ruby impelementation) and#unscan, which rewinds the bytesize of the last match.#matched? : Boolto check the presence of@last_match, similar to Ruby's implementation.#inspectto show where the scan head is.