-
Notifications
You must be signed in to change notification settings - Fork 565
Add cut operator (^) to grammar
#2104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is the beginning of a new mdbook book that will house all of the guidelines for contributors. This is published via GitHub Pages.
This is just some light editing. I expect that this chapter will have larger edits in the future, but I want to defer that till later.
This is just a stub, with the expectation that it will be expanded/rewritten later.
This is just a stub, with the expectation that it will be expanded/rewritten later.
This has been superseded by the contributor guide.
This introduction should probably receive deeper edits to make it fit into the guide better.
8b74468 to
24690d2
Compare
The cut operator (`^`) is a backtracking fence. Once the expression to its left succeeds, we become committed to the alternative; the remainder of the expression must parse successfully or parsing will fail. See *Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space*, Mizushima et al., <https://kmizu.github.io/papers/paste513-mizushima.pdf>. This operator solves a problem for us with C string literals. These literals cannot contain a null escape. But if we simply fail to lex the literal (e.g. `c"\0"`), we may instead lex it successfully as two separate tokens (`c "\0"), and that would be incorrect. As long as we only use cut to express constraints that can be expressed in a regular language and we keep our alternations disjoint, the grammar can still be mechanically converted to a CFG. Let's add the cut operator to our grammar and use it for C string literals and some similar constructs. In the railroad diagrams, we'll render the cut as a "no backtracking" box around the expression or sequence of expressions after the cut. The idea is that once you enter the box the only way out is forward.
24690d2 to
fc646a1
Compare
|
When you're defining a cut operator, it's important to specify the scope over For the lexer's purposes it would be fine to make that scope unlimited, saying that if the right-hand side of a production containing a cut fails after reaching the cut then the entire lexing process fails. But in this PR the nearest thing to a definition of the cut operator is the reference to https://kmizu.github.io/papers/paste513-mizushima.pdf . That paper defines a cut operator with a narrower scope: it allows a cut only on the left-hand side of an ordered choice expression, and cancels only the re-attempt of the right-hand side of that expression. That definition doesn't work for the positions in which this PR is placing cuts. |
|
If you're still planning to use prioritised choice more widely in the Reference, then (given that the Reference already has the notion of reserved forms) perhaps the simplest way to define cut is to say that:
is a shorthand for
with That characterisation also illustrates why adding the notion of a cut doesn't buy very much. In particular if you use prioritised choice for the token rules then (for the lexing dialect used in Rust 2021 and later) you can simplify the existing reserved token rules, getting rid of the "except b or c or r or br or cr" business, and end up with something like this:
This way |
The cut operator (
^) is a backtracking fence. Once the expression to its left succeeds, we become committed to the alternative; the remainder of the expression must parse successfully or parsing will fail. See Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space, Mizushima et al., https://kmizu.github.io/papers/paste513-mizushima.pdf.This operator solves a problem for us with C string literals. These literals cannot contain a null escape. But if we simply fail to lex the literal (e.g.
c"\0"), we may instead lex it successfully as two separate tokens (`c "\0"), and that would be incorrect.As long as we only use cut to express constraints that can be expressed in a regular language and we keep our alternations disjoint, the grammar can still be mechanically converted to a CFG.
Let's add the cut operator to our grammar and use it for C string literals and some similar constructs.
In the railroad diagrams, we'll render the cut as a "no backtracking" box around the expression or sequence of expressions after the cut. The idea is that once you enter the box the only way out is forward.
(H/t to @ehuss for suggesting the cut operator to solve this problem.)
cc @ehuss
This is stacked on #2097 and should merge after it.