Skip to content

feat(gf-cypher): replace LALRPOP with recursive-descent + Pratt parser scaffold#654

Merged
DecisionNerd merged 5 commits into
mainfrom
feature/554-lalrpop-grammar
May 29, 2026
Merged

feat(gf-cypher): replace LALRPOP with recursive-descent + Pratt parser scaffold#654
DecisionNerd merged 5 commits into
mainfrom
feature/554-lalrpop-grammar

Conversation

@DecisionNerd
Copy link
Copy Markdown
Owner

@DecisionNerd DecisionNerd commented May 29, 2026

Summary

  • Removes LALRPOP entirely from gf-cypher after 50 unresolvable structural shift/reduce conflicts in LR(1) mode (see ADR 0003 for full history)
  • Adds hand-written Tok lexer with compound tokens (NOT IN, IS NULL, STARTS WITH, etc.)
  • Adds TokenStream scaffold (peek, peek_n, eat, eat_if, advance, span helpers) — 11 unit tests
  • Updates gf-ast with UnionClause, FunctionCall::star, UnterminatedBlockComment
  • Rewrites ADR 0003 to document the full decision arc (LALR→LR(1)→RD+Pratt)
  • Updates all docs: 7-verb API, algorithms catalog, parser diagram, Swift/Kotlin bindings, roadmap

parser::parse() currently returns Err(unimplemented) — the clause and expression parsers follow in child issues.

Child issues (implement in order after this merges)

Test plan

  • cargo build -p gf-cypher — clean, no conflict warnings, no grammar generation step
  • cargo test -p gf-cypher — 11 TokenStream unit tests passing
  • cargo test --workspace — 66 tests, 0 failures
  • CodeRabbit CLI — 2 findings, both fixed (misleading doc comment, misleading inline comment)

Closes #554 (tracking — child issues carry the implementation)

🤖 Generated with Claude Code

Note

Replace LALRPOP with hand-written recursive-descent + Pratt parser in gf-cypher

  • Removes lalrpop-util and logos dependencies from gf-cypher/Cargo.toml and replaces them with a hand-written Lexer and TokenStream.
  • The new lexer tokenizes all Cypher keywords (case-insensitive), literals, operators, relationship-arrow composites, and multi-word compound tokens (NOT IN, IS NULL, STARTS WITH, etc.) with precise span-based error reporting.
  • TokenStream eagerly lexes input into a Vec of spanned tokens and provides lookahead, consume, and error helpers to support the upcoming recursive-descent clause parser and Pratt expression parser.
  • Extends the AST with UnionClause, AstClause::Union, a star field on FunctionCall, and an UnterminatedBlockComment parse error kind.
  • Risk: parser::parse is a stub that always returns UnexpectedEof; no Cypher queries can be parsed yet.

Macroscope summarized 0533882.

Summary by CodeRabbit

  • New Features

    • Support for UNION and UNION ALL in Cypher queries
    • Function-call star syntax (e.g., COUNT(*)) supported
    • Integrated Cypher lexer + parser to enable richer query parsing
  • Bug Fixes

    • Better detection and reporting for unterminated block comments
  • Documentation

    • Added architecture docs and navigation entries for parser/algorithm design

Review Change Stack

…r scaffold (#554)

LALRPOP produced 50 unresolvable structural shift/reduce conflicts in LR(1)
mode after full openCypher operator coverage. State-merging contaminated
FOLLOW sets across unrelated grammar contexts — not fixable within LALRPOP.

Replace with hand-written recursive-descent (clauses) + Pratt (expressions),
the architecture used by Neo4j, PostgreSQL, and SQLite. Conflict-free by
construction; no build-time grammar generation.

Changes:
- Remove lalrpop, lalrpop-util, logos from gf-cypher Cargo.toml
- Delete grammar.lalrpop and build.rs (no-op remnant also removed)
- Add lexer.rs: hand-written Tok lexer with compound tokens (NOT IN,
  IS NULL, IS NOT NULL, STARTS WITH, ENDS WITH)
- Add parser/mod.rs: TokenStream scaffold (peek, peek_n, eat, eat_if,
  advance, span helpers) — 11 unit tests passing
- lib.rs: delegates directly to parser::parse()
- gf-ast: add UnionClause, FunctionCall::star field,
  UnterminatedBlockComment error variant
- ADR 0003: rewritten to document full history (LALR→LR(1)→RD+Pratt)
  and final decision rationale
- Docs: 7-verb API surface, algorithms catalog, RD+Pratt parser diagram,
  Swift/Kotlin UniFFI bindings, roadmap, api reference all updated
- CLAUDE.md: updated for current architecture

Implementation continues in child issues:
- #650 Pratt expression parser (expr.rs)
- #651 Pattern parser (patterns.rs)
- #652 Read-only clauses (clauses.rs)
- #653 Write clauses + final wire-up

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Walkthrough

Adds a hand-written Cypher lexer (Tok, Lexer) with span-aware errors, extends gf-ast for UNION and f(*) (star), records unterminated block-comment parse errors, provides TokenStream parser scaffolding (peek/eat/span helpers and tests), and wires parser::parse as the crate entrypoint; docs/Cargo updates included.

Changes

Lexer and Parser Foundation

Layer / File(s) Summary
AST Extensions for UNION and FunctionCall
crates/gf-ast/src/ast.rs, crates/gf-ast/src/lib.rs, crates/gf-ast/src/tests.rs
AstClause::Union(UnionClause) variant added with all: bool and span; FunctionCall extended with star: bool; AstClause::span() updated; tests and re-exports adjusted.
Parse Error Handling for Block Comments
crates/gf-ast/src/parse_error.rs
Added ParseErrorKind::UnterminatedBlockComment and Display case "unterminated block comment".
Hand-written Lexer Implementation
crates/gf-cypher/src/lexer.rs
Adds Tok enum and Lexer<'input> producing Spanned<Tok>; handles compound keywords, strings/backticks, numeric literals, $ parameters, punctuation/operators/arrows, and reports span-aware errors for unterminated/invalid constructs.
Parser Infrastructure and Wiring
crates/gf-cypher/Cargo.toml, crates/gf-cypher/src/lib.rs, crates/gf-cypher/src/parser/mod.rs, mkdocs.yml
Adds parser and lexer modules; parse() delegates to parser::parse(); TokenStream<'input> provides eager token collection, peek/eat/advance/span helpers, error helpers, and unit tests; Cargo.toml deps updated; docs nav entries added.

Sequence Diagram(s)

sequenceDiagram
  participant Source as Source Text
  participant Lexer as gf_cypher::lexer::Lexer
  participant TokenStream as gf_cypher::parser::TokenStream
  participant Parser as gf_cypher::parser::parse
  Source->>Lexer: provide characters
  Lexer->>Lexer: skip whitespace/comments, lex token
  Lexer->>TokenStream: emit Spanned<Tok> (start,tok,end)
  TokenStream->>Parser: peek/eat/advance tokens
  Parser->>TokenStream: request spans (span_from/current_span)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive PR description includes most required sections but lacks comprehensive testing evidence and incomplete checklist confirmation. Add explicit confirmation of all test types run (unit, integration, TCK), document specific test commands executed, and complete the pre-submission checklist items to verify code quality compliance.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: replacing LALRPOP with a recursive-descent + Pratt parser scaffold in gf-cypher.
Linked Issues check ✅ Passed The PR successfully delivers the TokenStream scaffold, lexer updates, AST extensions, and wire-up specified in #554, with child issues properly listed for follow-up work.
Out of Scope Changes check ✅ Passed All changes directly support the recursive-descent + Pratt parser transition: lexer, TokenStream, AST updates, and documentation reflect the stated objectives.
Docstring Coverage ✅ Passed Docstring coverage is 83.72% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/554-lalrpop-grammar

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.09%. Comparing base (6a70655) to head (0533882).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #654   +/-   ##
=======================================
  Coverage   84.09%   84.09%           
=======================================
  Files          49       49           
  Lines       12921    12921           
  Branches     3628     3628           
=======================================
  Hits        10866    10866           
  Misses       1270     1270           
  Partials      785      785           
Flag Coverage Δ
full-coverage 84.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
parser 92.93% <ø> (ø)
planner 79.90% <ø> (ø)
executor 75.46% <ø> (ø)
storage 98.68% <ø> (ø)
ast 97.51% <ø> (ø)
types 90.66% <ø> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6a70655...0533882. Read the comment docs.

Comment thread docs/architecture/ast-and-planning.md Outdated
Comment thread docs/reference/api.md Outdated
Comment thread docs/architecture/overview.md
Comment thread docs/adr/0002-rust-core.md
DecisionNerd and others added 2 commits May 29, 2026 08:23
- cargo fmt: lexer.rs enum variants one-per-line (CI fmt check)
- mkdocs.yml: add algorithms.md and ADR 0003 to nav (docs build)
- overview.md: fix broken ../adr/0004-uniffi-bindings.md link → 0002

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cargo fmt expands Ok(Self { tokens, pos: 0, input }) to multi-line form
when line length exceeds the configured threshold on CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread crates/gf-cypher/src/lexer.rs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
crates/gf-ast/src/tests.rs (1)

237-251: ⚡ Quick win

Add explicit tests for f(*) and UNION AST paths.

Line 241 only exercises star: false. Please add one roundtrip test for FunctionCall { star: true, args: vec![] } and one for AstClause::Union span + JSON roundtrip.

Proposed test additions
 #[test]
 fn ast_function_call_roundtrip() {
@@
     assert_eq!(e, back);
 }
+
+#[test]
+fn ast_function_call_star_roundtrip() {
+    let e = Expr::FunctionCall(FunctionCall {
+        name: vec!["count".into()],
+        distinct: false,
+        star: true,
+        args: vec![],
+        span: Span::new(0, 8),
+    });
+    let json = serde_json::to_string(&e).unwrap();
+    let back: Expr = serde_json::from_str(&json).unwrap();
+    assert_eq!(e, back);
+}
+
+#[test]
+fn ast_union_clause_roundtrip_and_span() {
+    let clause = AstClause::Union(UnionClause {
+        all: true,
+        span: Span::new(5, 14),
+    });
+    assert_eq!(clause.span(), Span::new(5, 14));
+    let json = serde_json::to_string(&clause).unwrap();
+    let back: AstClause = serde_json::from_str(&json).unwrap();
+    assert_eq!(clause, back);
+}

Based on learnings: Write comprehensive unit tests for each layer (parser, planner, executor), integration tests for end-to-end behavior, and aim for 100% coverage on new code (90% minimum).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-ast/src/tests.rs` around lines 237 - 251, The existing
ast_function_call_roundtrip test only covers FunctionCall with star: false; add
two tests: (1) a new test (e.g., ast_function_call_star_roundtrip) that
constructs Expr::FunctionCall(FunctionCall { name: vec!["f".into()], distinct:
false, star: true, args: vec![], span: Span::new(...)}), serializes with
serde_json::to_string and deserializes with serde_json::from_str and asserts
equality; (2) a new test (e.g., ast_clause_union_roundtrip) that constructs an
AstClause::Union with an appropriate Span, serializes/deserializes via
serde_json and asserts equality to exercise the UNION AST path; reuse existing
patterns from ast_function_call_roundtrip (serde_json::to_string/from_str and
assert_eq!) and import/construct Span as done elsewhere in tests.rs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gf-cypher/src/lexer.rs`:
- Around line 217-231: The Unicode-escape branch handling Some('u') in lexer.rs
currently ignores malformed escapes; change it to reject invalid escapes by
returning a lexer error instead of silently skipping: ensure you collect exactly
four hex digits (fail if fewer than 4), validate u32::from_str_radix(&hex, 16)
succeeds, and validate char::from_u32(n) yields Some(ch); on any failure
propagate/return the lexical error using the lexer’s existing error path (the
surrounding scanning function’s error/Err return), rather than pushing a
replacement or continuing.

In `@crates/gf-cypher/src/parser/mod.rs`:
- Around line 10-13: The stub parser error uses a fixed Span::new(0, 0) which
misreports EOF location; update the ParseError::new call (the UnexpectedEof
construction in crates/gf-cypher/src/parser/mod.rs) to use the actual EOF
position by replacing Span::new(0, 0) with a span anchored at the end of the
input (e.g., Span::new(input.len(), input.len())), keeping the rest of the
ParseError::new(ParseErrorKind::UnexpectedEof { ... }, ..., "parser not yet
implemented") intact.

---

Nitpick comments:
In `@crates/gf-ast/src/tests.rs`:
- Around line 237-251: The existing ast_function_call_roundtrip test only covers
FunctionCall with star: false; add two tests: (1) a new test (e.g.,
ast_function_call_star_roundtrip) that constructs
Expr::FunctionCall(FunctionCall { name: vec!["f".into()], distinct: false, star:
true, args: vec![], span: Span::new(...)}), serializes with
serde_json::to_string and deserializes with serde_json::from_str and asserts
equality; (2) a new test (e.g., ast_clause_union_roundtrip) that constructs an
AstClause::Union with an appropriate Span, serializes/deserializes via
serde_json and asserts equality to exercise the UNION AST path; reuse existing
patterns from ast_function_call_roundtrip (serde_json::to_string/from_str and
assert_eq!) and import/construct Span as done elsewhere in tests.rs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3cb7c02c-7616-4b3e-998b-53884bf3ca9d

📥 Commits

Reviewing files that changed from the base of the PR and between 6a70655 and 1a8b72c.

⛔ Files ignored due to path filters (15)
  • CLAUDE.md is excluded by !**/*.md
  • Cargo.lock is excluded by !**/*.lock, !**/*.lock
  • docs/adr/0002-rust-core.md is excluded by !**/*.md, !**/docs/**
  • docs/adr/0003-lr1-grammar.md is excluded by !**/*.md, !**/docs/**
  • docs/architecture/algorithms.md is excluded by !**/*.md, !**/docs/**
  • docs/architecture/ast-and-planning.md is excluded by !**/*.md, !**/docs/**
  • docs/architecture/execution-model.md is excluded by !**/*.md, !**/docs/**
  • docs/architecture/overview.md is excluded by !**/*.md, !**/docs/**
  • docs/development/contributing.md is excluded by !**/*.md, !**/docs/**
  • docs/development/workflow.md is excluded by !**/*.md, !**/docs/**
  • docs/index.md is excluded by !**/*.md, !**/docs/**
  • docs/reference/RELEASE_TRACKING.md is excluded by !**/*.md, !**/docs/**
  • docs/reference/api.md is excluded by !**/*.md, !**/docs/**
  • docs/reference/opencypher-compatibility.md is excluded by !**/*.md, !**/docs/**
  • docs/releases/roadmap.md is excluded by !**/*.md, !**/docs/**
📒 Files selected for processing (8)
  • crates/gf-ast/src/ast.rs
  • crates/gf-ast/src/lib.rs
  • crates/gf-ast/src/parse_error.rs
  • crates/gf-ast/src/tests.rs
  • crates/gf-cypher/Cargo.toml
  • crates/gf-cypher/src/lexer.rs
  • crates/gf-cypher/src/lib.rs
  • crates/gf-cypher/src/parser/mod.rs

Comment thread crates/gf-cypher/src/lexer.rs
Comment on lines +10 to +13
Err(ParseError::new(
ParseErrorKind::UnexpectedEof { expected: vec![] },
Span::new(0, 0),
"parser not yet implemented",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use EOF span in the stub error for accurate diagnostics.

At Line 12, returning Span::new(0, 0) for all inputs makes downstream error reporting misleading. Use input.len() for the placeholder error span.

Proposed fix
 pub fn parse(input: &str) -> Result<AstQuery, ParseError> {
     let _ts = TokenStream::new(input)?;
+    let eof = input.len();
     Err(ParseError::new(
         ParseErrorKind::UnexpectedEof { expected: vec![] },
-        Span::new(0, 0),
+        Span::new(eof, eof),
         "parser not yet implemented",
     ))
 }
🧰 Tools
🪛 GitHub Actions: PR `#654` / 0_Analyze (rust).txt

[warning] 11-11: macro expansion failed for 'vec'

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gf-cypher/src/parser/mod.rs` around lines 10 - 13, The stub parser
error uses a fixed Span::new(0, 0) which misreports EOF location; update the
ParseError::new call (the UnexpectedEof construction in
crates/gf-cypher/src/parser/mod.rs) to use the actual EOF position by replacing
Span::new(0, 0) with a span anchored at the end of the input (e.g.,
Span::new(input.len(), input.len())), keeping the rest of the
ParseError::new(ParseErrorKind::UnexpectedEof { ... }, ..., "parser not yet
implemented") intact.

DecisionNerd and others added 2 commits May 29, 2026 10:45
- ast-and-planning.md: fix broken backtick nesting in Pratt binding powers line
- api.md: common parameters table now shows per-verb availability instead of
  claiming all three params are shared by all verbs
- adr/0002-rust-core.md: "four analyst-intent verbs" → seven (correct count)
- algorithms.md: "four analyst-intent verbs" → five (algorithm verbs only,
  correct count); add forge.similar() to overview code block

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tly dropping

Invalid hex digits or surrogate code points now return ParseError with the
offending escape span. Previously \uZZZZ would silently produce an empty
string fragment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment on lines +288 to +293
for _ in 0..4 {
match self.advance() {
Some(h) => hex.push(h),
None => break,
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium src/lexer.rs:288

In read_string, the \uXXXX escape handler consumes any 4 characters rather than exactly 4 hex digits. For input like "\uAB", the closing quote is swallowed as part of the hex sequence, producing hex = "AB\" and corrupting the lexer state. The loop should validate each character is a hex digit and break early if not.

-                            for _ in 0..4 {
-                                match self.advance() {
-                                    Some(h) => hex.push(h),
-                                    None => break,
-                                }
-                            }
+                            for _ in 0..4 {
+                                match self.peek() {
+                                    Some(h) if h.is_ascii_hexdigit() => {
+                                        self.advance();
+                                        hex.push(h);
+                                    }
+                                    _ => break,
+                                }
+                            }
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file crates/gf-cypher/src/lexer.rs around lines 288-293:

In `read_string`, the `\uXXXX` escape handler consumes any 4 characters rather than exactly 4 hex digits. For input like `"\uAB"`, the closing quote is swallowed as part of the hex sequence, producing `hex = "AB\"` and corrupting the lexer state. The loop should validate each character is a hex digit and break early if not.

Evidence trail:
crates/gf-cypher/src/lexer.rs lines 284-306 at REVIEWED_COMMIT: The `for _ in 0..4` loop (line 288) calls `self.advance()` and pushes any `Some(h)` character without checking `h.is_ascii_hexdigit()`. Line 290 `Some(h) => hex.push(h)` accepts any character. Line 275 `Some(c) if c == quote => break` is the quote-termination check but it's in the outer loop and never reached during the inner `for` loop's execution.

@DecisionNerd DecisionNerd merged commit 12565bd into main May 29, 2026
44 checks passed
@DecisionNerd DecisionNerd deleted the feature/554-lalrpop-grammar branch May 29, 2026 18:36
DecisionNerd added a commit that referenced this pull request May 29, 2026
* feat(gf-cypher): Pratt expression parser — all Expr variants (#650)

Implements parse_expr(ts, min_bp) in crates/gf-cypher/src/parser/expr.rs.

Binding power table:
  OR(10/11), XOR(20/21), AND(30/31), NOT prefix(35),
  comparisons(40/41), IS NULL/NOT IN/STARTS WITH/etc(40/40),
  +/-(50/51), */%( 60/61), ^(70/70 right-assoc),
  ./[(80/81 postfix)

Produces all Expr AST variants:
  Literal, Var, Param, BinaryOp, UnaryOp, FunctionCall,
  Property, List, Map, Case, ListComprehension,
  IsNull, InList, StringOp, RegexMatch, Parenthesized

- COUNT(*), COUNT(DISTINCT x), keyword-named functions (exists,
  any, none, single, filter, extract, reduce, shortestPath)
- Simple and searched CASE expressions
- List comprehensions with optional WHERE filter and | projection
- Subscript [idx] and slice [lo..hi] encoded as synthetic _subscript
  and _slice FunctionCall nodes until AST gains a dedicated Slice node
- eat_ident: error span now references the actual bad token
- parse_map_literal: duplicate keys are now a ParseError

49 tests passing (11 TokenStream + 38 expression).
Part of #554. Depends on #650 scaffolding (merged in #654).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(parser): extend eat_ident keyword-as-property coverage

Add NOT, AND, OR, XOR, OPTIONAL, ANY, NONE, SINGLE, FILTER, EXTRACT,
REDUCE, SHORTESTPATH, ALLSHORTESTPATHS to tok_as_keyword_str so that
property accesses like n.not or n.any parse correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(parser): use exact token end positions for literal spans

Atomic literal/param prefixes now capture (l, _, r) from advance() and
use Span::new(l, r) directly, so spans don't bleed into trailing
whitespace before the next token.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rust: implement recursive-descent + Pratt parser for gf-cypher

1 participant