Skip to content

detect_error dedup loses specific Q-codes when GLR has forked earlier in the parse #230

@rundel

Description

@rundel

It is possible to create circumstances where the tree sitter parser detects multiple errors at the same location - claude believes that this is occuring due to the GLR engine forking due to earlier errors (but I have not verified this).

The issue then becomes the errors with the same (row, column) are deduplicated in a way that neither state nor sym is taken into account. As a result, when the surviving versions disagree on lookahead, the dedup picks one essentially arbitrarily and this can result in a generic "Parse error" rather than a more specific Q-code from the corpus.

I am not sure if it is possible to construct a situation where multiple errors with distinct Q-codes can be generated and how that should be handled.

Example

# Heading {key=val .class}

Second *bad paragraph.
printf  "%s\n" "# Heading {key=val .class}" "" "Second *bad paragraph." \
  | pampa --no-prune-errors -t native
Error: [Q-2-3] Key-value Pair Before Class Specifier in Attribute
   ╭─[ <stdin>:1:20 ]
   │
 1 │ # Heading {key=val .class}
   │            ───┬─── ───┬──
   │               ╰──────────── This key-value pair cannot appear before the class specifier.
   │                       │
   │                       ╰──── This class specifier appears after the key-value pair.
───╯

Error: Parse error
   ╭─[ <stdin>:3:23 ]
   │
 3 │ Second *bad paragraph.
   │                       ┬
   │                       ╰── unexpected character or token here
───╯

Control

First good paragraph.

Second *bad paragraph.
printf  "%s\n" "First good paragraph." "" "Second *bad paragraph." \
  | pampa --no-prune-errors -t native
Error: [Q-2-12] Unclosed Star Emphasis
   ╭─[ <stdin>:3:23 ]
   │
 3 │ Second *bad paragraph.
   │ ───┬──                ┬
   │    ╰───────────────────── This is the opening '*' mark.
   │                       │
   │                       ╰── I reached the end of the block before finding a closing '*' for the emphasis.
───╯

Claude's assessment on what is happening under the hood

Running pampa with -v on the reproduction input shows that after the attribute-list error in paragraph 1, GLR forks:

detect_error lookahead:attribute_class
recover_to_previous state:2590, depth:2
process version:0, version_count:2, state:0, row:0, col:25
process version:1, version_count:2, state:2590, row:0, col:19

Both versions continue parsing forward through the blank line and into paragraph 3, and both reach the unclosed * at the same source position. Each emits its own detect_error event:

process version:0, version_count:2, state:1922, row:2, col:22
detect_error lookahead:_commonmark_single_quote_string_token2

process version:1, version_count:2, state:1922, row:2, col:22
detect_error lookahead:_close_block

Both events have (state=1922, row=2, col=22) but different lookahead symbols:

  • version 0: (state=1922, sym=_commonmark_single_quote_string_token2) — no entry in crates/pampa/resources/error-corpus/_autogen-table.json, so Merr falls through to the generic "Parse error" fallback at error_generation.rs:243-249.
  • version 1: (state=1922, sym=_close_block) — matches the Q-2-12 corpus entry at _autogen-table.json line ~2236 and would produce the specific diagnostic seen in the control case.

The collector in produce_diagnostic_messages (error_generation.rs:39-62) is:

let mut seen_errors: HashSet<(usize, usize)> = HashSet::new();
for parse in &tree_sitter_log.parses {
    for process_log in parse.processes.values() {
        for state in process_log.error_states.iter() {
            if seen_errors.contains(&(state.row, state.column)) {
                continue;
            }
            seen_errors.insert((state.row, state.column));
            ...

The dedup key is just (row, column). Whichever GLR version's detect_error event is visited first in the HashMap iteration claims that source position; the other version's event is silently dropped before it ever reaches lookup_error_entry.

parse.processes is a HashMap<usize, TreeSitterProcessLog> keyed on GLR version number, iterated via .values(). Rust's standard HashMap randomizes iteration order across processes, but with only two integer keys (0 and 1) on this build the order is effectively pinned in practice — running the reproduction ten times in a row produces the generic "Parse error" output every time, indicating version 0 (the non-corpus-matching version) consistently wins the dedup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions