Skip to content

bugfix: multi-line block quote inside list item (bd-vet6)#176

Open
cscheid wants to merge 6 commits into
mainfrom
bugfix/multiline-list-item
Open

bugfix: multi-line block quote inside list item (bd-vet6)#176
cscheid wants to merge 6 commits into
mainfrom
bugfix/multiline-list-item

Conversation

@cscheid
Copy link
Copy Markdown
Member

@cscheid cscheid commented May 11, 2026

Summary

The tree-sitter qmd parser failed on a Pandoc-valid CommonMark construct: a list item containing a block quote whose paragraph spans multiple lines using the > continuation marker (e.g. - > a\n > b\n). Pandoc parses this as BulletList[[BlockQuote[Para[Str a, SoftBreak, Str b]]]]; we returned a parse error at line 2 col 6. The bug fires for every list marker (-, *, +, 1., 1), etc.), the 3+-line case, and any variant with content following the block quote — the original 2-line variant only escapes because the scanner's EOF path runs before the buggy code (pampa hides this by auto-appending a \n, so users always hit it).

Root cause is in crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.c. After a SOFT_LINE_ENDING, the scanner sets STATE_MATCHING | STATE_WAS_SOFT_LINE_BREAK. When the parser then asks for the next token at the trailing \n of the second block-quote-marked line, the STATE_MATCHING block at line 2040 calls match_line, which routes LIST_ITEM through its case 2 "blank-line continuation" branch — that branch advances past the \n. By the time control reaches the line-ending gate at line 2233 (which checks lookahead == '\n'), the newline is gone, so the gate skips. The scan returns false, tree-sitter retries with a different lex-state and gets _close_block which has no shift at the current parse state, and the parse errors out. The BLOCK_QUOTE match has no analogous \n-consuming branch, which is why nested block quotes (> > a\n> > b\n) work fine.

The fix bypasses the STATE_MATCHING block when STATE_WAS_SOFT_LINE_BREAK is set AND lookahead is \n/\r — the soft-line-break already accounted for the continuation prefix, so re-running match_line against the trailing newline of the same logical line is wrong; the line-ending gate handles it cleanly. Full investigation, neighborhood characterization, fix proposal, and end-to-end verification are in claude-notes/plans/2026-05-11-bq-multiline-in-list-item.md. The branch is structured as five commits, one per phase (failing tests → characterization → fix → e2e verification → close-out), so the reviewer can step through the work in order.

Test plan

  • tree-sitter test: 476/476 (6 new corpus tests for marker variants + 3-line case)
  • cargo nextest run -p pampa: 3685/3685 (7 new pandoc-match fixtures, all match Pandoc's native AST)
  • cargo nextest run --workspace: 8804/8804
  • cargo xtask verify (full, 9/9 steps including hub-client WASM build + tests): passed
  • Manual end-to-end on the original reporter file and all marker / multi-line / follow-up content variants; pampa output matches pandoc -t native byte-for-structure
  • Regression sanity checks: bq-in-bq, blank-line-separated paragraphs in list items, nested lists, and lazy continuation all still produce the same trees as before
  • No .snap files touched

Closes bd-vet6.

cscheid added 6 commits May 11, 2026 10:53
Plan: claude-notes/plans/2026-05-11-bq-multiline-in-list-item.md
Adds 6 failing tree-sitter corpus tests (24-29) and 7 pampa
pandoc-match fixtures that exercise the multi-line block quote inside
a list item bug. Tests fail because of the LIST_ITEM match() newline
branch consuming the \n that the line-ending gate needs (see plan doc
for full root cause).

Refined bug scope captured in plan: the bug requires a trailing \n
after the second blockquote-marked line. Pampa auto-appends one
(Q-7-1), which is why users hit it.
Maps every STATE_MATCHING / STATE_WAS_SOFT_LINE_BREAK / match_line
site, identifies the invariant the bug violates, surveys existing
corpus tests that depend on LIST_ITEM case 2, and writes down the
exact proposed guard before implementing.
Skip the STATE_MATCHING block in scanner.c when
STATE_WAS_SOFT_LINE_BREAK is set AND lookahead is \n or \r. The
LIST_ITEM match() case 2 was advancing past the trailing newline,
leaving the line-ending gate at line 2233 with nothing to match
against. The line-ending gate (or the EOF handler above) handles this
case cleanly when not bypassed by match_line first.

Results:
- tree-sitter test: 476/476 pass (6 new corpus tests for marker
  variants and 3-line case)
- cargo nextest -p pampa: 3685/3685 pass
- cargo nextest --workspace: 8804/8804 pass

No regressions in any existing tests.
Pampa output matches pandoc native AST for the original reporter file
and every marker variant + multi-line + follow-up content case.
Regression sanity checks confirm bq-in-bq, blank-line list paragraphs,
nested lists, and lazy continuation all still produce the same trees
as before.
Full cargo xtask verify passed (9/9 steps, including hub-client and
trace-viewer). No snapshot files changed. Branch ready for review.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant