bugfix: multi-line block quote inside list item (bd-vet6)#176
Open
cscheid wants to merge 6 commits into
Open
Conversation
Plan: claude-notes/plans/2026-05-11-bq-multiline-in-list-item.md
Adds 6 failing tree-sitter corpus tests (24-29) and 7 pampa pandoc-match fixtures that exercise the multi-line block quote inside a list item bug. Tests fail because of the LIST_ITEM match() newline branch consuming the \n that the line-ending gate needs (see plan doc for full root cause). Refined bug scope captured in plan: the bug requires a trailing \n after the second blockquote-marked line. Pampa auto-appends one (Q-7-1), which is why users hit it.
Maps every STATE_MATCHING / STATE_WAS_SOFT_LINE_BREAK / match_line site, identifies the invariant the bug violates, surveys existing corpus tests that depend on LIST_ITEM case 2, and writes down the exact proposed guard before implementing.
Skip the STATE_MATCHING block in scanner.c when STATE_WAS_SOFT_LINE_BREAK is set AND lookahead is \n or \r. The LIST_ITEM match() case 2 was advancing past the trailing newline, leaving the line-ending gate at line 2233 with nothing to match against. The line-ending gate (or the EOF handler above) handles this case cleanly when not bypassed by match_line first. Results: - tree-sitter test: 476/476 pass (6 new corpus tests for marker variants and 3-line case) - cargo nextest -p pampa: 3685/3685 pass - cargo nextest --workspace: 8804/8804 pass No regressions in any existing tests.
Pampa output matches pandoc native AST for the original reporter file and every marker variant + multi-line + follow-up content case. Regression sanity checks confirm bq-in-bq, blank-line list paragraphs, nested lists, and lazy continuation all still produce the same trees as before.
Full cargo xtask verify passed (9/9 steps, including hub-client and trace-viewer). No snapshot files changed. Branch ready for review.
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The tree-sitter qmd parser failed on a Pandoc-valid CommonMark construct: a list item containing a block quote whose paragraph spans multiple lines using the
>continuation marker (e.g.- > a\n > b\n). Pandoc parses this asBulletList[[BlockQuote[Para[Str a, SoftBreak, Str b]]]]; we returned a parse error at line 2 col 6. The bug fires for every list marker (-,*,+,1.,1), etc.), the 3+-line case, and any variant with content following the block quote — the original 2-line variant only escapes because the scanner's EOF path runs before the buggy code (pampa hides this by auto-appending a\n, so users always hit it).Root cause is in
crates/tree-sitter-qmd/tree-sitter-markdown/src/scanner.c. After aSOFT_LINE_ENDING, the scanner setsSTATE_MATCHING | STATE_WAS_SOFT_LINE_BREAK. When the parser then asks for the next token at the trailing\nof the second block-quote-marked line, theSTATE_MATCHINGblock at line 2040 callsmatch_line, which routesLIST_ITEMthrough itscase 2"blank-line continuation" branch — that branchadvances past the\n. By the time control reaches the line-ending gate at line 2233 (which checkslookahead == '\n'), the newline is gone, so the gate skips. The scan returns false, tree-sitter retries with a different lex-state and gets_close_blockwhich has no shift at the current parse state, and the parse errors out. TheBLOCK_QUOTEmatch has no analogous\n-consuming branch, which is why nested block quotes (> > a\n> > b\n) work fine.The fix bypasses the
STATE_MATCHINGblock whenSTATE_WAS_SOFT_LINE_BREAKis set AND lookahead is\n/\r— the soft-line-break already accounted for the continuation prefix, so re-runningmatch_lineagainst the trailing newline of the same logical line is wrong; the line-ending gate handles it cleanly. Full investigation, neighborhood characterization, fix proposal, and end-to-end verification are in claude-notes/plans/2026-05-11-bq-multiline-in-list-item.md. The branch is structured as five commits, one per phase (failing tests → characterization → fix → e2e verification → close-out), so the reviewer can step through the work in order.Test plan
tree-sitter test: 476/476 (6 new corpus tests for marker variants + 3-line case)cargo nextest run -p pampa: 3685/3685 (7 new pandoc-match fixtures, all match Pandoc's native AST)cargo nextest run --workspace: 8804/8804cargo xtask verify(full, 9/9 steps including hub-client WASM build + tests): passedpandoc -t nativebyte-for-structure.snapfiles touchedCloses bd-vet6.