refactor(mdxish): add micromark tokenizer for MDX component parsing#1426
Conversation
…g-to-use-a-tokenizer-instead
| }); | ||
| }); | ||
|
|
||
| describe('given there is a line content right above the Tabs component', () => { |
There was a problem hiding this comment.
As of now, this would throw an error
There was a problem hiding this comment.
This test file tests the whole mdxish pipeline and not just the component transformers, so it focuses on testing of various forms of components are still parsable
There was a problem hiding this comment.
The changes in this file are mainly:
- Adding more coverage for the helper functions such as parseTag & parseAttributes
- Accounting for the simpler code in mdxish-component-blocks.ts
- Improving robustness of some tests by matching resulting ast with the expected object so it's simpler to write & more strict checking
|
@eaglethrost what do you mean by
|
maximilianfalco
left a comment
There was a problem hiding this comment.
looks good overall! just some minor comments
| attributes: parseAttributes(attrString), | ||
| selfClosing: !!selfClosing, | ||
| contentAfterTag, | ||
| attrString // Just for debugging purposes |
There was a problem hiding this comment.
| attrString // Just for debugging purposes | |
this can be deleted right?
There was a problem hiding this comment.
Not really because I want to use it for testing, even though it's not used in the rendering code. Do you have a suggestion to make it cleaner?
Sorry there was actually a <Table> in between "the" and "tokenizer", but GitHub parsed it as HTML hence it didn't show 🤦 Fixed this |
…g-to-use-a-tokenizer-instead
| const processor = remark() | ||
| .data('micromarkExtensions', [magicBlock()]) | ||
| .data('fromMarkdownExtensions', [magicBlockFromMarkdown()]) | ||
| .data('micromarkExtensions', [magicBlock(), mdxComponent()]) |
There was a problem hiding this comment.
Add the tokenizer here as well since mdxishComponentBlocks is simplified, so they both go in tandem. This also allows more variety of components tags to be captured.
…g-to-use-a-tokenizer-instead
|
Follow up: I think we can actually remove the 1) snake case preprocessing 2) mdxishSelfClosingBlocks transformers now. Will do so in a follow up to reduce PR size and decrease blast radius |
kevinports
left a comment
There was a problem hiding this comment.
Had a few minor comments but otherwise code looks good.
I spent quite a bit of time QAing this locally on a clone of the developers project and the page rendering / editor loading seems fine.
BUT I did follow the repro steps for this ticket linked to this PR:
https://linear.app/readme-io/issue/CX-3155/mdxish-editor-invalid-mdx-syntax-saves-silently-but-causes-500-error
and used the snippet provided in the reproduction steps, and the code doesn't serialize and cannot be rendered:
CleanShot.2026-04-15.at.15.53.37.mp4
| import type { CompileContext, Extension as FromMarkdownExtension, Handle } from 'mdast-util-from-markdown'; | ||
|
|
||
| function enterMdxComponent(this: CompileContext, token: Parameters<Handle>[0]): void { | ||
| this.enter({ type: 'html', value: '' } as Html, token); |
There was a problem hiding this comment.
Took me a sec to realize we're just capturing the block-level JSX component syntax as an html node here when I'd initially expected mdxJsxFlowElement nodes to be emitted.
It's probably worth a JSDoc comment somewhere in this file or in the tokenizer (or both?) noting that we're only emitting the tokenized MdxComponent as an opaque html MDAST node at this stage, and it actually gets processed and structured into an mdxJsxFlowElement node downstream in the mdxish transformers.
There was a problem hiding this comment.
Took me a sec to realize we're just capturing the block-level JSX component syntax as an
htmlnode here when I'd initially expectedmdxJsxFlowElementnodes to be emitted.It's probably worth a JSDoc comment somewhere in this file or in the tokenizer (or both?) noting that we're only emitting the tokenized
MdxComponentas an opaquehtmlMDAST node at this stage, and it actually gets processed and structured into anmdxJsxFlowElementnode downstream in the mdxish transformers.
Added here!
There was a problem hiding this comment.
Should these const declarations be hoisted up to a root location if they're shared by this tokenizer and also a few transformers?
There was a problem hiding this comment.
Yeah good point, I've hoisted it to the tokenizer level but it also makes sense to further hoisted it up. Moved it to lib/constants.ts
| /** | ||
| * Inline-only custom components. These must not use generic block-level | ||
| * `<Component>...</Component>` handling (MDXish transforms) or be treated as | ||
| * flow elements where inline semantics apply. | ||
| */ |
There was a problem hiding this comment.
This comment is a bit confusing because it just explains what but not why.
| /** | |
| * Inline-only custom components. These must not use generic block-level | |
| * `<Component>...</Component>` handling (MDXish transforms) or be treated as | |
| * flow elements where inline semantics apply. | |
| */ | |
| /** | |
| * Inline-only custom components that appear as phrasing content within paragraph | |
| * nodes. Excluding them from the generic mdxComponent micromark construct lets | |
| * the dedicated inline transformer handle them instead. | |
| * @see mdxish-inline-components.ts | |
| */ |
@kevinports Oh I think your local markdown repo is not properly linked to the readme app, because I can replicate this exactly in production, but it already works in the markdown playground and if I successfully link the repo. Here's a demo: Screen.Recording.2026-04-16.at.10.42.57.am.mov |
| // Find the first closing tag | ||
| const closingTagIndex = contentAfterTag.indexOf(closingTagStr); | ||
| const componentInnerContent = contentAfterTag.substring(0, closingTagIndex).trim(); | ||
| const closingTagIndex = contentAfterTag.lastIndexOf(closingTagStr); |
There was a problem hiding this comment.
Found a potential bug where if there's nested components of the same type, it will cut off and get the first closing tag, which is what we don't actually want. We want the outermost cause it's the one that matches with the first opening tag
kevinports
left a comment
There was a problem hiding this comment.
Oh I think your local markdown repo is not properly linked to the readme app, because I can replicate this exactly in production, but it already works in the markdown playground and if I successfully link the repo. Here's a demo:
Ok yeah checked again this morning and things are fine. I think I linked the wrong branch. My bad!
QA'd again locally and everything appears to be in order.
## Version 14.0.0
### ⚠ BREAKING CHANGES
* **mdxish:** magic blocks), the parser had no way to recognize `{/* ... */}`
as a single expression
> [!NOTE]
> Technically, introducing the flow construct of the `mdxExpression`
would partly fix this issue but we cannot do this because it would break
magic block parsing. magic blocks usually contain `{ \n \n }` JSON
payloads, and the flow construct would eat them before the magic block
tokenizer got a chance to run.
created a dedicated `jsxComment` micromark tokenizer for JSX comments.
enabled for all branches instead of being gated by `newEditorTypes`.
> [!NOTE]
In the future as a follow up, we can work in integrating this new
tokenizer and fully removing the `removeJSXComments` preprocessing step.
similar to what we do in the `stripComments`
## 🧪 QA tips
Try pasting this in the editor (both rich and raw):
```
{/* Dropbox hosted. What was it?
[block:image]
{
"images": [
{
"image": [
"https://paper-attachments.dropboxusercontent.com/s_924615C008BE616D2ABE30441190A9401D739971D9AD94E8B5B46A426C5F05E8_1683123013834_Dev+Dash+-+hero+docs+shot.png",
null,
"The My Developers page in your ReadMe project dashboard"
],
"align": "center",
"border": true,
"caption": "The My Developers page in your ReadMe project dashboard"
}
]
}
[/block]
*/}
```
It should be treated as one big comment block instead of a rendered
image.
## 📸 Screenshot or Loom
https://github.com/user-attachments/assets/9b619f54-3d6e-456f-83be-27d4076a7a3a
### ✨ New & Improved
* **mdxish:** add micromark tokenizer for MDX component parsing ([#1426](#1426)) ([a77707d](a77707d))
* move numbered lists for 1 -> a -> i ([#1438](#1438)) ([7e614d4](7e614d4))
### 🛠 Fixes & Updates
* **mdxish:** properly parse JSX comments in editor deserialization ([#1419](#1419)) ([95e1969](95e1969))
<!--SKIP CI-->
This PR was released!🚀 Changes included in v14.0.0 |
| 🎫 Resolve RM-15852 | | :-----------------: | ## Context - In the MDXish Editor, component attribute expressions were being evaluated during preprocessing Example: `<MyComp attr={1 + 1} />` became `<MyComp attr="2" />`, which permanently lost the original expression - This came from a fundamental MDXish workaround: because we were not using full MDX parsing for attributes, preprocessing normalized non-quoted JSX-style attributes into quoted HTML-safe values to avoid breaking AST nodes - We can't just use safeMode because we still need the inline expressions to be parsed as expression nodes, so there still needs to be an expression parsing ## 🎯 What does this PR do? - With the new MDX tokenizer work in #1426 that this is based upon, we can now preserve attribute syntax safely (including expression forms) without that preprocessing conversion, so this PR entirely removes the `evaluateJSXExpression` step in the preprocessing so the attributes don't get evaluated. This means the output of the `mdxishAstProcessor` will contain the raw expression value instead of the evaluated one - There was a step in `evaluateJSXExpression` where I encode object / array attributes so it doesn't get split in `rehypeRaw` step. Since that's removed, I moved that code to the `mdxish-handlers` - Removed jsxContext argument since we don't actually use it anywhere in the readme app - Since attribute expressions are no longer evaluated in preprocessing, the `parseAttributes` code need to be beefed up to deal with expressions. I created a mini tokenizer / traveler in `mdxish-component-tag-parser.ts` to do such thing The crux of the change is not much, but there's quite a lot of tests that needs to be modified / no longer needed. ## QA Tips See my comment, better to test this in the readme app directly & locally link the markdown repo to it ## 📸 Screenshot or Loom E2E test with the readme app, now in the Mdxish Editor the attribute expressions are retained & not evaluated, but the rendering stay the same Before: <img width="1070" height="636" alt="Screenshot 2026-04-09 at 11 28 18 pm" src="https://github.com/user-attachments/assets/1ac3be2e-ccc0-44ff-9a1f-b6ac1157b3ed" /> After: <img width="1095" height="687" alt="Screenshot 2026-04-09 at 5 45 56 pm" src="https://github.com/user-attachments/assets/86982b7f-32ad-4993-b14f-dbb9c18e101e" />
## Version 14.1.0 ### ⚠ BREAKING CHANGES * **mdxish:** AST nodes - We can't just use safeMode because we still need the inline expressions to be parsed as expression nodes, so there still needs to be an expression parsing ## 🎯 What does this PR do? ### ✨ New & Improved * **mdxish:** remove jsx attribute expression preprocessing ([#1429](#1429)) ([9545472](9545472)), closes [#1426](#1426) ### 🛠 Fixes & Updates * add readme components tests ([#1434](#1434)) ([72ea080](72ea080)), closes [#1354](#1354) * **deps:** bump actions/upload-artifact from 5 to 7 ([#1417](#1417)) ([3ca6d32](3ca6d32)) * **mdxish:** legacy glossary syntax in callout title crashing ([#1441](#1441)) ([9d1b18b](9d1b18b)) <!--SKIP CI-->

🎯 What does this PR do?
lib/micromark/mdx-component/syntax.ts) that captures<Component>...</Component>blocks as single tokens during parsing, preventing CommonMark from fragmenting them across multiple HTML/paragraph nodes. It supports snake case components as wellmdxish-component-blocks.tsby removing the multi-sibling stitching logic (Case 3:scanForClosingTag,stripClosingTagFromParagraph,MAX_LOOKAHEAD) — ~200 lines removed",'), JSX expression attributes ({...}), template literals (`with${...}interpolation), code spans, and fenced code blocks inside component bodiesWhy
The existing pipeline relied on string-level preprocessing and post-parse stitching to handle MDX components that remark fragmented across multiple AST nodes. This was fragile and sensitive to whitespace, blank lines, and edge cases. A tokenizer captures component boundaries at parse time, which is more robust and reduces the need for preprocessing workarounds. It also allows capture of more complex forms of attributes that otherwise might have broken up the tree / difficult to parse.
Key design decisions
htmlnodes (notmdxJsxFlowElement) — inner content still needs markdown transformation, which happens in the simplifiedmdxish-component-blockstransformer (Case 1 + Case 2 only)Table,HTMLBlock,Glossary,Anchorare skipped (they have dedicated tokenizers/transformers)templateStackto track${...}interpolation nesting, critical for safeMode wherepreprocessJSXExpressionsis skipped and the tokenizer sees raw expressionssafeDeindentin the transformer strips just enough indentation (down to 3 spaces) to prevent deeply nested component content from being treated as code blocks, while preserving whitespace that becomes HAST text nodesI've looked at several unified / remark libraries that already parses / tokenizes MDX components, but they resulted in more strict syntaxes existing test no longer pass, which goes back to the original MDX problem that MDXish tries to solve. Hence I think we would need to manually create the tokenizer.
Next steps
} wrapping part inside the tags, but we should create that next since it can simplify the HTML transformers as well🧪 QA tips
{...}expression attributes render correctly in both normal and safeMode<Tabs>inside<Tabs>)📸 Screenshot or Loom
There should be no visual changes.