Skip to content

refactor(mdxish): add micromark tokenizer for MDX component parsing#1426

Merged
kevinports merged 22 commits intonextfrom
dimas/rm-15995-refactor-mdx-component-parsing-to-use-a-tokenizer-instead
Apr 16, 2026
Merged

refactor(mdxish): add micromark tokenizer for MDX component parsing#1426
kevinports merged 22 commits intonextfrom
dimas/rm-15995-refactor-mdx-component-parsing-to-use-a-tokenizer-instead

Conversation

@eaglethrost
Copy link
Copy Markdown
Contributor

@eaglethrost eaglethrost commented Apr 7, 2026

🎫 Resolve RM-15995, CX-3155

🎯 What does this PR do?

  • Adds a micromark flow tokenizer (lib/micromark/mdx-component/syntax.ts) that captures <Component>...</Component> blocks as single tokens during parsing, preventing CommonMark from fragmenting them across multiple HTML/paragraph nodes. It supports snake case components as well
  • Simplifies mdxish-component-blocks.ts by removing the multi-sibling stitching logic (Case 3: scanForClosingTag, stripClosingTagFromParagraph, MAX_LOOKAHEAD) — ~200 lines removed
  • The tokenizer handles: PascalCase tag names, depth tracking for same-name nesting, quoted attributes (", '), JSX expression attributes ({...}), template literals (` with ${...} interpolation), code spans, and fenced code blocks inside component bodies

Why

The existing pipeline relied on string-level preprocessing and post-parse stitching to handle MDX components that remark fragmented across multiple AST nodes. This was fragile and sensitive to whitespace, blank lines, and edge cases. A tokenizer captures component boundaries at parse time, which is more robust and reduces the need for preprocessing workarounds. It also allows capture of more complex forms of attributes that otherwise might have broken up the tree / difficult to parse.

Key design decisions

  • Outputs html nodes (not mdxJsxFlowElement) — inner content still needs markdown transformation, which happens in the simplified mdxish-component-blocks transformer (Case 1 + Case 2 only)
  • Excluded tagsTable, HTMLBlock, Glossary, Anchor are skipped (they have dedicated tokenizers/transformers)
  • Template literal awareness — uses a templateStack to track ${...} interpolation nesting, critical for safeMode where preprocessJSXExpressions is skipped and the tokenizer sees raw expressions
  • safeDeindent in the transformer strips just enough indentation (down to 3 spaces) to prevent deeply nested component content from being treated as code blocks, while preserving whitespace that becomes HAST text nodes

I've looked at several unified / remark libraries that already parses / tokenizes MDX components, but they resulted in more strict syntaxes existing test no longer pass, which goes back to the original MDX problem that MDXish tries to solve. Hence I think we would need to manually create the tokenizer.

Next steps

  • I think we probably can still remove / simplify more code that's involved in component parsing. We can also look to remove the <Table> tokenizer and reuse this new one. Currently I'm leaving it as is to avoid increasing the blast radius of this refactor
  • We can't reuse this tokenizer for HTMLBlock because it has the { } wrapping part inside the tags, but we should create that next since it can simplify the HTML transformers as well

🧪 QA tips

  • Verify components with {...} expression attributes render correctly in both normal and safeMode
  • Test deeply nested same-name components (e.g., <Tabs> inside <Tabs>)
  • Verify components with fenced code blocks in their body don't break (closing tag inside code shouldn't terminate the component)
  • In the markdown playground, chose test fixtures that has components on it

📸 Screenshot or Loom

There should be no visual changes.

@eaglethrost eaglethrost changed the title feat: add micromark tokenizer for MDX component parsing refactor(mdxish): add micromark tokenizer for MDX component parsing Apr 7, 2026
@eaglethrost eaglethrost requested review from Jadenzzz, maximilianfalco and rafegoldberg and removed request for rafegoldberg April 7, 2026 13:32
@eaglethrost eaglethrost marked this pull request as draft April 7, 2026 13:33
@eaglethrost eaglethrost marked this pull request as ready for review April 8, 2026 08:25
});
});

describe('given there is a line content right above the Tabs component', () => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, this would throw an error

Copy link
Copy Markdown
Contributor Author

@eaglethrost eaglethrost Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file tests the whole mdxish pipeline and not just the component transformers, so it focuses on testing of various forms of components are still parsable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file are mainly:

  • Adding more coverage for the helper functions such as parseTag & parseAttributes
  • Accounting for the simpler code in mdxish-component-blocks.ts
  • Improving robustness of some tests by matching resulting ast with the expected object so it's simpler to write & more strict checking

@maximilianfalco
Copy link
Copy Markdown
Contributor

@eaglethrost what do you mean by

We can also look to remove the tokenizer and reuse this new one

Copy link
Copy Markdown
Contributor

@maximilianfalco maximilianfalco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall! just some minor comments

Comment thread lib/micromark/mdx-component/syntax.ts Outdated
Comment thread lib/micromark/mdx-component/syntax.ts
Comment thread __tests__/lib/mdxish/components.test.ts Outdated
Comment thread __tests__/lib/render-mdxish/Tabs.test.tsx Outdated
attributes: parseAttributes(attrString),
selfClosing: !!selfClosing,
contentAfterTag,
attrString // Just for debugging purposes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attrString // Just for debugging purposes

this can be deleted right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really because I want to use it for testing, even though it's not used in the rendering code. Do you have a suggestion to make it cleaner?

@eaglethrost
Copy link
Copy Markdown
Contributor Author

@eaglethrost what do you mean by

We can also look to remove the tokenizer and reuse this new one

Sorry there was actually a <Table> in between "the" and "tokenizer", but GitHub parsed it as HTML hence it didn't show 🤦 Fixed this

Comment thread lib/mdxishTags.ts
const processor = remark()
.data('micromarkExtensions', [magicBlock()])
.data('fromMarkdownExtensions', [magicBlockFromMarkdown()])
.data('micromarkExtensions', [magicBlock(), mdxComponent()])
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the tokenizer here as well since mdxishComponentBlocks is simplified, so they both go in tandem. This also allows more variety of components tags to be captured.

@eaglethrost
Copy link
Copy Markdown
Contributor Author

eaglethrost commented Apr 15, 2026

Follow up: I think we can actually remove the 1) snake case preprocessing 2) mdxishSelfClosingBlocks transformers now. Will do so in a follow up to reduce PR size and decrease blast radius

Copy link
Copy Markdown
Contributor

@kevinports kevinports left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few minor comments but otherwise code looks good.
I spent quite a bit of time QAing this locally on a clone of the developers project and the page rendering / editor loading seems fine.

BUT I did follow the repro steps for this ticket linked to this PR:
https://linear.app/readme-io/issue/CX-3155/mdxish-editor-invalid-mdx-syntax-saves-silently-but-causes-500-error
and used the snippet provided in the reproduction steps, and the code doesn't serialize and cannot be rendered:

CleanShot.2026-04-15.at.15.53.37.mp4

import type { CompileContext, Extension as FromMarkdownExtension, Handle } from 'mdast-util-from-markdown';

function enterMdxComponent(this: CompileContext, token: Parameters<Handle>[0]): void {
this.enter({ type: 'html', value: '' } as Html, token);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a sec to realize we're just capturing the block-level JSX component syntax as an html node here when I'd initially expected mdxJsxFlowElement nodes to be emitted.

It's probably worth a JSDoc comment somewhere in this file or in the tokenizer (or both?) noting that we're only emitting the tokenized MdxComponent as an opaque html MDAST node at this stage, and it actually gets processed and structured into an mdxJsxFlowElement node downstream in the mdxish transformers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a sec to realize we're just capturing the block-level JSX component syntax as an html node here when I'd initially expected mdxJsxFlowElement nodes to be emitted.

It's probably worth a JSDoc comment somewhere in this file or in the tokenizer (or both?) noting that we're only emitting the tokenized MdxComponent as an opaque html MDAST node at this stage, and it actually gets processed and structured into an mdxJsxFlowElement node downstream in the mdxish transformers.

Added here!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these const declarations be hoisted up to a root location if they're shared by this tokenizer and also a few transformers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I've hoisted it to the tokenizer level but it also makes sense to further hoisted it up. Moved it to lib/constants.ts

Comment on lines +1 to +5
/**
* Inline-only custom components. These must not use generic block-level
* `<Component>...</Component>` handling (MDXish transforms) or be treated as
* flow elements where inline semantics apply.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit confusing because it just explains what but not why.

Suggested change
/**
* Inline-only custom components. These must not use generic block-level
* `<Component>...</Component>` handling (MDXish transforms) or be treated as
* flow elements where inline semantics apply.
*/
/**
* Inline-only custom components that appear as phrasing content within paragraph
* nodes. Excluding them from the generic mdxComponent micromark construct lets
* the dedicated inline transformer handle them instead.
* @see mdxish-inline-components.ts
*/

@eaglethrost
Copy link
Copy Markdown
Contributor Author

eaglethrost commented Apr 16, 2026

BUT I did follow the repro steps for this ticket linked to this PR:
https://linear.app/readme-io/issue/CX-3155/mdxish-editor-invalid-mdx-syntax-saves-silently-but-causes-500-error
and used the snippet provided in the reproduction steps, and the code doesn't serialize and cannot be rendered:

@kevinports Oh I think your local markdown repo is not properly linked to the readme app, because I can replicate this exactly in production, but it already works in the markdown playground and if I successfully link the repo. Here's a demo:

Screen.Recording.2026-04-16.at.10.42.57.am.mov

@eaglethrost eaglethrost requested a review from kevinports April 16, 2026 03:08
// Find the first closing tag
const closingTagIndex = contentAfterTag.indexOf(closingTagStr);
const componentInnerContent = contentAfterTag.substring(0, closingTagIndex).trim();
const closingTagIndex = contentAfterTag.lastIndexOf(closingTagStr);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a potential bug where if there's nested components of the same type, it will cut off and get the first closing tag, which is what we don't actually want. We want the outermost cause it's the one that matches with the first opening tag

Copy link
Copy Markdown
Contributor

@kevinports kevinports left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think your local markdown repo is not properly linked to the readme app, because I can replicate this exactly in production, but it already works in the markdown playground and if I successfully link the repo. Here's a demo:

Ok yeah checked again this morning and things are fine. I think I linked the wrong branch. My bad!

QA'd again locally and everything appears to be in order.

@kevinports kevinports merged commit a77707d into next Apr 16, 2026
7 checks passed
@kevinports kevinports deleted the dimas/rm-15995-refactor-mdx-component-parsing-to-use-a-tokenizer-instead branch April 16, 2026 13:44
rafegoldberg pushed a commit that referenced this pull request Apr 16, 2026
## Version 14.0.0
### ⚠ BREAKING CHANGES

* **mdxish:** magic blocks), the parser had no way to recognize `{/* ... */}`
as a single expression

> [!NOTE]
> Technically, introducing the flow construct of the `mdxExpression`
would partly fix this issue but we cannot do this because it would break
magic block parsing. magic blocks usually contain `{ \n \n }` JSON
payloads, and the flow construct would eat them before the magic block
tokenizer got a chance to run.

created a dedicated `jsxComment` micromark tokenizer for JSX comments.
enabled for all branches instead of being gated by `newEditorTypes`.

> [!NOTE]
In the future as a follow up, we can work in integrating this new
tokenizer and fully removing the `removeJSXComments` preprocessing step.
similar to what we do in the `stripComments`

## 🧪 QA tips

Try pasting this in the editor (both rich and raw):
```
{/* Dropbox hosted. What was it?

[block:image]
{
  "images": [
    {
      "image": [
        "https://paper-attachments.dropboxusercontent.com/s_924615C008BE616D2ABE30441190A9401D739971D9AD94E8B5B46A426C5F05E8_1683123013834_Dev+Dash+-+hero+docs+shot.png",
        null,
        "The My Developers page in your ReadMe project dashboard"
      ],
      "align": "center",
      "border": true,
      "caption": "The My Developers page in your ReadMe project dashboard"
    }
  ]
}
[/block]

*/}
```
It should be treated as one big comment block instead of a rendered
image.

## 📸 Screenshot or Loom

https://github.com/user-attachments/assets/9b619f54-3d6e-456f-83be-27d4076a7a3a

### ✨ New & Improved

* **mdxish:** add micromark tokenizer for MDX component parsing ([#1426](#1426)) ([a77707d](a77707d))
* move numbered lists for 1 -> a -> i ([#1438](#1438)) ([7e614d4](7e614d4))

### 🛠 Fixes & Updates

* **mdxish:** properly parse JSX comments in editor deserialization ([#1419](#1419)) ([95e1969](95e1969))

<!--SKIP CI-->
@rafegoldberg
Copy link
Copy Markdown
Contributor

This PR was released!

🚀 Changes included in v14.0.0

eaglethrost added a commit that referenced this pull request Apr 22, 2026
| 🎫 Resolve RM-15852 |
| :-----------------: |

## Context

- In the MDXish Editor, component attribute expressions were being
evaluated during preprocessing
Example: `<MyComp attr={1 + 1} />` became `<MyComp attr="2" />`, which
permanently lost the original expression
- This came from a fundamental MDXish workaround: because we were not
using full MDX parsing for attributes, preprocessing normalized
non-quoted JSX-style attributes into quoted HTML-safe values to avoid
breaking AST nodes
- We can't just use safeMode because we still need the inline
expressions to be parsed as expression nodes, so there still needs to be
an expression parsing

## 🎯 What does this PR do?

- With the new MDX tokenizer work in #1426 that this is based upon, we
can now preserve attribute syntax safely (including expression forms)
without that preprocessing conversion, so this PR entirely removes the
`evaluateJSXExpression` step in the preprocessing so the attributes
don't get evaluated. This means the output of the `mdxishAstProcessor`
will contain the raw expression value instead of the evaluated one
- There was a step in `evaluateJSXExpression` where I encode object /
array attributes so it doesn't get split in `rehypeRaw` step. Since
that's removed, I moved that code to the `mdxish-handlers`
- Removed jsxContext argument since we don't actually use it anywhere in
the readme app
- Since attribute expressions are no longer evaluated in preprocessing,
the `parseAttributes` code need to be beefed up to deal with
expressions. I created a mini tokenizer / traveler in
`mdxish-component-tag-parser.ts` to do such thing

The crux of the change is not much, but there's quite a lot of tests
that needs to be modified / no longer needed.

## QA Tips

See my comment, better to test this in the readme app directly & locally
link the markdown repo to it

## 📸 Screenshot or Loom

E2E test with the readme app, now in the Mdxish Editor the attribute
expressions are retained & not evaluated, but the rendering stay the
same

Before:
<img width="1070" height="636" alt="Screenshot 2026-04-09 at 11 28
18 pm"
src="https://github.com/user-attachments/assets/1ac3be2e-ccc0-44ff-9a1f-b6ac1157b3ed"
/>

After:
<img width="1095" height="687" alt="Screenshot 2026-04-09 at 5 45 56 pm"
src="https://github.com/user-attachments/assets/86982b7f-32ad-4993-b14f-dbb9c18e101e"
/>
rafegoldberg pushed a commit that referenced this pull request Apr 22, 2026
## Version 14.1.0
### ⚠ BREAKING CHANGES

* **mdxish:** AST nodes
- We can't just use safeMode because we still need the inline
expressions to be parsed as expression nodes, so there still needs to be
an expression parsing

## 🎯 What does this PR do?

### ✨ New & Improved

* **mdxish:** remove jsx attribute expression preprocessing ([#1429](#1429)) ([9545472](9545472)), closes [#1426](#1426)

### 🛠 Fixes & Updates

* add readme components tests ([#1434](#1434)) ([72ea080](72ea080)), closes [#1354](#1354)
* **deps:** bump actions/upload-artifact from 5 to 7 ([#1417](#1417)) ([3ca6d32](3ca6d32))
* **mdxish:** legacy glossary syntax in callout title crashing ([#1441](#1441)) ([9d1b18b](9d1b18b))

<!--SKIP CI-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants