refactor(mdxish): add micromark tokenizer for MDX component parsing by eaglethrost · Pull Request #1426 · readmeio/markdown

eaglethrost · 2026-04-07T13:30:15Z

🎫 Resolve RM-15995, CX-3155

🎯 What does this PR do?

Adds a micromark flow tokenizer (lib/micromark/mdx-component/syntax.ts) that captures <Component>...</Component> blocks as single tokens during parsing, preventing CommonMark from fragmenting them across multiple HTML/paragraph nodes. It supports snake case components as well
Simplifies mdxish-component-blocks.ts by removing the multi-sibling stitching logic (Case 3: scanForClosingTag, stripClosingTagFromParagraph, MAX_LOOKAHEAD) — ~200 lines removed
The tokenizer handles: PascalCase tag names, depth tracking for same-name nesting, quoted attributes (", '), JSX expression attributes ({...}), template literals (` with ${...} interpolation), code spans, and fenced code blocks inside component bodies

Why

The existing pipeline relied on string-level preprocessing and post-parse stitching to handle MDX components that remark fragmented across multiple AST nodes. This was fragile and sensitive to whitespace, blank lines, and edge cases. A tokenizer captures component boundaries at parse time, which is more robust and reduces the need for preprocessing workarounds. It also allows capture of more complex forms of attributes that otherwise might have broken up the tree / difficult to parse.

Key design decisions

Outputs html nodes (not mdxJsxFlowElement) — inner content still needs markdown transformation, which happens in the simplified mdxish-component-blocks transformer (Case 1 + Case 2 only)
Excluded tags — Table, HTMLBlock, Glossary, Anchor are skipped (they have dedicated tokenizers/transformers)
Template literal awareness — uses a templateStack to track ${...} interpolation nesting, critical for safeMode where preprocessJSXExpressions is skipped and the tokenizer sees raw expressions
safeDeindent in the transformer strips just enough indentation (down to 3 spaces) to prevent deeply nested component content from being treated as code blocks, while preserving whitespace that becomes HAST text nodes

I've looked at several unified / remark libraries that already parses / tokenizes MDX components, but they resulted in more strict syntaxes existing test no longer pass, which goes back to the original MDX problem that MDXish tries to solve. Hence I think we would need to manually create the tokenizer.

Next steps

I think we probably can still remove / simplify more code that's involved in component parsing. We can also look to remove the <Table> tokenizer and reuse this new one. Currently I'm leaving it as is to avoid increasing the blast radius of this refactor
We can't reuse this tokenizer for HTMLBlock because it has the { } wrapping part inside the tags, but we should create that next since it can simplify the HTML transformers as well

🧪 QA tips

Verify components with {...} expression attributes render correctly in both normal and safeMode
Test deeply nested same-name components (e.g., <Tabs> inside <Tabs>)
Verify components with fenced code blocks in their body don't break (closing tag inside code shouldn't terminate the component)
In the markdown playground, chose test fixtures that has components on it

📸 Screenshot or Loom

There should be no visual changes.

…g-to-use-a-tokenizer-instead

eaglethrost · 2026-04-08T12:11:02Z

    });
  });
+
+  describe('given there is a line content right above the Tabs component', () => {


As of now, this would throw an error

eaglethrost · 2026-04-08T12:12:10Z

This test file tests the whole mdxish pipeline and not just the component transformers, so it focuses on testing of various forms of components are still parsable

eaglethrost · 2026-04-08T12:15:20Z

The changes in this file are mainly:

Adding more coverage for the helper functions such as parseTag & parseAttributes

Accounting for the simpler code in mdxish-component-blocks.ts

Improving robustness of some tests by matching resulting ast with the expected object so it's simpler to write & more strict checking

maximilianfalco · 2026-04-13T07:20:15Z

@eaglethrost what do you mean by

We can also look to remove the tokenizer and reuse this new one

maximilianfalco

looks good overall! just some minor comments

maximilianfalco · 2026-04-13T08:34:08Z

    attributes: parseAttributes(attrString),
    selfClosing: !!selfClosing,
    contentAfterTag,
+    attrString // Just for debugging purposes


Suggested change

attrString // Just for debugging purposes

this can be deleted right?

Not really because I want to use it for testing, even though it's not used in the rendering code. Do you have a suggestion to make it cleaner?

eaglethrost · 2026-04-14T00:47:47Z

@eaglethrost what do you mean by

We can also look to remove the tokenizer and reuse this new one

Sorry there was actually a <Table> in between "the" and "tokenizer", but GitHub parsed it as HTML hence it didn't show 🤦 Fixed this

…g-to-use-a-tokenizer-instead

eaglethrost · 2026-04-14T13:26:45Z

  const processor = remark()
-    .data('micromarkExtensions', [magicBlock()])
-    .data('fromMarkdownExtensions', [magicBlockFromMarkdown()])
+    .data('micromarkExtensions', [magicBlock(), mdxComponent()])


Add the tokenizer here as well since mdxishComponentBlocks is simplified, so they both go in tandem. This also allows more variety of components tags to be captured.

…g-to-use-a-tokenizer-instead

eaglethrost · 2026-04-15T08:48:16Z

Follow up: I think we can actually remove the 1) snake case preprocessing 2) mdxishSelfClosingBlocks transformers now. Will do so in a follow up to reduce PR size and decrease blast radius

kevinports

Had a few minor comments but otherwise code looks good.
I spent quite a bit of time QAing this locally on a clone of the developers project and the page rendering / editor loading seems fine.

BUT I did follow the repro steps for this ticket linked to this PR:
https://linear.app/readme-io/issue/CX-3155/mdxish-editor-invalid-mdx-syntax-saves-silently-but-causes-500-error
and used the snippet provided in the reproduction steps, and the code doesn't serialize and cannot be rendered:

CleanShot.2026-04-15.at.15.53.37.mp4

kevinports · 2026-04-15T10:39:11Z

+import type { CompileContext, Extension as FromMarkdownExtension, Handle } from 'mdast-util-from-markdown';
+
+function enterMdxComponent(this: CompileContext, token: Parameters<Handle>[0]): void {
+  this.enter({ type: 'html', value: '' } as Html, token);


Took me a sec to realize we're just capturing the block-level JSX component syntax as an html node here when I'd initially expected mdxJsxFlowElement nodes to be emitted.

It's probably worth a JSDoc comment somewhere in this file or in the tokenizer (or both?) noting that we're only emitting the tokenized MdxComponent as an opaque html MDAST node at this stage, and it actually gets processed and structured into an mdxJsxFlowElement node downstream in the mdxish transformers.

Took me a sec to realize we're just capturing the block-level JSX component syntax as an html node here when I'd initially expected mdxJsxFlowElement nodes to be emitted.

It's probably worth a JSDoc comment somewhere in this file or in the tokenizer (or both?) noting that we're only emitting the tokenized MdxComponent as an opaque html MDAST node at this stage, and it actually gets processed and structured into an mdxJsxFlowElement node downstream in the mdxish transformers.

Added here!

kevinports · 2026-04-15T11:05:10Z

Should these const declarations be hoisted up to a root location if they're shared by this tokenizer and also a few transformers?

Yeah good point, I've hoisted it to the tokenizer level but it also makes sense to further hoisted it up. Moved it to lib/constants.ts

kevinports · 2026-04-15T11:06:23Z

+/**
+ * Inline-only custom components. These must not use generic block-level
+ * `<Component>...</Component>` handling (MDXish transforms) or be treated as
+ * flow elements where inline semantics apply.
+ */


This comment is a bit confusing because it just explains what but not why.

Suggested change

/**

* Inline-only custom components. These must not use generic block-level

* `<Component>...</Component>` handling (MDXish transforms) or be treated as

* flow elements where inline semantics apply.

*/

/**

* Inline-only custom components that appear as phrasing content within paragraph

* nodes. Excluding them from the generic mdxComponent micromark construct lets

* the dedicated inline transformer handle them instead.

* @see mdxish-inline-components.ts

*/

eaglethrost · 2026-04-16T00:49:44Z

BUT I did follow the repro steps for this ticket linked to this PR:
https://linear.app/readme-io/issue/CX-3155/mdxish-editor-invalid-mdx-syntax-saves-silently-but-causes-500-error
and used the snippet provided in the reproduction steps, and the code doesn't serialize and cannot be rendered:

@kevinports Oh I think your local markdown repo is not properly linked to the readme app, because I can replicate this exactly in production, but it already works in the markdown playground and if I successfully link the repo. Here's a demo:

Screen.Recording.2026-04-16.at.10.42.57.am.mov

eaglethrost · 2026-04-16T13:08:24Z

      // Find the first closing tag
-      const closingTagIndex = contentAfterTag.indexOf(closingTagStr);
-      const componentInnerContent = contentAfterTag.substring(0, closingTagIndex).trim();
+      const closingTagIndex = contentAfterTag.lastIndexOf(closingTagStr);


Found a potential bug where if there's nested components of the same type, it will cut off and get the first closing tag, which is what we don't actually want. We want the outermost cause it's the one that matches with the first opening tag

kevinports

Oh I think your local markdown repo is not properly linked to the readme app, because I can replicate this exactly in production, but it already works in the markdown playground and if I successfully link the repo. Here's a demo:

Ok yeah checked again this morning and things are fine. I think I linked the wrong branch. My bad!

QA'd again locally and everything appears to be in order.

## Version 14.0.0 ### ⚠ BREAKING CHANGES * **mdxish:** magic blocks), the parser had no way to recognize `{/* ... */}` as a single expression > [!NOTE] > Technically, introducing the flow construct of the `mdxExpression` would partly fix this issue but we cannot do this because it would break magic block parsing. magic blocks usually contain `{ \n \n }` JSON payloads, and the flow construct would eat them before the magic block tokenizer got a chance to run. created a dedicated `jsxComment` micromark tokenizer for JSX comments. enabled for all branches instead of being gated by `newEditorTypes`. > [!NOTE] In the future as a follow up, we can work in integrating this new tokenizer and fully removing the `removeJSXComments` preprocessing step. similar to what we do in the `stripComments` ## 🧪 QA tips Try pasting this in the editor (both rich and raw): ``` {/* Dropbox hosted. What was it? [block:image] { "images": [ { "image": [ "https://paper-attachments.dropboxusercontent.com/s_924615C008BE616D2ABE30441190A9401D739971D9AD94E8B5B46A426C5F05E8_1683123013834_Dev+Dash+-+hero+docs+shot.png", null, "The My Developers page in your ReadMe project dashboard" ], "align": "center", "border": true, "caption": "The My Developers page in your ReadMe project dashboard" } ] } [/block] */} ``` It should be treated as one big comment block instead of a rendered image. ## 📸 Screenshot or Loom https://github.com/user-attachments/assets/9b619f54-3d6e-456f-83be-27d4076a7a3a ### ✨ New & Improved * **mdxish:** add micromark tokenizer for MDX component parsing ([#1426](#1426)) ([a77707d](a77707d)) * move numbered lists for 1 -> a -> i ([#1438](#1438)) ([7e614d4](7e614d4)) ### 🛠 Fixes & Updates * **mdxish:** properly parse JSX comments in editor deserialization ([#1419](#1419)) ([95e1969](95e1969))

rafegoldberg · 2026-04-16T13:50:33Z

This PR was released!

🚀 Changes included in v14.0.0

| 🎫 Resolve RM-15852 | | :-----------------: | ## Context - In the MDXish Editor, component attribute expressions were being evaluated during preprocessing Example: `<MyComp attr={1 + 1} />` became `<MyComp attr="2" />`, which permanently lost the original expression - This came from a fundamental MDXish workaround: because we were not using full MDX parsing for attributes, preprocessing normalized non-quoted JSX-style attributes into quoted HTML-safe values to avoid breaking AST nodes - We can't just use safeMode because we still need the inline expressions to be parsed as expression nodes, so there still needs to be an expression parsing ## 🎯 What does this PR do? - With the new MDX tokenizer work in #1426 that this is based upon, we can now preserve attribute syntax safely (including expression forms) without that preprocessing conversion, so this PR entirely removes the `evaluateJSXExpression` step in the preprocessing so the attributes don't get evaluated. This means the output of the `mdxishAstProcessor` will contain the raw expression value instead of the evaluated one - There was a step in `evaluateJSXExpression` where I encode object / array attributes so it doesn't get split in `rehypeRaw` step. Since that's removed, I moved that code to the `mdxish-handlers` - Removed jsxContext argument since we don't actually use it anywhere in the readme app - Since attribute expressions are no longer evaluated in preprocessing, the `parseAttributes` code need to be beefed up to deal with expressions. I created a mini tokenizer / traveler in `mdxish-component-tag-parser.ts` to do such thing The crux of the change is not much, but there's quite a lot of tests that needs to be modified / no longer needed. ## QA Tips See my comment, better to test this in the readme app directly & locally link the markdown repo to it ## 📸 Screenshot or Loom E2E test with the readme app, now in the Mdxish Editor the attribute expressions are retained & not evaluated, but the rendering stay the same Before: <img width="1070" height="636" alt="Screenshot 2026-04-09 at 11 28 18 pm" src="https://github.com/user-attachments/assets/1ac3be2e-ccc0-44ff-9a1f-b6ac1157b3ed" /> After: <img width="1095" height="687" alt="Screenshot 2026-04-09 at 5 45 56 pm" src="https://github.com/user-attachments/assets/86982b7f-32ad-4993-b14f-dbb9c18e101e" />

## Version 14.1.0 ### ⚠ BREAKING CHANGES * **mdxish:** AST nodes - We can't just use safeMode because we still need the inline expressions to be parsed as expression nodes, so there still needs to be an expression parsing ## 🎯 What does this PR do? ### ✨ New & Improved * **mdxish:** remove jsx attribute expression preprocessing ([#1429](#1429)) ([9545472](9545472)), closes [#1426](#1426) ### 🛠 Fixes & Updates * add readme components tests ([#1434](#1434)) ([72ea080](72ea080)), closes [#1354](#1354) * **deps:** bump actions/upload-artifact from 5 to 7 ([#1417](#1417)) ([3ca6d32](3ca6d32)) * **mdxish:** legacy glossary syntax in callout title crashing ([#1441](#1441)) ([9d1b18b](9d1b18b))

eaglethrost added 4 commits April 6, 2026 18:49

test: add tabs test

c6a9fb2

feat: first pass at mdx component tokenizer

38ef867

feat: support multiline components & template literal tracking

2dfb082

chore: bump bundle

6e8d510

eaglethrost changed the title ~~feat: add micromark tokenizer for MDX component parsing~~ refactor(mdxish): add micromark tokenizer for MDX component parsing Apr 7, 2026

eaglethrost requested review from Jadenzzz, maximilianfalco and rafegoldberg and removed request for rafegoldberg April 7, 2026 13:32

eaglethrost marked this pull request as draft April 7, 2026 13:33

eaglethrost added 7 commits April 8, 2026 13:36

Merge branch 'next' into dimas/rm-15995-refactor-mdx-component-parsin…

c79bab9

…g-to-use-a-tokenizer-instead

fix: extend tag attribute regex to account double quotes

ddaa125

test: unquoted attributes in components

d34b283

fix: remove lib

24fb09d

refactor: make components test general, remove attribute specific tests

440565b

test: parse tag & attributes

b5495c4

tests: update mdxish component blocks

b107b00

eaglethrost marked this pull request as ready for review April 8, 2026 08:25

eaglethrost commented Apr 8, 2026

View reviewed changes

eaglethrost added 2 commits April 8, 2026 22:18

style: format test file

99d928a

test: component not parsed in code

4ac7e40

eaglethrost requested review from kevinports and rafegoldberg April 8, 2026 12:38

maximilianfalco reviewed Apr 13, 2026

View reviewed changes

eaglethrost added 2 commits April 14, 2026 10:59

style: address falco's comments

4209f3e

refactor: unify excluded component tags

0df8246

eaglethrost added 2 commits April 14, 2026 15:11

Merge branch 'next' into dimas/rm-15995-refactor-mdx-component-parsin…

333a6b3

…g-to-use-a-tokenizer-instead

fix: type erros in component blocks test

75840f5

eaglethrost mentioned this pull request Apr 14, 2026

refactor(mdxish): remove jsx attribute expression preprocessing #1429

Merged

fix: add tokenizer to mdxish tags to capture components

985a8ec

eaglethrost commented Apr 14, 2026

View reviewed changes

Merge branch 'next' into dimas/rm-15995-refactor-mdx-component-parsin…

65ca45d

…g-to-use-a-tokenizer-instead

kevinports reviewed Apr 15, 2026

View reviewed changes

refactor: address comments, move tag declarations to root

b7c2d1e

eaglethrost requested a review from kevinports April 16, 2026 03:08

eaglethrost added 2 commits April 16, 2026 14:05

fix: import linting

2d9ae7a

fix: use last index to get outermost closing tag

b96a464

eaglethrost commented Apr 16, 2026

View reviewed changes

kevinports approved these changes Apr 16, 2026

View reviewed changes

kevinports merged commit a77707d into next Apr 16, 2026
7 checks passed

kevinports deleted the dimas/rm-15995-refactor-mdx-component-parsing-to-use-a-tokenizer-instead branch April 16, 2026 13:44

rafegoldberg added the released label Apr 16, 2026

-/**
- * Inline-only custom components. These must not use generic block-level
- * `<Component>...</Component>` handling (MDXish transforms) or be treated as
- * flow elements where inline semantics apply.
- */
+/**
+ * Inline-only custom components that appear as phrasing content within paragraph
+ * nodes. Excluding them from the generic mdxComponent micromark construct lets
+ * the dedicated inline transformer handle them instead.
+ * @see mdxish-inline-components.ts
+ */

Conversation

eaglethrost commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 What does this PR do?

Why

Key design decisions

Next steps

🧪 QA tips

📸 Screenshot or Loom

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eaglethrost Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maximilianfalco commented Apr 13, 2026

Uh oh!

maximilianfalco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eaglethrost commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eaglethrost commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinports left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eaglethrost commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinports left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rafegoldberg commented Apr 16, 2026

This PR was released!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eaglethrost commented Apr 7, 2026 •

edited

Loading

eaglethrost Apr 8, 2026 •

edited

Loading

eaglethrost commented Apr 15, 2026 •

edited

Loading

eaglethrost commented Apr 16, 2026 •

edited

Loading