Skip to content

feat: render Formula elements as $$ blocks with optional normalization#4308

Open
claytonlin1110 wants to merge 8 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/formula-markdown-latex
Open

feat: render Formula elements as $$ blocks with optional normalization#4308
claytonlin1110 wants to merge 8 commits intoUnstructured-IO:mainfrom
claytonlin1110:feat/formula-markdown-latex

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

Summary

  • Render Formula elements as Markdown math blocks ($$ ... $$) during markdown staging export.
  • Add optional normalize_formula flag (default True) to control conservative Unicode-math to LaTeX-like normalization.
  • Add tests for:
    • formula block rendering
    • normalization on/off behavior
    • passthrough of normalize_formula via create_file_from_elements()
  • Update changelog and bump package version to 0.22.11.

Closes #3868

@claytonlin1110 claytonlin1110 force-pushed the feat/formula-markdown-latex branch from 17e10f5 to 199318a Compare March 31, 2026 19:58
Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Severity: High

Description: elements_to_md() silently breaks backward compatibility for positional callers by inserting normalize_formula before encoding.

def elements_to_md(
    elements: Iterable[Element],
    filename: Optional[str] = None,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
    encoding: str = "utf-8",
) -> str:
def elements_to_md(
    elements: Iterable[Element],
    filename: Optional[str] = None,
    exclude_binary_image_data: bool = False,
    encoding: str = "utf-8",
) -> str:

Why it is risky: old valid calls like elements_to_md(elements, path, False, "latin-1") now bind "latin-1" to normalize_formula instead of encoding. The call succeeds, so the regression is silent, but writes with UTF-8 instead of the requested encoding.

Concrete failure scenario: a document-export worker writes markdown for a downstream Latin-1 consumer. After this PR, retries all succeed but produce incorrectly encoded output, causing corrupted artifacts or decode failures downstream.

Severity: High

Description: create_file_from_elements() has the same silent positional-argument break by inserting normalize_formula before no_group_by_page.

def create_file_from_elements(
    elements: Iterable[Element],
    output_format: str = "markdown",
    filename: Optional[str] = None,
    encoding: str = "utf-8",
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
    no_group_by_page: bool = True,
) -> str:
def create_file_from_elements(
    elements: Iterable[Element],
    output_format: str = "markdown",
    filename: Optional[str] = None,
    encoding: str = "utf-8",
    exclude_binary_image_data: bool = False,
    no_group_by_page: bool = True,
) -> str:

Why it is risky: older positional callers now pass their intended no_group_by_page value into normalize_formula, and no_group_by_page silently reverts to its default. That changes output semantics with no exception.

Concrete failure scenario: an HTML export path calls create_file_from_elements(elements, "html", out, "utf-8", False, False) expecting grouped-by-page behavior. After this change, the last False disables formula normalization instead, while HTML grouping silently changes, so production exports return structurally different HTML under load and retries reproduce the same bad output.

Severity: High

Description: your finding is real. Mapping to \sqrt{} is not conservative normalization; it corrupts the math expression unless the radicand is explicitly reparsed and wrapped.

def _normalize_formula_for_markdown(text: str) -> str:
    ...
    substitutions = {
        "−": "-",  # Unicode minus -> ASCII hyphen-minus
        "×": r"\times",
        ...
        "√": r"\sqrt{}",
    }
    normalized = text
    for source, target in substitutions.items():
        normalized = normalized.replace(source, target)
    return normalized

Why it is risky: √x becomes \sqrt{}x, which changes the semantics of the formula rather than just normalizing syntax. Since normalize_formula=True is the default, this corrupting behavior is on the default path.

Concrete failure scenario: a formula extracted as √2 ≤ x is exported as $$\n\sqrt{}2 \leq x\n$$, and downstream math rendering shows an empty radical plus 2, or fails to render correctly. The stored/exported markdown is wrong on every retry.

What I Would Not Keep At Critical/High

outputmarkdowndiff.txt

I agree it should be removed, but I would not rate it Critical. It looks like an accidental artifact and repo hygiene issue, not a production correctness/concurrency problem. If this review is strictly limited to critical/high, I would leave it out or mention it separately as non-blocking cleanup.

Version bump / changelog edit

I would not keep this as High without explicit repo policy saying feature PRs must not touch versioning. Merge-conflict risk is real, but that is usually workflow friction rather than a high-severity product bug. In many repos, feature branches intentionally update changelogs and versions, so this is too process-dependent to call high-confidence/high-severity from the diff alone.

Missing High-Value Tests

The highest-value missing tests are:

  • A regression test for old positional elements_to_md(..., encoding) usage.
  • A regression test for old positional create_file_from_elements(..., no_group_by_page) usage.
  • A normalization test such as Formula("√2") and Formula("√(x+1)").

Verdict

Do Not Merge

The merged critical/high set is:

  • High: silent positional API break in elements_to_md()
  • High: silent positional API break in create_file_from_elements()
  • High: incorrect normalization corrupts formula output

No Critical findings with high confidence.

Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

Severity: High
Description: Every Formula is now serialized as a display-math block, even when the extracted payload is clearly not valid LaTeX/math syntax.

def element_to_md(
    element: Element,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
) -> str:
    match element:
        case Title(text=text):
            return f"# {text}"
        case Formula(text=text):
            formula_text = text.strip()
            if normalize_formula:
                formula_text = _normalize_formula_for_markdown(formula_text)
            return f"$$\n{formula_text}\n$$"
The corrosion rate (CR) was calculated using Eq. (1) [15]
$$
Corrosion rate CRð Þ ¼ 87:6W DAT ð1Þ
$$
where: W is weight loss in mg, A is specimen surface area, T is immersion period in hours and D is the specimen density. From the corrosion rate, the surface coverage (θ) and inhibition efficiencies (IE %) were determined using Eqs. (2) and (3) respectively
$$
_ CRo—CR O= OR
$$
$$
CRo=CR , 100 IE (0) = CR
$$

Why it is risky: This changes the abstraction from “emit extracted text” to “assert this text is safe display math.” Markdown consumers that run KaTeX/MathJax will now parse OCR-corrupted formula text as math source, which can fail to parse, render blank/error output, or distort surrounding content. The golden updates already show real extracted Formula payloads that are noisy OCR, not reliable math markup.

Concrete failure scenario: A production ingestion job exports markdown for PDFs, and a downstream service renders all $$...$$ blocks with KaTeX. A biomedical document containing OCR-heavy formulas like _ CRo—CR O= OR now produces math parse failures, so end users see missing equations or broken sections where the old code at least showed readable plain text.


Severity: High
Description: Raw Formula.text is inserted inside $$ fences with no escaping or sanitization of embedded math delimiters.

def element_to_md(
    element: Element,
    exclude_binary_image_data: bool = False,
    normalize_formula: bool = True,
) -> str:
    match element:
        case Title(text=text):
            return f"# {text}"
        case Formula(text=text):
            formula_text = text.strip()
            if normalize_formula:
                formula_text = _normalize_formula_for_markdown(formula_text)
            return f"$$\n{formula_text}\n$$"

Why it is risky: If extracted formula text itself contains $$, or delimiter-like sequences introduced by OCR, the serializer can terminate the display-math block early and leak the remainder back into normal markdown parsing. That is a structural corruption bug, not just a cosmetic rendering issue.

Concrete failure scenario: One tenant uploads a PDF whose extracted Formula.text contains a $$ b or a similar OCR artifact. The generated markdown closes the math block at the inner delimiter, and the rest of the document is reinterpreted as normal markdown. In a batch/high-concurrency pipeline this yields hard-to-debug, document-specific corruption that only appears for certain source files and downstream renderers.

Missing High-Value Tests

  • A Formula containing embedded $$, lone $, or multiline delimiter-like OCR noise, to prove markdown structure cannot be broken by the new serializer.
  • An end-to-end markdown-rendering test using one of the newly updated noisy fixtures, to verify the new $$ output is actually consumable by a real math renderer or to document that invalid math should remain plain text instead.
  • A policy test for malformed/non-LaTeX Formula content: either “wrap anyway,” “fallback to plain text,” or “reject/sanitize.” Right now the branch changes behavior without locking down that contract.

Verdict

Not Merge

I did not find any critical/high-confidence concurrency or shared-state bugs in the diff itself; the blocker is correctness and downstream production risk from unconditionally treating extracted Formula text as safe display math.

@claytonlin1110 claytonlin1110 force-pushed the feat/formula-markdown-latex branch from ec3c4b9 to 1c7d90f Compare April 2, 2026 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/formula-detection

2 participants