feat: render Formula elements as $$ blocks with optional normalization#4308
feat: render Formula elements as $$ blocks with optional normalization#4308claytonlin1110 wants to merge 8 commits intoUnstructured-IO:mainfrom
Conversation
17e10f5 to
199318a
Compare
PastelStorm
left a comment
There was a problem hiding this comment.
Review
Severity: High
Description: elements_to_md() silently breaks backward compatibility for positional callers by inserting normalize_formula before encoding.
def elements_to_md(
elements: Iterable[Element],
filename: Optional[str] = None,
exclude_binary_image_data: bool = False,
normalize_formula: bool = True,
encoding: str = "utf-8",
) -> str:def elements_to_md(
elements: Iterable[Element],
filename: Optional[str] = None,
exclude_binary_image_data: bool = False,
encoding: str = "utf-8",
) -> str:Why it is risky: old valid calls like elements_to_md(elements, path, False, "latin-1") now bind "latin-1" to normalize_formula instead of encoding. The call succeeds, so the regression is silent, but writes with UTF-8 instead of the requested encoding.
Concrete failure scenario: a document-export worker writes markdown for a downstream Latin-1 consumer. After this PR, retries all succeed but produce incorrectly encoded output, causing corrupted artifacts or decode failures downstream.
Severity: High
Description: create_file_from_elements() has the same silent positional-argument break by inserting normalize_formula before no_group_by_page.
def create_file_from_elements(
elements: Iterable[Element],
output_format: str = "markdown",
filename: Optional[str] = None,
encoding: str = "utf-8",
exclude_binary_image_data: bool = False,
normalize_formula: bool = True,
no_group_by_page: bool = True,
) -> str:def create_file_from_elements(
elements: Iterable[Element],
output_format: str = "markdown",
filename: Optional[str] = None,
encoding: str = "utf-8",
exclude_binary_image_data: bool = False,
no_group_by_page: bool = True,
) -> str:Why it is risky: older positional callers now pass their intended no_group_by_page value into normalize_formula, and no_group_by_page silently reverts to its default. That changes output semantics with no exception.
Concrete failure scenario: an HTML export path calls create_file_from_elements(elements, "html", out, "utf-8", False, False) expecting grouped-by-page behavior. After this change, the last False disables formula normalization instead, while HTML grouping silently changes, so production exports return structurally different HTML under load and retries reproduce the same bad output.
Severity: High
Description: your √ finding is real. Mapping √ to \sqrt{} is not conservative normalization; it corrupts the math expression unless the radicand is explicitly reparsed and wrapped.
def _normalize_formula_for_markdown(text: str) -> str:
...
substitutions = {
"−": "-", # Unicode minus -> ASCII hyphen-minus
"×": r"\times",
...
"√": r"\sqrt{}",
}
normalized = text
for source, target in substitutions.items():
normalized = normalized.replace(source, target)
return normalizedWhy it is risky: √x becomes \sqrt{}x, which changes the semantics of the formula rather than just normalizing syntax. Since normalize_formula=True is the default, this corrupting behavior is on the default path.
Concrete failure scenario: a formula extracted as √2 ≤ x is exported as $$\n\sqrt{}2 \leq x\n$$, and downstream math rendering shows an empty radical plus 2, or fails to render correctly. The stored/exported markdown is wrong on every retry.
What I Would Not Keep At Critical/High
outputmarkdowndiff.txt
I agree it should be removed, but I would not rate it Critical. It looks like an accidental artifact and repo hygiene issue, not a production correctness/concurrency problem. If this review is strictly limited to critical/high, I would leave it out or mention it separately as non-blocking cleanup.
Version bump / changelog edit
I would not keep this as High without explicit repo policy saying feature PRs must not touch versioning. Merge-conflict risk is real, but that is usually workflow friction rather than a high-severity product bug. In many repos, feature branches intentionally update changelogs and versions, so this is too process-dependent to call high-confidence/high-severity from the diff alone.
Missing High-Value Tests
The highest-value missing tests are:
- A regression test for old positional
elements_to_md(..., encoding)usage. - A regression test for old positional
create_file_from_elements(..., no_group_by_page)usage. - A
√normalization test such asFormula("√2")andFormula("√(x+1)").
Verdict
Do Not Merge
The merged critical/high set is:
High: silent positional API break inelements_to_md()High: silent positional API break increate_file_from_elements()High: incorrect√normalization corrupts formula output
No Critical findings with high confidence.
PastelStorm
left a comment
There was a problem hiding this comment.
Findings
Severity: High
Description: Every Formula is now serialized as a display-math block, even when the extracted payload is clearly not valid LaTeX/math syntax.
def element_to_md(
element: Element,
exclude_binary_image_data: bool = False,
normalize_formula: bool = True,
) -> str:
match element:
case Title(text=text):
return f"# {text}"
case Formula(text=text):
formula_text = text.strip()
if normalize_formula:
formula_text = _normalize_formula_for_markdown(formula_text)
return f"$$\n{formula_text}\n$$"The corrosion rate (CR) was calculated using Eq. (1) [1–5]
$$
Corrosion rate CRð Þ ¼ 87:6W DAT ð1Þ
$$
where: W is weight loss in mg, A is specimen surface area, T is immersion period in hours and D is the specimen density. From the corrosion rate, the surface coverage (θ) and inhibition efficiencies (IE %) were determined using Eqs. (2) and (3) respectively
$$
_ CRo—CR O= OR
$$
$$
CRo=CR , 100 IE (0) = CR
$$Why it is risky: This changes the abstraction from “emit extracted text” to “assert this text is safe display math.” Markdown consumers that run KaTeX/MathJax will now parse OCR-corrupted formula text as math source, which can fail to parse, render blank/error output, or distort surrounding content. The golden updates already show real extracted Formula payloads that are noisy OCR, not reliable math markup.
Concrete failure scenario: A production ingestion job exports markdown for PDFs, and a downstream service renders all $$...$$ blocks with KaTeX. A biomedical document containing OCR-heavy formulas like _ CRo—CR O= OR now produces math parse failures, so end users see missing equations or broken sections where the old code at least showed readable plain text.
Severity: High
Description: Raw Formula.text is inserted inside $$ fences with no escaping or sanitization of embedded math delimiters.
def element_to_md(
element: Element,
exclude_binary_image_data: bool = False,
normalize_formula: bool = True,
) -> str:
match element:
case Title(text=text):
return f"# {text}"
case Formula(text=text):
formula_text = text.strip()
if normalize_formula:
formula_text = _normalize_formula_for_markdown(formula_text)
return f"$$\n{formula_text}\n$$"Why it is risky: If extracted formula text itself contains $$, or delimiter-like sequences introduced by OCR, the serializer can terminate the display-math block early and leak the remainder back into normal markdown parsing. That is a structural corruption bug, not just a cosmetic rendering issue.
Concrete failure scenario: One tenant uploads a PDF whose extracted Formula.text contains a $$ b or a similar OCR artifact. The generated markdown closes the math block at the inner delimiter, and the rest of the document is reinterpreted as normal markdown. In a batch/high-concurrency pipeline this yields hard-to-debug, document-specific corruption that only appears for certain source files and downstream renderers.
Missing High-Value Tests
- A
Formulacontaining embedded$$, lone$, or multiline delimiter-like OCR noise, to prove markdown structure cannot be broken by the new serializer. - An end-to-end markdown-rendering test using one of the newly updated noisy fixtures, to verify the new
$$output is actually consumable by a real math renderer or to document that invalid math should remain plain text instead. - A policy test for malformed/non-LaTeX
Formulacontent: either “wrap anyway,” “fallback to plain text,” or “reject/sanitize.” Right now the branch changes behavior without locking down that contract.
Verdict
Not Merge
I did not find any critical/high-confidence concurrency or shared-state bugs in the diff itself; the blocker is correctness and downstream production risk from unconditionally treating extracted Formula text as safe display math.
ec3c4b9 to
1c7d90f
Compare
Summary
Formulaelements as Markdown math blocks ($$ ... $$) during markdown staging export.normalize_formulaflag (defaultTrue) to control conservative Unicode-math to LaTeX-like normalization.normalize_formulaviacreate_file_from_elements()0.22.11.Closes #3868