Skip to content

quoteAttribute corrupts attribute values containing '\\' / '\t' / '\n' — round-trip parse(el.toString()) lossy #306

@zhangjiashuo-cs

Description

@zhangjiashuo-cs

Description

HTMLElement.quoteAttribute() silently drops or transforms backslash characters in attribute values during serialization, so setAttributetoStringparse is not a round-trip.

Reproducer (node-html-parser 7.1.0, latest on npm)

const { parse } = require("node-html-parser");

const doc = parse("<div></div>");
const div = doc.firstChild;
div.setAttribute("path", "C:\\Users\\test");   // string is exactly:  C:\Users\test

console.log("set value : C:\\Users\\test");
console.log("serialized: " + div.toString());
//  serialized:  <div path="C:Users\test"></div>          <-- one backslash gone, '\U' lost

const doc2 = parse(div.toString());
console.log("round-trip:", JSON.stringify(doc2.firstChild.getAttribute("path")));
//  round-trip: "C:Users\test"   <-- different from what was set

Other inputs (all silently corrupted):

setAttribute value toString() output (attribute value) round-trip via parse()
a\\b ab (single \\) "ab"
C:\\Users\\test C:Users\\test "C:Users\\test" (still wrong)
\\ (attribute empty) undefined
path\\to\\file pathto\\file "pathto\\file"

Root cause

src/nodes/html.ts (the source for dist/nodes/html.js), quoteAttribute:

return JSON.stringify(attr.replace(/"/g, "&quot;"))
    .replace(/\\t/g, "\t")
    .replace(/\\n/g, "\n")
    .replace(/\\r/g, "\r")
    .replace(/\\/g, "");      // <— removes EVERY remaining backslash

The idea is to undo the \\ that JSON.stringify adds before each backslash. But because the four .replace calls run unconditionally, this also:

  1. Converts every \t / \n / \r from the literal input into the corresponding control character.
  2. Deletes every other backslash in the literal input.

The net effect: attribute values containing any backslash are corrupted, and (worse) values containing \t etc. get a real tab character spliced into the HTML — which parse() then drops as whitespace.

Why this matters

Backslashes show up routinely in attribute values: Windows paths (C:\Users\…), regex patterns (data-pattern="\\d+"), URI templates with escaped sequences, JSON snippets stored in data-* attributes. Any code that does el.setAttribute("data-x", JSON.stringify(obj)) (or stores a JSON or path in any attribute) and later reads it back via parse(el.toString()).getAttribute("data-x") will get back a different value — silently.

A server returning HTML with such attributes makes the client lose data on every parse-serialize-parse cycle.

Suggested fix

Don't double-substitute. The minimal correct quoting is just escape-the-quote-character:

quoteAttribute(attr: string | null): string {
    if (attr == null) return "null";
    return `"${attr.replace(/&/g, "&amp;").replace(/"/g, "&quot;")}"`;
}

(or use he.encode for full attribute-context entity encoding). No JSON.stringify, no backslash dance.

Environment

  • node-html-parser: 7.1.0 (latest)
  • node: 20+

Discovered via fast-check on the property parse(el.toString()).getAttribute(name) === el.getAttribute(name). Happy to PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions