Description
HTMLElement.quoteAttribute() silently drops or transforms backslash characters in attribute values during serialization, so setAttribute → toString → parse is not a round-trip.
Reproducer (node-html-parser 7.1.0, latest on npm)
const { parse } = require("node-html-parser");
const doc = parse("<div></div>");
const div = doc.firstChild;
div.setAttribute("path", "C:\\Users\\test"); // string is exactly: C:\Users\test
console.log("set value : C:\\Users\\test");
console.log("serialized: " + div.toString());
// serialized: <div path="C:Users\test"></div> <-- one backslash gone, '\U' lost
const doc2 = parse(div.toString());
console.log("round-trip:", JSON.stringify(doc2.firstChild.getAttribute("path")));
// round-trip: "C:Users\test" <-- different from what was set
Other inputs (all silently corrupted):
setAttribute value |
toString() output (attribute value) |
round-trip via parse() |
a\\b |
ab (single \\) |
"ab" |
C:\\Users\\test |
C:Users\\test |
"C:Users\\test" (still wrong) |
\\ |
(attribute empty) |
undefined |
path\\to\\file |
pathto\\file |
"pathto\\file" |
Root cause
src/nodes/html.ts (the source for dist/nodes/html.js), quoteAttribute:
return JSON.stringify(attr.replace(/"/g, """))
.replace(/\\t/g, "\t")
.replace(/\\n/g, "\n")
.replace(/\\r/g, "\r")
.replace(/\\/g, ""); // <— removes EVERY remaining backslash
The idea is to undo the \\ that JSON.stringify adds before each backslash. But because the four .replace calls run unconditionally, this also:
- Converts every
\t / \n / \r from the literal input into the corresponding control character.
- Deletes every other backslash in the literal input.
The net effect: attribute values containing any backslash are corrupted, and (worse) values containing \t etc. get a real tab character spliced into the HTML — which parse() then drops as whitespace.
Why this matters
Backslashes show up routinely in attribute values: Windows paths (C:\Users\…), regex patterns (data-pattern="\\d+"), URI templates with escaped sequences, JSON snippets stored in data-* attributes. Any code that does el.setAttribute("data-x", JSON.stringify(obj)) (or stores a JSON or path in any attribute) and later reads it back via parse(el.toString()).getAttribute("data-x") will get back a different value — silently.
A server returning HTML with such attributes makes the client lose data on every parse-serialize-parse cycle.
Suggested fix
Don't double-substitute. The minimal correct quoting is just escape-the-quote-character:
quoteAttribute(attr: string | null): string {
if (attr == null) return "null";
return `"${attr.replace(/&/g, "&").replace(/"/g, """)}"`;
}
(or use he.encode for full attribute-context entity encoding). No JSON.stringify, no backslash dance.
Environment
- node-html-parser: 7.1.0 (latest)
- node: 20+
Discovered via fast-check on the property parse(el.toString()).getAttribute(name) === el.getAttribute(name). Happy to PR.
Description
HTMLElement.quoteAttribute()silently drops or transforms backslash characters in attribute values during serialization, sosetAttribute→toString→parseis not a round-trip.Reproducer (node-html-parser 7.1.0, latest on npm)
Other inputs (all silently corrupted):
setAttributevaluetoString()output (attribute value)parse()a\\bab(single\\)"ab"C:\\Users\\testC:Users\\test"C:Users\\test"(still wrong)\\undefinedpath\\to\\filepathto\\file"pathto\\file"Root cause
src/nodes/html.ts(the source fordist/nodes/html.js),quoteAttribute:The idea is to undo the
\\thatJSON.stringifyadds before each backslash. But because the four.replacecalls run unconditionally, this also:\t/\n/\rfrom the literal input into the corresponding control character.The net effect: attribute values containing any backslash are corrupted, and (worse) values containing
\tetc. get a real tab character spliced into the HTML — whichparse()then drops as whitespace.Why this matters
Backslashes show up routinely in attribute values: Windows paths (
C:\Users\…), regex patterns (data-pattern="\\d+"), URI templates with escaped sequences, JSON snippets stored indata-*attributes. Any code that doesel.setAttribute("data-x", JSON.stringify(obj))(or stores a JSON or path in any attribute) and later reads it back viaparse(el.toString()).getAttribute("data-x")will get back a different value — silently.A server returning HTML with such attributes makes the client lose data on every parse-serialize-parse cycle.
Suggested fix
Don't double-substitute. The minimal correct quoting is just escape-the-quote-character:
(or use
he.encodefor full attribute-context entity encoding). NoJSON.stringify, no backslash dance.Environment
Discovered via fast-check on the property
parse(el.toString()).getAttribute(name) === el.getAttribute(name). Happy to PR.