fix(to_xml): strip XML-invalid C0 control chars from attribute values#155
Merged
villelaitila merged 2 commits intoApr 29, 2026
Merged
Conversation
Attribute values containing C0 control characters (0x00-0x08, 0x0B, 0x0C, 0x0E-0x1F) produced XML that parsers reject outright — XML 1.0 forbids these characters even as numeric entities. enc_xml_a_v now strips them before escaping. TAB, LF, CR remain supported (LF is escaped as &softagram#10;). Add a roundtrip test that builds an element with every forbidden C0 char plus <&"'> and DEL, serialises, and re-parses to confirm the output is valid.
Both branches of the previous `or` collapsed to the same expression (`'after<&"\'>' in got`), so the assertion only ever exercised one predicate. Reduce to the single equivalent check for readability.
Looking good 🎉
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SGraph.to_xmlproduced output that XML 1.0 parsers reject when an attribute value contained C0 control characters (0x00–0x08, 0x0B, 0x0C, 0x0E–0x1F). Per XML 1.0 §2.2, those characters are forbidden in any XML content, even as numeric character references — so a single such byte in a model attribute renders the entire serialised file unparseable.enc_xml_a_vnow strips the forbidden C0 characters before applying the existing entity escaping. TAB (0x09), LF (0x0A) and CR (0x0D) are preserved as before; LF continues to be escaped as . DEL (0x7F) is also preserved, matching XML 1.0 (which permits it).SGraph._XML_INVALID_CONTROL_RE), so there is no per-call compilation cost on large models.Why
Models built from real-world source data occasionally pick up C0 control bytes — typically through binary files, encoded blobs, or copy-paste artefacts that survive into attribute values such as descriptions or commit metadata. The previous behaviour silently produced an XML file that
parse_xml_file_or_streamcould not load back, breaking serialise → reload roundtrips. The fix is the smallest correct change: drop only the bytes XML actually forbids, leave everything else untouched, and document the spec link in the code comment.This is a behaviour-preserving fix for valid input — only previously-broken outputs change, and they change from "unparseable" to "valid XML with the offending control characters elided".
Test plan
test_to_xml_strips_invalid_control_chars_and_roundtrips) constructs an attribute value containing every forbidden C0 character plus<&"'>, DEL, and a TAB, serialises the graph, parses the output back, and asserts that- all forbidden C0 bytes have been stripped from the parsed attribute,
- the surrounding visible content (including the entity-escaped chars and DEL) survives intact.
pytestsuite passes locally.