Skip to content

fix(to_xml): strip XML-invalid C0 control chars from attribute values#155

Merged
villelaitila merged 2 commits into
softagram:mainfrom
villelaitila:fix/strip-xml-invalid-control-chars
Apr 29, 2026
Merged

fix(to_xml): strip XML-invalid C0 control chars from attribute values#155
villelaitila merged 2 commits into
softagram:mainfrom
villelaitila:fix/strip-xml-invalid-control-chars

Conversation

@villelaitila
Copy link
Copy Markdown
Contributor

Summary

  • SGraph.to_xml produced output that XML 1.0 parsers reject when an attribute value contained C0 control characters (0x00–0x08, 0x0B, 0x0C, 0x0E–0x1F). Per XML 1.0 §2.2, those characters are forbidden in any XML content, even as numeric character references — so a single such byte in a model attribute renders the entire serialised file unparseable.
  • enc_xml_a_v now strips the forbidden C0 characters before applying the existing entity escaping. TAB (0x09), LF (0x0A) and CR (0x0D) are preserved as before; LF continues to be escaped as 
. DEL (0x7F) is also preserved, matching XML 1.0 (which permits it).
  • The strip is performed via a precompiled class-level regex (SGraph._XML_INVALID_CONTROL_RE), so there is no per-call compilation cost on large models.

Why

Models built from real-world source data occasionally pick up C0 control bytes — typically through binary files, encoded blobs, or copy-paste artefacts that survive into attribute values such as descriptions or commit metadata. The previous behaviour silently produced an XML file that parse_xml_file_or_stream could not load back, breaking serialise → reload roundtrips. The fix is the smallest correct change: drop only the bytes XML actually forbids, leave everything else untouched, and document the spec link in the code comment.

This is a behaviour-preserving fix for valid input — only previously-broken outputs change, and they change from "unparseable" to "valid XML with the offending control characters elided".

Test plan

  • New roundtrip test (test_to_xml_strips_invalid_control_chars_and_roundtrips) constructs an attribute value containing every forbidden C0 character plus <&"'>, DEL, and a TAB, serialises the graph, parses the output back, and asserts that
    - all forbidden C0 bytes have been stripped from the parsed attribute,
    - the surrounding visible content (including the entity-escaped chars and DEL) survives intact.
  • Full pytest suite passes locally.

Attribute values containing C0 control characters (0x00-0x08, 0x0B, 0x0C,
0x0E-0x1F) produced XML that parsers reject outright — XML 1.0 forbids
these characters even as numeric entities. enc_xml_a_v now strips them
before escaping. TAB, LF, CR remain supported (LF is escaped as &softagram#10;).

Add a roundtrip test that builds an element with every forbidden C0 char
plus <&"'> and DEL, serialises, and re-parses to confirm the output is
valid.
Both branches of the previous `or` collapsed to the same expression
(`'after<&"\'>' in got`), so the assertion only ever exercised one
predicate. Reduce to the single equivalent check for readability.
@softagram-bot
Copy link
Copy Markdown

Looking good 🎉

See the full Softagram report

@villelaitila villelaitila merged commit 47215c4 into softagram:main Apr 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants