Skip to content

gh-139489: Add xml.is_valid_text()#149412

Merged
serhiy-storchaka merged 3 commits intopython:mainfrom
serhiy-storchaka:xml-is_valid_text-1.0
May 6, 2026
Merged

gh-139489: Add xml.is_valid_text()#149412
serhiy-storchaka merged 3 commits intopython:mainfrom
serhiy-storchaka:xml-is_valid_text-1.0

Conversation

@serhiy-storchaka
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka commented May 5, 2026

@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community Bot commented May 5, 2026

@vstinner
Copy link
Copy Markdown
Member

vstinner commented May 5, 2026

What do you think of adding also a sanitize function which takes a callback? Example:

import re

ILLEGAL_XML_CHARS_RE = re.compile(
    '['
    # Control characters; newline (\x0A and \x0D) and TAB (\x09) are legal
    '\x00-\x08\x0B\x0C\x0E-\x1F'
    # Surrogate characters
    '\uD800-\uDFFF'
    # Special Unicode characters
    '\uFFFE'
    '\uFFFF'
    # Match multiple sequential invalid characters for better efficiency
    ']+')

def sanitize(text, replace_func):
    def callback(regs):
        return replace_func(regs[0])

    return ILLEGAL_XML_CHARS_RE.sub(callback, text)

def escape(text):
    return ''.join(f'\\x{ord(char):02x}' for char in text)

invalid = '\x00'
test = f'a{invalid}b'
print(sanitize(test, lambda text: '#' * len(text)))
print(sanitize(test, escape))

It would be useful for sanitize_xml() of Lib/test/libregrtest/utils.py.

Comment thread Doc/library/xml.rst Outdated
Comment thread Doc/library/xml.rst Outdated
Comment thread Doc/library/xml.rst Outdated
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@serhiy-storchaka
Copy link
Copy Markdown
Member Author

What do you think of adding also a sanitize function which takes a callback?

I left it for a separate PR. I have similar function in a different branch. But it accepts also the name of the registered error handler ('strict', 'ignore', 'replace, 'backslashreplace', etc) and the callback has different interface. It needs a separate discussion.

Copy link
Copy Markdown
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread Misc/NEWS.d/next/Library/2026-05-05-13-12-58.gh-issue-139489.a8qqIM.rst Outdated
@vstinner
Copy link
Copy Markdown
Member

vstinner commented May 6, 2026

I left it for a separate PR. (...). It needs a separate discussion.

Ok, that makes sense.

…8qqIM.rst

Co-authored-by: Victor Stinner <vstinner@python.org>
@serhiy-storchaka serhiy-storchaka enabled auto-merge (squash) May 6, 2026 14:11
@serhiy-storchaka serhiy-storchaka merged commit a5c7a74 into python:main May 6, 2026
55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants