Skip to content

gh-149468: Add option to validate ElementTree during serialization#149469

Open
serhiy-storchaka wants to merge 10 commits into
python:mainfrom
serhiy-storchaka:xml-etree-validate
Open

gh-149468: Add option to validate ElementTree during serialization#149469
serhiy-storchaka wants to merge 10 commits into
python:mainfrom
serhiy-storchaka:xml-etree-validate

Conversation

@serhiy-storchaka
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka commented May 6, 2026

@serhiy-storchaka
Copy link
Copy Markdown
Member Author

This PR also fixes some bugs in serialization to HTML. I am going to extract them into separate issue before merging this PR. But for now, it is more convenient to review them together.

@@ -0,0 +1,3 @@
Add the *validate* option to :mod:`xml.etree.ElementTree` serialization
functions, which allows to validate the element or element tree before
Copy link
Copy Markdown
Contributor

@sethmlarson sethmlarson May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion on wording: "...which validates the element or element tree names and values only contain allowed/escaped characters prior to serialization"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not good. Characters should not be escaped prior to serialization. Element tree has no names, I wrote "the element or element tree" because some methods/functions serialize an element, and others work with the element tree. We check not only for invalid characters, but for invalid sequences of characters, and some characters/sequences can only be invalid at the start or the end of the name or content.

I updated the entry according to description proposed by @ezio-melotti.

Comment thread Lib/xml/etree/ElementTree.py
* The content of comments, processing instructions and elements "xmp",
  "iframe", "noembed", "noframes", and "plaintext" is no longer escaped.
* The "plaintext" element no longer have the closing tag.
* Add support of empty attributes (with value None).
@serhiy-storchaka
Copy link
Copy Markdown
Member Author

Rebased it onto #149490.

Comment thread Doc/whatsnew/3.15.rst Outdated


xml.etree.ElementTree
---------------------
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should now move this entry to Doc/whatsnew/3.16.rst.

@ned-deily ned-deily removed the request for review from a team May 16, 2026 04:55
Comment thread Doc/library/xml.etree.elementtree.rst Outdated
Comment thread Doc/library/xml.etree.elementtree.rst Outdated
Comment thread Doc/library/xml.etree.elementtree.rst Outdated
Comment thread Doc/library/xml.etree.elementtree.rst Outdated
Comment thread Lib/test/test_xml_etree.py
Comment on lines +1429 to +1435
def test_invalid_attr_value(self):
self.check(ET.Element('tag', attrib={'key': '\x00'}))
self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
self.check(ET.Element('tag', attrib={'key': '\ufffe'}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and several other methods could use subTests if you think it's an improvement, e.g.:

Suggested change
def test_invalid_attr_value(self):
self.check(ET.Element('tag', attrib={'key': '\x00'}))
self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
self.check(ET.Element('tag', attrib={'key': '\ufffe'}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))
@support.subTests('value', ('\x00', '\ud8ff', '\ufffe'))
def test_invalid_attr_value(self, value):
self.check(ET.Element('tag', attrib={'key': value}))
self.check(ET.Element('tag', attrib={'key': ET.QName(value)}))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about it.

The drawback is that this can make tests less flexible. It will be difficult to add different assertions in the same method. We also cannot use ET outside the method body, it is only defined at test running time.

Comment thread Lib/test/test_xml_etree.py
write("<!--%s-->" % text)
elif tag is ProcessingInstruction:
if validate:
m = re.search('[ \t\r\n]', text)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_invalid_processing_instruction.

Although using non-space delimiters is not completely tested (only negative case).

if text:
if validate:
if '\0' in text:
raise ValueError('invalid characters')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError('invalid characters')
raise ValueError('invalid character ("\\0")')

Similarly the other error messages could specify what is invalid.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We perhaps should also forbid the surrogate codes. They are not explicitly mentioned in the tokenizer definition, unlike to the null character, but they are invalid in all standard encodings, and using numeric references for them is also invalid (https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state).

As for what invalid, it is not always easily accessible. xml.is_valid_name() and xml.is_valid_text() simply return a boolean.

We could include the name of the parent element, but this may complicate the code to handle special cases of undefined or None.

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Copy link
Copy Markdown
Member Author

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review @ezio-melotti. See also the issue. "XML validation" has its own specific meaning, so there is a question about using the name "validate" for our much more lenient checks.

We do not check that the output is well-formed. There are much more strict rules for what the name of the element or attribute should be. I use the rules of the HTML parser, which accepts much more.

Comment thread Doc/library/xml.etree.elementtree.rst Outdated
Comment on lines +1429 to +1435
def test_invalid_attr_value(self):
self.check(ET.Element('tag', attrib={'key': '\x00'}))
self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
self.check(ET.Element('tag', attrib={'key': '\ufffe'}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about it.

The drawback is that this can make tests less flexible. It will be difficult to add different assertions in the same method. We also cannot use ET outside the method body, it is only defined at test running time.

write("<!--%s-->" % text)
elif tag is ProcessingInstruction:
if validate:
m = re.search('[ \t\r\n]', text)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_invalid_processing_instruction.

Although using non-space delimiters is not completely tested (only negative case).

if text:
if validate:
if '\0' in text:
raise ValueError('invalid characters')
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We perhaps should also forbid the surrogate codes. They are not explicitly mentioned in the tokenizer definition, unlike to the null character, but they are invalid in all standard encodings, and using numeric references for them is also invalid (https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state).

As for what invalid, it is not always easily accessible. xml.is_valid_name() and xml.is_valid_text() simply return a boolean.

We could include the name of the parent element, but this may complicate the code to handle special cases of undefined or None.

@read-the-docs-community
Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #32922382 | 📁 Comparing 3d0fdd2 against main (350e9de)

  🔍 Preview build  

3 files changed
± library/xml.etree.elementtree.html
± whatsnew/3.16.html
± whatsnew/changelog.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants