gh-149468: Add option to validate ElementTree during serialization by serhiy-storchaka · Pull Request #149469 · python/cpython

serhiy-storchaka · 2026-05-06T19:25:48Z

Issue: Add option to validate ElementTree during serialization #149468

serhiy-storchaka · 2026-05-06T19:30:37Z

This PR also fixes some bugs in serialization to HTML. I am going to extract them into separate issue before merging this PR. But for now, it is more convenient to review them together.

sethmlarson · 2026-05-06T20:30:50Z

@@ -0,0 +1,3 @@
+Add the *validate* option to :mod:`xml.etree.ElementTree` serialization
+functions, which allows to validate the element or element tree before


Suggestion on wording: "...which validates the element or element tree names and values only contain allowed/escaped characters prior to serialization"

No, this is not good. Characters should not be escaped prior to serialization. Element tree has no names, I wrote "the element or element tree" because some methods/functions serialize an element, and others work with the element tree. We check not only for invalid characters, but for invalid sequences of characters, and some characters/sequences can only be invalid at the start or the end of the name or content.

I updated the entry according to description proposed by @ezio-melotti.

* The content of comments, processing instructions and elements "xmp", "iframe", "noembed", "noframes", and "plaintext" is no longer escaped. * The "plaintext" element no longer have the closing tag. * Add support of empty attributes (with value None).

serhiy-storchaka · 2026-05-07T17:28:25Z

Rebased it onto #149490.

vstinner · 2026-05-13T11:58:37Z



+xml.etree.ElementTree
+---------------------


You should now move this entry to Doc/whatsnew/3.16.rst.

ezio-melotti · 2026-05-30T19:50:06Z

+    def test_invalid_attr_value(self):
+        self.check(ET.Element('tag', attrib={'key': '\x00'}))
+        self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
+        self.check(ET.Element('tag', attrib={'key': '\ufffe'}))
+        self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
+        self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
+        self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))


This and several other methods could use subTests if you think it's an improvement, e.g.:

Suggested change

def test_invalid_attr_value(self):

self.check(ET.Element('tag', attrib={'key': '\x00'}))

self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))

self.check(ET.Element('tag', attrib={'key': '\ufffe'}))

self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))

self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))

self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))

@support.subTests('value', ('\x00', '\ud8ff', '\ufffe'))

def test_invalid_attr_value(self, value):

self.check(ET.Element('tag', attrib={'key': value}))

self.check(ET.Element('tag', attrib={'key': ET.QName(value)}))

I'll think about it.

The drawback is that this can make tests less flexible. It will be difficult to add different assertions in the same method. We also cannot use ET outside the method body, it is only defined at test running time.

ezio-melotti · 2026-05-30T19:53:39Z

        write("<!--%s-->" % text)
    elif tag is ProcessingInstruction:
+        if validate:
+            m = re.search('[ \t\r\n]', text)


Is there a test for this?

test_invalid_processing_instruction.

Although using non-space delimiters is not completely tested (only negative case).

ezio-melotti · 2026-05-30T20:03:02Z

            if text:
+                if validate:
+                    if '\0' in text:
+                        raise ValueError('invalid characters')


Suggested change

raise ValueError('invalid characters')

raise ValueError('invalid character ("\\0")')

Similarly the other error messages could specify what is invalid.

We perhaps should also forbid the surrogate codes. They are not explicitly mentioned in the tokenizer definition, unlike to the null character, but they are invalid in all standard encodings, and using numeric references for them is also invalid (https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state).

As for what invalid, it is not always easily accessible. xml.is_valid_name() and xml.is_valid_text() simply return a boolean.

We could include the name of the parent element, but this may complicate the code to handle special cases of undefined or None.

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

serhiy-storchaka

Thank you for your review @ezio-melotti. See also the issue. "XML validation" has its own specific meaning, so there is a question about using the name "validate" for our much more lenient checks.

We do not check that the output is well-formed. There are much more strict rules for what the name of the element or attribute should be. I use the rules of the HTML parser, which accepts much more.

serhiy-storchaka · 2026-05-31T08:05:13Z

+    def test_invalid_attr_value(self):
+        self.check(ET.Element('tag', attrib={'key': '\x00'}))
+        self.check(ET.Element('tag', attrib={'key': '\ud8ff'}))
+        self.check(ET.Element('tag', attrib={'key': '\ufffe'}))
+        self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')}))
+        self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')}))
+        self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')}))


I'll think about it.

The drawback is that this can make tests less flexible. It will be difficult to add different assertions in the same method. We also cannot use ET outside the method body, it is only defined at test running time.

serhiy-storchaka · 2026-05-31T08:11:23Z

        write("<!--%s-->" % text)
    elif tag is ProcessingInstruction:
+        if validate:
+            m = re.search('[ \t\r\n]', text)


test_invalid_processing_instruction.

Although using non-space delimiters is not completely tested (only negative case).

serhiy-storchaka · 2026-05-31T08:22:16Z

            if text:
+                if validate:
+                    if '\0' in text:
+                        raise ValueError('invalid characters')


We perhaps should also forbid the surrogate codes. They are not explicitly mentioned in the tokenizer definition, unlike to the null character, but they are invalid in all standard encodings, and using numeric references for them is also invalid (https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state).

As for what invalid, it is not always easily accessible. xml.is_valid_name() and xml.is_valid_text() simply return a boolean.

We could include the name of the parent element, but this may complicate the code to handle special cases of undefined or None.

read-the-docs-community · 2026-05-31T11:31:01Z

Documentation build overview

📚 cpython-previews | 🛠️ Build #32922382 | 📁 Comparing 3d0fdd2 against main (350e9de)

🔍 Preview build

3 files changed

± library/xml.etree.elementtree.html
± whatsnew/3.16.html
± whatsnew/changelog.html

serhiy-storchaka requested a review from AA-Turner as a code owner May 6, 2026 19:25

bedevere-app Bot mentioned this pull request May 6, 2026

Add option to validate ElementTree during serialization #149468

Open

bedevere-app Bot added the awaiting core review label May 6, 2026

sethmlarson reviewed May 6, 2026

View reviewed changes

merwok reviewed May 6, 2026

View reviewed changes

Comment thread Lib/xml/etree/ElementTree.py

serhiy-storchaka requested review from a team, FFY00, ZeroIntensity, corona10, emmatyping, encukou, ericsnowcurrently and erlend-aasland as code owners May 7, 2026 17:18

serhiy-storchaka added 2 commits May 7, 2026 20:21

pythongh-149468: Add option to validate ElementTree during serialization

a134c0b

serhiy-storchaka force-pushed the xml-etree-validate branch from 73e1b24 to a134c0b Compare May 7, 2026 17:26

vstinner reviewed May 13, 2026

View reviewed changes

ned-deily removed the request for review from a team May 16, 2026 04:55

Merge branch 'main' into xml-etree-validate

c81fe70

serhiy-storchaka force-pushed the xml-etree-validate branch from ca19970 to c81fe70 Compare May 29, 2026 21:42

ezio-melotti reviewed May 30, 2026

View reviewed changes

Apply suggestions from code review

ea414fa

Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>

serhiy-storchaka commented May 31, 2026

View reviewed changes

serhiy-storchaka added 4 commits May 31, 2026 11:46

Merge branch 'main' into xml-etree-validate

452480d

Add more tests for processing instructions

474411f

Add more details in exceptions.

22e5543

Check also for surrogates in HTML.

e12e88d

serhiy-storchaka added 2 commits May 31, 2026 14:16

Move the What's New entry to 3.16.

8cc34e7

Update the NEWS entry.

3d0fdd2

		@@ -0,0 +1,3 @@
		Add the validate option to :mod:`xml.etree.ElementTree` serialization
		functions, which allows to validate the element or element tree before

	raise ValueError('invalid characters')
	raise ValueError('invalid character ("\\0")')



		xml.etree.ElementTree
		---------------------

Uh oh!

Conversation

serhiy-storchaka commented May 6, 2026 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka commented May 6, 2026

Uh oh!

sethmlarson May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serhiy-storchaka commented May 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

read-the-docs-community Bot commented May 31, 2026

Documentation build overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

serhiy-storchaka commented May 6, 2026 •

edited by bedevere-app Bot

Loading

sethmlarson May 6, 2026 •

edited

Loading