gh-149468: Add option to validate ElementTree during serialization#149469
gh-149468: Add option to validate ElementTree during serialization#149469serhiy-storchaka wants to merge 10 commits into
Conversation
|
This PR also fixes some bugs in serialization to HTML. I am going to extract them into separate issue before merging this PR. But for now, it is more convenient to review them together. |
| @@ -0,0 +1,3 @@ | |||
| Add the *validate* option to :mod:`xml.etree.ElementTree` serialization | |||
| functions, which allows to validate the element or element tree before | |||
There was a problem hiding this comment.
Suggestion on wording: "...which validates the element or element tree names and values only contain allowed/escaped characters prior to serialization"
There was a problem hiding this comment.
No, this is not good. Characters should not be escaped prior to serialization. Element tree has no names, I wrote "the element or element tree" because some methods/functions serialize an element, and others work with the element tree. We check not only for invalid characters, but for invalid sequences of characters, and some characters/sequences can only be invalid at the start or the end of the name or content.
I updated the entry according to description proposed by @ezio-melotti.
* The content of comments, processing instructions and elements "xmp", "iframe", "noembed", "noframes", and "plaintext" is no longer escaped. * The "plaintext" element no longer have the closing tag. * Add support of empty attributes (with value None).
73e1b24 to
a134c0b
Compare
|
Rebased it onto #149490. |
|
|
||
|
|
||
| xml.etree.ElementTree | ||
| --------------------- |
There was a problem hiding this comment.
You should now move this entry to Doc/whatsnew/3.16.rst.
ca19970 to
c81fe70
Compare
| def test_invalid_attr_value(self): | ||
| self.check(ET.Element('tag', attrib={'key': '\x00'})) | ||
| self.check(ET.Element('tag', attrib={'key': '\ud8ff'})) | ||
| self.check(ET.Element('tag', attrib={'key': '\ufffe'})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')})) |
There was a problem hiding this comment.
This and several other methods could use subTests if you think it's an improvement, e.g.:
| def test_invalid_attr_value(self): | |
| self.check(ET.Element('tag', attrib={'key': '\x00'})) | |
| self.check(ET.Element('tag', attrib={'key': '\ud8ff'})) | |
| self.check(ET.Element('tag', attrib={'key': '\ufffe'})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')})) | |
| @support.subTests('value', ('\x00', '\ud8ff', '\ufffe')) | |
| def test_invalid_attr_value(self, value): | |
| self.check(ET.Element('tag', attrib={'key': value})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName(value)})) |
There was a problem hiding this comment.
I'll think about it.
The drawback is that this can make tests less flexible. It will be difficult to add different assertions in the same method. We also cannot use ET outside the method body, it is only defined at test running time.
| write("<!--%s-->" % text) | ||
| elif tag is ProcessingInstruction: | ||
| if validate: | ||
| m = re.search('[ \t\r\n]', text) |
There was a problem hiding this comment.
test_invalid_processing_instruction.
Although using non-space delimiters is not completely tested (only negative case).
| if text: | ||
| if validate: | ||
| if '\0' in text: | ||
| raise ValueError('invalid characters') |
There was a problem hiding this comment.
| raise ValueError('invalid characters') | |
| raise ValueError('invalid character ("\\0")') |
Similarly the other error messages could specify what is invalid.
There was a problem hiding this comment.
We perhaps should also forbid the surrogate codes. They are not explicitly mentioned in the tokenizer definition, unlike to the null character, but they are invalid in all standard encodings, and using numeric references for them is also invalid (https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state).
As for what invalid, it is not always easily accessible. xml.is_valid_name() and xml.is_valid_text() simply return a boolean.
We could include the name of the parent element, but this may complicate the code to handle special cases of undefined or None.
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Thank you for your review @ezio-melotti. See also the issue. "XML validation" has its own specific meaning, so there is a question about using the name "validate" for our much more lenient checks.
We do not check that the output is well-formed. There are much more strict rules for what the name of the element or attribute should be. I use the rules of the HTML parser, which accepts much more.
| def test_invalid_attr_value(self): | ||
| self.check(ET.Element('tag', attrib={'key': '\x00'})) | ||
| self.check(ET.Element('tag', attrib={'key': '\ud8ff'})) | ||
| self.check(ET.Element('tag', attrib={'key': '\ufffe'})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')})) |
There was a problem hiding this comment.
I'll think about it.
The drawback is that this can make tests less flexible. It will be difficult to add different assertions in the same method. We also cannot use ET outside the method body, it is only defined at test running time.
| write("<!--%s-->" % text) | ||
| elif tag is ProcessingInstruction: | ||
| if validate: | ||
| m = re.search('[ \t\r\n]', text) |
There was a problem hiding this comment.
test_invalid_processing_instruction.
Although using non-space delimiters is not completely tested (only negative case).
| if text: | ||
| if validate: | ||
| if '\0' in text: | ||
| raise ValueError('invalid characters') |
There was a problem hiding this comment.
We perhaps should also forbid the surrogate codes. They are not explicitly mentioned in the tokenizer definition, unlike to the null character, but they are invalid in all standard encodings, and using numeric references for them is also invalid (https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state).
As for what invalid, it is not always easily accessible. xml.is_valid_name() and xml.is_valid_text() simply return a boolean.
We could include the name of the parent element, but this may complicate the code to handle special cases of undefined or None.
Documentation build overview
|
Uh oh!
There was an error while loading. Please reload this page.