Python XML Processing With LXML
Python XML Processing With LXML
lxml
John W. Shipman
2013-08-24 12:39
Abstract
Describes the lxml package for reading and writing XML files with the Python programming
language.
1 2
This publication is available in Web form and also as a PDF document . Please forward any
comments to tcc-doc@nmt.edu.
3
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0
Unported License.
Table of Contents
1. Introduction: Python and XML ................................................................................................. 3
2. How ElementTree represents XML ........................................................................................ 3
3. Reading an XML document ...................................................................................................... 5
4. Handling multiple namespaces ................................................................................................. 6
4.1. Glossary of namespace terms ......................................................................................... 6
4.2. The syntax of multi-namespace documents ..................................................................... 7
4.3. Namespace maps .......................................................................................................... 8
5. Creating a new XML document ................................................................................................ 9
6. Modifying an existing XML document ..................................................................................... 10
7. Features of the etree module ................................................................................................ 10
7.1. The Comment() constructor ......................................................................................... 10
7.2. The Element() constructor ......................................................................................... 11
7.3. The ElementTree() constructor ................................................................................. 12
7.4. The fromstring() function: Create an element from a string ....................................... 13
7.5. The parse() function: build an ElementTree from a file ............................................ 14
7.6. The ProcessingInstruction() constructor ............................................................ 14
7.7. The QName() constructor ............................................................................................. 14
7.8. The SubElement() constructor ................................................................................... 15
7.9. The tostring() function: Serialize as XML ................................................................. 16
7.10. The XMLID() function: Convert text to XML with a dictionary of id values .................. 16
8. class ElementTree: A complete XML document ................................................................ 17
8.1. ElementTree.find() ............................................................................................... 17
8.2. ElementTree.findall(): Find matching elements ................................................... 17
8.3. ElementTree.findtext(): Retrieve the text content from an element ........................ 17
8.4. ElementTree.getiterator(): Make an iterator ...................................................... 17
8.5. ElementTree.getroot(): Find the root element ....................................................... 18
1
2
http://www.nmt.edu/tcc/help/pubs/pylxml/
3
http://www.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf
http://creativecommons.org/licenses/by-nc/3.0/
New Mexico Tech Computer Center Python XML processing with lxml 1
8.6. ElementTree.xpath(): Evaluate an XPath expression ................................................ 18
8.7. ElementTree.write(): Translate back to XML .......................................................... 18
9. class Element: One element in the tree ............................................................................... 19
9.1. Attributes of an Element instance ................................................................................ 19
9.2. Accessing the list of child elements ............................................................................... 19
9.3. Element.append(): Add a new element child ............................................................ 20
9.4. Element.clear(): Make an element empty ............................................................... 21
9.5. Element.find(): Find a matching sub-element .......................................................... 21
9.6. Element.findall(): Find all matching sub-elements ................................................. 22
9.7. Element.findtext(): Extract text content ................................................................ 22
9.8. Element.get(): Retrieve an attribute value with defaulting ........................................ 23
9.9. Element.getchildren(): Get element children ........................................................ 24
9.10. Element.getiterator(): Make an iterator to walk a subtree ................................... 24
9.11. Element.getroottree(): Find the ElementTree containing this element ............... 25
9.12. Element.insert(): Insert a new child element ........................................................ 26
9.13. Element.items(): Produce attribute names and values ............................................ 26
9.14. Element.iterancestors(): Find an element's ancestors ......................................... 26
9.15. Element.iterchildren(): Find all children ........................................................... 27
9.16. Element.iterdescendants(): Find all descendants ............................................... 27
9.17. Element.itersiblings(): Find other children of the same parent ........................... 28
9.18. Element.keys(): Find all attribute names ................................................................ 28
9.19. Element.remove(): Remove a child element ............................................................ 29
9.20. Element.set(): Set an attribute value ...................................................................... 29
9.21. Element.xpath(): Evaluate an XPath expression ...................................................... 29
10. XPath processing .................................................................................................................. 30
10.1. An XPath example ...................................................................................................... 31
11. The art of Web-scraping: Parsing HTML with Beautiful Soup .................................................. 31
12. Automated validation of input files ....................................................................................... 32
12.1. Validation with a Relax NG schema ............................................................................ 32
12.2. Validation with an XSchema (XSD) schema .................................................................. 33
13. etbuilder.py: A simplified XML builder module ............................................................... 33
13.1. Using the etbuilder module .................................................................................... 33
13.2. CLASS(): Adding class attributes ............................................................................ 35
13.3. FOR(): Adding for attributes .................................................................................... 35
13.4. subElement(): Adding a child element ..................................................................... 35
13.5. addText(): Adding text content to an element ........................................................... 36
14. Implementation of etbuilder ............................................................................................. 36
14.1. Features differing from Lundh's original ..................................................................... 36
14.2. Prologue .................................................................................................................... 36
14.3. CLASS(): Helper function for adding CSS class attributes ......................................... 37
14.4. FOR(): Helper function for adding XHTML for attributes ........................................... 37
14.5. subElement(): Add a child element ......................................................................... 38
14.6. addText(): Add text content to an element ................................................................ 38
14.7. class ElementMaker: The factory class ................................................................... 38
14.8. ElementMaker.__init__(): Constructor ................................................................ 39
14.9. ElementMaker.__call__(): Handle calls to the factory instance .............................. 42
14.10. ElementMaker.__handleArg(): Process one positional argument .......................... 43
14.11. ElementMaker.__getattr__(): Handle arbitrary method calls ............................. 44
14.12. Epilogue .................................................................................................................. 44
14.13. testetbuilder: A test driver for etbuilder ......................................................... 44
15. rnc_validate: A module to validate XML against a Relax NG schema ................................. 45
15.1. Design of the rnc_validate module ........................................................................ 45
2 Python XML processing with lxml New Mexico Tech Computer Center
15.2. Interface to the rnc_validate module ...................................................................... 46
15.3. rnc_validate.py: Prologue .................................................................................... 46
15.4. RelaxException ..................................................................................................... 47
15.5. class RelaxValidator ......................................................................................... 47
15.6. RelaxValidator.validate() ............................................................................... 48
15.7. RelaxValidator.__init__(): Constructor ............................................................ 48
15.8. RelaxValidator.__makeRNG(): Find or create an .rng file .................................... 49
15.9. RelaxValidator.__getModTime(): When was this file last changed? ..................... 51
15.10. RelaxValidator.__trang(): Translate .rnc to .rng format ................................ 51
16. rnck: A standalone script to validate XML against a Relax NG schema ..................................... 52
16.1. rnck: Prologue ............................................................................................................ 52
16.2. rnck: main() ............................................................................................................. 53
16.3. rnck: checkArgs() ................................................................................................... 54
16.4. rnck: usage() ........................................................................................................... 54
16.5. rnck: fatal() ........................................................................................................... 55
16.6. rnck: message() ....................................................................................................... 55
16.7. rnck: validateFile() ............................................................................................. 55
16.8. rnck: Epilogue ............................................................................................................ 56
4
5
http://lxml.de/
6
http://effbot.org/zone/element-index.htm
http://docs.python.org/library/xml.etree.elementtree.html
New Mexico Tech Computer Center Python XML processing with lxml 3
<p>To find out <em>more</em>, see the
<a href="http://www.w3.org/XML">standard</a>.</p>
The above diagram shows the conceptual structure of the XML. The lxml view of an XML document,
by contrast, builds a tree of only one node type: the Element.
The main difference between the ElementTree view used in lxml, and the classical view, is the asso-
ciation of text with elements: it is very different in lxml.
An instance of lxml's Element class contains these attributes:
.tag
The name of the element, such as "p" for a paragraph or "em" for emphasis.
.text
The text inside the element, if any, up to the first child element. This attribute is None if the element
is empty or has no text before the first child element.
.tail
The text following the element. This is the most unusual departure. In the DOM model, any text
following an element E is associated with the parent of E; in lxml, that text is considered the tail
of E.
.attrib
A Python dictionary containing the element's XML attribute names and their corresponding values.
For example, for the element <h2 class="arch" id="N15">, that element's .attrib would
be the dictionary {"class": "arch", "id": "N15"}.
(element children)
To access sub-elements, treat an element as a list. For example, if node is an Element instance,
node[0] is the first sub-element of node. If node doesn't have any sub-elements, this operation
will raise an IndexError exception.
You can find out the number of sub-elements using the len() function. For example, if node has
five children, len(node) will return a value of 5.
One advantage of the lxml view is that a tree is now made of only one type of node: each node is an
Element instance. Here is our XML fragment again, and a picture of its representation in lxml.
4 Python XML processing with lxml New Mexico Tech Computer Center
<p>To find out <em>more</em>, see the
<a href="http://www.w3.org/XML">standard</a>.</p>
Notice that in the lxml view, the text ", see the\n" (which includes the newline) is contained in
the .tail attribute of the em element, not associated with the p element as it would be in the DOM
view. Also, the "." at the end of the paragraph is in the .tail attribute of the a (link) element.
Now that you know how XML is represented in lxml, there are three general application areas.
Section 3, Reading an XML document (p. 5).
Section 5, Creating a new XML document (p. 9).
Section 6, Modifying an existing XML document (p. 10).
2. Typically your XML document will be in a file somewhere. Suppose your file is named test.xml;
to read the document, you might say something like:
doc = etree.parse('test.xml')
The returned value doc is an instance of the ElementTree class that represents your XML document
in tree form.
Once you have your document in this form, refer to Section 8, class ElementTree: A complete
XML document (p. 17) to learn how to navigate around the tree and extract the various parts of its
structure.
For other methods of creating an ElementTree, refer to Section 7, Features of the etree mod-
ule (p. 10).
New Mexico Tech Computer Center Python XML processing with lxml 5
4. Handling multiple namespaces
A namespace in XML is a collection of element and attribute names. For example, in the XHTML namespace
we find element names like body, link and h1, and attribute names like href and align.
For simple documents, all the element and attribute names in a single document may be in the namespace.
In general, however, an XML document may include element and attribute names from many namespaces.
See Section 4.1, Glossary of namespace terms (p. 6) to familiarize yourself with the terminology.
Section 4.2, The syntax of multi-namespace documents (p. 7) discusses how namespaces are rep-
resented in an XML file.
Note
7
The W3C Recommendation Namespaces in XML 1.0 prefers the term namespace name for the more widely
used NSURI.
For example, here is the NSURI that identifies the XHTML 1.0 Strict dialect of XHTML:
http://www.w3.org/1999/xhtml
7
http://www.w3.org/TR/xml-names/
6 Python XML processing with lxml New Mexico Tech Computer Center
For example, many XHTML pages use a blank namespace because all the names are in the same
namespace and because browsers don't need the NSURI in order to display them correctly.
"{NSURI}name"
For example, when a properly constructed XHTML 1.0 Strict document is parsed into an ElementTree,
the .tag attribute of the document's root element will be:
"{http://www.w3.org/1999/xhtml}html"
Note
Clark notation does not actually appear in the XML source file. It is employed only within the Element-
Tree representation of the document.
For element and attribute names in the blank namespace, the Clark notation is just the name without
the {NSURI} prefix.
4.1.5. Ancestor
The ancestors of an element include its immediate parent, its parent's parent, and so forth up to the root
of the tree. The root node has no ancestors.
4.1.6. Descendant
The descendants of an element include its direct children, its childrens' children, and so on out to the
leaves of the document tree.
8
http://en.wikipedia.org/wiki/James_Clark_(programmer)
New Mexico Tech Computer Center Python XML processing with lxml 7
<fo:inline font-style='italic' font-family='sans-serif'>
<xsl:copy-of select="$content"/>
</fo:inline>
The inline element is in the XSL-FO namespace, which in this document uses the namespace prefix
fo:. The copy-of element is in the XSLT namespace, whose prefix is xsl:.
Within your document, you must define the NSURI corresponding to each namespace prefix. This can
be done in multiple ways.
Any element may contain an attribute of the form xmlns:P="NSURI", where P is the namespace
prefix for that NSURI.
Any element may contain attribute of the form xmlns="NSURI". This defines the NSURI associated
with the blank namespace.
If an element or attribute does not carry a namespace prefix, it inherits the NSURI of the closest an-
cestor element that does bear a prefix.
Certain attributes may occur anywhere in any document in the xml: namespace, which is always
defined.
For example, any element may carry a xml:id attribute that serves to identify a unique element
within the document.
Here is a small complete XHTML file with all the decorations recommended by the W3C organization:
The xmlns attribute of the html element specifies that all its descendant elements are in the XHTML
1.0 Strict namespace.
The xml:lang="en" attribute specifies that the document is in English.
Here is a more elaborate example. This is the root element of an XSLT stylesheet. Prefix xsl: is used
for the XSLT elements; prefix fo: is used for the XSL-FO elements; and a third namespace with prefix
date: is also included. This document does not use a blank namespace.
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:date="http://exslt.org/dates-and-times">
8 Python XML processing with lxml New Mexico Tech Computer Center
Namespace maps are used in several roles.
When reading an XML file with multiple namespaces, you can use a namespace map in the process
of searching for and retrieving elements and attributes from an ElementTree. See, for example,
Section 9.5, Element.find(): Find a matching sub-element (p. 21).
When creating a new XML document that has elements in multiple namespaces, you can use a
namespace map to specify what namespace prefixes will appear when the ElementTree is serialized
to XML form. See Section 7.2, The Element() constructor (p. 11) and Section 7.8, The SubEle-
ment() constructor (p. 15) for particulars.
For example, at the end of Section 4.2, The syntax of multi-namespace documents (p. 7) there is an
xsl:stylesheet start tag that defines xsl: as the prefix for the XSLT namespace, fo: for the XSL-
FO namespace, and date: for a date-and-time extension package. Here is a namespace map that describes
those same relationships of prefixes to NSURIs:
To define the NSURI of the blank namespace, use an entry whose key is None. For example, this
namespace map would define elements without a namespace as belonging to XHTML, and elements
9
with namespace prefix xl: belong to the XLink namespace:
2. Create the root element. For example, suppose you're creating a Web page; the root element is html.
Use the etree.Element() constructor to build that element.
page = etree.Element('html')
3. Next, use the etree.ElementTree() constructor to make a new document tree, using our html
element as its root:
doc = etree.ElementTree(page)
4. The etree.SubElement() constructor is perfect for adding new child elements to our document.
Here's the code to add a head element, and then a body as element, as new children of the html
element:
9
http://en.wikipedia.org/wiki/XLink
New Mexico Tech Computer Center Python XML processing with lxml 9
5. Your page will need a title element child under the head element. Add text to this element by
storing a string in its .text attribute:
6. To supply attribute values, use keyword arguments to the SubElement() constructor. For example,
suppose you want a stylesheet link inside the head element that looks like this:
7. Continue building your new document using the various functions described in Section 7, Features
of the etree module (p. 10) and Section 9, class Element: One element in the tree (p. 19).
8. When the document is completely built, write it to a file using the ElementTree instance's .write()
method, which takes a file argument.
linkNode.attrib['href'] = 'http://www.nmt.edu/'
3. Finally, write the document back out to a file as described in Section 5, Creating a new XML docu-
ment (p. 9).
10 Python XML processing with lxml New Mexico Tech Computer Center
etree.Comment(text=None)
text
The text to be placed within the comment. When serialized back into XML form, this text will be
preceded by <!-- and followed by -->. Note that one space will be added around each
end of the text you supply.
The return value is an instance of the Comment class. Use the .append() method on the parent element
to place the comment into your document.
For example, suppose bodyElt is an HTML body element. To add a comment under this element
containing string s, you would use this code:
newComment = etree.Comment(s)
bodyElt.append(newComment)
tag
A string containing the name of the element to be created.
attrib
A dictionary containing attribute names and values to be added to the element. The default is to
have no attributes.
nsmap
If your document contains multiple XML namespaces, you can supply a namespace map that defines
the namespace prefixes you would like to use when this document is converted to XML. See Sec-
tion 4.3, Namespace maps (p. 8).
If you supply this argument, it will also apply to all descendants of the created node, unless the
descendant node supplies a different namespace map.
extras
Any keyword arguments of the form name=value that you supply to the constructor are added
to the element's attributes. For example, this code:
Here is an example of creation of a document with multiple namespaces using the nsmap keyword ar-
gument.
#!/usr/bin/env python
import sys
from lxml import etree as et
HTML_NS = "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
XSL_NS = "http://www.w3.org/1999/XSL/Transform"
New Mexico Tech Computer Center Python XML processing with lxml 11
NS_MAP = {None: HTML_NS,
"xsl": XSL_NS}
When this root element is serialized into XML, it will look something like this:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<xsl:template match="/">
<html>
<head>
<title>Heading title</title>
</head>
<body>
<h1>Body heading</h1>
<p>Paragraph text</p>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
There is one minor pathology of this constructor. If you pass in a pre-constructed dictionary as the at-
trib argument, and you also supply keyword arguments, the values of the keyword arguments will
be added into that dictionary as if you had used the .update() method on the attrib dictionary.
Here is a conversational example showing this side effect:
12 Python XML processing with lxml New Mexico Tech Computer Center
etree.ElementTree(element=None, file=None)
element
An Element instance to be used as the root element.
file
To construct an ElementTree that represents an existing file, pass either a writeable file object,
or a string containing the name of the file. Do not use the element argument; if you do, the file
argument will be ignored.
For example, to transform a file named balrog.xml into an ElementTree, use this statement:
balrogTree = etree.ElementTree(file='balrog.xml')
>>> try:
... bad = etree.fromstring("<a>\n<<oops>\n</a>")
... except etree.XMLSyntaxError, detail:
... pass
...
>>> detail
<etree.XMLSyntaxError instance at 0xb7eba10c>
>>> detail.error_log
<string>:2:FATAL:PARSER:ERR_NAME_REQUIRED: StartTag: invalid element
name
<string>:3:FATAL:PARSER:ERR_TAG_NAME_MISMATCH: Opening and ending tag
mismatch: oops line 2 and a
<string>:3:FATAL:PARSER:ERR_TAG_NOT_FINISHED: Premature end of data in
tag a line 1
>>>
etree.fromstring(s)
where s is a string.
Here's an example:
New Mexico Tech Computer Center Python XML processing with lxml 13
>>> doc = etree.fromstring(milne)
>>> print etree.tostring(doc)
<monster name="Heffalump">
<trail>Woozle</trail>
<eeyore mood="boggy"/>
</monster>
>>>
etree.parse(source)
where source is the name of the file, or a file object containing the XML. If the file is well-formed,
the function returns an ElementTree instance.
Exceptions raised include:
IOError
The file is nonexistent or not readable.
etree.XMLSyntaxError
The file is readable, but does not contain well-formed XML. The returned exception contains an
.error_log attribute that you can print to see where the error occurred. For an example of the
display of the error_log, see Section 7.3, The ElementTree() constructor (p. 12).
etree.ProcessingInstruction(target, text=None):
target
A string containing the target portion of the processing instruction.
text
An optional string containing the rest of the processing instruction. The default value is empty.
Here's an example:
pi = etree.ProcessingInstruction('decor', 'danish,modern,ducksOnWall')
When converted back to XML, this processing instruction would look like this:
<?decor danish,modern,ducksOnWall?>
14 Python XML processing with lxml New Mexico Tech Computer Center
Although it is not legal in XML element names, there is a convention called Clark notation (after
James Clark) that combines these two parts in a string of this form:
{nsURI}local
etree.QName(text, tag=none)
If the fully qualified element name is already in Clark notation, call the QName constructor with this
argument alone.
If you would like to pass the namespace URI and the local name separately, call QName with the
namespace URI as the text argument, and the local name as the tag argument.
Here are two examples for creating a QName instance representing a qualified name in the XSLT
namespace with a local name of template:
In Clark notation:
qn = etree.QName("{http://www.w3.org/1999/XSL/Transform}template")
qn = etree.QName("http://www.w3.org/1999/XSL/Transform", "template")
The first argument, parent, is the Element instance under which the newly created Element instance
is to be added as its next child. The tag, attrib, nsmap, and **extras arguments work exactly the
same as they do in the call to Element() described in Section 7.2, The Element() constructor (p. 11).
The return value is the newly constructed Element.
Here's an example. Suppose you want to build this XML:
Here's the code to build it, and then display it, interactively:
New Mexico Tech Computer Center Python XML processing with lxml 15
</county></state>
>>>
where elt is an Element instance. The function returns a string containing the XML. For an example,
see Section 7.8, The SubElement() constructor (p. 15).
If you set the optional pretty_print argument to True, the method will attempt to insert line breaks
to keep line lengths short where possible.
To output Unicode, use the keyword argument encoding=unicode.
7.10. The XMLID() function: Convert text to XML with a dictionary of id values
To convert XML in the form of a string into an Element structure, use Section 7.4, The fromstring()
function: Create an element from a string (p. 13). However, there is a similar function named
etree.XMLID() that does this and also provides a dictionary that allows you to find elements in a
tree by their unique id attribute values.
The XML standard stipulates that any element in any document can have an id attribute, but each value
of this attribute must be unique within the document. The intent of this feature is that applications can
refer to any element using its id value.
Here is the general form for this function:
etree.XMLID(text)
#!/usr/bin/env python
from lxml import etree
for id in sorted(idMap.keys()):
elt = idMap[id].text or "(none)"
print "Tag {0}, text is '{1}'".format(id, elt.strip())
16 Python XML processing with lxml New Mexico Tech Computer Center
Tag Fido, text is 'Woof!'
Tag Fluff, text is 'Mao?'
Tag ZR, text is '(none)'
8.1. ElementTree.find()
ET.find(path[, namespaces=D])
This method is used to find a specific single element in the document. It is essentially equivalent to
calling the .find() method on the document's root element; see Section 9.5, Element.find(): Find
a matching sub-element (p. 21).
For example, if doc is an ElementTree instance, this call:
doc.find('h1')
is equivalent to:
doc.getroot().find('h1')
ET.findall(path[, namespaces=N])
This method works exactly the same as calling the .findall() method on the document's root element.
See Section 9.6, Element.findall(): Find all matching sub-elements (p. 22).
This method is essentially the same as calling the .findtext() method on the document's root element;
see Section 9.7, Element.findtext(): Extract text content (p. 22).
ET.getiterator(tag=None)
New Mexico Tech Computer Center Python XML processing with lxml 17
If you omit the argument, you will get an iterator that generates every element in the tree, in document
order.
If you want to visit only tags with a certain name, pass that name as the argument.
Here are some examples. In these examples, assume that page is an ElementTree instance that contains
an XHTML page. The first example would print every tag name in the page, in document order.
The second example would look at every div element in the page, and for those that have a class at-
tribute, it prints those attributes.
ET.getroot()
The return value will normally be the Element instance at the root of the tree. However, if you have
created your ElementTree instance without specifying either a root element or an input file, this
method will return None.
ET.xpath(s)
This methods returns the result of the XPath expression. For a general discussion of XPath, see Section 10,
XPath processing (p. 30).
ET.write(file, pretty_print=False)
You must supply a writeable file object, or the name of a file to be written. If you set argument
pretty_print=True, the method will attempt to fold long lines and indent the XML for legibility.
For example, if you have an ElementTree instance in a variable page containing an XHTML page,
and you want to write it to the standard output stream, this statement would do it:
import sys
page.write(sys.stdout)
18 Python XML processing with lxml New Mexico Tech Computer Center
9. class Element: One element in the tree
Each XML element is represented by an instance of the Element class.
See Section 9.1, Attributes of an Element instance (p. 19) for attributes of an Element instance
in the Python sense, as opposed to XML attributes.
See Section 9.2, Accessing the list of child elements (p. 19) for the various ways to access the element
children of an element.
The various methods on Element instances follow in alphabetical order, starting with Section 9.3,
Element.append(): Add a new element child (p. 20).
New Mexico Tech Computer Center Python XML processing with lxml 19
E[i] returns the child element of E at position i, if there is one. If there is no child element at that
position, this operation raises an IndexError exception.
E[i:j] returns a list of the child elements between positions i and j.
For example, node[2:4] returns a list containing the third and fourth children of node.
You can replace one child of an element E with a new element c using a statement of this form:
E[i] = c
If i is not the position of an existing child, this operation will raise an IndexError.
You can replace a sequence of adjacent children of an element E using slice assignment:
E[i:j] = seq
del E[i]
del E[i:j]
You can iterate over the children of an element with a for loop. For example, if node is an Element
instance, this code would print the tags of all its children:
If you need to test whether a given child node is a processing instruuction or a comment, you can use
Python's built-in function isinstance(I, C), which tests whether an object I is an instance of a class
or subclass of class C.
For instance, to test whether node is a comment, you can use this test, which returns True if node is
a comment, False otherwise.
issubclass(node, etree._Comment)
E.append(c)
20 Python XML processing with lxml New Mexico Tech Computer Center
You can use this method to add Comment and ProcessingInstruction instances as children of an
element, as well as Element instances.
Here is a conversational example:
E.find(path[, namespaces=D])
This method searches the Element and its descendants for a single element that fits the pattern described
by the path argument.
If there is exactly one matching element, this method returns that element as an Element instance.
If there are multiple matching elements, the method returns the one that appears first in document
order.
If there are no matching elements, it returns None.
The path argument is a string describing the element for which you are searching. Possible values in-
clude:
"tag"
Find the first child element whose name is "tag".
"tag1/tag2/.../tagn"
Find the first child element whose name is tag1; then, under that child element, find its first child
named tag2; and so forth.
For example, if node is an Element instance that has an element child with a tag "county", and that
child in turn has an element child with tag "seat", this expression will return the Element correspond-
ing to the "seat" element:
New Mexico Tech Computer Center Python XML processing with lxml 21
node.find("county/seat")
The optional namespaces argument is a namespace map; see Section 4.3, Namespace maps (p. 8).
If supplied, this map is used to interpret namespace prefixes in the path argument.
For example, suppose you have an element someNode, and you want to find a child element named
roundtable in the namespace named http://example.com/mphg/, and under that you want to
find a child element named knight in the namespace named http://example.org/sirs/ns/.
This call would do it:
Note that the namespace prefixes you define in this way do not need to have any particular value, or
to match the namespace prefixs that might be used for these NSURIs in some document's external form.
Warning
The namespaces keyword argument to the .find() method is available only for version 2.3.0 or later
of etree.
E.findall(path[, namespaces=N])
The way that the path argument describes the desired set of nodes works the same ways as the path
argument described in Section 9.5, Element.find(): Find a matching sub-element (p. 21).
For example, if an article element named root has zero or more children named section, this call
would set sectionList to a list containing Element instances representing those children.
sectionList = root.findall('section')
The optional namespaces keyword argument allows you to specify a namespace map. If supplied,
this namespace map is used to interpret namespace prefixes in the path; see Section 9.5, Ele-
ment.find(): Find a matching sub-element (p. 21) for details.
Warning
The namespaces keyword argument is available only since release 2.3.0 of lxml.etree.
22 Python XML processing with lxml New Mexico Tech Computer Center
The path argument specifies the desired element in the same way as does the path argument in Sec-
tion 9.5, Element.find(): Find a matching sub-element (p. 21).
If any descendants of E exist that match the given path, this method returns the text content of the
first matching element.
If the there is at least one matching element but it has no text content, the returned value will be the
empty string.
If no elements match the specified path, the method will return None, or the value of the default=
keyword argument if you provided one.
Here's a conversational example.
The optional namespaces keyword argument allows you to specify namespace prefixes for multi-
namespace documents; for details, see Section 9.5, Element.find(): Find a matching sub-ele-
ment (p. 21).
E.get(key, default=None)
The key argument is the name of the attribute whose value you want to retrieve.
If E has an attribute by that name, the method returns that attribute's value as a string.
If E has no such attribute, the method returns the default argument, which itself has a default value
of None.
Here's an example:
New Mexico Tech Computer Center Python XML processing with lxml 23
None
>>> print node.get('source', 'Unknown')
Unknown
>>>
E.getchildren()
Here's an example:
E.getiterator(tag=None)
If you omit the argument, you will get an iterator that visits E first, then all its element children and
their children, in a preorder traversal of that subtree.
If you want to visit only elements with a certain tag name, pass the desired tag name as the argument.
Preorder traversal of a tree means that we visit the root first, then the subtrees from left to right (that
is, in document order). This is also called a depth-first traversal: we visit the root, then its first child,
then its first child's first child, and so on until we run out of descendants. Then we move back up to the
last element with more children, and repeat.
Here is an example showing the traversal of an entire tree. First, a diagram showing the tree structure:
24 Python XML processing with lxml New Mexico Tech Computer Center
A preorder traversal of this tree goes in this order: a, b, c, d, e.
Note in the above example that the iterator visits the Verdin element even though it is not a direct child
of the root element.
E.getroottree()
New Mexico Tech Computer Center Python XML processing with lxml 25
9.12. Element.insert(): Insert a new child element
Use the .insert() method on an Element instance E to add a new element child elt in an arbitrary
position. (To append a new element child at the last position, see Section 9.3, Element.append():
Add a new element child (p. 20).)
E.insert(index, elt)
The index argument specifies the position into which element elt is inserted. For example, if you
specify index 0, the new child will be inserted before any other children of E.
The lxml module is quite permissive about the values of the index argument: if it is negative, or
greater than the position of the last existing child, the new child is added after all existing children.
Here is an example showing insertions at positions 0 and 2.
E.iterancestors(tag=None)
If you omit the argument, the iterator will visit all ancestors. If you wish to visit only ancestors with a
specific tag name, pass that tag name as an argument.
Examples:
26 Python XML processing with lxml New Mexico Tech Computer Center
>>> xml = '''<class sci='Aves' eng='Birds'>
... <order sci='Strigiformes' eng='Owls'>
... <family sci='Tytonidae' eng='Barn-Owls'>
... <genus sci='Tyto'>
... <species sci='Tyto alba' eng='Barn Owl'/>
... </genus>
... </family>
... </order>
... </class>'''
>>> root = etree.fromstring(xml)
>>> barney = root.xpath('//species') [0]
>>> print "%s: %s" % (barney.get('sci'), barney.get('eng'))
Tyto alba: Barn Owl
>>> for ancestor in barney.iterancestors():
... print ancestor.tag,
genus family order class
>>> for fam in barney.iterancestors('family'):
... print "%s: %s" % (fam.get('sci'), fam.get('eng'))
Tytonidae: Barn-Owls
E.iterchildren(reversed=False, tag=None)
Normally, the resulting iterator will visit the children in document order. However, if you pass re-
versed=True, it will visit them in the opposite order.
If you want the iterator to visit only children with a specific name N, pass an argument tag=N.
Example:
>>> root=et.fromstring("<mom><aaron/><betty/><clarence/><dana/></mom>")
>>> for kid in root.getchildren():
... print kid.tag
aaron
betty
clarence
dana
>>> for kid in root.iterchildren(reversed=True):
... print kid.tag
...
dana
clarence
betty
aaron
>>>
New Mexico Tech Computer Center Python XML processing with lxml 27
For an Element instance E, this method returns an iterator that visits all of E's descendants in document
order.
E.iterdescendants(tag=None)
If you want the iterator to visit only elements with a specific tag name N, pass an argument tag=N.
Example:
E.itersiblings(preceding=False)
If the preceding argument is false, the iterator will visit the siblings following E in document order.
If you pass preceding=True, the iterator will visit the siblings that precede E in document order.
Example:
>>> root=etree.fromstring(
... "<mom><aaron/><betty/><clarence/><dana/></mom>")
>>> betty=root.find('betty')
>>> for sib in betty.itersiblings(preceding=True):
... print sib.tag
...
aaron
>>> for sib in betty.itersiblings():
... print sib.tag
...
clarence
dana
>>>
28 Python XML processing with lxml New Mexico Tech Computer Center
E.keys()
Here's an example:
E.remove(C)
E.set(A, V)
Here's an example.
This method is one of two ways to create or change an attribute value. The other method is to store
values into the .attrib dictionary of the Element instance.
For a general discussion of the use of XPath, see Section 10, XPath processing (p. 30).
s
An XPath expression to be evaluated.
N
A namespace map that relates namespace prefixes to NSURIs; see Section 4.3, Namespace
maps (p. 8). The namespace map is used to interpret namespace prefixes in the XPath expression.
New Mexico Tech Computer Center Python XML processing with lxml 29
var=value
You may use additional keyword arguments to define the values of XPath variables to be used in
the evaluation of s. For example, if you pass an argument count=17, the value of variable $count
in the XPath expression will be 17.
The returned value may be any of:
A list of zero or more selected Element instances.
A Python bool value for true/false tests.
A Python float value for numeric results.
A string for string results.
10
11
http://www.nmt.edu/tcc/help/pubs/xslt/
12
http://www.nmt.edu/tcc/help/pubs/xslt/xpath-sect.html
http://www.w3.org/TR/xpath
30 Python XML processing with lxml New Mexico Tech Computer Center
13
For further information on lxml's XPath features, see XML Path Language (XPath) .
descendant-or-self::text()
The descendant-or-self:: is an axis selector that limits the search to the context node, its children,
their children, and so on out to the leaves of the tree. The text() function selects only text nodes,
discarding any elements, comments, and other non-textual content. The return value is a list of strings.
Here's an example of this expression in practice.
>>> node=etree.fromstring('''<a>
... a-text <b>b-text</b> b-tail <c>c-text</c> c-tail
... </a>''')
>>> alltext = node.xpath('descendant-or-self::text()')
>>> alltext
['\n a-text ', 'b-text', ' b-tail ', 'c-text', ' c-tail\n']
>>> clump = "".join(alltext)
>>> clump
'\n a-text b-text b-tail c-text c-tail\n'
>>>
13
14
http://www.w3.org/TR/xpath
15
http://en.wikipedia.org/wiki/Web_scraping
16
http://en.wikipedia.org/wiki/Tag_soup
http://lxml.de/elementsoup.html
New Mexico Tech Computer Center Python XML processing with lxml 31
There are two functions in this module.
soupparser.parse(input)
The input argument specifies a Web page's HTML source as either a file name or a file-like object.
The return value is an ElementTree instance whose root element is an html element as an Element
instance.
soupparser.fromstring(s)
The s argument is a string containing some tag soup. The return value is a tree of nodes representing
s. The root node of this tree will always be an html element as an Element instance.
Once you have the schema available as an .rng file, use these steps to valid an element tree ET.
1. Parse the .rng file into its own ElementTree, as described in Section 7.3, The ElementTree()
constructor (p. 12).
2. Use the constructor etree.RelaxNG(S) to convert that tree into a schema instance, where S
is the ElementTree instance, containing the schema, from the previous step.
If the tree is not a valid Relax NG schema, the constructor will raise an etree.RelaxNGParseEr-
ror exception.
3. Use the .validate(ET) method of the schema instance to validate ET.
This method returns 1 if ET validates against the schema, or 0 if it does not.
17
http://www.nmt.edu/tcc/help/pubs/rnc/
32 Python XML processing with lxml New Mexico Tech Computer Center
If the method returns 0, the schema instance has an attribute named .error_log containing all
the errors detected by the schema instance. You can print .error_log.last_error to see the
most recent error detected.
Presented later in this document are two examples of the use of this validation technique:
Section 15, rnc_validate: A module to validate XML against a Relax NG schema (p. 45).
Section 16, rnck: A standalone script to validate XML against a Relax NG schema (p. 52).
mainTitle = et.Element('h1')
mainTitle.text = "Welcome to Your Title Here!"
The brilliant and productive Fredrik Lundh has written a very nice module called builder.py that
makes building XML a lot easier.
18
See Lundh's original page, An ElementTree Builder , for an older version of his module, with
documentation and examples.
19
You may wish to use the current version of builder.py from Lundh's SVN repository page .
The author has written a modified version based heavily on Lundh's version. The source for this et-
20
builder.py module is available online .
For the instructions for use of the author's version, see Section 13.1, Using the etbuilder mod-
ule (p. 33).
21
For the actual implementation in lightweight literate programming form , see Section 14, Imple-
mentation of etbuilder (p. 36).
18
19
http://effbot.org/zone/element-builder.htm
20
http://svn.effbot.org/public/stuff/sandbox/elementlib/
21
http://www.nmt.edu/tcc/help/pubs/pylxml/etbuilder.py
http://www.nmt.edu/~shipman/soft/litprog/
New Mexico Tech Computer Center Python XML processing with lxml 33
E(tag, *p, **kw)
The first argument, tag, is the element's name as a string. The return value is a new et.Element in-
stance.
You can supply any number of positional arguments p, followed by any number of keyword arguments.
The interpretation of each argument depends on its type. The displays with >>> prompts are interactive
examples.
Any keyword argument of the form name=value becomes an XML attribute name='value'
of the new element.
An argument of type int is converted to a string and added to the tag's content.
If you pass a dictionary to the factory, its members also become XML attributes. For instance, you
might create an XHTML table cell element like this:
You can pass in an et.Element instance, and it becomes a child element of the element being built.
This allows you to nest calls within calls, like this:
This module has one more nice wrinkle. If the name of the tag you are creating is also a valid Python
name, you can use that name as the name of a method call on the E instance. That is,
E.name(...)
is functionally equivalent to
E("name", ...)
Here is an example:
34 Python XML processing with lxml New Mexico Tech Computer Center
... E.link(rel='stylesheet', href='/tcc/style.css'))
>>> print et.tostring(head, pretty_print=True)
<head>
<title>Your title</title>
<link href="/tcc/style.css" rel="stylesheet" />
</head>
<div class='warning'>
Your brain may not be the boss!
</div>
Because class is a reserved word in Python, you can't use it as an argument keyword. Therefore, the
package includes a helper function named CLASS() that takes one or more names as arguments, and
returns a dictionary that can be passed to the E() constructor to add a class= attribute with the argu-
ment value. This example does work to generate the above XML:
This generates:
This function adds child as the next child of parent, and returns the child.
New Mexico Tech Computer Center Python XML processing with lxml 35
13.5. addText(): Adding text content to an element
This convenience function handles the special logic used to add text content to an ElementTree-style
node. The problem is that if the node does not have any children, the new text is appended to the node's
.text attribute, but if there are any children, the new text must be appended to the .tail attribute
of the last child. Refer to Section 2, How ElementTree represents XML (p. 3) for a discussion of
why this is necessary.
Here is the general calling sequence to add some text string s to an existing node:
addText(node, s)
14.2. Prologue
The module begins with a comment pointing back to this documentation, and acknowledging Fredrik
Lundh's work.
etbuilder.py
#================================================================
# Imports
#----------------------------------------------------------------
36 Python XML processing with lxml New Mexico Tech Computer Center
22
The functools.partial() function is used to curry a function call in Section 14.11, Element-
Maker.__getattr__(): Handle arbitrary method calls (p. 44).
However, the functools module is new in Python 2.5. In order to make this module work in a Python
2.4 install, we will anticipate a possible failure to import functools, providing that functionality with
23
a substitute partial() function. This function is stolen directly from the Python Library Reference .
etbuilder.py
try:
from functools import partial
except ImportError:
def partial(func, *args, **keywords):
def newfunc(*fargs, **fkeywords):
newkeywords = keywords.copy()
newkeywords.update(fkeywords)
return func(*(args + fargs), **newkeywords)
newfunc.func = func
newfunc.args = args
newfunc.keywords = keywords
return newfunc
# - - - C L A S S
def CLASS(*names):
'''Helper function for adding 'class=...' attributes to tags.
# - - - F O R
def FOR(id):
'''Helper function for adding 'for=ID' attributes to tags.
'''
return {'for': id}
22
23
http://docs.python.org/library/functools.html
http://docs.python.org/library/functools.html
New Mexico Tech Computer Center Python XML processing with lxml 37
14.5. subElement(): Add a child element
See Section 13.4, subElement(): Adding a child element (p. 35).
etbuilder.py
# - - - s u b E l e m e n t
#-- 2 --
return child
# - - - a d d T e x t
#-- 2 --
if len(node) == 0:
node.text = (node.text or "") + s
else:
lastChild = node[-1]
lastChild.tail = (lastChild.tail or "") + s
38 Python XML processing with lxml New Mexico Tech Computer Center
etbuilder.py
# - - - - - c l a s s E l e m e n t M a k e r
class ElementMaker(object):
'''ElementTree element factory class
Exports:
ElementMaker(typeMap=None):
[ (typeMap is an optional dictionary whose keys are
type objects T, and each corresponding value is a
function with calling sequence
f(elt, item)
and generic intended function
[ (elt is an et.Element) and
(item has type T) ->
elt := elt with item added ]) ->
return a new ElementMaker instance that has
calling sequence
E(*p, **kw)
and intended function
[ p[0] exists and is a str ->
return a new et.Element instance whose name
is p[0], and remaining elements of p become
string content of that element (for types
str, unicode, and int) or attributes (for
type dict, and members of kw) or children
(for type et.Element), plus additional
handling from typeMap if it is provided ]
and allows arbitrary method calls of the form
E.tag(*p, **kw)
with intended function
[ return a new et.Element instance whose name
is (tag), and elements of p and kw have
the same effects as E(*(p[1:]), **kw) ]
'''
24
http://www.nmt.edu/~shipman/soft/clean/
New Mexico Tech Computer Center Python XML processing with lxml 39
The functions that process arguments all have this generic calling sequence:
f(elt, item)
where elt is the et.Element being built, and item is the argument to be processed.
The first step is to initialize the .__typeMap dictionary. In most cases, the user will be satisfied with
the type set described in Section 13.1, Using the etbuilder module (p. 33). However, as a conveni-
ence, Lundh's original builder.py design allows the caller to supply a dictionary of additional type-
function pairs as an optional argument; in that case, we will copy the supplied dictionary as the initial
value of self.__typeMap.
etbuilder.py
# - - - E l e m e n t M a k e r . _ _ i n i t _ _
The first types we'll need to handle are the str and unicode types. These types will use a function we
define locally named addText(). Adding text to an element in the ElementTree world has two cases.
If the element has no children, the text is added to the element's .text attribute. If the element has any
children, the new text is added to the last child's .tail attribute. See Section 2, How ElementTree
represents XML (p. 3) for a review of text handling.
etbuilder.py
#-- 2 --
# [ self.__typeMap[str], self.__typeMap[unicode] :=
# a function with calling sequence
# addText(elt, item)
# and intended function
# [ (elt is an et.Element) and
# (item is a str or unicode instance) ->
# if elt has no children and elt.text is None ->
# elt.text := item
# else if elt has no children ->
# elt.text +:= item
# else if elt's last child has .text==None ->
# that child's .text := item
# else ->
# that child's .text +:= item ]
def addText(elt, item):
if len(elt):
elt[-1].tail = (elt[-1].tail or "") + item
else:
elt.text = (elt.text or "") + item
self.__typeMap[str] = self.__typeMap[unicode] = addText
40 Python XML processing with lxml New Mexico Tech Computer Center
Lundh's original module did not handle arguments of type int, but this ability is handy for many
common tags, such as <table border='8'>, which becomes E.table(border=8).
A little deviousness is required here. The addInt() function can't call the addText() function above
directly, because the name addText is bound to that function only inside the constructor. The instance
does not know that name. However, we can assume that self.__typeMap[str] is bound to that
function, so we call it from there.
etbuilder.py
#-- 3 --
# [ self.__typeMap[str], self.__typeMap[unicode] :=
# a function with calling sequence
# addInt(elt, item)
# and intended function
# [ (elt is an et.Element) and
# (item is an int instance) ->
# if elt has no children and elt.text is None ->
# elt.text := str(item)
# else if elt has no children ->
# elt.text +:= str(item)
# else if elt's last child has .text==None ->
# that child's .text := str(item)
# else ->
# that child's .text +:= str(item) ]
def addInt(elt, item):
self.__typeMap[str](elt, str(item))
self.__typeMap[int] = addInt
The next type we need to handle is dict. Each key-value pair from the dictionary becomes an XML
attribute. For user convenience, if the value is not a string, we'll use the str() function on it, allowing
constructs like E({border: 1}).
etbuilder.py
#-- 4 --
# [ self.__typeMap[dict] := a function with calling
# sequence
# addDict(elt, item)
# and intended function
# [ (elt is an et.Element) and
# (item is a dictionary) ->
# elt := elt with an attribute made from
# each key-value pair from item ]
def addDict(elt, item):
for key, value in item.items():
if isinstance(value, basestring):
elt.attrib[key] = value
else:
elt.attrib[key] = str(value)
self.__typeMap[dict] = addDict
Note
In Lundh's original, the last line of the previous block was the equivalent of this:
New Mexico Tech Computer Center Python XML processing with lxml 41
elt.attrib[key] = \
self.__typeMap[type(value)](None, value)
I'm not entirely sure what he had in mind here. If you have any good theories, please forward them to
<tcc-doc@nmt.edu>.
Next up is the handler for arguments that are instances of et.Element. We'll actually create an
et.Element to be sure that self.__typeMap uses the correct key.
etbuilder.py
#-- 5 --
# [ self.__typeMap[type(et.Element instances)] := a
# function with calling sequence
# addElt(elt, item)
# and intended function
# [ (elt and item are et.Element instances) ->
# elt := elt with item added as its next
# child element ]
def addElement(elt, item):
elt.append(item)
sample = et.Element('sample')
self.__typeMap[type(sample)] = addElement
# - - - E l e m e n t M a k e r . _ _ c a l l _ _
First we create a new, empty element with the given tag name.
etbuilder.py
#-- 1 --
# [ elt := a new et.Element with name (tag) ]
elt = et.Element(tag)
If the attr dictionary has anything in it, we can use the function stored in self.__typeMap[dict]
to process those attributes.
etbuilder.py
#-- 2 --
# [ elt := elt with attributes made from the key-value
# pairs in attr ]
# else -> I ]
if attr:
self.__typeMap[dict](elt, attr)
Next, process the positional arguments in a loop, using each argument's type to extract from
self.__typeMap the proper handler for that type. For this logic, see Section 14.10, Element-
Maker.__handleArg(): Process one positional argument (p. 43).
42 Python XML processing with lxml New Mexico Tech Computer Center
etbuilder.py
#-- 3 --
# [ if the types of all the members of pos are also
# keys in self.__typeMap ->
# elt := elt modified as per the corresponding
# functions from self.__typeMap
# else -> raise TypeError ]
for arg in argList:
#-- 3 body --
# [ if type(arg) is a key in self.__typeMap ->
# elt := elt modified as per self.__typeMap[type(arg)]
# else -> raise TypeError ]
self.__handleArg(elt, arg)
#-- 4 --
return elt
# - - - E l e m e n t M a k e r . _ _ h a n d l e A r g
As a convenience, if the caller passes some callable object, we'll call that object and use its result. Other-
wise we'll use the object itself. (This is another Lundh feature, the utility of which I don't fully under-
stand.)
etbuilder.py
#-- 1 --
# [ if arg is callable ->
# value := arg()
# else ->
# value := arg ]
if callable(arg):
value = arg()
else:
value = arg
Next we look up the value's type in self.__typeMap, and call the corresponding function.
etbuilder.py
#-- 2 --
# [ if type(value) is a key in self.__typeMap ->
New Mexico Tech Computer Center Python XML processing with lxml 43
# elt := elt modified as per self.__typeMap[type(value)]
# else -> raise TypeError ]
try:
handler = self.__typeMap[type(value)]
handler(elt, value)
except KeyError:
raise TypeError("Invalid argument type: %r" % value)
# - - - E l e m e n t M a k e r . _ _ g e t a t t r _ _
14.12. Epilogue
The last step is to create the factory instance E.
etbuilder.py
# - - - - - m a i n
E = ElementMaker()
<html>
<head>
25
26
http://docs.python.org/library/functools.html
http://en.wikipedia.org/wiki/Currying
44 Python XML processing with lxml New Mexico Tech Computer Center
<title>Sample page<title>
<link href="/tcc/style.css" rel="stylesheet"/>
</head>
<body>
<h1 class='big-title'>Sample page title</h1>
<p>A paragraph containing a <a href='http://www.nmt.edu/'
>link to the NMT homepage</a>.</p>
</body>
</html>
#!/usr/bin/env python
from __future__ import print_function
from etbuilder import E, et, CLASS
page = E.html(
E.head(
E.title("Sample page"),
E.link(href='/tcc/style.css', rel='stylesheet')),
E.body(
E.h1(CLASS('big-title'), "Sample page title"),
E.p("A paragraph containing ", 1, " ",
E.a("link to the NMT homepage",
href='http://www.nmt.edu/'),
".")))
print(et.tostring(page, pretty_print=True, encoding=unicode), end='')
27
http://www.thaiopensource.com/relaxng/trang.html
New Mexico Tech Computer Center Python XML processing with lxml 45
15.2. Interface to the rnc_validate module
Our module rnc_validate.py exports this interface.
RelaxException
An exception class that inherits from Python's standard Exception class. This exception will be
raised when an XML file is found not to be valid against the given Relax NG schema. The str()
function, applied to an instance of this exception, returns a textual description of the validity error.
RelaxValidator(schemaPath)
This class constructor takes one argument, a path name to a schema in either .rnc or .rng format.
Assuming that the situation meets all the assumptions enumerated in Section 15.1, Design of the
rnc_validate module (p. 45), it returns a new RelaxValidator instance that can be used to
validate XML files against that schema.
If anything goes wrong, the constructor raises a ValueError exception. This can happen for sev-
eral reasons, for example: failure to read the schema; failure to write the .rng file if translating
from .rnc format; if the .rng file is not well-formed or not a valid Relax NG schema.
RV.validate(tree)
For a RelaxValidator instance RV, this method takes as its argument an ElementTree instance
containing an XML document. If that document is valid against the schema, this method returns
None. If there is a validation problem, it raises RelaxException.
Exports:
class RelaxException(Exception)
class RelaxValidator
RelaxValidator(schemaPath):
[ schemaPath is a string ->
if schemaPath names a readable, valid .rng schema ->
return a RelaxValidator that validates against that schema
else if (schemaPath, with .rnc appended if there is no
extension, names a readable, valid .rnc schema) ->
if the corresponding .rng schema is readable, valid, and
newer than the .rnc schema ->
return a RelaxValidator that validates against the
.rng schema
else if (we have write access to the corresponding .rng
schema) and (trang is locally installed) ->
corresponding .rng schema := trang's translation of
the .rnc schema into .rng
return a RelaxValidator that validates against the
28
http://www.nmt.edu/~shipman/soft/clean/
46 Python XML processing with lxml New Mexico Tech Computer Center
translated schema
else -> raise ValueError ]
.validate(tree):
[ tree is an etree.ElementTree ->
if tree validates against self -> I
else -> raise RelaxException ]
'''
Next come module imports. We need the standard Python os and stat modules to check file modific-
ation times.
rnc_validate.py
# - - - - - I m p o r t s
import os
import stat
import pexpect
# - - - - - M a n i f e s t c o n s t a n t s
RNC_SUFFIX = '.rnc'
RNG_SUFFIX = '.rng'
15.4. RelaxException
This pro-forma exception is used to signal validity problem.
rnc_validate.py
# - - - - - c l a s s R e l a x E x c e p t i o n
class RelaxException(Exception):
pass
29
http://www.noah.org/wiki/Pexpect
New Mexico Tech Computer Center Python XML processing with lxml 47
rnc_validate.py
# - - - - - c l a s s R e l a x V a l i d a t o r
class RelaxValidator(object):
'''Represents an XML validator for a given Relax NG schema.
State/Invariants:
.__schema:
[ an etree.RelaxNG instance representing the effective schema ]
'''
15.6. RelaxValidator.validate()
This method passes the ElementTree to the .validate() method of the stored RelaxNG instance,
which returns a bool value, True iff the tree is valid. We translate a False return value to an exception.
rnc_validate.py
# - - - R e l a x V a l i d a t o r . v a l i d a t e
# - - - R e l a x V a l i d a t o r . _ _ i n i t _ _
If the desired schema is in .rng form, we're ready to proceed. If it is an .rnc schema, though, we need
an .rng version that is up to date. See Section 15.8, RelaxValidator.__makeRNG(): Find or create
an .rng file (p. 49). If the file suffix isn't either, that's an error.
48 Python XML processing with lxml New Mexico Tech Computer Center
rnc_validate.py
#-- 2 --
# [ if suffix == RNG_SUFFIX ->
# I
# else if (file cName is readable) and (gName names a
# readable file that is newer than cName) ->
# I
# else if (cName names a readable, valid RNC file) and
# (we have write access to path gName) and
# (trang is locally installed) ->
# file gName := trang's translation of file cName into RNG
# else -> raise ValueError ]
if suffix == RNC_SUFFIX:
self.__makeRNG(cName, gName)
elif suffix != RNG_SUFFIX:
raise ValueError("File suffix not %s or %s: %s" %
(RNC_SUFFIX, RNG_SUFFIX, suffix))
At this point we have a known good .rng version of the schema. Read that, make it into a RelaxNG
instance (assuming it is valid Relax NG), and store it in self.__schema.
rnc_validate.py
#-- 3 --
# if gName names a readable, valid XML file ->
# doc := an et.ElementTree representing that file
# else -> raise ValueError ]
try:
doc = et.parse(gName)
except IOError, details:
raise ValueError("Can't open the schema file '%s': %s" %
(gName, str(details)))
#-- 4 --
# [ if doc is a valid RNG schema ->
# self.__schema := an et.RelaxNG instance that represents
# doc
# else -> raise ValueError ]
try:
self.__schema = et.RelaxNG(doc)
except et.RelaxNGParseError, details:
raise ValueError("Schema file '%s' is not valid: %s" %
(gName, str(details)))
# - - - R e l a x V a l i d a t o r . _ _ m a k e R N G
[ (cName names an RNC file) and (gName names an RNG file) ->
if (file cName is readable) and (gName names a
New Mexico Tech Computer Center Python XML processing with lxml 49
readable file that is newer than cName) ->
I
else if (cName names a readable, valid RNC file) and
(we have write access to path gName) and
(trang is locally installed) ->
file gName := trang's translation of file cName into RNG
First we get the modification time of the .rnc file. See Section 15.9, RelaxValidator.__getMod-
Time(): When was this file last changed? (p. 51). If anything goes wrong, we raise a ValueError.
rnc_validate.py
#-- 1 --
# [ if we can stat file (cName) ->
# cTime := epoch modification timestamp of that file
# else -> raise ValueError ]
try:
cTime = self.__getModTime(cName)
except (IOError, OSError), details:
raise ValueError("Can't read the RNC file '%s': %s" %
(cName, str(details)))
Then we try to get the modification time of the .rng file. If that file exists and the modification time is
newer, we're done, because the .rng is up to date against the requested .rnc schema. If either the file
doesn't exist or it's out of date, fall through to the next step.
rnc_validate.py
#-- 2 --
# [ if (we can stat file (gName)) and
# (that file's modification time is more recent than cTime) ->
# return
# else -> I ]
try:
gTime = self.__getModTime(gName)
if gTime > cTime:
return
except (IOError, OSError):
pass
Now, try to recreate the .rng file by running the .rnc file through trang. See Section 15.10, RelaxVal-
idator.__trang(): Translate .rnc to .rng format (p. 51).
rnc_validate.py
#-- 3 --
# [ if (file (cName) is a valid RNC file) and
# (we have write access to path gName) and
# (trang is locally installed) ->
# file (gName) := an RNG representation of file (cName)
# else -> raise ValueError ]
self.__trang(cName, gName)
50 Python XML processing with lxml New Mexico Tech Computer Center
15.9. RelaxValidator.__getModTime(): When was this file last changed?
The returned value is an epoch time, the number of seconds since January 0, 1970.
rnc_validate.py
# - - - R e l a x V a l i d a t o r . _ _ g e t M o d T i m e
That function returns the entire output of the run as a string. The output from trang is empty if the
translation succeeded; otherwise it contains the error message.
rnc_validate.py
# - - - R e l a x V a l i d a t o r . _ _ t r a n g
#-- 1 --
# [ output := all output from the execution of the command
# "trang (cName) (gName)" ]
output = pexpect.run("trang %s %s" % (cName, gName))
#-- 2 --
if len(output) > 0:
raise ValueError("Could not create '%s' from '%s':/n%s" %
(gName, cName, output))
New Mexico Tech Computer Center Python XML processing with lxml 51
16. rnck: A standalone script to validate XML against a Relax NG
schema
Here we present a script that uses the rnc_validate module to validate one or more XML files against
a given Relax NG schema.
Command line arguments take this form:
schema
Names a Relax NG schema as either an .rnc file or an .rng file.
file
Names of one or more XML files to be validated against schema.
#!/usr/bin/env python
#================================================================
# rnck: Validate XML files against an RNC schema.
# For documentation, see:
# http://www.nmt.edu/tcc/help/pubs/pylxml/
#----------------------------------------------------------------
Next come module imports. We use the Python 3.x style of print statement. We need the standard
Python sys module for standard I/O streams and command line arguments.
rnck
# - - - - - I m p o r t s
We'll need the lxml.etree module to read the XML files, but we'll call it et for short.
rnck
import lxml.etree as et
Finally, import the rnc_validate module described in Section 15, rnc_validate: A module to
validate XML against a Relax NG schema (p. 45).
rnck
import rnc_validate
30
http://www.nmt.edu/~shipman/soft/litprog/
52 Python XML processing with lxml New Mexico Tech Computer Center
16.2. rnck: main()
rnck
# - - - - - m a i n
def main():
"""Validate one or more files against an RNC schema.
Processing of the arguments is handled in Section 16.3, rnck: checkArgs() (p. 54). We get back two
items: the path to the schema, and a list of XML file names to be validated.
rnck
#-- 1 --
# [ if sys.argv is a valid command line ->
# schemaPath := the SCHEMA argument
# fileList := list of FILE arguments
# else ->
# sys.stderr +:= error message
# stop execution ]
schemaPath, fileList = checkArgs()
#-- 2 --
# [ if schemaPath names a readable, valid .rng schema ->
# return a RelaxValidator that validates against that schema
# else if (schemaPath, with .rnc appended if there is no
# extension, names a readable, valid .rnc schema) ->
# if the corresponding .rng schema is readable, valid, and
# newer than the .rnc
# return a RelaxValidator that validates against the
# .rng schema
# else if (we have write access to the corresponding .rng
# schema) and (trang is locally installed) ->
# corresponding .rng schema := trang's translation of
# the .rnc schema into .rng
# return a RelaxValidator that validates against
# translated schema
# else ->
# sys.stderr +:= error message
New Mexico Tech Computer Center Python XML processing with lxml 53
# stop execution ]
validator = rnc_validate.RelaxValidator(schemaPath)
For the logic that validates one XML file against our validator, see Section 16.7, rnck: validate-
File() (p. 55).
rnck
#-- 3 --
# [ sys.stderr +:= messages about any files from (fileList) that
# are unreadable or not valid against (validator) ]
for fileName in fileList:
validateFile(validator, fileName)
# - - - c h e c k A r g s
def checkArgs():
'''Check the command line arguments.
For the usage message, see Section 16.4, rnck: usage() (p. 54).
rnck
#-- 2 --
if len(argList) < 2:
usage("You must supply at least two arguments.")
else:
schemaPath, fileList = argList[0], argList[1:]
#-- 3 --
return (schemaPath, fileList)
# - - - u s a g e
def usage(*L) :
'''Write an error message and terminate.
54 Python XML processing with lxml New Mexico Tech Computer Center
[ L is a list of strings ->
sys.stderr +:= (concatenation of elements of L)
stop execution ]
'''
fatal("*** Usage:\n"
"*** %s SCHEMA FILE ...\n"
"*** %s" %
(sys.argv[0], ''.join(L)))
raise SystemExit
# - - - f a t a l
def fatal(*L):
'''Write an error message and terminate.
# - - - m e s s a g e
def message(*L):
'''Write an error message to stderr.
# - - - v a l i d a t e F i l e
New Mexico Tech Computer Center Python XML processing with lxml 55
sys.stderr +:= error message ]
'''
#-- 1 --
# [ if fileName names a readable, well-formed XML file ->
# doc := an et.ElementTree instance representing that file
# else ->
# sys.stderr +:= error message
# return ]
try:
doc = et.parse(fileName)
except et.XMLSyntaxError, details:
message("*** File '%s' not well-formed: %s" %
(fileName, str(details)))
return
except IOError, details:
message("*** Can't read file '%s': %s" %
(fileName, str(details)))
return
#-- 2 --
# [ if doc is valid against validator ->
# I
# else ->
# sys.stdout +:= failure report ]
try:
validator.validate(doc)
except rnc_validate.RelaxException, details:
message("*** File '%s' is not valid:\n%s" %
(fileName, details))
# - - - - - E p i l o g u e
if __name__ == "__main__":
main()
56 Python XML processing with lxml New Mexico Tech Computer Center