This document discusses processing XML documents in Python. It introduces XML and its features like namespaces and provides an example XML file. It explains how to use the lxml library to read the XML file into a document object model, extract elements and attributes using XPath queries, and access element tags and text. Namespaces are important in XML and are represented in Clark notation when examining element tags. The document shows how to parse the example XML, extract specific elements matching an XPath, and print out names and other attributes or text of the elements.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
XML namespaces and XPath with Python
1. XML namespaces and XPath with Python
March 2, 2016
1 XML namespaces and XPath with Python
Thomas Aglassinger http://www.roskakori.at
2 XML
• eXtensible Markup Language
• a blueprint for other file formats
• can represent sequences and hierarchies
• text based (binary somewhat possible using e.g. UUEncode)
• human readable
• somewhat verbose
• supports a Document Object Model (DOM)
2.1 XML with Python
• xml: part if the standard library
• xml.etree.ElementTree - XML as pythonic Trees
• xml.dom.mindom - DOM, warts and all
• xml.sax - sequential parsing of large documents
• works, but has limited support for namespaces, XPath etc.
• lxml: available from http://lxml.de/
• Python wrapper to C based XML libraries
• full support for namespaces, XPath, schemas etc
• universally used for “serious” XML processing
2.2 Example XML file
<?xml version="1.0" encoding="utf-8"?>
<people:list xmlns:people="https://www.example.org/xml/people">
<people:updated date="2016-02-16" />
<people:person name="Alice" phone="0650/12345678" size="172" />
<people:person name="Bob" phone="0654/23456789" size="167" />
<people:person name="B¨arbel" phone="0699/34567890" size="182" />
<people:person name="G¨unther" size="172">
<people:note>Ask for phone number.</people:note>
</people:person>
</people:list>
1
2. 2.3 XML namespaces
In our example
xmlns:people="https://www.example.org/xml/people"
assigns the shortcut people to the namespace identified by https://www.example.org/xml/people.
2.4 XPath
XPath is a query language to find nodes in XML documents. Examples:
• /people:list/people:person - all person elements in the document
• /people:list/people:person[@phone] - all person elements in the document with a phone attribute
Tutorial: http://www.w3schools.com/xsl/xpath intro.asp
3 Extract information from XML
3.1 Read the document root
Compute the path to our example XML file:
In [1]: import os.path
people_xml_path = os.path.join(’examples’, ’people.xml’)
Build the document root from the file:
In [2]: from lxml import etree
people_root = etree.parse(people_xml_path)
3.2 Setup the namespace
In [3]: NAMESPACES = {
’people’: ’https://www.example.org/xml/people’,
}
3.3 Find persons and print details
In [4]: # Find persons matching XPath.
person_elements = people_root.xpath(
’/people:list/people:person[@phone]’,
namespaces=NAMESPACES)
# Print name and phone of persons found.
for person_element in person_elements:
print(
person_element.attrib[’name’] + ’: ’ +
person_element.attrib[’phone’])
Alice: 0650/12345678
Bob: 0654/23456789
B¨arbel: 0699/34567890
2
3. 3.4 Examining XML elements
Elements have a tag, where namespaces are represente using the Clark notation {namespace}tag:
In [5]: person_element.tag
Out[5]: ’{https://www.example.org/xml/people}person’
XML attributes are a simlpe dictionary:
In [6]: person_element.attrib
Out[6]: {’phone’: ’0699/34567890’, ’name’: ’B¨arbel’, ’size’: ’182’}
3.5 Text nodes
Print notes about persons without a phone:
In [7]: note_elements_for_persons_without_phone =
people_root.xpath(
’/people:list/people:person[not(@phone)]/people:note’,
namespaces=NAMESPACES)
for note_element in note_elements_for_persons_without_phone:
person_element = note_element.getparent()
person_name = person_element.attrib[’name’]
note_text = note_element.text
print(person_name + ’: ’ + note_text)
G¨unther: Ask for phone number.
Use getparent() to access the enclosing XML element (as seen above).
4 Summary
• XML will be around for the foreseeable future so learn to deal with it.
• Use lxmlfor any serious XML work in Python.
• Namespaces and XPath can be taimed.
3