Working with XML Data
What is XML?
XML structures data using nested tags, similar to HTML but for arbitrary data. Here’s a minimal example:
<book>
<title>Deep Learning</title>
<author>Ian Goodfellow</author>
<year>2016</year>
</book>Each piece of data lives between an opening tag <title> and closing tag </title>. Tags can nest inside other tags, creating a tree structure.
Python’s ElementTree
Python includes xml.etree.ElementTree for parsing XML without external dependencies. Think of it as converting XML text into Python objects you can navigate.
import xml.etree.ElementTree as ET
# Parse XML from a string
xml_string = """
<book>
<title>Deep Learning</title>
<author>Ian Goodfellow</author>
<year>2016</year>
</book>
"""
root = ET.fromstring(xml_string)The root variable now holds a Python object representing the <book> element. Every element has:
.tag- the tag name.text- the content between tags.attrib- a dictionary of attributes
print(root.tag) # 'book'
print(root[0].tag) # 'title' (first child)
print(root[0].text) # 'Deep Learning'Finding Elements
XML often contains repeated structures. Consider an ArXiv API response:
<feed>
<entry>
<id>http://arxiv.org/abs/1234.5678</id>
<title>Paper Title One</title>
<author>
<name>Alice Smith</name>
</author>
<author>
<name>Bob Jones</name>
</author>
</entry>
<entry>
<id>http://arxiv.org/abs/2345.6789</id>
<title>Paper Title Two</title>
<author>
<name>Carol White</name>
</author>
</entry>
</feed>To extract data from this structure:
root = ET.fromstring(arxiv_xml)
# Find all entry elements
for entry in root.findall('entry'):
# Within each entry, find the title
title = entry.find('title').text
print(f"Title: {title}")
# Find all authors within this entry
authors = []
for author in entry.findall('author'):
name = author.find('name').text
authors.append(name)
print(f"Authors: {', '.join(authors)}")The key distinction:
find()returns the first matching element (or None)findall()returns a list of all matching elements
Handling Missing Elements
Real XML data often has optional fields. Calling .text on None causes an error:
# This breaks if no abstract exists
abstract = entry.find('abstract').text # AttributeError!
# Safe approach
abstract_elem = entry.find('abstract')
if abstract_elem is not None:
abstract = abstract_elem.text
else:
abstract = ""A compact pattern for extraction:
def get_text(parent, tag, default=""):
elem = parent.find(tag)
return elem.text if elem is not None else default
# Usage
title = get_text(entry, 'title', 'Untitled')
abstract = get_text(entry, 'abstract')Attributes in XML
XML elements can have attributes - key-value pairs inside the opening tag:
<entry updated="2024-01-15">
<category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.AI" scheme="http://arxiv.org/schemas/atom"/>
</entry>Access attributes using .get():
entry = root.find('entry')
updated_date = entry.get('updated') # '2024-01-15'
# Extract all category terms
categories = []
for cat in entry.findall('category'):
term = cat.get('term')
if term:
categories.append(term)
# categories = ['cs.LG', 'cs.AI']Complete ArXiv Example
Here’s how to parse a simplified ArXiv response:
def parse_arxiv_response(xml_content):
"""Extract paper data from ArXiv API XML."""
root = ET.fromstring(xml_content)
papers = []
for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
# Note: ArXiv uses namespaces - the {url} prefix
paper = {}
# Extract ID from full URL
id_elem = entry.find('{http://www.w3.org/2005/Atom}id')
if id_elem is not None:
# "http://arxiv.org/abs/1234.5678" -> "1234.5678"
full_id = id_elem.text
paper['id'] = full_id.split('/')[-1]
# Extract title (cleaning whitespace)
title_elem = entry.find('{http://www.w3.org/2005/Atom}title')
if title_elem is not None:
paper['title'] = ' '.join(title_elem.text.split())
# Collect all authors
authors = []
for author in entry.findall('{http://www.w3.org/2005/Atom}author'):
name_elem = author.find('{http://www.w3.org/2005/Atom}name')
if name_elem is not None:
authors.append(name_elem.text)
paper['authors'] = authors
papers.append(paper)
return papersNamespaces: The Curly Brace Notation
Real XML often declares namespaces - URIs that identify the vocabulary:
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title>Some Title</title>
</entry>
</feed>When a namespace is declared, you must include it when searching:
# Without namespace - finds nothing!
entry = root.find('entry') # Returns None
# With namespace - works
entry = root.find('{http://www.w3.org/2005/Atom}entry')
# Or use a namespace map
ns = {'atom': 'http://www.w3.org/2005/Atom'}
entry = root.find('atom:entry', ns)Reading from URLs
To fetch and parse XML from a URL:
import urllib.request
url = "http://export.arxiv.org/api/query?search_query=cat:cs.LG&max_results=5"
# Fetch the XML
with urllib.request.urlopen(url, timeout=10) as response:
xml_data = response.read()
# Parse it
root = ET.fromstring(xml_data)
# Now process as before
for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
# ... extract data ...Method Reference
# Core operations
root = ET.fromstring(xml_string) # Parse from string
tree = ET.parse('file.xml') # Parse from file
root = tree.getroot() # Get root element
# Navigation
element.find('tag') # First child with tag
element.findall('tag') # All children with tag
element.iter('tag') # All descendants with tag
# Data extraction
element.text # Text content
element.tag # Tag name
element.get('attr') # Attribute value
element.attrib # All attributes as dict
# Iteration
for child in element: # Direct children only
process(child)