Working with XML Data

What is XML?

XML structures data using nested tags, similar to HTML but for arbitrary data. Here’s a minimal example:

<book>
  <title>Deep Learning</title>
  <author>Ian Goodfellow</author>
  <year>2016</year>
</book>

Each piece of data lives between an opening tag <title> and closing tag </title>. Tags can nest inside other tags, creating a tree structure.

Python’s ElementTree

Python includes xml.etree.ElementTree for parsing XML without external dependencies. Think of it as converting XML text into Python objects you can navigate.

import xml.etree.ElementTree as ET

# Parse XML from a string
xml_string = """
<book>
  <title>Deep Learning</title>
  <author>Ian Goodfellow</author>
  <year>2016</year>
</book>
"""

root = ET.fromstring(xml_string)

The root variable now holds a Python object representing the <book> element. Every element has:

  • .tag - the tag name
  • .text - the content between tags
  • .attrib - a dictionary of attributes
print(root.tag)        # 'book'
print(root[0].tag)     # 'title' (first child)
print(root[0].text)    # 'Deep Learning'

Finding Elements

XML often contains repeated structures. Consider an ArXiv API response:

<feed>
  <entry>
    <id>http://arxiv.org/abs/1234.5678</id>
    <title>Paper Title One</title>
    <author>
      <name>Alice Smith</name>
    </author>
    <author>
      <name>Bob Jones</name>
    </author>
  </entry>
  <entry>
    <id>http://arxiv.org/abs/2345.6789</id>
    <title>Paper Title Two</title>
    <author>
      <name>Carol White</name>
    </author>
  </entry>
</feed>

To extract data from this structure:

root = ET.fromstring(arxiv_xml)

# Find all entry elements
for entry in root.findall('entry'):
    # Within each entry, find the title
    title = entry.find('title').text
    print(f"Title: {title}")
    
    # Find all authors within this entry
    authors = []
    for author in entry.findall('author'):
        name = author.find('name').text
        authors.append(name)
    print(f"Authors: {', '.join(authors)}")

The key distinction:

  • find() returns the first matching element (or None)
  • findall() returns a list of all matching elements

Handling Missing Elements

Real XML data often has optional fields. Calling .text on None causes an error:

# This breaks if no abstract exists
abstract = entry.find('abstract').text  # AttributeError!

# Safe approach
abstract_elem = entry.find('abstract')
if abstract_elem is not None:
    abstract = abstract_elem.text
else:
    abstract = ""

A compact pattern for extraction:

def get_text(parent, tag, default=""):
    elem = parent.find(tag)
    return elem.text if elem is not None else default

# Usage
title = get_text(entry, 'title', 'Untitled')
abstract = get_text(entry, 'abstract')

Attributes in XML

XML elements can have attributes - key-value pairs inside the opening tag:

<entry updated="2024-01-15">
  <category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
  <category term="cs.AI" scheme="http://arxiv.org/schemas/atom"/>
</entry>

Access attributes using .get():

entry = root.find('entry')
updated_date = entry.get('updated')  # '2024-01-15'

# Extract all category terms
categories = []
for cat in entry.findall('category'):
    term = cat.get('term')
    if term:
        categories.append(term)
# categories = ['cs.LG', 'cs.AI']

Complete ArXiv Example

Here’s how to parse a simplified ArXiv response:

def parse_arxiv_response(xml_content):
    """Extract paper data from ArXiv API XML."""
    
    root = ET.fromstring(xml_content)
    papers = []
    
    for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
        # Note: ArXiv uses namespaces - the {url} prefix
        paper = {}
        
        # Extract ID from full URL
        id_elem = entry.find('{http://www.w3.org/2005/Atom}id')
        if id_elem is not None:
            # "http://arxiv.org/abs/1234.5678" -> "1234.5678"
            full_id = id_elem.text
            paper['id'] = full_id.split('/')[-1]
        
        # Extract title (cleaning whitespace)
        title_elem = entry.find('{http://www.w3.org/2005/Atom}title')
        if title_elem is not None:
            paper['title'] = ' '.join(title_elem.text.split())
        
        # Collect all authors
        authors = []
        for author in entry.findall('{http://www.w3.org/2005/Atom}author'):
            name_elem = author.find('{http://www.w3.org/2005/Atom}name')
            if name_elem is not None:
                authors.append(name_elem.text)
        paper['authors'] = authors
        
        papers.append(paper)
    
    return papers

Namespaces: The Curly Brace Notation

Real XML often declares namespaces - URIs that identify the vocabulary:

<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <title>Some Title</title>
  </entry>
</feed>

When a namespace is declared, you must include it when searching:

# Without namespace - finds nothing!
entry = root.find('entry')  # Returns None

# With namespace - works
entry = root.find('{http://www.w3.org/2005/Atom}entry')

# Or use a namespace map
ns = {'atom': 'http://www.w3.org/2005/Atom'}
entry = root.find('atom:entry', ns)

Reading from URLs

To fetch and parse XML from a URL:

import urllib.request

url = "http://export.arxiv.org/api/query?search_query=cat:cs.LG&max_results=5"

# Fetch the XML
with urllib.request.urlopen(url, timeout=10) as response:
    xml_data = response.read()

# Parse it
root = ET.fromstring(xml_data)

# Now process as before
for entry in root.findall('{http://www.w3.org/2005/Atom}entry'):
    # ... extract data ...

Method Reference

# Core operations
root = ET.fromstring(xml_string)     # Parse from string
tree = ET.parse('file.xml')          # Parse from file
root = tree.getroot()                # Get root element

# Navigation
element.find('tag')                  # First child with tag
element.findall('tag')               # All children with tag
element.iter('tag')                  # All descendants with tag

# Data extraction  
element.text                         # Text content
element.tag                          # Tag name
element.get('attr')                  # Attribute value
element.attrib                        # All attributes as dict

# Iteration
for child in element:                # Direct children only
    process(child)