Skip to content
On this page

XML

Although XML is more complex than JSON and not as commonly used in web applications today, it's still important to understand how to work with XML.

DOM vs SAX

There are two primary methods for parsing XML: DOM and SAX.

  • DOM loads the entire XML document into memory, creating a tree structure. This consumes more memory and can be slower, but allows for easy traversal of nodes.
  • SAX operates in a streaming fashion, parsing the XML as it reads it. This method is more memory-efficient and faster, but requires the programmer to handle events manually.

In most cases, it's preferable to use SAX due to its lower memory usage.

SAX Example

Using SAX to parse XML in Python is straightforward. You typically define three key event handlers: start_element, end_element, and char_data. Here's a simple example:

python
from xml.parsers.expat import ParserCreate

class DefaultSaxHandler(object):
    def start_element(self, name, attrs):
        print('sax:start_element: %s, attrs: %s' % (name, str(attrs)))

    def end_element(self, name):
        print('sax:end_element: %s' % name)

    def char_data(self, text):
        print('sax:char_data: %s' % text)

xml = r'''<?xml version="1.0"?>
<ol>
    <li><a href="/python">Python</a></li>
    <li><a href="/ruby">Ruby</a></li>
</ol>
'''

handler = DefaultSaxHandler()
parser = ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
parser.CharacterDataHandler = handler.char_data
parser.Parse(xml)

Note: When reading large strings, the CharacterDataHandler might be called multiple times, so you'll need to concatenate the data and process it in the EndElementHandler.

Generating XML

Most of the time, generated XML structures are simple. The most straightforward way to create XML is by concatenating strings:

python
L = []
L.append(r'<?xml version="1.0"?>')
L.append(r'<root>')
L.append(encode('some & data'))
L.append(r'</root>')
return ''.join(L)

For more complex XML, it's advisable to use JSON instead.

Summary

When parsing XML, focus on the nodes of interest, save the data during event handling, and process it after parsing is complete.

Exercise

Here's an exercise to parse the XML format of weather forecasts from WeatherAPI using SAX:

python
from xml.parsers.expat import ParserCreate
from urllib import request

class WeatherSaxHandler(object):
    def __init__(self):
        self.city = ''
        self.condition = ''
        self.temperature = ''
        self.wind = ''
        self.current_tag = ''

    def start_element(self, name, attrs):
        self.current_tag = name

    def end_element(self, name):
        if name == 'location':
            self.city = self.current_data
        elif name == 'current':
            pass  # You could process the current weather here

    def char_data(self, text):
        if self.current_tag == 'name':
            self.current_data = text.strip()
        elif self.current_tag == 'condition':
            self.condition = text.strip()
        elif self.current_tag == 'temp_c':
            self.temperature = text.strip()
        elif self.current_tag == 'wind_kph':
            self.wind = text.strip()

def parseXml(xml_str):
    handler = WeatherSaxHandler()
    parser = ParserCreate()
    parser.StartElementHandler = handler.start_element
    parser.EndElementHandler = handler.end_element
    parser.CharacterDataHandler = handler.char_data
    parser.Parse(xml_str)
    
    return {
        'city': handler.city,
        'weather': {
            'condition': handler.condition,
            'temperature': handler.temperature,
            'wind': handler.wind,
        }
    }

# Test
URL = 'https://api.weatherapi.com/v1/current.xml?key=b4e8f86b44654e6b86885330242207&q=Beijing&aqi=no'

with request.urlopen(URL, timeout=4) as f:
    data = f.read()

result = parseXml(data.decode('utf-8'))
assert result['city'] == 'Beijing'
print(result)

This code defines a SAX handler for weather data and verifies that the parsed city matches "Beijing."

XML has loaded