Skip to content
On this page

HTMLParser

If you're building a search engine, the first step is to use a crawler to download the target website's pages. The second step involves parsing the HTML page to determine its content type—whether it's news, images, or videos.

Parsing HTML with HTMLParser

HTML is essentially a subset of XML, but its syntax is not as strict as XML, which means we can't use standard DOM or SAX to parse HTML.

Fortunately, Python provides the HTMLParser class, which makes it easy to parse HTML with just a few lines of code. Here's a simple example:

python
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print('<%s>' % tag)

    def handle_endtag(self, tag):
        print('</%s>' % tag)

    def handle_startendtag(self, tag, attrs):
        print('<%s/>' % tag)

    def handle_data(self, data):
        print(data)

    def handle_comment(self, data):
        print('<!--', data, '-->')

    def handle_entityref(self, name):
        print('&%s;' % name)

    def handle_charref(self, name):
        print('&#%s;' % name)

parser = MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p>Some <a href="#">html</a> HTML&nbsp;tutorial...<br>END</p>
</body></html>''')

The feed() method can be called multiple times, allowing you to parse the HTML in chunks rather than all at once.

Special Characters

HTML has special characters represented in two ways:

  • Named entities, like &nbsp;
  • Numeric character references, like &#1234;

Both types can be parsed using HTMLParser.

Summary

Using HTMLParser, you can extract text, images, and other content from web pages.

Exercise

Try parsing a webpage, such as Python Events. First, view the source in your browser and copy the HTML. Then, use the following code to parse it and output the event times, names, and locations .

HTMLParser has loaded