Appearance
HTMLParser
If you're building a search engine, the first step is to use a crawler to download the target website's pages. The second step involves parsing the HTML page to determine its content type—whether it's news, images, or videos.
Parsing HTML with HTMLParser
HTML is essentially a subset of XML, but its syntax is not as strict as XML, which means we can't use standard DOM or SAX to parse HTML.
Fortunately, Python provides the HTMLParser
class, which makes it easy to parse HTML with just a few lines of code. Here's a simple example:
python
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print('<%s>' % tag)
def handle_endtag(self, tag):
print('</%s>' % tag)
def handle_startendtag(self, tag, attrs):
print('<%s/>' % tag)
def handle_data(self, data):
print(data)
def handle_comment(self, data):
print('<!--', data, '-->')
def handle_entityref(self, name):
print('&%s;' % name)
def handle_charref(self, name):
print('&#%s;' % name)
parser = MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
<p>Some <a href="#">html</a> HTML tutorial...<br>END</p>
</body></html>''')
The feed()
method can be called multiple times, allowing you to parse the HTML in chunks rather than all at once.
Special Characters
HTML has special characters represented in two ways:
- Named entities, like
- Numeric character references, like
Ӓ
Both types can be parsed using HTMLParser
.
Summary
Using HTMLParser
, you can extract text, images, and other content from web pages.
Exercise
Try parsing a webpage, such as Python Events. First, view the source in your browser and copy the HTML. Then, use the following code to parse it and output the event times, names, and locations .