Skip to content
On this page

urllib

The urllib module provides a set of functions for working with URLs, allowing easy fetching of URL content.

GET Requests

The request module in urllib makes it straightforward to send GET requests. For example, to fetch data from a specific URL like https://api.douban.com/v2/book/2129650:

python
from urllib import request

with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data.decode('utf-8'))

This will return the HTTP response headers and JSON data.

To simulate a browser sending a GET request, you can use the Request object to add HTTP headers. For instance, simulating an iPhone request to Douban's homepage:

python
req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 ...')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

POST Requests

To send data using POST, you can encode parameters as bytes. For example, to log in to Weibo:

python
from urllib import request, parse

print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ...
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 ...')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&...')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

Using a Proxy

If you need to route requests through a proxy, you can use ProxyHandler:

python
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)

with opener.open('http://www.example.com/login.html') as f:
    pass

Summary

urllib allows for executing various HTTP requests programmatically. To simulate browser behavior, you need to mimic the headers sent by the browser, especially the User-Agent header.

Exercise

To read JSON data using urllib and parse it into a Python object:

python
from urllib import request
import json

def fetch_data(url):
    with request.urlopen(url) as response:
        return json.loads(response.read().decode('utf-8'))

# Test
URL = 'https://api.weatherapi.com/v1/current.json?key=b4e8f86b44654e6b86885330242207&q=Beijing&aqi=no'
data = fetch_data(URL)
print(data)
assert data['location']['name'] == 'Beijing'
print('ok')
urllib has loaded