urllib

The urllib module provides a set of functions for working with URLs, allowing easy fetching of URL content.

GET Requests

The request module in urllib makes it straightforward to send GET requests. For example, to fetch data from a specific URL like https://api.douban.com/v2/book/2129650:

python

from urllib import request

with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data.decode('utf-8'))

This will return the HTTP response headers and JSON data.

To simulate a browser sending a GET request, you can use the Request object to add HTTP headers. For instance, simulating an iPhone request to Douban's homepage:

python

req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 ...')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

POST Requests

To send data using POST, you can encode parameters as bytes. For example, to log in to Weibo:

python

from urllib import request, parse

print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ...
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 ...')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&...')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

Using a Proxy

If you need to route requests through a proxy, you can use ProxyHandler:

python

proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)

with opener.open('http://www.example.com/login.html') as f:
    pass

Summary

urllib allows for executing various HTTP requests programmatically. To simulate browser behavior, you need to mimic the headers sent by the browser, especially the User-Agent header.

Exercise

To read JSON data using urllib and parse it into a Python object:

python

from urllib import request
import json

def fetch_data(url):
    with request.urlopen(url) as response:
        return json.loads(response.read().decode('utf-8'))

# Test
URL = 'https://api.weatherapi.com/v1/current.json?key=b4e8f86b44654e6b86885330242207&q=Beijing&aqi=no'
data = fetch_data(URL)
print(data)
assert data['location']['name'] == 'Beijing'
print('ok')

urllib ​

GET Requests ​

POST Requests ​

Using a Proxy ​

Summary ​

Exercise ​

urllib

GET Requests

POST Requests

Using a Proxy

Summary

Exercise