Appearance
urllib
The urllib
module provides a set of functions for working with URLs, allowing easy fetching of URL content.
GET Requests
The request
module in urllib
makes it straightforward to send GET requests. For example, to fetch data from a specific URL like https://api.douban.com/v2/book/2129650
:
python
from urllib import request
with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
data = f.read()
print('Status:', f.status, f.reason)
for k, v in f.getheaders():
print('%s: %s' % (k, v))
print('Data:', data.decode('utf-8'))
This will return the HTTP response headers and JSON data.
To simulate a browser sending a GET request, you can use the Request
object to add HTTP headers. For instance, simulating an iPhone request to Douban's homepage:
python
req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 ...')
with request.urlopen(req) as f:
print('Status:', f.status, f.reason)
for k, v in f.getheaders():
print('%s: %s' % (k, v))
print('Data:', f.read().decode('utf-8'))
POST Requests
To send data using POST, you can encode parameters as bytes. For example, to log in to Weibo:
python
from urllib import request, parse
print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
('username', email),
('password', passwd),
...
])
req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 ...')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&...')
with request.urlopen(req, data=login_data.encode('utf-8')) as f:
print('Status:', f.status, f.reason)
for k, v in f.getheaders():
print('%s: %s' % (k, v))
print('Data:', f.read().decode('utf-8'))
Using a Proxy
If you need to route requests through a proxy, you can use ProxyHandler
:
python
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
with opener.open('http://www.example.com/login.html') as f:
pass
Summary
urllib
allows for executing various HTTP requests programmatically. To simulate browser behavior, you need to mimic the headers sent by the browser, especially the User-Agent
header.
Exercise
To read JSON data using urllib
and parse it into a Python object:
python
from urllib import request
import json
def fetch_data(url):
with request.urlopen(url) as response:
return json.loads(response.read().decode('utf-8'))
# Test
URL = 'https://api.weatherapi.com/v1/current.json?key=b4e8f86b44654e6b86885330242207&q=Beijing&aqi=no'
data = fetch_data(URL)
print(data)
assert data['location']['name'] == 'Beijing'
print('ok')