urllib
Overview¶
urllib is a package to deal with URLs, including 4 modules: urllib.request, urllib.error, urllib.parse, and urllib.robotparser. The main part is about module request.
Get Started¶
Module request is for opening and reading URLs. An example is presented as follows. The response is an instance of http.client.HTTPResponse
.
from urllib import request
# build a request object
req = request.Request('https://www.baidu.com', headers={
'Host': 'www.baidu.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}, method='GET')
# issue the request and print the result
with request.urlopen(req, timeout=30) as response:
print(response.read().decode('utf-8'))
To handle the result further, commonly a html page, use BeautifulSoup.
HTTP Header¶
See more details here.
Request¶
Header | Description | Example |
---|---|---|
Accept | Content types that clients accept | text/html,application/xml |
Accept-Charset | ||
Accept-Encoding | gzip, deflate | |
Accept-Language | zh-CN,zh;q=0.9,zh-TW;q=0.8 | |
Host | netloc | www.douban.com |
Origin | scheme://netloc. Used for POST request or CORS request | |
Referer | Previous URL | |
User-Agent | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36 |
Response¶
Header | Description | Example |
---|---|---|
Content-Encoding | ||
Content-Language | ||
Content-Length | ||
Content-Location | Backup address | |
Content-MD5 | ||
Content-Type | MIME type |
Download¶
Under the package, there are two ways to download files from the Internet.
urllib.request.urlretrieve¶
from urllib import request
def reporthook(block_count, block_size, total_size):
print('\rDownloading: %.2f%%' % (100 * block_count * block_size / total_size), end='')
url = 'https://www.baidu.com/'
filename, headers = request.urlretrieve(url, filename='baidu.html', reporthook=reporthook, data=None)
filename
denotes target path of downloaded file. If absent, the location will be a tempfile with a generated name under the temp directory of operating system.
reporthook
is a callback function which is called when a block is read. It can help print the process of downloading.
data
specifies additional data when the method of request is POST.
It will return the file location and headers of response.
Multi-Thread¶
The second way is to request directly with urllib.request
and speed up with threads.