urllib

Overview¶

urllib is a package to deal with URLs, including 4 modules: urllib.request, urllib.error, urllib.parse, and urllib.robotparser. The main part is about module request.

Get Started¶

Module request is for opening and reading URLs. An example is presented as follows. The response is an instance of http.client.HTTPResponse.

from urllib import request

# build a request object
req = request.Request('https://www.baidu.com', headers={
    'Host': 'www.baidu.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}, method='GET')

# issue the request and print the result
with request.urlopen(req, timeout=30) as response:
    print(response.read().decode('utf-8'))

To handle the result further, commonly a html page, use BeautifulSoup.

HTTP Header¶

See more details here.

Request¶

Header	Description	Example
Accept	Content types that clients accept	text/html,application/xml
Accept-Charset
Accept-Encoding		gzip, deflate
Accept-Language		zh-CN,zh;q=0.9,zh-TW;q=0.8
Host	netloc	www.douban.com
Origin	scheme://netloc. Used for POST request or CORS request
Referer	Previous URL
User-Agent		Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36

Response¶

Header	Description	Example
Content-Encoding
Content-Language
Content-Length
Content-Location	Backup address
Content-MD5
Content-Type	MIME type

Download¶

Under the package, there are two ways to download files from the Internet.

urllib.request.urlretrieve ¶

from urllib import request


def reporthook(block_count, block_size, total_size):
    print('\rDownloading: %.2f%%' % (100 * block_count * block_size / total_size), end='')


url = 'https://www.baidu.com/'
filename, headers = request.urlretrieve(url, filename='baidu.html', reporthook=reporthook, data=None)

filename denotes target path of downloaded file. If absent, the location will be a tempfile with a generated name under the temp directory of operating system.

reporthook is a callback function which is called when a block is read. It can help print the process of downloading.

data specifies additional data when the method of request is POST.

It will return the file location and headers of response.

Multi-Thread¶

The second way is to request directly with urllib.request and speed up with threads.

urllib

Overview¶

Get Started¶

HTTP Header¶

Request¶

Response¶

Download¶

urllib.request.urlretrieve ¶

Multi-Thread¶

parse¶

robotparser¶

urllib

Overview¶

Get Started¶

HTTP Header¶

Request¶

Response¶

Download¶

urllib.request.urlretrieve¶

Multi-Thread¶

parse¶

robotparser¶

urllib.request.urlretrieve ¶