Spider
Overview¶
Web spider is a kind of script to crawl information from the Internet.
Architecture¶
Common process of spider is Request -> Get responses -> Parse the content -> Save useful data, including five parts:
- Scheduler: schedule to make the following parts work well together.
- URL Manager: manage URLs accessed and to access by memory, database or cache.
- Downloader: download web pages from the specified URLs. Logging in, proxy and cookies may be required.
- Parser: parse the web pages downloaded to obtain useful information by DOM tree or parsing the whole string of the page. Regular expression,
html.parser
,beautifulsoup
,lxml
are all optional. - Application: apply the gotten information.