Wiki-Kingen

Spider

GitHub

Wiki-Kingen

GitHub

Home
Java
Java
- Index
- Basic Syntax
  Basic Syntax
  - Annotation
- Extension Libraries
  Extension Libraries
- Design Patterns
  Design Patterns
  - State Pattern
  - Singleton Pattern
- Spring
  Spring
  - Spring
  - AOP
  - Transaction
  - Actuator
- MyBatis
- Tomcat
- Maven
- Freemarker
- OGNL
- Quarkus
CS
CS
- Mathematics
  Mathematics
- Algorithm
  Algorithm
  - Index
  - Sort
    Sort
    
    Sort
    
    Selection Sort
    
    Bubble Sort
    
    Insertion Sort
    
    Merge Sort
    
    Quick Sort
    
    Counting Sort
    
    Bucket Sort
  - Divide and Conquer
  - Dynamic Programming
  - Greedy Algorithm
  - Backtracking
  - Branch and Bound
  - Pattern Searching
    Pattern Searching
    
    KMP Algorithm
    
    BM Algorithm
    
    Sunday Algorithm
  - Cryptography Algorithm
    Cryptography Algorithm
    
    RSA
    
    SHA
  - Graph
    Graph
    
    DFS
    
    Dijkstra's Algorithm
    
    Tarjan Algorithm
  - Others
    Others
    
    Knuth-Shuffle Algorithm
    
    Monte Carlo Tree Search
    
    Reservoir Sampling Algorithm
- Data Structure
  Data Structure
- Machine Learning
  Machine Learning
- Problem
  Problem
  - Inversion Pairs
Python
Python
- Index
- Library
  Library
  - File and Directory Access
    File and Directory Access
    
    shutil
  - Data Persistence
    Data Persistence
    
    sqlite3
  - File Formats
    File Formats
    
    configParser
  - Cryptographic Services
    Cryptographic Services
    
    hashlib
  - Operating System Services
    Operating System Services
    
    logging
  - Internet Protocols and Support
    Internet Protocols and Support
    
    urllib
  - Graphical User Interfaces
    Graphical User Interfaces
    
    tkinter
- Extension References
  Extension References
  - BeautifulSoup
  - PyMySQL
  - PyYAML
  - PyWin32
  - MkDocs
- Applications
  Applications
  - Spider Spider
    Table of contents
    
    Overview
    
    Architecture
    
    References
  - Task
  - Excel
- Numpy
- Pandas
- Matplotlib
- Flask
Development
Development
- References
- API
  API
- Database
  Database
  - MySQL
    MySQL
    
    Index
    
    SQL
    
    FAQ
  - Oracle
    Oracle
    
    Index
    
    SQL
  - Redis
  - InfluxDB
  - IoTDB
  - SQLite
- Web
  Web
  - JavaScript
    JavaScript
    
    Index
    
    jQuery
    
    Node.js
    
    React
    
    Layui
    
    Select2
  - CSS
    CSS
    
    Bootstrap
- Languages
  Languages
  - Go
  - C Language
  - PHP
- Linux
- Git
- Docker
- Kafka
- Nginx
- Prometheus
- Grafana
- FTP
- OAuth
- FasfDFS
- MinIO
- MQTT
Misc
Misc
- ChatGPT
- Activation
- Wiki
- LaTeX
- Markdown
- Chrome
- Windows
- IntelliJ IDEA
- Visual Studio Code
- Fiddler
- Charles
- FFmpeg

Spider

Overview¶

Web spider is a kind of script to crawl information from the Internet.

Architecture¶

Common process of spider is Request -> Get responses -> Parse the content -> Save useful data, including five parts:

Scheduler: schedule to make the following parts work well together.
URL Manager: manage URLs accessed and to access by memory, database or cache.
Downloader: download web pages from the specified URLs. Logging in, proxy and cookies may be required.
Parser: parse the web pages downloaded to obtain useful information by DOM tree or parsing the whole string of the page. Regular expression, html.parser, beautifulsoup, lxml are all optional.
Application: apply the gotten information.

Spider

Overview¶

Architecture¶

References¶