codesamples-python3-gevent-site-parser
Gevent site parser
Description
- Site parser for news.ycombinator.com should parse main page and comments page - also includes allowed clicks option in order not to hammer the site with requests too much. Configuration script should be run once per 2 days - enough time for the domain to accumulate decent amount of items on each page
- Site news.ycombinator.com is a tough domain - actually second this harsh in my practice after non-esitent now Monarch Flights site - after 10-15 concurrent attempts to connect the site - it will send the given IP in ban and for several hours showing 403 - so potential clicks on “More” button are limited to 3 + 1 for start page
- Main page of the news.ycombinator.com seem to be the same as news page of the site mentioned so used it as well in a form of news?p=1 to speed the development
- Since from my end news.ycombinator.com only allows 10-15 connections/requests at the same and very small period of time - proxy usage implemented ina script and a standard proxies IPs were used, if connection to news.ycombinator.com is faster from your end and you have a better proxy in your disposal you can edit the config.ini file via ./codesamples-python3-gevent-site-parser/config/config.ini and change the value of clicksallowed from 3 to infinity, but also, please change proxylist to the list of better proxy servers in your disposal otherwise you may lost access to the news.ycombinator.com site with 403 error
- During tests I safely changed clicksallowed value up to 10 with the proxylist mentioned in the file just limited it to 3 + 1 for starting page of each site section in order to be polite to the site community and not get 403 accidentally
- Configuration of the crawler get the list of the pages for each site section (main, comments) based on athing class on the very first page (last commentary in current minute) or page 1 (for main/news page) in single thread mode, because Greelets can not communicate like Goroutines (threads in Golang) do - this means I am unable to pass the values between Greenlets on the fly(like in Go) and unable to finish them based on the results of each of them (for example if we reach the anticipated end with one Greenlet - this is possible on Python too (or at least the same results), but require more time for development which we need to skip due to urgency of the interview. Plus the cofiguration is actual only once in 2 days (with that speed pages of the news.ycombinator.com) being updated - for example to me the pages of the news.yocmbinator.com still contain almost the same items as for yesterday (January 22nd 2021 vs January 23rd 2021) and the last page is within the same boundaries.
-
Normal crawling for links is - according to the task description - is implemented using gevent module - and can display results on screen(all of them - huge amount) or to store them into json output
Limitations
- Single thread config - can be fixed in the future require some time to re-implement
- Proxy module is used, but for better results Tor module should be used as this implementation provides better anonymity and random IPs for each request. Not implemented because it would include additional modules to deployment for at least sock5 support and Tor itself support and would require to restart Tor server each time on local machine where the script is going to run
- Requests with Sessions could also be used to avoid the issue with limiting concurrenct connections and to use ability to log into news.ycombinator.com, which also may enable us to use more concurrent connections to the site in a given time period (registered users may have better access conditions to the news.ycombinator.com site)
- Proxies with better connection speed may be used
- One (n * n) loop can be re-implemented to threads-style loop, which is currently not part of the task
-
It is possible to make the Crawler a little more generic and scallable, however, at the moment you it seems possible to add new site sections to the config.ini, but for a single domain mentioned in the global section of the config.ini
Possible modifications
- Multi-thread’ed config mode - need to invent/find a way to communicate between Gevent Greenlets
- (n * n) loop to multi threaded loop
- Proxy requests to Tor-styled requests
-
Enable possibility for the Crawler to be scalable on a terms of how many pages/site it can crawl - all of those should be changed/added in config.ini
Please kindly note that modifications were not initially mentioned in the task description (see below) and due to urgency of the interview were not implemented to safe time of development
Terms
- Pages - part of pagination “More” menu at the bottom of main page and comments page on news.ycombinator.com
-
Site sections - aforementioned main page and comments page on news.ycombinator.com
Purposes
To demonstrate ability to work with Gevent and multithreading pn Python 3.x
Requirements
- Python 3.x
- virtualenv gevent, lxml, requests, configparser, optparse and dependencies
- Linux OS - tested on Ubuntu 18.04
-
news.ycombinator.com to be in healthy state
Installation instructions (approximate, not the last ones to follow):
- git clone this project
- sudo pip3 install virtualenv
- cd codesamples-python3-gevent-site-parser
- virtualenv codesamples-python3-gevent-site-parser
- source codesamples-python3-gevent-site-parser/bin/activate
- pip install gevent
- pip install lxml
- pip install requests
- pip install configparser
- [optional] pip install optparse
- [optiona] edit the config.ini, if needed, via ./codesamples-python3-gevent-site-parser/config/config.ini
-
[optional] deactivate
How to run?:
- python run_crawler.py -h - will provide help information
- python run_crawler.py -c - will configure parser in single thread mode
- python run_crawler.py -r - will crawl the site in gevent/multithread mode
- there also additional options - only running alongside with the with the -r option:
- python run_crawler.py -r -v - view the results on screen (huge list so increase output limits in your Linux shell/PUTTY)
- python run_crawler.py -r -s - will store the results into ./codesamples-python3-gevent-site-parser/output/ separetelly for each site section
- options in points 4,5,6 can be combined as follows: python run_crawler.py -r -v -s so the script will be able to both display the results and store them
-
You also can use a BASH script included ./menu.sh to see some fucntions organized in a menu
Task Description
Write a webcrawler using python that crawls a single domain. For example, given the URL ‘news.ycombinator.com’, it should crawl the main page and the comments, but not any external links. After finishing crawling the program it should print the links between pages.
The goal is to make the crawler work as fast as possible by using gevent
.