Scrap recurship site and extract links, one of one navigate to each link and extract information of the images.<\/p>\n
Demo code:<\/h3>\nimport requests\r\nfrom parsel import Selector\r\n\r\nimport time\r\nstart = time.time()\r\n\r\n\r\nall_images = {} \r\nresponse = requests.get('http:\/\/recurship.com\/')\r\nselector = Selector(response.text)\r\nhref_links = selector.xpath('\/\/a\/@href').getall()\r\nimage_links = selector.xpath('\/\/img\/@src').getall()\r\n\r\nfor link in href_links:\r\ntry:\r\nresponse = requests.get(link)\r\nif response.status_code == 200:\r\nimage_links = selector.xpath('\/\/img\/@src').getall()\r\nall_images[link] = image_links\r\nexcept Exception as exp:\r\nprint('Error navigating to link : ', link)\r\n\r\nprint(all_images)\r\nend = time.time()\r\nprint(\"Time taken in seconds : \", (end-start))<\/pre>\n <\/p>\n
Task II takes 22 seconds to complete. We are constantly using the python “parsel” <\/em><\/strong>and “<\/strong>request”<\/strong><\/em> package.<\/p>\nLet’s see some features these packages use.<\/p>\n
Request package:<\/h3>\n
Python\u00a0request module basically offers the following features:<\/em><\/p>\n\n- HTTP method calls<\/li>\n
- Working with response codes and headers<\/li>\n
- Maintains redirection and history for requests<\/li>\n
- Maintains sessions<\/li>\n
- Work with cookies<\/li>\n
- Errors and exceptions<\/li>\n<\/ul>\n
Parsel package<\/h3>\n
Python\u00a0parsel package offers the following features:<\/em><\/p>\n\n- Extract text using CSS or XPath selectors<\/li>\n
- Regular expression helper methods<\/li>\n<\/ul>\n
<\/p>\n
Crawler service using Request and Parsel<\/h3>\n
The code:<\/p>\n
import requests\r\nimport time\r\nimport random\r\nfrom urllib.parse import urlparse\r\nimport logging\r\n\r\nlogger = logging.getLogger(__name__)\r\n\r\nLOG_PREFIX = 'RequestManager:'\r\n\r\n\r\nclass RequestManager:\r\ndef __init__(self):\r\nself.set_user_agents(); # This is to keep user-agent same throught out one request\r\n\r\ncrawler_name = None\r\nsession = requests.session()\r\n# This is for agent spoofing...\r\nuser_agents = [\r\n'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/42.0.2311.135 Safari\/537.36 Edge\/12.246',\r\n'Mozilla\/4.0 (X11; Linux x86_64) AppleWebKit\/567.36 (KHTML, like Gecko) Chrome\/62.0.3239.108 Safari\/537.36',\r\n'Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit\/601.3.9 (KHTML, like Gecko) Version\/9.0.2 Safari\/601.3.9',\r\n'Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko'\r\n]\r\n\r\nheaders = {}\r\n\r\ncookie = None\r\ndebug = True\r\n\r\ndef file_name(self, context: RequestContext, response, request_type: str = 'GET'):\r\nurl = urlparse(response.url).path.replace(\"\/\", \"|\")\r\nreturn f'{time.time()}_{context.get(\"key\")}_{context.get(\"category\")}_{request_type}_{response.status_code}_{url}'\r\n\r\n# write a file, safely\r\ndef write(self, name, text):\r\nif self.debug:\r\nfile = open(f'logs\/{name}.html', 'w')\r\nfile.write(text)\r\nfile.close()\r\n\r\ndef set_user_agents(self):\r\nself.headers.update({\r\n'user-agent': random.choice(self.user_agents)\r\n})\r\n\r\ndef set_headers(self, headers):\r\nlogger.info(f'{LOG_PREFIX}:SETHEADER set headers {self.headers}')\r\nself.session.headers.update(headers)\r\n\r\ndef get(self, url: str, withCookie: bool = False, context):\r\nlogger.info(f'{LOG_PREFIX}-{self.crawler_name}:GET making get request {url} {context} {withCookie}')\r\ncookies = self.cookie if withCookie else None\r\nresponse = self.session.get(url=url, cookies=cookies, headers=self.headers)\r\nself.write(self.file_name(context, response), response.text)\r\nreturn response\r\n\r\ndef post(self, url: str, data, withCookie: bool = False, allow_redirects=True, context: RequestContext = {}):\r\nlogger.info(f'{LOG_PREFIX}:POST making post request {url} {data} {context} {withCookie}')\r\ncookies = self.cookie if withCookie else None\r\nresponse = self.session.post(url=url, data=data, cookies=cookies, allow_redirects=allow_redirects)\r\nself.write(self.file_name(context, response, request_type='POST'), response.text)\r\nreturn response\r\n\r\ndef set_cookie(self, cookie):\r\nself.cookie = cookie\r\nlogger.info(f'{LOG_PREFIX}-{self.crawler_name}:SET_COOKIE set cookie {self.cookie}')\r\n\r\nRequest = RequestManager()\r\n\r\ncontext = {\r\n\"key\": \"demo\",\r\n\"category\": \"history\"\r\n}\r\nSTART_URI = \"DUMMY_URL\" # URL OF SIGNUP PORTAL\r\nLOGIN_API = \"DUMMY_LOGIN_API\"\r\nresponse = Request.get(url=START_URI, context=context)\r\n\r\nRequest.set_cookie('SOME_DUMMY_COOKIE')\r\nRequest.set_headers('DUMMY_HEADERS')\r\n\r\nresponse = Request.post(url=LOGIN_API, data = {'username': '', 'passphrase': ''}, context=context)<\/pre>\n <\/p>\n
Class “RequestManager” offers few functionalities listed below:<\/p>\n
\n- POST and GET calls with logging<\/li>\n
- Saving responses as log files of each HTTP request<\/li>\n
- Setting headers and cookies<\/li>\n
- Session management<\/li>\n
- Agent spoofing<\/li>\n<\/ul>\n
Scraping with AsyncIO<\/h2>\n
All we have to do is scrap the Recurship<\/em> site and extract all the links, later we navigate each link asynchronously and extract information from the images.<\/p>\nDemo code<\/h3>\nimport requests\r\nimport aiohttp\r\nimport asyncio\r\nfrom parsel import Selector\r\nimport time\r\n\r\nstart = time.time()\r\nall_images = {} # website links as \"keys\" and images link as \"values\"\r\n\r\nasync def fetch(session, url):\r\ntry:\r\nasync with session.get(url) as response:\r\nreturn await response.text()\r\nexcept Exception as exp:\r\nreturn '<html> <html>' #empty html for invalid uri case\r\n\r\nasync def main(urls):\r\ntasks = []\r\nasync with aiohttp.ClientSession() as session:\r\nfor url in urls:\r\ntasks.append(fetch(session, url))\r\nhtmls = await asyncio.gather(*tasks)\r\nfor index, html in enumerate(htmls):\r\nselector = Selector(html)\r\nimage_links = selector.xpath('\/\/img\/@src').getall()\r\nall_images[urls[index]] = image_links\r\nprint('*** all images : ', all_images)\r\n\r\n\r\nresponse = requests.get('http:\/\/recurship.com\/')\r\nselector = Selector(response.text)\r\nhref_links = selector.xpath('\/\/a\/@href').getall()\r\nloop = asyncio.get_event_loop()\r\nloop.run_until_complete(main(href_links))\r\n\r\n\r\nprint (\"All done !\")\r\nend = time.time()\r\nprint(\"Time taken in seconds : \", (end-start))<\/pre>\nBy AsyncIO, scraping took almost 21 seconds. We can achieve more good performance with this task.<\/p>\n
Open-Source Python Frameworks for spiders<\/h2>\n
Python has multiple frameworks which take care of the optimization<\/p>\n
It gives us different patterns. There are three popular frameworks, namely:<\/p>\n
\n- Scrapy<\/li>\n
- PySpider<\/li>\n
- Mechanical soup<\/li>\n<\/ol>\n
Let’s use Scrapy for further demo.<\/p>\n
Scrapy<\/h3>\n
Scrapy is a framework used for scraping and is supported by an active community. We can build our own scraping tools.<\/p>\n
There are few features which scrapy provides:<\/p>\n
\n- Scraping and parsing tools<\/li>\n
- Easily export the data it collects in a number of formats like JSON or CSV and store the data on a backend of your choosing<\/li>\n
- Has a number of built-in extensions for tasks like cookie handling, user-agent spoofing, restricting crawl depth, and others<\/li>\n
- Has an API for easily building your own additions.<\/li>\n
- Scrapy also offers a cloud to host spiders where spiders scale on-demand and run from thousand to billions. It’s like Heroku<\/em><\/strong>\u00a0of spiders.<\/li>\n<\/ul>\n
Now we have to do is scrap the Recurship<\/em> site and extract all the links, later we navigate each link asynchronously and extract information from the images.<\/p>\nDemo Code<\/h3>\nimport scrapy\r\n\r\n\r\nclass AuthorSpider(scrapy.Spider):\r\nname = 'Links'\r\n\r\nstart_urls = ['http:\/\/recurship.com\/']\r\nimages_data = {}\r\ndef parse(self, response):\r\n# follow links to author pages\r\nfor img in response.css('a::attr(href)'):\r\nyield response.follow(img, self.parse_images)\r\n\r\n# Below commented portion is for following all pages\r\n# follow pagination links\r\n# for href in response.css('a::attr(href)'):\r\n# yield response.follow(href, self.parse)\r\n\r\ndef parse_images(self, response):\r\n#print \"URL: \" + response.request.url\r\ndef extract_with_css(query):\r\nreturn response.css(query).extract()\r\nyield {\r\n'URL': response.request.url,\r\n'image_link': extract_with_css('img::attr(src)')\r\n}<\/pre>\n