{"id":2555,"date":"2021-04-22T07:59:43","date_gmt":"2021-04-22T02:29:43","guid":{"rendered":"https:\/\/python-programs.com\/?p=2555"},"modified":"2021-11-22T18:44:57","modified_gmt":"2021-11-22T13:14:57","slug":"web-crawling-and-scraping-in-python","status":"publish","type":"post","link":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/","title":{"rendered":"Web crawling and scraping in Python"},"content":{"rendered":"<p>In this article, we will be checking up few things:<\/p>\n<ul>\n<li>Basic crawling setup In Python<\/li>\n<li>Basic crawling with AsyncIO<\/li>\n<li>Scraper Util service<\/li>\n<li>Python scraping via Scrapy framework<\/li>\n<\/ul>\n<h2>Web Crawler<\/h2>\n<p>A web crawler is an automatic bot that extracts useful information by systematically browsing the world wide web.<\/p>\n<p>The web crawler is also known as a spider or spider bot. Some websites use web crawling for updating their web content. Some websites do not allow crawling because of their security, so on that websites crawler works by either asking for permission or exiting out of the website.<\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-2638\" src=\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png\" alt=\"Web Crawler\" width=\"800\" height=\"611\" srcset=\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png 800w, https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_-300x229.png 300w, https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_-768x587.png 768w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/p>\n<h2>Web Scraping<\/h2>\n<p>Extracting the data from the websites is known as web scraping. Web scraping requires two parts crawler and scraper.<\/p>\n<p>Crawler is known to be an artificial intelligence algorithm and it browses the web which leads to searching of the links we want to crawl across the internet.<\/p>\n<p>Scraper is the tool that was specifically used for extracting information from the internet.<\/p>\n<p>By web scraping, we can obtain a large amount of data which is in unstructured data in an HTML format and then it is converted into structured data.<\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-full wp-image-2642\" src=\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/web-Scraping-1.jpg\" alt=\"Web Scraping\" width=\"1920\" height=\"458\" srcset=\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/web-Scraping-1.jpg 1920w, https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/web-Scraping-1-300x72.jpg 300w, https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/web-Scraping-1-1024x244.jpg 1024w, https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/web-Scraping-1-768x183.jpg 768w, https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/web-Scraping-1-1536x366.jpg 1536w\" sizes=\"(max-width: 1920px) 100vw, 1920px\" \/><\/p>\n<h2>Crawler Demo<\/h2>\n<p>Mainly, we have been using two tools:<\/p>\n<ul>\n<li>Python request module to make a crawler bot. We can get it from (<a class=\"cd le\" href=\"https:\/\/pypi.org\/project\/requests\/\" rel=\"noopener nofollow\">https:\/\/pypi.org\/project\/requests\/<\/a>)<\/li>\n<li>Parsel Library which is used as a scraping tool. Access it from (<a class=\"cd le\" href=\"https:\/\/parsel.readthedocs.io\/en\/latest\/usage.html\" rel=\"noopener nofollow\">https:\/\/parsel.readthedocs.io\/en\/latest\/usage.html<\/a>)<\/li>\n<\/ul>\n<h3>Task I<\/h3>\n<p><a href=\"http:\/\/(http:\/\/recurship.com)\">Scrap recurship<\/a> website is used for extracting all the links and images present on the page.<\/p>\n<h4>Demo Code:<\/h4>\n<pre>Import requests\r\n\r\nfrom parsel import Selector\r\n\r\nimport time\r\n\r\nstart = time.time()\r\n\r\nresponse = requests.get('http:\/\/recurship.com\/')\r\n\r\nselector\u00a0 = Selector(response.text)\r\n\r\nhref_links = selector.xpath('\/\/a\/@href').getall()\r\n\r\nimage_links = selector.xpath('\/\/img\/@src').getall()\r\n\r\nprint(\"********************href_links****************\")\r\n\r\nprint(href_links)\r\n\r\nprint(\"******************image_links****************\")\r\n\r\nprint(image_links)\r\n\r\nend = time.time()\r\n\r\nprint(\"Time taken in seconds:\", (end_start)<\/pre>\n<p>&nbsp;<\/p>\n<h3>Task II<\/h3>\n<p>Scrap recurship site and extract links, one of one navigate to each link and extract information of the images.<\/p>\n<h3>Demo code:<\/h3>\n<pre>import requests\r\nfrom parsel import Selector\r\n\r\nimport time\r\nstart = time.time()\r\n\r\n\r\nall_images = {} \r\nresponse = requests.get('http:\/\/recurship.com\/')\r\nselector = Selector(response.text)\r\nhref_links = selector.xpath('\/\/a\/@href').getall()\r\nimage_links = selector.xpath('\/\/img\/@src').getall()\r\n\r\nfor link in href_links:\r\ntry:\r\nresponse = requests.get(link)\r\nif response.status_code == 200:\r\nimage_links = selector.xpath('\/\/img\/@src').getall()\r\nall_images[link] = image_links\r\nexcept Exception as exp:\r\nprint('Error navigating to link : ', link)\r\n\r\nprint(all_images)\r\nend = time.time()\r\nprint(\"Time taken in seconds : \", (end-start))<\/pre>\n<p>&nbsp;<\/p>\n<p>Task II takes 22 seconds to complete. We are constantly using the python <strong>&#8220;<em>parsel&#8221; <\/em><\/strong>and <strong>&#8220;<\/strong><em><strong>request&#8221;<\/strong><\/em> package.<\/p>\n<p>Let&#8217;s see some features these packages use.<\/p>\n<h3>Request package:<\/h3>\n<p id=\"588c\" class=\"it iu fn iv b iw kx iy iz ja ky jc jd je kz jg jh ji la jk jl jm lb jo jp jq ff bx\" data-selectable-paragraph=\"\"><em>Python\u00a0request module basically offers the following features:<\/em><\/p>\n<ul class=\"\">\n<li id=\"d977\" class=\"it iu fn iv b iw ix iy iz ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">HTTP method calls<\/li>\n<li id=\"30ee\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Working with response codes and headers<\/li>\n<li id=\"057c\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Maintains redirection and history for requests<\/li>\n<li id=\"f148\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Maintains sessions<\/li>\n<li id=\"887c\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Work with cookies<\/li>\n<li id=\"ef18\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Errors and exceptions<\/li>\n<\/ul>\n<h3 id=\"3b52\" class=\"jz ka fn as kb kc kd ke kf kg kh ki kj kk kl km kn ko kp kq kr ks kt ku kv kw bx\">Parsel package<\/h3>\n<p id=\"abe0\" class=\"it iu fn iv b iw kx iy iz ja ky jc jd je kz jg jh ji la jk jl jm lb jo jp jq ff bx\" data-selectable-paragraph=\"\"><em>Python\u00a0parsel package offers the following features:<\/em><\/p>\n<ul class=\"\">\n<li id=\"8b2c\" class=\"it iu fn iv b iw ix iy iz ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Extract text using CSS or XPath selectors<\/li>\n<li id=\"2bef\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Regular expression helper methods<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>Crawler service using Request and Parsel<\/h3>\n<p>The code:<\/p>\n<pre>import requests\r\nimport time\r\nimport random\r\nfrom urllib.parse import urlparse\r\nimport logging\r\n\r\nlogger = logging.getLogger(__name__)\r\n\r\nLOG_PREFIX = 'RequestManager:'\r\n\r\n\r\nclass RequestManager:\r\ndef __init__(self):\r\nself.set_user_agents(); # This is to keep user-agent same throught out one request\r\n\r\ncrawler_name = None\r\nsession = requests.session()\r\n# This is for agent spoofing...\r\nuser_agents = [\r\n'Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/42.0.2311.135 Safari\/537.36 Edge\/12.246',\r\n'Mozilla\/4.0 (X11; Linux x86_64) AppleWebKit\/567.36 (KHTML, like Gecko) Chrome\/62.0.3239.108 Safari\/537.36',\r\n'Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit\/601.3.9 (KHTML, like Gecko) Version\/9.0.2 Safari\/601.3.9',\r\n'Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko'\r\n]\r\n\r\nheaders = {}\r\n\r\ncookie = None\r\ndebug = True\r\n\r\ndef file_name(self, context: RequestContext, response, request_type: str = 'GET'):\r\nurl = urlparse(response.url).path.replace(\"\/\", \"|\")\r\nreturn f'{time.time()}_{context.get(\"key\")}_{context.get(\"category\")}_{request_type}_{response.status_code}_{url}'\r\n\r\n# write a file, safely\r\ndef write(self, name, text):\r\nif self.debug:\r\nfile = open(f'logs\/{name}.html', 'w')\r\nfile.write(text)\r\nfile.close()\r\n\r\ndef set_user_agents(self):\r\nself.headers.update({\r\n'user-agent': random.choice(self.user_agents)\r\n})\r\n\r\ndef set_headers(self, headers):\r\nlogger.info(f'{LOG_PREFIX}:SETHEADER set headers {self.headers}')\r\nself.session.headers.update(headers)\r\n\r\ndef get(self, url: str, withCookie: bool = False, context):\r\nlogger.info(f'{LOG_PREFIX}-{self.crawler_name}:GET making get request {url} {context} {withCookie}')\r\ncookies = self.cookie if withCookie else None\r\nresponse = self.session.get(url=url, cookies=cookies, headers=self.headers)\r\nself.write(self.file_name(context, response), response.text)\r\nreturn response\r\n\r\ndef post(self, url: str, data, withCookie: bool = False, allow_redirects=True, context: RequestContext = {}):\r\nlogger.info(f'{LOG_PREFIX}:POST making post request {url} {data} {context} {withCookie}')\r\ncookies = self.cookie if withCookie else None\r\nresponse = self.session.post(url=url, data=data, cookies=cookies, allow_redirects=allow_redirects)\r\nself.write(self.file_name(context, response, request_type='POST'), response.text)\r\nreturn response\r\n\r\ndef set_cookie(self, cookie):\r\nself.cookie = cookie\r\nlogger.info(f'{LOG_PREFIX}-{self.crawler_name}:SET_COOKIE set cookie {self.cookie}')\r\n\r\nRequest = RequestManager()\r\n\r\ncontext = {\r\n\"key\": \"demo\",\r\n\"category\": \"history\"\r\n}\r\nSTART_URI = \"DUMMY_URL\" # URL OF SIGNUP PORTAL\r\nLOGIN_API = \"DUMMY_LOGIN_API\"\r\nresponse = Request.get(url=START_URI, context=context)\r\n\r\nRequest.set_cookie('SOME_DUMMY_COOKIE')\r\nRequest.set_headers('DUMMY_HEADERS')\r\n\r\nresponse = Request.post(url=LOGIN_API, data = {'username': '', 'passphrase': ''}, context=context)<\/pre>\n<p>&nbsp;<\/p>\n<p>Class &#8220;RequestManager&#8221; offers few functionalities listed below:<\/p>\n<ul class=\"\">\n<li id=\"631f\" class=\"it iu fn iv b iw ix iy iz ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">POST and GET calls with logging<\/li>\n<li id=\"305e\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Saving responses as log files of each HTTP request<\/li>\n<li id=\"0478\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Setting headers and cookies<\/li>\n<li id=\"3c77\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Session management<\/li>\n<li id=\"6481\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Agent spoofing<\/li>\n<\/ul>\n<h2>Scraping with AsyncIO<\/h2>\n<p>All we have to do is scrap the <em>Recurship<\/em> site and extract all the links, later we navigate each link asynchronously and extract information from the images.<\/p>\n<h3>Demo code<\/h3>\n<pre>import requests\r\nimport aiohttp\r\nimport asyncio\r\nfrom parsel import Selector\r\nimport time\r\n\r\nstart = time.time()\r\nall_images = {} # website links as \"keys\" and images link as \"values\"\r\n\r\nasync def fetch(session, url):\r\ntry:\r\nasync with session.get(url) as response:\r\nreturn await response.text()\r\nexcept Exception as exp:\r\nreturn '&lt;html&gt; &lt;html&gt;' #empty html for invalid uri case\r\n\r\nasync def main(urls):\r\ntasks = []\r\nasync with aiohttp.ClientSession() as session:\r\nfor url in urls:\r\ntasks.append(fetch(session, url))\r\nhtmls = await asyncio.gather(*tasks)\r\nfor index, html in enumerate(htmls):\r\nselector = Selector(html)\r\nimage_links = selector.xpath('\/\/img\/@src').getall()\r\nall_images[urls[index]] = image_links\r\nprint('*** all images : ', all_images)\r\n\r\n\r\nresponse = requests.get('http:\/\/recurship.com\/')\r\nselector = Selector(response.text)\r\nhref_links = selector.xpath('\/\/a\/@href').getall()\r\nloop = asyncio.get_event_loop()\r\nloop.run_until_complete(main(href_links))\r\n\r\n\r\nprint (\"All done !\")\r\nend = time.time()\r\nprint(\"Time taken in seconds : \", (end-start))<\/pre>\n<p>By AsyncIO, scraping took almost 21 seconds. We can achieve more good performance with this task.<\/p>\n<h2 id=\"470b\" class=\"lj ka fn as kb lk ll iy kf lm ln jc kj lo lp lq kn lr ls lt kr lu lv lw kv lx bx\">Open-Source Python Frameworks for spiders<\/h2>\n<p>Python has multiple frameworks which take care of the optimization<\/p>\n<p>It gives us different patterns. There are three popular frameworks, namely:<\/p>\n<ol>\n<li>Scrapy<\/li>\n<li>PySpider<\/li>\n<li>Mechanical soup<\/li>\n<\/ol>\n<p>Let&#8217;s use Scrapy for further demo.<\/p>\n<h3>Scrapy<\/h3>\n<p>Scrapy is a framework used for scraping and is supported by an active community. We can build our own scraping tools.<\/p>\n<p>There are few features which scrapy provides:<\/p>\n<ul class=\"\">\n<li id=\"27c8\" class=\"it iu fn iv b iw ix iy iz ja jb jc jd je jf jg jh ji jj jk jl jm jn jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Scraping and parsing tools<\/li>\n<li id=\"e001\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Easily export the data it collects in a number of formats like JSON or CSV and store the data on a backend of your choosing<\/li>\n<li id=\"990b\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Has a number of built-in extensions for tasks like cookie handling, user-agent spoofing, restricting crawl depth, and others<\/li>\n<li id=\"edd3\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Has an API for easily building your own additions.<\/li>\n<li id=\"187b\" class=\"it iu fn iv b iw ju iy iz ja jv jc jd je jw jg jh ji jx jk jl jm jy jo jp jq jr js jt bx\" data-selectable-paragraph=\"\">Scrapy also offers a cloud to host spiders where spiders scale on-demand and run from thousand to billions. It&#8217;s like <strong class=\"iv lc\"><em class=\"ld\">Heroku<\/em><\/strong>\u00a0of spiders.<\/li>\n<\/ul>\n<p>Now we have to do is scrap the <em>Recurship<\/em> site and extract all the links, later we navigate each link asynchronously and extract information from the images.<\/p>\n<h3>Demo Code<\/h3>\n<pre>import scrapy\r\n\r\n\r\nclass AuthorSpider(scrapy.Spider):\r\nname = 'Links'\r\n\r\nstart_urls = ['http:\/\/recurship.com\/']\r\nimages_data = {}\r\ndef parse(self, response):\r\n# follow links to author pages\r\nfor img in response.css('a::attr(href)'):\r\nyield response.follow(img, self.parse_images)\r\n\r\n# Below commented portion is for following all pages\r\n# follow pagination links\r\n# for href in response.css('a::attr(href)'):\r\n# yield response.follow(href, self.parse)\r\n\r\ndef parse_images(self, response):\r\n#print \"URL: \" + response.request.url\r\ndef extract_with_css(query):\r\nreturn response.css(query).extract()\r\nyield {\r\n'URL': response.request.url,\r\n'image_link': extract_with_css('img::attr(src)')\r\n}<\/pre>\n<h3>Commands<\/h3>\n<pre>scrapy run spider -o output.json spider.py<\/pre>\n<p>The JSON file got export in 1 second.<\/p>\n<h2>Conclusion<\/h2>\n<p>We can see that the scrapy performed an excellent job. If we have to perform simple crawling, scrapy will give the best results.<\/p>\n<p><em><strong>Enjoy scraping!!<\/strong><\/em><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this article, we will be checking up few things: Basic crawling setup In Python Basic crawling with AsyncIO Scraper Util service Python scraping via Scrapy framework Web Crawler A web crawler is an automatic bot that extracts useful information by systematically browsing the world wide web. The web crawler is also known as a &hellip;<\/p>\n<p class=\"read-more\"> <a class=\"\" href=\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/\"> <span class=\"screen-reader-text\">Web crawling and scraping in Python<\/span> Read More &raquo;<\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[5],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v18.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Web crawling and scraping in Python - Python Programs<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Web crawling and scraping in Python - Python Programs\" \/>\n<meta property=\"og:description\" content=\"In this article, we will be checking up few things: Basic crawling setup In Python Basic crawling with AsyncIO Scraper Util service Python scraping via Scrapy framework Web Crawler A web crawler is an automatic bot that extracts useful information by systematically browsing the world wide web. The web crawler is also known as a &hellip; Web crawling and scraping in Python Read More &raquo;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/\" \/>\n<meta property=\"og:site_name\" content=\"Python Programs\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/btechgeeks\" \/>\n<meta property=\"article:published_time\" content=\"2021-04-22T02:29:43+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-11-22T13:14:57+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@btech_geeks\" \/>\n<meta name=\"twitter:site\" content=\"@btech_geeks\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bahija Siddiqui\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Organization\",\"@id\":\"https:\/\/python-programs.com\/#organization\",\"name\":\"BTech Geeks\",\"url\":\"https:\/\/python-programs.com\/\",\"sameAs\":[\"https:\/\/www.instagram.com\/btechgeeks\/\",\"https:\/\/www.linkedin.com\/in\/btechgeeks\",\"https:\/\/in.pinterest.com\/btechgeek\/\",\"https:\/\/www.youtube.com\/channel\/UC9MlCqdJ3lKqz2p5114SDIg\",\"https:\/\/www.facebook.com\/btechgeeks\",\"https:\/\/twitter.com\/btech_geeks\"],\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/python-programs.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/python-programs.com\/wp-content\/uploads\/2020\/11\/BTechGeeks.png\",\"contentUrl\":\"https:\/\/python-programs.com\/wp-content\/uploads\/2020\/11\/BTechGeeks.png\",\"width\":350,\"height\":70,\"caption\":\"BTech Geeks\"},\"image\":{\"@id\":\"https:\/\/python-programs.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/python-programs.com\/#website\",\"url\":\"https:\/\/python-programs.com\/\",\"name\":\"Python Programs\",\"description\":\"Python Programs with Examples, How To Guides on Python\",\"publisher\":{\"@id\":\"https:\/\/python-programs.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/python-programs.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#primaryimage\",\"url\":\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png\",\"contentUrl\":\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png\",\"width\":800,\"height\":611,\"caption\":\"Web Crawler\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#webpage\",\"url\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/\",\"name\":\"Web crawling and scraping in Python - Python Programs\",\"isPartOf\":{\"@id\":\"https:\/\/python-programs.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#primaryimage\"},\"datePublished\":\"2021-04-22T02:29:43+00:00\",\"dateModified\":\"2021-11-22T13:14:57+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/python-programs.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Web crawling and scraping in Python\"}]},{\"@type\":\"Article\",\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#webpage\"},\"author\":{\"@id\":\"https:\/\/python-programs.com\/#\/schema\/person\/7442c33f5d282892c551affb0de5cd16\"},\"headline\":\"Web crawling and scraping in Python\",\"datePublished\":\"2021-04-22T02:29:43+00:00\",\"dateModified\":\"2021-11-22T13:14:57+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#webpage\"},\"wordCount\":659,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/python-programs.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png\",\"articleSection\":[\"Python\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#respond\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/python-programs.com\/#\/schema\/person\/7442c33f5d282892c551affb0de5cd16\",\"name\":\"Bahija Siddiqui\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/python-programs.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/978ea39d7d45a7b9c5e7e7b141d5b94b?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/978ea39d7d45a7b9c5e7e7b141d5b94b?s=96&d=mm&r=g\",\"caption\":\"Bahija Siddiqui\"},\"url\":\"https:\/\/python-programs.com\/author\/bahija\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Web crawling and scraping in Python - Python Programs","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/","og_locale":"en_US","og_type":"article","og_title":"Web crawling and scraping in Python - Python Programs","og_description":"In this article, we will be checking up few things: Basic crawling setup In Python Basic crawling with AsyncIO Scraper Util service Python scraping via Scrapy framework Web Crawler A web crawler is an automatic bot that extracts useful information by systematically browsing the world wide web. The web crawler is also known as a &hellip; Web crawling and scraping in Python Read More &raquo;","og_url":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/","og_site_name":"Python Programs","article_publisher":"https:\/\/www.facebook.com\/btechgeeks","article_published_time":"2021-04-22T02:29:43+00:00","article_modified_time":"2021-11-22T13:14:57+00:00","og_image":[{"url":"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png"}],"twitter_card":"summary_large_image","twitter_creator":"@btech_geeks","twitter_site":"@btech_geeks","twitter_misc":{"Written by":"Bahija Siddiqui","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Organization","@id":"https:\/\/python-programs.com\/#organization","name":"BTech Geeks","url":"https:\/\/python-programs.com\/","sameAs":["https:\/\/www.instagram.com\/btechgeeks\/","https:\/\/www.linkedin.com\/in\/btechgeeks","https:\/\/in.pinterest.com\/btechgeek\/","https:\/\/www.youtube.com\/channel\/UC9MlCqdJ3lKqz2p5114SDIg","https:\/\/www.facebook.com\/btechgeeks","https:\/\/twitter.com\/btech_geeks"],"logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/python-programs.com\/#\/schema\/logo\/image\/","url":"https:\/\/python-programs.com\/wp-content\/uploads\/2020\/11\/BTechGeeks.png","contentUrl":"https:\/\/python-programs.com\/wp-content\/uploads\/2020\/11\/BTechGeeks.png","width":350,"height":70,"caption":"BTech Geeks"},"image":{"@id":"https:\/\/python-programs.com\/#\/schema\/logo\/image\/"}},{"@type":"WebSite","@id":"https:\/\/python-programs.com\/#website","url":"https:\/\/python-programs.com\/","name":"Python Programs","description":"Python Programs with Examples, How To Guides on Python","publisher":{"@id":"https:\/\/python-programs.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/python-programs.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#primaryimage","url":"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png","contentUrl":"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png","width":800,"height":611,"caption":"Web Crawler"},{"@type":"WebPage","@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#webpage","url":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/","name":"Web crawling and scraping in Python - Python Programs","isPartOf":{"@id":"https:\/\/python-programs.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#primaryimage"},"datePublished":"2021-04-22T02:29:43+00:00","dateModified":"2021-11-22T13:14:57+00:00","breadcrumb":{"@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/python-programs.com\/"},{"@type":"ListItem","position":2,"name":"Web crawling and scraping in Python"}]},{"@type":"Article","@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#article","isPartOf":{"@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#webpage"},"author":{"@id":"https:\/\/python-programs.com\/#\/schema\/person\/7442c33f5d282892c551affb0de5cd16"},"headline":"Web crawling and scraping in Python","datePublished":"2021-04-22T02:29:43+00:00","dateModified":"2021-11-22T13:14:57+00:00","mainEntityOfPage":{"@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#webpage"},"wordCount":659,"commentCount":0,"publisher":{"@id":"https:\/\/python-programs.com\/#organization"},"image":{"@id":"https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#primaryimage"},"thumbnailUrl":"https:\/\/python-programs.com\/wp-content\/uploads\/2021\/04\/800px-WebCrawlerArchitecture.svg_.png","articleSection":["Python"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/python-programs.com\/web-crawling-and-scraping-in-python\/#respond"]}]},{"@type":"Person","@id":"https:\/\/python-programs.com\/#\/schema\/person\/7442c33f5d282892c551affb0de5cd16","name":"Bahija Siddiqui","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/python-programs.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/978ea39d7d45a7b9c5e7e7b141d5b94b?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/978ea39d7d45a7b9c5e7e7b141d5b94b?s=96&d=mm&r=g","caption":"Bahija Siddiqui"},"url":"https:\/\/python-programs.com\/author\/bahija\/"}]}},"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/posts\/2555"}],"collection":[{"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/comments?post=2555"}],"version-history":[{"count":5,"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/posts\/2555\/revisions"}],"predecessor-version":[{"id":2668,"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/posts\/2555\/revisions\/2668"}],"wp:attachment":[{"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/media?parent=2555"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/categories?post=2555"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/python-programs.com\/wp-json\/wp\/v2\/tags?post=2555"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}