Python

How to web scrape with Python in 4 minutes

Web Scraping:

Web scraping is used to extract the data from the website and it can save time as well as effort. In this article, we will be extracting hundreds of file from the New York MTA. Some people find web scraping tough, but it is not the case as this article will break the steps into easier ones to get you comfortable with web scraping.

New York MTA Data:

We will download the data from the below website:

http://web.mta.info/developers/turnstile.html

Turnstile data is compiled every week from May 2010 till now, so there are many files that exist on this site. For instance, below is an example of what data looks like.

You can right-click on the link and can save it to your desktop. That is web scraping!

Important Notes about Web scraping:

  1. Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes.
  2. Make sure you are not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well.

Inspecting the website:

The first thing that we should find out is the information contained in the HTML tag from where we want to scrape it. As we know, there is a lot of code on the entire page and it contains multiple HTML tags, so we have to find out the one which we want to scrape and write it down in our code so that all the data related to it will be visible.

When you are on the website, right-click and then when you will scroll down you will get an option of “inspect”. Click on it and see the hidden code behind the page.

You can see the arrow symbol at the top of the console. 

If you will click on the arrow and then click any text or item on the website then the highlighted tag will appear related to the website on which you clicked.

I clicked on Saturday, September 2018 file and the console came in the blue highlighted part.

<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday, September 22, 2018</a>

You will see that all the .txt files come in <a> tags. <a> tags are used for hyperlinks.

Now that we got the location, we will process the coding!

Python Code:

The first and foremost step is importing the libraries:

import requests

import urllib.request

import time

from bs4 import BeautifulSoup

Now we have to set the url and access the website:

url = 'http://web.mta.info/developers/turnstile.html'

response = requests.get(url)

Now, we can use the features of beautiful soup for scraping.

soup = BeautifulSoup(response.text, “html.parser”)

We will use the method findAll to get all the <a> tags.

soup.findAll('a')

This function will give us all the <a> tags.

Now, we will extract the actual link that we want.

one_a_tag = soup.findAll(‘a’)[38]

link = one_a_tag[‘href’]

This code will save the first .txt file to our variable link.

download_url = 'http://web.mta.info/developers/'+ link

urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

For pausing our code we will use the sleep function.

time.sleep(1)

To download the entire data we have to apply them for a loop. I am attaching the entire code so that you won’t face any problem.

I hope you understood the concept of web scraping.

Enjoy reading and have fun while scraping!

An Intro to Web Scraping with lxml and Python:

Sometimes we want that data from the API which cannot be accessed using it. Then, in the absence of API, the only choice left is to make a web scraper. The task of the scraper is to scrape all the information which we want in easily and in very little time.

The example of a typical API response in JSON. This is the response from Reddit.

 There are various kinds of python libraries that help in web scraping namely scrapy, lxml, and beautiful soup.

Many articles explain how to use beautiful soup and scrapy but I will be focusing on lxml. I will teach you how to use XPaths and how to use them to extract data from HTML documents.

Getting the data:

If you are into gaming, then you must be familiar with this website steam.

We will be extracting the data from the “popular new release” information.

Now, right-click on the website and you will see the inspect option. Click on it and select the HTML tag.

We want an anchor tag because every list is encapsulated in the <a> tag.

The anchor tag lies in the div tag with an id of tag_newreleasecontent. We are mentioning the id because there are two tabs on this page and we only want the information of popular release data.

Now, create your python file and start coding. You can name the file according to your preference. Start importing the below libraries:

import requests 

import lxml.html

If you don’t have requests to install then type the below code on your terminal:

$ pip install requests

Requests module helps us open the webpage in python.

Extracting and processing the information:

Now, let’s open the web page using the requests and pass that response to lxml.html.fromstring.

html = requests.get('https://store.steampowered.com/explore/new/') 

doc = lxml.html.fromstring(html.content)

This provides us with a structured way to extract information from an HTML document. Now we will be writing an XPath for extracting the div which contains the” popular release’ tab.

new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')[0]

We are taking only one element ([0]) and that would be our required div. Let us break down the path and understand it.

  • // these tell lxml that we want to search for all tags in the HTML document which match our requirements.
  • Div tells lxml that we want to find div tags.
  • @id=”tab_newreleases_content tells the div tag that we are only interested in the id which contains tab_newrelease_content.

Awesome! Now we understand what it means so let’s go back to inspect and check under which tag the title lies.

The title name lies in the div tag inside the class tag_item_name. Now we will run the XPath queries to get the title name.

titles = new_releases.xpath('.//div[@class="tab_item_name"]/text()')







We can see that the names of the popular releases came. Now, we will extract the price by writing the following code:

prices = new_releases.xpath('.//div[@class="discount_final_price"]/text()')

Now, we can see that the prices are also scraped. We will extract the tags by writing the following command:

tags = new_releases.xpath('.//div[@class="tab_item_top_tags"]')

total_tags = []

for tag in tags:

total_tags.append(tag.text_content())

We are extracting the div containing the tags for the game. Then we loop over the list of extracted tags using the tag.text_content method.

Now, the only thing remaining is to extract the platforms associated with each title. Here is the the HTML markup:

The major difference here is that platforms are not contained as texts within a specific tag. They are listed as class name so some titles only have one platform associated with them:

 

<span class="platform_img win">&lt;/span>

 

While others have 5 platforms like this:

 

<span class="platform_img win"></span><span class="platform_img mac"></span><span class="platform_img linux"></span><span class="platform_img hmd_separator"></span> <span title="HTC Vive" class="platform_img htcvive"></span> <span title="Oculus Rift" class="platform_img oculusrift"></span>

The span tag contains platform types as the class name. The only thing common between them is they all contain platform_img class.

First of all, we have to extract the div tags containing the tab_item_details class. Then we will extract the span containing the platform_img class. Lastly, we will extract the second class name from those spans. Refer to the below code:

platforms_div = new_releases.xpath('.//div[@class="tab_item_details"]')

total_platforms = []

for game in platforms_div:    

temp = game.xpath('.//span[contains(@class, "platform_img")]')    

platforms = [t.get('class').split(' ')[-1] for t in temp]    

if 'hmd_separator' in platforms:        

platforms.remove('hmd_separator')   

 total_platforms.append(platforms)

Now we just need this to return a JSON response so that we can easily turn this into Flask based API.

output = []for info in zip(titles,prices, tags, total_platforms):    resp = {}    

resp['title'] = info[0]

resp['price'] = info[1]    

resp['tags'] = info[2]    

resp['platforms'] = info[3]    

output.append(resp)

We are using the zip function to loop over all of the lists in parallel. Then we create a dictionary for each game to assign the game name, price, and platforms as keys in the dictionary.

Wrapping up:

I hope this article is understandable and you find the coding easy.

Enjoy reading!

 

How to web scrape with Python in 4 minutes Read More »

How To Scrape LinkedIn Public Company Data

How To Scrape LinkedIn Public Company Data – Beginners Guide

Nowadays everybody is familiar with how big the LinkedIn community is. LinkedIn is one of the largest professional social networking sites in the world which holds a wealth of information about industry insights, data on professionals, and job data.

Now, the only way to get the entire data out of LinkedIn is through Web Scraping.

Why Scrape LinkedIn public data?

There are multiple reasons why one wants to scrape the data out of LinkedIn. The scrape data can be useful when you are associated with the project or for hiring multiple people based on their profile while looking at their data and selecting among them who all are applicable and fits for the company best.

This scraping task will be less time-consuming and will automate the process of searching for millions of data in a single file which will make the task easy.

Another benefit of scraping is when one wants to automate their job search. As every online site has thousands of job openings for different kinds of jobs, so it must be hectic for people who are looking for a job in their field only. So scraping can help them automate their job search by applying filters and extracting all the information at only one page.

In this tutorial, we will be scraping the data from LinkedIn using Python.

Prerequisites:

In this tutorial, we will use basic Python programming as well as some python packages- LXML and requests.

But first, you need to install the following things:

  1. Python accessible here (https://www.python.org/downloads/)
  2. Python requests accessible here(http://docs.python-requests.org/en/master/user/install/)
  3. Python LXML( Study how to install it here: http://lxml.de/installation.html)

Once you are done with installing here, we will write the python code to extract the LinkedIn public data from company pages.

This below code will only run on python 2 and not above them because the sys function is not supported in it.

import json

import re

from importlib import reload

import lxml.html

import requests

import sys

reload(sys)

sys.setdefaultencoding('cp1251')




HEADERS = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

          'accept-encoding': 'gzip, deflate, sdch',

          'accept-language': 'en-US,en;q=0.8',

          'upgrade-insecure-requests': '1',

          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}

file = open('company_data.json', 'w')

file.write('[')

file.close()

COUNT = 0




def increment():

   global COUNT

   COUNT = COUNT+1




def fetch_request(url):

   try:

       fetch_url = requests.get(url, headers=HEADERS)

   except:

       try:

           fetch_url = requests.get(url, headers=HEADERS)

       except:

           try:

               fetch_url = requests.get(url, headers=HEADERS)

           except:

               fetch_url = ''

   return fetch_url




def parse_company_urls(company_url):




   if company_url:

       if '/company/' in company_url:

           parse_company_data(company_url)

       else:

           parent_url = company_url

           fetch_company_url=fetch_request(company_url)

           if fetch_company_url:

               sel = lxml.html.fromstring(fetch_company_url.content)

               COMPANIES_XPATH = '//div[@class="section last"]/div/ul/li/a/@href'

               companies_urls = sel.xpath(COMPANIES_XPATH)

               if companies_urls:

                   if '/company/' in companies_urls[0]:

                       print('Parsing From Category ', parent_url)

                       print('-------------------------------------------------------------------------------------')

                   for company_url in companies_urls:

                       parse_company_urls(company_url)

           else:

               pass







def parse_company_data(company_data_url):




   if company_data_url:

       fetch_company_data = fetch_request(company_data_url)

       if fetch_company_data.status_code == 200:

           try:

               source = fetch_company_data.content.decode('utf-8')

               sel = lxml.html.fromstring(source)

               # CODE_XPATH = '//code[@id="stream-promo-top-bar-embed-id-content"]'

               # code_text = sel.xpath(CODE_XPATH).re(r'<!--(.*)-->')

               code_text = sel.get_element_by_id(

                   'stream-promo-top-bar-embed-id-content')

               if len(code_text) > 0:

                   code_text = str(code_text[0])

                   code_text = re.findall(r'<!--(.*)-->', str(code_text))

                   code_text = code_text[0].strip() if code_text else '{}'

                   json_data = json.loads(code_text)

                   if json_data.get('squareLogo', ''):

                       company_pic = 'https://media.licdn.com/mpr/mpr/shrink_200_200' + \

                                     json_data.get('squareLogo', '')

                   elif json_data.get('legacyLogo', ''):

                       company_pic = 'https://media.licdn.com/media' + \

                                     json_data.get('legacyLogo', '')

                   else:

                       company_pic = ''

                   company_name = json_data.get('companyName', '')

                   followers = str(json_data.get('followerCount', ''))




                   # CODE_XPATH = '//code[@id="stream-about-section-embed-id-content"]'

                   # code_text = sel.xpath(CODE_XPATH).re(r'<!--(.*)-->')

                   code_text = sel.get_element_by_id(

                       'stream-about-section-embed-id-content')

               if len(code_text) > 0:

                   code_text = str(code_text[0]).encode('utf-8')

                   code_text = re.findall(r'<!--(.*)-->', str(code_text))

                   code_text = code_text[0].strip() if code_text else '{}'

                   json_data = json.loads(code_text)

                   company_industry = json_data.get('industry', '')

                   item = {'company_name': str(company_name.encode('utf-8')),

                           'followers': str(followers),

                           'company_industry': str(company_industry.encode('utf-8')),

                           'logo_url': str(company_pic),

                           'url': str(company_data_url.encode('utf-8')), }

                   increment()

                   print(item)

                   file = open('company_data.json', 'a')

                   file.write(str(item)+',\n')

                   file.close()

           except:

               pass

       else:

           pass
fetch_company_dir = fetch_request('https://www.linkedin.com/directory/companies/')

if fetch_company_dir:

   print('Starting Company Url Scraping')

   print('-----------------------------')

   sel = lxml.html.fromstring(fetch_company_dir.content)

   SUB_PAGES_XPATH = '//div[@class="bucket-list-container"]/ol/li/a/@href'

   sub_pages = sel.xpath(SUB_PAGES_XPATH)

   print('Company Category URL list')

   print('--------------------------')

   print(sub_pages)

   if sub_pages:

       for sub_page in sub_pages:

           parse_company_urls(sub_page)

else:

   pass

How To Scrape LinkedIn Public Company Data – Beginners Guide Read More »

Best 5 stock markets APIs in 2020

There are various stock markets that are available online but among all of them, it’s hard to figure out from which site you should visit or which site will be useful.

In this article, we will be discussing the 5 best stock market APIs.

What is Stock market data API?

Real-time or historical data on financial assets that are currently being traded in the markets are offered by stock market data APIs.

Prices of public stocks, ETFs, and ETNs are specially offered by them.

Data:

In the article, we will be more inclined towards the price information. We will be talking about the following APIs and how they are useful:

  1. Yahoo Finance
  2. Google Finance in Google sheets.
  3. IEX cloud
  4. AlphaVantage
  5. World trading data
  6. Other APIs( Polygon.io, intrinio, Quandl)

1. Yahoo Finance:

The API was shut down in 2017. However, it got back up after 2019. The amazing thing is we can still use Yahoo Finance to get free stock data. It is employed by both individual and enterprise-level users.

It is free and reliable and provides access to more than 5 years of daily OHLC price data.

yFinance is the new python module that wraps the new yahoo finance API.

>pip install yfinance

 The GitHub link is provided for the code but I will be attaching the code below for your reference.

GoogleFinance:

Google Finance got shut down in 2012 but some features were still on the go. There is a feature in this API that supports you to get the stock market data and It is known as GoogleFinance in google sheets.

All we have to do is type the below command and we will get the data.

 

GOOGLEFINANCE("GOOG", "price")

Furthermore, the syntax is:

GOOGLEFINANCE(ticker, [attribute], [start_date], [end_date|num_days], [interval])

The ticker is used for security consideration.

Attribute(should be “price” by default).

Start_date: when you want to fetch the historical data.

End_date: Till when you want the data.

Intervals: return data frequency which is either “DAILY” or “WEEKLY”.

2. IEX Cloud:

IEX Cloud is a new financial service just released this year. It’s an independent business separate from IEX Group’s flagship stock exchange, is a high-performance, financial data platform that connects developers and financial data creators.

It is very cheap compared to others and you can get all the data you want easily. It also provides free trial.

You can easily check it out at :

 

Iexfinance

3. AlphaVantage:

You can refer to the website:

https://www.alphavantage.co/

It is the best and the leading provider of various free APIs. It provides gain to access the data related to the stock, FX-data, and cryptocurrency.

AlphaVantage provides access to 5-API request per minute and 500-API requests per day.

4. World Trading Data:

You can refer to the website for World Trading data:

https://www.worldtradingdata.com/

In this trading application, you can access the full intraday API and currency API. The availability ranges from $8 to $32 per month.

There are different types of plans available. You can get 5-stocks per request for free access. You can get 250 total requests per day.

The response will be in JSON format and there will be no python module to wrap their APIs.

5. Other APIs:

Website: https://polygon.io

Polygon.io
Polygon.io

It is only for the US stock market and is available at $199 per month. This is not a good choice for beginners.

Website: https://intrino.com/

intrino
intrino

It is only available for real-time stock data at $75 per month. For EOD price data it is $40 but you can get free access to this on different platforms. So, I guess it might not be a good choice for independent traders.

Website: https://www.quandl.com/

Quandl
Quandl

It is a marketplace for financial, economic, and other related APIs. It aggregates API from thor party so that users can purchase whatever APIs they want to use.

Every other API will have different prices and some APIs will be free and others will be charged.

Quandl contains its analysis tool inside the website which will be more convenient.

It is a platform which will be most suitable if you can spend a lot of money.

Wrapping up:

I hope you find this tutorial useful and will refer to the websites given for stock market data.

Trading is a quite complex field and learning it is not so easy. You have to spend some time and practice understanding the stock market data and its uses.

Best 5 stock markets APIs in 2020 Read More »

Python – Ways to remove duplicates from list

List is an important container and used almost in every code of day-day programming as well as web-development, more it is used, more is the requirement to master it and hence knowledge of its operations is necessary. This article focuses on one of the operations of getting the unique list from a list that contains a possible duplicated. Remove duplicates from list operation has large number of applications and hence, its knowledge is good to have.

How to Remove Duplicates From a Python List

Method 1 : Naive method

In Naive method, we simply traverse the list and append the first occurrence of the element in new list and ignore all the other occurrences of that particular element.

# Using Naive methods:

Using Naive method

Output :

Remove duplicates from list using Naive method output

Method 2 : Using list comprehension

List comprehensions are Python functions that are used for creating new sequences (such as lists, tuple, etc.) using previously created sequence. This makes code more efficient and easy to understand. This method has working similar to the above method, but this is just a one-liner shorthand of longer method done with the help of list comprehension.

# Using list comprehension

Remove duplicates from list using list comprehension
Output :

Remove duplicates from list using list comprehension output

Method 3 : Using set():

We can remove duplicates from a list using an inbuilt function called set(). The set() always return distinct elements. Therefore, we use the set() for removing duplicates.But the main and notable drawback of this approach is that the ordering of the element is lost in this particular method.

# Using set()

Remove duplicates from list using set method
Output :

Remove duplicates from list using set method output

Method 4 : Using list comprehension + enumerate():

Enumerate can also be used for removing duplicates when used with the list comprehension.It basically looks for already occurred elements and skips adding them. It preserves the list ordering.

# Using list comprehension + enumerate()

Using list comprehension + enumerate()
Output :

Using list comprehension + enumerate() output
Method 5 : Using collections.OrderedDict.fromkeys():

This is fastest method to achieve the particular task. It first removes the duplicates and returns a dictionary which has to be converted to list. This works well in case of strings also.

# Using collections.OrderedDict.fromkeys()

Using collections.OrderedDict.fromkeys()
Output :

Using collections.OrderedDict.fromkeys() output

Conclusion :

In conclusion, nowyou may know “how to remove duplicates from a list in python?“. There are different ways but the using collections.OrderedDict.fromkeys() method is the best in accordance with the programming efficiency of the computer.

Python – Ways to remove duplicates from list Read More »

How To Scrape Amazon Data Using Python Scrapy

How To Scrape Amazon Data Using Python Scrapy

Will it not be good if all the information related to some product will be placed in only one table? I guess it will be really awesome and accessible if we can get the entire information at one place.

Since, Amazon is a huge website containing millions of data so scraping the data is quite challenging. Amazon is a tough website to scrape for beginners and people often get blocked by Amazon’s anti-scraping technology.

In this blog, we will be aiming to provide the information about the scrapy and how to scrape the Amazon website using it.

What is Scrapy?

Scrapy is a free and open-source web-crawling Python’s framework. It was originally designed for web scraping, extracting the data using API’s and or general-purpose web crawler.

This framework is used in data mining, information processing or historical archival. The applications of this framework is used widely in different industries and has been proven very useful. It not only scrapes the data from the website, but it is able to scrape the data from the web services also. For example, Amazon API, Facebook API, and many more.

How to install Scrapy?

Firstly, there are some third-party softwares which needs to be installed in order to install the Scrapy module.

  • Python: As Scrapy has the base of the Python language, one has to install it first.
  • pip: pip is a python package manager tool which maintains a package repository and installs python libraries, and its dependencies automatically. It is better to install pip according to system OS, and then try to follow the standard way of installing Scrapy.

There are different ways in which we can download Scrapy globally as well as locally but the most standard way of downloading it is by using pip.

Run the below command to install Scrapy using pip:

Pip install scrapy

How to get started with Scrapy?

Since we know that Scrapy is an application framework and it provides multiple commands to create an application and use them. But before everything, we have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

Scrapy startproject new_project

This will create a directory:

Scrapy is an application framework which follows object oriented programming style for the definition of items and spiders for overall applications.

The project structure contains different the following files:

  1. Scrapy.cfg : This file is a root directory of the project, which includes project name with the project settings.
  2. Test_project : It is an application directory with many different files which actually make running and scraping responsible from the web URLs.
  3. Items.py :Items are containers that will be loaded with the scraped data and they work like simple python dictionaries.Items provide additional prevention against typos and populating undeclared fields.
  4. Pipelines.py :After the scraping of an item has been done by the spider, it is sent to the item pipeline which processes it through several components. Each class has to implement a method called process_item for the processing of scraped items.
  5. Settings.py : It allows the customization of the behaviour of all scrapy components, including the core, extensions, pipelines, and spiders themselves.
  6. Spiders : It is a directory which contains all spiders/crawlers as python classes.

Scrape Amazon Data: How to Scrape an Amazon Web Page

For a better understanding of how the scrapy works, we will be scraping the product name, price, category, and it’s availability from the Amazon.com website.

Let’s name this project amazon_pro. You can use the project name according to your choice.

Start by writing the below code:

Scrapy startproject test_project

The directory will be created in the local folder by the name mentioned above.

Now we need three things which will help in the scraping process.

  1. Update items.py field which we want to scrape. For example names, price, availability, and so on.
  2. We have to create a new spider with all the necessary elements in it like allowed domains, start_urls and parse method.
  3. For data processing, we have to update pipelines.py file.

Now after creating the spider, follow thee further steps given in the terminal:

  1. Scrapy genspider amazon amazon.com

Now, we need to define the name, URLs, and possible domains to scrape the data.

How to Scrape an Amazon Web Page 2

An item object is defined in the parse method and is filled with required information using the utility of XPath response object. It is a search function that is used to find elements in the HTML tree structure. Lastly let’s yield the items object, so that scrapy can do further processing on it.

Next, after scraping data, scrapy calls Item pipelines to process them. These are called Pipeline classes and we can use these classes to store data in a file or database or in any other way. It is a default class like Items that scrapy generates for users.

How to Scrape an Amazon Web Page 3

The process_item is implemented by the pipeline classes method and items are being yielded by a Spider each and every time. It takes the item and spider class as arguments and returns a dict object. So for this example, we are just returning item dict as it is.

Now, we have to enable the ITEM_PIPELINES from settings.py file.

How to Scrape an Amazon Web Page 4

Now, after completing the entire code, we need to scrape the item by sending requests and accepting response objects.

We will call a spider by its unique name and scrapy will easily search from it.

Scrapy crawl amazon

Now, after the items have been scraped, we can save it to different formats using their extensions. For example, .json, .csv, and many more.

Scrapy crawl amazon -o data.csv

The above command will save the scraped data in the csv format in data.csv file.

Here is the output of the above code:

[ 
{"product_category": "Electronics,Computers & Accessories,Data Storage,External Hard Drives", "product_sale_price": "$949.95", "product_name": "G-Technology G-SPEED eS PRO High-Performance Fail-Safe RAID Solution for HD/2K Production 8TB (0G01873)", "product_availability": "Only 1 left in stock."},
{"product_category": "Electronics,Computers & Accessories,Data Storage,USB Flash Drives", "product_sale_price": "", "product_name": "G-Technology G-RAID with Removable Drives High-Performance Storage System 4TB (Gen7) (0G03240)", "product_availability": "Available from these sellers."},
{"product_category": "Electronics,Computers & Accessories,Data Storage,USB Flash Drives", "product_sale_price": "$549.95", "product_name": "G-Technology G-RAID USB Removable Dual Drive Storage System 8TB (0G04069)", "product_availability": "Only 1 left in stock."},
{"product_category": "Electronics,Computers & Accessories,Data Storage,External Hard Drives", "product_sale_price": "$89.95", "product_name": "G-Technology G-DRIVE ev USB 3.0 Hard Drive 500GB (0G02727)", "product_availability": "Only 1 left in stock."}
]

We have successfully scraped the data from Amazon.com using scrapy.

How To Scrape Amazon Data Using Python Scrapy Read More »

How to Scrape Wikipedia Articles with Python

How to Scrape Wikipedia Articles with Python

We are going to make a scraper which will scrape the wikipedia page.

The scraper will get directed to the wikipedia page and then it will go to the random link.

I guess it will be fun looking at the pages the scraper will go.

Setting up the scraper:

Here, I will be using Google Colaboratory, but you can use Pycharm or any other platform you want to do your python coding.

I will be making a colaboratory named Wikipedia. If you will use any python platform then you need to create a .py file followed by the any name for your file.

To make the HTTP requests, we will be installing the requests module available in python.

Pip install requests

 We will be using a wiki page for the starting point.

Import requests

Response = requests.get(url = "https://en.wikipedia.org/wiki/Web_scraping")

print(response.status_code)

 When we run the above command, it will show 200 as a status code.

How to Scrape Wikipedia Articles with Python 1

Okay!!! Now we are ready to step on the next thing!!

Extracting the data from the page:

We will be using beautifulsoup to make our task easier. Initial step is to install the beautiful soup.

Pip install beautifulsoup4

Beautiful soup allows you to find an element by the ID tag.

Title = soup.find( id=”firstHeading”)

 Bringing everything together, our code will look like:

How to Scrape Wikipedia Articles with Python 3

As we can see, when the program is run, the output is the title of the wiki article i.e Web Scraping.

 Scraping other links:

Other than scraping the title of the article, now we will be focusing on the rest of the things we want.

We will be grabbing <a> tag to another wikipedia article and scrape that page.

To do this, we will scrape all the <a> tags within the article and then I will shuffle it.

Do not forget to import the random module.

How to Scrape Wikipedia Articles with Python 3

You can see, the link is directed to some other wikipedia article page named as IP address.

Creating an endless scraper:

Now, we have to make the scraper scrape the new links.

For doing this, we have to move everything into scrapeWikiArticle function.

How to Scrape Wikipedia Articles with Python 4

The function scrapeWikiArticle will extract the links and and title. Then again it will call this function and will create an endless cycle of scrapers that bounce around the wikipedia.

After running the program, we got:

How to Scrape Wikipedia Articles with Python 5

Wonderful! In only few steps, we got the “web scraping” to “Wikipedia articles with NLK identifiers”.

Conclusion:

We hope that this article is useful to you and you learned how to extract random wikipedia pages. It revolves around wikipedia by following random links.

How to Scrape Wikipedia Articles with Python Read More »

How to Code a Scraping Bot with Selenium and Python

How to Code a Scraping Bot with Selenium and Python

Selenium is a powerful tool for controlling web browsers through programs and performing browser automation. Selenium is also used in python for scraping the data. It is also useful for interacting with the page before collecting the data, this is the case that we will discuss in this article.

In this article, we will be scraping the investing.com to extract the historical data of dollar exchange rates against one or more currencies.

There are other tools in python by which we can extract the financial information. However, here we want to explore how selenium helps with data extraction.

The Website we are going to Scrape:

Understanding of the website is the initial step before moving on to further things.

Website consists of historical data for the exchange rate of dollars against euros.

In this page, we will find a table in which we can set the date range which we want.

That is the thing which we will be using.

We only want the currencies exchange rate against the dollar. If that’s not the case then replace the “usd” in the URL.

The Scraper’s Code:

The initial step is starting with the imports from the selenium, the Sleep function to pause the code for some time and the pandas to manipulate the data whenever necessary.

How to Code a Scraping Bot with Selenium and Python

Now, we will write the scraping function. The function will consists of:

  • A list of currency codes.
  • A start date.
  • An End date.
  • A boolean function to export the data into .csv file. We will be using False as a default.

We want to make a scraper that scrapes the data about the multiple currencies. We also have to initialise the empty list to store the scraped data.

How to Code a Scraping Bot with Selenium and Python 1

As we can see that the function has the list of currencies and our plan is to iterate over this list and get the data.

For each currency we will create a URL, instantiate the driver object, and we will get the page by using it.

Then the window function will be maximized but it will only be visible when we will keep the option.headless as False.

Otherwise, all the work will be done by the selenium without even showing you.

How to Code a Scraping Bot with Selenium and Python 2

Now, we want to get the data for any time period.

Selenium provides some awesome functionalities for getting connected to the website.

We will click on the date and fill the start date and end dates with the dates we want and then we will hit apply.

We will use WebDriverWait, ExpectedConditions, and By to make sure that the driver will wait for the elements we want to interact with.

The waiting time is 20 seconds, but it is to you whichever the way you want to set it.

We have to select the date button and it’s XPath.

The same process will be followed by the start_bar, end_bar, and apply_button.

The start_date field will take in the date from which we want the data.

End_bar will select the date till which we want the data.

When we will be done with this, then the apply_button will come into work.

How to Code a Scraping Bot with Selenium and Python 3

Now, we will use the pandas.read_html file to get all the content of the page. The source code of the page will be revealed and then finally we will quit the driver.

How to Code a Scraping Bot with Selenium and Python 4

How to handle Exceptions In Selenium:

The collecting data process is done. But selenium is sometimes a little unstable and fail to perform the function we are performing here.

To prevent this we have to put the code in the try and except block so that every time it faces any problem the except block will be executed.

So, the code will be like:

for currency in currencies:

        while True:

            try:

                # Opening the connection and grabbing the page

                my_url = f'https://br.investing.com/currencies/usd-{currency.lower()}-historical-data'

                option = Options()

                option.headless = False

                driver = webdriver.Chrome(options=option)

                driver.get(my_url)

                driver.maximize_window()

                  

                # Clicking on the date button

                date_button = WebDriverWait(driver, 20).until(

                            EC.element_to_be_clickable((By.XPATH,

                            "/html/body/div[5]/section/div[8]/div[3]/div/div[2]/span")))

               

                date_button.click()

               

                # Sending the start date

                start_bar = WebDriverWait(driver, 20).until(

                            EC.element_to_be_clickable((By.XPATH,

                            "/html/body/div[7]/div[1]/input[1]")))

                           

                start_bar.clear()

                start_bar.send_keys(start)




                # Sending the end date

                end_bar = WebDriverWait(driver, 20).until(

                            EC.element_to_be_clickable((By.XPATH,

                            "/html/body/div[7]/div[1]/input[2]")))

                           

                end_bar.clear()

                end_bar.send_keys(end)

              

                # Clicking on the apply button

                apply_button = WebDriverWait(driver,20).until(

                      EC.element_to_be_clickable((By.XPATH,

                      "/html/body/div[7]/div[5]/a")))

               

                apply_button.click()

                sleep(5)

               

                # Getting the tables on the page and quiting

                dataframes = pd.read_html(driver.page_source)

                driver.quit()

                print(f'{currency} scraped.')

                break

           

            except:

                driver.quit()

                print(f'Failed to scrape {currency}. Trying again in 30 seconds.')

                sleep(30)

                Continue

For each DataFrame in this dataframes list, we will check if the name matches, Now we will append this dataframe to the list we assigned in the beginning.

Then we will need to export a csv file. This will be the last step and then we will be over with the extraction.

How to Code a Scraping Bot with Selenium and Python 5

Wrapping up:

This is all about extracting the data from the website.So far this code gets the historical data of the exchange rate of a list of currencies against the dollar and returns a list of DataFrames and several .csv files.

https://www.investing.com/currencies/usd-eur-historical-data

How to Code a Scraping Bot with Selenium and Python Read More »

Python Interview Questions on Array Sequence

We have compiled most frequently asked Python Interview Questions which will help you with different expertise levels.

Python Interview Questions on Array Sequence

Low Level Arrays

Let’s try to understand how information is stored in low-level computer architecture in order to understand how array sequences work. Here is a small brush-up on computer science basics on memory units.

We know that the smallest unit of data in a computer is a bit (0 or 1). 8 bits together make a byte. A byte has 8 binary digits. Characters such as alphabets, numbers, or symbols are stored in bytes. A computer system memory has a huge number of bytes and tracking of how information is stored in these bytes is done with the help of a memory address. Every byte has a unique memory address which makes tracking information easier.

The following diagram depicts a lower-level computer memory. It shows a small section of memory of individual bytes with consecutive addresses.

Python Interview Questions on Array Sequence chapter 9 img 1

Computer system hardware is designed in such a manner that the main memory can easily access any byte in the system. The primary memory is located in the CPU itself and is known as RAM. Any byte irrespective of the address can be accessed easily. Every byte in memory can be stored or retrieved in constant time, hence its Big-O notation for the time complexity would be O(1).

There is a link between an identifier of a value and the memory address where it is stored, the programming language keeps a track of this association. So, a variable student_name may store name details for a student and class_ teacher would store the name of a class teacher. While programming, it is often required to keep a track of all related objects.

So, if you want to keep a track of scores in various subjects for a student, then it is a wise idea to group these values under one single name, assign each value an index and use the index to retrieve the desired value. This can be done with the help of Arrays. An array is nothing but a contiguous block of memory.

Python internally stores every Unicode character in 2 bytes. So, if you want to store a 5-letter word (let’s say ‘state’) in python, this is how it would get stored in memory:

Python Interview Questions on Array Sequence chapter 9 img 2

Since each Unicode character occupies 2 bytes, the word STATE is stored in 10 consecutive bytes in the memory. So, this is a case of an array of 5 characters. Every location of an array is referred to as a cell. Every array element is numbered and its position is called index. In other words, the index describes the location of an element.

Table

Every cell of an array must utilize the same number of bytes.

The actual address of the 1st element of the array is called the Base Address. Let’s say the name of the array mentioned above is my_array[ ]. The Base address of my_array[ ] is 1826. If we have this information it is easy to calculate the address of any element in the array.

Address of my array [index] = Base Address +( Storage size in bytes of one element in the array) * index.

Therefore, Address of my array[3] = 1826 + 2*3
= 1826 + 6
= 182C
After that, slight information on how things work at a lower level lets’s get back to the higher level of programming where the programmer is only concerned with the elements and index of the array.

Referential Array

We know that in an array, every cell must occupy the same number of bytes. Suppose we have to save string values for the food menu. The names can be of different lengths. In this case, we can try to save enough space considering the longest possible name that we can think of but that does not seem to be a wise thing to do like a lot of space is wasted in the process and you never know there may be a name longer than the value that we have catered for. A smarter solution, in this case, would be to use an array of object references.

Python Interview Questions on Array Sequence chapter 9 img 3

In this case, every element of the array is actually a reference to an object. The benefit of this is that every object which is of string value can be of different length but the addresses will occupy the same number of cells. This helps in maintaining the constant time factor of order O(1).

In Python, lists are referential in nature. They store pointers to addresses in the memory. Every memory address requires a space of 64-bits which is fixed.

Question 1.
You have a list integer list having integer values. What happens when you give the following command?
integer_list[1] + = 7
Answer:
In this case, the value of the integer at index 1 does not change rather we end up referring to space in the memory that stores the new value i.e. the sum of integer_list[1]+7.

Question 2.
State whether True or False:
A single list instance may include multiple references to the same object as elements of the list.
Answer:
True

Question 3.
Can a single object be an element of two or more lists?
Answer:
Yes

Question 4.
What happens when you compute a slice of the list?
Answer:
When you compute a slice of the list, a new list instance is created. This new list actually contains references to the same elements that were in the parent list.
For example:

>>> my_list = [1, 2,8,9, "cat", "bat", 18]
>>> slice_list = my_list[2:6]
>>> slice_list
[8, 9, 'cat' , 'bat' ]
>>>

This is shown in the following diagram:

Python Interview Questions on Array Sequence chapter 9 img 4

Question 5.
Suppose we change the value of the element at index 1 of the slice just to 18 (preceding diagram). How will you represent this in a diagram?
Answer:
When we say sliceJist[1]=18, we actually are changing the reference that earlier pointed to 9 to another reference that points to value 18. The actual integer object is not changed, only the reference is shifted from one location to the other.

Python Interview Questions on Array Sequence chapter 9 img 5

Deep Copy and Shallow Copy in Python

Python has a module named “copy” that allows deep copy or shallow copy mutable objects.

Assignment statements can be used to create a binding between a target and an object, however, they cannot be used for copying purposes.

Deep Copy

The copy module of python defines a deepcopy( ) function which allows the object to be copied into another object. Any changes made to the new object will not be reflected in the original object.

In the case of shallow copy, a reference of the object is copied into another object as a result of which changes made in the copy will be reflected in the parent copy. This is shown in the following code:

Python Interview Questions on Array Sequence chapter 9 img 6

Python Interview Questions on Array Sequence chapter 9 img 7

It is important to note here that shallow and deep copying functions should be used when dealing with objects that contain other objects (lists or class instances), A shallow copy will create a compound object and insert into it, the references the way they exist in the original object. A deep copy on the other hand creates a new compound and recursively inserts copies of the objects the way they exist in the original list.

Question 6.
What would be the result of the following statement? my list = [7]* 10?
Answer:
It would create a list named my list as follows:
[7, 7, 7, 7, 7, 7, 7, 7, 7, 7]

Python Interview Questions on Array Sequence chapter 9 img 8

All the 10 cells of the list my_list, refers to the same element which in this case is 10.

Question 7.
Have a look at the following piece of code:

>>> a = [1,2,3,4,5,6,7]
>>> b = a
>>>b [ 0 ] = 8
>>> b
[8, 2, 3, 4, 5, 6, 7]
>>> a
[8, 2, 3, 4, 5, 6, 7]
>>>

Here, we have used the assignment operator still on making changes to ‘b’ changes are reflected in ‘a’. Why?
Answer:
When you use an assignment operator you are just establishing a relationship between an object and the target. You are merely setting a reference to the variable. There are two solutions to this:

(I)

>>> a
[8,2,3, 4, 5, 6, 7]
>>>a = [ 1,2 3, 4,5 ,6, 7]
>>> b = a [:]
>>>b[0] = 9
>>> b
[9, 2, 3, 4, 5, 6, 7]
>>> a
[1, 2, 3, 4, 5, 6, 7]
>>>

(II)

>>> a = [ 1,2 3, 4,5 6, 7]
>>> b = list (a)
>>>b
[1,2, 3, 4, 5, 6, 7]
>>>b [0] = 9
>>> b
[9, 2, 3, 4, 5, 6, 7]
>>> a
[1, 2, 3, 4, 5, 6, 7]
>>>

Question 8.
Take a look at the following piece of code:

>>> import copy
>>> a = [1 9 9 4 5, 5]
>>> b = copy.copy(a)
>>> b
[l, 2, 3, 4, 5, 6]
>>>b [2] = 9
>>> b
[1.2, 9, 4, 5, 6]
>>> a
[1.2, 3, 4, 5, 6]
>>>

‘b’ is a shallow copy of ‘a’ however when we make changes to ‘b’ it is not reflected in ‘a’, why? How can this be resolved?
Answer:
List ‘a’ is a mutable object (list) that consists of immutable objects (integer).
A shallow copy would work with the list containing mutable objects.
You can use b=a to get the desired result.

Question 9.
Look at the following code:

>>> my_list = [["apples", "banana"], ["Rose", "Lotus"], ["Rice", "Wheat"]]
>>> copy_list = list (my_list)
>>> copy_list[2][0]="cereals"

What would happen to the content of my_list? Does it change or remains the same?
Answer:
Content of my_list will change:

>>> my list [['apples', 'banana'] , [ 'Rose', 'Lotus'],
['cereals', 'Wheat']]
>>>

Question 10.
Look at the following code:

>>> my_list = [ ["apples", "banana"], ["Rose", "Lotus"], ["Rice", "Wheat"]]
>>> copy_list = my_list.copy( )
>>> copy_list[2][0]="cereals"

What would happen to the content of my_list? Does it change or remains the same?
Answer:
Content of my_list would change:

>>> my_list
[ [ 'apples' , 'banana1' ] , [ 'Rose' , 'Lotus' ] ,
[ 'cereals', 'Wheat']]
>>>

Question 11.
When the base address of immutable objects is copied it is called_________________.
Answer:
Shallow copy

Question 12.
What happens when a nested list undergoes deep copy?
Answer:
When we create a deep copy of an object, copies of nested objects in the original object are recursively added to the new object. Thus, a deep copy will create a completely independent copy of not only of the object but also of its nested objects.

Question 13.
What happens when a nested list undergoes shallow copy?
Answer:
A shallow copy just copies references of nested objects therefore the copies of objects are not created.

Dynamic Arrays

As the name suggests dynamic array is a contiguous block of memory that grows dynamically when new data is inserted. It has the ability to adjust its size automatically as and when there is a requirement, as a result of which, we need not specify the size of the array at the time of allocation and later we can use it to store as many elements as we want.

When a new element is inserted in the array, if there is space then the element is added at the end else a new array is created that is double in size of the current array, so elements are moved from the old array to the new array and the old array is deleted in order to create some free memory space. The new element is then added at the end of the expanded array.

Let’s try to execute a small piece of code. This example is executed on 32-bit machine architecture. The result can be different from that of 64-bit but the logic remains the same.
For a 32-bit system 32-bits (i.e 4 bytes) are used to hold a memory address. So, now let’s try and understand how this works:
When we created a blank list structure, it occupies 36 bytes in size. Look at the code given as follows:

import sys
my_dynamic_list =[ ]
print("length = ",len(my_dynamic_list) , ".", "size in bytes = ", sys.getsizeof(my_dynamic_list),".")

Here we have imported the sys module so that we can make use of getsizeof( ) function to find the size, the list occupies in the memory.
The output is as follows:
Length = 0.
Size in bytes = 36 .
Now, suppose we have a list of only one element, let’s see how much size it occupies in the memory.

import sys
my_dynamic_list =[1]
print("length = ",len(my_dynamic_list),".", "size in bytes = ", sys.getsizeof(my_dynamic_list),".")
length = 1 . size in bytes = 40 .

This 36 byte is just the requirement of the list data structure on 32- bit architecture.
If the list has one element that means it contains one reference to memory and in a 32-bit system architecture memory, the address occupies 4 bytes. Therefore, the size of the list with one element is 36+4 = 40 bytes.
Now, let’s see what happens when we append to an empty list.

import sys
my_dynamic_list =[ ]
value = 0
for i in range(20):
print("i = ",i,".", "                                  Length of my_
dynamic_list = ",len(my_dynamic_list),".", " size in bytes = ", sys.getsizeof(my_dynamic_ list),".")
my_dynamic_list.append(value)
value +=1

Output
i = 0 .    Length of mydynamic list = 0 .      size in bytes =36
i = 1 .    Length of my_dynamic_list = 1 .    size in bytes = 52
i = 2 .    Length of my_dynamic list = 2 .    size in bytes = 52
i = 3 .    Length of my dynamic list = 3 .    size in bytes = 52
i = 4.     Length of my dynamic list = 4 .    size in bytes = 52
i= 5.      Length of my dynamic list = 5 .    size in bytes = 68
i = 6.     Length of my_dynamic_list = 6 .   size in bytes = 68
i = 7.     Length of my_dynamic_list = 7 .   size in bytes = 68
i = 8 .    Length of my_dynamic_list = 8 .   size in bytes = 68

i = 9.     Length of my_dynamic_list = 9 .   size in bytes = 100
i = 10.   Length of my dynamic list =10.    size in bytes = 100
i = 11 .  Length of my dynamic list =11.    size in bytes = 100
i = 12 .  Length of my_dynamic_list = 12 . size in bytes = 100
i = 13 .  Length of my dynamic list =13.    size in bytes = 100
i = 14.   Length of my_dynamic_list = 14 . size in bytes = 100
i = 15 .  Length of my dynamic list =15.    size in bytes = 100
i = 16.   Length of my dynamic list = 16.   size in bytes = 100
i = 17.   Length of my_dynamic_list =17.   size in bytes = 136
i = 18 .  Length of my dynamic list =18.    size in bytes = 136
i = 19.  Length of my_dynamic_list = 19.   size in bytes = 136

Now, lets have a look at how things worked here:

When you call an append( ) function for list, resizing takes place as per list_resize( ) function defined in Objects/listobject.c file in Python. The job of this function is to allocate cells proportional to the list size thereby making space for additional growth.
The growth pattern is : 0,4,8,16,25,35,46,58,72,88,

Amortization

Let’s suppose that there is a man called Andrew, who wants to start his own car repair shop and has a small garage. His business starts and he gets his first customer however, the garage has space only to keep one car. So, he can only have one car in his garage.

Python Interview Questions on Array Sequence chapter 9 img 9

Seeing a car in his garage, another person wants to give his car to him. Andrew, for the ease of his business, wants to keep all cars in one place. So, in order to keep two cars, he must look for space to keep two cars, move the old car to the new space and also move the new car to the new space and see how it works.

So, basically, he has to:

  1. Buy new space
  2. Sell the old space

Let’s say, this process takes one unit of time. Now, he also has to:

  1. Move the old car to a new location
  2. Move the new car to a new location Moving each car takes one unit of time.

Python Interview Questions on Array Sequence chapter 9 img 10

Andrew is new in the business. He does not know how his business would expand, also what is the right size for the garage. So, he comes up with the idea that if there is space in his garage then he would simply add the new car to space, and when the space is full he will get a new space twice as big as the present one and then move all cars there and get rid of old space. So, the moment he gets his new car, it’s time to buy a new space twice the old space and get rid of the old space.

Python Interview Questions on Array Sequence chapter 9 img 11

Now, when the fourth car arrives Andrew need not worry. He has space for the car.
Python Interview Questions on Array Sequence chapter 9 img 12
And now again when he gets the fifth car he will have to buy a space that is double the size of the space that he currently has and sell the old space.

So, let’s now take a look at the time complexity: Let’s analyze how much does it take to add a car to Andrew’s garage where there is n number of cars in the garage.

Here is what we have to see:

1. If there is space available, Andrew just has to move one car into a new space and that takes only one unit of time. This action is independent of the n (number of cars in the garage). Moving a car in a garage that has space is constant time i.e. O(1).

2. When there in spot and a new car arrives, Andrew has to do the following:

  • Buy a new space that takes 1 unit of time.
  • Move all cars into new space one by one and then move the new car into free space. Moving every car takes one unit of time.

So, if there were n cars already in the old garage, plus one car then that would take n+1 time units to move the car.
So, the total time taken in step two is 1+n+l and in Big O notation this would mean O(n) as the constant does not matter.

On the look of it, one may think this is too much of a task but every business plan should be correctly analyzed. Andrew will buy a new garage only if the space that he has, gets all filled up. On spreading out the cost over a period of time, one will realize that it takes a good amount of time only when the space is full but in a scenario where there is space, the addition of cars does not take much time.

Now keeping this example in mind we try to understand the amortization of dynamic arrays. We have already learned that in a dynamic array when the array is full and a new value has to be added, the contents of the array are moved on to the new array that is double in size, and then the space occupied by the old array is released.

 

Python Interview Questions on Array Sequence chapter 9 img 13

It may seem like the task of replacing the old array with a new one is likely to slow down the system. When the array is full, appending a new element may require O(n) time. However, once the new array has been created we can add new elements to the array in constant time O(1) till it has to be replaced again. We will now see that with the help of amortization analysis how this strategy is actually quite efficient.

As per the graph above, when there are two elements then on calling append, the array will have to double in size, same after 4th and 8th element. So, at 2, 4, 8, 16… append will be O(n) and for the rest of the cases, it will be 0(1).
The steps involved are as follows:

  1. When the array is full, allocate memory for the new array and the size of the new array should typically be twice the size of the old array.
  2. All contents from the old array should be copied to the new array.
  3. The space occupied by the old array should be released.

 

Python Interview Questions on Array Sequence chapter 9 img 14

The analysis would be as follows:
Python Interview Questions on Array Sequence chapter 9 img 15

Python Interview Questions on Array Sequence Read More »

Python hashlib.sha3_256() Function

Python hashlib Module:

To generate a message digest or secure hash from the source message, we can utilize the Python hashlib library.

The hashlib module is required to generate a secure hash message in Python.

The hashlib hashing function in Python takes a variable length of bytes and converts it to a fixed-length sequence. This function only works in one direction. This means that when you hash a message, you obtain a fixed-length sequence. However, those fixed-length sequences do not allow you to obtain the original message.

A hash algorithm is considered better in cryptography if the original message cannot be decoded from the hash message. Changing one byte in the original message also has a big impact(change) on the message digest value.

Python secure hash values are used to store encrypted passwords. So that even the application’s owner does not have access to the user’s password, passwords are matched when the user enters the password again, and the hash value is calculated and compared to the stored value.

Hashing Algorithms That Are Available:

  • The algorithms_available function returns a list of all the algorithms available in the system, including those accessible via OpenSSl. Duplicate algorithm names can also be found.
  • The algorithms in the module can be viewed by using the algorithms_guaranteed function.
import hashlib
# Printing list of all the algorithms
print(hashlib.algorithms_available)
# Viewing algorithms
print(hashlib.algorithms_guaranteed)

Output:

{'sha384', 'blake2s', 'sha3_384', 'sha224', 'md5', 'shake_256', 'blake2b', 'sha3_512', 'sha1', 'shake_128', 'sha512', 'sha3_256', 'sha256', 'sha3_224'}
{'sha384', 'blake2s', 'sha3_384', 'sha224', 'md5', 'shake_256', 'blake2b', 'sha3_512', 'sha1', 'shake_128', 'sha512', 'sha3_256', 'sha256', 'sha3_224'}

Functions:

You only need to know a few functions to use the Python hashlib module.

  • You can hash the entire message at once by using the hashlib.encryption_algorithm_name(b”message”) function.
  • Additionally, the update() function can be used to append a byte message to the secure hash value. The output will be the same in both cases. Finally, the secure hash can be obtained by using the digest() function.
  • It’s worth noting that b is written to the left of the message to be hashed. This b indicates that the string is a byte string.

hashlib.sha3_256() Function:

We can convert a normal string in byte format to an encrypted form using the hashlib.sha3_256() function. Passwords and important files can be hashed to secure them using the hashlib.sha3_256() method.

Syntax:

hashlib.sha3_256()

Return Value:

The hash code for the string given is returned by the sha3_256() function.

Differences

Shortly after the discovery of cost-effective brute force operations against SHA-1, SHA-2 was created. It is a family of two similar hash algorithms, SHA-256 and SHA-512, with varying block sizes.

  • The fundamental distinction between SHA-256 and SHA-512 is word size.
  • SHA-256 uses 32-byte words, whereas SHA-512 employs 64-byte words.
  • Each standard also has modified versions called SHA-224, SHA-384, SHA-512/224, and SHA-512/256. Today, the most often used SHA function is SHA-256, which provides adequate safety at current computer processing capabilities.
  • SHA-384 is a cryptographic hash that belongs to the SHA-2 family. It generates a 384-bit digest of a message.
  • On 64-bit processors, SHA-384 is around 50% faster than SHA-224 and SHA-256, despite having a longer digest. The increased speed is due to the internal computation using 64-bit words, whereas the other two hash algorithms use 32-bit words.
  • For the same reason, SHA-512, SHA-512/224, and SHA-512/256 are faster on 64-bit processors.

Algorithm – digest size (the larger the better):

MD5 –> 128 bits
SHA-1 –> 160 bits
SHA-256 –> 256 bits
SHA-512 –> 512 bits

hashlib.sha3_256() Function in Python

Method #1: Using sha3_256() Function (Static Input)

Here, we encrypt the byte string or passwords to secure them using the hashlib.sha3_256() function.

Approach:

  • Import hashlib module using the import keyword
  • Create a reference/Instance variable(Object) for the hashlib module and call sha3_256() function and store it in a variable
  • Give the string as static input(here b represents byte string) and store it in another variable.
  • Call the update() function using the above-created object by passing the above-given string as an argument to it
  • Here it converts the given string in byte format to an encrypted form.
  • Get the secure hash using the digest() function.
  • The Exit of the Program.

Below is the implementation:

# Import hashlib module using the import keyword
import hashlib

# Creating a reference/Instance variable(Object) for the hashlib module and 
# call sha3_256() function and store it in a variable
obj = hashlib.sha3_256()

# Give the string as static input(here b represents byte string) and store it in another variable.
gvn_str = b'Python-programs'

# Call the update() function using the above created object by passing the above given string as 
# an argument to it
# Here it converts the given string in byte format to an encrypted form.
obj.update(gvn_str)
# Get the secure hash using the digest() function.
print(obj.digest())

Output:

b'\xf7\x97[\xc6b{ua\x90bn\xbb\xf54\xc4$\xab\x08\xde\xe6\x11\xb3\xd3\xca\x99_\x89\x0b\xa99>\x9c'

Method #2: Using sha3_256() Function (User Input)

Approach:

  • Import hashlib module using the import keyword
  • Create a reference/Instance variable(Object) for the hashlib module and call sha3_256() function and store it in a variable
  • Give the string as user input using the input() function and store it in another variable.
  • Convert the given string into a byte string using the bytes() function by passing the given string, ‘utf-8’ as arguments to it.
  • Call the update() function using the above-created object by passing the above-given string as an argument to it
  • Here it converts the given string in byte format to an encrypted form.
  • Get the secure hash using the digest() function.
  • The Exit of the Program.

Below is the implementation:

# Import hashlib module using the import keyword
import hashlib

# Creating a reference/Instance variable(Object) for the hashlib module and 
# call sha3_256() function and store it in a variable
obj = hashlib.sha3_256()

# Give the string as user input using the input() function and store it in another variable.
gvn_str = input("Enter some random string = ")
# Convert the given string into byte string using the bytes() function by passing given string, 
# 'utf-8' as arguments to it 
gvn_str=bytes(gvn_str, 'utf-8')

# Call the update() function using the above created object by passing the above given string as 
# an argument to it
# Here it converts the given string in byte format to an encrypted form.
obj.update(gvn_str)
# Get the secure hash using the digest() function.
print(obj.digest())

Output:

Enter some random string = welcome to Python-programs
b'\xd2\xd5!\xb3\xe4\xfaM\x93</8#\xf7\xa1\xdb\xces\nE^\xc1\xb2ukW\x8eF\x8e\xa0y\x8c\x05'

Python hashlib.sha3_256() Function Read More »

Python hashlib.sha3_384() Function

Python hashlib Module:

To generate a message digest or secure hash from the source message, we can utilize the Python hashlib library.

The hashlib module is required to generate a secure hash message in Python.

The hashlib hashing function in Python takes a variable length of bytes and converts it to a fixed-length sequence. This function only works in one direction. This means that when you hash a message, you obtain a fixed-length sequence. However, those fixed-length sequences do not allow you to obtain the original message.

A hash algorithm is considered better in cryptography if the original message cannot be decoded from the hash message. Changing one byte in the original message also has a big impact(change) on the message digest value.

Python secure hash values are used to store encrypted passwords. So that even the application’s owner does not have access to the user’s password, passwords are matched when the user enters the password again, and the hash value is calculated and compared to the stored value.

Hashing Algorithms That Are Available:

  • The algorithms_available function returns a list of all the algorithms available in the system, including those accessible via OpenSSl. Duplicate algorithm names can also be found.
  • The algorithms in the module can be viewed by using the algorithms_guaranteed function.
import hashlib
# Printing list of all the algorithms
print(hashlib.algorithms_available)
# Viewing algorithms
print(hashlib.algorithms_guaranteed)

Output:

{'sha384', 'blake2s', 'sha3_384', 'sha224', 'md5', 'shake_256', 'blake2b', 'sha3_512', 'sha1', 'shake_128', 'sha512', 'sha3_256', 'sha256', 'sha3_224'}
{'sha384', 'blake2s', 'sha3_384', 'sha224', 'md5', 'shake_256', 'blake2b', 'sha3_512', 'sha1', 'shake_128', 'sha512', 'sha3_256', 'sha256', 'sha3_224'}

Functions:

You only need to know a few functions to use the Python hashlib module.

  • You can hash the entire message at once by using the hashlib.encryption_algorithm_name(b”message”) function.
  • Additionally, the update() function can be used to append a byte message to the secure hash value. The output will be the same in both cases. Finally, the secure hash can be obtained by using the digest() function.
  • It’s worth noting that b is written to the left of the message to be hashed. This b indicates that the string is a byte string.

hashlib.sha3_384() Function:

We can convert a normal string in byte format to an encrypted form using the hashlib.sha3_384() function. Passwords and important files can be hashed to secure them using the hashlib.sha3_384() method.

Syntax:

hashlib.sha3_384()

Return Value:

The hash code for the string given is returned by the sha3_384() function.

Differences

Shortly after the discovery of cost-effective brute force operations against SHA-1, SHA-2 was created. It is a family of two similar hash algorithms, SHA-256 and SHA-512, with varying block sizes.

  • The fundamental distinction between SHA-256 and SHA-512 is word size.
  • SHA-256 uses 32-byte words, whereas SHA-512 employs 64-byte words.
  • Each standard also has modified versions called SHA-224, SHA-384, SHA-512/224, and SHA-512/256. Today, the most often used SHA function is SHA-256, which provides adequate safety at current computer processing capabilities.
  • SHA-384 is a cryptographic hash that belongs to the SHA-2 family. It generates a 384-bit digest of a message.
  • On 64-bit processors, SHA-384 is around 50% faster than SHA-224 and SHA-256, despite having a longer digest. The increased speed is due to the internal computation using 64-bit words, whereas the other two hash algorithms use 32-bit words.
  • For the same reason, SHA-512, SHA-512/224, and SHA-512/256 are faster on 64-bit processors.

Algorithm – digest size (the larger the better):

MD5 –> 128 bits
SHA-1 –> 160 bits
SHA-256 –> 256 bits
SHA-512 –> 512 bits

hashlib.sha3_384() Function in Python

Method #1: Using sha3_384() Function (Static Input)

Here, we encrypt the byte string or passwords to secure them using the hashlib.sha3_384() function.

Approach:

  • Import hashlib module using the import keyword
  • Create a reference/Instance variable(Object) for the hashlib module and call sha3_384() function and store it in a variable
  • Give the string as static input(here b represents byte string) and store it in another variable.
  • Call the update() function using the above-created object by passing the above-given string as an argument to it
  • Here it converts the given string in byte format to an encrypted form.
  • Get the secure hash using the digest() function.
  • The Exit of the Program.

Below is the implementation:

# Import hashlib module using the import keyword
import hashlib

# Creating a reference/Instance variable(Object) for the hashlib module and 
# call sha3_384() function and store it in a variable
obj = hashlib.sha3_384()

# Give the string as static input(here b represents byte string) and store it in another variable.
gvn_str = b'Python-programs'

# Call the update() function using the above created object by passing the above given string as 
# an argument to it
# Here it converts the given string in byte format to an encrypted form.
obj.update(gvn_str)
# Get the secure hash using the digest() function.
print(obj.digest())

Output:

b'\xf5;.\x1f\xb3$\xfd\xc2s7\xa5\x13\xa9\xc9\xceB\xd1A,u\x7fy\xeaw\xab\x07\x0b,lN\xefg\x86\x04t\xb0~\xd9\xe4_\xed\xc2\x07\xdb\xad$\xf4\xea'

Method #2: Using sha3_384() Function (User Input)

Approach:

  • Import hashlib module using the import keyword
  • Create a reference/Instance variable(Object) for the hashlib module and call sha3_384() function and store it in a variable
  • Give the string as user input using the input() function and store it in another variable.
  • Convert the given string into a byte string using the bytes() function by passing the given string, ‘utf-8’ as arguments to it.
  • Call the update() function using the above-created object by passing the above-given string as an argument to it
  • Here it converts the given string in byte format to an encrypted form.
  • Get the secure hash using the digest() function.
  • The Exit of the Program.

Below is the implementation:

# Import hashlib module using the import keyword
import hashlib

# Creating a reference/Instance variable(Object) for the hashlib module and 
# call sha3_384() function and store it in a variable
obj = hashlib.sha3_384()

# Give the string as user input using the input() function and store it in another variable.
gvn_str = input("Enter some random string = ")
# Convert the given string into byte string using the bytes() function by passing given string, 
# 'utf-8' as arguments to it 
gvn_str=bytes(gvn_str, 'utf-8')

# Call the update() function using the above created object by passing the above given string as 
# an argument to it
# Here it converts the given string in byte format to an encrypted form.
obj.update(gvn_str)
# Get the secure hash using the digest() function.
print(obj.digest())

Output:

Enter some random string = welcome to Python-programs
b'\xe4\xf2\xe03\xcbG\xe5YH.\xe2\x13W\xbc\xd0pQ\xd4dYshc=\xdc\xeeY\xf9\xa5dX\xda\x8c\x08\x18\xa1\xe4,\xe6z\x99x\xf2\xa4l\xe4\xc4$'

Python hashlib.sha3_384() Function Read More »