Python

Python Set symmetric_difference_update() Method with Examples

Set symmetric_difference_update() Method:

The symmetric_difference_update() method updates the original set by removing items from both sets and inserting new items.

The set of elements that are in either P or Q but not in their intersection is the symmetric difference of two sets P and Q.

For Example:

If P and Q are two distinct sets. The symmetric_difference_update of the given two sets is :

Let P={4, 5, 6, 7}

Q={5, 6, 8, 9}

Here {5, 6} are the common elements in both sets.

so, Set P is updated with all the other items from both sets except the common elements.

Set Q, on the other hand, remains unchanged.

Output : P={4, 7, 8, 9} and Q={5, 6, 8, 9}

Syntax:

set.symmetric_difference_update(set)

Parameters

set: This is Required. The set to look for matches in.

Return Value:

None is returned by symmetric difference update() (returns nothing). Rather, it calls the set and updates it.

Examples:

Example1:

Input:

Given first set = {10, 20, 50, 60, 30}
Given second set = {20, 30, 80, 90}

Output:

The given first set =  {10, 80, 50, 90, 60}
The given second set =  {80, 90, 20, 30}
The result after applying symmetric_difference_update() method =  None

Example2:

Input:

Given first set = {'a', 'b', 'c'}
Given second set = {'a', 'b', 'c', 'd', 'e'}

Output:

The given first set =  {'d', 'e'}
The given second set =  {'a', 'e', 'b', 'd', 'c'}
The result after applying symmetric_difference_update() method =  None

Set symmetric_difference_update() Method with Examples in Python

Method #1: Using Built-in Functions (Static Input)

Approach:

  • Give the first set as static input and store it in a variable.
  • Give the second set as static input and store it in another variable.
  • Apply symmetric_difference_update() method to the given two sets and Store it in another variable.
  • Print the given first set.
  • Print the given second set.
  • Print the result after applying the symmetric_difference_update() method for the given two sets.
  • The Exit of the Program.

Below is the implementation:

# Give the first set as static input and store it in a variable.
fst_set = {10, 20, 50, 60, 30}
# Give the second set as static input and store it in another variable.
scnd_set = {20, 30, 80, 90}
# Apply symmetric_difference_update() method to the given two sets and
# Store it in another variable.
rslt = fst_set.symmetric_difference_update(scnd_set)
# Print the given first set
print("The given first set = ", fst_set)
# Print the given second set
print("The given second set = ", scnd_set)
# Print the result after applying symmetric_difference_update() method for the
# given two sets
print("The result after applying symmetric_difference_update() method = ", rslt)

Output:

The given first set =  {10, 80, 50, 90, 60}
The given second set =  {80, 90, 20, 30}
The result after applying symmetric_difference_update() method =  None

Method #2: Using Built-in Functions (User Input)

Approach:

  • Give the first set as user input using set(),map(),input(),and split() functions.
  • Store it in a variable.
  • Give the second set as user input using set(),map(),input(),and split() functions.
  • Store it in another variable.
  • Apply symmetric_difference_update() method to the given two sets and Store it in another variable.
  • Print the given first set.
  • Print the given second set.
  • Print the result after applying the symmetric_difference_update() method for the given two sets.
  • The Exit of the Program.

Below is the implementation:

# Give the first set as user input using set(),map(),input(),and split() functions.
# Store it in a variable.
fst_set = set(map(int, input(
   'Enter some random Set Elements separated by spaces = ').split()))
# Give the second set as user input using set(),map(),input(),and split() functions.
# Store it in another variable.
scnd_set = set(map(int, input(
   'Enter some random Set Elements separated by spaces = ').split()))

# Apply symmetric_difference_update() method to the given two sets and
# Store it in another variable.
rslt = fst_set.symmetric_difference_update(scnd_set)
# Print the given first set
print("The given first set = ", fst_set)
# Print the given second set
print("The given second set = ", scnd_set)
# Print the result after applying symmetric_difference_update() method for the
# given two sets
print("The result after applying symmetric_difference_update() method = ", rslt)

Output:

Enter some random Set Elements separated by spaces = 2 5 6 7 1
Enter some random Set Elements separated by spaces = 6 9 1 4 0
The given first set = {0, 2, 4, 5, 7, 9}
The given second set = {0, 1, 4, 6, 9}
The result after applying symmetric_difference_update() method = None

Python Set symmetric_difference_update() Method with Examples Read More »

Python Programs for Calculus Using SymPy Module

Calculus:

Calculus is a branch of mathematics. Limits, functions, derivatives, integrals, and infinite series are all topics in calculus. To do calculus in Python, we will utilize the SymPy package.

Derivatives:

How steep is a function at a given point? Derivatives can be used to get the answer to this question. It measures the rate of change of a function at a specific point.

Integration:
what is the area beneath the graph over a particular region? Integration can be used to get the answer to this question. It combines the function’s values over a range of numbers.

SymPy Module:

SymPy is a Python symbolic mathematics library. It aspires to be a full-featured computer algebra system (CAS) while keeping the code as basic or simple as possible in order to be understandable and easily expandable. SymPy is entirely written in Python.

Before performing calculus operations install the sympy module as shown below:

Installation of sympy:

pip install sympy

To write any sympy expression, we must first declare its symbolic variables. To accomplish this, we can employ the following two functions:

sympy.Symbol(): This function is used to declare a single variable by passing it as a string into its parameter.

sympy.symbols(): This function is used to declare multivariable by passing the variables as a string as an argument. Each variable must be separated by a space to produce a string.

1)Limits Calculation in Python

Limits are used in calculus to define the continuity, derivatives, and integrals of a function sequence. In Python, we use the following syntax to calculate limits:

Syntax:

sympy.limit(function,variable,value)

For Example:

limit = f(x)
x–>k

The parameters specified in the preceding syntax for computing the limit in Python are function, variable, and value.

f(x): The function on which the limit operation will be conducted is denoted by f(x).
x: The function’s variable is x.

k: k is the value to which the limit tends to.

Example1: limit y–>0.4= sin(y) / y

Approach:

  • Import sympy module as ‘sp’ using the import keyword
  • Pass the argument y to symbol() function which is LHS in given limit and store it in a variable
  • Create the RHS of the limit using the above LHS limit and sin function and sympy module.
  • Pass the given function, variable, value as the arguments to the limit() function to get the limit value.
  • Store it in a variable.
  • Print the above-obtained limit value for the given function.
  • The Exit of the Program.

Below is the implementation:

# limit y–>0.4= sin(y) / y

# Import sympy module as 'sp' using the import keyword
import sympy as sp
# pass the argument y to symbol function which is LHS in given limit and store it in a variable
y = sp.Symbol('y')
# Create the RHS of the limit using the above LHS limit and sin function and sympy module
func = sp.sin(y)/y
# Pass the given function, variable, value as the arguments to the limit() function
# to get the limit value.
# Store it in a variable.
rslt_lmt = sp.limit(func, y, 0.4)
# Print the above obtained limit value for the given function.
print("The result limit value for the given function = ", rslt_lmt)

Output:

The result limit value for the given function = 0.973545855771626

Example2:  limit y–>0 = sin(3y) / y

# limit y–>0= sin(3y) / y

# Import sympy module as 'sp' using the import keyword
import sympy as sp
# pass the argument y to symbol function which is LHS in given limit and store it in a variable
y = sp.Symbol('y')
# Create the RHS of the limit using the above LHS limit and sin function and sympy module
func = sp.sin(3*y)/y
# Pass the given function, variable, value as the arguments to the limit() function
# to get the limit value.
# Store it in a variable.
rslt_lmt = sp.limit(func, y, 0)
# Print the above obtained limit value for the given function.
print("The result limit value for the given function = ", rslt_lmt)

Output:

The result limit value for the given function = 3
2)Derivatives Calculation in Python

Derivatives are an important aspect of conducting calculus in Python. We use the following syntax to differentiate or find the derivatives in limits:

Syntax:

sympy.diff(function,variable)

Example: f(y) = sin(y) + y2 + e^3y

Below is the implementation:

# f(y) = sin(y) + y2 + e^3y

# Import sympy module as 'sp' using the import keyword
import sympy as sp
# pass the argument y to symbol function which is LHS in given limit and store it in a variable

y=sp.Symbol('y')
#Create the RHS of the limit using the above LHS limit and sin function,exp function and sympy module
func=sp.sin(y)+y**2+sp.exp(3*y)
#get the first differentiation value by passing the function and lhs to diff function and print it
fst_diff=sp.diff(func,y)
print('The value of first differentation of function',func,'is :\n',fst_diff)
#get the second differentiation value by passing the function and lhs to diff function and extra argument 2(which implies 2nd differentitation) and print it
scnd_diff=sp.diff(func,y,2)
print('The value of second differentation of function',func,'is :\n',scnd_diff)

Output:

The value of first differentation of function y**2 + exp(3*y) + sin(y) is :
2*y + 3*exp(3*y) + cos(y) 
The value of second differentation of function y**2 + exp(3*y) + sin(y) is : 
9*exp(3*y) - sin(y) + 2

Example: f(y) = cos(y) + y2 + e^3y

# f(y) = cos(y) + y2 + e^3y

# Import sympy module as 'sp' using the import keyword
import sympy as sp
# pass the argument y to symbol function which is LHS in given limit and store it in a variable

y=sp.Symbol('y')
# Create the RHS of the limit using the above LHS limit and cos function,exp function and sympy module
func=sp.cos(y)+y**2+sp.exp(3*y)
# get the first differentiation value by passing the function and lhs to diff function and print it
fst_diff=sp.diff(func,y)
print('The value of first differentation of function',func,'is :\n',fst_diff)
# get the second differentiation value by passing the function and lhs to diff function and extra argument 2
# (which implies 2nd differentitation) and print it
scnd_diff=sp.diff(func,y,2)
print('The value of second differentation of function',func,'is :\n',scnd_diff)

Output:

The value of first differentation of function y**2 + exp(3*y) + cos(y) is : 
2*y + 3*exp(3*y) - sin(y) 
The value of second differentation of function y**2 + exp(3*y) + cos(y) is : 
9*exp(3*y) - cos(y) + 2
3)Integration Calculation in Python

Integration’s SymPy module is made up of integral modules. In Python, the syntax for calculating integration is as follows:

Syntax:

integrate(function, value)

Example:     x3 + 2x + 5

Below is the implementation:

# Function: x^3 + 2x + 5

# Import all functions from sympy module using the import keyword
from sympy import*
x,y=symbols('x y')
gvn_expresn = x**3+2*x+ 5
print("The integration for the given expression is:")
integrate(gvn_expresn ,x)

Output:

Python Programs for Calculus Using SymPy Module Read More »

Python: Differences Between List and Array

List:

In Python, a list is a collection of items that can contain elements of multiple data types, such as numeric, character logical values, and so on. It is an ordered collection that allows for negative indexing. Using [], you can create a list with data values.
List contents may be simply merged and copied using Python’s built-in functions.

The core difference between a Python list and a Python array is that a list is included in the Python standard package, whereas an array requires the “array” module to be imported. With a few exceptions, lists in Python replace the array data structure.

Array:

An array is a vector that contains homogenous items, that is, elements of the same data type. Elements are assigned contiguous memory addresses, allowing for easy change, i.e., addition, deletion, and access to elements. To declare arrays in Python, we must utilize the array module. If the array’s items are of different data types, an exception “Incompatible data types” is issued.

Differences: List vs Array

                                    LIST                                   ARRAY
Elements of many data types may be present in a list.Only elements of the same data type are included in an array
There is no need to import a module manually for declaration.It is necessary to explicitly import a module for declaration.
Cannot do mathematical operations directly.Can do arithmetic operations directly in an array
Preferred for shorter data item sequencesPreferred for longer data sequences.
It is possible to nest elements to contain multiple types of elements.All nested items of the same size must be present.
Data can be easily modified (added, deleted) with greater freedom.Less flexibility because addition and deletion must be done element by element.
Without any explicit looping, the complete list can be printed.To print or access the array’s components, a loop must be created.
Lists Consume more memory to facilitate the insertion of elements.The consumed Memory size is somewhat smaller.

1)Storing Data – List vs Array:

Data structures, as we all know, are used to effectively store data.
In this scenario, a list can contain heterogeneous data values. In other words, data objects of various data types can be handled in a Python List.

list:

# Give the list as static input and store it in a variable.
# List may contain heterogenous datatypes like int, float, strings etc.
gvn_lst = [8, 2.5, 1, "hello", 'Python-programs']
# Print the given list
print(gvn_lst)

Output:

[8, 2.5, 1, 'hello', 'Python-programs']

Arrays:

Arrays, on the other hand, store homogenous elements into them, that is, elements of the same kind.

# Import array using the import keyword
import array
# Give the array as static input by passing it as an argument to
# the array() function and store it in a variable.
# Arrays contain homogeneous datatypes
gvn_arry = array.array('i', [15, 65, 25, 48])
# Print the given array
print(gvn_arry)

Output:

array('i', [15, 65, 25, 48])

2)Declaration

Lists: 

Python provides a built-in data structure called “List.” As a result, lists in Python do not need to be specified.

# Give the list as static input and store it in a variable.
gvn_lst = [8, 2.5, 1, "hello", 'Python-programs']

Arrays:

Arrays must be declared in Python. We can declare an array using the following methods:

Array Module:

import array
ArrayName = array.array('format-code', [items])

Numpy Module:

import numpy
ArrayName = numpy.array([items])

3)Mathematical Operations:

Arrays:

When it comes to conducting Mathematical operations, arrays have an advantage. The NumPy module provides us with an array structure to store and manipulate data values.

Example

# Import numpy module using the import keyword
import numpy
# Give the array as static input by passing it as an argument to
# the array() function and store it in a variable.
gvn_arry = numpy.array([15, 10, 5, 4])
# Multiply each element of the array with 5 and and store it in another variable.
rslt = gvn_arry*5
# Print the above result
print(rslt)

Output:

[75 50 25 20]

In contrast to lists, where the operations performed on the list do not reflect in the results, as shown in the example below using list operations.

In this case, we attempted to multiply the constant value (5) by the list, but the result does not have any effect. Because Lists cannot be directly mathematically manipulated with any data values.

So, if we wish to multiply 5 with the elements of the list, we must multiply 5 with each element of the list individually.

Lists:

# Give the list as static input and store it in a variable.
gvn_lst = [15, 10, 5, 4]
# Multiply the list with 5 and and store it in another variable.
rslt = gvn_lst*5
# Print the given lst
print(gvn_lst)

Output:

[15, 10, 5, 4]

4)Changing the size of the data structure

Python Lists, as an inbuilt data structure, may be enlarged or resized quickly and easily.

Arrays, on the other hand, show very poor performance when it comes to resizing the array’s memory. Instead, we’ll have to duplicate the array in order to scale and resize it.

 

Python: Differences Between List and Array Read More »

Python Numpy Broadcasting

“The word broadcasting refers to how numpy handles arrays of varying shapes during arithmetic operations.” The smaller array is “broadcast” across the bigger array, subject to specific limits so that their shapes are consistent. Broadcasting allows you to vectorize array operations such that looping happens in C rather than Python.”

 For Example:

To understand NumPy’s broadcasting method, we add two arrays of different dimensions.

#import numpy as np using the import keyword
import numpy as np
# Pass some random number(length of array) to the arange() function and store it in a variable.
gvn_arry = np.arange(4)
# Add a number to the given array and and store it in another variable.
result = gvn_arry + 6
# Print the above result
print(result)

Output:

[6 7 8 9]

In this case, the given array has one dimension (axis), which has a length of 4, whereas 6. is a simple integer with no dimensions. Because they have different dimensions, Numpy attempts to broadcast (simply stretch) the smaller array along a specific axis, making it suitable for the mathematical operation.

The Numpy Broadcasting Rules

Numpy broadcasting follows a strict set of rules to ensure that array operations are consistent and fail-safe. The following are two general broadcasting rules in numpy:

When we perform an operation on a NumPy array, NumPy compares the array’s shape element by element from right to left. Only when two dimensions are equal or one of them is 1, are they compatible. If two dimensions are equal, the array is preserved.

The array is broadcasted along the dimension if it is one. NumPy throws a ValueError if neither of the two conditions is met, indicating that the array cannot be broadcasted. If and only if all dimensions are compatible, the arrays are broadcasted.
The arrays being compared do not have to have the same number of dimensions. The array with fewer dimensions can be easily scaled along the missing dimension.

Implementation

Let the two arrays be arr_1= (5, 3) and arr_2= (5, 1)

The sum of arrays with compatible dimensions: The arrays have compatible dimensions (5, 3) and (5, 1). To match the dimension of arr_1 the array arr_2 is expanded along the second dimension.

# Import numpy module as np using the import keyword.
import numpy as np
# Pass the rowsize*columnsize as argument to the arrange() function and 
# pass the rowsize, columnsize  as arguments to the reshape() function
# Store it in a variable.
arry1 = np.arange(15).reshape(5, 3)
# Print the shape(rowsize, columnsize) using the shape function
print("First Array shape = ", arry1.shape)
# similary get the other array
arry2 = np.arange(5).reshape(5, 1)
print("Second Array shape = ", arry2.shape)
# Print the sum of both the arrays by adding the above two variables.
print("Adding both arrays and printing the sum of it:")
print(arry1 + arry2)

Output:

First Array shape = (5, 3) 
Second Array shape = (5, 1) 
Adding both arrays and printing the sum of it: 
[[ 0 1 2] 
 [ 4 5 6] 
 [ 8 9 10] 
 [12 13 14] 
 [16 17 18]]

Example2:

# Import numpy module as np using the import keyword.
import numpy as np
# Pass the rowsize*columnsize as argument to the arrange() function and 
# pass the rowsize, columnsize  as arguments to the reshape() function
# Store it in a variable.
arry1 = np.arange(15).reshape(5, 4)
# Print the shape(rowsize, columnsize) using the shape function
print("First Array shape = ", arry1.shape)
# similary get the other array
arry2 = np.arange(5).reshape(5, 1)
print("Second Array shape = ", arry2.shape)
# Print the sum of both the arrays by adding the above two variables.
print("Adding both arrays and printing the sum of it:")
print(arry1 + arry2)

Output:

ValueError                                Traceback (most recent call last)
<ipython-input-17-3bb527438935> in <module>()
      4 # pass the rowsize, columnsize  as arguments to the reshape() function
      5 # Store it in a variable.
----> 6 arry1 = np.arange(15).reshape(5, 4)
      7 # Print the shape(rowsize, columnsize) using the shape function
      8 print("First Array shape = ", arry1.shape)

ValueError: cannot reshape array of size 15 into shape (5,4)

Explanation:

Here, the number of rows is 5, while the number of columns is 4.
It cannot be placed in a matrix of size 16 (a matrix of size 5*4 = 20 
is required).

Example3:

# Import numpy module as np using the import keyword.
import numpy as np
# Pass the rowsize*columnsize as argument to the arrange() function and 
# pass the rowsize, columnsize  as arguments to the reshape() function
# Store it in a variable.
arry1 = np.arange(18).reshape(6, 3)
# Print the shape(rowsize, columnsize) using the shape function
print("First Array shape = ", arry1.shape)
# similary get the other array
arry2 = np.arange(3)
print("Second Array shape = ", arry2.shape)
# Print the sum of both the arrays by adding the above two variables.
print("Adding both arrays and printing the sum of it:")
print(arry1 + arry2)

Output:

First Array shape = (6, 3) 
Second Array shape = (3,) 
Adding both arrays and printing the sum of it: 
[[ 0 2 4] 
 [ 3 5 7] 
 [ 6 8 10] 
 [ 9 11 13] 
 [12 14 16] 
 [15 17 19]]

Example4:

arry1 = np.arange(120).reshape(5, 4, 3, 2)
print("First Array shape = ", arry1.shape)
 
arry2 = np.arange(24).reshape(4, 3, 2)
print("Second Array shape = ", arry2.shape)
 
print("Adding both arrays and printing the sum of it: \n", (arry1 + arry2).shape)

Output:

First Array shape = (5, 4, 3, 2) 
Second Array shape = (4, 3, 2) 
Adding both arrays and printing the sum of it: 
(5, 4, 3, 2)

Explanation:

It is vital to realize that several arrays can be propagated along many 
dimensions. Array1 has dimensions (5, 4, 3, 2), while array2 has dimensions 
( 4, 3, 2).Array1 is extended along the third dimension, whereas array2 is 
stretched along the first and second dimensions, yielding the dimension array 
(5, 4, 3, 2).

Broadcasting’s Speed Advantages

Numpy broadcasting is more efficient than looping through the array. Let’s start with the first example. The user can choose not to use the broadcasting mechanism and instead loop through an array, adding the same number to each element in the array. This can be slow for two reasons: looping involves interacting with the Python loop, which reduces the speed of the C implementation. Second, NumPy employs strides rather than loops. Setting strides to 0 allows you to repeat the elements indefinitely without incurring any memory overhead.

 

Python Numpy Broadcasting Read More »

Python Program to Remove Stop Words with NLTK

Pre-processing is the process of transforming data into something that a computer can understand. Filtering out worthless data is a common type of pre-processing. In natural language processing, stop words are worthless (useless) words (data).

Stop Words:

A stop word is a regularly used term for example, “the,” “a,” “an,”,”is” or “in” that a search engine has been configured to ignore, both while indexing entries for searching and retrieving them as the result of a search query.
We don’t want these terms taking up space in our database or using precious processing time. We can easily eliminate them by storing a list of terms that you believe to stop words. Python’s NLTK (Natural Language Toolkit) contains a list of stopwords in 16 different languages. You may find them in the nltk data directory, which is located at home/folder/nltk data/corpora/stopwords.

Note: Don’t forget to modify the name of your home directory.

Before going to the coding part, download the corpus including stop words from the NLTK module.

# Import nltk module using the import keyword.
import nltk
# Pass the 'stopwords' as an argument to the download() function to download all the
# stop words package
nltk.download('stopwords')

Output:

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True

Printing the stop words list from the corpus:

# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Print all the stopwords in english language using the words() function in
# stopwords.
print(stopwords.words('english'))

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
 "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', 
"it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those',
 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 
'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most',
 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't",
 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain',
 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn',
 "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn',
 "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn',
 "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

You can also select stopwords from the other languages based on requirements.

Get all the Languages list that can be used:

The below are the languages that are available in the NLTK ‘stopwords’ corpus.

# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Get all the Languages list that can be used using the fileids() function in
# stopwords
print(stopwords.fileids())

Output:

['arabic', 'azerbaijani', 'bengali', 'danish', 'dutch', 'english', 'finnish',
 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 
'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 
'spanish', 'swedish', 'tajik', 'turkish']

Adding our own stop words to the corpus:

# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Get all the stopwords in english language using the words() function in
# stopwords.
# Store it in a variable
our_stopwords = stopwords.words('english')
# Append some random stop word to the above obtained stopwords list using the
# append() function
our_stopwords.append('forexample')
print(our_stopwords)

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
 "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's",
 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 
'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 
'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before',
 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 
'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same',
 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't",
 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 
'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't",
 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', 
"shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 
'won', "won't", 'wouldn', "wouldn't", 'forexample']

The user-given stop word is added at the end. Check it out in the Output.

Removal of stop words:

The below is the code for removing all the stop words from a random string/sentence.

Tokenization:

Tokenization is the process of converting a piece of text into smaller parts known as tokens. These tokens are the core of NLP.

Tokenization is used to convert a sentence into a list of words.

Approach:

  • Import word_tokenize from nltk.tokenize using the import keyword.
  • Import stopwords from nltk.corpus using the import keyword.
  • Download ‘stopwords’,’punkt’ from nltk module using the download() function.
  • Import word_tokenize from nltk.tokenize using the import keyword.
  • Give the random string as static input and store it in a variable.
  • Pass the given string to the word_tokenize() function to convert the given string into a list of words.
  • Remove the stop words from the given string using the list comprehension and store it in another variable.
  • Print the string after removing stopwords.
  • The Exit of the Program.

Below is the implementation:

# Import nltk module using the import keyword.
import nltk
# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Download 'stopwords','punkt' from nltk module using the download() function.
nltk.download('stopwords')
nltk.download('punkt')
# Import word_tokenize from nltk.tokenize using the import keyword.
from nltk.tokenize import word_tokenize
# Give the random string as static input and store it in a variable.
gvn_str = "hello this is btechgeeks in is good morning all is a"
# Pass the given string to the word_tokenize() function to convert the given
# string into a list of words.
text_tokens = word_tokenize(gvn_str)
# Remove the stop words from the given string using the list comprehension 
# and store it in another variable.
stopwrds_removd = [word for word in text_tokens if not word in stopwords.words()]
# Print the string after removing stopwords.
print(stopwrds_removd)

Output:

['hello', 'btechgeeks', 'good', 'morning']

 

Python Program to Remove Stop Words with NLTK Read More »

In Python, How do you Normalize Data?

This post will teach you how to normalize data in Pandas.

Pandas:

Pandas is an open-source library developed on top of the NumPy library. It is a Python module that contains a variety of data structures and procedures for manipulating numerical data and statistics. It’s mostly used to make importing and evaluating data easier. Pandas is fast, high-performance, and productive for users.

Data Normalization:

Data Normalization is a common approach in machine learning that involves translating numeric columns to a standard scale. Some feature values in machine learning differ from others numerous times. The characteristics with the highest values will dominate the learning process.

Before we get into normalisation, let us first grasp why it is necessary.

  • Feature scaling is an important stage in data analysis and data preparation for modelling. In this section, we make the data scale-free for easier analysis.
  • One of the feature scaling strategies is normalisation. We use normalisation most often when the data is skewed on either axis, i.e. when the data does not match the Gaussian distribution.
  • Normalization converts data features from different scales to a similar scale, making it easier to handle the data for modelling. As a result, all of the data features (variables) have a similar impact on the modelling section.

We normalise each feature using the formula below by subtracting the minimum data value from the data variable and then dividing it by the variable’s range, as shown below:

Formula:

 

As a result, we convert the data to a range between [0,1].

Methods for Normalizing Data in Python

Python has several approaches that you can use to do normalization.

Let us take an example of a dummy dataset here. You can download some other dataset and test it out.

1) MinMaxScaler

)Importing the Dataset

Import the dataset into a Pandas Dataframe.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
dummy_dataset = pd.read_csv("dummy_data.csv")
print(dummy_dataset)

Output:

    id  calories  protein  fat
0    0        70        4    1
1    1       120        3    5
2    2        70        4    1
3    3        50        4    0
4    4       110        2    2
5    5       110        2    2
6    6       110        2    0
7    7       130        3    2
8    8        90        2    1
9    9        90        3    0
10  10       120        1    2
11  11       110        6    2
12  12       120        1    3
13  13       110        3    2
14  14       110        1    1
15  15       110        2    0
16  16       100        2    0
17  17       110        1    0
18  18       110        1    1

Normalizing the above-given dataset by applying the MinMaxScaler function

Approach:

  • Import pandas module as pd using the import keyword.
  • Import MinMaxScaler function from sklearn.preprocessing module using the import keyword.
  • Import dataset using read_csv() function by pasing the dataset name as an argument to it.
  • Store it in a variable.
  • Create an object for the MinMaxScaler() function and store it in a variable.
  • Normalize(transform the data to 0’s and 1)the given dataset using the fit_transform() function and store it in another variable.
  • Print the Normalized data values.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import MinMaxScaler function from sklearn.preprocessing module using the import keyword
from sklearn.preprocessing import MinMaxScaler
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
dummy_dataset = pd.read_csv("dummy_data.csv")
# Create an object for the MinMaxScaler() function and store it in a variable.
scaler_val= MinMaxScaler()

# Normalize(transform the data to 0's and 1)the given dataset using the fit_transform() function and 
# store it in another variable.
normalizd_data= pd.DataFrame(scaler_val.fit_transform(dummy_dataset),
            columns=dummy_dataset.columns, index=dummy_dataset.index) 
# Print the Normalized data values
print(normalizd_data)

Output:

As can be seen, we have processed and normalised the data values between 0 and 1.

          id  calories  protein  fat
0   0.000000     0.250      0.6  0.2
1   0.055556     0.875      0.4  1.0
2   0.111111     0.250      0.6  0.2
3   0.166667     0.000      0.6  0.0
4   0.222222     0.750      0.2  0.4
5   0.277778     0.750      0.2  0.4
6   0.333333     0.750      0.2  0.0
7   0.388889     1.000      0.4  0.4
8   0.444444     0.500      0.2  0.2
9   0.500000     0.500      0.4  0.0
10  0.555556     0.875      0.0  0.4
11  0.611111     0.750      1.0  0.4
12  0.666667     0.875      0.0  0.6
13  0.722222     0.750      0.4  0.4
14  0.777778     0.750      0.0  0.2
15  0.833333     0.750      0.2  0.0
16  0.888889     0.625      0.2  0.0
17  0.944444     0.750      0.0  0.0
18  1.000000     0.750      0.0  0.2

In Brief:

As a result of the preceding explanation, the following conclusions can be drawn–

  • When the data values are skewed normalisation is used and do not follow a gaussian distribution,
  • The data values are transformed between 0 and 1.
  • Normalization frees the data’s scale.

We can also another method for Normalization i.e;

The maximum absolute scaling:

By dividing each observation by its maximum absolute value, maximum absolute scaling rescales each feature between -1 and 1. Using the.max() and.abs() methods in Pandas, we may achieve maximum absolute scaling.

But, the MinMaxScaler function is the popular one. Hence we have gone through only this in this article.

 

In Python, How do you Normalize Data? Read More »

Python Program to Extract Digits from a String – 2 Easy Ways

When working with strings, we frequently run into the problem of needing to get all of the numerical occurrences. This type of issue is common in competitive programming as well as online development. Let’s solve the issue now!.

Program to Extract Digits from a String – 2 Easy Ways in Python

Method #1: Using Built-in Functions (Static Input)

1) Using Python isdigit() Function:

If the given string contains digit characters, the Python isdigit() method returns True.

Syntax:

string.isdigit()

Approach:

  • Give the string as static input and store it in a variable.
  • Take a variable and initialize it with an empty string.
  • Iterate in the given string using the for loop.
  • Inside the for loop, check if the character in a given string is a digit or not using the isdigit() function and if conditional statement.
  •  If it is true, then concatenate the character to the above declared empty string using the ‘+’ operator and store it in the same variable.
  • Print all the digits from a given string.
  • The Exit of the Program.

Below is the implementation:

# Give the string as static input and store it in a variable.
gvn_strng = "678_Goodmorning 123 hello all"
print("The Given string = ", gvn_strng)
# Take a variable and initialize it with an empty string.
new_str = ""
# Iterate in the given string using the for loop.
for chrctr in gvn_strng:
    # Inside the for loop, check if the character in a given string is a digit
        # or not using the isdigit() function and if conditional statement.
    if chrctr.isdigit():
        # If it is true, then concatenate the character to the above declared empty
        # string using the '+' operator and store it in the same variable.
        new_str = new_str + chrctr
# Print all the digits from a given string.
print("The digits present in a given string = ", new_str)

Output:

The Given string =  678_Goodmorning 123 hello all
The digits present in a given string =  678123

2)Using List comprehension:

# Give the string as static input and store it in a variable.
gvn_strng = "678_Goodmorning 123 hello all"
# Print the given string
print("The Given string = ", gvn_strng)
# Using list comprehension to get all the digits present in a given string
new_lst = [int(chrctr) for chrctr in gvn_strng if chrctr.isdigit()]
# Print all the digits from a given string.
print("The digits present in a given string = ", new_lst)

Output:

The Given string =  678_Goodmorning 123 hello all
The digits present in a given string =  [6, 7, 8, 1, 2, 3]

3)Using regex Library:

The Python regular expressions library, known as the regex library,’ allows us to detect the presence of specific characters in a string, such as numbers, special characters, and so on.
Before proceeding, import the regex library into the Python environment.

import re

r’\d+’  – to extract numbers from the string

‘\d+’ helps the findall() function in identifying the existence of any digit.

Approach:

  • Import regex library using the import keyword.
  • Give the string as static input and store it in a variable.
  • Print the given string.
  • Pass r’\d+’  and given string as arguments to the re.findall() function to extract numbers from the string and store it in another variable.
  • Here ‘\d+’ helps the findall() function in identifying the existence of any digit.
  • Print all the digits from a given string.
  • The Exit of the Program.

Below is the implementation:

# Import regex library using the import keyword.
import re
# Give the string as static input and store it in a variable.
gvn_strng = "6 Goodmorning 17 hello all"
# Print the given string
print("The Given string = ", gvn_strng)
# Pass r'\d+'  and given string as arguments to the re.findall() function to
# extract numbers from the string and store it in another variable.
# Here '\d+' helps the findall() function in identifying the existence of any digit.
rslt_digts = re.findall(r'\d+', gvn_strng)
# Print all the digits from a given string.
print(rslt_digts)

Output:

The Given string = 6 Goodmorning 17 hello all
['6', '17']

Method #2: Using Built-in Functions (User Input)

1) Using Python isdigit() Function:

Approach:

  • Give the string as user input using the input() function and store it in a variable.
  • Take a variable and initialize it with an empty string.
  • Iterate in the given string using the for loop.
  • Inside the for loop, check if the character in a given string is a digit or not using the isdigit() function and if conditional statement.
  •  If it is true, then concatenate the character to the above declared empty string using the ‘+’ operator and store it in the same variable.
  • Print all the digits from a given string.
  • The Exit of the Program.

Below is the implementation:

# Give the string as user input using the input() function and store it in a variable.
gvn_strng = input("Enter some random string = ")
print("The Given string = ", gvn_strng)
# Take a variable and initialize it with an empty string.
new_str = ""
# Iterate in the given string using the for loop.
for chrctr in gvn_strng:
    # Inside the for loop, check if the character in a given string is a digit
        # or not using the isdigit() function and if conditional statement.
    if chrctr.isdigit():
        # If it is true, then concatenate the character to the above declared empty
        # string using the '+' operator and store it in the same variable.
        new_str = new_str + chrctr
# Print all the digits from a given string.
print("The digits present in a given string = ", new_str)

Output:

Enter some random string = welcome6477 to Python-programs
The Given string = welcome6477 to Python-programs
The digits present in a given string = 6477

2)Using List comprehension:

# Give the string as user input using the input() function and store it in a variable.
gvn_strng = input("Enter some random string = ")
# Print the given string
print("The Given string = ", gvn_strng)
# Using list comprehension to get all the digits present in a given string
new_lst = [int(chrctr) for chrctr in gvn_strng if chrctr.isdigit()]
# Print all the digits from a given string.
print("The digits present in a given string = ", new_lst)

Output:

Enter some random string = 65 hello 231 all
The Given string = 65 hello 231 all
The digits present in a given string = [6, 5, 2, 3, 1]

3)Using regex Library:

Approach:

  • Import regex library using the import keyword.
  • Give the string as static input and store it in a variable.
  • Print the given string.
  • Pass r’\d+’  and given string as arguments to the re.findall() function to extract numbers from the string and store it in another variable.
  • Here ‘\d+’ helps the findall() function in identifying the existence of any digit.
  • Print all the digits from a given string.
  • The Exit of the Program.

Below is the implementation:

# Import regex library using the import keyword.
import re
# Give the string as user input using the input() function and store it in a variable.
gvn_strng = input("Enter some random string = ")
# Print the given string
print("The Given string = ", gvn_strng)
# Pass r'\d+'  and given string as arguments to the re.findall() function to
# extract numbers from the string and store it in another variable.
# Here '\d+' helps the findall() function in identifying the existence of any digit.
rslt_digts = re.findall(r'\d+', gvn_strng)
# Print all the digits from a given string.
print(rslt_digts)

Output:

Enter some random string = hello this is python 35 program
The Given string = hello this is python 35 program
['35']

Python Program to Extract Digits from a String – 2 Easy Ways Read More »

Python astype() Method with Examples

In this tutorial, we will go over an important idea in detail: Data Type Conversion of Columns in a DataFrame Using Python astype() Method.

Python is a superb language for data analysis, owing to its fantastic ecosystem of data-centric python programmes. Pandas is one of these packages, and it greatly simplifies data import and analysis.

astype() Method:

DataFrame.astype() method is used to convert pandas object to a given datatype. The astype() function can also convert any acceptable existing column to a categorical type.

We frequently come across a stage in the realm of Data Science and Machine Learning when we need to pre-process and transform the data. To be more specific, the transformation of data values is the first step toward modeling.
This is when data column conversion comes into play.

The Python astype() method allows us to convert the data type of an existing data column in a dataset or data frame.

Using the astype() function, we can modify or transform the type of data values or single or multiple columns to a completely different form.

Syntax:

DataFrame.astype(dtype, copy=True, errors='raise')

Parameters

dtype: The data type that should be applied to the entire data frame.
copy: If we set it to True, it makes a new copy of the dataset with the changes incorporated.
errors: By setting it to ‘raise,’ we allow the function to raise exceptions. If it isn’t, we can set it to ‘ignore.’

1)astype() – with DataFrame

Below is the implementation:

# Import pandas module using the import keyword
import pandas as pd
# Give the dictionary as static input and store it in a variable.
# (data given in the dictionary form)
gvn_data = {"ID": [11, 12, 13, 14, 15, 16], "Name": ["peter", "irfan", "mary",
                                                     "riya", "virat", "sunny"], "salary": [10000, 25000, 15000, 50000, 30000, 22000]}
# Pass the given data to the DataFrame() function and store it in another variable
block_data = pd.DataFrame(gvn_data)
# Print the above result
print("The given input Dataframe: ")
print(block_data)
print()
# Apply dtypes to the above block data
block_data.dtypes

Output:

The given input Dataframe: 
   ID   Name  salary
0  11  peter   10000
1  12  irfan   25000
2  13   mary   15000
3  14   riya   50000
4  15  virat   30000
5  16  sunny   22000

ID         int64
Name      object
salary     int64
dtype: object

Now, apply the astype() method on the ‘Name’ column to change the data type to ‘category’

# Import pandas module using the import keyword
import pandas as pd
# Give the dictionary as static input and store it in a variable.
# (data given in the dictionary form)
gvn_data = {"ID": [11, 12, 13, 14, 15, 16], "Name": ["peter", "irfan", "mary",
                                                     "riya", "virat", "sunny"], "salary": [10000, 25000, 15000, 50000, 30000, 22000]}
# Pass the given data to the DataFrame() function and store it in another variable
block_data = pd.DataFrame(gvn_data)
# Apply the astype() method on the 'Name' column to change the data type to 'category'
block_data['Name'] = block_data['Name'].astype('category')
# Apply dtypes to the above block data
block_data.dtypes

Output:

ID           int64
Name      category
salary       int64
dtype: object

Note:

 You can also change to datatype 'string'

2)astype() Method – with a Dataset in Python

Use the pandas.read csv() function to import the dataset. The dataset can be found here.

Approach:

  • Import pandas library using the import keyword.
  • Import some random dataset using the pandas.read_csv() function by passing the filename as an argument to it.
  • Store it in a variable.
  • Apply dtypes to the above dataset.
  • The Exit of the Program.

Below is the implementation:

# Import pandas library using the import keyword
import pandas
# Import some random dataset using the pandas.read_csv() function by passing
# the filename as an argument to it.
# Store it in a variable.
cereal_dataset = pandas.read_csv("cereal.csv")
# Apply dtypes to the above dataset
cereal_dataset.dtypes

Output:

name         object
mfr          object
type         object
calories      int64
protein       int64
fat           int64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

Now attempt to change the datatype of the variables ‘name’ and ‘fat’ to string, float64 respectively. As a result, we can say that the astype() function allows us to change the data types of multiple columns in one go.

# Import pandas library using the import keyword
import pandas
# Import some random dataset using the pandas.read_csv() function by passing
# filename as an argument to it.
# Store it in a variable.
cereal_dataset = pandas.read_csv("cereal.csv")
# Change the datatype of the variables 'name' and 'fat'using the astype() function
print("The dataset after changing datatypes:")
cereal_dataset = cereal_dataset.astype({"name":'string', "fat":'float64'}) 
# Apply dtypes to the above dataset
cereal_dataset.dtypes

Output:

The dataset after changing datatypes:
name         string
mfr          object
type         object
calories      int64
protein       int64
fat         float64
sodium        int64
fiber       float64
carbo       float64
sugars        int64
potass        int64
vitamins      int64
shelf         int64
weight      float64
cups        float64
rating      float64
dtype: object

 

 

Python astype() Method with Examples Read More »

Python Program for Calculating Summary Statistics

To calculate summary statistics in Python, use the pandas.describe() function. The describe() method can be used on both numeric and object data, such as strings or timestamps.

The result for the two will differ in terms of fields.

For numerical data, the outcome will be as follows:

  • count
  • mean
  • standard deviation
  • minimum
  • maximum
  • 25 percentile
  • 50 percentile
  • 75 percentiles

For Objects, the outcome will be as follows:

  • count
  • top
  • unique
  • freq

On DataFrame, a huge number of methods collectively generate descriptive statistics and other related activities. The majority of these are aggregations, such as sum() and mean(), although some, such as sumsum(), produce an object of the same size. In general, these methods, like ndarray.{sum, std,…}, accept an axis argument, but the axis can be supplied by name or integer.

Using Python’s describe() function, compute Summary Statistics

Let us now have a look at how to calculate summary statistics for object and numerical data by using the describe() method.

1)Calculation of Summary Statistics for Numerical data:

Approach:

  • Import pandas module using the import keyword.
  • Give the list as static input and store it in a variable.
  • Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
  • Apply describe() function to the above series to get the summary statistics for the given series.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Give the list as static input and store it in a variable.
gvn_lst = [9, 5, 8, 2, 1]
# Pass the given list as an argument to the pandas.series() function and store it in
# another variable.(defining series)
rslt_seris = pandas.Series(gvn_lst)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count    5.000000
mean     5.000000
std      3.535534
min      1.000000
25%      2.000000
50%      5.000000
75%      8.000000
max      9.000000
dtype: float64

Here each value has a definition. They are:

count: It is the number of total entries

mean: It is the mean of all the entries

std: It is the standard deviation of all the entries.

min: It is the minimum value of all the entries.

25%: It is the 25 percentile mark

50%: It is the 50 percentile mark i.e, median

75%: It is the 75 percentile mark

max: It is the maximum value of all the entries.

2)Calculation of Summary Statistics for Object data:

Approach:

  • Import pandas module using the import keyword.
  • Give the list of characters as static input and store it in a variable.
  • Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
  • Apply describe() function to the above series to get the summary statistics for the given series.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Give the list of characters as static input and store it in a variable.
gvn_lst = ['p', 'e', 'r', 'g', 'e', 'p', 'e']
# Pass the given list as an argument to the pandas.series() function and store it in
# another variable.(defining series)
rslt_seris = pandas.Series(gvn_lst)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count     7
unique    4
top       e
freq      3
dtype: object

Where

count: It is the number of total entries

unique: It is the total number of unique/distinct entries.

top: It is the value that occurred most frequently

freq: It is the frequency of the most frequent entry i.e here ‘e’ occurred 3 times hence its freq is 3.

Calculation of Summary Statistics for Huge dataset:

Importing the Dataset first and applying the describe() method to get Summary Statistics

Let us take an example of a cereal dataset

Import the dataset into a Pandas Dataframe.

Approach:

  • Import pandas module using the import keyword.
  • Import dataset using read_csv() function by passing the dataset name as an argument to it.
  • Store it in a variable.
  • Apply describe() method to the above-given dataset to get the Summary Statistics of the dataset.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by passing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply describe() method to the above-given dataset to get the Summary Statistics
# of the dataset.
cereal_dataset.describe()

Output:

caloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
count77.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.000000
mean106.8831172.5454551.012987159.6753252.15194814.5974036.92207896.07792228.2467532.2077921.0296100.82103942.665705
std19.4841191.0947901.00647383.8322952.3833644.2789564.44488571.28681322.3425230.8325240.1504770.23271614.047289
min50.0000001.0000000.0000000.0000000.000000-1.000000-1.000000-1.0000000.0000001.0000000.5000000.25000018.042851
25%100.0000002.0000000.000000130.0000001.00000012.0000003.00000040.00000025.0000001.0000001.0000000.67000033.174094
50%110.0000003.0000001.000000180.0000002.00000014.0000007.00000090.00000025.0000002.0000001.0000000.75000040.400208
75%110.0000003.0000002.000000210.0000003.00000017.00000011.000000120.00000025.0000003.0000001.0000001.00000050.828392
max160.0000006.0000005.000000320.00000014.00000023.00000015.000000330.000000100.0000003.0000001.5000001.50000093.704912

The result includes summary statistics for all of the columns in our dataset.

Calculation of Summary Statistics for timestamp series:

The describe() method is also used to obtain summary statistics for a timestamp series.

Approach:

  • Import pandas module using the import keyword.
  • Import datetime module using the import keyword.
  • Import numpy module as np using the import keyword
  • Give the timestamp as static input using the np.datetime64() function.
  • Store it in a variable.
  • Pass the given timestamp as an argument to the pandas.series() function and store it in another variable (defining series).
  • Apply describe() function to the above series to get the summary statistics for the given series.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Import datetime module using the import keyword
import datetime
# Import numpy module as np using the import keyword
import numpy as np
# Give the timestamp as static input using the np.datetime64() function
# Store it in a variable.
gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64(
    "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")]
# Pass the given timestamp as an argument to the pandas.series() function and
# store it in another variable.(defining series)
rslt_seris = pandas.Series(gvn_timestmp)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count                       4
unique                      3
top       2008-05-01 00:00:00
freq                        2
first     2003-01-07 00:00:00
last      2008-05-01 00:00:00
dtype: object

You can also tell describe() method to treat dateTime as a numeric value. The result will be displayed in a way similar to that of numerical data. In the DateTime format, you can get the mean, median, 25th percentile, and 75th percentile.

rslt_seris.describe(datetime_is_numeric=True)
# Import pandas module using the import keyword
import pandas
# Import datetime module using the import keyword
import datetime
# Import numpy module as np using the import keyword
import numpy as np
# Give the timestamp as static input using the np.datetime64() function
# Store it in a variable.
gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64(
    "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")]
# Pass the given timestamp as an argument to the pandas.series() function and
# store it in another variable.(defining series)
rslt_seris = pandas.Series(gvn_timestmp)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe(datetime_is_numeric=True)

Output:

count                      4
mean     2006-03-26 18:00:00
min      2003-01-07 00:00:00
25%      2004-09-10 18:00:00
50%      2006-10-17 00:00:00
75%      2008-05-01 00:00:00
max      2008-05-01 00:00:00
dtype: object

 

Python Program for Calculating Summary Statistics Read More »

Python Interpolation To Fill Missing Entries

Interpolation is a Python technique for estimating unknown data points between two known data points. While preprocessing data, interpolation is commonly used to fill in missing values in a dataframe or series.

Interpolation is also used in image processing to estimate pixel values using neighboring pixels when extending or expanding an image.

Interpolation is also used by financial analysts to forecast the financial future based on known datapoints from the past.

Interpolation is commonly employed when working with time-series data since we want to fill missing values with the preceding one or two values in time-series data. For example, if we are talking about temperature, we would always prefer to fill today’s temperature with the mean of the last two days rather than the mean of the month. Interpolation can also be used to calculate moving averages.

Pandas Dataframe has interpolate() method that can be used to fill in the missing entries in your data.

The dataframe.interpolate() function in Pandas is mostly used to fill NA values in a dataframe or series. However, this is a really powerful function for filling in the blanks. Rather than hard-coding the value, it employs various interpolation techniques to fill in the missing data.

Interpolation for Missing Values in Series Data

Creation of pandas. Series with missing values as shown below:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import numpy module as np using the import keyword
import numpy as np
# Pass some random list as an argument to thr pd.Series() method
# and store it in another variable.(defining series)
k = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])

1)Linear Interpolation:

Linear interpolation basically implies estimating a missing value by connecting dots in increasing order in a straight line. In a nutshell, it estimates the unknown value in the same ascending order as prior values. Interpolation’s default method is linear, thus we didn’t need to specify it when using it.

The value at the fourth index in the above code is nan. Use the following code to interpolate the data:

k.interpolate()

In the absence of a method specification, linear interpolation is used as default.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import numpy module as np using the import keyword
import numpy as np
# Pass some random list as an argument to thr pd.Series() method
# and store it in another variable.(defining series)
k = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])
# Apply interpolate() function to the above series to fill the
# missing values(nan).
k.interpolate()

Output:

0    2.0
1    3.0
2    1.0
3    2.5
4    4.0
5    5.0
6    8.0
dtype: float64

2)Polynomial Interpolation:

You must specify an order in Polynomial Interpolation. Polynomial interpolation fills missing values with the lowest degree possible that passes via existing data points. The polynomial interpolation curve is similar to the trigonometric sin curve or assumes the shape of a parabola.

Polynomial interpolation needs the specification of an order. Here we see the interpolating with order 2 this time.

k.interpolate(method='polynomial', order=2)
# Import pandas module as pd using the import keyword
import pandas as pd
# Import numpy module as np using the import keyword
import numpy as np
# Pass some random list as an argument to thr pd.Series() method
# and store it in another variable.(defining series)
k = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])
# Apply interpolate() function to the above series by giving the method as 
# "polynomial" and order= 2 as the arguments to fill the missing values(nan).
k.interpolate(method='polynomial', order=2)

Output:

0    2.000000
1    3.000000
2    1.000000
3    1.921053
4    4.000000
5    5.000000
6    8.000000
dtype: float64

When you use polynomial interpolation with order 1, you get the same result as linear interpolation. This is due to the fact that a polynomial of degree 1 is linear.

3)Interpolation Via Padding

Interpolation via padding involves copying the value just preceding a missing item.

When utilizing padding interpolation, you must set a limit. The limit is the maximum number of nans that the function can fill consecutively.

So, if you’re working on a real-world project and want to fill missing values with previous values, you’ll need to establish a limit on the number of rows in the dataset.

k.interpolate(method='pad', limit=2)
# Import pandas module as pd using the import keyword
import pandas as pd
# Import numpy module as np using the import keyword
import numpy as np
# Pass some random list as an argument to thr pd.Series() method
# and store it in another variable.(defining series)
k = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])
# Apply interpolate() function to the above series by giving the method as 
# "pad" and limit = 2 as the arguments to fill the missing values(nan).
# The limit= 2 is the maximum number of nans that the function 
# can fill consecutively.
k.interpolate(method='pad', limit=2)

Output:

0    2.0
1    3.0
2    1.0
3    1.0
4    4.0
5    5.0
6    8.0
dtype: float64

The value of the missing entry is the same as the value of the entry preceding it.

We set the limit to two, so let’s see what happens if three consecutive nans occur.

k = pd.Series([0, 1, np.nan, np.nan, np.nan, 3,4,5,7])
k.interpolate(method='pad', limit=2)

Output:

0    0.0
1    1.0
2    1.0
3    1.0
4    NaN
5    3.0
6    4.0
7    5.0
8    7.0
dtype: float64

Here, the third nan is unaltered.

Pandas DataFrames Interpolation

Interpolation can also be used to fill missing values in a Pandas Dataframe.

Example

Approach:

  • Import pandas module as pd using the import keyword.
  • Pass some random data(as dictionary) to the pd.DataFrame() function to create a dataframe.
  • Store it in a variable.
  • Print the above-given dataframe.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module as pd using the import keyword
import pandas as pd
  
# Pass some random data(as dictionary) to the pd.DataFrame() function
# to create a dataframe.
# Store it in a variable.
rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1],
                   "q":[None, 1, 26, 8, None],
                   "r":[14, 10, None, 9, 4],
                   "s":[18, 5, None, None, 2]})
  
# Print the above given dataframe
print(rslt_datafrme)

Output:

      p     q     r     s
0  11.0   NaN  14.0  18.0
1   3.0   1.0  10.0   5.0
2   2.0  26.0   NaN   NaN
3   NaN   8.0   9.0   NaN
4   1.0   NaN   4.0   2.0

Pandas Dataframe Linear Interpolation

Do as given below to apply linear interpolation to the dataframe:

rslt_datafrme.interpolate()
# Import pandas module as pd using the import keyword
import pandas as pd
  
# Pass some random data(as dictionary) to the pd.DataFrame() function
# to create a dataframe.
# Store it in another variable.
rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1],
                   "q":[None, 1, 26, 8, None],
                   "r":[14, 10, None, 9, 4],
                   "s":[18, 5, None, None, 2]})
  
# Apply interpolate() function to the above dataframe
rslt_datafrme.interpolate()

Output:

pqrs
011.0NaN14.018.0
13.01.010.05.0
22.026.09.54.0
31.58.09.03.0
41.08.04.02.0

In the above example, the first value below the ‘p’ column is still nan as there is no known data point before it for interpolation.

Individual columns of a dataframe can also be interpolated.

rslt_datafrme['r'].interpolate()
# Import pandas module as pd using the import keyword
import pandas as pd
  
# Pass some random data(as dictionary) to the pd.DataFrame() function
# to create a dataframe.
# Store it in another variable.
rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1],
                   "q":[None, 1, 26, 8, None],
                   "r":[14, 10, None, 9, 4],
                   "s":[18, 5, None, None, 2]})
  
# Apply interpolate() function to the 'r' column of the above dataframe
rslt_datafrme['r'].interpolate()

Output:

0    14.0
1    10.0
2     9.5
3     9.0
4     4.0
Name: r, dtype: float64

Interpolation Via Padding

# Apply interpolate() function to the above dataframe by giving the method as 
"pad" and limit = 2 as the arguments to fill the missing values(nan). 

rslt_datafrme.interpolate(method='pad', limit=2)
# Import pandas module as pd using the import keyword
import pandas as pd
  
# Pass some random data(as dictionary) to the pd.DataFrame() function
# to create a dataframe.
# Store it in another variable.
rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1],
                   "q":[None, 1, 26, 8, None],
                   "r":[14, 10, None, 9, 4],
                   "s":[18, 5, None, None, 2]})
  
# Apply interpolate() function to the above dataframe by giving the method as 
# "pad" and limit = 2 as the arguments to fill the missing values(nan).
rslt_datafrme.interpolate(method='pad', limit=2)

Output:

pqrs
011.0NaN14.018.0
13.01.010.05.0
22.026.010.05.0
32.08.09.05.0
41.08.04.02.0

 

 

 

Python Interpolation To Fill Missing Entries Read More »