Author name: Satyabrata Jena

Python: Check if all values are same in a Numpy Array (both 1D and 2D)

Checking if all the values are same in Numpy Array (both 1D and 2D) :

In this article we will discuss how we can check if all elements are same in 1D or 2D. We will also check if all rows or columns have same values or not.

Check if all elements are equal in a 1D Numpy Array using numpy.all() :

Here we compare all the elements with the first element of the array and returns a bool array of same size. If first element is equal to a element in main array, then true is returned else false is returned.

If all elements in bool array become true then all elements of main array are same.

import numpy as sc

# creating a 1D numpy array

num_arr = sc.array([14, 14, 14, 14, 14, 14])

# here we check if all values in array is same as first element

res = sc.all(num_arr == num_arr[0])

if res:

 print('All elements are same')

else:

 print('All elements are not same')
Output :
All elements are same

Check if all elements are equal in a 1D Numpy Array using min() & max() :

If maximum value of element is equal to minimum value of element in array then it indirectly means all values are same in the array.

import numpy as sc

# creating a 1D numpy array

num_arr = sc.array([14, 14, 14, 14, 14, 14])

# if min element= max element then all values in array are same

res = sc.max(num_arr) == sc.min(num_arr)

if res:

 print('All elements are equal')

else:

 print('All elements are not equal')
Output :
All elements are equal

Check if all elements are equal in a Multidimensional Numpy Array or Matrix :

We know that numpy.ravel() function returns flattened 1D view of a array. So we can easily convert a multi-dimensional flattened to 1D array and then compare first element with all the elements to check all element are same or not.

import numpy as sc

mul_arr = sc.array([[1, 1, 1],

[1, 1, 1],

[1, 1, 1]])

# to get a flattened 1D view of multidimensional numpy array

flatn_arr = sc.ravel(mul_arr)

# to check if all value in multidimensional array are equal

res = sc.all(mul_arr==flatn_arr[0])

if res:

 print('All elements are same')

else:

 print('All elements are not same')
Output :
All elements are same

Find rows or columns with same values in a matrix or 2D Numpy array :

Find rows with same values in a matrix or 2D Numpy array :

Similarly to check if all elements are same in each row, we can compare elements of each row with first element.

import numpy as sc
mul_arr = sc.array([[1, 1, 1],
[1, 21, 1],
[1, 1, 1]])
# to compare 1st element with elements of each row
for i in range(mul_arr.shape[0]):
 if sc.all(mul_arr[i]==mul_arr[i][0]):
  print('Row:', i)
print("Following rows have same elements")
Output :
Row: 0
Row: 2
Following rows have same elements

Find columns with same values in a matrix or 2D Numpy array :

Here also to check if all elements are same in each column, we can compare elements of each column with first element.

 

import numpy as sc
mul_arr = sc.array([[1, 1, 1],
[1, 21, 1],
[1, 1, 1]])
#to compare 1st element with elements of each row
trans_Arra = mul_arr.T
for i in range(trans_Arra.shape[0]):
 if sc.all(trans_Arra[i] == trans_Arra[i][0]):
  print('Column: ', i)
print("Following rows have same elements")
Output :
Column:  0
Column:  2
Following rows have same elements

Python: Check if all values are same in a Numpy Array (both 1D and 2D) Read More »

How to Reverse a 1D & 2D numpy array using np.flip() and [] operator in Python

Reversing a 1D and 2D numpy array using np.flip() and [] operator in Python.

Reverse 1D Numpy array using ‘[]’ operator :

By not passing any start or end parameter, so by default complete array is picked. And as step size is -1, so elements selected from last to first.

# Program :

import numpy as sc

# Creating a numpy array
num_arr = sc.array([11,22,33,44,55,66])

print('Original Array: ',num_arr)

# To get reverse of numpy array
rev_arr = num_arr[::-1]

print('Reversed Array : ', rev_arr)
Output :
Original Array:  [11 22 33 44 55 66]                                
Reversed Array :  [66 55 44 33 22 11]

Reverse Array is View Only :

Here if we do any modification in reversed array, it will also be reflected in original array.

# Program :

import numpy as sc

# Creatin a numpy array
num_arr = sc.array([11,22,33,44,55,66])

# To get reverse of numpy array
rev_arr = num_arr[::-1]
rev_arr[4]=63

print('Modified reversed Array : ', rev_arr)
print('Original array is: ',num_arr)
Output :
Modified reversed Array :  [66 55 44 33 63 11]                      
Original array is:  [11 63 33 44 55 66]

Reverse Numpy array using np.flip() :

flip() function provided by Python’s numpy module helps to flip or reverse the content of numpy array.

Syntax : numpy.flip(arr, axis=None)

where,

  • arr :  A numpy array
  • axis : Axis along which contents need to be flipped. If None, contents will be flipped along axis of array.

Reverse 1D Numpy array using np.flip() :

Here as it is 1 -D Numpy array, there is no need of axis parameter.

# Program :

import numpy as sc

# To create a Numpy array
num_arr = sc.array([11,22,33,44,55])

print('Original array: ',num_arr)

# To get reverse of numpy array
rev_arr = sc.flip(num_arr)

print('Reversed Array is : ', rev_arr)
Output :
Original array:  [11 22 33 44 55]
Reversed Array is :  [55 44 33 22 11]

Reverse 2D Numpy Array using np.flip() :

Reverse contents in all rows and all columns of 2D Numpy Array :

Here we don’t provide parameter in np.flip() function, then contents will be reversed along the axes of 2-D Numpy array.

# Program :

import numpy as sc

# to create a 2D Numpy array
twoD_Arr = sc.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print('Original Array is: ',twoD_Arr)

# to reverse 2D numpy array
rev_Arr = sc.flip(twoD_Arr)

print('Reversed Array : ')
print(rev_Arr)
Output :
Original Array is:
[[1 2 3]
[4 5 6]
[7 8 9]]
Reversed Array :
[[9 8 7]
[6 5 4]
[3 2 1]]

Reverse contents of all rows only in 2D Numpy Array :

If we provide axix parameter i.e. axis=1, then rows of array will be reversed.

# Program :

import numpy as sc

# to create a 2D Numpy array
twoD_Arr = sc.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print('Original Array is: ',twoD_Arr)

# to reverse the content of each row in array
rev_Arr = sc.flip(twoD_Arr, axis=1)

print('Reversed Array : ')
print(rev_Arr)
Output :
Original Array is:  [[1 2 3]
[4 5 6]
[7 8 9]]
Reversed Array :
[[3 2 1]
[6 5 4]
[9 8 7]]

Reverse contents of all columns only in 2D Numpy Array :

If we provide axix parameter i.e. axis=0, then rows of array will be reversed.

# Program :

import numpy as sc

# to create a 2D Numpy array
twoD_Arr = sc.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print('Original Array is: ',twoD_Arr)

# to reverse the content of each column in array
rev_Arr = sc.flip(twoD_Arr, axis=0)

print('Reversed Array : ')
print(rev_Arr)
Output :
Original Array is:
[[1 2 3]
[4 5 6]
[7 8 9]]
Reversed Array :
[[7 8 9]
[4 5 6]
[1 2 3]]

Reverse contents of only one row in 2D Numpy Array :

Let, we want to reverse only 1st row of a Numpy array.

# Program :

import numpy as sc

# to create a 2D Numpy array
twoD_Arr = sc.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print('Original Array is: ',twoD_Arr)

# to reverse only 1st row
twoD_Arr[0] = sc.flip(twoD_Arr[0])

print('Reversed Array : ')
print(twoD_Arr)
Output :
Original Array is:
[[1 2 3]
[4 5 6]
[7 8 9]]
Reversed Array :
[[3 2 1]
[4 5 6]
[7 8 9]]

Reverse contents of only one column in 2D Numpy Array :

Let, we want to reverse only 3rd column of a Numpy array.

# Program :

import numpy as sc

# to create a 2D Numpy array
twoD_Arr = sc.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print('Original Array is: ',twoD_Arr)

# to reverse the content of 3rd column in array
twoD_Arr[:,2] = sc.flip(twoD_Arr[:,2])

print('Reversed Array : ')
print(twoD_Arr)
Output :
Original Array is:
[[1 2 3]
[4 5 6]
[7 8 9]]
Reversed Array :
[[1 2 9]
[4 5 6]
[7 8 3]]

 

How to Reverse a 1D & 2D numpy array using np.flip() and [] operator in Python Read More »

Python : How to move files and Directories ?

How to move files and directories in python.

In python, shutil module offers various files related operations

Syntax- shutil.move(src, dst)

It accepts source and destination path and moves file or directory from source pointed as src to destination pointed as dst.

Move a file to an another directory :

 We will pass source file in first parameter and destination file in second parameter as string. Some points to keep in view.

  • If there is no file in destination directory then a new file will be created with the same name.
  • If already exists a file with same name in destination directory, then the file will be overwritten.
  • If in destination path, the path is not valid not existed it will result in FileNotFoundError.
import shutil
cre_path = shutil.move('satya.txt', 'document')    
print(cre_path)
Output :
FileNotFoundError: [Errno 2] No such file or directory: 'saya.txt'

Move a file with a new name :

 If we pass a new name of file in destination path, it will reflect in source file as it will move to source file with new name.

  • If there was already existed file with same name, then the file will be overwritten.
  • If path doesn’t exist for destination path it will result in error.

 

import shutil
cre_path = shutil.move('satya.txt', 'document/sample.txt')
print(cre_path)
Output :
FileNotFoundError: [Errno 2] No such file or directory: 'satya.txt'

Move all files in a directory to an another directory recursively :

Let there is a situation where we want move all the files form one directory to another directory. To implement this, using shtil.move(), we will iterate all files in source directory and move each file to destination directory.

import shutil,os,glob
def mov_every_Files_Dir(sourDir, destDir):
    print(sourDir)
    print(destDir)
    # checking if both the are directories are not
    if os.path.isdir(sourDir) and os.path.isdir(destDir) :
        # Iterate through all the files in source directory
        for filePath in glob.glob(sourDir + '/*'):
            # moving the files to destination directory
            print(file_path)
            shutil.move(file_path, destDir);
    else:
        print("srcDir & dstDir should be Directories")   
def main():        
    if __name__ == '__main__':
        main()
    srcDir = '/users/sjones/chem/notes'
    desDir =  '/users/sjones/chem/notes_backup'
    mov_every_Files_Dir(srcDir,desDir)

Move file and Create Intermediate directories :

                We know that if there is no directory in a given path, then shutil.move() will give error. So we will create a function which move the file to destination directory and also create all directories in given path.

import shutil, os, glob
def moven_cre_Dir(srcPath, destDir):
    if os.path.isdir(destDir) == False:
        os.makedirs(destDir); 
    shutil.move(srcPath, destDir);
def main():
    if __name__ == '__main__':
        main()
    moven_cre_Dir(srcDir, destDir)
    srcFile = 'certificate/document.txt'
    destDir =  'certificate/document9'

Move symbolic links :

If source file is a symbolic link, then a link will be created at destination path that will point to source link and subsequently source link will be deleted.

Move a directory to an another directory :

We can move an entire directory to other locations by keeping a view on some points.

import shutil
sour_dir = 'satya'
dest_dir =  'satya1'
shutil.move(sour_dir, dest_dir)
Output :
FileNotFoundError: [Errno 2] No such file or directory: 'satya'
  • If there was already a destination directory, then source directory will move to destination directory.
  • If destination directory is not there, then it will be created.
  • If there is no intermediate directory or path is invalid then there will be error.
  • If there is another directory with same name as that of the source in the destination directory, then there also will be error.

 

 

 

Python : How to move files and Directories ? Read More »

Python: How to delete specific lines in a file in a memory-efficient way?

How to delete specific lines in a file in a memory-efficient way ?

In this article, we will see how to delete a set of lines from a file in various ways.

We can not delete the lines from the file directly so we will first create a temporary file and write into it all the other lines. Then we will delete the original file and rename the temporary file.

We will be using the following file for demonstration

 File.txt

Line 1

Line 2

Line 3

Line 4

Line 5

Delete a line from a file by specific line number in python :

The algorithm will be-

  • Open the original file in reading mode
  • Enter the line number
  • Create a new temporary file and opening it in write mode
  • Read the contents of the file while keeping the count of the lines
    1. If the counter reaches the line we have to delete then skip this line
  • If any line was removed from the original file then delta the original file and rename the temporary file to that of the original file.
  • Else delete the temporary file.
import os
def deleteLine(originFile, lineNo):
    isSkipped = False
    index = 0
    tempFile = originFile + '.bak'
    #Open original file in read and temp file in write mode
    with open(originFile, 'r') as readObj, open(tempFile, 'w') as writeObj:
        #Copy all dat line by line
        for line in readObj:
            #When the loop counter is same as the line number skip
            if index != lineNo:
                writeObj.write(line)
            else:
                isSkipped = True
            index += 1
    #If any lines are not there rename temp as original file
    if isSkipped:
        os.remove(originFile)
        os.rename(tempFile, originFile)
    else:
        os.remove(tempFile)

#Line numbering starts from 0
deleteLine('file.txt',1)

After execution,

File.txt

Line 1

Line 3

Line 4

Line 5

Delete multiple lines from a file by line numbers :

The algorithm will be-

  • Open the original file in reading mode
  • Enter the line numbers to delete and pass it as a series
  • Create a new temporary file and opening it in write mode
  • Read the contents of the file while keeping the count of the lines
    1. If the counter reaches the numbers we have, then skip
  • If any line was removed from the original file then delta the original file and rename the temporary file to that of the original file.
  • Else delete the temporary file.
import os
def deleteLine(originFile, lineNo):
    isSkipped = False
    index = 0
    tempFile = originFile + '.bak'
    #Open original file in read and temp file in write mode
    with open(originFile, 'r') as readObj, open(tempFile, 'w') as writeObj:
        #Copy all dat line by line
        for line in readObj:
            #When the loop counter is same as the line number skip
            if index not in lineNo:
                writeObj.write(line)
            else:
                isSkipped = True
            index += 1
    #If any lines are not there rename temp as original file
    if isSkipped:
        os.remove(originFile)
        os.rename(tempFile, originFile)
    else:
        os.remove(tempFile)

#Line numbering starts from 0
deleteLine('file.txt',[0,2,3]

After execution,

file.txt

Line 2

Line 5 

Delete a specific line from the file by matching content :

The algorithm will be-

  • Open the original file in reading mode
  • Create a temporary file
  • Copy all contents from original file to the temp file line by line. If the line matches the lines we want to delete then skip
  • Compare both the files, if there are any difference delete original file and rename temp file as original.
import os
def deleteLine(originFile, lineToDelete):
    isSkipped = False
    tempFile = originFile + '.bak'
    #Open original file in read and temp file in write mode
    with open(originFile, 'r') as readObj, open(tempFile, 'w') as writeObj:
        #Copy all data line by line
        for line in readObj:
            currentLine = line
            if line[-1] == '\n':
                currentLine = line[:-1]
            # if currentLine matches with the given line then skip
            if currentLine != lineToDelete:
                writeObj.write(line)
    #If any lines are not there rename temp as original file
    if isSkipped:
        os.remove(originFile)
        os.rename(tempFile, originFile)
    else:
        os.remove(tempFile)

#Line numbering starts from 0
deleteLine('file.txt','Line 4')

After execution,

file.txt

Line 1

Line 2

Line 3

Line 5

Delete specific lines from a file that matches the given conditions :

The algorithm will be-

  • Accept the original file with a function as call-back.
  • Open the original file in reading mode
  • Enter the line numbers to delete and pass it as a series
  • Create a new temporary file and opening it in write mode
  • Read the contents of the file while keeping the count of the lines
    1. Pass each line into the function, if it returns true then skip
  • If any line was removed from the original file then deleted the original file and rename the temporary file to that of the original file.
  • Else delete the temporary file.
import os
def deleteLine(originFile, conditionalFunc):
    isSkipped = False
    tempFile = originFile + '.bak'
    #Open original file in read and temp file in write mode
    with open(originFile, 'r') as readObj, open(tempFile, 'w') as writeObj:
        #Copy all data line by line
        for line in readObj:
#Chech each file by passing it into the function
            if conditionalFunc(line) == False:
                writeObj.write(line)
    #If any lines are not there rename temp as original file
    if isSkipped:
        os.remove(originFile)
        os.rename(tempFile, originFile)
    else:
        os.remove(tempFile)

#Line numbering starts from 0
deleteLine('file.txt',conditionalFunction)

We can pass any function to check for our condition and when they are met, the lines are skipped.. The conditions can be anything like size of the line, line with a particular word in it etc.

Python: How to delete specific lines in a file in a memory-efficient way? Read More »

Python: String to datetime or date object

String to datetime or date object in Python.

We will see in this article how to convert different format string to a DateTime or date object. We will do so using a function provided by Python’s date-time module to change string into a datetime object i.e. strptime( )

Syntax-

datetime.strptime(datetime_str, format):

Arguments:

  • datetime_str – It takes the date and time information
  • format – The format in which we want the date item information based on which the object will be created

The function will return the  date time object when the right parameters are provided. In case there is something wrong with the parameters provided to the function will throw a value error.

Convert string (‘DD/MM/YY HH:MM:SS ‘) to datetime object :

We can obtain the date time in our desired format by passing in the correct format code as arguments.

  • %d – Days
  • %m – Month
  • %y – Years
  • %H – Hours
  • %M – Minutes
  • &S – Seconds
  • %z – Timezone
  • %f – Milliseconds
from datetime import datetime
dateTimeString = '15/5/21 11:12:13'
# Converting the String ( ‘DD/MM/YY HH:MM:SS ‘) into a datetime object
dateTimeObj = datetime.strptime(dateTimeString, '%d/%m/%y %H:%M:%S')
print(dateTimeObj)
Output :
2021-05-15 11:12:13

Convert string (‘MM/DD/YY HH:MM:SS ‘) to datetime object :

To achieve the ‘MM/DD/YY HH:MM:SS ‘ format we are goin to change the order of format from the previous code.

from datetime import datetime
dateTimeString = '5/15/2021 11:12:13'
#Converting the String ( ‘MM/DD/YY HH:MM:SS‘) into a datetime object
dateTimeObj = datetime.strptime(dateTimeString, '%m/%d/%Y %H:%M:%S')
print(dateTimeObj)
Output :
2021-05-15 11:12:13

Convert string to datetime and handle ValueError :

In case we pass a format that is not compatible with the with the function, it throws a ValueError.

We can handle that error beforehand so that the program execution does not stops.

(For correct format)

from datetime import datetime
dateTimeString = '5/15/2021 11:12:13'
#Converting the String ( ‘MM/DD/YY HH:MM:SS‘) into a datetime object
try:
    dateTimeObj = datetime.strptime(dateTimeString, '%m/%d/%Y %H:%M:%S')
    print(dateTimeObj)
except ValueError as e:
    print(e)
Output :
2021-05-15 11:12:13

(For wrong format)

from datetime import datetime
dateTimeString = '5/15/2021 11:12:13'
#Converting the String ( ‘MM/DD/YY HH:MM:SS‘) into a datetime object
try:
    dateTimeObj = datetime.strptime(dateTimeString, '%d-%m-%Y %H:%M:%S')
    print(dateTimeObj)
except ValueError as e:
    print(e)
Output :
time data '5/15/2021 11:12:13' does not match format '%d-%m-%Y %H:%M:%S'

Python: Convert string to datetime – ( string format yyyy-mm-dd hh-mm-ss) :

from datetime import datetime
dateTimeString = '2021-5-15 11-12-13'
#Converting the String ( ‘yyyy-mm-dd hh-mm-ss‘) into a datetime object
dateTimeObj = datetime.strptime(dateTimeString, '%Y-%m-%d %H-%M-%S')
print(dateTimeObj)
Output :
2021-05-15 11:12:13

Python: Convert string to datetime – ( string format MMM DD YYYY HH:MM:SS) :

from datetime import datetime
dateTimeString = 'May 15 2021 11:12:13'
#Converting the String ( ‘MMM DD YYYY HH:MM:SS‘) into a datetime object
dateTimeObj = datetime.strptime(dateTimeString, '%b %d %Y %H:%M:%S')
print(dateTimeObj)
Output :
2021-05-15 11:12:13

Python: Convert string to datetime with milliseconds- ( string format DD/MM/YY HH:MM:SS:FFFFFF) :

In case we have the millisecond info,

from datetime import datetime
dateTimeString = '15/5/21 11:12:13.453'
#Converting the String ( ‘DD/MM/YY HH:MM:SS:FFFFFF‘) into a datetime object
dateTimeObj = datetime.strptime(dateTimeString, '%d/%m/%y %H:%M:%S.%f')
print(dateTimeObj)
Output :
2021-05-15 11:12:13.453000

Python: Convert string to datetime with timezone :

To add the time zone in the format we have to include %z

from datetime import datetime
dateTimeString = '15/5/21 11:12:13+05:30'
#Converting the String wih timezone into a datetime object
dateTimeObj = datetime.strptime(dateTimeString, '%d/%m/%y %H:%M:%S%z')
print(dateTimeObj)
Output :
2021-05-15 11:12:13+05:30

Python: Convert string to date object :

To convert a datetime string into a date object we have to first convert it into a datetime object using strptime( ) and then pass it into the date( ) function to get our date object.

from datetime import datetime
dateTimeString = '2021-5-15'
#Converting the String into a date object
dateTimeObj = datetime.strptime(dateTimeString, '%Y-%m-%d').date()
print(dateTimeObj)
Output :
2021-05-15

Python: Convert string to time object :

To convert a datetime string into a time object we have to first convert it into a datetime object using strptime( ) and then pass it into the time( ) function to get our time object.

from datetime import datetime
dateTimeString = '11:12:13'
#Converting the String into a time object
dateTimeObj = datetime.strptime(dateTimeString, '%H:%M:%S').time()
print(dateTimeObj)
Output :
11:12:13

 

Python: String to datetime or date object Read More »

Pandas : How to merge Dataframes by index using Dataframe.merge()

How to merge Dataframes by index using Dataframe.merge() in Python ?

In this article we are going to see how we can merge two dataframes by using index of both the dataframes or by suing index of one dataframe and some columns of the other dataframe, and how we can keep a merged dataframe with similar indices. So, let’s start the exploring the topic.

DataFrame.merge()

SYNTAX :

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, index_left=False, index_right=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

For this short example, we will only focus on some arguments :

  • On : This is the column name on which the merge is supposed to be done.
  • Left_on : The column names of the left dataframe which are to be merged
  • Right_on : The column names of the right dataframe which are to be merged
  • index_left : It takes a boolean value whose default values is false. If it is true then it will choose indices from the left dataframe as join key.
  • index_right : It takes a boolean value whose default values is false. If it is true then it will choose indices from the right dataframe as join key.

To demonstrate we will be taking the following two dataframes :

Left dataframe :

     Regd     Name    Age        City         Exp
0    10        Jill          16.0     Tokyo          10
1    11        Rachel    38.0     Texas           5
2    12        Kirti        39.0     New York     7
3    13        Veena     40.0     Texas          21
4    14        Lucifer    NaN    Texas          30
5    15        Pablo      30.0    New York     7
6    16        Lionel     45.0    Colombia    11

Right dataframe :

     Regd     Exp       Wage     Bonus
0    10        Junior   75000      2000
1    11        Senior   72200     1000
2    12        Expert   90999     1100
3    13        Expert   90000     1000
4    14        Junior   20000      2000
5    15        Junior   50000      1500
6    16        Senior   81000     1000

Merging two Dataframes by index of both the dataframes :

Here you might have noticed we have a common column named ‘Regd’ . So we can merge both the dataframes by passing left_index and right_index as true in the function.

# Program :


# Importing the module
import pandas as pd
import numpy as np

#Left Dataframe
students = [(10,'Jill',    16,     'Tokyo',    10),
            (11,'Rachel',  38,     'Texas',     5),
            (12,'Kirti',   39,     'New York',  7),
            (13,'Veena',   40,     'Texas',    21),
            (14,'Lucifer', np.NaN, 'Texas',    30),
            (15,'Pablo',   30,     'New York',  7),
            (16,'Lionel',  45,     'Colombia', 11) ]
lDfObj = pd.DataFrame(students, columns=['Regd','Name','Age','City','Exp'],index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

#Right dataframe
wage = [(10, 'Junior', 75000, 2000) ,
        (11, 'Senior', 72200, 1000) ,
        (12, 'Expert', 90999, 1100) ,
        (13, 'Expert', 90000, 1000) ,
        (14, 'Junior', 20000, 2000) ,
        (15, 'Junior', 50000, 1500) ,
        (16, 'Senior', 81000, 1000)]
rDfObj = pd.DataFrame(wage, columns=['Regd','Exp','Wage','Bonus'] , index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

#Merging both the dataframes
newDF = lDfObj.merge(rDfObj, left_index=True, right_index=True)

#printing the merged dataframe
print("The merged dataframe is-")
print(newDF)

Output :
The merged dataframe is-
     Regd_x    Name     Age       City         Exp_x   Regd_y    Exp_y      Wage     Bonus
a      10         Jill         16.0      Tokyo         10        10          Junior     75000      2000
b      11        Rachel   38.0      Texas           5         11          Senior     72200      1000
c      12         Kirti       39.0      New York    7         12           Expert    90999     1100
d      13        Veena    40.0      Texas           21       13          Expert     90000      1000
e      14        Lucifer    NaN     Texas           30       14          Junior     20000      2000
f      15         Pablo      30.0      New York    7         15          Junior     50000      1500
g      16       Lionel      45.0      Colombia    11       16          Senior     81000      1000

Finally, if we want to merge a dataframe by the index of the first dataframe with some column from the other dataframe, we can also do that.

Want to expert in the python programming language? Exploring Python Data Analysis using Pandas tutorial changes your knowledge from basic to advance level in python concepts.

Read more Articles on Python Data Analysis Using Padas

Pandas : How to merge Dataframes by index using Dataframe.merge() Read More »

Select Rows & Columns by Name or Index in DataFrame using loc & iloc | Python Pandas

How to select rows and columns by Name or Index in DataFrame using loc and iloc in Python ?

We will discuss several methods to select rows and columns in a dataframe. To select rows or columns we can use loc( ), iloc( ) or using the [ ] operator.

To demonstrate the various methods we will be using the following dataset :

      Name      Score        City
0     Jill           16.0       Tokyo
1     Rachel     38.0      Texas
2     Kirti         39.0       New York
3     Veena      40.0      Texas
4     Lucifer     NaN      Texas
5     Pablo       30.0       New York
6     Lionel      45.0       Colombia

Method-1 : DataFrame.loc | Select Column & Rows by Name

We can use the loc( ) function to select rows and columns.

Syntax :

dataFrame.loc[<ROWS RANGE> , <COLUMNS RANGE>]

We have to enter the range of rows or columns, and it will select the specified range.

If we don’t give a value and pass ‘:’ instead, it will select all the rows or columns.

Select a Column by Name in DataFrame using loc[ ] :

As we need to select a single column only, we have to pass ‘:’ in row range place.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'])
#Selecting the 'Score solumn'
columnD = dfObj.loc[:,'Score']
print(columnD)
Output :
0    16.0
1    38.0
2    39.0
3    40.0
4     NaN
5    30.0
6    45.0
Name: Score, dtype: float64

Select multiple Columns by Name in DataFrame using loc[ ] :

To select multiple columns, we have to pass the column names as a list into the function.

So, let’s see the implementation of it.

#Program

import pandas as pd
import numpy as np
#data
students = [('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple columns i.e 'Name' and 'Score' column
columnD = dfObj.loc[:,['Name','Score']]
print(columnD)
Output :
     Name      Score
a     Jill          16.0
b   Rachel     38.0
c    Kirti         39.0
d    Veena     40.0
e  Lucifer      NaN
f    Pablo       30.0
g   Lionel       45.0

Select a single row by Index Label in DataFrame using loc[ ] :

Just like the column, we can also select a single row by passing its name and in place of column range passing ‘:’.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting a single row i.e 'b' row
selectData = dfObj.loc['b',:]
print(selectData)

Output :
Name       Rachel
Score        38.0
City          Texas
Name: b, dtype: object

Select multiple rows by Index labels in DataFrame using loc[ ] :

To select multiple rows we have to pass the names as a list into the function.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple rows i.e 'd' and 'g'
selectData = dfObj.loc[['d','g'],:]
print(selectData)

Output :
      Name   Score      City
d    Veena   40.0       Texas
g    Lionel   45.0       Colombia

Select multiple row & columns by Labels in DataFrame using loc[ ] :

To select multiple rows and columns we have to pass the list of rows and columns we want to select into the function.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple rows and columns i.e 'd' and 'g' rows and 'Name' , 'City' column
selectData = dfObj.loc[['d','g'],['Name','City']]
print(selectData)

Output :
      Name      City
d   Veena      Texas
g  Lionel       Colombia

Method-2 : DataFrame.iloc | Select Column Indexes & Rows Index Positions

We can use the iloc( ) function to select rows and columns. It is quite similar to loc( ) function .

Syntax-

dataFrame.iloc

[<ROWS INDEX RANGE> , <COLUMNS INDEX RANGE>]

The function selects rows and columns in the dataframe by the index position we pass into the program. And just as like loc( ) if ‘:’ is passed into the function, all the rows/columns are selected.

Select a single column by Index position :

We have to pass the index of the column with ‘:’ in place of the row index.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting a single column at the index 2
selectData = dfObj.iloc[:,2]
print(selectData)
Output :
a       Tokyo
b       Texas
c        New York
d       Texas
e       Texas
f        New York
g       Colombia
Name: City, dtype: object

Select multiple columns by Indices in a list :

To select multiple columns by indices we just pass the indices as series into the column value.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple columns at the index 0 & 2
selectData = dfObj.iloc[:,[0,2]]
print(selectData)
Output :
        Name       City
a      Jill            Tokyo
b      Rachel     Texas
c      Kirti          New York
d      Veena      Texas
e      Lucifer     Texas
f      Pablo        New York
g     Lionel       Colombia

Select multiple columns by Index range :

To select multiple columns by index range we just pass the indices as series into the column value.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple columns from the index 1 to 3
selectData = dfObj.iloc[:,1:3]
print(selectData)
Output :
      Score      City
a    16.0         Tokyo
b    38.0         Texas
c    39.0         New York
d    40.0         Texas
e    NaN        Texas
f     30.0         New York
g    45.0         Colombia

Select single row by Index Position :

Just like columns we can pass the index and select the row.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting a single row with index 2
selectData = dfObj.iloc[2,:]
print(selectData)
Output :
Name        Kirti
Score        39.0
City     New York
Name: c, dtype: object

Select multiple rows by Index positions in a list :

To do this we can pass the indices of positions to select into the function.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple rows by passing alist i.e. 2 & 5
selectData = dfObj.iloc[[2,5],:]
print(selectData)
 Output :
     Name    Score      City
c   Kirti      39.0       New York
f    Pablo   30.0       New York

Select multiple rows by Index range :

To select a range of rows we pass the range separated by a ‘:’ into the function.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple rows by range i.e. 2 to 5
selectData = dfObj.iloc[2:5,:]
print(selectData)
Output :
      Name      Score      City
c     Kirti        39.0        New York
d     Veena     40.0       Texas
e     Lucifer    NaN       Texas

Select multiple rows & columns by Index positions :

To select multiple rows and columns at once, we pass the indices directly into function.

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Selecting multiple rows and columns
selectData = dfObj.iloc[[1,2],[1,2]]
print(selectData)
Output :
    Score      City
b   38.0      Texas
c   39.0       New York

Method-3 : Selecting Columns in DataFrame using [ ] operator

The [ ] operator selects the data according to the name provided to it. However, when a non-existent label is passed into it, it sends a KeyError.

Select a Column by Name :

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Select a single column name using [ ]
selectData = dfObj['Name']
print(selectData)
Output :
a       Jill
b     Rachel
c      Kirti
d      Veena
e    Lucifer
f      Pablo
g     Lionel
Name: Name, dtype: object

Select multiple columns by Name :

To select multiple columns we just pass a list of their names into [ ].

So, let’s see the implementation of it.

#Program :

import pandas as pd
import numpy as np
#data
students = [
('Jill',    16,     'Tokyo',),
('Rachel',  38,     'Texas',),
('Kirti',   39,     'New York'),
('Veena',   40,     'Texas',),
('Lucifer', np.NaN, 'Texas'),
('Pablo',   30,     'New York'),
('Lionel',  45,     'Colombia',)]
#Creating the dataframe object
dfObj = pd.DataFrame(students, columns=['Name','Score','City'], index=['a','b','c','d','e','f','g'])
#Select multiple columns using [ ]
selectData = dfObj[['Name','City']]
print(selectData)
Output :
       Name        City
a      Jill            Tokyo
b      Rachel     Texas
c      Kirti          New York
d      Veena       Texas
e      Lucifer      Texas
f       Pablo        New York
g      Lionel       Colombia

Want to expert in the python programming language? Exploring Python Data Analysis using Pandas tutorial changes your knowledge from basic to advance level in python concepts.

Read more Articles on Python Data Analysis Using Padas – Select items from a Dataframe

Select Rows & Columns by Name or Index in DataFrame using loc & iloc | Python Pandas Read More »

Pandas: Create Series from list in python

How to create series from list in Python ?

In this article we will get to know about how we will convert list to a series using Pandas.

Series class provides a constructor in python.

Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

Where,

  • data : It is array-like and iterable sequence. It adds the elements in this iterable as values in the Series.
  • index : It is array-like and iterable sequence. It adds the elements in this iterable as indexes in the Series.
  • dtype : It represents the data type of the output series.

Now, we will create series class constructor to create a Pandas Series object from a list. So, let’s start exploring the topic.

Creating Pandas Series from a list :

Let’s take an example where we will convert a list into Pandas series object by passing the list in Series class constructor.

Let’s see how it is actually implemented.

#Program :


import pandas as sc

# List for strings
list_words = ['alexa', 'siri', 'try', 'sometime', 'why', 'it']

# To create a series object from string list
series_objs = sc.Series(list_words)

print('New series object: ')
print(series_objs)
Output:
New Series Objetct:
0      alexa
1      siri
2    try
3     sometime
4     why
5     it
dtype: object

Creating Pandas Series from two lists :

Let’s take a scenario where we want to have only some specific indices in Series object, so in that case we would pass another list of same size to Series class constructor as the index argument.

Let’s see how it is actually implemented.

#Program :


import pandas as sc

# List for values
list_words = ['alexa', 'siri', 'please', 'try', 'sometime', 'it']

# List for index
index_tags = ['i', 'ii', 'iii', 'iv', 'v', 'vi']

# To create a series from two list i.e. one for value, one for index
series_objs = sc.Series(list_words, index=index_tags)

print('New series object: ')
print(series_objs)
Output:
New Series Object:
i      alexa
ii      siri
iii    please
iv    try
v    sometime
vi    it
dtype: object

By default the index values start from 0 to N-1 (N= No. of elements in Series object).

Creating Pandas Series object from a list but with dissimilar datatype :

Let we want to form a Series object from a list of integer, but the items should be stored as strings inside Series object i.e. here we would convert list into Pandas Series object by converting integers into string. To get the output we should pass dtype argument in Series Constructor.

Let’s see how it is actually implemented.

#Program :


import pandas as sc

# List for integers
list_nums = [101, 100, 153, 36, 58]

# To create a series with list of different type i.e. str
series_objs = sc.Series(list_nums, index= ['i', 'ii', 'iii', 'iv', 'v'],dtype=str)

print('New Series Object:')
print(series_objs)
Output:
i     101
ii    100
iii   153
iv    36
v    58
dtype: object

Converting a heterogeneous list to Pandas Series object :

Let’s take a example of heterogenous list where if we don’t provide any dtype argument in Series constructor, then all the items will be converted to str type.

Let’s see how it is actually implemented.

#Program :


import pandas as sc

# List for mixed datatypes
list_mixed = ['some',99,'will','be',10.57,'yes']

series_objs = sc.Series(list_mixed,index= ['i', 'ii', 'iii', 'iv', 'v'])

print(series_objs)
Output:
i    some
ii    99
iii   will
iv   be
v    10.57
vi    yes
dtype: object

Converting a booli list (Bool type) to Pandas Series object :

Let’s take an example where we would create Series object from booli list of boolean type.

Let’s see how it is actually implemented.

#Program :

import pandas as sc

booli_list = [False, True, False, False, True]

# Convert a booli list to Series object of bool data type.
series_objs = sc.Series(booli_list,index=['i', 'ii', 'iii', 'iv', 'v'])

print('New Series Object:')
print(series_objs)
Output:
Contents of the Series Object:
a     False
b    True
c    False
d    False
e     True
dtype: bool

Pandas: Create Series from list in python Read More »

Convert NumPy array to list in python

How to convert NumPy array to list in python ?

In this article we are going to discuss about how we can convert a 1D or 2D or 3D Numpy Array to a list or list of lists.

Converting Numpy Array to List :

In Python, ndarray class of Numpy Module provides tolist() function which can be used to convert a 1D array to list. All the elements of the 1D array will be contained as the items of the list.

So, let’s see how it actually works.

# Program :

import numpy as np
# NumPy array created
arr = np.array([1, 2, 3, 4, 5])
# Printing the NumPy array
print('Numpy Array:', arr)

# Converting 1D Numpy Array to list
num_list = arr.tolist()

# Printing the list
print('List: ', num_list)
Outptut :
Numpy Array: [1 2 3 4 5] 
List: [1, 2, 3, 4, 5]

Converting 2D Numpy array to list of lists :

We can use the same tolist() function to convert the 2D Numpy array to list of lists i.e called as Nested List.

So, let’s see how it actually works.

# Program :

import numpy as np
# 2D Numpy Array created
arr = np.array([[11, 12, 13, 14],
                [15, 16, 17, 18],
                [19, 20, 21, 22]])
# Printing 2D numpy array                
print('2D Numpy Array:')
print(arr)

# Converting Numpy Array to list of lists
list_of_lists = arr.tolist()

#Printing the nested list
print('List of lists:')
print(list_of_lists)
Output :
2D Numpy Array:
[[11, 12, 13, 14], 
[15, 16, 17, 18], 
[19, 20, 21, 22]]
List of lists:

Converting 2D Numpy Array to a flattened list :

We can also convert the 2D NumPy array to a flat list. For that first we have to convert the 2D NumPy array to 1D array by using flatten() function. Then convert 1D array to list by using tolist() function.

So, let’s see how it actually works.

# Program :

import numpy as np
# 2D Numpy Array created
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [3, 3, 3, 3]])
# Printing 2D Numpy array                
print('2D Numpy Array:')
print(arr)

# Converting 2D Numpy array toa single list
num_list = arr.flatten().tolist()

#Printing list
print('List:', num_list)
Output :
2D Numpy Array:
[[1, 2, 3, 4], 
[5, 6, 7, 8], 
[3, 3, 3, 3]]
2D Numpy Array:
[1, 2, 3, 4, 5, 6, 7, 8, 3, 3, 3, 3]

Convert 3D Numpy array to nested list :

Similarly the 3D Numpy array can also be converted by using tolist() function into list of nested lists.

So, let’s see how it actually works.

# Program :

import numpy as np

# 3D Numpy Array created
arr = np.ones( (2,4,5) , dtype=np.int64)

#Prinitng 3D Numpy array 
print('3D Numpy Array:')
print(arr)

# Converting 3D Numpy Array to nested list
nested_list = arr.tolist()

# Printing nested list
print('Nested list:')
print(nested_list)
Output :
3D Numpy Array:
[[[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]]


[[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]]]
Nested list:
[[[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]],
[[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1], 
[1, 1, 1, 1, 1]]]

Converting 3D Numpy Array to a flat list :

The process is same like how we converted the 2D Numpy array to flatten list. Similarly, use the flatten() function to convert the 3D Numpy array to 1D array. Then convert the 1D Numpy array to flat list by using tolist() function.

So, let’s see how it actually works.

# Program :

import numpy as np
# 3D Numpy Array created
arr = np.ones( (2,4,5) , dtype=np.int64)
# Printing 3D Numpy array
print('3D Numpy Array:')
print(arr)

# Converting 3D Numpy Array to flat list
flat_list = arr.flatten().tolist()

# Printing the list
print('Flat list:')
print(flat_list)
Output :
3D Numpy Array:
[[[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]]
[[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]
[1 1 1 1 1]]]
Flat list:
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

 

Convert NumPy array to list in python Read More »

Pandas : Find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated() in Python

How to find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicate() in Python ?

In this article we will discuss about how we can find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated(). So first let’s know about this duplicated() function then we will see how it actually works.

DataFrame.duplicated()

Python’s Pandas library contains DataFrame class which provides a function i.e. duplicated() that helps in finding duplicate rows based on specific or all columns.

Synatx : DataFrame.duplicated (subset='None', keep='first')

where,

  • subset : It represents single or multiple column labels which will be used for duplication check. If it is not provided then all columns will be checked for finding duplicate rows.
  • keep : It represents the occurrence which needs to be marked as duplicate. Its value can be (first : here all duplicate rows except their first occurrence are returned and it is the default value also, last : here all duplicate rows except their last occurrence are returned and false : here all duplicate rows except occurrence are returned)

Find Duplicate rows based on all columns :

To find all the duplicate rows based on all columns, we should not pass any argument in subset while calling DataFrame.duplicate(). If any duplicate rows found, True will be returned at place of the duplicated rows expect the first occurrence as default value of keep argument is first.

import pandas as sc

# List of Tuples

players = [('MI', 'Surya', 487),

('RR', 'Buttler', 438),

('CSK', 'Jadeja', 456),

('CSK', 'Jadeja', 456),

('KKR', 'Gill', 337),

('SRH', 'Roy', 241),

('DC', 'Rahane', 221),

('CSK', 'Dhoni', 446),

('PK', 'Malan', 298)

]

# To create a DataFrame object

dfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs'])

# To select duplicate rows based on all columns except the first occurrence

dupliRows = dfObjs[dfObjs.duplicated()]

print("Duplicate rows based on all column excluding first occurrence is:")

print(dupliRows)
Output :
Duplicate rows based on all column excluding first occurrence is:
Team  Player  Runs
3  CSK  Jadeja   456


In the above example all duplicate values returned except the first occurrence, because the by default value of keep is first.

Note : If we make keep argument as last then while finding the duplicate rows last occurrence will be ignored.

Find Duplicate Rows based on selected columns :

If we want to find duplicate compare rows on selected column, then we should pass the columns names as argument in duplicate(), which will return the duplicate rows based on passed or selected columns. Similarly in this case also first occurrence is ignored.

#Program :

import pandas as sc

# List of Tuples

players = [('MI', 'Surya', 487),

('RR', 'Buttler', 438),

('DC', 'Pant', 337),

('CSK', 'Dhoni', 456),

('KKR', 'Gill', 337),

('SRH', 'Roy', 241),

('DC', 'Rahane', 337),

('DC', 'Iyer', 337),

('PK', 'Malan', 298)

]

# To create a DataFrame object

dfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs'])

# Select all duplicate rows based on one column

# To select the duplicated rows based on column that is passed as argument

dupliRows = dfObjs[dfObjs.duplicated(['Team','Runs'])]

print("Duplicate Rows based on a selected column are:", dupliRows, sep='\n')
Output :
Duplicate Rows based on a selected column are:
     Team   Player      Runs
6    DC      Rahane    337
7    DC      Iyer          337

By default value of keep is first, so only matched first row is ignored. Here, we have found rows based on selected columns. In this example we have selected  columns (Team and Runs) based on which 3 rows matches i.e

'DC', 'Pant', 337
'DC', 'Rahane', 337
'DC', 'Iyer', 337

Out of which last two rows are displayed and first row is ignored as the keep value is first(default value).

So it returned last two rows as output i.e

'DC', 'Rahane', 337
'DC', 'Iyer', 337

Want to expert in the python programming language? Exploring Python Data Analysis using Pandas tutorial changes your knowledge from basic to advance level in python concepts.

Read more Articles on Python Data Analysis Using Padas – Find Elements in a Dataframe

Pandas : Find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated() in Python Read More »