How to find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicate() in Python ?
In this article we will discuss about how we can find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated()
. So first let’s know about this duplicated()
function then we will see how it actually works.
DataFrame.duplicated()
Python’s Pandas library contains DataFrame class which provides a function i.e. duplicated()
that helps in finding duplicate rows based on specific or all columns.
Synatx : DataFrame.duplicated (subset='None', keep='first')
where,
- subset : It represents single or multiple column labels which will be used for duplication check. If it is not provided then all columns will be checked for finding duplicate rows.
- keep : It represents the occurrence which needs to be marked as duplicate. Its value can be (first : here all duplicate rows except their first occurrence are returned and it is the default value also, last : here all duplicate rows except their last occurrence are returned and false : here all duplicate rows except occurrence are returned)
Find Duplicate rows based on all columns :
To find all the duplicate rows based on all columns, we should not pass any argument in subset while calling DataFrame.duplicate()
. If any duplicate rows found, True
will be returned at place of the duplicated rows expect the first occurrence as default value of keep
argument is first
.
import pandas as sc # List of Tuples players = [('MI', 'Surya', 487), ('RR', 'Buttler', 438), ('CSK', 'Jadeja', 456), ('CSK', 'Jadeja', 456), ('KKR', 'Gill', 337), ('SRH', 'Roy', 241), ('DC', 'Rahane', 221), ('CSK', 'Dhoni', 446), ('PK', 'Malan', 298) ] # To create a DataFrame object dfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs']) # To select duplicate rows based on all columns except the first occurrence dupliRows = dfObjs[dfObjs.duplicated()] print("Duplicate rows based on all column excluding first occurrence is:") print(dupliRows)
Output : Duplicate rows based on all column excluding first occurrence is: Team Player Runs 3 CSK Jadeja  456
In the above example all duplicate values returned except the first occurrence, because the by default value of keep
is first
.
Note : If we make keep
argument as last
then while finding the duplicate rows last occurrence will be ignored.
Find Duplicate Rows based on selected columns :
If we want to find duplicate compare rows on selected column, then we should pass the columns names as argument in duplicate(), which will return the duplicate rows based on passed or selected columns. Similarly in this case also first occurrence is ignored.
#Program : import pandas as sc # List of Tuples players = [('MI', 'Surya', 487), ('RR', 'Buttler', 438), ('DC', 'Pant', 337), ('CSK', 'Dhoni', 456), ('KKR', 'Gill', 337), ('SRH', 'Roy', 241), ('DC', 'Rahane', 337), ('DC', 'Iyer', 337), ('PK', 'Malan', 298) ] # To create a DataFrame object dfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs']) # Select all duplicate rows based on one column # To select the duplicated rows based on column that is passed as argument dupliRows = dfObjs[dfObjs.duplicated(['Team','Runs'])] print("Duplicate Rows based on a selected column are:", dupliRows, sep='\n')
Output : Duplicate Rows based on a selected column are:   Team  Player   Runs 6  DC   Rahane  337 7  DC   Iyer     337
By default value of keep
is first, so only matched first row is ignored. Here, we have found rows based on selected columns. In this example we have selected columns (Team and Runs) based on which 3 rows matches i.e
'DC', 'Pant', 337 'DC', 'Rahane', 337 'DC', 'Iyer', 337
Out of which last two rows are displayed and first row is ignored as the keep value is first
(default value).
So it returned last two rows as output i.e
'DC', 'Rahane', 337 'DC', 'Iyer', 337
Want to expert in the python programming language? Exploring Python Data Analysis using Pandas tutorial changes your knowledge from basic to advance level in python concepts.
Read more Articles on Python Data Analysis Using Padas – Find Elements in a Dataframe