Pandas : Find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated() in Python

How to find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicate() in Python ?

In this article we will discuss about how we can find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated(). So first let’s know about this duplicated() function then we will see how it actually works.

DataFrame.duplicated()

Python’s Pandas library contains DataFrame class which provides a function i.e. duplicated() that helps in finding duplicate rows based on specific or all columns.

Synatx : DataFrame.duplicated (subset='None', keep='first')

where,

  • subset : It represents single or multiple column labels which will be used for duplication check. If it is not provided then all columns will be checked for finding duplicate rows.
  • keep : It represents the occurrence which needs to be marked as duplicate. Its value can be (first : here all duplicate rows except their first occurrence are returned and it is the default value also, last : here all duplicate rows except their last occurrence are returned and false : here all duplicate rows except occurrence are returned)

Find Duplicate rows based on all columns :

To find all the duplicate rows based on all columns, we should not pass any argument in subset while calling DataFrame.duplicate(). If any duplicate rows found, True will be returned at place of the duplicated rows expect the first occurrence as default value of keep argument is first.

import pandas as sc

# List of Tuples

players = [('MI', 'Surya', 487),

('RR', 'Buttler', 438),

('CSK', 'Jadeja', 456),

('CSK', 'Jadeja', 456),

('KKR', 'Gill', 337),

('SRH', 'Roy', 241),

('DC', 'Rahane', 221),

('CSK', 'Dhoni', 446),

('PK', 'Malan', 298)

]

# To create a DataFrame object

dfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs'])

# To select duplicate rows based on all columns except the first occurrence

dupliRows = dfObjs[dfObjs.duplicated()]

print("Duplicate rows based on all column excluding first occurrence is:")

print(dupliRows)
Output :
Duplicate rows based on all column excluding first occurrence is:
Team  Player  Runs
3  CSK  Jadeja   456


In the above example all duplicate values returned except the first occurrence, because the by default value of keep is first.

Note : If we make keep argument as last then while finding the duplicate rows last occurrence will be ignored.

Find Duplicate Rows based on selected columns :

If we want to find duplicate compare rows on selected column, then we should pass the columns names as argument in duplicate(), which will return the duplicate rows based on passed or selected columns. Similarly in this case also first occurrence is ignored.

#Program :

import pandas as sc

# List of Tuples

players = [('MI', 'Surya', 487),

('RR', 'Buttler', 438),

('DC', 'Pant', 337),

('CSK', 'Dhoni', 456),

('KKR', 'Gill', 337),

('SRH', 'Roy', 241),

('DC', 'Rahane', 337),

('DC', 'Iyer', 337),

('PK', 'Malan', 298)

]

# To create a DataFrame object

dfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs'])

# Select all duplicate rows based on one column

# To select the duplicated rows based on column that is passed as argument

dupliRows = dfObjs[dfObjs.duplicated(['Team','Runs'])]

print("Duplicate Rows based on a selected column are:", dupliRows, sep='\n')
Output :
Duplicate Rows based on a selected column are:
     Team   Player      Runs
6    DC      Rahane    337
7    DC      Iyer          337

By default value of keep is first, so only matched first row is ignored. Here, we have found rows based on selected columns. In this example we have selected  columns (Team and Runs) based on which 3 rows matches i.e

'DC', 'Pant', 337
'DC', 'Rahane', 337
'DC', 'Iyer', 337

Out of which last two rows are displayed and first row is ignored as the keep value is first(default value).

So it returned last two rows as output i.e

'DC', 'Rahane', 337
'DC', 'Iyer', 337

Want to expert in the python programming language? Exploring Python Data Analysis using Pandas tutorial changes your knowledge from basic to advance level in python concepts.

Read more Articles on Python Data Analysis Using Padas – Find Elements in a Dataframe