{"id":5280,"date":"2021-05-10T09:32:13","date_gmt":"2021-05-10T04:02:13","guid":{"rendered":"https:\/\/python-programs.com\/?p=5280"},"modified":"2021-11-22T18:42:49","modified_gmt":"2021-11-22T13:12:49","slug":"pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python","status":"publish","type":"post","link":"https:\/\/python-programs.com\/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python\/","title":{"rendered":"Pandas : Find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated() in Python"},"content":{"rendered":"
In this article we will discuss about how we can find duplicate rows in a Dataframe based on all or selected columns using Python’s Pandas library contains DataFrame class which provides a function i.e. where,<\/p>\n To find all the duplicate rows based on all columns, we should not pass any argument in subset <\/em>while calling In the above example all duplicate values returned except the first occurrence, because the by default value of Note : If we make If we want to find duplicate compare rows on selected column, then we should pass the columns names as argument in duplicate(), which will return the duplicate rows based on passed or selected columns. Similarly in this case also first occurrence is ignored.<\/p>\n By default value of Out of which last two rows are displayed and first row is ignored as the keep value is So it returned last two rows as output i.e<\/p>\n Want to expert in the python programming language? Exploring\u00a0Python Data Analysis using Pandas<\/a>\u00a0tutorial changes your knowledge from basic to advance level in python concepts.<\/p>\n Read more Articles on Python Data Analysis Using Padas \u2013 Find Elements in a Dataframe<\/strong><\/p>\n How to find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicate() in Python ? In this article we will discuss about how we can find duplicate rows in a Dataframe based on all or selected columns using DataFrame.duplicated(). So first let’s know about this duplicated() function then we will see …<\/p>\nDataFrame.duplicated()<\/code>. So first let’s know about this
duplicated()<\/code> function then we will see how it actually works.<\/p>\n
DataFrame.duplicated()<\/h3>\n
duplicated()<\/code> that helps in finding duplicate rows based on specific or all columns.<\/p>\n
Synatx : DataFrame.duplicated (subset='None', keep='first')<\/pre>\n
\n
Find Duplicate rows based on all columns :<\/h3>\n
DataFrame.duplicate()<\/code>. If any duplicate rows found,
True<\/code> <\/em>will be returned at place of the duplicated rows expect the first occurrence as default value of
keep<\/code> <\/em>argument is
first<\/code>.<\/p>\n
import pandas as sc\r\n\r\n# List of Tuples\r\n\r\nplayers = [('MI', 'Surya', 487),\r\n\r\n('RR', 'Buttler', 438),\r\n\r\n('CSK', 'Jadeja', 456),\r\n\r\n('CSK', 'Jadeja', 456),\r\n\r\n('KKR', 'Gill', 337),\r\n\r\n('SRH', 'Roy', 241),\r\n\r\n('DC', 'Rahane', 221),\r\n\r\n('CSK', 'Dhoni', 446),\r\n\r\n('PK', 'Malan', 298)\r\n\r\n]\r\n\r\n# To create a DataFrame object\r\n\r\ndfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs'])\r\n\r\n# To select duplicate rows based on all columns except the first occurrence\r\n\r\ndupliRows = dfObjs[dfObjs.duplicated()]\r\n\r\nprint(\"Duplicate rows based on all column excluding first occurrence is:\")\r\n\r\nprint(dupliRows)<\/pre>\n
Output :\r\nDuplicate rows based on all column excluding first occurrence is:\r\nTeam\u00a0 Player\u00a0 Runs\r\n3\u00a0 CSK\u00a0 Jadeja\u00a0\u00a0 456\r\n\r\n\r\n<\/pre>\n
keep<\/code> is
first<\/code>.<\/p>\n
keep<\/code><\/em> argument as
last<\/code> <\/em>then while finding the duplicate rows last occurrence will be ignored.<\/p>\n
Find Duplicate Rows based on selected columns :<\/h3>\n
#Program :\r\n\r\nimport pandas as sc\r\n\r\n# List of Tuples\r\n\r\nplayers = [('MI', 'Surya', 487),\r\n\r\n('RR', 'Buttler', 438),\r\n\r\n('DC', 'Pant', 337),\r\n\r\n('CSK', 'Dhoni', 456),\r\n\r\n('KKR', 'Gill', 337),\r\n\r\n('SRH', 'Roy', 241),\r\n\r\n('DC', 'Rahane', 337),\r\n\r\n('DC', 'Iyer', 337),\r\n\r\n('PK', 'Malan', 298)\r\n\r\n]\r\n\r\n# To create a DataFrame object\r\n\r\ndfObjs = sc.DataFrame(players, columns=['Team', 'Player', 'Runs'])\r\n\r\n# Select all duplicate rows based on one column\r\n\r\n# To select the duplicated rows based on column that is passed as argument\r\n\r\ndupliRows = dfObjs[dfObjs.duplicated(['Team','Runs'])]\r\n\r\nprint(\"Duplicate Rows based on a selected column are:\", dupliRows, sep='\\n')\r\n<\/pre>\n
Output :\r\nDuplicate Rows based on a selected column are:\r\n \u00a0 \u00a0Team\u00a0 \u00a0Player\u00a0 \u00a0 \u00a0 Runs\r\n6\u00a0 \u00a0 DC\u00a0 \u00a0 \u00a0 Rahane\u00a0 \u00a0 337\r\n7\u00a0 \u00a0 DC\u00a0 \u00a0 \u00a0 Iyer\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 337<\/pre>\n
keep<\/code> is first, so only matched first row is ignored. Here, we have found rows based on selected columns. In this example we have selected\u00a0 columns (Team and Runs) based on which 3 rows matches i.e<\/p>\n
'DC', 'Pant', 337\r\n'DC', 'Rahane', 337\r\n'DC', 'Iyer', 337<\/pre>\n
first<\/code>(default value).<\/p>\n
'DC', 'Rahane', 337\r\n'DC', 'Iyer', 337<\/pre>\n
\n