{"id":7296,"date":"2023-11-03T08:13:54","date_gmt":"2023-11-03T02:43:54","guid":{"rendered":"https:\/\/python-programs.com\/?p=7296"},"modified":"2023-11-10T12:15:06","modified_gmt":"2023-11-10T06:45:06","slug":"how-to-find-and-drop-duplicate-columns-in-a-dataframe-python-pandas","status":"publish","type":"post","link":"https:\/\/python-programs.com\/how-to-find-and-drop-duplicate-columns-in-a-dataframe-python-pandas\/","title":{"rendered":"How to Find and Drop duplicate columns in a DataFrame | Python Pandas"},"content":{"rendered":"
In this article we will learn to find duplicate columns in a Pandas dataframe and drop them.<\/p>\n
Pandas library contain direct APIs to find out the duplicate rows, but there is no direct APIs for duplicate columns. And hence, we have to build API for that. Initially let’s create a dataframe with duplicate columns.<\/p>\n
import pandas as sc\r\n# List of Tuples\r\nplayers = [('Nathan', 35, 'Australia', 35, 'Australia', 35),\r\n ('Vishal', 24, 'India', 24, 'India', 24),\r\n ('Abraham', 34, 'South Africa', 34, 'South Africa', 34),\r\n ('Trevor', 28, 'England', 28, 'England', 28),\r\n ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42),\r\n ]\r\n# Create a DataFrame object\r\nPlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey'])\r\nprint(\"Original Dataframe is:\")\r\nprint(PlayerObj)\r\n<\/pre>\nOutput :\r\nOriginal Dataframe is:\r\n \u00a0\u00a0\u00a0\u00a0 Name\u00a0 Age\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Country\u00a0 Address\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Citizen\u00a0 Jersey\r\n0\u00a0\u00a0 Nathan\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0 35\r\n1\u00a0\u00a0 Vishal\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0 24\r\n2\u00a0 Abraham\u00a0\u00a0 34\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 34\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0 34\r\n3\u00a0\u00a0 Trevor\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0 28\r\n4\u00a0\u00a0\u00a0 Kumar\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0 42\r\nOriginal Dataframe is:\r\n \u00a0\u00a0\u00a0\u00a0 Name\u00a0 Age\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Country\u00a0 Address\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Citizen\u00a0 Jersey\r\n0\u00a0\u00a0 Nathan\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0 35\r\n1\u00a0\u00a0 Vishal\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0 24\r\n2\u00a0 Abraham\u00a0\u00a034\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 34\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0 34\r\n3\u00a0\u00a0 Trevor\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0 28\r\n4\u00a0\u00a0\u00a0 Kumar\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0 42<\/pre>\nFind duplicate columns in a DataFrame :<\/h3>\n
To find the duplicate columns in dataframe, we will iterate over each column and search if any other columns exist of same content. If yes, that column name will be stored in duplicate column list and in the end our API will returned list of duplicate columns.<\/p>\n
import pandas as sc\r\ndef getDuplicateColumns(df):\r\n '''\r\n Get a list of duplicate columns.\r\n It will iterate over all the columns and finfd the duplicate columns in dataframe\r\n :param df: Dataframe object\r\n :return: Column\u2019s list whose contents are same\r\n '''\r\n duplicateColumnNames = set()\r\n # Iterate over all the columns \r\n for x in range(df.shape[1]):\r\n # Select column at xth index of dataframe.\r\n col = df.iloc[:, x]\r\n # Iterate over all the columns from (x+1)th index till end\r\n for y in range(x + 1, df.shape[1]):\r\n # Select column at yth index of dataframe.\r\n otherCol = df.iloc[:, y]\r\n # Check if two columns x & y are equal\r\n if col.equals(otherCol):\r\n duplicateColumnNames.add(df.columns.values[y])\r\n return list(duplicateColumnNames)\r\n \r\ndef main():\r\n# List of Tuples\r\n players = [('Nathan', 35, 'Australia', 35, 'Australia', 35),\r\n ('Vishal', 24, 'India', 24, 'India', 24),\r\n ('Abraham', 34, 'South Africa', 34, 'South Africa', 34),\r\n ('Trevor', 28, 'England', 28, 'England', 28),\r\n ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42),\r\n ]\r\n# Creation of DataFrame object\r\n PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey'])\r\n print(\"Original Dataframe is:\")\r\n print(PlayerObj)\r\n# To get list of duplicate columns\r\n duplicateColumnNames = getDuplicateColumns(PlayerObj)\r\n print('Duplicate Columns are: ')\r\n for ele in duplicateColumnNames:\r\n print('Column name is : ', ele)\r\n\r\nif __name__ == '__main__':\r\n main()\r\n<\/pre>\nOutput :\r\nOriginal Dataframe is:\r\n \u00a0\u00a0\u00a0\u00a0 Name\u00a0 Age\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Country\u00a0 Address\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Citizen\u00a0 Jersey\r\n0\u00a0\u00a0 Nathan\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0 35\r\n1\u00a0\u00a0 Vishal\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0 24\r\n2\u00a0 Abraham\u00a0\u00a0 34\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 34\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0 34\r\n3\u00a0\u00a0 Trevor\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0 28\r\n4\u00a0\u00a0\u00a0 Kumar\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0 42\r\nDuplicate Columns are:\r\n('Column name is : ', 'Citizen')\r\n('Column name is : ', 'Jersey')\r\n('Column name is : ', 'Address')<\/pre>\nDrop duplicate columns in a DataFrame :<\/h3>\n
To drop\/ remove the duplicate columns we will pass the list of duplicate column’s name which is returned by our API to dataframe.drop.<\/em><\/p>\n
import pandas as sc\r\ndef getDuplicateColumns(df):\r\n '''\r\n Get a list of duplicate columns.\r\n It will iterate over all the columns and finfd the duplicate columns in dataframe\r\n :param df: Dataframe object\r\n :return: Column\u2019s list whose contents are same\r\n '''\r\n duplicateColumnNames = set()\r\n # Iterate over all the columns \r\n for x in range(df.shape[1]):\r\n # Select column at xth index of dataframe.\r\n col = df.iloc[:, x]\r\n # Iterate over all the columns from (x+1)th index till end\r\n for y in range(x + 1, df.shape[1]):\r\n # Select column at yth index of dataframe.\r\n otherCol = df.iloc[:, y]\r\n # Check if two columns x & y are equal\r\n if col.equals(otherCol):\r\n duplicateColumnNames.add(df.columns.values[y])\r\n return list(duplicateColumnNames)\r\n \r\ndef main():\r\n# List of Tuples\r\n players = [('Nathan', 35, 'Australia', 35, 'Australia', 35),\r\n ('Vishal', 24, 'India', 24, 'India', 24),\r\n ('Abraham', 34, 'South Africa', 34, 'South Africa', 34),\r\n ('Trevor', 28, 'England', 28, 'England', 28),\r\n ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42),\r\n ]\r\n# Creation of DataFrame object\r\n PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey'])\r\n print(\"Original Dataframe is:\")\r\n print(PlayerObj)\r\n# To get list of duplicate columns\r\n duplicateColumnNames = getDuplicateColumns(PlayerObj)\r\n print('Duplicate Columns are: ')\r\n for ele in duplicateColumnNames:\r\n print('Column name is : ', ele)\r\n \r\n # Delete duplicate columns\r\n print('After removing duplicate columns new data frame becomes: ')\r\n newDf = PlayerObj.drop(columns=getDuplicateColumns(PlayerObj))\r\n print(\"Modified Dataframe is: \", newDf)\r\n\r\nif __name__ == '__main__':\r\n main()\r\n<\/pre>\nOutput :\r\nOriginal Dataframe is:\r\nName\u00a0 Age\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Country\u00a0 Address\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Citizen\u00a0 Jersey\r\n0\u00a0\u00a0 Nathan\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\u00a0\u00a0\u00a0\u00a0\u00a0 35\r\n1\u00a0\u00a0 Vishal\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\u00a0\u00a0\u00a0\u00a0\u00a0 24\r\n2\u00a0 Abraham\u00a0\u00a0 34\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 34\u00a0 South Africa\u00a0\u00a0\u00a0\u00a0\u00a0 34\r\n3\u00a0\u00a0 Trevor\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\u00a0\u00a0\u00a0\u00a0\u00a0 28\r\n4\u00a0\u00a0\u00a0 Kumar\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka\u00a0\u00a0\u00a0\u00a0\u00a0 42\r\nDuplicate Columns are:\r\nColumn name is :\u00a0 Jersey\r\nColumn name is :\u00a0 Citizen\r\nColumn name is :\u00a0 Address\r\nAfter removing duplicate columns new data frame becomes:\r\nModified Dataframe is:\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Name\u00a0 Age\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Country\r\n0\u00a0\u00a0 Nathan\u00a0\u00a0 35\u00a0\u00a0\u00a0\u00a0 Australia\r\n1\u00a0\u00a0 Vishal\u00a0\u00a0 24\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 India\r\n2\u00a0 Abraham\u00a0\u00a0 34\u00a0 South Africa\r\n3\u00a0\u00a0 Trevor\u00a0\u00a0 28\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 England\r\n4\u00a0\u00a0\u00a0 Kumar\u00a0\u00a0 42\u00a0\u00a0\u00a0\u00a0\u00a0 SriLanka<\/pre>\n<\/p>\n","protected":false},"excerpt":{"rendered":"
Find & Drop duplicate columns in a DataFrame | Python Pandas In this article we will learn to find duplicate columns in a Pandas dataframe and drop them. Pandas library contain direct APIs to find out the duplicate rows, but there is no direct APIs for duplicate columns. And hence, we have to build API …<\/p>\n