Find & Drop duplicate columns in a DataFrame | Python Pandas
In this article we will learn to find duplicate columns in a Pandas dataframe and drop them.
Pandas library contain direct APIs to find out the duplicate rows, but there is no direct APIs for duplicate columns. And hence, we have to build API for that. Initially let’s create a dataframe with duplicate columns.
import pandas as sc # List of Tuples players = [('Nathan', 35, 'Australia', 35, 'Australia', 35), ('Vishal', 24, 'India', 24, 'India', 24), ('Abraham', 34, 'South Africa', 34, 'South Africa', 34), ('Trevor', 28, 'England', 28, 'England', 28), ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42), ] # Create a DataFrame object PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey']) print("Original Dataframe is:") print(PlayerObj)
Output : Original Dataframe is:     Name Age      Country Address      Citizen Jersey 0  Nathan  35    Australia      35    Australia     35 1  Vishal  24        India      24        India     24 2 Abraham  34 South Africa      34 South Africa     34 3  Trevor  28      England      28      England     28 4   Kumar  42     SriLanka      42     SriLanka     42 Original Dataframe is:     Name Age      Country Address      Citizen Jersey 0  Nathan  35    Australia      35    Australia     35 1  Vishal  24        India      24        India     24 2 Abraham  34 South Africa      34 South Africa     34 3  Trevor  28      England      28      England     28 4   Kumar  42     SriLanka      42     SriLanka     42
Find duplicate columns in a DataFrame :
To find the duplicate columns in dataframe, we will iterate over each column and search if any other columns exist of same content. If yes, that column name will be stored in duplicate column list and in the end our API will returned list of duplicate columns.
import pandas as sc def getDuplicateColumns(df): ''' Get a list of duplicate columns. It will iterate over all the columns and finfd the duplicate columns in dataframe :param df: Dataframe object :return: Column’s list whose contents are same ''' duplicateColumnNames = set() # Iterate over all the columns for x in range(df.shape[1]): # Select column at xth index of dataframe. col = df.iloc[:, x] # Iterate over all the columns from (x+1)th index till end for y in range(x + 1, df.shape[1]): # Select column at yth index of dataframe. otherCol = df.iloc[:, y] # Check if two columns x & y are equal if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) return list(duplicateColumnNames) def main(): # List of Tuples players = [('Nathan', 35, 'Australia', 35, 'Australia', 35), ('Vishal', 24, 'India', 24, 'India', 24), ('Abraham', 34, 'South Africa', 34, 'South Africa', 34), ('Trevor', 28, 'England', 28, 'England', 28), ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42), ] # Creation of DataFrame object PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey']) print("Original Dataframe is:") print(PlayerObj) # To get list of duplicate columns duplicateColumnNames = getDuplicateColumns(PlayerObj) print('Duplicate Columns are: ') for ele in duplicateColumnNames: print('Column name is : ', ele) if __name__ == '__main__': main()
Output : Original Dataframe is:     Name Age      Country Address      Citizen Jersey 0  Nathan  35    Australia      35    Australia     35 1  Vishal  24        India      24        India     24 2 Abraham  34 South Africa      34 South Africa     34 3  Trevor  28      England      28      England     28 4   Kumar  42     SriLanka      42     SriLanka     42 Duplicate Columns are: ('Column name is : ', 'Citizen') ('Column name is : ', 'Jersey') ('Column name is : ', 'Address')
Drop duplicate columns in a DataFrame :
To drop/ remove the duplicate columns we will pass the list of duplicate column’s name which is returned by our API to dataframe.drop.
import pandas as sc def getDuplicateColumns(df): ''' Get a list of duplicate columns. It will iterate over all the columns and finfd the duplicate columns in dataframe :param df: Dataframe object :return: Column’s list whose contents are same ''' duplicateColumnNames = set() # Iterate over all the columns for x in range(df.shape[1]): # Select column at xth index of dataframe. col = df.iloc[:, x] # Iterate over all the columns from (x+1)th index till end for y in range(x + 1, df.shape[1]): # Select column at yth index of dataframe. otherCol = df.iloc[:, y] # Check if two columns x & y are equal if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) return list(duplicateColumnNames) def main(): # List of Tuples players = [('Nathan', 35, 'Australia', 35, 'Australia', 35), ('Vishal', 24, 'India', 24, 'India', 24), ('Abraham', 34, 'South Africa', 34, 'South Africa', 34), ('Trevor', 28, 'England', 28, 'England', 28), ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42), ] # Creation of DataFrame object PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey']) print("Original Dataframe is:") print(PlayerObj) # To get list of duplicate columns duplicateColumnNames = getDuplicateColumns(PlayerObj) print('Duplicate Columns are: ') for ele in duplicateColumnNames: print('Column name is : ', ele) # Delete duplicate columns print('After removing duplicate columns new data frame becomes: ') newDf = PlayerObj.drop(columns=getDuplicateColumns(PlayerObj)) print("Modified Dataframe is: ", newDf) if __name__ == '__main__': main()
Output : Original Dataframe is: Name Age      Country Address      Citizen Jersey 0  Nathan  35    Australia      35    Australia     35 1  Vishal  24        India      24        India     24 2 Abraham  34 South Africa      34 South Africa     34 3  Trevor  28      England      28      England     28 4   Kumar  42     SriLanka      42     SriLanka     42 Duplicate Columns are: Column name is : Jersey Column name is : Citizen Column name is : Address After removing duplicate columns new data frame becomes: Modified Dataframe is:       Name Age      Country 0  Nathan  35    Australia 1  Vishal  24        India 2 Abraham  34 South Africa 3  Trevor  28      England 4   Kumar  42     SriLanka