How to Find and Drop duplicate columns in a DataFrame | Python Pandas

Find & Drop duplicate columns in a DataFrame | Python Pandas

In this article we will learn to find duplicate columns in a Pandas dataframe and drop them.

Pandas library contain direct APIs to find out the duplicate rows, but there is no direct APIs for duplicate columns. And hence, we have to build API for that. Initially let’s create a dataframe with duplicate columns.

import pandas as sc
# List of Tuples
players = [('Nathan', 35, 'Australia', 35, 'Australia', 35),
            ('Vishal', 24, 'India', 24, 'India', 24),
            ('Abraham', 34, 'South Africa', 34, 'South Africa', 34),
            ('Trevor', 28, 'England', 28, 'England', 28),
            ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42),
            ]
# Create a DataFrame object
PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey'])
print("Original Dataframe is:")
print(PlayerObj)
Output :
Original Dataframe is:
      Name  Age       Country  Address       Citizen  Jersey
0   Nathan   35     Australia       35     Australia      35
1   Vishal   24         India       24         India      24
2  Abraham   34  South Africa       34  South Africa      34
3   Trevor   28       England       28       England      28
4    Kumar   42      SriLanka       42      SriLanka      42
Original Dataframe is:
      Name  Age       Country  Address       Citizen  Jersey
0   Nathan   35     Australia       35     Australia      35
1   Vishal   24         India       24         India      24
2  Abraham  34  South Africa       34  South Africa      34
3   Trevor   28       England       28       England      28
4    Kumar   42      SriLanka       42      SriLanka      42

Find duplicate columns in a DataFrame :

To find the duplicate columns in dataframe, we will iterate over each column and search if any other columns exist of same content. If yes, that column name will be stored in duplicate column list and in the end our API will returned list of duplicate columns.

import pandas as sc
def getDuplicateColumns(df):
    '''
    Get a list of duplicate columns.
    It will iterate over all the columns and finfd the duplicate columns in dataframe
    :param df: Dataframe object
    :return: Column’s list whose contents are same
    '''
    duplicateColumnNames = set()
    # Iterate over all the columns 
    for x in range(df.shape[1]):
        # Select column at xth index of dataframe.
        col = df.iloc[:, x]
        # Iterate over all the columns from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index of dataframe.
            otherCol = df.iloc[:, y]
            # Check if two columns x & y are equal
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
    return list(duplicateColumnNames)
    
def main():
# List of Tuples
    players = [('Nathan', 35, 'Australia', 35, 'Australia', 35),
            ('Vishal', 24, 'India', 24, 'India', 24),
            ('Abraham', 34, 'South Africa', 34, 'South Africa', 34),
            ('Trevor', 28, 'England', 28, 'England', 28),
            ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42),
            ]
# Creation of DataFrame object
    PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey'])
    print("Original Dataframe is:")
    print(PlayerObj)
# To get list of duplicate columns
    duplicateColumnNames = getDuplicateColumns(PlayerObj)
    print('Duplicate Columns are: ')
    for ele in duplicateColumnNames:
        print('Column name is : ', ele)

if __name__ == '__main__':
    main()
Output :
Original Dataframe is:
      Name  Age       Country  Address       Citizen  Jersey
0   Nathan   35     Australia       35     Australia      35
1   Vishal   24         India       24         India      24
2  Abraham   34  South Africa       34  South Africa      34
3   Trevor   28       England       28       England      28
4    Kumar   42      SriLanka       42      SriLanka      42
Duplicate Columns are:
('Column name is : ', 'Citizen')
('Column name is : ', 'Jersey')
('Column name is : ', 'Address')

Drop duplicate columns in a DataFrame :

To drop/ remove the duplicate columns we will pass the list of duplicate column’s name which is returned by our API to dataframe.drop.

import pandas as sc
def getDuplicateColumns(df):
    '''
    Get a list of duplicate columns.
    It will iterate over all the columns and finfd the duplicate columns in dataframe
    :param df: Dataframe object
    :return: Column’s list whose contents are same
    '''
    duplicateColumnNames = set()
    # Iterate over all the columns 
    for x in range(df.shape[1]):
        # Select column at xth index of dataframe.
        col = df.iloc[:, x]
        # Iterate over all the columns from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index of dataframe.
            otherCol = df.iloc[:, y]
            # Check if two columns x & y are equal
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
    return list(duplicateColumnNames)
    
def main():
# List of Tuples
    players = [('Nathan', 35, 'Australia', 35, 'Australia', 35),
            ('Vishal', 24, 'India', 24, 'India', 24),
            ('Abraham', 34, 'South Africa', 34, 'South Africa', 34),
            ('Trevor', 28, 'England', 28, 'England', 28),
            ('Kumar', 42, 'SriLanka', 42, 'SriLanka', 42),
            ]
# Creation of DataFrame object
    PlayerObj = sc.DataFrame(players, columns=['Name', 'Age', 'Country', 'Address', 'Citizen', 'Jersey'])
    print("Original Dataframe is:")
    print(PlayerObj)
# To get list of duplicate columns
    duplicateColumnNames = getDuplicateColumns(PlayerObj)
    print('Duplicate Columns are: ')
    for ele in duplicateColumnNames:
        print('Column name is : ', ele)
    
 # Delete duplicate columns
    print('After removing duplicate columns new data frame becomes: ')
    newDf = PlayerObj.drop(columns=getDuplicateColumns(PlayerObj))
    print("Modified Dataframe is: ", newDf)

if __name__ == '__main__':
    main()
Output :
Original Dataframe is:
Name  Age       Country  Address       Citizen  Jersey
0   Nathan   35     Australia       35     Australia      35
1   Vishal   24         India       24         India      24
2  Abraham   34  South Africa       34  South Africa      34
3   Trevor   28       England       28       England      28
4    Kumar   42      SriLanka       42      SriLanka      42
Duplicate Columns are:
Column name is :  Jersey
Column name is :  Citizen
Column name is :  Address
After removing duplicate columns new data frame becomes:
Modified Dataframe is:        Name  Age       Country
0   Nathan   35     Australia
1   Vishal   24         India
2  Abraham   34  South Africa
3   Trevor   28       England
4    Kumar   42      SriLanka