Pandas : Merge Dataframes on specific columns or on index in Python – Part 2

Merge Dataframes on specific columns or on index in Python

In this article, we will learn to merge dataframes on basis of given columns or index.

Dataframe.merge() :

Dataframe class of Python’s Pandas library provide a function i.e. merge() which helps in merging of two DataFrames.

Syntax: DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Arguments:-

  • right : A dataframe that is to be merged with the calling dataframe.
  • how : (Merge type). Some values are : left, right, outer, inner. Its default value is ‘inner’. If the two dataframes contains different columns, then based how value, columns will be considered accordingly for merging.
  • on : It is the column name on which merge will be done. If not provided then merged done on basis of indexes.
  • left_on : Column in left dataframe where merging is to be done.
  • right_on : Column in right dataframe, where merging is to be done.
  • left_index : (bool), default is False (If found True index index from left dataframe selected as join key)
  • right_index : (bool), default is False (If found True index index from right dataframe selected as join key)
  • suffixes : tuple of (str, str), default (‘_x’, ‘_y’) (Suffix that is to be applied on overlapping columns in left and right dataframes respectively.)

Merging Dataframe on a given column name as join key :

Let’s take a scenario where the columns names are same, but contents are different i.e. one column data is of int type and other column data is of string type. And if we apply merge() on them without passing any argument, it wouldn’t merge here. Here, we can merge dataframes on a single column by passing on argument in merge() function.

And as both dataframes have common column i.e. sponsered, so after merging columns are named by default. It will splitted by taking a suffix  i.e. Sponsered_x and Sponsered_y as left and right dataframe respectively.

import pandas as sc
# List of Tuples
players = [(15,'Smith','Pune', 17,12000),
            (99,'Rana', 'Mumbai', 20,2000),
            (51,'Jaydev','Kolkata', 22,25640),
            (31,'Shikhar','Hyderabad', 28,85410),
            (12,'Sanju','Rajasthan', 21,63180),
            (35,'Raina','Gujarat', 18,62790)
            ]
# Creation of DataFrame object
playDFObj = sc.DataFrame(players, columns=['JersyN','Name', 'Team', 'Age','Salary'], index=['I', 'II', 'III', 'IV', 'V', 'VI'])
print('DataFrame 1 : ')
print(playDFObj)
# List of Tuples
moreInfo = [(15, 13, 180000, 'Nissin') ,
           (99, 2, 195200, 'Jio') ,
           (51, 7, 15499, 'Lays') ,
           (31, 17, 654000, 'AmbujaC') ,
           (12, 5, 201000, 'AsianP') ,
           (35, 14, 741000, 'Airtel')
            ]
# Creation of DataFrame object
moreinfoObj = sc.DataFrame(moreInfo, columns=['JersyN', 'PLayingSince' , 'Salary', 'Sponsered'], index=['I', 'II', 'III', 'IV', 'V', 'VI'])
print('DataFrame 2 : ')
print(moreinfoObj)
# Merge two Dataframes on basis of common column by default INNER JOIN
mergedDataf = playDFObj.merge(moreinfoObj, on='JersyN')
print('After merging: ')
print(mergedDataf)
Output :

DataFrame 1 : 
 JersyN   Name    Team       Age    Salary
I    15    Smith      Pune         17    12000
II   99    Rana       Mumbai    20    2000
III  51    Jaydev    Kolkata      22   25640
IV  31   Shikhar   Hyderabad 28   85410
V  12    Sanju      Rajasthan   21   63180
VI  35    Raina     Gujarat       18   62790
DataFrame 2 : 
   JersyN PLayingSince   Salary       Sponsered
I    15             13            180000       Nissin
II    99            2              195200         Jio
III   51            7              15499          Lays
IV   31          17              654000     AmbujaC
V   12           5                201000       AsianP
VI   35          14              741000         Airtel
After merging: 
  JersyN     Name    Team            Age Salary_x    PLayingSince Salary_y Sponsered
0  15          Smith    Pune            17       12000       13              180000   Nissin
1  99          Rana     Mumbai       20       2000          2               195200   Jio
2  51          Jaydev   Kolkata        22       25640        7               15499    Lays
3  31          Shikhar  Hyderabad   28      85410       17              654000  AmbujaC
4  12          Sanju     Rajasthan     21      63180        5               201000  AsianP
5  5           Raina    Gujarat           18      62790       14              741000  Airtel

Merging Dataframe on a given column with suffix for similar column names :

In previous example, for common columns with dissimilar contents suffix x & y are added. We can also add our own custom suffix.

import pandas as sc
# List of Tuples
players = [(15,'Smith','Pune', 17,12000),
            (99,'Rana', 'Mumbai', 20,2000),
            (51,'Jaydev','Kolkata', 22,25640),
            (31,'Shikhar','Hyderabad', 28,85410),
            (12,'Sanju','Rajasthan', 21,63180),
            (35,'Raina','Gujarat', 18,62790)
            ]
# Creation of DataFrame object
playDFObj = sc.DataFrame(players, columns=['JersyN','Name', 'Team', 'Age','Sponsered'], index=['I', 'II', 'III', 'IV', 'V', 'VI'])
print('DataFrame 1 : ')
print(playDFObj)
# List of Tuples
moreInfo = [(15, 13, 180000, 'Nissin') ,
           (99, 2, 195200, 'Jio') ,
           (51, 7, 15499, 'Lays') ,
           (31, 17, 654000, 'AmbujaC') ,
           (12, 5, 201000, 'AsianP') ,
           (35, 14, 741000, 'Airtel')
            ]
# Creation of DataFrame object
moreinfoObj = sc.DataFrame(moreInfo, columns=['JersyN', 'PLayingSince' , 'Salary', 'Sponsered'], index=['I', 'II', 'III', 'IV', 'V', 'VI'])
print('DataFrame 2 : ')
print(moreinfoObj)
# Merge two Dataframes on basis of common column by default INNER JOIN
mergedDataf = playDFObj.merge(moreinfoObj, on='JersyN',suffixes=('_Price', '_Companies'))
print('After merging: ')
print(mergedDataf)
Output :
DataFrame 1 : 
  JersyN Name  Team          Age Sponsered
I 15       Smith   Pune           17   12000
II 99      Rana     Mumbai     20   2000
III 51     Jaydev   Kolkata      22   25640
IV 31    Shikhar  Hyderabad 28   85410
V 12     Sanju     Rajasthan   21   63180
VI 35    Raina     Gujarat      18    62790
DataFrame 2 : 
  JersyN PLayingSince Salary     Sponsered
I   15      13                 180000   Nissin
II  99     2                    195200    Jio
III  51     7                   15499      Lays
IV  31   17                   654000   AmbujaC
V  12     5                    201000    AsianP
VI  35   14                   741000     Airtel
After merging: 
JersyN Name Team ... PLayingSince Salary Sponsered_Companies
0 15 Smith Pune ... 13 180000 Nissin
1 99 Rana Mumbai ... 2 195200 Jio
2 51 Jaydev Kolkata ... 7 15499 Lays
3 31 Shikhar Hyderabad ... 17 654000 AmbujaC
4 12 Sanju Rajasthan ... 5 201000 AsianP
5 35 Raina Gujarat ... 14 741000 Airtel

Merging Dataframe different columns :

Now let’s take a scenario of changing name of JersyN column of a dataframe and try to merge it with another dataframe.

import pandas as sc
# List of Tuples
players = [(15,'Smith','Pune', 17,12000),
            (99,'Rana', 'Mumbai', 20,2000),
            (51,'Jaydev','Kolkata', 22,25640),
            (31,'Shikhar','Hyderabad', 28,85410),
            (12,'Sanju','Rajasthan', 21,63180),
            (35,'Raina','Gujarat', 18,62790)
            ]
# Creation of DataFrame object
playDFObj = sc.DataFrame(players, columns=['JersyN','Name', 'Team', 'Age','Salary'], index=['I', 'II', 'III', 'IV', 'V', 'VI'])
print('DataFrame 1 : ')
print(playDFObj)
# List of Tuples
moreInfo = [(15, 13, 180000, 'Nissin') ,
           (99, 2, 195200, 'Jio') ,
           (51, 7, 15499, 'Lays') ,
           (31, 17, 654000, 'AmbujaC') ,
           (12, 5, 201000, 'AsianP') ,
           (35, 14, 741000, 'Airtel')
            ]
# Creation of DataFrame object
moreinfoObj = sc.DataFrame(moreInfo, columns=['JersyN', 'PLayingSince' , 'Salary', 'Sponsered'], index=['I', 'II', 'III', 'IV', 'V', 'VI'])
print('DataFrame 2 : ')
print(moreinfoObj)
# Rename column JersyN to ShirtN
moreinfoObj.rename(columns={'JersyN': 'ShirtN'}, inplace=True)
# Merge two Dataframes on basis of common column by default INNER JOIN
mergedDataf = playDFObj.merge(moreinfoObj, left_on='JersyN', right_on='ShirtN')
print('After merging: ')
print(mergedDataf)
Output ;
DataFrame 1 : 
 JersyN Name Team         Age   Salary
I 15    Smith     Pune           17   12000
II 99   Rana      Mumbai      20    2000
III 51  Jaydev   Kolkata        22    25640
IV 31  Shikhar  Hyderabad  28    85410
V 12   Sanju     Rajasthan     21   63180
VI 35  Raina     Gujarat        18    62790
DataFrame 2 : 
   JersyN  PLayingSince   Salary    Sponsered
I   15              13           180000    Nissin
II   99             2             195200    Jio
III  51             7             15499     Lays
IV  31           17             654000   AmbujaC
V  12             5              201000   AsianP
VI  35           14             741000   Airtel
After merging: 
JersyN Name Team        Age ... ShirtN PLayingSince Salary Sponsered_y
0 15 Smith   Pune            17 ...   15           13             180000   Nissin
1 99 Rana     Mumbai      20 ...   99            2              195200   Jio
2 51 Jaydev  Kolkata       22 ...    51            7              15499    Lays
3 31 Shikhar Hyderabad 28 ...    31            17            654000  AmbujaC
4 12 Sanju    Rajasthan    21 ...   12             5             201000  AsianP
5 35 Raina    Gujarat       18 ...    35            14            741000  Airtel

[6 rows x 9 columns]

Want to expert in the python programming language? Exploring Python Data Analysis using Pandas tutorial changes your knowledge from basic to advance level in python concepts.

Read more Articles on Python Data Analysis Using Padas