Sum rows in Dataframe ( all or certain rows) in Python
In this article we will discuss how we can merge rows into a dataframe and add values as a new queue to the same dataframe.
So, let’s start exploring the topic.
First, we will build a Dataframe,
import pandas as pd import numpy as np # The List of Tuples salary_of_employees = [('Amit', 2000, 2050, 1099, 2134, 2111), ('Rabi', 2122, 3022, 3456, 3111, 2109), ('Abhi', np.NaN, 2334, 2077, np.NaN, 3122), ('Naresh', 3050, 3050, 2010, 2122, 1111), ('Suman', 2023, 2232, 3050, 2123, 1099), ('Viroj', 2050, 2510, np.NaN, 3012, 2122), ('Nabin', 4000, 2000, 2050, np.NaN, 2111)] # By Creating a DataFrame object from list of tuples test = pd.DataFrame(salary_of_employees, columns=['Name', 'Jan', 'Feb', 'March', 'April', 'May']) # To Set column Name as the index of dataframe test.set_index('Name', inplace=True) print(test)
Output : Jan Feb March April May Name Amit 2000.0 2050 1099.0 2134.0 2111 Rabi 2122.0 3022 3456.0 3111.0 2109 Abhi NaN 2334 2077.0 NaN 3122 Naresh 3050.0 3050 2010.0 2122.0 1111 Suman 2023.0 2232 3050.0 2123.0 1099 Viroj 2050.0 2510 NaN 3012.0 2122 Nabin 4000.0 2000 2050.0 NaN 2111
This Dataframe contains employee salaries from January to May. We’ve created a column name as a data name index. Each line of this dataframe contains the employee’s salary from January to May.
Get the sum of all rows in a Pandas Dataframe :
Let’s say in the above dataframe, we want to get details about the total salary paid each month. Basically, we want a Series that contains the total number of rows and columns eg. each item in the Series should contain a total column value.
Let’s see how we can find that series,
import pandas as pd import numpy as np # The List of Tuples salary_of_employees = [('Amit', 2000, 2050, 1099, 2134, 2111), ('Rabi', 2122, 3022, 3456, 3111, 2109), ('Abhi', np.NaN, 2334, 2077, np.NaN, 3122), ('Naresh', 3050, 3050, 2010, 2122, 1111), ('Suman', 2023, 2232, 3050, 2123, 1099), ('Viroj', 2050, 2510, np.NaN, 3012, 2122), ('Nabin', 4000, 2000, 2050, np.NaN, 2111)] # By Creating a DataFrame object from list of tuples test = pd.DataFrame(salary_of_employees, columns=['Name', 'Jan', 'Feb', 'March', 'April', 'May']) # To Set column Name as the index of dataframe test.set_index('Name', inplace=True) #By getting sum of all rows in the Dataframe as a Series total = test.sum() print('Total salary paid in each month:') print(total)
Output : Total salary paid in each month: Jan 15245.0 Feb 17198.0 March 13742.0 April 12502.0 May 13785.0 dtype: float64
We have called the sum()
function in dataframe without parameter. So, it automatically considered the axis as 0 and added all the columns wisely i.e. added all values to each column and returned a string item containing those values. Each item in this series item contains the total amount paid in monthly installments and the name of the month in the index label for that entry.
We can add this Series as a new line to the dataframe i.e.
import pandas as pd import numpy as np # The List of Tuples salary_of_employees = [('Amit', 2000, 2050, 1099, 2134, 2111), ('Rabi', 2122, 3022, 3456, 3111, 2109), ('Abhi', np.NaN, 2334, 2077, np.NaN, 3122), ('Naresh', 3050, 3050, 2010, 2122, 1111), ('Suman', 2023, 2232, 3050, 2123, 1099), ('Viroj', 2050, 2510, np.NaN, 3012, 2122), ('Nabin', 4000, 2000, 2050, np.NaN, 2111)] # By Creating a DataFrame object from list of tuples test = pd.DataFrame(salary_of_employees, columns=['Name', 'Jan', 'Feb', 'March', 'April', 'May']) # To Set column Name as the index of dataframe test.set_index('Name', inplace=True) # By getting sum of all rows as a new row in Dataframe total = test.sum() total.name = 'Total' # By assignimg sum of all rows of DataFrame as a new Row test = test.append(total.transpose()) print(test)
Output : Jan Feb March April May Name Amit 2000.0 2050.0 1099.0 2134.0 2111.0 Rabi 2122.0 3022.0 3456.0 3111.0 2109.0 Abhi NaN 2334.0 2077.0 NaN 3122.0 Naresh 3050.0 3050.0 2010.0 2122.0 1111.0 Suman 2023.0 2232.0 3050.0 2123.0 1099.0 Viroj 2050.0 2510.0 NaN 3012.0 2122.0 Nabin 4000.0 2000.0 2050.0 NaN 2111.0 Total 15245.0 17198.0 13742.0 12502.0 13785.0
Added a new line to the dataframe and ‘Total’ reference label. Each entry in this line contains the amount of details paid per month.
How did it work?
We have passed the Series to create a Dataframe in one line. All references in the series became columns in the new dataframe. Then add this new data name to the original dataframe. The result was that I added a new line to the dataframe.
Get Sum of certain rows in Dataframe by row numbers :
In the previous example we added all the rows of data but what if we want to get a total of only a few rows of data? As with the data above we want the total value in the top 3 lines eg to get the total monthly salary for only 3 employees from the top,
import pandas as pd import numpy as np # The List of Tuples salary_of_employees = [('Amit', 2000, 2050, 1099, 2134, 2111), ('Rabi', 2122, 3022, 3456, 3111, 2109), ('Abhi', np.NaN, 2334, 2077, np.NaN, 3122), ('Naresh', 3050, 3050, 2010, 2122, 1111), ('Suman', 2023, 2232, 3050, 2123, 1099), ('Viroj', 2050, 2510, np.NaN, 3012, 2122), ('Nabin', 4000, 2000, 2050, np.NaN, 2111)] # By Creating a DataFrame object from list of tuples test = pd.DataFrame(salary_of_employees, columns=['Name', 'Jan', 'Feb', 'March', 'April', 'May']) # To Set column Name as the index of dataframe test.set_index('Name', inplace=True) #By getting sum of values of top 3 DataFrame rows, sumtabOf = test.iloc[0:3].sum() print(sumtabOf)
Output : Jan 4122.0 Feb 7406.0 March 6632.0 April 5245.0 May 7342.0 dtype: float64
We selected the first 3 lines of the data file and called the total () for that. Returns a series containing the total monthly salary paid to selected employees only which means for the first three lines of the actual data list.
Get the sum of specific rows in Pandas Dataframe by index/row label :
Unlike the previous example, we can select specific lines with the reference label and find the value of values in those selected lines only i.e.
import pandas as pd import numpy as np # The List of Tuples salary_of_employees = [('Amit', 2000, 2050, 1099, 2134, 2111), ('Rabi', 2122, 3022, 3456, 3111, 2109), ('Abhi', np.NaN, 2334, 2077, np.NaN, 3122), ('Naresh', 3050, 3050, 2010, 2122, 1111), ('Suman', 2023, 2232, 3050, 2123, 1099), ('Viroj', 2050, 2510, np.NaN, 3012, 2122), ('Nabin', 4000, 2000, 2050, np.NaN, 2111)] # By Creating a DataFrame object from list of tuples test = pd.DataFrame(salary_of_employees, columns=['Name', 'Jan', 'Feb', 'March', 'April', 'May']) # To Set column Name as the index of dataframe test.set_index('Name', inplace=True) # By getting sum of 3 DataFrame rows (selected by index labels) sumtabOf = test.loc[['Amit', 'Naresh', 'Viroj']].sum() print(sumtabOf)
Output : Jan 7100.0 Feb 7610.0 March 3109.0 April 7268.0 May 5344.0 dtype: float64
We have selected 3 lines of data name with the reference label namely ‘Amit’, ‘Naresh’ and ‘Viroj’. We then added the queue values for these selected employees only. Return a series with the total amount of salary paid per month to those selected employees per month only wisely.
Conclusion:
So in the above cases we found out that to sum the multiple rows given in a dataframe.