Pandas Math Functions for Data Analysis

Python is a superb language for data analysis, owing to its fantastic ecosystem of data-centric Python tools. Pandas is one of these packages, and it greatly simplifies data import and analysis.
There are several essential math operations that can be done on a pandas series to ease data analysis in Python and save a significant amount of time.

Data analysis is basically the extraction of meaningful information from a raw data source. This information provides us with an idea of how the data is distributed and structured.

Let us go through the following Pandas math functions:

  • mean() function
  • sum() function
  • median() function
  • min() and max() functions
  • value_counts() function
  • describe() function

Here we used the cereal dataset as an example.

Before going to analyze the following Pandas math functions, first import the dataset.

Importing the Dataset

Import the dataset into a Pandas Dataframe.

Approach:

  • Import pandas module as pd using the import keyword.
  • Import dataset using read_csv() function by passing the dataset name as an argument to it.
  • Store it in a variable.
  • Print the above dataset if you want to see the dataset(here we just imported).
  • The Exit of the Program.

Below is the implementation:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by passing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')

This will save the dataset in the variable ‘cereal_dataset ‘ as a DataFrame.

1)mean() function in Pandas

Mean is a statistical value that represents the whole distribution of data in a single number/value.

We can acquire the mean value for a single column or many columns, i.e. the complete dataset, by using the dataframe.mean() function.

Apply the mean() function to the dataset to get the mean of all the columns in a dataset.

cereal_dataset.mean()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the mean() function to the above dataset to get the mean of all the 
# columns in a dataset.
cereal_dataset.mean()

Output:

calories    106.883117
protein       2.545455
fat           1.012987
sodium      159.675325
fiber         2.151948
carbo        14.597403
sugars        6.922078
potass       96.077922
vitamins     28.246753
shelf         2.207792
weight        1.029610
cups          0.821039
rating       42.665705
dtype: float64

2)sum() function in Pandas

In addition to the mean() function, we can utilize the Pandas sum() function to get the sum of the values of the columns on a bigger scale. This allows us to have a more quantitative view of the data.

Apply the sum() function to the dataset to calculate the sum of each column in the entire dataset.

cereal_dataset.sum()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the sum() function to the above dataset to calculate the 
# sum of each column in the entire dataset.
cereal_dataset.sum()

Output:

name        100% Bran100% Natural BranAll-BranAll-Bran wit...
mfr         NQKKRGKGRPQGGGGRKKGKNKGRKKKPKPPGPPPQGPKKGQGARR...
type        CCCCCCCCCCCCCCCCCCCCHCCCCCCCCCCCCCCCCCCCCCCHCC...
calories                                                 8230
protein                                                   196
fat                                                        78
sodium                                                  12295
fiber                                                   165.7
carbo                                                    1124
sugars                                                    533
potass                                                   7398
vitamins                                                 2175
shelf                                                     170
weight                                                  79.28
cups                                                    63.22
rating                                                3285.26
dtype: object

3)median() function in Pandas

The median() function returns the 50 percentile or central value of a set of data (dataset).

Apply the median() function on the dataset to get the 50 percentile or central value of all columns of the dataset.

cereal_dataset.median()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the median() function on the dataset to get the 50 percentile or 
# central value of all columns of the dataset.
cereal_dataset.median()

Output:

calories    110.000000
protein       3.000000
fat           1.000000
sodium      180.000000
fiber         2.000000
carbo        14.000000
sugars        7.000000
potass       90.000000
vitamins     25.000000
shelf         2.000000
weight        1.000000
cups          0.750000
rating       40.400208
dtype: float64

4)min() and max() functions in Pandas

We can acquire the minimum and maximum values of each column of the dataset as well as a single column of the dataframe using the min() and max() functions.

Apply the max() function on the dataset to get the maximum limit of each column in the dataset.

cereal_dataset.max()

similarly, do the same for the min() function to get the minimum values of each column of the dataset.

cereal_dataset.min()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the max() function on the dataset to get the maximum values
# of each column in the dataset.
print("The maximum values of each column in the dataset:")
cereal_dataset.max()

Output:

The maximum values of each column in the dataset:
name        Wheaties Honey Gold
mfr                           R
type                          H
calories                    160
protein                       6
fat                           5
sodium                      320
fiber                        14
carbo                        23
sugars                       15
potass                      330
vitamins                    100
shelf                         3
weight                      1.5
cups                        1.5
rating                  93.7049
dtype: object

5)value_counts() function in Pandas

We get the count of each category or group in a variable using the value_counts() function. It is useful when dealing with categorical variables.

Apply the value_counts() function on the dataset vitamins variable to obtain the count of each group in the variable as a separate category.

cereal_dataset.vitamins.value_counts()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the value_counts() function on the dataset vitamins variable to obtain the 
# count of each group in the variable as a separate category.
cereal_dataset.vitamins.value_counts()

Output:

25     63
0       8
100     6
Name: vitamins, dtype: int64

5)describe() function in Pandas

We obtain the statistical information of the given dataset all at once using the describe() function.

Apply the describe() function to the dataset to obtain the statistical information of the given dataset all at once

cereal_dataset.describe()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the describe() function to the dataset to obtain the statistical
# information of the given dataset all at once 
cereal_dataset.describe()

Output:

calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
count 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000
mean 106.883117 2.545455 1.012987 159.675325 2.151948 14.597403 6.922078 96.077922 28.246753 2.207792 1.029610 0.821039 42.665705
std 19.484119 1.094790 1.006473 83.832295 2.383364 4.278956 4.444885 71.286813 22.342523 0.832524 0.150477 0.232716 14.047289
min 50.000000 1.000000 0.000000 0.000000 0.000000 -1.000000 -1.000000 -1.000000 0.000000 1.000000 0.500000 0.250000 18.042851
25% 100.000000 2.000000 0.000000 130.000000 1.000000 12.000000 3.000000 40.000000 25.000000 1.000000 1.000000 0.670000 33.174094
50% 110.000000 3.000000 1.000000 180.000000 2.000000 14.000000 7.000000 90.000000 25.000000 2.000000 1.000000 0.750000 40.400208
75% 110.000000 3.000000 2.000000 210.000000 3.000000 17.000000 11.000000 120.000000 25.000000 3.000000 1.000000 1.000000 50.828392
max 160.000000 6.000000 5.000000 320.000000 14.000000 23.000000 15.000000 330.000000 100.000000 3.000000 1.500000 1.500000 93.704912