Pandas Math Functions for Data Analysis

Python is a superb language for data analysis, owing to its fantastic ecosystem of data-centric Python tools. Pandas is one of these packages, and it greatly simplifies data import and analysis.
There are several essential math operations that can be done on a pandas series to ease data analysis in Python and save a significant amount of time.

Data analysis is basically the extraction of meaningful information from a raw data source. This information provides us with an idea of how the data is distributed and structured.

Let us go through the following Pandas math functions:

mean() function
sum() function
median() function
min() and max() functions
value_counts() function
describe() function

Here we used the cereal dataset as an example.

Before going to analyze the following Pandas math functions, first import the dataset.

Importing the Dataset

Import the dataset into a Pandas Dataframe.

Approach:

Import pandas module as pd using the import keyword.
Import dataset using read_csv() function by passing the dataset name as an argument to it.
Store it in a variable.
Print the above dataset if you want to see the dataset(here we just imported).
The Exit of the Program.

Below is the implementation:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by passing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')

This will save the dataset in the variable ‘cereal_dataset ‘ as a DataFrame.

1)mean() function in Pandas

Mean is a statistical value that represents the whole distribution of data in a single number/value.

We can acquire the mean value for a single column or many columns, i.e. the complete dataset, by using the dataframe.mean() function.

Apply the mean() function to the dataset to get the mean of all the columns in a dataset.

cereal_dataset.mean()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the mean() function to the above dataset to get the mean of all the 
# columns in a dataset.
cereal_dataset.mean()

Output:

calories    106.883117
protein       2.545455
fat           1.012987
sodium      159.675325
fiber         2.151948
carbo        14.597403
sugars        6.922078
potass       96.077922
vitamins     28.246753
shelf         2.207792
weight        1.029610
cups          0.821039
rating       42.665705
dtype: float64

2)sum() function in Pandas

In addition to the mean() function, we can utilize the Pandas sum() function to get the sum of the values of the columns on a bigger scale. This allows us to have a more quantitative view of the data.

Apply the sum() function to the dataset to calculate the sum of each column in the entire dataset.

cereal_dataset.sum()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the sum() function to the above dataset to calculate the 
# sum of each column in the entire dataset.
cereal_dataset.sum()

Output:

name        100% Bran100% Natural BranAll-BranAll-Bran wit...
mfr         NQKKRGKGRPQGGGGRKKGKNKGRKKKPKPPGPPPQGPKKGQGARR...
type        CCCCCCCCCCCCCCCCCCCCHCCCCCCCCCCCCCCCCCCCCCCHCC...
calories                                                 8230
protein                                                   196
fat                                                        78
sodium                                                  12295
fiber                                                   165.7
carbo                                                    1124
sugars                                                    533
potass                                                   7398
vitamins                                                 2175
shelf                                                     170
weight                                                  79.28
cups                                                    63.22
rating                                                3285.26
dtype: object

3)median() function in Pandas

The median() function returns the 50 percentile or central value of a set of data (dataset).

Apply the median() function on the dataset to get the 50 percentile or central value of all columns of the dataset.

cereal_dataset.median()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the median() function on the dataset to get the 50 percentile or 
# central value of all columns of the dataset.
cereal_dataset.median()

Output:

calories    110.000000
protein       3.000000
fat           1.000000
sodium      180.000000
fiber         2.000000
carbo        14.000000
sugars        7.000000
potass       90.000000
vitamins     25.000000
shelf         2.000000
weight        1.000000
cups          0.750000
rating       40.400208
dtype: float64

4)min() and max() functions in Pandas

We can acquire the minimum and maximum values of each column of the dataset as well as a single column of the dataframe using the min() and max() functions.

Apply the max() function on the dataset to get the maximum limit of each column in the dataset.

cereal_dataset.max()

similarly, do the same for the min() function to get the minimum values of each column of the dataset.

cereal_dataset.min()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the max() function on the dataset to get the maximum values
# of each column in the dataset.
print("The maximum values of each column in the dataset:")
cereal_dataset.max()

Output:

The maximum values of each column in the dataset:
name        Wheaties Honey Gold
mfr                           R
type                          H
calories                    160
protein                       6
fat                           5
sodium                      320
fiber                        14
carbo                        23
sugars                       15
potass                      330
vitamins                    100
shelf                         3
weight                      1.5
cups                        1.5
rating                  93.7049
dtype: object

5)value_counts() function in Pandas

We get the count of each category or group in a variable using the value_counts() function. It is useful when dealing with categorical variables.

Apply the value_counts() function on the dataset vitamins variable to obtain the count of each group in the variable as a separate category.

cereal_dataset.vitamins.value_counts()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the value_counts() function on the dataset vitamins variable to obtain the 
# count of each group in the variable as a separate category.
cereal_dataset.vitamins.value_counts()

Output:

25     63
0       8
100     6
Name: vitamins, dtype: int64

5)describe() function in Pandas

We obtain the statistical information of the given dataset all at once using the describe() function.

Apply the describe() function to the dataset to obtain the statistical information of the given dataset all at once

cereal_dataset.describe()

Example

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply the describe() function to the dataset to obtain the statistical
# information of the given dataset all at once 
cereal_dataset.describe()

Output:

	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
count	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000
mean	106.883117	2.545455	1.012987	159.675325	2.151948	14.597403	6.922078	96.077922	28.246753	2.207792	1.029610	0.821039	42.665705
std	19.484119	1.094790	1.006473	83.832295	2.383364	4.278956	4.444885	71.286813	22.342523	0.832524	0.150477	0.232716	14.047289
min	50.000000	1.000000	0.000000	0.000000	0.000000	-1.000000	-1.000000	-1.000000	0.000000	1.000000	0.500000	0.250000	18.042851
25%	100.000000	2.000000	0.000000	130.000000	1.000000	12.000000	3.000000	40.000000	25.000000	1.000000	1.000000	0.670000	33.174094
50%	110.000000	3.000000	1.000000	180.000000	2.000000	14.000000	7.000000	90.000000	25.000000	2.000000	1.000000	0.750000	40.400208
75%	110.000000	3.000000	2.000000	210.000000	3.000000	17.000000	11.000000	120.000000	25.000000	3.000000	1.000000	1.000000	50.828392
max	160.000000	6.000000	5.000000	320.000000	14.000000	23.000000	15.000000	330.000000	100.000000	3.000000	1.500000	1.500000	93.704912