Python is a superb language for data analysis, owing to its fantastic ecosystem of data-centric Python tools. Pandas is one of these packages, and it greatly simplifies data import and analysis.
There are several essential math operations that can be done on a pandas series to ease data analysis in Python and save a significant amount of time.
Data analysis is basically the extraction of meaningful information from a raw data source. This information provides us with an idea of how the data is distributed and structured.
Let us go through the following Pandas math functions:
- mean() function
- sum() function
- median() function
- min() and max() functions
- value_counts() function
- describe() function
Here we used the cereal dataset as an example.
Before going to analyze the following Pandas math functions, first import the dataset.
Importing the Dataset
Import the dataset into a Pandas Dataframe.
Approach:
- Import pandas module as pd using the import keyword.
- Import dataset using read_csv() function by passing the dataset name as an argument to it.
- Store it in a variable.
- Print the above dataset if you want to see the dataset(here we just imported).
- The Exit of the Program.
Below is the implementation:
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by passing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv')
This will save the dataset in the variable ‘cereal_dataset ‘ as a DataFrame.
1)mean() function in Pandas
Mean is a statistical value that represents the whole distribution of data in a single number/value.
We can acquire the mean value for a single column or many columns, i.e. the complete dataset, by using the dataframe.mean() function.
Apply the mean() function to the dataset to get the mean of all the columns in a dataset.
cereal_dataset.mean()
Example
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply the mean() function to the above dataset to get the mean of all the # columns in a dataset. cereal_dataset.mean()
Output:
calories 106.883117 protein 2.545455 fat 1.012987 sodium 159.675325 fiber 2.151948 carbo 14.597403 sugars 6.922078 potass 96.077922 vitamins 28.246753 shelf 2.207792 weight 1.029610 cups 0.821039 rating 42.665705 dtype: float64
2)sum() function in Pandas
In addition to the mean() function, we can utilize the Pandas sum() function to get the sum of the values of the columns on a bigger scale. This allows us to have a more quantitative view of the data.
Apply the sum() function to the dataset to calculate the sum of each column in the entire dataset.
cereal_dataset.sum()
Example
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply the sum() function to the above dataset to calculate the # sum of each column in the entire dataset. cereal_dataset.sum()
Output:
name 100% Bran100% Natural BranAll-BranAll-Bran wit... mfr NQKKRGKGRPQGGGGRKKGKNKGRKKKPKPPGPPPQGPKKGQGARR... type CCCCCCCCCCCCCCCCCCCCHCCCCCCCCCCCCCCCCCCCCCCHCC... calories 8230 protein 196 fat 78 sodium 12295 fiber 165.7 carbo 1124 sugars 533 potass 7398 vitamins 2175 shelf 170 weight 79.28 cups 63.22 rating 3285.26 dtype: object
3)median() function in Pandas
The median() function returns the 50 percentile or central value of a set of data (dataset).
Apply the median() function on the dataset to get the 50 percentile or central value of all columns of the dataset.
cereal_dataset.median()
Example
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply the median() function on the dataset to get the 50 percentile or # central value of all columns of the dataset. cereal_dataset.median()
Output:
calories 110.000000 protein 3.000000 fat 1.000000 sodium 180.000000 fiber 2.000000 carbo 14.000000 sugars 7.000000 potass 90.000000 vitamins 25.000000 shelf 2.000000 weight 1.000000 cups 0.750000 rating 40.400208 dtype: float64
4)min() and max()Â functions in Pandas
We can acquire the minimum and maximum values of each column of the dataset as well as a single column of the dataframe using the min() and max() functions.
Apply the max() function on the dataset to get the maximum limit of each column in the dataset.
cereal_dataset.max()
similarly, do the same for the min() function to get the minimum values of each column of the dataset.
cereal_dataset.min()
Example
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply the max() function on the dataset to get the maximum values # of each column in the dataset. print("The maximum values of each column in the dataset:") cereal_dataset.max()
Output:
The maximum values of each column in the dataset: name Wheaties Honey Gold mfr R type H calories 160 protein 6 fat 5 sodium 320 fiber 14 carbo 23 sugars 15 potass 330 vitamins 100 shelf 3 weight 1.5 cups 1.5 rating 93.7049 dtype: object
5)value_counts() function in Pandas
We get the count of each category or group in a variable using the value_counts() function. It is useful when dealing with categorical variables.
Apply the value_counts() function on the dataset vitamins variable to obtain the count of each group in the variable as a separate category.
cereal_dataset.vitamins.value_counts()
Example
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply the value_counts() function on the dataset vitamins variable to obtain the # count of each group in the variable as a separate category. cereal_dataset.vitamins.value_counts()
Output:
25 63 0 8 100 6 Name: vitamins, dtype: int64
5)describe() function in Pandas
We obtain the statistical information of the given dataset all at once using the describe() function.
Apply the describe() function to the dataset to obtain the statistical information of the given dataset all at once
cereal_dataset.describe()
Example
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply the describe() function to the dataset to obtain the statistical # information of the given dataset all at once cereal_dataset.describe()
Output:
calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 |
mean | 106.883117 | 2.545455 | 1.012987 | 159.675325 | 2.151948 | 14.597403 | 6.922078 | 96.077922 | 28.246753 | 2.207792 | 1.029610 | 0.821039 | 42.665705 |
std | 19.484119 | 1.094790 | 1.006473 | 83.832295 | 2.383364 | 4.278956 | 4.444885 | 71.286813 | 22.342523 | 0.832524 | 0.150477 | 0.232716 | 14.047289 |
min | 50.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | -1.000000 | -1.000000 | 0.000000 | 1.000000 | 0.500000 | 0.250000 | 18.042851 |
25% | 100.000000 | 2.000000 | 0.000000 | 130.000000 | 1.000000 | 12.000000 | 3.000000 | 40.000000 | 25.000000 | 1.000000 | 1.000000 | 0.670000 | 33.174094 |
50% | 110.000000 | 3.000000 | 1.000000 | 180.000000 | 2.000000 | 14.000000 | 7.000000 | 90.000000 | 25.000000 | 2.000000 | 1.000000 | 0.750000 | 40.400208 |
75% | 110.000000 | 3.000000 | 2.000000 | 210.000000 | 3.000000 | 17.000000 | 11.000000 | 120.000000 | 25.000000 | 3.000000 | 1.000000 | 1.000000 | 50.828392 |
max | 160.000000 | 6.000000 | 5.000000 | 320.000000 | 14.000000 | 23.000000 | 15.000000 | 330.000000 | 100.000000 | 3.000000 | 1.500000 | 1.500000 | 93.704912 |