Python Program for Calculating Summary Statistics

To calculate summary statistics in Python, use the pandas.describe() function. The describe() method can be used on both numeric and object data, such as strings or timestamps.

The result for the two will differ in terms of fields.

For numerical data, the outcome will be as follows:

count
mean
standard deviation
minimum
maximum
25 percentile
50 percentile
75 percentiles

For Objects, the outcome will be as follows:

count
top
unique
freq

On DataFrame, a huge number of methods collectively generate descriptive statistics and other related activities. The majority of these are aggregations, such as sum() and mean(), although some, such as sumsum(), produce an object of the same size. In general, these methods, like ndarray.{sum, std,…}, accept an axis argument, but the axis can be supplied by name or integer.

Using Python’s describe() function, compute Summary Statistics

Let us now have a look at how to calculate summary statistics for object and numerical data by using the describe() method.

1)Calculation of Summary Statistics for Numerical data:

Approach:

Import pandas module using the import keyword.
Give the list as static input and store it in a variable.
Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
Apply describe() function to the above series to get the summary statistics for the given series.
The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Give the list as static input and store it in a variable.
gvn_lst = [9, 5, 8, 2, 1]
# Pass the given list as an argument to the pandas.series() function and store it in
# another variable.(defining series)
rslt_seris = pandas.Series(gvn_lst)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count    5.000000
mean     5.000000
std      3.535534
min      1.000000
25%      2.000000
50%      5.000000
75%      8.000000
max      9.000000
dtype: float64

Here each value has a definition. They are:

count: It is the number of total entries

mean: It is the mean of all the entries

std: It is the standard deviation of all the entries.

min: It is the minimum value of all the entries.

25%: It is the 25 percentile mark

50%: It is the 50 percentile mark i.e, median

75%: It is the 75 percentile mark

max: It is the maximum value of all the entries.

2)Calculation of Summary Statistics for Object data:

Approach:

Import pandas module using the import keyword.
Give the list of characters as static input and store it in a variable.
Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
Apply describe() function to the above series to get the summary statistics for the given series.
The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Give the list of characters as static input and store it in a variable.
gvn_lst = ['p', 'e', 'r', 'g', 'e', 'p', 'e']
# Pass the given list as an argument to the pandas.series() function and store it in
# another variable.(defining series)
rslt_seris = pandas.Series(gvn_lst)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count     7
unique    4
top       e
freq      3
dtype: object

Where

count: It is the number of total entries

unique: It is the total number of unique/distinct entries.

top: It is the value that occurred most frequently

freq: It is the frequency of the most frequent entry i.e here ‘e’ occurred 3 times hence its freq is 3.

Calculation of Summary Statistics for Huge dataset:

Importing the Dataset first and applying the describe() method to get Summary Statistics

Let us take an example of a cereal dataset

Import the dataset into a Pandas Dataframe.

Approach:

Import pandas module using the import keyword.
Import dataset using read_csv() function by passing the dataset name as an argument to it.
Store it in a variable.
Apply describe() method to the above-given dataset to get the Summary Statistics of the dataset.
The Exit of the Program.

Below is the implementation:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by passing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply describe() method to the above-given dataset to get the Summary Statistics
# of the dataset.
cereal_dataset.describe()

Output:

	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
count	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000
mean	106.883117	2.545455	1.012987	159.675325	2.151948	14.597403	6.922078	96.077922	28.246753	2.207792	1.029610	0.821039	42.665705
std	19.484119	1.094790	1.006473	83.832295	2.383364	4.278956	4.444885	71.286813	22.342523	0.832524	0.150477	0.232716	14.047289
min	50.000000	1.000000	0.000000	0.000000	0.000000	-1.000000	-1.000000	-1.000000	0.000000	1.000000	0.500000	0.250000	18.042851
25%	100.000000	2.000000	0.000000	130.000000	1.000000	12.000000	3.000000	40.000000	25.000000	1.000000	1.000000	0.670000	33.174094
50%	110.000000	3.000000	1.000000	180.000000	2.000000	14.000000	7.000000	90.000000	25.000000	2.000000	1.000000	0.750000	40.400208
75%	110.000000	3.000000	2.000000	210.000000	3.000000	17.000000	11.000000	120.000000	25.000000	3.000000	1.000000	1.000000	50.828392
max	160.000000	6.000000	5.000000	320.000000	14.000000	23.000000	15.000000	330.000000	100.000000	3.000000	1.500000	1.500000	93.704912

The result includes summary statistics for all of the columns in our dataset.

Calculation of Summary Statistics for timestamp series:

The describe() method is also used to obtain summary statistics for a timestamp series.

Approach:

Import pandas module using the import keyword.
Import datetime module using the import keyword.
Import numpy module as np using the import keyword
Give the timestamp as static input using the np.datetime64() function.
Store it in a variable.
Pass the given timestamp as an argument to the pandas.series() function and store it in another variable (defining series).
Apply describe() function to the above series to get the summary statistics for the given series.
The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Import datetime module using the import keyword
import datetime
# Import numpy module as np using the import keyword
import numpy as np
# Give the timestamp as static input using the np.datetime64() function
# Store it in a variable.
gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64(
    "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")]
# Pass the given timestamp as an argument to the pandas.series() function and
# store it in another variable.(defining series)
rslt_seris = pandas.Series(gvn_timestmp)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count                       4
unique                      3
top       2008-05-01 00:00:00
freq                        2
first     2003-01-07 00:00:00
last      2008-05-01 00:00:00
dtype: object

You can also tell describe() method to treat dateTime as a numeric value. The result will be displayed in a way similar to that of numerical data. In the DateTime format, you can get the mean, median, 25th percentile, and 75th percentile.

rslt_seris.describe(datetime_is_numeric=True)

# Import pandas module using the import keyword
import pandas
# Import datetime module using the import keyword
import datetime
# Import numpy module as np using the import keyword
import numpy as np
# Give the timestamp as static input using the np.datetime64() function
# Store it in a variable.
gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64(
    "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")]
# Pass the given timestamp as an argument to the pandas.series() function and
# store it in another variable.(defining series)
rslt_seris = pandas.Series(gvn_timestmp)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe(datetime_is_numeric=True)

Output:

count                      4
mean     2006-03-26 18:00:00
min      2003-01-07 00:00:00
25%      2004-09-10 18:00:00
50%      2006-10-17 00:00:00
75%      2008-05-01 00:00:00
max      2008-05-01 00:00:00
dtype: object