Python Program for Calculating Summary Statistics

To calculate summary statistics in Python, use the pandas.describe() function. The describe() method can be used on both numeric and object data, such as strings or timestamps.

The result for the two will differ in terms of fields.

For numerical data, the outcome will be as follows:

  • count
  • mean
  • standard deviation
  • minimum
  • maximum
  • 25 percentile
  • 50 percentile
  • 75 percentiles

For Objects, the outcome will be as follows:

  • count
  • top
  • unique
  • freq

On DataFrame, a huge number of methods collectively generate descriptive statistics and other related activities. The majority of these are aggregations, such as sum() and mean(), although some, such as sumsum(), produce an object of the same size. In general, these methods, like ndarray.{sum, std,…}, accept an axis argument, but the axis can be supplied by name or integer.

Using Python’s describe() function, compute Summary Statistics

Let us now have a look at how to calculate summary statistics for object and numerical data by using the describe() method.

1)Calculation of Summary Statistics for Numerical data:

Approach:

  • Import pandas module using the import keyword.
  • Give the list as static input and store it in a variable.
  • Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
  • Apply describe() function to the above series to get the summary statistics for the given series.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Give the list as static input and store it in a variable.
gvn_lst = [9, 5, 8, 2, 1]
# Pass the given list as an argument to the pandas.series() function and store it in
# another variable.(defining series)
rslt_seris = pandas.Series(gvn_lst)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count    5.000000
mean     5.000000
std      3.535534
min      1.000000
25%      2.000000
50%      5.000000
75%      8.000000
max      9.000000
dtype: float64

Here each value has a definition. They are:

count: It is the number of total entries

mean: It is the mean of all the entries

std: It is the standard deviation of all the entries.

min: It is the minimum value of all the entries.

25%: It is the 25 percentile mark

50%: It is the 50 percentile mark i.e, median

75%: It is the 75 percentile mark

max: It is the maximum value of all the entries.

2)Calculation of Summary Statistics for Object data:

Approach:

  • Import pandas module using the import keyword.
  • Give the list of characters as static input and store it in a variable.
  • Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
  • Apply describe() function to the above series to get the summary statistics for the given series.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Give the list of characters as static input and store it in a variable.
gvn_lst = ['p', 'e', 'r', 'g', 'e', 'p', 'e']
# Pass the given list as an argument to the pandas.series() function and store it in
# another variable.(defining series)
rslt_seris = pandas.Series(gvn_lst)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count     7
unique    4
top       e
freq      3
dtype: object

Where

count: It is the number of total entries

unique: It is the total number of unique/distinct entries.

top: It is the value that occurred most frequently

freq: It is the frequency of the most frequent entry i.e here ‘e’ occurred 3 times hence its freq is 3.

Calculation of Summary Statistics for Huge dataset:

Importing the Dataset first and applying the describe() method to get Summary Statistics

Let us take an example of a cereal dataset

Import the dataset into a Pandas Dataframe.

Approach:

  • Import pandas module using the import keyword.
  • Import dataset using read_csv() function by passing the dataset name as an argument to it.
  • Store it in a variable.
  • Apply describe() method to the above-given dataset to get the Summary Statistics of the dataset.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by passing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply describe() method to the above-given dataset to get the Summary Statistics
# of the dataset.
cereal_dataset.describe()

Output:

calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
count 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000 77.000000
mean 106.883117 2.545455 1.012987 159.675325 2.151948 14.597403 6.922078 96.077922 28.246753 2.207792 1.029610 0.821039 42.665705
std 19.484119 1.094790 1.006473 83.832295 2.383364 4.278956 4.444885 71.286813 22.342523 0.832524 0.150477 0.232716 14.047289
min 50.000000 1.000000 0.000000 0.000000 0.000000 -1.000000 -1.000000 -1.000000 0.000000 1.000000 0.500000 0.250000 18.042851
25% 100.000000 2.000000 0.000000 130.000000 1.000000 12.000000 3.000000 40.000000 25.000000 1.000000 1.000000 0.670000 33.174094
50% 110.000000 3.000000 1.000000 180.000000 2.000000 14.000000 7.000000 90.000000 25.000000 2.000000 1.000000 0.750000 40.400208
75% 110.000000 3.000000 2.000000 210.000000 3.000000 17.000000 11.000000 120.000000 25.000000 3.000000 1.000000 1.000000 50.828392
max 160.000000 6.000000 5.000000 320.000000 14.000000 23.000000 15.000000 330.000000 100.000000 3.000000 1.500000 1.500000 93.704912

The result includes summary statistics for all of the columns in our dataset.

Calculation of Summary Statistics for timestamp series:

The describe() method is also used to obtain summary statistics for a timestamp series.

Approach:

  • Import pandas module using the import keyword.
  • Import datetime module using the import keyword.
  • Import numpy module as np using the import keyword
  • Give the timestamp as static input using the np.datetime64() function.
  • Store it in a variable.
  • Pass the given timestamp as an argument to the pandas.series() function and store it in another variable (defining series).
  • Apply describe() function to the above series to get the summary statistics for the given series.
  • The Exit of the Program.

Below is the implementation:

# Import pandas module using the import keyword
import pandas
# Import datetime module using the import keyword
import datetime
# Import numpy module as np using the import keyword
import numpy as np
# Give the timestamp as static input using the np.datetime64() function
# Store it in a variable.
gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64(
    "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")]
# Pass the given timestamp as an argument to the pandas.series() function and
# store it in another variable.(defining series)
rslt_seris = pandas.Series(gvn_timestmp)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe()

Output:

count                       4
unique                      3
top       2008-05-01 00:00:00
freq                        2
first     2003-01-07 00:00:00
last      2008-05-01 00:00:00
dtype: object

You can also tell describe() method to treat dateTime as a numeric value. The result will be displayed in a way similar to that of numerical data. In the DateTime format, you can get the mean, median, 25th percentile, and 75th percentile.

rslt_seris.describe(datetime_is_numeric=True)
# Import pandas module using the import keyword
import pandas
# Import datetime module using the import keyword
import datetime
# Import numpy module as np using the import keyword
import numpy as np
# Give the timestamp as static input using the np.datetime64() function
# Store it in a variable.
gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64(
    "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")]
# Pass the given timestamp as an argument to the pandas.series() function and
# store it in another variable.(defining series)
rslt_seris = pandas.Series(gvn_timestmp)
# Apply describe() function to the above series to get the summary statistics
# for the given series.
rslt_seris.describe(datetime_is_numeric=True)

Output:

count                      4
mean     2006-03-26 18:00:00
min      2003-01-07 00:00:00
25%      2004-09-10 18:00:00
50%      2006-10-17 00:00:00
75%      2008-05-01 00:00:00
max      2008-05-01 00:00:00
dtype: object