To calculate summary statistics in Python, use the pandas.describe() function. The describe() method can be used on both numeric and object data, such as strings or timestamps.
The result for the two will differ in terms of fields.
For numerical data, the outcome will be as follows:
- count
- mean
- standard deviation
- minimum
- maximum
- 25 percentile
- 50 percentile
- 75 percentiles
For Objects, the outcome will be as follows:
- count
- top
- unique
- freq
On DataFrame, a huge number of methods collectively generate descriptive statistics and other related activities. The majority of these are aggregations, such as sum() and mean(), although some, such as sumsum(), produce an object of the same size. In general, these methods, like ndarray.{sum, std,…}, accept an axis argument, but the axis can be supplied by name or integer.
Using Python’s describe() function, compute Summary Statistics
Let us now have a look at how to calculate summary statistics for object and numerical data by using the describe() method.
1)Calculation of Summary Statistics for Numerical data:
Approach:
- Import pandas module using the import keyword.
- Give the list as static input and store it in a variable.
- Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
- Apply describe() function to the above series to get the summary statistics for the given series.
- The Exit of the Program.
Below is the implementation:
# Import pandas module using the import keyword import pandas # Give the list as static input and store it in a variable. gvn_lst = [9, 5, 8, 2, 1] # Pass the given list as an argument to the pandas.series() function and store it in # another variable.(defining series) rslt_seris = pandas.Series(gvn_lst) # Apply describe() function to the above series to get the summary statistics # for the given series. rslt_seris.describe()
Output:
count 5.000000 mean 5.000000 std 3.535534 min 1.000000 25% 2.000000 50% 5.000000 75% 8.000000 max 9.000000 dtype: float64
Here each value has a definition. They are:
count: It is the number of total entries
mean: It is the mean of all the entries
std: It is the standard deviation of all the entries.
min: It is the minimum value of all the entries.
25%: It is the 25 percentile mark
50%: It is the 50 percentile mark i.e, median
75%: It is the 75 percentile mark
max: It is the maximum value of all the entries.
2)Calculation of Summary Statistics for Object data:
Approach:
- Import pandas module using the import keyword.
- Give the list of characters as static input and store it in a variable.
- Pass the given list argument to the pandas.series() function and store it in another variable. (defining series)
- Apply describe() function to the above series to get the summary statistics for the given series.
- The Exit of the Program.
Below is the implementation:
# Import pandas module using the import keyword import pandas # Give the list of characters as static input and store it in a variable. gvn_lst = ['p', 'e', 'r', 'g', 'e', 'p', 'e'] # Pass the given list as an argument to the pandas.series() function and store it in # another variable.(defining series) rslt_seris = pandas.Series(gvn_lst) # Apply describe() function to the above series to get the summary statistics # for the given series. rslt_seris.describe()
Output:
count 7 unique 4 top e freq 3 dtype: object
Where
count: It is the number of total entries
unique: It is the total number of unique/distinct entries.
top: It is the value that occurred most frequently
freq: It is the frequency of the most frequent entry i.e here ‘e’ occurred 3 times hence its freq is 3.
Calculation of Summary Statistics for Huge dataset:
Importing the Dataset first and applying the describe() method to get Summary Statistics
Let us take an example of a cereal dataset
Import the dataset into a Pandas Dataframe.
Approach:
- Import pandas module using the import keyword.
- Import dataset using read_csv() function by passing the dataset name as an argument to it.
- Store it in a variable.
- Apply describe() method to the above-given dataset to get the Summary Statistics of the dataset.
- The Exit of the Program.
Below is the implementation:
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by passing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply describe() method to the above-given dataset to get the Summary Statistics # of the dataset. cereal_dataset.describe()
Output:
calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 |
mean | 106.883117 | 2.545455 | 1.012987 | 159.675325 | 2.151948 | 14.597403 | 6.922078 | 96.077922 | 28.246753 | 2.207792 | 1.029610 | 0.821039 | 42.665705 |
std | 19.484119 | 1.094790 | 1.006473 | 83.832295 | 2.383364 | 4.278956 | 4.444885 | 71.286813 | 22.342523 | 0.832524 | 0.150477 | 0.232716 | 14.047289 |
min | 50.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | -1.000000 | -1.000000 | 0.000000 | 1.000000 | 0.500000 | 0.250000 | 18.042851 |
25% | 100.000000 | 2.000000 | 0.000000 | 130.000000 | 1.000000 | 12.000000 | 3.000000 | 40.000000 | 25.000000 | 1.000000 | 1.000000 | 0.670000 | 33.174094 |
50% | 110.000000 | 3.000000 | 1.000000 | 180.000000 | 2.000000 | 14.000000 | 7.000000 | 90.000000 | 25.000000 | 2.000000 | 1.000000 | 0.750000 | 40.400208 |
75% | 110.000000 | 3.000000 | 2.000000 | 210.000000 | 3.000000 | 17.000000 | 11.000000 | 120.000000 | 25.000000 | 3.000000 | 1.000000 | 1.000000 | 50.828392 |
max | 160.000000 | 6.000000 | 5.000000 | 320.000000 | 14.000000 | 23.000000 | 15.000000 | 330.000000 | 100.000000 | 3.000000 | 1.500000 | 1.500000 | 93.704912 |
The result includes summary statistics for all of the columns in our dataset.
Calculation of Summary Statistics for timestamp series:
The describe() method is also used to obtain summary statistics for a timestamp series.
Approach:
- Import pandas module using the import keyword.
- Import datetime module using the import keyword.
- Import numpy module as np using the import keyword
- Give the timestamp as static input using the np.datetime64() function.
- Store it in a variable.
- Pass the given timestamp as an argument to the pandas.series() function and store it in another variable (defining series).
- Apply describe() function to the above series to get the summary statistics for the given series.
- The Exit of the Program.
Below is the implementation:
# Import pandas module using the import keyword import pandas # Import datetime module using the import keyword import datetime # Import numpy module as np using the import keyword import numpy as np # Give the timestamp as static input using the np.datetime64() function # Store it in a variable. gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64( "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")] # Pass the given timestamp as an argument to the pandas.series() function and # store it in another variable.(defining series) rslt_seris = pandas.Series(gvn_timestmp) # Apply describe() function to the above series to get the summary statistics # for the given series. rslt_seris.describe()
Output:
count 4 unique 3 top 2008-05-01 00:00:00 freq 2 first 2003-01-07 00:00:00 last 2008-05-01 00:00:00 dtype: object
You can also tell describe() method to treat dateTime as a numeric value. The result will be displayed in a way similar to that of numerical data. In the DateTime format, you can get the mean, median, 25th percentile, and 75th percentile.
rslt_seris.describe(datetime_is_numeric=True)
# Import pandas module using the import keyword import pandas # Import datetime module using the import keyword import datetime # Import numpy module as np using the import keyword import numpy as np # Give the timestamp as static input using the np.datetime64() function # Store it in a variable. gvn_timestmp = [np.datetime64("2005-04-03"), np.datetime64( "2008-05-01"), np.datetime64("2008-05-01"), np.datetime64("2003-01-07")] # Pass the given timestamp as an argument to the pandas.series() function and # store it in another variable.(defining series) rslt_seris = pandas.Series(gvn_timestmp) # Apply describe() function to the above series to get the summary statistics # for the given series. rslt_seris.describe(datetime_is_numeric=True)
Output:
count 4 mean 2006-03-26 18:00:00 min 2003-01-07 00:00:00 25% 2004-09-10 18:00:00 50% 2006-10-17 00:00:00 75% 2008-05-01 00:00:00 max 2008-05-01 00:00:00 dtype: object