{"id":26368,"date":"2021-12-21T09:28:08","date_gmt":"2021-12-21T03:58:08","guid":{"rendered":"https:\/\/python-programs.com\/?p=26368"},"modified":"2021-12-21T09:28:08","modified_gmt":"2021-12-21T03:58:08","slug":"python-program-for-calculating-summary-statistics","status":"publish","type":"post","link":"https:\/\/python-programs.com\/python-program-for-calculating-summary-statistics\/","title":{"rendered":"Python Program for Calculating Summary Statistics"},"content":{"rendered":"
To calculate summary statistics in Python, use the pandas.describe()<\/strong> function. The describe() method can be used on both numeric and object data, such as strings or timestamps.<\/p>\n The result for the two will differ in terms of fields.<\/p>\n For numerical data, the outcome will be as follows:<\/strong><\/p>\n For Objects, the outcome will be as follows:<\/strong><\/p>\n On DataFrame, a huge number of methods collectively generate descriptive statistics and other related activities. The majority of these are aggregations, such as sum() and mean(), although some, such as sumsum(), produce an object of the same size. In general, these methods, like ndarray.{sum, std,…}, accept an axis argument, but the axis can be supplied by name or integer.<\/p>\n Using Python’s describe() function, compute Summary Statistics<\/strong><\/p>\n Let us now have a look at how to calculate summary statistics for object and numerical data by using the describe() method.<\/p>\n 1)Calculation of Summary Statistics for Numerical data:<\/strong><\/p>\n Approach:<\/strong><\/p>\n Below is the implementation:<\/strong><\/p>\n Output:<\/strong><\/p>\n Here each value has a definition. They are:<\/p>\n count:<\/strong> It is the number of total entries<\/p>\n mean:<\/strong> It is the mean of all the entries<\/p>\n std:<\/strong> It is the standard deviation of all the entries.<\/p>\n min:<\/strong> It is the minimum value of all the entries.<\/p>\n 25%:<\/strong> It is the 25 percentile mark<\/p>\n 50%:<\/strong> It is the 50 percentile mark i.e, median<\/p>\n 75%:<\/strong> It is the 75 percentile mark<\/p>\n max:<\/strong> It is the maximum value of all the entries.<\/p>\n 2)Calculation of Summary Statistics for Object data:<\/strong><\/p>\n Approach:<\/strong><\/p>\n Below is the implementation:<\/strong><\/p>\n Output:<\/strong><\/p>\n Where<\/p>\n count:<\/strong> It is the number of total entries<\/p>\n unique:<\/strong> It is the total number of unique\/distinct entries.<\/p>\n top:<\/strong> It is the value that occurred most frequently<\/p>\n freq:<\/strong> It is the frequency of the most frequent entry i.e here ‘e’<\/strong> occurred 3 times hence its freq is 3.<\/p>\n Importing the Dataset first and applying the describe() method to get Summary Statistics<\/strong><\/p>\n Let us take an example of a cereal<\/strong> dataset<\/p>\n Import the dataset into a Pandas Dataframe.<\/p>\n Approach:<\/strong><\/p>\n Below is the implementation:<\/strong><\/p>\n Output:<\/strong><\/p>\n The result includes summary statistics for all of the columns in our dataset.<\/p>\n The describe() method is also used to obtain summary statistics for a timestamp series.<\/p>\n Approach:<\/strong><\/p>\n Below is the implementation:<\/strong><\/p>\n Output:<\/strong><\/p>\n You can also tell describe() method to treat dateTime as a numeric value. The result will be displayed in a way similar to that of numerical data. In the DateTime format, you can get the mean, median, 25th percentile, and 75th percentile.<\/p>\n Output:<\/strong><\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" To calculate summary statistics in Python, use the pandas.describe() function. The describe() method can be used on both numeric and object data, such as strings or timestamps. The result for the two will differ in terms of fields. For numerical data, the outcome will be as follows: count mean standard deviation minimum maximum 25 percentile …<\/p>\n\n
\n
\n
# Import pandas module using the import keyword\r\nimport pandas\r\n# Give the list as static input and store it in a variable.\r\ngvn_lst = [9, 5, 8, 2, 1]\r\n# Pass the given list as an argument to the pandas.series() function and store it in\r\n# another variable.(defining series)\r\nrslt_seris = pandas.Series(gvn_lst)\r\n# Apply describe() function to the above series to get the summary statistics\r\n# for the given series.\r\nrslt_seris.describe()\r\n<\/pre>\n
count 5.000000\r\nmean 5.000000\r\nstd 3.535534\r\nmin 1.000000\r\n25% 2.000000\r\n50% 5.000000\r\n75% 8.000000\r\nmax 9.000000\r\ndtype: float64<\/pre>\n
\n
# Import pandas module using the import keyword\r\nimport pandas\r\n# Give the list of characters as static input and store it in a variable.\r\ngvn_lst = ['p', 'e', 'r', 'g', 'e', 'p', 'e']\r\n# Pass the given list as an argument to the pandas.series() function and store it in\r\n# another variable.(defining series)\r\nrslt_seris = pandas.Series(gvn_lst)\r\n# Apply describe() function to the above series to get the summary statistics\r\n# for the given series.\r\nrslt_seris.describe()\r\n<\/pre>\n
count 7\r\nunique 4\r\ntop e\r\nfreq 3\r\ndtype: object<\/pre>\n
Calculation of Summary Statistics for Huge dataset:<\/strong><\/h4>\n
\n
# Import pandas module as pd using the import keyword\r\nimport pandas as pd\r\n# Import dataset using read_csv() function by passing the dataset name as\r\n# an argument to it.\r\n# Store it in a variable.\r\ncereal_dataset = pd.read_csv('cereal.csv')\r\n# Apply describe() method to the above-given dataset to get the Summary Statistics\r\n# of the dataset.\r\ncereal_dataset.describe()<\/pre>\n
\n\n
\n \n<\/th>\n calories<\/th>\n protein<\/th>\n fat<\/th>\n sodium<\/th>\n fiber<\/th>\n carbo<\/th>\n sugars<\/th>\n potass<\/th>\n vitamins<\/th>\n shelf<\/th>\n weight<\/th>\n cups<\/th>\n rating<\/th>\n<\/tr>\n<\/thead>\n \n count<\/th>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n 77.000000<\/td>\n<\/tr>\n \n mean<\/th>\n 106.883117<\/td>\n 2.545455<\/td>\n 1.012987<\/td>\n 159.675325<\/td>\n 2.151948<\/td>\n 14.597403<\/td>\n 6.922078<\/td>\n 96.077922<\/td>\n 28.246753<\/td>\n 2.207792<\/td>\n 1.029610<\/td>\n 0.821039<\/td>\n 42.665705<\/td>\n<\/tr>\n \n std<\/th>\n 19.484119<\/td>\n 1.094790<\/td>\n 1.006473<\/td>\n 83.832295<\/td>\n 2.383364<\/td>\n 4.278956<\/td>\n 4.444885<\/td>\n 71.286813<\/td>\n 22.342523<\/td>\n 0.832524<\/td>\n 0.150477<\/td>\n 0.232716<\/td>\n 14.047289<\/td>\n<\/tr>\n \n min<\/th>\n 50.000000<\/td>\n 1.000000<\/td>\n 0.000000<\/td>\n 0.000000<\/td>\n 0.000000<\/td>\n -1.000000<\/td>\n -1.000000<\/td>\n -1.000000<\/td>\n 0.000000<\/td>\n 1.000000<\/td>\n 0.500000<\/td>\n 0.250000<\/td>\n 18.042851<\/td>\n<\/tr>\n \n 25%<\/th>\n 100.000000<\/td>\n 2.000000<\/td>\n 0.000000<\/td>\n 130.000000<\/td>\n 1.000000<\/td>\n 12.000000<\/td>\n 3.000000<\/td>\n 40.000000<\/td>\n 25.000000<\/td>\n 1.000000<\/td>\n 1.000000<\/td>\n 0.670000<\/td>\n 33.174094<\/td>\n<\/tr>\n \n 50%<\/th>\n 110.000000<\/td>\n 3.000000<\/td>\n 1.000000<\/td>\n 180.000000<\/td>\n 2.000000<\/td>\n 14.000000<\/td>\n 7.000000<\/td>\n 90.000000<\/td>\n 25.000000<\/td>\n 2.000000<\/td>\n 1.000000<\/td>\n 0.750000<\/td>\n 40.400208<\/td>\n<\/tr>\n \n 75%<\/th>\n 110.000000<\/td>\n 3.000000<\/td>\n 2.000000<\/td>\n 210.000000<\/td>\n 3.000000<\/td>\n 17.000000<\/td>\n 11.000000<\/td>\n 120.000000<\/td>\n 25.000000<\/td>\n 3.000000<\/td>\n 1.000000<\/td>\n 1.000000<\/td>\n 50.828392<\/td>\n<\/tr>\n \n max<\/th>\n 160.000000<\/td>\n 6.000000<\/td>\n 5.000000<\/td>\n 320.000000<\/td>\n 14.000000<\/td>\n 23.000000<\/td>\n 15.000000<\/td>\n 330.000000<\/td>\n 100.000000<\/td>\n 3.000000<\/td>\n 1.500000<\/td>\n 1.500000<\/td>\n 93.704912<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n Calculation of Summary Statistics for <\/strong>timestamp series:<\/h4>\n
\n
# Import pandas module using the import keyword\r\nimport pandas\r\n# Import datetime module using the import keyword\r\nimport datetime\r\n# Import numpy module as np using the import keyword\r\nimport numpy as np\r\n# Give the timestamp as static input using the np.datetime64() function\r\n# Store it in a variable.\r\ngvn_timestmp = [np.datetime64(\"2005-04-03\"), np.datetime64(\r\n \"2008-05-01\"), np.datetime64(\"2008-05-01\"), np.datetime64(\"2003-01-07\")]\r\n# Pass the given timestamp as an argument to the pandas.series() function and\r\n# store it in another variable.(defining series)\r\nrslt_seris = pandas.Series(gvn_timestmp)\r\n# Apply describe() function to the above series to get the summary statistics\r\n# for the given series.\r\nrslt_seris.describe()\r\n<\/pre>\n
count 4\r\nunique 3\r\ntop 2008-05-01 00:00:00\r\nfreq 2\r\nfirst 2003-01-07 00:00:00\r\nlast 2008-05-01 00:00:00\r\ndtype: object<\/pre>\n
rslt_seris.describe(datetime_is_numeric=True)<\/pre>\n
# Import pandas module using the import keyword\r\nimport pandas\r\n# Import datetime module using the import keyword\r\nimport datetime\r\n# Import numpy module as np using the import keyword\r\nimport numpy as np\r\n# Give the timestamp as static input using the np.datetime64() function\r\n# Store it in a variable.\r\ngvn_timestmp = [np.datetime64(\"2005-04-03\"), np.datetime64(\r\n \"2008-05-01\"), np.datetime64(\"2008-05-01\"), np.datetime64(\"2003-01-07\")]\r\n# Pass the given timestamp as an argument to the pandas.series() function and\r\n# store it in another variable.(defining series)\r\nrslt_seris = pandas.Series(gvn_timestmp)\r\n# Apply describe() function to the above series to get the summary statistics\r\n# for the given series.\r\nrslt_seris.describe(datetime_is_numeric=True)\r\n<\/pre>\n
count 4\r\nmean 2006-03-26 18:00:00\r\nmin 2003-01-07 00:00:00\r\n25% 2004-09-10 18:00:00\r\n50% 2006-10-17 00:00:00\r\n75% 2008-05-01 00:00:00\r\nmax 2008-05-01 00:00:00\r\ndtype: object<\/pre>\n