Python DataFrames – Quick Overview and Summary

Pandas DataFrames really amazing. DataFrames in Python makes data manipulation very user-friendly.

Pandas allow you to import large datasets and then manipulate them effectively. CSV data can be easily imported into a Pandas DataFrame.

What are Python Dataframes and How Do You Use Them?

Dataframes are two-dimensional labeled data structures with columns of various types.
DataFrames can be used for a wide range of analyses.

Often, the dataset is too large, and it is impossible to examine the entire dataset at once. Instead, we’d like to see the Dataframe’s summary.
We can get the first five rows of the dataset as well as a quick statistical summary of the data. Aside from that, we can gain information about the types of columns in our dataset.

Let us take a cereal dataset as an example.

1)Importing the Dataset

Import the dataset into a Pandas Dataframe.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')

This will save the dataset in the variable ‘cereal_dataset ‘ as a DataFrame.

2)Getting First 5 Rows

It is common for data scientists to look at the first five rows of the Dataframe after importing a dataset for the first time. It provides a rough idea of how the data looks and what is all about.

Apply head() function to the above dataset to get the first 5 rows.

cereal_dataset.head()

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()

Output:

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843

3)To Obtain a statistical summary

The describe() method in pandas is used to get a statistical summary of your Dataframe.

cereal_dataset.describe()

For Example:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply describe() function to the above dataset to get the statistical summary
# of the given above dataset 
cereal_dataset.describe()

Output:

	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
count	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000	77.000000
mean	106.883117	2.545455	1.012987	159.675325	2.151948	14.597403	6.922078	96.077922	28.246753	2.207792	1.029610	0.821039	42.665705
std	19.484119	1.094790	1.006473	83.832295	2.383364	4.278956	4.444885	71.286813	22.342523	0.832524	0.150477	0.232716	14.047289
min	50.000000	1.000000	0.000000	0.000000	0.000000	-1.000000	-1.000000	-1.000000	0.000000	1.000000	0.500000	0.250000	18.042851
25%	100.000000	2.000000	0.000000	130.000000	1.000000	12.000000	3.000000	40.000000	25.000000	1.000000	1.000000	0.670000	33.174094
50%	110.000000	3.000000	1.000000	180.000000	2.000000	14.000000	7.000000	90.000000	25.000000	2.000000	1.000000	0.750000	40.400208
75%	110.000000	3.000000	2.000000	210.000000	3.000000	17.000000	11.000000	120.000000	25.000000	3.000000	1.000000	1.000000	50.828392
max	160.000000	6.000000	5.000000	320.000000	14.000000	23.000000	15.000000	330.000000	100.000000	3.000000	1.500000	1.500000	93.704912

4)To Obtain a quick description of the dataset

The info() method in pandas is used to get get a quick description of the type of data in the table.

cereal_dataset.info()

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply info() function to the above dataset to get a quick description of the 
# type of data in the table.
cereal_dataset.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      77 non-null     object 
 1   mfr       77 non-null     object 
 2   type      77 non-null     object 
 3   calories  77 non-null     int64  
 4   protein   77 non-null     int64  
 5   fat       77 non-null     int64  
 6   sodium    77 non-null     int64  
 7   fiber     77 non-null     float64
 8   carbo     77 non-null     float64
 9   sugars    77 non-null     int64  
 10  potass    77 non-null     int64  
 11  vitamins  77 non-null     int64  
 12  shelf     77 non-null     int64  
 13  weight    77 non-null     float64
 14  cups      77 non-null     float64
 15  rating    77 non-null     float64
dtypes: float64(5), int64(8), object(3)
memory usage: 9.8+ KB

Each column of the dataset contains a row in the output. For each column label, the number of non-null entries and the data type of the entry are returned.

Knowing the data type of your dataset’s columns allows you to make better decisions when it comes to using the data to train models.

5) To Obtain a count for each column.

In Pandas, you can directly get the count of entries in each column by using the count() method.

cereal_dataset.count()

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply count() function to the above dataset to directly get the count of
# entries in each column
cereal_dataset.count()

Output:

name        77
mfr         77
type        77
calories    77
protein     77
fat         77
sodium      77
fiber       77
carbo       77
sugars      77
potass      77
vitamins    77
shelf       77
weight      77
cups        77
rating      77
dtype: int64

Seeing the count for each column can help you identify any missing entries in your data. Following that, you can plan your data cleaning strategy.

6)To Generate a Histogram for each column in the dataset.

Pandas enable you to display histograms for each column with a single line of code.

cereal_dataset.hist()

For Example:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply hist() function to the above dataset to generate histograms for each column
# in the given dataset
cereal_dataset.hist()

Output:

Histograms are frequently used by data scientists to gain a better understanding of the data.