Python DataFrames – Quick Overview and Summary

Pandas DataFrames really amazing. DataFrames in Python makes data manipulation very user-friendly.

Pandas allow you to import large datasets and then manipulate them effectively. CSV data can be easily imported into a Pandas DataFrame.

What are Python Dataframes and How Do You Use Them?

Dataframes are two-dimensional labeled data structures with columns of various types.
DataFrames can be used for a wide range of analyses.

Often, the dataset is too large, and it is impossible to examine the entire dataset at once. Instead, we’d like to see the Dataframe’s summary.
We can get the first five rows of the dataset as well as a quick statistical summary of the data. Aside from that, we can gain information about the types of columns in our dataset.

Let us take a cereal dataset as an example.

1)Importing the Dataset

Import the dataset into a Pandas Dataframe.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')

This will save the dataset in the variable ‘cereal_dataset ‘ as a DataFrame.

2)Getting First 5 Rows

It is common for data scientists to look at the first five rows of the Dataframe after importing a dataset for the first time. It provides a rough idea of how the data looks and what is all about.

Apply head() function to the above dataset to get the first 5 rows.

cereal_dataset.head()
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()

Output:

namemfrtypecaloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
0100% BranNC704113010.05.062802531.00.3368.402973
1100% Natural BranQC12035152.08.08135031.01.0033.983679
2All-BranKC70412609.07.053202531.00.3359.425505
3All-Bran with Extra FiberKC504014014.08.003302531.00.5093.704912
4Almond DelightRC110222001.014.08-12531.00.7534.384843

3)To Obtain a statistical summary

The describe() method in pandas is used to get a statistical summary of your Dataframe.

cereal_dataset.describe()

For Example:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply describe() function to the above dataset to get the statistical summary
# of the given above dataset 
cereal_dataset.describe()

Output:

caloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
count77.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.00000077.000000
mean106.8831172.5454551.012987159.6753252.15194814.5974036.92207896.07792228.2467532.2077921.0296100.82103942.665705
std19.4841191.0947901.00647383.8322952.3833644.2789564.44488571.28681322.3425230.8325240.1504770.23271614.047289
min50.0000001.0000000.0000000.0000000.000000-1.000000-1.000000-1.0000000.0000001.0000000.5000000.25000018.042851
25%100.0000002.0000000.000000130.0000001.00000012.0000003.00000040.00000025.0000001.0000001.0000000.67000033.174094
50%110.0000003.0000001.000000180.0000002.00000014.0000007.00000090.00000025.0000002.0000001.0000000.75000040.400208
75%110.0000003.0000002.000000210.0000003.00000017.00000011.000000120.00000025.0000003.0000001.0000001.00000050.828392
max160.0000006.0000005.000000320.00000014.00000023.00000015.000000330.000000100.0000003.0000001.5000001.50000093.704912

4)To Obtain a quick description of the dataset

The info() method in pandas is used to get get a quick description of the type of data in the table.

cereal_dataset.info()
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply info() function to the above dataset to get a quick description of the 
# type of data in the table.
cereal_dataset.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      77 non-null     object 
 1   mfr       77 non-null     object 
 2   type      77 non-null     object 
 3   calories  77 non-null     int64  
 4   protein   77 non-null     int64  
 5   fat       77 non-null     int64  
 6   sodium    77 non-null     int64  
 7   fiber     77 non-null     float64
 8   carbo     77 non-null     float64
 9   sugars    77 non-null     int64  
 10  potass    77 non-null     int64  
 11  vitamins  77 non-null     int64  
 12  shelf     77 non-null     int64  
 13  weight    77 non-null     float64
 14  cups      77 non-null     float64
 15  rating    77 non-null     float64
dtypes: float64(5), int64(8), object(3)
memory usage: 9.8+ KB

Each column of the dataset contains a row in the output. For each column label, the number of non-null entries and the data type of the entry are returned.

Knowing the data type of your dataset’s columns allows you to make better decisions when it comes to using the data to train models.

5) To Obtain a count for each column.

In Pandas, you can directly get the count of entries in each column by using the count() method.

cereal_dataset.count()
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply count() function to the above dataset to directly get the count of
# entries in each column
cereal_dataset.count()

Output:

name        77
mfr         77
type        77
calories    77
protein     77
fat         77
sodium      77
fiber       77
carbo       77
sugars      77
potass      77
vitamins    77
shelf       77
weight      77
cups        77
rating      77
dtype: int64

Seeing the count for each column can help you identify any missing entries in your data. Following that, you can plan your data cleaning strategy.

6)To Generate a Histogram for each column in the dataset.

Pandas enable you to display histograms for each column with a single line of code.

cereal_dataset.hist()

For Example:

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply hist() function to the above dataset to generate histograms for each column
# in the given dataset
cereal_dataset.hist()

Output:

Histograms are frequently used by data scientists to gain a better understanding of the data.