Pandas DataFrames really amazing. DataFrames in Python makes data manipulation very user-friendly.
Pandas allow you to import large datasets and then manipulate them effectively. CSV data can be easily imported into a Pandas DataFrame.
What are Python Dataframes and How Do You Use Them?
Dataframes are two-dimensional labeled data structures with columns of various types.
DataFrames can be used for a wide range of analyses.
Often, the dataset is too large, and it is impossible to examine the entire dataset at once. Instead, we’d like to see the Dataframe’s summary.
We can get the first five rows of the dataset as well as a quick statistical summary of the data. Aside from that, we can gain information about the types of columns in our dataset.
Let us take a cereal dataset as an example.
1)Importing the Dataset
Import the dataset into a Pandas Dataframe.
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv')
This will save the dataset in the variable ‘cereal_dataset ‘ as a DataFrame.
2)Getting First 5 Rows
It is common for data scientists to look at the first five rows of the Dataframe after importing a dataset for the first time. It provides a rough idea of how the data looks and what is all about.
Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply head() function to the above dataset to get the first 5 rows. cereal_dataset.head()
Output:
name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | C | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
2 | All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | C | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
4 | Almond Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
3)To Obtain a statistical summary
The describe() method in pandas is used to get a statistical summary of your Dataframe.
cereal_dataset.describe()
For Example:
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply describe() function to the above dataset to get the statistical summary # of the given above dataset cereal_dataset.describe()
Output:
calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 | 77.000000 |
mean | 106.883117 | 2.545455 | 1.012987 | 159.675325 | 2.151948 | 14.597403 | 6.922078 | 96.077922 | 28.246753 | 2.207792 | 1.029610 | 0.821039 | 42.665705 |
std | 19.484119 | 1.094790 | 1.006473 | 83.832295 | 2.383364 | 4.278956 | 4.444885 | 71.286813 | 22.342523 | 0.832524 | 0.150477 | 0.232716 | 14.047289 |
min | 50.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | -1.000000 | -1.000000 | 0.000000 | 1.000000 | 0.500000 | 0.250000 | 18.042851 |
25% | 100.000000 | 2.000000 | 0.000000 | 130.000000 | 1.000000 | 12.000000 | 3.000000 | 40.000000 | 25.000000 | 1.000000 | 1.000000 | 0.670000 | 33.174094 |
50% | 110.000000 | 3.000000 | 1.000000 | 180.000000 | 2.000000 | 14.000000 | 7.000000 | 90.000000 | 25.000000 | 2.000000 | 1.000000 | 0.750000 | 40.400208 |
75% | 110.000000 | 3.000000 | 2.000000 | 210.000000 | 3.000000 | 17.000000 | 11.000000 | 120.000000 | 25.000000 | 3.000000 | 1.000000 | 1.000000 | 50.828392 |
max | 160.000000 | 6.000000 | 5.000000 | 320.000000 | 14.000000 | 23.000000 | 15.000000 | 330.000000 | 100.000000 | 3.000000 | 1.500000 | 1.500000 | 93.704912 |
4)To Obtain a quick description of the dataset
The info() method in pandas is used to get get a quick description of the type of data in the table.
cereal_dataset.info()
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply info() function to the above dataset to get a quick description of the # type of data in the table. cereal_dataset.info()
Output:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 77 entries, 0 to 76 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 77 non-null object 1 mfr 77 non-null object 2 type 77 non-null object 3 calories 77 non-null int64 4 protein 77 non-null int64 5 fat 77 non-null int64 6 sodium 77 non-null int64 7 fiber 77 non-null float64 8 carbo 77 non-null float64 9 sugars 77 non-null int64 10 potass 77 non-null int64 11 vitamins 77 non-null int64 12 shelf 77 non-null int64 13 weight 77 non-null float64 14 cups 77 non-null float64 15 rating 77 non-null float64 dtypes: float64(5), int64(8), object(3) memory usage: 9.8+ KB
Each column of the dataset contains a row in the output. For each column label, the number of non-null entries and the data type of the entry are returned.
Knowing the data type of your dataset’s columns allows you to make better decisions when it comes to using the data to train models.
5) To Obtain a count for each column.
In Pandas, you can directly get the count of entries in each column by using the count() method.
cereal_dataset.count()
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply count() function to the above dataset to directly get the count of # entries in each column cereal_dataset.count()
Output:
name 77 mfr 77 type 77 calories 77 protein 77 fat 77 sodium 77 fiber 77 carbo 77 sugars 77 potass 77 vitamins 77 shelf 77 weight 77 cups 77 rating 77 dtype: int64
Seeing the count for each column can help you identify any missing entries in your data. Following that, you can plan your data cleaning strategy.
6)To Generate a Histogram for each column in the dataset.
Pandas enable you to display histograms for each column with a single line of code.
cereal_dataset.hist()
For Example:
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply hist() function to the above dataset to generate histograms for each column # in the given dataset cereal_dataset.hist()
Output:
Histograms are frequently used by data scientists to gain a better understanding of the data.