In this article, we will go through numerous methods for subsetting a dataframe. If you are importing data into Python then you must be aware of Data Frames. A DataFrame is a two-dimensional data structure in which data is aligned in rows and columns in a tabular form.
We may do various operations on a DataFrame using the Pandas library. We can even construct and access a DataFrame subset in several formats.
subsetting:
The process of picking a set of desired rows and columns from a data frame is known as subsetting.
We have the following options to select:
- All rows and only a few columns(limited columns)
- All columns and only a few rows
- A limited number of rows and columns
Subsetting a data frame is useful since it allows you to access only a portion of the data frame. When you wish to reduce the number of parameters in your data frame, this comes in helpful.
Let us take a cereal dataset as an example.
Importing and Getting first 5 rows of the Dataset
Import the dataset into a Pandas Dataframe.
Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply head() function to the above dataset to get the first 5 rows. cereal_dataset.head()
Output:
name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | C | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
2 | All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | C | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
4 | Almond Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
Using the Indexing Operator, select a subset of a dataframe.
The Indexing Operator is simply another term for square brackets. Using just the square brackets, you can select columns, rows, or a combination of rows and columns.
1)Selection of Only Columns
Use the below line of code to choose a column using the indexing operator.
cereal_dataset['vitamins']
The above line of code selects the column with the label ‘vitamins’ and displays all row values associated with it.
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply head() function to the above dataset to get the first 5 rows. cereal_dataset.head() # Get all the rows values corresponding to the 'vitamins' column in the # above given dataset cereal_dataset['vitamins']
Output:
0 25 1 0 2 25 3 25 4 25 .. 72 25 73 25 74 25 75 25 76 25 Name: vitamins, Length: 77, dtype: int64
Selection of Multiple Columns
cereal_dataset[['vitamins', 'fat']]
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Apply head() function to the above dataset to get the first 5 rows. cereal_dataset.head() # Select multiple columns using the index Operator. # Get all the rows values corresponding to the 'vitamins','fat' columns in the # above given dataset cereal_dataset[['vitamins', 'fat']]
Output:
vitamins | fat | |
---|---|---|
0 | 25 | 1 |
1 | 0 | 5 |
2 | 25 | 1 |
3 | 25 | 0 |
4 | 25 | 2 |
… | … | … |
72 | 25 | 1 |
73 | 25 | 1 |
74 | 25 | 1 |
75 | 25 | 1 |
76 | 25 | 1 |
77 rows × 2 columns
It generates a separate data frame that is a subset of the original.
2)Selection of Rows
The indexing operator can be used to pick specific rows depending on specified conditions.
To pick rows with ‘vitamins’ greater than 50, use the code below:
vitmns_grtrthan50= cereal_dataset[cereal_dataset['vitamins']>50] vitmns_grtrthan50
Output:
name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
38 | Just Right Crunchy Nuggets | K | C | 110 | 2 | 1 | 170 | 1.0 | 17.0 | 6 | 60 | 100 | 3 | 1.0 | 1.00 | 36.523683 |
39 | Just Right Fruit & Nut | K | C | 140 | 3 | 1 | 170 | 2.0 | 20.0 | 9 | 95 | 100 | 3 | 1.3 | 0.75 | 36.471512 |
53 | Product 19 | K | C | 100 | 3 | 0 | 320 | 1.0 | 20.0 | 3 | 45 | 100 | 3 | 1.0 | 1.00 | 41.503540 |
69 | Total Corn Flakes | G | C | 110 | 2 | 1 | 200 | 0.0 | 21.0 | 3 | 35 | 100 | 3 | 1.0 | 1.00 | 38.839746 |
70 | Total Raisin Bran | G | C | 140 | 3 | 1 | 190 | 4.0 | 15.0 | 14 | 230 | 100 | 3 | 1.5 | 1.00 | 28.592785 |
71 | Total Whole Grain | G | C | 100 | 3 | 1 | 200 | 3.0 | 16.0 | 3 | 110 | 100 | 3 | 1.0 | 1.00 | 46.658844 |
Using Python.loc(), select a Subset of a dataframe.
The.loc indexer is a powerful tool for selecting rows and columns from a data frame. It can also be used to select both rows and columns at the same time.
Note: It’s vital to understand that.loc() function only works on the labels of rows and columns. Following that, we’ll look at.iloc(), which is based on a row and column index.
 1)Selection of a Row using loc():
Use the following code to choose a single row with.loc().
cereal_dataset.loc[2]
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Get all the 2nd rows values in the above given dataset using the loc[] function cereal_dataset.loc[2]
Output:
name All-Bran mfr K type C calories 70 protein 4 fat 1 sodium 260 fiber 9 carbo 7 sugars 5 potass 320 vitamins 25 shelf 3 weight 1 cups 0.33 rating 59.4255 Name: 2, dtype: object
 Selection of Multiple Rows using loc():
cereal_dataset.loc[[2,4,6]]
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Retrieving Multiple rows data # Get all the 2, 4, 6 rows values in the above given dataset using the loc[] function cereal_dataset.loc[[2, 4, 6]]
Output:
name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
4 | Almond Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
6 | Apple Jacks | K | C | 110 | 2 | 0 | 125 | 1.0 | 11.0 | 14 | 30 | 25 | 2 | 1.0 | 1.00 | 33.174094 |
Getting Range of Rows:
We can get rows data by providing the range (lower and upper limits) using slicing and loc[] function.
cereal_dataset.loc[3:6]
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Retrieving Multiple rows data # Get 3 to 6 rows data by providing the range (lower and upper limits) # using slicing and loc[] function. cereal_dataset.loc[3:6]
Output:
name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | All-Bran with Extra Fiber | K | C | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
4 | Almond Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
5 | Apple Cinnamon Cheerios | G | C | 110 | 2 | 2 | 180 | 1.5 | 10.5 | 10 | 70 | 25 | 1 | 1.0 | 0.75 | 29.509541 |
6 | Apple Jacks | K | C | 110 | 2 | 0 | 125 | 1.0 | 11.0 | 14 | 30 | 25 | 2 | 1.0 | 1.00 | 33.174094 |
2)Selection of rows and columns using loc()
Use the below line of code to choose specific rows and columns from the given data frame:
cereal_dataset.loc[3:6,['vitamins',' fats']]
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Retrieving Multiple rows data # Get vitamins, fat columns from 3 to 6 rows by providing the range (lower and upper limits) # and columns using slicing and loc[] function. cereal_dataset.loc[3:6,['vitamins', 'fat']]
Output:
vitamins | fat | |
---|---|---|
3 | 25 | 0 |
4 | 25 | 2 |
5 | 25 | 2 |
6 | 25 | 0 |
Using Python.iloc(), select a Subset of a dataframe.
The iloc() function stands for integer location. It is completely based on integer indexing for both rows and columns.
Using iloc saves you from having to write down the entire label for each row and column.
Use iloc() to choose a subset of rows and columns as shown below:
cereal_dataset.iloc[[1,4,5], [2, 6]]
The above code selects rows 1, 4, and 5, as well as columns 2 and 6.
# Import pandas module as pd using the import keyword import pandas as pd # Import dataset using read_csv() function by pasing the dataset name as # an argument to it. # Store it in a variable. cereal_dataset = pd.read_csv('cereal.csv') # Retrieving Multiple rows data # Get rows 1, 4, and 5, as well as columns 2 and 6 using iloc[] function. cereal_dataset.iloc[[1,4,5], [2, 6]]
Output:
type | sodium | |
---|---|---|
1 | C | 15 |
4 | C | 200 |
5 | C | 180 |
After replacing the labels with integers, you can use iloc() to pick rows or columns individually, much like loc().