In Python, How do you subset a DataFrame?

In this article, we will go through numerous methods for subsetting a dataframe. If you are importing data into Python then you must be aware of Data Frames. A DataFrame is a two-dimensional data structure in which data is aligned in rows and columns in a tabular form.

We may do various operations on a DataFrame using the Pandas library. We can even construct and access a DataFrame subset in several formats.

subsetting:

The process of picking a set of desired rows and columns from a data frame is known as subsetting.

We have the following options to select:

  • All rows and only a few columns(limited columns)
  • All columns and only a few rows
  • A limited number of rows and columns

Subsetting a data frame is useful since it allows you to access only a portion of the data frame. When you wish to reduce the number of parameters in your data frame, this comes in helpful.

Let us take a cereal dataset as an example.

Importing and Getting first 5 rows of the Dataset

Import the dataset into a Pandas Dataframe.

Apply head() function to the above dataset to get the first 5 rows.

cereal_dataset.head()
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()

Output:

name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
0 100% Bran N C 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
1 100% Natural Bran Q C 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
2 All-Bran K C 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1.0 0.75 34.384843

Using the Indexing Operator, select a subset of a dataframe.

The Indexing Operator is simply another term for square brackets. Using just the square brackets, you can select columns, rows, or a combination of rows and columns.

1)Selection of Only Columns

Use the below line of code to choose a column using the indexing operator.

cereal_dataset['vitamins']

The above line of code selects the column with the label ‘vitamins’ and displays all row values associated with it.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()
# Get all the rows values corresponding to the 'vitamins' column in the 
# above given dataset
cereal_dataset['vitamins']

Output:

0     25
1      0
2     25
3     25
4     25
      ..
72    25
73    25
74    25
75    25
76    25
Name: vitamins, Length: 77, dtype: int64

Selection of Multiple Columns

Select multiple columns using the index Operator.
Get all the rows values corresponding to the ‘vitamins’, ‘fat’ columns in the above-given dataset
cereal_dataset[['vitamins', 'fat']]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()
# Select multiple columns using the index Operator.
# Get all the rows values corresponding to the 'vitamins','fat' columns in the 
# above given dataset
cereal_dataset[['vitamins', 'fat']]

Output:

vitamins fat
0 25 1
1 0 5
2 25 1
3 25 0
4 25 2
72 25 1
73 25 1
74 25 1
75 25 1
76 25 1

77 rows × 2 columns

It generates a separate data frame that is a subset of the original.

2)Selection of Rows

The indexing operator can be used to pick specific rows depending on specified conditions.

To pick rows with ‘vitamins’ greater than 50, use the code below:

vitmns_grtrthan50= cereal_dataset[cereal_dataset['vitamins']>50]
vitmns_grtrthan50

Output:

name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
38 Just Right Crunchy Nuggets K C 110 2 1 170 1.0 17.0 6 60 100 3 1.0 1.00 36.523683
39 Just Right Fruit & Nut K C 140 3 1 170 2.0 20.0 9 95 100 3 1.3 0.75 36.471512
53 Product 19 K C 100 3 0 320 1.0 20.0 3 45 100 3 1.0 1.00 41.503540
69 Total Corn Flakes G C 110 2 1 200 0.0 21.0 3 35 100 3 1.0 1.00 38.839746
70 Total Raisin Bran G C 140 3 1 190 4.0 15.0 14 230 100 3 1.5 1.00 28.592785
71 Total Whole Grain G C 100 3 1 200 3.0 16.0 3 110 100 3 1.0 1.00 46.658844

Using Python.loc(), select a Subset of a dataframe.

The.loc indexer is a powerful tool for selecting rows and columns from a data frame. It can also be used to select both rows and columns at the same time.

Note: It’s vital to understand that.loc() function only works on the labels of rows and columns. Following that, we’ll look at.iloc(), which is based on a row and column index.

 1)Selection of a Row using loc():

Use the following code to choose a single row with.loc().

cereal_dataset.loc[2]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Get all the 2nd rows values in the above given dataset using the loc[] function
cereal_dataset.loc[2]

Output:

name        All-Bran
mfr                K
type               C
calories          70
protein            4
fat                1
sodium           260
fiber              9
carbo              7
sugars             5
potass           320
vitamins          25
shelf              3
weight             1
cups            0.33
rating       59.4255
Name: 2, dtype: object

 Selection of Multiple Rows using loc():

cereal_dataset.loc[[2,4,6]]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get all the 2, 4, 6 rows values in the above given dataset using the loc[] function
cereal_dataset.loc[[2, 4, 6]]

Output:

name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
2 All-Bran K C 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
4 Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1.0 0.75 34.384843
6 Apple Jacks K C 110 2 0 125 1.0 11.0 14 30 25 2 1.0 1.00 33.174094

Getting Range of Rows:

We can get rows data by providing the range (lower and upper limits) using slicing and loc[] function.

cereal_dataset.loc[3:6]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get 3 to 6 rows data by providing the range (lower and upper limits) 
# using slicing and loc[] function.
cereal_dataset.loc[3:6]

Output:

name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1.0 0.75 34.384843
5 Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5 10 70 25 1 1.0 0.75 29.509541
6 Apple Jacks K C 110 2 0 125 1.0 11.0 14 30 25 2 1.0 1.00 33.174094

2)Selection of rows and columns using loc()

Use the below line of code to choose specific rows and columns from the given data frame:

cereal_dataset.loc[3:6,['vitamins',' fats']]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get vitamins, fat columns from 3 to 6 rows by providing the range (lower and upper limits)
# and columns using slicing and loc[] function.
cereal_dataset.loc[3:6,['vitamins', 'fat']]

Output:

vitamins fat
3 25 0
4 25 2
5 25 2
6 25 0

Using Python.iloc(), select a Subset of a dataframe.

The iloc() function stands for integer location. It is completely based on integer indexing for both rows and columns.

Using iloc saves you from having to write down the entire label for each row and column.

Use iloc() to choose a subset of rows and columns as shown below:

cereal_dataset.iloc[[1,4,5], [2, 6]]

The above code selects rows 1, 4, and 5, as well as columns 2 and 6.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get rows 1, 4, and 5, as well as columns 2 and 6 using iloc[] function.
cereal_dataset.iloc[[1,4,5], [2, 6]]

Output:

type sodium
1 C 15
4 C 200
5 C 180

After replacing the labels with integers, you can use iloc() to pick rows or columns individually, much like loc().