In Python, How do you subset a DataFrame?

In this article, we will go through numerous methods for subsetting a dataframe. If you are importing data into Python then you must be aware of Data Frames. A DataFrame is a two-dimensional data structure in which data is aligned in rows and columns in a tabular form.

We may do various operations on a DataFrame using the Pandas library. We can even construct and access a DataFrame subset in several formats.

subsetting:

The process of picking a set of desired rows and columns from a data frame is known as subsetting.

We have the following options to select:

  • All rows and only a few columns(limited columns)
  • All columns and only a few rows
  • A limited number of rows and columns

Subsetting a data frame is useful since it allows you to access only a portion of the data frame. When you wish to reduce the number of parameters in your data frame, this comes in helpful.

Let us take a cereal dataset as an example.

Importing and Getting first 5 rows of the Dataset

Import the dataset into a Pandas Dataframe.

Apply head() function to the above dataset to get the first 5 rows.

cereal_dataset.head()
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()

Output:

namemfrtypecaloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
0100% BranNC704113010.05.062802531.00.3368.402973
1100% Natural BranQC12035152.08.08135031.01.0033.983679
2All-BranKC70412609.07.053202531.00.3359.425505
3All-Bran with Extra FiberKC504014014.08.003302531.00.5093.704912
4Almond DelightRC110222001.014.08-12531.00.7534.384843

Using the Indexing Operator, select a subset of a dataframe.

The Indexing Operator is simply another term for square brackets. Using just the square brackets, you can select columns, rows, or a combination of rows and columns.

1)Selection of Only Columns

Use the below line of code to choose a column using the indexing operator.

cereal_dataset['vitamins']

The above line of code selects the column with the label ‘vitamins’ and displays all row values associated with it.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()
# Get all the rows values corresponding to the 'vitamins' column in the 
# above given dataset
cereal_dataset['vitamins']

Output:

0     25
1      0
2     25
3     25
4     25
      ..
72    25
73    25
74    25
75    25
76    25
Name: vitamins, Length: 77, dtype: int64

Selection of Multiple Columns

Select multiple columns using the index Operator.
Get all the rows values corresponding to the ‘vitamins’, ‘fat’ columns in the above-given dataset
cereal_dataset[['vitamins', 'fat']]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Apply head() function to the above dataset to get the first 5 rows.
cereal_dataset.head()
# Select multiple columns using the index Operator.
# Get all the rows values corresponding to the 'vitamins','fat' columns in the 
# above given dataset
cereal_dataset[['vitamins', 'fat']]

Output:

vitaminsfat
0251
105
2251
3250
4252
72251
73251
74251
75251
76251

77 rows × 2 columns

It generates a separate data frame that is a subset of the original.

2)Selection of Rows

The indexing operator can be used to pick specific rows depending on specified conditions.

To pick rows with ‘vitamins’ greater than 50, use the code below:

vitmns_grtrthan50= cereal_dataset[cereal_dataset['vitamins']>50]
vitmns_grtrthan50

Output:

namemfrtypecaloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
38Just Right Crunchy NuggetsKC110211701.017.066010031.01.0036.523683
39Just Right Fruit & NutKC140311702.020.099510031.30.7536.471512
53Product 19KC100303201.020.034510031.01.0041.503540
69Total Corn FlakesGC110212000.021.033510031.01.0038.839746
70Total Raisin BranGC140311904.015.01423010031.51.0028.592785
71Total Whole GrainGC100312003.016.0311010031.01.0046.658844

Using Python.loc(), select a Subset of a dataframe.

The.loc indexer is a powerful tool for selecting rows and columns from a data frame. It can also be used to select both rows and columns at the same time.

Note: It’s vital to understand that.loc() function only works on the labels of rows and columns. Following that, we’ll look at.iloc(), which is based on a row and column index.

 1)Selection of a Row using loc():

Use the following code to choose a single row with.loc().

cereal_dataset.loc[2]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Get all the 2nd rows values in the above given dataset using the loc[] function
cereal_dataset.loc[2]

Output:

name        All-Bran
mfr                K
type               C
calories          70
protein            4
fat                1
sodium           260
fiber              9
carbo              7
sugars             5
potass           320
vitamins          25
shelf              3
weight             1
cups            0.33
rating       59.4255
Name: 2, dtype: object

 Selection of Multiple Rows using loc():

cereal_dataset.loc[[2,4,6]]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get all the 2, 4, 6 rows values in the above given dataset using the loc[] function
cereal_dataset.loc[[2, 4, 6]]

Output:

namemfrtypecaloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
2All-BranKC70412609.07.053202531.00.3359.425505
4Almond DelightRC110222001.014.08-12531.00.7534.384843
6Apple JacksKC110201251.011.014302521.01.0033.174094

Getting Range of Rows:

We can get rows data by providing the range (lower and upper limits) using slicing and loc[] function.

cereal_dataset.loc[3:6]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get 3 to 6 rows data by providing the range (lower and upper limits) 
# using slicing and loc[] function.
cereal_dataset.loc[3:6]

Output:

namemfrtypecaloriesproteinfatsodiumfibercarbosugarspotassvitaminsshelfweightcupsrating
3All-Bran with Extra FiberKC504014014.08.003302531.00.5093.704912
4Almond DelightRC110222001.014.08-12531.00.7534.384843
5Apple Cinnamon CheeriosGC110221801.510.510702511.00.7529.509541
6Apple JacksKC110201251.011.014302521.01.0033.174094

2)Selection of rows and columns using loc()

Use the below line of code to choose specific rows and columns from the given data frame:

cereal_dataset.loc[3:6,['vitamins',' fats']]
# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get vitamins, fat columns from 3 to 6 rows by providing the range (lower and upper limits)
# and columns using slicing and loc[] function.
cereal_dataset.loc[3:6,['vitamins', 'fat']]

Output:

vitaminsfat
3250
4252
5252
6250

Using Python.iloc(), select a Subset of a dataframe.

The iloc() function stands for integer location. It is completely based on integer indexing for both rows and columns.

Using iloc saves you from having to write down the entire label for each row and column.

Use iloc() to choose a subset of rows and columns as shown below:

cereal_dataset.iloc[[1,4,5], [2, 6]]

The above code selects rows 1, 4, and 5, as well as columns 2 and 6.

# Import pandas module as pd using the import keyword
import pandas as pd
# Import dataset using read_csv() function by pasing the dataset name as
# an argument to it.
# Store it in a variable.
cereal_dataset = pd.read_csv('cereal.csv')
# Retrieving Multiple rows data
# Get rows 1, 4, and 5, as well as columns 2 and 6 using iloc[] function.
cereal_dataset.iloc[[1,4,5], [2, 6]]

Output:

typesodium
1C15
4C200
5C180

After replacing the labels with integers, you can use iloc() to pick rows or columns individually, much like loc().