Interpolation is a Python technique for estimating unknown data points between two known data points. While preprocessing data, interpolation is commonly used to fill in missing values in a dataframe or series.
Interpolation is also used in image processing to estimate pixel values using neighboring pixels when extending or expanding an image.
Interpolation is also used by financial analysts to forecast the financial future based on known datapoints from the past.
Interpolation is commonly employed when working with time-series data since we want to fill missing values with the preceding one or two values in time-series data. For example, if we are talking about temperature, we would always prefer to fill today’s temperature with the mean of the last two days rather than the mean of the month. Interpolation can also be used to calculate moving averages.
Pandas Dataframe has interpolate() method that can be used to fill in the missing entries in your data.
The dataframe.interpolate() function in Pandas is mostly used to fill NA values in a dataframe or series. However, this is a really powerful function for filling in the blanks. Rather than hard-coding the value, it employs various interpolation techniques to fill in the missing data.
Interpolation for Missing Values in Series Data
Creation of pandas. Series with missing values as shown below:
# Import pandas module as pd using the import keyword import pandas as pd # Import numpy module as np using the import keyword import numpy as np # Pass some random list as an argument to thr pd.Series() method # and store it in another variable.(defining series) k = pd.Series([2, 3, 1, np.nan, 4, 5, 8])
1)Linear Interpolation:
Linear interpolation basically implies estimating a missing value by connecting dots in increasing order in a straight line. In a nutshell, it estimates the unknown value in the same ascending order as prior values. Interpolation’s default method is linear, thus we didn’t need to specify it when using it.
The value at the fourth index in the above code is nan. Use the following code to interpolate the data:
k.interpolate()
In the absence of a method specification, linear interpolation is used as default.
# Import pandas module as pd using the import keyword import pandas as pd # Import numpy module as np using the import keyword import numpy as np # Pass some random list as an argument to thr pd.Series() method # and store it in another variable.(defining series) k = pd.Series([2, 3, 1, np.nan, 4, 5, 8]) # Apply interpolate() function to the above series to fill the # missing values(nan). k.interpolate()
Output:
0 2.0 1 3.0 2 1.0 3 2.5 4 4.0 5 5.0 6 8.0 dtype: float64
2)Polynomial Interpolation:
You must specify an order in Polynomial Interpolation. Polynomial interpolation fills missing values with the lowest degree possible that passes via existing data points. The polynomial interpolation curve is similar to the trigonometric sin curve or assumes the shape of a parabola.
Polynomial interpolation needs the specification of an order. Here we see the interpolating with order 2 this time.
k.interpolate(method='polynomial', order=2)
# Import pandas module as pd using the import keyword import pandas as pd # Import numpy module as np using the import keyword import numpy as np # Pass some random list as an argument to thr pd.Series() method # and store it in another variable.(defining series) k = pd.Series([2, 3, 1, np.nan, 4, 5, 8]) # Apply interpolate() function to the above series by giving the method as # "polynomial" and order= 2 as the arguments to fill the missing values(nan). k.interpolate(method='polynomial', order=2)
Output:
0 2.000000 1 3.000000 2 1.000000 3 1.921053 4 4.000000 5 5.000000 6 8.000000 dtype: float64
When you use polynomial interpolation with order 1, you get the same result as linear interpolation. This is due to the fact that a polynomial of degree 1 is linear.
3)Interpolation Via Padding
Interpolation via padding involves copying the value just preceding a missing item.
When utilizing padding interpolation, you must set a limit. The limit is the maximum number of nans that the function can fill consecutively.
So, if you’re working on a real-world project and want to fill missing values with previous values, you’ll need to establish a limit on the number of rows in the dataset.
k.interpolate(method='pad', limit=2)
# Import pandas module as pd using the import keyword import pandas as pd # Import numpy module as np using the import keyword import numpy as np # Pass some random list as an argument to thr pd.Series() method # and store it in another variable.(defining series) k = pd.Series([2, 3, 1, np.nan, 4, 5, 8]) # Apply interpolate() function to the above series by giving the method as # "pad" and limit = 2 as the arguments to fill the missing values(nan). # The limit= 2 is the maximum number of nans that the function # can fill consecutively. k.interpolate(method='pad', limit=2)
Output:
0 2.0 1 3.0 2 1.0 3 1.0 4 4.0 5 5.0 6 8.0 dtype: float64
The value of the missing entry is the same as the value of the entry preceding it.
We set the limit to two, so let’s see what happens if three consecutive nans occur.
k = pd.Series([0, 1, np.nan, np.nan, np.nan, 3,4,5,7]) k.interpolate(method='pad', limit=2)
Output:
0 0.0 1 1.0 2 1.0 3 1.0 4 NaN 5 3.0 6 4.0 7 5.0 8 7.0 dtype: float64
Here, the third nan is unaltered.
Pandas DataFrames Interpolation
Interpolation can also be used to fill missing values in a Pandas Dataframe.
Example
Approach:
- Import pandas module as pd using the import keyword.
- Pass some random data(as dictionary) to the pd.DataFrame() function to create a dataframe.
- Store it in a variable.
- Print the above-given dataframe.
- The Exit of the Program.
Below is the implementation:
# Import pandas module as pd using the import keyword import pandas as pd # Pass some random data(as dictionary) to the pd.DataFrame() function # to create a dataframe. # Store it in a variable. rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1], "q":[None, 1, 26, 8, None], "r":[14, 10, None, 9, 4], "s":[18, 5, None, None, 2]}) # Print the above given dataframe print(rslt_datafrme)
Output:
p q r s 0 11.0 NaN 14.0 18.0 1 3.0 1.0 10.0 5.0 2 2.0 26.0 NaN NaN 3 NaN 8.0 9.0 NaN 4 1.0 NaN 4.0 2.0
Pandas Dataframe Linear Interpolation
Do as given below to apply linear interpolation to the dataframe:
rslt_datafrme.interpolate()
# Import pandas module as pd using the import keyword import pandas as pd # Pass some random data(as dictionary) to the pd.DataFrame() function # to create a dataframe. # Store it in another variable. rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1], "q":[None, 1, 26, 8, None], "r":[14, 10, None, 9, 4], "s":[18, 5, None, None, 2]}) # Apply interpolate() function to the above dataframe rslt_datafrme.interpolate()
Output:
p | q | r | s | |
---|---|---|---|---|
0 | 11.0 | NaN | 14.0 | 18.0 |
1 | 3.0 | 1.0 | 10.0 | 5.0 |
2 | 2.0 | 26.0 | 9.5 | 4.0 |
3 | 1.5 | 8.0 | 9.0 | 3.0 |
4 | 1.0 | 8.0 | 4.0 | 2.0 |
In the above example, the first value below the ‘p’ column is still nan as there is no known data point before it for interpolation.
Individual columns of a dataframe can also be interpolated.
rslt_datafrme['r'].interpolate()
# Import pandas module as pd using the import keyword import pandas as pd # Pass some random data(as dictionary) to the pd.DataFrame() function # to create a dataframe. # Store it in another variable. rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1], "q":[None, 1, 26, 8, None], "r":[14, 10, None, 9, 4], "s":[18, 5, None, None, 2]}) # Apply interpolate() function to the 'r' column of the above dataframe rslt_datafrme['r'].interpolate()
Output:
0 14.0 1 10.0 2 9.5 3 9.0 4 4.0 Name: r, dtype: float64
Interpolation Via Padding
# Apply interpolate() function to the above dataframe by giving the method as "pad" and limit = 2 as the arguments to fill the missing values(nan). rslt_datafrme.interpolate(method='pad', limit=2)
# Import pandas module as pd using the import keyword import pandas as pd # Pass some random data(as dictionary) to the pd.DataFrame() function # to create a dataframe. # Store it in another variable. rslt_datafrme = pd.DataFrame({"p":[11, 3, 2, None, 1], "q":[None, 1, 26, 8, None], "r":[14, 10, None, 9, 4], "s":[18, 5, None, None, 2]}) # Apply interpolate() function to the above dataframe by giving the method as # "pad" and limit = 2 as the arguments to fill the missing values(nan). rslt_datafrme.interpolate(method='pad', limit=2)
Output:
p | q | r | s | |
---|---|---|---|---|
0 | 11.0 | NaN | 14.0 | 18.0 |
1 | 3.0 | 1.0 | 10.0 | 5.0 |
2 | 2.0 | 26.0 | 10.0 | 5.0 |
3 | 2.0 | 8.0 | 9.0 | 5.0 |
4 | 1.0 | 8.0 | 4.0 | 2.0 |