{"id":26376,"date":"2021-12-21T09:28:08","date_gmt":"2021-12-21T03:58:08","guid":{"rendered":"https:\/\/python-programs.com\/?p=26376"},"modified":"2021-12-21T09:28:08","modified_gmt":"2021-12-21T03:58:08","slug":"python-interpolation-to-fill-missing-entries","status":"publish","type":"post","link":"https:\/\/python-programs.com\/python-interpolation-to-fill-missing-entries\/","title":{"rendered":"Python Interpolation To Fill Missing Entries"},"content":{"rendered":"

Interpolation is a Python technique for estimating unknown data points between two known data points. While preprocessing data, interpolation is commonly used to fill in missing values in a dataframe or series.<\/p>\n

Interpolation is also used in image processing to estimate pixel values using neighboring pixels when extending or expanding an image.<\/p>\n

Interpolation is also used by financial analysts to forecast the financial future based on known datapoints from the past.<\/p>\n

Interpolation is commonly employed when working with time-series data since we want to fill missing values with the preceding one or two values in time-series data. For example, if we are talking about temperature, we would always prefer to fill today’s temperature with the mean of the last two days rather than the mean of the month. Interpolation can also be used to calculate moving averages.<\/p>\n

Pandas Dataframe has interpolate() method that can be used to fill in the missing entries in your data.<\/p>\n

The dataframe.interpolate()<\/strong> function in Pandas is mostly used to fill NA values in a dataframe or series. However, this is a really powerful function for filling in the blanks. Rather than hard-coding the value, it employs various interpolation techniques to fill in the missing data.<\/p>\n

Interpolation for Missing Values in Series Data<\/h4>\n

Creation of pandas. Series with missing values as shown below:<\/p>\n

# Import pandas module as pd using the import keyword\r\nimport pandas as pd\r\n# Import numpy module as np using the import keyword\r\nimport numpy as np\r\n# Pass some random list as an argument to thr pd.Series() method\r\n# and store it in another variable.(defining series)\r\nk = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])\r\n<\/pre>\n

1)Linear Interpolation:<\/strong><\/p>\n

Linear interpolation basically implies estimating a missing value by connecting dots in increasing order in a straight line. In a nutshell, it estimates the unknown value in the same ascending order as prior values. Interpolation’s default method is linear, thus we didn’t need to specify it when using it.<\/p>\n

The value at the fourth index in the above code is nan. Use the following code to interpolate the data:<\/p>\n

k.interpolate()<\/pre>\n

In the absence of a method specification, linear interpolation is used as default<\/strong>.<\/p>\n

# Import pandas module as pd using the import keyword\r\nimport pandas as pd\r\n# Import numpy module as np using the import keyword\r\nimport numpy as np\r\n# Pass some random list as an argument to thr pd.Series() method\r\n# and store it in another variable.(defining series)\r\nk = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])\r\n# Apply interpolate() function to the above series to fill the\r\n# missing values(nan).\r\nk.interpolate()\r\n<\/pre>\n

Output:<\/strong><\/p>\n

0    2.0\r\n1    3.0\r\n2    1.0\r\n3    2.5\r\n4    4.0\r\n5    5.0\r\n6    8.0\r\ndtype: float64<\/pre>\n

2)Polynomial Interpolation:<\/strong><\/p>\n

You must specify an order in Polynomial Interpolation. Polynomial interpolation fills missing values with the lowest degree possible that passes via existing data points. The polynomial interpolation curve is similar to the trigonometric sin curve or assumes the shape of a parabola.<\/p>\n

Polynomial interpolation needs the specification of an order. Here we see the interpolating with order 2 this time.<\/p>\n

k.interpolate(method='polynomial', order=2)<\/pre>\n
# Import pandas module as pd using the import keyword\r\nimport pandas as pd\r\n# Import numpy module as np using the import keyword\r\nimport numpy as np\r\n# Pass some random list as an argument to thr pd.Series() method\r\n# and store it in another variable.(defining series)\r\nk = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])\r\n# Apply interpolate() function to the above series by giving the method as \r\n# \"polynomial\" and order= 2 as the arguments to fill the missing values(nan).\r\nk.interpolate(method='polynomial', order=2)\r\n<\/pre>\n

Output:<\/strong><\/p>\n

0    2.000000\r\n1    3.000000\r\n2    1.000000\r\n3    1.921053\r\n4    4.000000\r\n5    5.000000\r\n6    8.000000\r\ndtype: float64<\/pre>\n

When you use polynomial interpolation with order 1, you get the same result as linear interpolation.\u00a0This is due to the fact that a polynomial of degree 1 is linear.<\/p>\n

3)Interpolation Via Padding<\/strong><\/p>\n

Interpolation via padding involves copying the value just\u00a0preceding a missing item.<\/p>\n

When utilizing padding interpolation, you must set a limit. The limit is the maximum number of nans that the function can fill consecutively.<\/p>\n

So, if you’re working on a real-world project and want to fill missing values with previous values, you’ll need to establish a limit on the number of rows in the dataset.<\/p>\n

k.interpolate(method='pad', limit=2)<\/pre>\n
# Import pandas module as pd using the import keyword\r\nimport pandas as pd\r\n# Import numpy module as np using the import keyword\r\nimport numpy as np\r\n# Pass some random list as an argument to thr pd.Series() method\r\n# and store it in another variable.(defining series)\r\nk = pd.Series([2, 3, 1,  np.nan, 4, 5, 8])\r\n# Apply interpolate() function to the above series by giving the method as \r\n# \"pad\" and limit = 2 as the arguments to fill the missing values(nan).\r\n# The limit= 2 is the maximum number of nans that the function \r\n# can fill consecutively.\r\nk.interpolate(method='pad', limit=2)\r\n<\/pre>\n

Output:<\/strong><\/p>\n

0    2.0\r\n1    3.0\r\n2    1.0\r\n3    1.0\r\n4    4.0\r\n5    5.0\r\n6    8.0\r\ndtype: float64<\/pre>\n

The value of the missing entry is the same as the value of the entry preceding it.<\/p>\n

We set the limit to two, so let’s see what happens if three consecutive nans occur.<\/p>\n

\n
k = pd.Series([0, 1, np.nan, np.nan, np.nan, 3,4,5,7])\r\nk.interpolate(method='pad',\u00a0limit=2)<\/pre>\n

Output:<\/strong><\/p>\n<\/div>\n

0    0.0\r\n1    1.0\r\n2    1.0\r\n3    1.0\r\n4    NaN\r\n5    3.0\r\n6    4.0\r\n7    5.0\r\n8    7.0\r\ndtype: float64<\/pre>\n

Here, the third nan is unaltered<\/strong>.<\/p>\n

Pandas DataFrames Interpolation<\/h4>\n

Interpolation can also be used to fill missing values in a Pandas Dataframe.<\/p>\n

Example<\/strong><\/p>\n

Approach:<\/strong><\/p>\n