{"id":26050,"date":"2021-12-21T09:28:21","date_gmt":"2021-12-21T03:58:21","guid":{"rendered":"https:\/\/python-programs.com\/?p=26050"},"modified":"2021-12-21T09:28:21","modified_gmt":"2021-12-21T03:58:21","slug":"python-program-to-remove-stop-words-with-nltk","status":"publish","type":"post","link":"https:\/\/python-programs.com\/python-program-to-remove-stop-words-with-nltk\/","title":{"rendered":"Python Program to Remove Stop Words with NLTK"},"content":{"rendered":"

Pre-processing is the process of transforming data into something that a computer can understand. Filtering out worthless data is a common type of pre-processing. In natural language processing, stop words are worthless (useless) words (data).<\/p>\n

Stop Words:<\/strong><\/p>\n

A stop word is a regularly used term for example, “the,” “a,” “an,”,”is” or “in” that a search engine has been configured to ignore, both while indexing entries for searching and retrieving them as the result of a search query.
\nWe don’t want these terms taking up space in our database or using precious processing time. We can easily eliminate them by storing a list of terms that you believe to stop words. Python’s NLTK (Natural Language Toolkit) contains a list of stopwords in 16 different languages. You may find them in the nltk data directory, which is located at home\/folder\/nltk data\/corpora\/stopwords.<\/p>\n

Note:<\/strong> Don’t forget to modify the name of your home directory.<\/p>\n

Before going to the coding part, download the corpus including stop words from the NLTK module.<\/p>\n

# Import nltk module using the import keyword.\r\nimport nltk\r\n# Pass the 'stopwords' as an argument to the download() function to download all the\r\n# stop words package\r\nnltk.download('stopwords')<\/pre>\n

Output:<\/strong><\/p>\n

\n
[nltk_data] Downloading package stopwords to \/root\/nltk_data...\r\n[nltk_data]   Unzipping corpora\/stopwords.zip.\r\nTrue<\/pre>\n<\/div>\n

Printing the stop words list from the corpus:<\/strong><\/p>\n

# Import stopwords from nltk.corpus using the import keyword.\r\nfrom nltk.corpus import stopwords\r\n# Print all the stopwords in english language using the words() function in\r\n# stopwords.\r\nprint(stopwords.words('english'))\r\n<\/pre>\n

Output:<\/strong><\/p>\n

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\",\r\n \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he',\r\n 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \r\n\"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',\r\n 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those',\r\n 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', \r\n'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',\r\n 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', \r\n'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\r\n 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',\r\n 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',\r\n 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most',\r\n 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', \r\n'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\",\r\n 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain',\r\n 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn',\r\n \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn',\r\n \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn',\r\n \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", \r\n'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]<\/pre>\n

You can also select stopwords from the other languages based on requirements.<\/p>\n

Get all the Languages list that can be used:<\/strong><\/p>\n

The below are the languages that are available in the NLTK ‘stopwords’ corpus.<\/p>\n

# Import stopwords from nltk.corpus using the import keyword.\r\nfrom nltk.corpus import stopwords\r\n# Get all the Languages list that can be used using the fileids() function in\r\n# stopwords\r\nprint(stopwords.fileids())\r\n<\/pre>\n

Output:<\/strong><\/p>\n

['arabic', 'azerbaijani', 'bengali', 'danish', 'dutch', 'english', 'finnish',\r\n 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', \r\n'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', \r\n'spanish', 'swedish', 'tajik', 'turkish']<\/pre>\n

Adding our own stop words to the corpus:<\/strong><\/p>\n

# Import stopwords from nltk.corpus using the import keyword.\r\nfrom nltk.corpus import stopwords\r\n# Get all the stopwords in english language using the words() function in\r\n# stopwords.\r\n# Store it in a variable\r\nour_stopwords = stopwords.words('english')\r\n# Append some random stop word to the above obtained stopwords list using the\r\n# append() function\r\nour_stopwords.append('forexample')\r\nprint(our_stopwords)\r\n<\/pre>\n

Output:<\/strong><\/p>\n

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\",\r\n \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', \r\n'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\",\r\n 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', \r\n'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', \r\n'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', \r\n'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', \r\n'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',\r\n 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before',\r\n 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',\r\n 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', \r\n'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\r\n 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same',\r\n 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\",\r\n 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', \r\n'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\",\r\n 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma',\r\n 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \r\n\"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", \r\n'won', \"won't\", 'wouldn', \"wouldn't\", 'forexample']<\/pre>\n

The user-given stop word is added at the end. Check it out in the Output.<\/p>\n

Removal of stop words:<\/strong><\/h4>\n

The below is the code for removing all the stop words from a random string\/sentence.<\/p>\n

Tokenization:<\/strong><\/p>\n

Tokenization is the process of converting a piece of text into smaller parts known as tokens. These tokens are the core of NLP.<\/p>\n

Tokenization is used to convert a sentence into a list of words.<\/p>\n

Approach:<\/strong><\/p>\n