Python Program to Remove Stop Words with NLTK

Pre-processing is the process of transforming data into something that a computer can understand. Filtering out worthless data is a common type of pre-processing. In natural language processing, stop words are worthless (useless) words (data).

Stop Words:

A stop word is a regularly used term for example, “the,” “a,” “an,”,”is” or “in” that a search engine has been configured to ignore, both while indexing entries for searching and retrieving them as the result of a search query.
We don’t want these terms taking up space in our database or using precious processing time. We can easily eliminate them by storing a list of terms that you believe to stop words. Python’s NLTK (Natural Language Toolkit) contains a list of stopwords in 16 different languages. You may find them in the nltk data directory, which is located at home/folder/nltk data/corpora/stopwords.

Note: Don’t forget to modify the name of your home directory.

Before going to the coding part, download the corpus including stop words from the NLTK module.

# Import nltk module using the import keyword.
import nltk
# Pass the 'stopwords' as an argument to the download() function to download all the
# stop words package
nltk.download('stopwords')

Output:

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True

Printing the stop words list from the corpus:

# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Print all the stopwords in english language using the words() function in
# stopwords.
print(stopwords.words('english'))

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
 "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he',
 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', 
"it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those',
 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 
'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most',
 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't",
 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain',
 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn',
 "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn',
 "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn',
 "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 
'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

You can also select stopwords from the other languages based on requirements.

Get all the Languages list that can be used:

The below are the languages that are available in the NLTK ‘stopwords’ corpus.

# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Get all the Languages list that can be used using the fileids() function in
# stopwords
print(stopwords.fileids())

Output:

['arabic', 'azerbaijani', 'bengali', 'danish', 'dutch', 'english', 'finnish',
 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 
'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 
'spanish', 'swedish', 'tajik', 'turkish']

Adding our own stop words to the corpus:

# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Get all the stopwords in english language using the words() function in
# stopwords.
# Store it in a variable
our_stopwords = stopwords.words('english')
# Append some random stop word to the above obtained stopwords list using the
# append() function
our_stopwords.append('forexample')
print(our_stopwords)

Output:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're",
 "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 
'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's",
 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 
'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 
'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before',
 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 
'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same',
 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't",
 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 
'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't",
 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma',
 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', 
"shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 
'won', "won't", 'wouldn', "wouldn't", 'forexample']

The user-given stop word is added at the end. Check it out in the Output.

Removal of stop words:

The below is the code for removing all the stop words from a random string/sentence.

Tokenization:

Tokenization is the process of converting a piece of text into smaller parts known as tokens. These tokens are the core of NLP.

Tokenization is used to convert a sentence into a list of words.

Approach:

Import word_tokenize from nltk.tokenize using the import keyword.
Import stopwords from nltk.corpus using the import keyword.
Download ‘stopwords’,’punkt’ from nltk module using the download() function.
Import word_tokenize from nltk.tokenize using the import keyword.
Give the random string as static input and store it in a variable.
Pass the given string to the word_tokenize() function to convert the given string into a list of words.
Remove the stop words from the given string using the list comprehension and store it in another variable.
Print the string after removing stopwords.
The Exit of the Program.

Below is the implementation:

# Import nltk module using the import keyword.
import nltk
# Import stopwords from nltk.corpus using the import keyword.
from nltk.corpus import stopwords
# Download 'stopwords','punkt' from nltk module using the download() function.
nltk.download('stopwords')
nltk.download('punkt')
# Import word_tokenize from nltk.tokenize using the import keyword.
from nltk.tokenize import word_tokenize
# Give the random string as static input and store it in a variable.
gvn_str = "hello this is btechgeeks in is good morning all is a"
# Pass the given string to the word_tokenize() function to convert the given
# string into a list of words.
text_tokens = word_tokenize(gvn_str)
# Remove the stop words from the given string using the list comprehension 
# and store it in another variable.
stopwrds_removd = [word for word in text_tokens if not word in stopwords.words()]
# Print the string after removing stopwords.
print(stopwrds_removd)

Output:

['hello', 'btechgeeks', 'good', 'morning']