Filtering Text using Enchant in Python

Enchant Module in Python:

Enchant is a Python module that checks a word’s spelling and provides suggestions for correcting it. The antonyms and synonyms of the words are also provided. It determines whether or not a word is in the dictionary.

To tokenize text, Enchant provides the enchant.tokenize module. Tokenizing is the way of splitting words from the body of the text. However, not all words must be tokenized all of the time. If we’re spell-checking, it’s common practice to ignore email addresses and URLs. This can be accomplished by using filters to alter/modify the tokenization process.

Filtering Text using Enchant in Python

The following filters are currently in use:

1)EmailFilter

Approach:

  • Import get_tokenizer from tokenize of enchant module using the import keyword
  • Import EmailFilter from tokenize of enchant module using the import keyword
  • Give the text to be tokenized as static input and store it in a variable.
  • Get the tokenizer class by the argument language to the get_tokenizer() function
  • Printing the tokens without filtering it.
  • Take a new empty list that stores the tokens
  • Loop in each words of the token text by passing text as argument to tokenizer using the for loop
  • Add/append the token words to the newly created list using the append() function
  • Print the token list(It prints the tokens without filtering)
  • Pass the language and Type of Filter(say EmailFilter) to get_tokenizer() function to get the tokenizer of the corresponding filter
  • Printing the tokens after applying filtering to the given tokens
  • Take a new empty list that stores the tokens after filtering
  • Loop in each word of the filtered token text bypassing text as an argument to tokenizerfilter using the for loop
  • Add/append the token words to the newly created filteredtokenslist using the append() function
  • Print the token list after filtering(It prints the tokens with filtering).
  • The Exit of the Program.

Below is the implementation:

# Import get_tokenizer from tokenize of enchant module using the import keyword
from enchant.tokenize import get_tokenizer
# Import EmailFilter from tokenize of enchant module using the import keyword
from enchant.tokenize import EmailFilter

# Give the text to be tokenized as static input and store it in a variable.
gvn_text = "My email Id is [email protected]"

# Get the tokenizer class by the argument language to the get_tokenizer() function
tokenizer = get_tokenizer("en_US")

# Printing the tokens without filtering it.
print("Printing the tokens without filtering it:")
# Take a new empty list which stores the tokens 
tokenslist = []
# Loop in each words of the token text by passing text as argument to tokenizer using the for loop
for wrds in tokenizer(gvn_text):
  # Add/apppend the token words to the newly created list using the append() function
    tokenslist.append(wrds)
# Print the token list(It prints the tokens without filtering)    
print(tokenslist)
# Pass the language and Type of Filter(say EmailFilter) to get_tokenizer() function
# to get the tokenizer of corresponding filter
tokenizerfilter = get_tokenizer("en_US", [EmailFilter])
# Printing the tokens after applying filtering to the given tokens 
print("\nPrinting the tokens after applying filtering to the given tokens:")
# Take a new empty list which stores the tokens after filtering
filteredtokenslist = []
# Loop in each words of the filtered token text by passing text as argument to tokenizerfilter using the for loop
for wrds in tokenizerfilter(gvn_text):
    # Add/apppend the token words to the newly created filteredtokenslist using the append() function
    filteredtokenslist.append(wrds)
# Print the token list after filtering(It prints the tokens withfiltering)     
print(filteredtokenslist)

Output:

Printing the tokens without filtering it:
[('My', 0), ('email', 3), ('Id', 9), ('is', 12), ('pythonprograms', 15), ('gmail', 30), ('com', 36)]

Printing the tokens after applying filtering to the given tokens:
[('My', 0), ('email', 3), ('Id', 9), ('is', 12)]

2)URLFilter

Approach:

  • Import get_tokenizer from tokenize of enchant module using the import keyword
  • Import URLFilter from tokenize of enchant module using the import keyword
  • Give the text(URL) to be tokenized as static input and store it in a variable.
  • Get the tokenizer class by the argument language to the get_tokenizer() function
  • Printing the tokens without filtering it.
  • Take a new empty list that stores the tokens
  • Loop in each word of the token text by passing text as an argument to tokenizer using the for loop
  • Add/append the token words to the newly created list using the append() function
  • Print the token list(It prints the tokens without filtering)
  • Pass the language and Type of Filter(say URLFilter) to get_tokenizer() function to get the tokenizer of the corresponding filter
  • Printing the tokens after applying filtering to the given tokens
  • Take a new empty list that stores the tokens after filtering
  • Loop in each words of the filtered token text by passing text as an argument to tokenizerfilter using the for loop
  • Add/append the token words to the newly created filteredtokenslist using the append() function
  • Print the token list after filtering(It prints the tokens with filtering).
  • The Exit of the Program.

Below is the implementation:

# Import get_tokenizer from tokenize of enchant module using the import keyword
from enchant.tokenize import get_tokenizer
# Import URLFilter from tokenize of enchant module using the import keyword
from enchant.tokenize import URLFilter

# Give the text(URL) to be tokenized as static input and store it in a variable.
gvn_text = "The given URL is = https://python-programs.com/"

# Get the tokenizer class by the argument language to the get_tokenizer() function
tokenizer = get_tokenizer("en_US")

# Printing the tokens without filtering it.
print("Printing the tokens without filtering it.")
# Take a new empty list which stores the tokens 
tokenslist = []
# Loop in each words of the token text by passing text as argument to tokenizer using the for loop
for wrds in tokenizer(gvn_text):
  # Add/append the token words to the newly created list using the append() function
    tokenslist.append(wrds)
# Print the token list(It prints the tokens without filtering)    
print(tokenslist)
# Pass the language and Type of Filter(say URLFilter) to get_tokenizer() function
# to get the tokenizer of corresponding filter
tokenizerfilter = get_tokenizer("en_US", [URLFilter])
# Printing the tokens after applying filtering to the given tokens 
print("\nPrinting the tokens after applying filtering to the given tokens:")
# Take a new empty list which stores the tokens after filtering
filteredtokenslist = []
# Loop in each words of the filtered token text by passing text as argument to tokenizerfilter using the for loop
for wrds in tokenizerfilter(gvn_text):
    # Add/append the token words to the newly created filteredtokenslist using the append() function
    filteredtokenslist.append(wrds)
# Print the token list after filtering(It prints the tokens with filtering)     
print(filteredtokenslist)

Output:

Printing the tokens without filtering it.
[('The', 0), ('given', 4), ('URL', 10), ('is', 14), ('https', 19), ('python', 27), ('programs', 34), ('com', 43)]

Printing the tokens after applying filtering to the given tokens:
[('The', 0), ('given', 4), ('URL', 10), ('is', 14)]

3)WikiWordFilter

A WikiWord is a word made up of two or more words that start with capital letters and are run together.

Approach:

  • Import get_tokenizer from tokenize of enchant module using the import keyword
  • Import WikiWordFilter from tokenize of enchant module using the import keyword
  • Give the text(two or more words with initial capital letters) to be tokenized as static input and store it in a variable.
  • Get the tokenizer class by the argument language to the get_tokenizer() function
  • Printing the tokens without filtering it.
  • Take a new empty list that stores the tokens
  • Loop in each word of the token text by passing text as an argument to tokenizer using the for loop
  • Add/append the token words to the newly created list using the append() function
  • Print the token list(It prints the tokens without filtering)
  • Pass the language and Type of Filter(say WikiWordFilter) to get_tokenizer() function to get the tokenizer of the corresponding filter
  • Printing the tokens after applying filtering to the given tokens
  • Take a new empty list that stores the tokens after filtering
  • Loop in each word of the filtered token text by passing text as an argument to tokenizerfilter using the for loop
  • Add/append the token words to the newly created filteredtokenslist using the append() function
  • Print the token list after filtering(It prints the tokens with filtering).
  • The Exit of the Program.

Below is the implementation:

# Import get_tokenizer from tokenize of enchant module using the import keyword
from enchant.tokenize import get_tokenizer
# Import WikiWordFilter from tokenize of enchant module using the import keyword
from enchant.tokenize import WikiWordFilter

# Give the text(two or more words with initial capital letters) to be tokenized as static input and store it in a variable.
gvn_text = "PythonProgramsCoding....Hello all"

# Get the tokenizer class by the argument language to the get_tokenizer() function
tokenizer = get_tokenizer("en_US")

# Printing the tokens without filtering it.
print("Printing the tokens without filtering it.")
# Take a new empty list which stores the tokens 
tokenslist = []
# Loop in each words of the token text by passing text as argument to tokenizer using the for loop
for wrds in tokenizer(gvn_text):
  # Add/apppend the token words to the newly created list using the append() function
    tokenslist.append(wrds)
# Print the token list(It prints the tokens without filtering)    
print(tokenslist)
# Pass the language and Type of Filter(say WikiWordFilter) to get_tokenizer() function
# to get the tokenizer of corresponding filter
tokenizerfilter = get_tokenizer("en_US", [WikiWordFilter])
# Printing the tokens after applying filtering to the given tokens 
print("\nPrinting the tokens after applying filtering to the given tokens:")
# Take a new empty list which stores the tokens after filtering
filteredtokenslist = []
# Loop in each words of the filtered token text by passing text as argument to tokenizerfilter using the for loop
for wrds in tokenizerfilter(gvn_text):
    # Add/apppend the token words to the newly created filteredtokenslist using the append() function
    filteredtokenslist.append(wrds)
# Print the token list after filtering(It prints the tokens withfiltering)     
print(filteredtokenslist)

Output:

Printing the tokens without filtering it.
[('PythonProgramsCoding', 0), ('Hello', 24), ('all', 30)]

Printing the tokens after applying filtering to the given tokens:
[('all', 30)]