Chunking Text using Enchant in Python

Enchant Module in Python:

Enchant is a Python module that checks a word’s spelling and provides suggestions for correcting it. The antonyms and synonyms of the words are also provided. It determines whether or not a word is in the dictionary.

To tokenize text, Enchant also provides the enchant.tokenize module. Tokenizing is the process of separating/splitting words from the body of a text. However, not all words must be tokenized at all times. Assume we have an HTML file; upon tokenization, all tags will be included. Typically, HTML tags do not contribute to the content of the article, so there is a need to tokenize by excluding them.

HTMLChunker is the only chunker that is currently implemented here.

In simple words, here we exclude the HTML Tags during the Tokenization.

Chunking Text using Enchant in Python

Approach:

  • Import get_tokenizer function from the enchant.tokenize module using the import keyword
  • Import HTMLChunker function from the enchant.tokenize module using the import keyword
  • Give the text with HTML tags to be tokenized as static input and store it in a variable.
  • Pass the language code as an argument to the get_tokenizer() function to get the tokenizer class and store it in another variable.
  • Take an empty list to store the tokens.
  • Printing the tokens of the given text without chunking
  • Loop in each word of the token text by passing the given text as an argument to the tokenizer using the for loop
  • Add/append the token words to the newly created list using the append() function
  • Print the token list(It prints the tokens with the position without chunking)
  • Here chunking means excluding the HTML tokens
  • Pass the language code, type of chunkers(HTMLChunker) as an argument to the get_tokenizer() function to get the tokenizer class with chunk(HTML Chunking) and store it in another variable.
  • Take an empty list to store the tokens with chunking.
  • Printing the tokens of the given text after chunking
  • Loop in each word of the token text with chunking by passing the given text as an argument to tokenizer_withchunking using the for loop
  • Add/append the token words with chunking to the newly created chunking list using the append() function
  • Print the token list with chunking(It prints the tokens with the position with HTML chunking)
  • Here HTML chunking means excluding the HTML tags while tokenization.
  • The Exit of the Program.

Below is the implementation:

# Import get_tokenizer function from the enchant.tokenize module using the import keyword
from enchant.tokenize import get_tokenizer
# Import HTMLChunker function from the enchant.tokenize module using the import keyword
from enchant.tokenize import HTMLChunker

# Give the text with HTML tags to be tokenized as static input and store it in a variable.
gvn_txt = "<div> <h2> welcome to Python-programs </h2> <br> </div>"

# Pass the language code as an argument to the get_tokenizer() function to 
# get the tokenizer class and store it in another variable.
tokenizer = get_tokenizer("en_US")

# Take an empty list to store the tokens.
tokens_lst =[]

# Printing the tokens of the given text without chunking
print("The tokens of the given text without chunking:")
# Loop in each words of the token text by passing the given text as argument to tokenizer using the for loop
for wrds in tokenizer(gvn_txt):
    # Add/append the token words to the newly created list using the append() function
    tokens_lst.append(wrds)
    
# Print the token list(It prints the tokens with the position without chunking) 
# Here chunking means excluding the HTML tokens
print(tokens_lst)

 
# Pass the language code, type of chunkers(HTMLChunker) as an argument to the get_tokenizer() function to 
# get the tokenizer class with chunk(HTML Chunking) and store it in another variable.
tokenizer_withchunking = get_tokenizer("en_US", chunkers = (HTMLChunker, ))
print()

# Take an empty list to store the tokens with chunking
tokenslist_chunk = []

# Printing the tokens of the given text after chunking
print("The tokens of the given text after chunking:")

# Loop in each words of the token text with chunking by passing the given text as argument to tokenizer_withchunking
# using the for loop
for wrds in tokenizer_withchunking(gvn_txt):
     # Add/append the token words with chunking to the newly created chunking list using the append() function
    tokenslist_chunk.append(wrds)

# Print the token list with chunking(It prints the tokens with the position with HTML chunking) 
# Here HTML chunking means excluding the HTML tags while tokenization
print(tokenslist_chunk)

Output:

The tokens of the given text without chunking:
[('div', 1), ('h', 7), ('welcome', 11), ('to', 19), ('Python', 22), ('programs', 29), ('h', 40), ('br', 45), ('div', 51)]

The tokens of the given text after chunking:
[('welcome', 11), ('to', 19), ('Python', 22), ('programs', 29)]