{"id":27159,"date":"2022-04-06T22:04:33","date_gmt":"2022-04-06T16:34:33","guid":{"rendered":"https:\/\/python-programs.com\/?p=27159"},"modified":"2022-04-06T22:04:33","modified_gmt":"2022-04-06T16:34:33","slug":"chunking-text-using-enchant-in-python","status":"publish","type":"post","link":"https:\/\/python-programs.com\/chunking-text-using-enchant-in-python\/","title":{"rendered":"Chunking Text using Enchant in Python"},"content":{"rendered":"
Enchant Module in Python:<\/strong><\/p>\n Enchant is a Python module that checks a word\u2019s spelling and provides suggestions for correcting it. The antonyms and synonyms of the words are also provided. It determines whether or not a word is in the dictionary.<\/p>\n To tokenize text, Enchant also provides the enchant.tokenize<\/strong> module. Tokenizing is the process of separating\/splitting words from the body of a text. However, not all words must be tokenized at all times. Assume we have an HTML file; upon tokenization, all tags will be included. Typically, HTML tags do not contribute to the content of the article, so there is a need to tokenize by excluding them.<\/p>\n HTMLChunker is the only chunker that is currently implemented here.<\/p>\n In simple words, here we exclude the HTML Tags<\/strong> during the Tokenization.<\/p>\n Approach:<\/strong><\/p>\n Below is the implementation:<\/strong><\/p>\n Output:<\/strong><\/p>\n Enchant Module in Python: Enchant is a Python module that checks a word\u2019s spelling and provides suggestions for correcting it. The antonyms and synonyms of the words are also provided. It determines whether or not a word is in the dictionary. To tokenize text, Enchant also provides the enchant.tokenize module. Tokenizing is the process of …<\/p>\nChunking Text using Enchant in Python<\/h2>\n
\n
# Import get_tokenizer function from the enchant.tokenize module using the import keyword\r\nfrom enchant.tokenize import get_tokenizer\r\n# Import HTMLChunker function from the enchant.tokenize module using the import keyword\r\nfrom enchant.tokenize import HTMLChunker\r\n\r\n# Give the text with HTML tags to be tokenized as static input and store it in a variable.\r\ngvn_txt = \"<div> <h2> welcome to Python-programs <\/h2> <br> <\/div>\"\r\n\r\n# Pass the language code as an argument to the get_tokenizer() function to \r\n# get the tokenizer class and store it in another variable.\r\ntokenizer = get_tokenizer(\"en_US\")\r\n\r\n# Take an empty list to store the tokens.\r\ntokens_lst =[]\r\n\r\n# Printing the tokens of the given text without chunking\r\nprint(\"The tokens of the given text without chunking:\")\r\n# Loop in each words of the token text by passing the given text as argument to tokenizer using the for loop\r\nfor wrds in tokenizer(gvn_txt):\r\n # Add\/append the token words to the newly created list using the append() function\r\n tokens_lst.append(wrds)\r\n \r\n# Print the token list(It prints the tokens with the position without chunking) \r\n# Here chunking means excluding the HTML tokens\r\nprint(tokens_lst)\r\n\r\n \r\n# Pass the language code, type of chunkers(HTMLChunker) as an argument to the get_tokenizer() function to \r\n# get the tokenizer class with chunk(HTML Chunking) and store it in another variable.\r\ntokenizer_withchunking = get_tokenizer(\"en_US\", chunkers = (HTMLChunker, ))\r\nprint()\r\n\r\n# Take an empty list to store the tokens with chunking\r\ntokenslist_chunk = []\r\n\r\n# Printing the tokens of the given text after chunking\r\nprint(\"The tokens of the given text after chunking:\")\r\n\r\n# Loop in each words of the token text with chunking by passing the given text as argument to tokenizer_withchunking\r\n# using the for loop\r\nfor wrds in tokenizer_withchunking(gvn_txt):\r\n # Add\/append the token words with chunking to the newly created chunking list using the append() function\r\n tokenslist_chunk.append(wrds)\r\n\r\n# Print the token list with chunking(It prints the tokens with the position with HTML chunking) \r\n# Here HTML chunking means excluding the HTML tags while tokenization\r\nprint(tokenslist_chunk)\r\n<\/pre>\n
The tokens of the given text without chunking:\r\n[('div', 1), ('h', 7), ('welcome', 11), ('to', 19), ('Python', 22), ('programs', 29), ('h', 40), ('br', 45), ('div', 51)]\r\n\r\nThe tokens of the given text after chunking:\r\n[('welcome', 11), ('to', 19), ('Python', 22), ('programs', 29)]<\/pre>\n","protected":false},"excerpt":{"rendered":"