Python nltk.word_tokenize() Function

NLTK in Python:

NLTK is a Python toolkit for working with natural language processing (NLP). It provides us with a large number of test datasets for various text processing libraries. NLTK can be used to perform a variety of tasks such as tokenizing, parse tree visualization, and so on.

Tokenization

Tokenization is the process of dividing a large amount of text into smaller pieces/parts known as tokens. These tokens are extremely valuable for detecting patterns and are regarded as the first stage in stemming and lemmatization. Tokenization also aids in the replacement of sensitive data elements with non-sensitive data elements.

Natural language processing is utilized in the development of applications such as text classification, intelligent chatbots, sentiment analysis, language translation, and so on. To attain the above target, it is essential to consider the pattern in the text.

nltk.word_tokenize() Function:

The “nltk.word_tokenize()” method will be used to tokenize sentences and words with NLTK.

  • word_tokenize –  To tokenize words.

NLTK Tokenization is a method of dividing a vast amount of textual data into sections in order to analyze the text’s character.

Tokenization using NLTK can be used for training machine learning models and text cleaning with Natural Language Processing. With NLTK, tokenized words and phrases can be converted into a data frame and vectorized. Tokenization with the Natural Language Tool Kit (NLTK) includes punctuation cleaning, text cleaning, vectorization of parsed text data for better lemmatization, stemming, and machine learning algorithm training.

Kit for Natural Language Processing “tokenize” is a tokenization package in Python Libray. There are two kinds of tokenization functions in the NLTK “tokenize” package.

  • word_tokenize –  To tokenize words.
  • sent_tokenize –  To tokenize sentences.

Syntax:

word_tokenize(string)

Advantages of word tokenization with NLTK

White Space Tokenization, Dictionary Based Tokenization, Rule-Based Tokenization, Regular Expression Tokenization, Penn Treebank Tokenization, Spacy Tokenization, Moses Tokenization, and Subword Tokenization are all advantages of using NLTK for word tokenization. The text normalisation procedure includes all types of word tokenization. The accuracy of language understanding algorithms is improved by normalising the text with stemming and lemmatization. The advantages and benefits of word tokenization with NLTK are listed below.

  • Easily removing stop words from corpora prior to tokenization.
  • Splitting words into sub-words to improve understanding of the text.
  • Using NLTK, removing the text disambiguate is faster and needs less coding.
  • Aside from White Space Tokenization, Dictionary Based and Rule-based Tokenization are also simple to implement.
  • NLTK makes it easy to perform Byte Pair Encoding, Word Piece Encoding, Unigram Language Model, and Setence Piece Encoding.
  • TweetTokenizer in NLTK is used to tokenize tweets that include emojis and other Twitter standards.
  • PunktSentenceTokenizer in NLTK has a pre-trained model for tokenization in several European languages.
  • NLTK includes a Multi Word Expression Tokenizer for tokenizing compound words like  “in spite of”.
  • RegexpTokenizer in NLTK is used to tokenize phrases based on regular expressions.

nltk.word_tokenize() Function in Python

Method #1: Using word_tokenize() Function (Static Input)

Approach:

  • Import word_tokenize() function from tokenize of the nltk module using the import keyword
  • Give the string as static input and store it in a variable.
  • Pass the above-given string as an argument to the word_tokenize() function to tokenize into words and print the result.
  • The Exit of the Program.

Below is the implementation:

# Import word_tokenize() function from tokenize of the nltk module using the import keyword
from nltk.tokenize import word_tokenize

# Give the string as static input and store it in a variable.
gvn_str = "hello this is Python-programs welcome all"

# Pass the above given string as an argument to the word_tokenize() function to 
# tokenize into words and print the result.
print(word_tokenize(gvn_str))

Output:

['hello', 'this', 'is', 'Python-programs', 'welcome', 'all']

Method #2: Using word_tokenize() Function (User Input)

Approach:

  • Import word_tokenize() function from tokenize of the nltk module using the import keyword
  • Give the string as user input using the input() function and store it in a variable.
  • Pass the above-given string as an argument to the word_tokenize() function to tokenize into words and print the result.
  • The Exit of the Program.

Below is the implementation:

# Import word_tokenize() function from tokenize of the nltk module using the import keyword
from nltk.tokenize import word_tokenize

# Give the string as user input using the input() function and store it in a variable.
gvn_str = input("Enter some random string = ")

# Pass the above given string as an argument to the word_tokenize() function to 
# tokenize into words and print the result.
print(word_tokenize(gvn_str))

Output:

Enter some random string = good morning python programs
['good', 'morning', 'python', 'programs']