Python nltk.WhitespaceTokenizer() Function

NLTK in Python:

NLTK is a Python toolkit for working with natural language processing (NLP). It provides us with a large number of test datasets for various text processing libraries. NLTK can be used to perform a variety of tasks such as tokenizing, parse tree visualization, and so on.

Tokenization

Tokenization is the process of dividing a large amount of text into smaller pieces/parts known as tokens. These tokens are extremely valuable for detecting patterns and are regarded as the first stage in stemming and lemmatization. Tokenization also aids in the replacement of sensitive data elements with non-sensitive data elements.

Natural language processing is utilized in the development of applications such as text classification, intelligent chatbots, sentiment analysis, language translation, and so on. To attain the above target, it is essential to consider the pattern in the text.

Natural Language Toolkit features an important module called NLTK tokenize sentences, which is further divided into sub-modules.

  • word tokenize
  • sentence tokenize

nltk.tokenize.WhitespaceTokenizer() Function:

Using WhitespaceTokenizer() Function of the tokenize of nltk module we can extract the tokens from a string of words/sentences without whitespaces, new lines, and tabs

Syntax:

tokenize.WhitespaceTokenizer()

Parameters: This method doesn’t accept any parameters

Return Value:

The tokens of words without whitespaces, new lines, and tabs from a string are returned by the WhitespaceTokenizer() Function

Examples:

Example1:

Input:

Given String = "hello python-programs.. @@&* \n welcome\t good morning"

Output:

['hello', 'python-programs..', '@@&*', 'welcome', 'good', 'morning']

Example2:

Input:

Given String =  "welcome to\t the\n world \nof python"

Output:

['welcome', 'to', 'the', 'world', 'of', 'python']

nltk.tokenize.WhitespaceTokenizer() Function in Python

Method #1: Using WhitespaceTokenizer() Function (Static Input)

Here, we extract the tokens from the stream to words that have space between them.

Approach:

  • Import WhitespaceTokenizer() function from tokenize of nltk module using the import keyword
  • Creating a reference/Instance variable(Object) for the WhitespaceTokenizer Class
  • Give the string as static input and store it in a variable.
  • Pass the above-given string as an argument to extract tokens from a string of words/sentences without whitespaces, new lines, and tabs and store it in another variable
  • Print the above result.
  • The Exit of the Program.

Below is the implementation:

# Import WhitespaceTokenizer() function from tokenize of nltk module using the import keyword
from nltk.tokenize import WhitespaceTokenizer
    
# Creating a reference/Instance variable(Object) for the WhitespaceTokenizer Class
tkn = WhitespaceTokenizer()
    
# Give the string as static input and store it in a variable.
gvn_str = "hello python-programs..  @@&* \n welcome\t good morning"

# Pass the above given string as an argument to extract tokens from a string of words/sentences 
# without whitespaces, new lines, and tabs and store it in another variable
rslt = tkn.tokenize(gvn_str)

# Print the above result 
print(rslt)

Output:

['hello', 'python-programs..', '@@&*', 'welcome', 'good', 'morning']

Method #2: Using WhitespaceTokenizer() Function (User Input)

Approach:

  • Import WhitespaceTokenizer() function from tokenize of nltk module using the import keyword
  • Creating a reference/Instance variable(Object) for the WhitespaceTokenizer Class
  • Give the string as static input and store it in a variable.
  • Pass the above-given string as an argument to extract tokens from a string of words/sentences without whitespaces, new lines, and tabs and store it in another variable
  • Print the above result.
  • The Exit of the Program.

Below is the implementation:

# Import WhitespaceTokenizer() function from tokenize of nltk module using the import keyword
from nltk.tokenize import WhitespaceTokenizer
    
# Creating a reference/Instance variable(Object) for the WhitespaceTokenizer Class
tkn = WhitespaceTokenizer()
    
# Give the string as user input using the input() function and store it in a variable.
gvn_str = input("Enter some random string = ")

# Pass the above given string as an argument to extract tokens from a string of words/sentences 
# without whitespaces, new lines, and tabs and store it in another variable
rslt = tkn.tokenize(gvn_str)

# Print the above result 
print(rslt)

Output:

Enter some random string = good morning python programs
['good', 'morning', 'python', 'programs']