NLTK in Python:
NLTK is a Python toolkit for working with natural language processing (NLP). It provides us with a large number of test datasets for various text processing libraries. NLTK can be used to perform a variety of tasks such as tokenizing, parse tree visualization, and so on.
Tokenization
Tokenization is the process of dividing a large amount of text into smaller pieces/parts known as tokens. These tokens are extremely valuable for detecting patterns and are regarded as the first stage in stemming and lemmatization. Tokenization also aids in the replacement of sensitive data elements with non-sensitive data elements.
Natural language processing is utilized in the development of applications such as text classification, intelligent chatbots, sentiment analysis, language translation, and so on. To attain the above target, it is essential to consider the pattern in the text.
Natural Language Toolkit features an important module called NLTK tokenize sentences, which is further divided into sub-modules.
- word tokenize
- sentence tokenize
nltk.TweetTokenizer() Function:
We can convert the stream of words into small tokens so that we can analyze the audio stream with the help of nltk.TweetTokenizer() method.
Syntax:
nltk.TweetTokenizer()
Return Value:
The stream of tokens is returned by the TweetTokenizer() function
NLTK nltk.TweetTokenizer() Function in Python
Method #1: Using TweetTokenizer() Function (Static Input)
Here, when we pass an audio stream in the form of a string, it is converted to small tokens from a large string using the TweetTokenizer() method.
Approach:
- Import TweetTokenizer() method from tokenize of nltk module using the import keyword
- Create a reference/Instance variable(Object) for the TweetTokenizer Class and store it in a variable
- Give the string as static input and store it in a variable.
- Pass the above-given string to the tokenize() function to convert it to small tokens from a given large string using the TweetTokenizer() method.
- Store it in another variable.
- Print the above result.
- The Exit of the Program.
Below is the implementation:
# Import TweetTokenizer() method from tokenize of nltk module using the import keyword from nltk.tokenize import TweetTokenizer # Creating a reference/Instance variable(Object) for the TweetTokenizer Class and # store it in a variable tkn = TweetTokenizer() # Give the string as static input and store it in a variable. gvn_str = "Hello this is Python-programs" # Pass the above given string to the tokenize() function to convert it to small tokens from a # given large string using the TweetTokenizer() method. # Store it in another variable. rslt = tkn.tokenize(gvn_str) # Print the above result print(rslt)
Output:
['Hello', 'this', 'is', 'Python-programs']
Method #2: Using TweetTokenizer() Function (User Input)
Approach:
- Import TweetTokenizer() method from tokenize of nltk module using the import keyword
- Create a reference/Instance variable(Object) for the TweetTokenizer Class and store it in a variable
- Give the string as user input using the input() function and store it in a variable.
- Pass the above-given string to the tokenize() function to convert it to small tokens from a given large string using the TweetTokenizer() method.
- Store it in another variable.
- Print the above result.
- The Exit of the Program.
Below is the implementation:
# Import TweetTokenizer() method from tokenize of nltk module using the import keyword from nltk.tokenize import TweetTokenizer # Creating a reference/Instance variable(Object) for the TweetTokenizer Class and # store it in a variable tkn = TweetTokenizer() # Give the string as user input using the input() function and store it in a variable. gvn_str = input("Enter some random string = ") # Pass the above given string to the tokenize() function to convert it to small tokens from a # large given string using the TweetTokenizer() method. # Store it in another variable. rslt = tkn.tokenize(gvn_str) # Print the above result print(rslt)
Output:
Enter some random string = : %:- <> () a {} [] :- [':', '%', ':', '-', '<', '>', '(', ')', 'a', '{', '}', '[', ']', ':', '-']