In this post, let us look at how to use the Python built-in fuzzyWuzzy library to match the string and determine how they are similar using various examples.
Python has a few methods for comparing two strings. A few of the most common methods are listed below.
- Using Regex
- Simple Compare
- Using dfflib
FuzzyBuzzy library in Python:
The FuzzyBuzzy library was created to compare to strings. To compare strings, we have other modules such as regex and difflib. However, FuzzyBuzzy is one-of-a-kind. Instead of true, false, or string, the methods in this library return a score out of 100 based on how well the strings match.
The process of finding strings that match a given pattern is known as fuzzy string matching. To calculate the differences between sequences, it employs Levenshtein Distance.
It is an open-source library developed and released by SeatGeek.
This library can assist in mapping databases that lack a common key, such as joining two tables by company name, which differs in both tables.
Example1:
# Give the string as static input and store it in a variable. gvn_str1 = "Hello this is btechgeeks" # Give the string as static input and store it in a variable. gvn_str2 = "Hello this is btechgeeks" # Check if both the given strings are equal using the if conditional statement comparision_rslt = gvn_str1 == gvn_str2 # Print the result of comparision print(comparision_rslt)
Output:
True
FuzzyWuzzy Python library
Type the below command to install the fuzzywuzzy library:
pip install fuzzywuzzy
Now type the following command:
pip install python-Levenshtein
- Using EmailFilter
- Using Fuzz.partial_ratio()
- Using Fuzz.token_sort_ratio()
- Using process module
- Using Fuzz.WRatio()
Method #1: Using Fuzz.ratio() method
The fuzz module is used to compare two strings at the same time. This function returns a score out of 100 after comparison using the various methods.
Fuzz.ratio() function:
It is one of the most important fuzz module methods. It compares the string and assigns a score based on how accurately the given string matches.
Approach:
- Import fuzz from fuzzywuzzy module using the import keyword
- Give the first string as static input and store it in a variable.
- Give the second string as static input and store it in another variable.
- Check if both the given strings are similar using the fuzz.ratio() function by passing given both strings as arguments to it.
- Here it returns a score out of 100, which indicates how much percent they are similar out of 100.
- Similarly, check with other strings and print the result.
- The Exit of the Program.
Below is the implementation:
# Import fuzz from fuzzywuzzy module using the import keyword from fuzzywuzzy import fuzz # Give the first string as static input and store it in a variable. gvn_str1 = "Hello this is btechgeeks" # Give the second string as static input and store it in another variable. gvn_str2 = "Hello this is btechgeeks" # Check if both the given strings are similar using the fuzz.ratio() function # by passing given both strings as arguments to it. # Here it returns a score out of 100, which indicates how much percent # they are similar out of 100. print(fuzz.ratio(gvn_str1, gvn_str2)) # Similarly check with other strings and print the result print(fuzz.ratio('btechgeeks', 'btech geeks')) print(fuzz.ratio('Hello this is btechgeeks', 'Hello btechgeeks')) print(fuzz.ratio('Hello this is btechgeeks', 'Welcome'))
Output:
100 95 80 26
Method #2: Using Fuzz.partial_ratio()
Another useful method in the fuzzywuzzy library is partial_ratio()
. It handles complex string comparisons such as substring matching.
Below is the implementation:
# Import fuzz from fuzzywuzzy module using the import keyword from fuzzywuzzy import fuzz # Give the string as static input and store it in a variable. gvn_str1 = "Hello this is btechgeeks" # Give the string as static input and store it in a variable. gvn_str2 = "techgeeks" # Check if both the given strings are similar using the fuzz.partial_ratio() function # by passing given both strings as arguments to it. # Here it returns a score out of 100, which indicates how much percent # they are similar out of 100. print(fuzz.partial_ratio(gvn_str1, gvn_str2)) # Similarly check with other strings and print the result # Here only the Exclamation mark differs in comparision to both the strings, # but partially words are same hence returns 100 print(fuzz.partial_ratio('btechgeeks', 'btechgeeks!')) print(fuzz.partial_ratio('btechgeeks', 'btech geeks')) print(fuzz.partial_ratio('Hello this is btechgeeks', 'this is')) # only partial tokens matches here hence returns 55 print(fuzz.partial_ratio('Hello this is btechgeeks', 'this python'))
Output:
100 100 90 100 55
Method #3: Using Fuzz.token_sort_ratio()
This method does not guarantee an accurate result because we can change the order of the strings. It might not produce an accurate result.
However, the fuzzywuzzy module provides a solution.
Below is the implementation:
# Import fuzz from fuzzywuzzy module using the import keyword from fuzzywuzzy import fuzz # Give the string as static input and store it in a variable. gvn_str1 = "Hello this is btechgeeks" # Give the string as static input and store it in a variable. gvn_str2 = "this btechgeeks hello is" # Check if both the given strings are similar using the fuzz.token_sort_ratio() function # by passing given both strings as arguments to it. # Here both the strings are similar but differ in their order of string. # Hence it returns 100 print(fuzz.token_sort_ratio(gvn_str1, gvn_str2)) # Similarly check with the other strings and print the result print(fuzz.token_sort_ratio("Hello this btechgeeks", "Hello this this btechgeeks")) # Here it returns 100 since the duplicate words are treated as a single word # by the token_set_ratio() function print(fuzz.token_set_ratio("Hello this btechgeeks", "Hello this this btechgeeks"))
Output:
100 89 100
Explanation:
Here, we used another method called fuzz.token_set_ratio(), which performs a set operation, extracts the common token, and then performs a ratio() pairwise comparison.
Since the substring or smaller string is made up of larger chunks of the original string or the remaining token is closer to each other, the intersection of the sorted token is always the same.
Method #4: Using process module
If we have a list of options and want to find the closest match/matches we can use the process module to do so.
Below is the implementation:
# Import process module from fuzzywuzzy module using the import keyword from fuzzywuzzy import process # Give the string as static input and store it in a variable. gvn_string = "Hello betchgeeks" # Give the list of options and store it in another variable gvn_options = ["hello","Hello python","betchgeeks","Python betchgeeks"] # Extract the ratios by passing the given string and options to extract() function ratios_rslt = process.extract(gvn_string, gvn_options) print(ratios_rslt) # We can choose the string that has highest matching percentage using extractOne() function most_accurate = process.extractOne(gvn_string, gvn_options) print('Most Accurate Result:',most_accurate)
Output:
[('hello', 90), ('betchgeeks', 90), ('Python betchgeeks', 79), ('Hello python', 57)] Most Accurate Result: ('hello', 90)
Method #5: Using Fuzz.WRatio()
The process module also includes the WRatio, which produces a more accurate result than the simple ratio. It handles lower and upper case, as well as some other parameters.
Below is the implementation:
# Import process module from fuzzywuzzy module using the import keyword from fuzzywuzzy import process # Pass two strings as arguments to the WRatio() function of fuzz module # to compare both the strings and returns the score # Here both the strings are similar but differ in their cases(lower/upper) # The WRatio() function is ignores case-sensitivity, so it returns 100 print(fuzz.WRatio('hello betchgeeks', 'Hello Betchgeeks')) # Smilarly check for the other strings and print the result print(fuzz.WRatio('hello betchgeeks', 'hello betchgeeks!!!!')) # Here we are using the fuzz.ratio() function for the same strings as above print(fuzz.ratio('hello betchgeeks', 'hello betchgeeks!!!!'))
Output:
100 100 89