[SOLVED] improving fuzzy matching performance


This Content is from Stack Overflow. Question asked by hilo

I have two data frames, the first one has 200k records and the second one has 9k. I need to apply fuzzy matching for string matching in two columns. I dropped the duplicate values in both data frames, but still, there might be similar strings. Hence, I wrote the below code. I thought that I can manually go through the best-two-matches in the third column and see if it is a reasonable match or not. The issue is that the code has been running for the last 2 days and I do not know how to speed up this process. I would appreciate if you could share your opinion to improve the speed and performance.

The mock data:

0google inc-radio automation buapple
1apple incgoogle
2google incmicrosoft
3applebeegoogle inc

the code I wrote:

# empty lists for storing the
# matches later
mat1 = []
mat2 = []
list1= df['var'].tolist()
# taking the threshold as 90
threshold = 90

# iterating through list1 to extract
# it's closest match from list2
for i in list1:
    mat1.append(process.extract(i, list2, limit=2))
df['matches'] = mat1


There is a package rapidfuzz by @maxbachmann

pip install rapidfuzz

Sample usage:

from rapidfuzz import process, utils
df['matches'] = df['var'].apply(lambda x: process.extract(x, df['comp'], limit=2))

This Question was asked in StackOverflow by hilo and Answered by Aarish Maqsood It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?