I have two data frames, the first one has 200k records and the second one has 9k. I need to apply fuzzy matching for string matching in two columns. I dropped the duplicate values in both data frames, but still, there might be similar strings. Hence, I wrote the below code. I thought that I can manually go through the best-two-matches in the third column and see if it is a reasonable match or not. The issue is that the code has been running for the last 2 days and I do not know how to speed up this process. I would appreciate if you could share your opinion to improve the speed and performance.
The mock data:
|0||google inc-radio automation bu||apple|
the code I wrote:
# empty lists for storing the # matches later mat1 =  mat2 =  list2=df['comp'].tolist() list1= df['var'].tolist() # taking the threshold as 90 threshold = 90 # iterating through list1 to extract # it's closest match from list2 for i in list1: mat1.append(process.extract(i, list2, limit=2)) df['matches'] = mat1
pip install rapidfuzz
from rapidfuzz import process, utils df['matches'] = df['var'].apply(lambda x: process.extract(x, df['comp'], limit=2))