Whats is the FASTEST way of Filtering large DataFrame based on another DataFrame?


I have a df1 that has 30 million rows and 10 columns. Also I have df2 that has 5 million rows and 6 columns.

My goal is to filter df1 based on df2, meaning, identify if the combination of 2 columns df1[‘Col_Mat’,’Col_Prod’] is NOT in same columns from df2.

I’ve tried the below methods but am wondering that there must be a faster way:

Method 1: using pd.merge() with indicator=‘i’ then querying it if df1[‘i’] == ‘left_only” then dropping the ‘i’ column.

Method 2: using df.isin()

Method 1 takes about 55 seconds and method 2 takes about 25 seconds.

I haven’t yet tried creating a Def to then apply (because thing it would be slower).

Also just thought about creating 2 Sets then compare and store the result on a “to_consider_set”, then using isin() to see if is faster.

Any ideas of which method I should use to filter df1 faster?

