Whats is the FASTEST way of Filtering large DataFrame based on another DataFrame?

Issue

This Content is from Stack Overflow. Question asked by Jefferson Robinski

I have a df1 that has 30 million rows and 10 columns. Also I have df2 that has 5 million rows and 6 columns.

My goal is to filter df1 based on df2, meaning, identify if the combination of 2 columns df1[‘Col_Mat’,’Col_Prod’] is NOT in same columns from df2.

I’ve tried the below methods but am wondering that there must be a faster way:

Method 1: using pd.merge() with indicator=‘i’ then querying it if df1[‘i’] == ‘left_only” then dropping the ‘i’ column.

Method 2: using df.isin()

Method 1 takes about 55 seconds and method 2 takes about 25 seconds.

I haven’t yet tried creating a Def to then apply (because thing it would be slower).

Also just thought about creating 2 Sets then compare and store the result on a “to_consider_set”, then using isin() to see if is faster.

Any ideas of which method I should use to filter df1 faster?

Thank you



Solution

This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?