Vectorization assign the newest value based on datetime

Issue

This Content is from Stack Overflow. Question asked by LLL

I have two dataframe. The first dataframe have only one column: email, the first dataframe is a complete list of email. The second dataframe is a dataframe with three column: email, subscribe_or_unsubscribe, date. The second dataframe is a history of user subcribing or unsubscribing from the email system. The second dataframe is sorted by date with oldest date at index 0 and newest date at latest index.

If email in first dataframe does not exist in the second dataframe, then the email is not subscribed to email system. If the latest history of the email in the second dataframe is subscribe or unsubcribe then the email is subscribed or not subscribed respectively to the email system.

The second dataframe is a cropped version from the actual dataframe and contain error such that it is possible for user in the history to unsubscribe without previously subscribing.

I want to create a third dataframe with two column: email, is_subscribed. The third dataframe will contain the information whether the email is currently subscribed to the email system or not. Right now i’m using iterrows to iterate each rows in third dataframe and second dataframe, however iterrows is really slow, how do i do it with vectorization ?

For example, consider the following example of first dataframe and second dataframe

import numpy as np
import pandas as pd
first_dataframe = pd.DataFrame(["test1@gmail.com", "test2@gmail.com", "test3@gmail.com", "test4@gmail.com", "test5@gmail.com", "test6@gmail.com", "test7@gmail.com" ,"test8@gmail.com"], columns=['email'])
first_dataframe
# Year-Month-Day
second_dataframe = pd.DataFrame([['test3@gmail.com', 'subscribe', '2020-12-26'], ['test5@gmail.com', 'subscribe', '2021-06-06'], 
                                 ['test7@gmail.com', 'unsubscribe', '2021-02-18'], ['test5@gmail.com', 'unsubscribe', '2020-08-17'], 
                                 ['test9@gmail.com', 'subscribe', '2022-01-08'], ['test9@gmail.com', 'unsubscribe', '2022-03-10'], 
                                 ['test9@gmail.com', 'subscribe', '2022-05-26']], columns=['email', "subscribe_or_unsubscribe", "date"])
second_dataframe['date'] = pd.to_datetime(second_dataframe['date'])
second_dataframe.sort_values(by='date', inplace=True)
second_dataframe

Then i create third dataframe which is a copy of the first dataframe with new column “is_subscribed” added with default value “unsubscribed”

third_dataframe = first_dataframe.copy(deep=True)
third_dataframe['is_subscribed'] = "unsubscribed"
third_dataframe

Then i use iterrows to know whether the email is currently subscribed to the email system or not

# The data is sorted from the oldest to newest
for i, third_dataframe_row in third_dataframe.iterrows():
  for j, second_dataframe_row in second_dataframe.iterrows():
    if third_dataframe_row['email'] == second_dataframe_row['email']:
      third_dataframe.at[i,'is_subscribed'] = second_dataframe_row['subscribe_or_unsubscribe'] + "d"
third_dataframe

How to do the above iterrows with vectorization?



Solution

This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.

This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?