[SOLVED] Vectorization assign the newest value based on datetime


This Content is from Stack Overflow. Question asked by LLL

I have two dataframe. The first dataframe have only one column: email, the first dataframe is a complete list of email. The second dataframe is a dataframe with three column: email, subscribe_or_unsubscribe, date. The second dataframe is a history of user subcribing or unsubscribing from the email system. The second dataframe is sorted by date with oldest date at index 0 and newest date at latest index.

If email in first dataframe does not exist in the second dataframe, then the email is not subscribed to email system. If the latest history of the email in the second dataframe is subscribe or unsubcribe then the email is subscribed or not subscribed respectively to the email system.

The second dataframe is a cropped version from the actual dataframe and contain error such that it is possible for user in the history to unsubscribe without previously subscribing.

I want to create a third dataframe with two column: email, is_subscribed. The third dataframe will contain the information whether the email is currently subscribed to the email system or not. Right now i’m using iterrows to iterate each rows in third dataframe and second dataframe, however iterrows is really slow, how do i do it with vectorization ?

For example, consider the following example of first dataframe and second dataframe

import numpy as np
import pandas as pd
first_dataframe = pd.DataFrame(["test1@gmail.com", "test2@gmail.com", "test3@gmail.com", "test4@gmail.com", "test5@gmail.com", "test6@gmail.com", "test7@gmail.com" ,"test8@gmail.com"], columns=['email'])
# Year-Month-Day
second_dataframe = pd.DataFrame([['test3@gmail.com', 'subscribe', '2020-12-26'], ['test5@gmail.com', 'subscribe', '2021-06-06'], 
                                 ['test7@gmail.com', 'unsubscribe', '2021-02-18'], ['test5@gmail.com', 'unsubscribe', '2020-08-17'], 
                                 ['test9@gmail.com', 'subscribe', '2022-01-08'], ['test9@gmail.com', 'unsubscribe', '2022-03-10'], 
                                 ['test9@gmail.com', 'subscribe', '2022-05-26']], columns=['email', "subscribe_or_unsubscribe", "date"])
second_dataframe['date'] = pd.to_datetime(second_dataframe['date'])
second_dataframe.sort_values(by='date', inplace=True)

Then i create third dataframe which is a copy of the first dataframe with new column “is_subscribed” added with default value “unsubscribed”

third_dataframe = first_dataframe.copy(deep=True)
third_dataframe['is_subscribed'] = "unsubscribed"

Then i use iterrows to know whether the email is currently subscribed to the email system or not

# The data is sorted from the oldest to newest
for i, third_dataframe_row in third_dataframe.iterrows():
  for j, second_dataframe_row in second_dataframe.iterrows():
    if third_dataframe_row['email'] == second_dataframe_row['email']:
      third_dataframe.at[i,'is_subscribed'] = second_dataframe_row['subscribe_or_unsubscribe'] + "d"

How to do the above iterrows with vectorization?


Use Series.map by Series created by DataFrame.drop_duplicates for latest dates (because already sorted) and convert email to index, add d and last replace non matched values by default value in Series.fillna:

s = (second_dataframe.drop_duplicates('email', keep='last')
third_dataframe['is_subscribed'] = (third_dataframe['email'].map(s) +  "d")
print (third_dataframe)
             email is_subscribed
0  test1@gmail.com  unsubscribed
1  test2@gmail.com  unsubscribed
2  test3@gmail.com    subscribed
3  test4@gmail.com  unsubscribed
4  test5@gmail.com    subscribed
5  test6@gmail.com  unsubscribed
6  test7@gmail.com  unsubscribed
7  test8@gmail.com  unsubscribed
8  test9@gmail.com    subscribed

This Question was asked in StackOverflow by LLL and Answered by jezrael It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?

Exit mobile version