[SOLVED] pandas, transforming dataframe & combining duplicates with counts

Issue

This Content is from Stack Overflow. Question asked by Ev0

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String        CaseType 

0        1           'hello1'      1.0
1        1           'hello2'      1.0
2        1           'goodbye1'    1.0
3        1           'goodbye2'    1.0
4        2           'hello1'      3.0
5        2           'hello2'      3.0
6        2           'hello3'      3.0
7        3           'goodbye1'    1.0
8        3           'goodbye2'    1.0
9        3           'goodbye3'    1.0
10       4           'hello'       2.0
11       4           'goodbye'     2.0

Expected Output:
(Please note they are grouped based on uniqueID, and the case Type follows the last string of the uniqueID.)

idx   Source      Destination 

0        'hello1'     'hello2'
1        'hello2'     'goodbye1'
3        'goodbye1'   'goodbye2'
4        'goodbye2'   '1.0'
6        'hello1'     'hello2'
7        'hello2'     'hello3' 
8        'hello3'     '3.0'
10       'goodbye1'   'goodbye2'
11       'goodbye2'   'goodbye3'
12       'goodbye3'   '1.0'
13       'hello'      'goodbye'
14       'goodbye'    '2.0'

Question: How do I transform the pandas dataframe in this way?

Currently, I am iterating through every row in a for loop, and for each uniqueId, adding each string+CaseType (at the end) to a list, then splitting up that list and adding it to a new dataframe. It is incredibly slow.

Following this, the next step is to get the total counts/occurences for each row of the output. Essentially, if there are duplicate rows of source:destination (ie, we have 3 rows of ‘hello’ ‘goodbye’, it would result in 1 row with ‘hello’ ‘goodbye’

Example:

Original: 
idx   Source      Destination  

0     'hello1'     'hello2'   
1     'hello2'     'hello3' 
2     'hello1'     'hello2' 
3     'hello4'     'goodbye'  

Expected Output:

idx   Source      Destination  Count

0     'hello1'     'hello2'    2
1     'hello2'     'hello3'    1
2     'hello4'     'goodbye'   1

I presume the first step is slightly more complex with pandas logic, and the next step is essentially just combining duplicates and getting their count, but I am new to pandas and not entirely sure how to do either. Thank you in advance.



Solution

You could try the following:

df_res = df[["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
    df.groupby("uniqueID")["String"].transform("shift", -1)
      .fillna(df["CaseType"])
      .astype("str")
)

Result for your sample is:

      Source Destination
0     hello1      hello2
1     hello2    goodbye1
2   goodbye1    goodbye2
3   goodbye2         1.0
4     hello1      hello2
5     hello2      hello3
6     hello3         3.0
7   goodbye1    goodbye2
8   goodbye2    goodbye3
9   goodbye3         1.0
10     hello     goodbye
11   goodbye         2.0

Regarding the second part: Try

df_counts = (
    df_res.value_counts(["Source", "Destination"])
    .to_frame(name="Counts").reset_index()
)

to get

     Source Destination  Counts
0  goodbye1    goodbye2       2
1    hello1      hello2       2
2   goodbye         2.0       1
3  goodbye2         1.0       1
4  goodbye2    goodbye3       1
5  goodbye3         1.0       1
6     hello     goodbye       1
7    hello2    goodbye1       1
8    hello2      hello3       1
9    hello3         3.0       1

If you want to keep the Source and Destinationcolumns in the index then remove .reset_index() at the end.

Regarding the modification in the comment: You could try more or less the same, but restricted to a filtered df:

approved = ["hello1", "hello"]
m = df.groupby("uniqueID")["String"].transform("first").isin(approved)
df_res = df.loc[m, ["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
    df[m].groupby("uniqueID")["String"].transform("shift", -1)
      .fillna(df.loc[m, "CaseType"])
      .astype("str")
)
df_res = df_res.reset_index(drop=True)

The mask m removes those groups whose first row String is not in the list of approved strings.

Result:

     Source Destination
0    hello1      hello2
1    hello2    goodbye1
2  goodbye1    goodbye2
3  goodbye2         1.0
4    hello1      hello2
5    hello2      hello3
6    hello3         3.0
7     hello     goodbye
8   goodbye         2.0


This Question was asked in StackOverflow by Ev0 and Answered by Timus It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?