[SOLVED] joining data sets while also comparing null values

Issue

This Content is from Stack Overflow. Question asked by bella_pa

I have two data frames.

DF1: column1         column2             column3              column4
     agree           strongly agree      disagree             null
     null            disagree            strongly disagree    agree
     disagree        null                strongly agree       disagree




DF2: column1         column2             column3              column4     cluster
     disagree        agree               strongly disagree    disagree    1 
     agree           strongly agree      disagree             null        2
     disagree        null                strongly agree       disagree    5
                                      .
                                      .
                                      .
     disagree        agree               strongly agree       disagree    3



Final output:DF1: column1         column2             column3              column4   Cluster
                  agree           strongly agree      disagree             null      2
                  null            disagree            strongly disagree    agree     5
                                        .
                                        .
                                       etc

I have already did this in R but I could not implement the same in PySpark. Can anyone please tell me how to implement in PySpark



Solution

Since join won’t take care of null values, I’m thinking about a small hack by replacing all nulls with a specific value, then perform a left join to keep app df1’s records

df1

df1 = spark.read.csv('a.csv', header=True).fillna('xxxyyyzzz')
df1.show()

+---------+--------------+-----------------+---------+
|  column1|       column2|          column3|  column4|
+---------+--------------+-----------------+---------+
|    agree|strongly agree|         disagree|xxxyyyzzz|
|xxxyyyzzz|      disagree|strongly disagree|    agree|
| disagree|     xxxyyyzzz|   strongly agree| disagree|
+---------+--------------+-----------------+---------+

df2

df2 = spark.read.csv('b.csv', header=True).fillna('xxxyyyzzz')
df2.show()

+---------+--------------+-----------------+---------+-------+
|  column1|       column2|          column3|  column4|cluster|
+---------+--------------+-----------------+---------+-------+
| disagree|         agree|strongly disagree| disagree|      1|
|    agree|strongly agree|         disagree|xxxyyyzzz|      2|
| disagree|     xxxyyyzzz|   strongly agree| disagree|      5|
| disagree|         agree|   strongly agree| disagree|      3|
|xxxyyyzzz|      disagree|strongly disagree|    agree|      5|
| disagree|     xxxyyyzzz|   strongly agree| disagree|      6|
+---------+--------------+-----------------+---------+-------+

Join and return dummy value back to nulls

(df1
    .join(df2, on=df1.columns, how='left')
    .replace('xxxyyyzzz', None)
    .show()
)

+--------+--------------+-----------------+--------+-------+
| column1|       column2|          column3| column4|cluster|
+--------+--------------+-----------------+--------+-------+
|   agree|strongly agree|         disagree|    null|      2|
|    null|      disagree|strongly disagree|   agree|      5|
|disagree|          null|   strongly agree|disagree|      6|
|disagree|          null|   strongly agree|disagree|      5|
+--------+--------------+-----------------+--------+-------+


This Question was asked in StackOverflow by bella_pa and Answered by pltc It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?