Issue
This Content is from Stack Overflow. Question asked by bella_pa
I have two data frames.
DF1: column1 column2 column3 column4
agree strongly agree disagree null
null disagree strongly disagree agree
disagree null strongly agree disagree
DF2: column1 column2 column3 column4 cluster
disagree agree strongly disagree disagree 1
agree strongly agree disagree null 2
disagree null strongly agree disagree 5
.
.
.
disagree agree strongly agree disagree 3
Final output:DF1: column1 column2 column3 column4 Cluster
agree strongly agree disagree null 2
null disagree strongly disagree agree 5
.
.
etc
I have already did this in R but I could not implement the same in PySpark. Can anyone please tell me how to implement in PySpark
Solution
Since join won’t take care of null values, I’m thinking about a small hack by replacing all null
s with a specific value, then perform a left join to keep app df1’s records
df1
df1 = spark.read.csv('a.csv', header=True).fillna('xxxyyyzzz')
df1.show()
+---------+--------------+-----------------+---------+
| column1| column2| column3| column4|
+---------+--------------+-----------------+---------+
| agree|strongly agree| disagree|xxxyyyzzz|
|xxxyyyzzz| disagree|strongly disagree| agree|
| disagree| xxxyyyzzz| strongly agree| disagree|
+---------+--------------+-----------------+---------+
df2
df2 = spark.read.csv('b.csv', header=True).fillna('xxxyyyzzz')
df2.show()
+---------+--------------+-----------------+---------+-------+
| column1| column2| column3| column4|cluster|
+---------+--------------+-----------------+---------+-------+
| disagree| agree|strongly disagree| disagree| 1|
| agree|strongly agree| disagree|xxxyyyzzz| 2|
| disagree| xxxyyyzzz| strongly agree| disagree| 5|
| disagree| agree| strongly agree| disagree| 3|
|xxxyyyzzz| disagree|strongly disagree| agree| 5|
| disagree| xxxyyyzzz| strongly agree| disagree| 6|
+---------+--------------+-----------------+---------+-------+
Join and return dummy value back to nulls
(df1
.join(df2, on=df1.columns, how='left')
.replace('xxxyyyzzz', None)
.show()
)
+--------+--------------+-----------------+--------+-------+
| column1| column2| column3| column4|cluster|
+--------+--------------+-----------------+--------+-------+
| agree|strongly agree| disagree| null| 2|
| null| disagree|strongly disagree| agree| 5|
|disagree| null| strongly agree|disagree| 6|
|disagree| null| strongly agree|disagree| 5|
+--------+--------------+-----------------+--------+-------+
This Question was asked in StackOverflow by bella_pa and Answered by pltc It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.