[SOLVED] Is this behavior of join normal?

Issue

This Content is from Stack Overflow. Question asked by user626528

from datetime import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

columns = [ 'id1', 'id2', 'val' ]
ids = [ 'id1', 'id2' ]

vals1 = [
        (1, 'a', 1),
        (2, 'b', 4),
        (4, None, 1),
      ]
df1 = spark.createDataFrame(data=vals1, schema=columns)

vals2 = [
        (1, 'a', 5),
        (3, 'c', 2),
        (4, None, 2),
      ]
df2 = spark.createDataFrame(data=vals2, schema=columns)

df1 = df1.withColumnRenamed('val', 'val1')
df2 = df2.withColumnRenamed('val', 'val2')

res_df = df1.join(df2, ids, 'full')

res_df.sort(ids).show()

And the result is

+---+----+----+----+
|id1| id2|val1|val2|
+---+----+----+----+
|  1|   a|   1|   5|
|  2|   b|   4|null|
|  3|   c|null|   2|
|  4|null|   1|null|
|  4|null|null|   2|
+---+----+----+----+

i.e. there are 2 rows with the key [ 4, null ]
Is this behavior normal?



Solution

This is expected behaviour. NULL does not equal NULL in comparisons as explained in the NULL Semantics section of the documentation:

Apache spark supports the standard comparison operators such as ‘>’, ‘>=’, ‘=’, ‘<’ and ‘<=’. The result of these operators is unknown or NULL when one of the operands or both the operands are unknown or NULL.

(AFAIK, this is standard SQL behaviour and not specific to Spark)

Furthermore:

WHERE, HAVING operators filter rows based on the user specified condition. A JOIN operator is used to combine rows from two tables based on a join condition. For all the three operators, a condition expression is a boolean expression and can return True, False or Unknown (NULL). They are “satisfied” if the result of the condition is True.

Comparing NULL to NULL produces NULL, which is not True, thus the (4, NULL) key in first table is not the same as the (4, NULL) key in the second table and so the join produces two distinct rows, one from joining the left row to a missing right row (| 4|null| 1|null|) and one from joining the right row to a missing left row (| 4|null|null| 2|).


This Question was asked in StackOverflow by user626528 and Answered by Hristo Iliev It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?