[SOLVED] Spark Dataset using case class


This Content is from Stack Overflow. Question asked by shobhit

When we have to convert a Spark Dataframe to Dataset. We generally use case class.
It means we are converting a Row of un-Type to Type.

case class MyData (Name: String, Age: Int)
import spark.implicits._
val ds = df.as[MyData] 

Let say I have one RDD & mapping with a case class then converting to dataframe. Why at the end dataframe showing Dataset[Row].

case class MyData(Name: String, Age: Int)    
object learning extends App{
      val spark = SparkSession.builder()
        .config("spark.master", "local")
      val data = Seq(("A",100),("B",300),("C",400))
      val rdd = spark.sparkContext.parallelize(data)
      val rddNew = rdd.map( x => MyData(x._1,x._2) )
      val newDataFrame = spark.createDataFrame(rddNew)

enter image description here

Why not this dataframe is showing as Dataset[MyData]?


This is because the return type of spark.createDataFrame is DataFrame which is equivalent to Dataset[Row].

If you want the return type to be Dataset[MyData], use spark.createDataset:

import spark.implicits._
val newDataset: Dataset[MyData] = spark.createDataset[MyData](rddNew)

This Question was asked in StackOverflow by sho and Answered by babkr It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?