[SOLVED] Reading zip file into Apache Spark dataframe

Question

This Content is from Stack Overflow. Question asked by nam

Using Apache Spark (or pyspark) I can read/load a text file into a spark dataframe and load that dataframe into a sql db, as follows:

df = spark.read.csv("MyFilePath/MyDataFile.txt", sep="|", header="true", inferSchema="true")
df.show()
.............
#load df into an SQL table
df.write(.....)

Question: How can we achieve the same if the data file is inside a zip file? The zip file has only one text file of size 6GB

Solution

I have create a sample dataset employee.txt which is in .zip folder. I have used pandas Lib to read the zipped compressed txt file. Might be there would be multiple approach but this is the best approach.

Records:employee.txt

Name;dept;age
Ravi kumar;Data Science;29
Amitesh Kumar;QA;29
Rohit Kumar;Sales;29
Ahimanyu;java;29
# import required modules
import zipfile
import pandas as pd

# read the dataset using the compression zip
pdf = pd.read_csv(r'C:\Users\ravi\Documents\pyspark test\dataset\employee.zip',compression='zip', sep=';')

# creating spark session and coverting pandas dataframe to spark datafram
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("zip reader").getOrCreate()
sparkDF=spark.createDataFrame(pdf)
print(sparkDF.show())

#mysql connection details

driver = "com.mysql.jdbc.Driver"
url = "jdbc:mysql://127.0.0.1:3306/test"
user = "root"
pwd = "India@123"

#writing final output to RDMS 
sparkDF.write.format("jdbc").option("driver", driver)\
    .option("url", url)\
    .option("dbtable", "employee")\
    .option("user", user)\
    .option("password", pwd)\
    .save()


Final Output:

+-------------+------------+---+
|         Name|        dept|age|
+-------------+------------+---+
|   Ravi kumar|Data Science| 29|
|Amitesh Kumar|          QA| 29|
|  Rohit Kumar|       Sales| 29|
|     Ahimanyu|        java| 29|
+-------------+------------+---+

Answered by Ravi Kumar
This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 4.0.

people found this article helpful. What about you?

Comments are closed.