Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
1K views

DataFrame Operations Using A Json File

This Python code uses Spark SQL to read employee data from a JSON file into a DataFrame. It then filters the DataFrame to only rows where the stream is "JAVA" and writes the filtered DataFrame to a new Parquet file. It first reads the JSON, displays the DataFrame, coalesces and writes it to a Parquet file. Then it reads the Parquet, filters for "JAVA" stream, displays and writes the filtered DataFrame to a new Parquet file.

Uploaded by

Arpita Das
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

DataFrame Operations Using A Json File

This Python code uses Spark SQL to read employee data from a JSON file into a DataFrame. It then filters the DataFrame to only rows where the stream is "JAVA" and writes the filtered DataFrame to a new Parquet file. It first reads the JSON, displays the DataFrame, coalesces and writes it to a Parquet file. Then it reads the Parquet, filters for "JAVA" stream, displays and writes the filtered DataFrame to a new Parquet file.

Uploaded by

Arpita Das
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

#Put your code here

from pyspark.sql import SparkSession


spark = SparkSession \
.builder \
.appName("Data Frame EMPLOYEE") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("emp.json")
df.show()
df.coalesce(1).write.parquet("Employees")
pf = spark.read.parquet("Employees")
dfNew = pf.filter(pf.stream=='JAVA')
dfNew.show()
dfNew.coalesce(1).write.parquet("JavaEmployees")

You might also like