The code streams retail data from Kafka into Spark, processes the data using UDFs to calculate order metrics like total cost and item count, and computes time-based and country-based KPIs using window functions and writes them to files stored on HDFS. It imports functions from Spark SQL, defines the data schema, creates SparkSession, reads data from Kafka, registers UDFs, selects and transforms data, and calculates various KPIs which are written to files.
The code streams retail data from Kafka into Spark, processes the data using UDFs to calculate order metrics like total cost and item count, and computes time-based and country-based KPIs using window functions and writes them to files stored on HDFS. It imports functions from Spark SQL, defines the data schema, creates SparkSession, reads data from Kafka, registers UDFs, selects and transforms data, and calculates various KPIs which are written to files.
The code streams retail data from Kafka into Spark, processes the data using UDFs to calculate order metrics like total cost and item count, and computes time-based and country-based KPIs using window functions and writes them to files stored on HDFS. It imports functions from Spark SQL, defines the data schema, creates SparkSession, reads data from Kafka, registers UDFs, selects and transforms data, and calculates various KPIs which are written to files.
The code streams retail data from Kafka into Spark, processes the data using UDFs to calculate order metrics like total cost and item count, and computes time-based and country-based KPIs using window functions and writes them to files stored on HDFS. It imports functions from Spark SQL, defines the data schema, creates SparkSession, reads data from Kafka, registers UDFs, selects and transforms data, and calculates various KPIs which are written to files.
Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 3
Code Explanation
Case Study: Retail Data Analysis
In this project, we will go through a real-world use case from the retail sector. Data from a centralised Kafka server in real-time will be streamed and processed to calculate various KPIs or key performance indicators.
1. Various sql functions were imported from pyspark.sql.functions module.
The functions include window, udf etc. 2. Various sql types were imported from pyspark.sql.types module. The types include StringType, ArrayType, TimestampType, IntegerType, DoubleType etc. 3. SparkSession was imported from pyspark.sql module 4. Initialized the spark session using spark = SparkSession \ .builder \ .appName("KafkaRead") \ .getOrCreate() 5. Streamed the data from kafka producer using orderRaw = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers","ec2-18-211-252-152.compute- 1.amazonaws.com:9092") \ .option("subscribe","real-time-project") \ .load() From Bootstrap Server - 18.211.252.152 Port - 9092 Topic - real-time-project 6. Schema is defined using jsonSchema = StructType() \ .add("invoice_no", StringType()) \ .add("country", StringType()) \ .add("timestamp", TimestampType()) \ .add("type", StringType()) \ .add("items", ArrayType(StructType([ StructField("SKU", StringType()), StructField("title", StringType()), StructField("unit_price", DoubleType()), StructField("quantity", IntegerType()) ]))) 7. Python function is written to compute total cost of an order using Total cost = ∑(quantity∗unitprice) 8. The above function is transformed into udf (user defined function) using add_total_cost = udf(get_total_cost, DoubleType()) 9. Python function is written to find total items in an order. 10. The above function is transformed into udf using add_total_count = udf(get_total_item, IntegerType()) 11. Python function is written to find if the order is new. 12.The above function is transformed into udf using add_is_order_flag = udf(get_is_order, IntegerType()) 13.Python function is written to find if the order is return. 14.The above function is transformed into udf using add_is_return_flag = udf(get_is_return, IntegerType()) 15. Selected data ("invoice_no", "country", "timestamp", "Total_Items", "Total_Cost", "is_order", "is_return" ) is written to the console. 16. Time based KPI (“Window”, ”OPM”, ”Total Sales Volume”, ”Average rate of return”, “Average Transaction Size” ) is calculated using tumbling window function for every 1 minute 17. Time and country based KPI ((“Window”, ”Country” ,”OPM”, ”Total Sales Volume”, ”Average rate of return”) is calculated using tumbling window function for every 1 minute 18. The Computed KPI were written to files and stored on HDFS in json form 19. The streaming process is manually killed after 10 mins.
Whole Wall Performance Analysis of Autoclaved Aerated Concrete An Example of Collaboration Between Industry and A Research Lab On Development of Energy Efficient Building Envelope Systems