Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Code Explanation

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Code Explanation

Case Study: Retail Data Analysis


In this project, we will go through a real-world use case from the retail sector.
Data from a centralised Kafka server in real-time will be streamed and
processed to calculate various KPIs or key performance indicators.

1. Various sql functions were imported from pyspark.sql.functions module.


The functions include window, udf etc.
2. Various sql types were imported from pyspark.sql.types module. The
types include StringType, ArrayType, TimestampType, IntegerType,
DoubleType etc.
3. SparkSession was imported from pyspark.sql module
4. Initialized the spark session using
spark = SparkSession \
.builder \
.appName("KafkaRead") \
.getOrCreate()
5. Streamed the data from kafka producer using
orderRaw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers","ec2-18-211-252-152.compute-
1.amazonaws.com:9092") \
.option("subscribe","real-time-project") \
.load()
From
Bootstrap Server - 18.211.252.152
Port - 9092
Topic - real-time-project
6. Schema is defined using
jsonSchema = StructType() \
.add("invoice_no", StringType()) \
.add("country", StringType()) \
.add("timestamp", TimestampType()) \
.add("type", StringType()) \
.add("items", ArrayType(StructType([
StructField("SKU", StringType()),
StructField("title", StringType()),
StructField("unit_price", DoubleType()),
StructField("quantity", IntegerType())
])))
7. Python function is written to compute total cost of an order using
Total cost = ∑(quantity∗unitprice)
8. The above function is transformed into udf (user defined function) using
add_total_cost = udf(get_total_cost, DoubleType())
9. Python function is written to find total items in an order.
10. The above function is transformed into udf using
add_total_count = udf(get_total_item, IntegerType())
11. Python function is written to find if the order is new.
12.The above function is transformed into udf using
add_is_order_flag = udf(get_is_order, IntegerType())
13.Python function is written to find if the order is return.
14.The above function is transformed into udf using
add_is_return_flag = udf(get_is_return, IntegerType())
15. Selected data ("invoice_no", "country", "timestamp", "Total_Items",
"Total_Cost", "is_order", "is_return" ) is written to the console.
16. Time based KPI (“Window”, ”OPM”, ”Total Sales Volume”, ”Average
rate of return”, “Average Transaction Size” ) is calculated using tumbling
window function for every 1 minute
17. Time and country based KPI ((“Window”, ”Country” ,”OPM”, ”Total
Sales Volume”, ”Average rate of return”) is calculated using tumbling
window function for every 1 minute
18. The Computed KPI were written to files and stored on HDFS in json
form
19. The streaming process is manually killed after 10 mins.

You might also like