The document provides a cheat sheet on using PySpark for big data analytics. It outlines how to set up the PySpark environment, load and save data in various formats, perform data processing and transformations, optimize performance, and do advanced analytics and machine learning. Key steps include initializing SparkContext, reading/writing CSV/Parquet files, selecting/filtering DataFrame columns, joining DataFrames, caching data, and using MLlib for linear regression, classification, and clustering models.
The document provides a cheat sheet on using PySpark for big data analytics. It outlines how to set up the PySpark environment, load and save data in various formats, perform data processing and transformations, optimize performance, and do advanced analytics and machine learning. Key steps include initializing SparkContext, reading/writing CSV/Parquet files, selecting/filtering DataFrame columns, joining DataFrames, caching data, and using MLlib for linear regression, classification, and clustering models.
● Splitting a column into multiple columns: from pyspark.sql.functions import split; df.withColumn('splitted', split(df['full_name'], ' ')).show() ● Collecting a column as a list: df.groupBy("department").agg(collect_list("name").alias("names")).show() ● Converting DataFrame column to Python list: names_list = df.select("name").rdd.flatMap(lambda x: x).collect() ● Using when-otherwise for conditional logic: from pyspark.sql.functions import when; df.withColumn("category", when(df["age"] < 30, "Young").otherwise("Old")).show() ● Exploding a list to rows: from pyspark.sql.functions import explode; df.withColumn('name', explode(df['names'])).show() ● Aggregating with custom expressions: from pyspark.sql.functions import expr; df.groupBy("department").agg(expr("avg(salary) as average_salary")).show() ● Calculating correlations: df.stat.corr("column1", "column2") ● Handling date and timestamp: from pyspark.sql.functions import current_date, current_timestamp; df.withColumn("today", current_date()).withColumn("now", current_timestamp()).show() ● Repartitioning DataFrames: df.repartition(10).rdd.getNumPartitions() ● Caching DataFrames for optimization: df.cache() ● Applying map and reduce operations on RDDs: rdd.map(lambda x: x * x).reduce(lambda x, y: x + y) ● Using broadcast variables for efficiency: broadcastVar = sc.broadcast([1, 2, 3]) ● Accumulators for aggregating information across executors: acc = sc.accumulator(0); rdd.foreach(lambda x: acc.add(x)) ● DataFrame descriptive statistics: df.describe().show()
Performance Optimization
● Broadcast join for large and small DataFrames: from
● Broadcast Variables for Large Lookups: broadcastVar =
sc.broadcast(largeLookupTable) ● Partitioning Strategies for Large Datasets: df.repartition(200, "keyColumn") ● Persisting DataFrames in Memory: df.persist(StorageLevel.MEMORY_AND_DISK) ● Optimizing Spark SQL Joins: df.join(broadcast(smallDf), "key") ● Minimizing Data Shuffles: df.coalesce(1) ● Using Kryo Serialization for Faster Processing: spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") ● Adjusting Spark Executor Memory: spark.conf.set("spark.executor.memory", "4g") ● Tuning Spark SQL Shuffle Partitions: spark.conf.set("spark.sql.shuffle.partitions", "200") ● Leveraging DataFrame Caching Wisely: df.cache() ● Avoiding Unnecessary Operations in Transformations: Avoid complex operations inside loops or iterative transformations ● Monitoring and Debugging with Spark UI: Access Spark UI on http://<spark-master-host>:4040 ● Efficient Use of Accumulators for Global Aggregates: acc = sc.accumulator(0) ● Optimizing Data Locality for Faster Processing: Ensure data is close to the computation for minimizing data transfer ● Utilizing DataFrames and Datasets over RDDs for Optimized Performance: Prefer DataFrames and Datasets APIs for leveraging Catalyst optimizer and Tungsten execution engine ● Applying Best Practices for Data Skew: Use salting techniques or repartitioning to mitigate data skew in joins or aggregations