Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
273 views

(Big Data Analytics With PySpark) (CheatSheet)

The document provides a cheat sheet on using PySpark for big data analytics. It outlines how to set up the PySpark environment, load and save data in various formats, perform data processing and transformations, optimize performance, and do advanced analytics and machine learning. Key steps include initializing SparkContext, reading/writing CSV/Parquet files, selecting/filtering DataFrame columns, joining DataFrames, caching data, and using MLlib for linear regression, classification, and clustering models.

Uploaded by

Niwahereza Dan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
273 views

(Big Data Analytics With PySpark) (CheatSheet)

The document provides a cheat sheet on using PySpark for big data analytics. It outlines how to set up the PySpark environment, load and save data in various formats, perform data processing and transformations, optimize performance, and do advanced analytics and machine learning. Key steps include initializing SparkContext, reading/writing CSV/Parquet files, selecting/filtering DataFrame columns, joining DataFrames, caching data, and using MLlib for linear regression, classification, and clustering models.

Uploaded by

Niwahereza Dan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

# [ Big Data Analytics with PySpark ] ( CheatSheet )

Setting Up PySpark Environment

● Install PySpark: !pip install pyspark


● Initialize SparkContext: from pyspark import SparkContext; sc =
SparkContext()
● Create SparkSession: from pyspark.sql import SparkSession; spark =
SparkSession.builder.appName('AppName').getOrCreate()
● Check Spark version: spark.version
● Configure Spark properties: spark.conf.set("spark.sql.shuffle.partitions",
"50")
● List configured properties: spark.sparkContext.getConf().getAll()
● Stop SparkContext: sc.stop()
● Initialize Spark in Jupyter Notebook: %env PYSPARK_PYTHON=python3

Data Loading and Saving

● Read CSV file: df = spark.read.csv('path/to/file.csv', header=True,


inferSchema=True)
● Read Parquet file: df = spark.read.parquet('path/to/file.parquet')
● Read from database (JDBC): df = spark.read.format("jdbc").option("url",
"jdbc_url").option("dbtable", "table_name").option("user",
"username").option("password", "password").load()
● Write DataFrame to CSV: df.write.csv('path/to/output.csv',
mode='overwrite')
● Write DataFrame to Parquet: df.write.parquet('path/to/output.parquet',
mode='overwrite')
● Load a text file: rdd = sc.textFile('path/to/textfile.txt')
● Save RDD to a text file: rdd.saveAsTextFile('path/to/output')
● Read JSON file: df = spark.read.json('path/to/file.json')
● Write DataFrame to JSON: df.write.json('path/to/output.json',
mode='overwrite')
● DataFrame to RDD conversion: rdd = df.rdd
● RDD to DataFrame conversion: df = rdd.toDF(['column1', 'column2'])
● Read multiple files: df = spark.read.csv(['path/to/file1.csv',
'path/to/file2.csv'])
● Read from HDFS: df =
spark.read.text("hdfs://namenode:port/path/to/file.txt")
● Saving DataFrame in Hive: df.write.saveAsTable("database.tableName")
By: Waleed Mousa
● Specifying schema explicitly: from pyspark.sql.types import StructType,
StructField, IntegerType, StringType; schema =
StructType([StructField("id", IntegerType(), True), StructField("name",
StringType(), True)]); df =
spark.read.schema(schema).csv('path/to/file.csv')

Data Processing and Transformation

● Select columns: df.select("column1", "column2").show()


● Filter rows: df.filter(df["age"] > 30).show()
● GroupBy and aggregate: df.groupBy("department").agg({"salary": "avg",
"age": "max"}).show()
● Join DataFrames: df1.join(df2, df1.id == df2.id).show()
● Sort DataFrame: df.sort(df.age.desc()).show()
● Distinct values: df.select("column").distinct().show()
● Column operations (add, subtract, etc.): df.withColumn("new_column",
df["salary"] * 0.1 + df["bonus"]).show()
● Rename column: df.withColumnRenamed("oldName", "newName").show()
● Drop column: df.drop("column_to_drop").show()
● Handle missing data: df.na.fill({"column1": "value1", "column2":
"value2"}).show()
● User-defined functions (UDF): from pyspark.sql.functions import udf; from
pyspark.sql.types import LongType; def square(x): return x * x;
square_udf = udf(square, LongType()); df.withColumn("squared",
square_udf(df["number"])).show()
● Pivot tables: df.groupBy("department").pivot("gender").agg({"salary":
"avg"}).show()
● Window functions: from pyspark.sql.window import Window; from
pyspark.sql.functions import rank; windowSpec =
Window.partitionBy("department").orderBy("salary");
df.withColumn("rank", rank().over(windowSpec)).show()
● Running SQL queries directly on DataFrames:
df.createOrReplaceTempView("table"); spark.sql("SELECT * FROM table
WHERE age > 30").show()
● Sampling DataFrames: df.sample(withReplacement=False,
fraction=0.1).show()
● Concatenating columns: from pyspark.sql.functions import concat_ws;
df.withColumn('full_name', concat_ws(' ', df['first_name'],
df['last_name'])).show()

By: Waleed Mousa


● Splitting a column into multiple columns: from pyspark.sql.functions
import split; df.withColumn('splitted', split(df['full_name'], '
')).show()
● Collecting a column as a list:
df.groupBy("department").agg(collect_list("name").alias("names")).show()
● Converting DataFrame column to Python list: names_list =
df.select("name").rdd.flatMap(lambda x: x).collect()
● Using when-otherwise for conditional logic: from pyspark.sql.functions
import when; df.withColumn("category", when(df["age"] < 30,
"Young").otherwise("Old")).show()
● Exploding a list to rows: from pyspark.sql.functions import explode;
df.withColumn('name', explode(df['names'])).show()
● Aggregating with custom expressions: from pyspark.sql.functions import
expr; df.groupBy("department").agg(expr("avg(salary) as
average_salary")).show()
● Calculating correlations: df.stat.corr("column1", "column2")
● Handling date and timestamp: from pyspark.sql.functions import
current_date, current_timestamp; df.withColumn("today",
current_date()).withColumn("now", current_timestamp()).show()
● Repartitioning DataFrames: df.repartition(10).rdd.getNumPartitions()
● Caching DataFrames for optimization: df.cache()
● Applying map and reduce operations on RDDs: rdd.map(lambda x: x *
x).reduce(lambda x, y: x + y)
● Using broadcast variables for efficiency: broadcastVar = sc.broadcast([1,
2, 3])
● Accumulators for aggregating information across executors: acc =
sc.accumulator(0); rdd.foreach(lambda x: acc.add(x))
● DataFrame descriptive statistics: df.describe().show()

Performance Optimization

● Broadcast join for large and small DataFrames: from


pyspark.sql.functions import broadcast;
large_df.join(broadcast(small_df), "key").show()
● Avoiding shuffles with coalesce:
df.coalesce(1).write.csv('path/to/output', mode='overwrite')
● Partition tuning for better parallelism:
df.repartition("column").write.parquet('path/to/output')
● Caching intermediate DataFrames: intermediate_df = df.filter(df["age"] >
30).cache()

By: Waleed Mousa


● Using columnar storage formats like Parquet:
df.write.parquet('path/to/output.parquet')
● Optimizing Spark SQL with explain plans: df.explain(True)
● Minimizing data serialization cost with Kryo:
spark.conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
● Leveraging off-heap memory storage:
spark.conf.set("spark.memory.offHeap.enabled", true);
spark.conf.set("spark.memory.offHeap.size","2g")
● Adjusting the size of shuffle partitions:
spark.conf.set("spark.sql.shuffle.partitions", "200")
● Using vectorized operations in PySpark: from pyspark.sql.functions import
pandas_udf; @pandas_udf("integer") def square_udf(s: pd.Series) ->
pd.Series: return s * s

Advanced Analytics and Machine Learning with PySpark

● Linear Regression Model: from pyspark.ml.regression import


LinearRegression; lr = LinearRegression(featuresCol='features',
labelCol='label'); lrModel = lr.fit(train_df)
● Classification Model (Logistic Regression): from pyspark.ml.classification
import LogisticRegression; logr =
LogisticRegression(featuresCol='features', labelCol='label'); logrModel =
logr.fit(train_df)
● Decision Tree Classifier: from pyspark.ml.classification import
DecisionTreeClassifier; dt =
DecisionTreeClassifier(featuresCol='features', labelCol='label'); dtModel
= dt.fit(train_df)
● Random Forest Classifier: from pyspark.ml.classification import
RandomForestClassifier; rf =
RandomForestClassifier(featuresCol='features', labelCol='label'); rfModel
= rf.fit(train_df)
● Gradient-Boosted Tree Classifier: from pyspark.ml.classification import
GBTClassifier; gbt = GBTClassifier(featuresCol='features',
labelCol='label'); gbtModel = gbt.fit(train_df)
● Clustering with K-Means: from pyspark.ml.clustering import KMeans; kmeans
= KMeans().setK(3).setSeed(1); model = kmeans.fit(dataset)
● Building a Pipeline: from pyspark.ml import Pipeline; from
pyspark.ml.feature import HashingTF, Tokenizer; tokenizer =
Tokenizer(inputCol="text", outputCol="words"); hashingTF =
HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features"); lr =

By: Waleed Mousa


LogisticRegression(maxIter=10, regParam=0.001); pipeline =
Pipeline(stages=[tokenizer, hashingTF, lr])
● Model Evaluation (Binary Classification): from pyspark.ml.evaluation
import BinaryClassificationEvaluator; evaluator =
BinaryClassificationEvaluator(); print('Area Under ROC',
evaluator.evaluate(predictions))
● Model Evaluation (Multiclass Classification): from pyspark.ml.evaluation
import MulticlassClassificationEvaluator; evaluator =
MulticlassClassificationEvaluator(metricName="accuracy"); accuracy =
evaluator.evaluate(predictions); print("Test Accuracy = %g" % accuracy)
● Hyperparameter Tuning using CrossValidator: from pyspark.ml.tuning
import ParamGridBuilder, CrossValidator; paramGrid =
ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).build(); cv =
CrossValidator(estimator=lr, estimatorParamMaps=paramGrid,
evaluator=evaluator, numFolds=3); cvModel = cv.fit(train_df)
● Feature Transformation - VectorAssembler: from pyspark.ml.feature import
VectorAssembler; assembler =
VectorAssembler(inputCols=['feature1','feature2'], outputCol="features");
output = assembler.transform(df)
● Feature Scaling - StandardScaler: from pyspark.ml.feature import
StandardScaler; scaler = StandardScaler(inputCol="features",
outputCol="scaledFeatures", withStd=True, withMean=False); scalerModel =
scaler.fit(df); scaledData = scalerModel.transform(df)
● Text Processing - Tokenization: from pyspark.ml.feature import Tokenizer;
tokenizer = Tokenizer(inputCol="document", outputCol="words"); wordsData
= tokenizer.transform(documentDF)
● Text Processing - Stop Words Removal: from pyspark.ml.feature import
StopWordsRemover; remover = StopWordsRemover(inputCol="words",
outputCol="filtered"); filteredData = remover.transform(wordsData)
● Principal Component Analysis (PCA): from pyspark.ml.feature import PCA;
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures"); model =
pca.fit(df); result = model.transform(df).select("pcaFeatures")
● Handling Missing Values: df.na.fill({'column1': 'value1', 'column2':
'value2'}).show()
● Using SQL Functions for Data Manipulation: from pyspark.sql.functions
import col, upper; df.select(col("name"),
upper(col("name")).alias("name_upper")).show()
● Applying Custom Functions with UDF: from pyspark.sql.functions import
udf; from pyspark.sql.types import IntegerType; my_udf = udf(lambda x:
len(x), IntegerType()); df.withColumn("string_length",
my_udf(col("string_column"))).show()

By: Waleed Mousa


Streaming Data Analysis with PySpark

● Creating a Streaming DataFrame from a Socket Source: lines =


spark.readStream.format("socket").option("host",
"localhost").option("port", 9999).load()
● Writing Streaming Data to Console: query =
lines.writeStream.outputMode("append").format("console").start()
● Using Watermarking to Handle Late Data: `windowedCounts =
lines.withWatermark("timestamp", "10
minutes").groupBy(window(col("timestamp"), "5 minutes")).count()
● Aggregating Stream Data:aggregatedStream =
streamData.groupBy("column").agg({"value": "sum"})
● Querying Streaming Data in Memory: query =
streamData.writeStream.queryName("aggregated_data").outputMode("complete"
).format("memory").start()
● Reading from a Kafka Source: kafkaStream =
spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"host1:port1,host2:port2").option("subscribe", "topicName").load()
● Writing Stream Data to Kafka: query =
streamData.selectExpr("to_json(struct(*)) AS
value").writeStream.format("kafka").option("kafka.bootstrap.servers",
"host:port").option("topic", "outputTopic").start()
● Triggering Streaming Queries: query =
streamData.writeStream.outputMode("append").trigger(processingTime='5
seconds').format("console").start()
● Managing Streaming Queries: query.status, query.stop()
● Using Foreach and ForeachBatch for Custom Sinks: query =
streamData.writeStream.foreachBatch(customFunction).start()
● Stateful Stream Processing (mapGroupsWithState): mappedStream =
streamData.groupByKey(lambda x: x.key).mapGroupsWithState(updateFunction)
● Handling Late Data and Watermarking: lateDataHandledStream =
streamData.withWatermark("timestampColumn", "1
hour").groupBy(window(col("timestampColumn"), "10 minutes"),
"keyColumn").count()
● Streaming Deduplication: streamData.withWatermark("eventTime", "10
minutes").dropDuplicates(["userID", "eventTime"])
● Continuous Processing Mode: query =
streamData.writeStream.format("console").trigger(continuous="1
second").start()

By: Waleed Mousa


● Monitoring Streaming Queries: spark.streams.addListener(new
StreamingQueryListener())

Spark Performance Tuning and Best Practices

● Broadcast Variables for Large Lookups: broadcastVar =


sc.broadcast(largeLookupTable)
● Partitioning Strategies for Large Datasets: df.repartition(200,
"keyColumn")
● Persisting DataFrames in Memory: df.persist(StorageLevel.MEMORY_AND_DISK)
● Optimizing Spark SQL Joins: df.join(broadcast(smallDf), "key")
● Minimizing Data Shuffles: df.coalesce(1)
● Using Kryo Serialization for Faster Processing:
spark.conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
● Adjusting Spark Executor Memory: spark.conf.set("spark.executor.memory",
"4g")
● Tuning Spark SQL Shuffle Partitions:
spark.conf.set("spark.sql.shuffle.partitions", "200")
● Leveraging DataFrame Caching Wisely: df.cache()
● Avoiding Unnecessary Operations in Transformations: Avoid complex
operations inside loops or iterative transformations
● Monitoring and Debugging with Spark UI: Access Spark UI on
http://<spark-master-host>:4040
● Efficient Use of Accumulators for Global Aggregates: acc =
sc.accumulator(0)
● Optimizing Data Locality for Faster Processing: Ensure data is close to
the computation for minimizing data transfer
● Utilizing DataFrames and Datasets over RDDs for Optimized Performance:
Prefer DataFrames and Datasets APIs for leveraging Catalyst optimizer and
Tungsten execution engine
● Applying Best Practices for Data Skew: Use salting techniques or
repartitioning to mitigate data skew in joins or aggregations

By: Waleed Mousa

You might also like