Azure Databricks is Easier Than You Think

Ike Ellis
Moderated By: Rayees Khan
Azure Databricks is Easier than
you Think

If you require assistance
during the session, type
your inquiry into the
question pane on the right
side.
Maximize your screen with
the zoom button on the
top of the presentation
window.
Please fill in the short
evaluation following the
session. It will appear in
your web browser.
Technical Assistance

PASS’ flagship event takes place in Seattle, Washington
November 5-8, 2019
PASSsummit.com
PASS Marathon: Career Development
October 8, 2019
Upcoming
Events

Ike Ellis
Solliance, Crafting Bytes
Azure Data Architect
Whether OLTP or analytics, I architect data
applications, mostly on the cloud. I specialize
in building scalable and resilient data solutions
that are affordable and simple to maintain.
Author
Developing Azure Solutions (2nd Ed)
Power BI MVP Book
Speaker
• PASS Summit Precon (Nov 2019)
• SDTIG
• San Diego Power BI & PowerApps UG
• PASS Summit Learning Pathway Speaker
• Microsoft Data & AI Summit (Dec 2019)
/ikeellis
@ike_ellis
ike@ikeellis.com

What is Spark?
Low – Level APIS
RDDs Distributed Variables
Structured APIs
Datasets DataFrames SQL
Structured
Streaming
Advanced
Analytics
Libraries and
Ecosystem

 ETL
 Predictive Analytics and Machine Learning
 SQL Queries and Visualizations
 Text mining and text processing
 Real-time event processing
 Graph applications
 Pattern recognition
 Recommendation engines
What applications use Spark?

 Not an OLTP engine
 Not good for data that is updated in place
 But if it has to do with data engineering and data analytics, spark is
pretty versatile:
 IOT
 Streaming
 Pub/Sub
 Notifications
 Alerting
What can’t you do with Spark?

 Scala
 Python
 Java
 SQL
 R
 C#
 Some clojure
Programming Interfaces with Spark

History Lesson: Hadoop/Resource Negotiation

 Local/network file systems
 Azure Blob Storage
 Azure Data Lake Store (gen 1 & gen 2)
 SQL Server or other RDBMS’s
 Azure CosmosDB or other NoSQL stores
 Messaging systems (like Kafka)
Input and Output Types

Spark application architecture

 All run in JVMs
 JVMs can all be on one machine or separate machines
 Don’t think of them as physical locations
Spark components

 The process that submits applications to Spark
 Plans and coordinates the application execution
 Returns status to the client
 Plans the execution of the application
 Creates the DAG
 Keeps track of available resources to execute tasks
 Scheduling tasks to run “close” to the data where possible
 Coordinate the movement of data when necessary
Spark Driver

DAG
Directed Acrylic Graph
Stage
Tasks

 Spark delays execution until an action is called, like collect()
 This allows Spark to delay the creation and execution of the DAG until
it sees how the code is being used. For instance, it might find an
optimal way to execute something if only the MAX or COUNT is being
requested
 The DAG is then managed across executors by the driver
 All statements before the action are just parsed and not executed until
Spark sees the action
Lazy Evaluation

 Transformations
 Data manipulation operations
 Actions
 Requests for output
 .Collect()
 .Show()
 The driver is also the UI in spark on port 4040. Subsequent applications are
port 4041, 4042, etc
Driver creates DAG for Two Items

 Executors
 Host process for spark tasks
 Can run hundreds of tasks within an application
 Reserve CPU and memory on workers
 Dedicated to a specific Spark application and terminated when the application
is done
 Worker
 Hosts the Executor
 Has a finite number of Executors that can be hosted
Spark Executors and Workers

 Allocate the distributed resources
 Can be separate processes, or in stand-alone mode, can be the same
process
 Spark Master
 Requests resources and assigns them to the Spark Driver
 Serves a web interface on 8080
 Is not in charge of the application. That’s the driver. Only used for resource
allocation
Spark Master & The Cluster Manager

 Solves the following issues:
 Difficult to setup
 Getting the right node count in the cluster is not elastic
 Lots of prerequisites
 Lots of decisions to make regarding scheduling, storage, etc.
Azure Databricks

 Azure Blob Storage
 Azure Data Lake Store Gen 1
 Azure Data Lake Store Gen 2
 Azure CosmosDB
 Azure SQL Database
 Azure Database
 Azure SQL Datawarehouse
Azure Databricks Storage Options for Files

 Azure Databricks has a secure and reliable production environment in the
cloud, managed and supported by Spark experts. You can:
 Create clusters in seconds.
 Dynamically autoscale clusters up and down, including serverless clusters, and
share them across teams.
 Use clusters programmatically by using the REST APIs.
 Use secure data integration capabilities built on top of Spark that enable you
to unify your data without centralization.
 Get instant access to the latest Apache Spark features with each release.
Azure Databricks over Spark OnPrem

 https://databricks.com/spark/comparing-databricks-to-apache-spark
Spark vs Databricks

 With the Serverless option, Azure Databricks completely abstracts out
the infrastructure complexity
 The workspace is different than the clusters. The workspace (notebook
code) can be saved and reused while the clusters are spun down and
not billing
Azure Databricks Serverless

 A notebook is:
 a web-based interface to a document that contains runnable code,
visualizations, and narrative text
 one interface for interacting with Azure Databricks
Azure Databricks Notebooks

 Power BI
 Excel
 Azure Data Factory
 SQL Server
 CosmosDB
Microsoft is Using Spark Everywhere

 Resilient
 Distributed
 Dataset
 Stored in memory
 Created in reaction to Hadoop always saving intermediate steps to the disk
 Much, much faster
 RDDs become interactively explored once loaded
 Because of the speed, used often to clean data (pipelines) or for machine learning
Basics of RDDs

 If a node performing an operation in spark is lost, the dataset can be
reconstructed
 Spark can do this because it knows the lineage each RDD (the
sequence of steps needed to create it)
RDDs - Resilient

 Data is divided into one or many partitions
 Distributed as an in-memory collection of objects across worker nodes
 RDDs are shared memory across executors (processes) in individual
workers (nodes)
RDDs - Distributed

 Records are uniquely identifiable data collections within the dataset
 Partitioned so that each partition can be operated on independently
 Called a Shared Nothing architecture
 Are immutable datasets. We don’t change an existing one as much as
we create a whole new one
 Can be relational (tables with rows and columns)
 Can be semi-structured: JSON, CSVs, PARQUET files, etc
RDDs - Datasets

 Files
 SQL Server or other RDBMSs
 Azure CosmosDB or other NoSQL stores
 Stream
 Programmatically
Loading data into RDDs

 If you create too many RDD partitions against a cloud service (SQL
Server, CosmosDB, Azure Blob Storage), you might get flagged as a
DDOS attack
Careful with partitioning!

 PySpark has an implementation of RDDs that makes it a lot easier for
python developers to work with the data
 Dataframes are tabular RDDs that support many features and
functions that python developers have learned to appreciate in pandas
 They have many methods and properties that show the tabular data
closer to how people are used to using in with an RDBMS or Excel
Programming with PySpark: DataFrames

Syntax for DataFrames
airports =
spark.read.csv("wasbs://databricksfiles@ikedatabricks.blob.core.windows.net/airports.c
sv", header="true")
airports.show()
For spark.read.csv, the second argument, “header” can be set to “true”. This tells spark that the first row is the
headers.

 Creates a new DataFrame from the previous one with an additional
column. Adds “1” to the value.
Add a column to a DataFrame
df = df.withColumn("newCol", df.oldCol + 1)

Save a table in the Spark Catalog
airports.createOrReplaceTempView("airports")
• Creates a temporary view that can be referenced in other statements
airports.write.saveAsTable("fullAirports")
• Creates a permanent table

List All of the Tables in the Spark Catalog
print(spark.catalog.listTables())

Query a DataFrame
# Don't change this query
query = "FROM airports SELECT * LIMIT 10"
# Get the first 10 rows of flights
airports10 = spark.sql(query)
# Show the results
airports10.show()

Explain Plan
# Don't change this query
query = "FROM airports SELECT * LIMIT 10"
# Get the first 10 rows of flights
airports10 = spark.sql(query).explain()
• Read from bottom to top
• Shows the steps to recreate the data frame
• Shows what the driver will do when an action is called
• Don’t worry too much about understanding everything they can just
be helpful tools for debugging and improving your knowledge as you
progress with Spark.

 Reference a query string
 airportsInOregon = airports.filter(“state == ‘OR’")
 Reference the DataFrame
 airportsInOregon = airports.filter(airports.state == ‘OR’)
 Notice there is no string for the entire clause
Two of the Ways to Filter

Bonus way to FILTER
# Define first filter
filterA = flights.origin == "SEA"
# Define second filter
filterB = flights.dest == "PDX"
# Filter the data, first by filterA then by filterB
selected2 = temp.filter(filterA).filter(filterB)

 Same as the filter. Use a string
 # Select the first set of columns
 selected1 = airports.select(“faa", “long", “lat")
 Or use the DataFrame
 # Select the second set of columns
 temp = airports.select(airports.faa, airports.long, airports.lat)
Two of the Ways to SELECT Columns

Putting it all together
from pyspark.sql.functions import *
airports = airports.withColumn("count", lit(1))
airportsHiAlt = airports
.filter("alt > 1000")
.groupBy("dst")
.sum("count")
.withColumnRenamed("sum(count)", "total")
.sort(desc("total"))
.explain()
# .show()

What is going on the previous slide
CSV File DataFrame DataFrame DataFrame
DataFrameDataFrame
Read Group By Sum
Sort
Array Collect

 Ike Ellis
 @ike_ellis
 Crafting Bytes
 We’re hiring a Data Experts!
 Microsoft MVP
 Chairperson of the San Diego TIG
 Book co-author – Developing Azure Solutions
 Upcoming course on Azure Databricks
 www.craftingbytes.com
 www.ikeellis.com
Summary

Coming up next…
Evaluating Cloud Products for Data Warehousing
Ginger Grant

Thank you for attending
@sqlpass
#sqlpass
@PASScommunity
Learn more from Ike
@yourhandle email@company.com

Azure Databricks is Easier Than You Think

Azure Databricks is Easier Than You Think

More Related Content

Azure Databricks is Easier Than You Think

Editor's Notes