Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we start with a technical overview of Spark and quickly jump into Azure Databricks’ key collaboration features, cluster management, and tight data integration with Azure data sources. Concepts are made concrete via a detailed walk through of an advance analytics pipeline built using Spark and Azure Databricks.
Full video of the presentation: https://www.youtube.com/watch?v=14D9VzI152o
Presentation demo: https://github.com/devlace/azure-databricks-anomaly
17. Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
AZURE DATABRICKS
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
18. Cosmos DB
Kafka on HDInsight
Event Hubs
Power BI
SQL DW
Data Factory
O R C H E S T R A T I O N
Storage (Azure) Azure Data Lake
S T O R A G E
I N G E S T V I S U A L I Z E
S E C U R E Azure Active Directory
A Z U RE DATA BRIC KS
38. Collaborative Workspace
Deploy Production Jobs & Workflows
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
45. Official Apache Spark website
Azure Databricks Documentation
[Book] Spark: The Definitive Guide
47. CONTROL EASE OF USE
Azure Data Lake Store
Azure Storage
Any Hadoop technology,
any distribution
Workload optimized,
managed clusters
Data Engineering in a
Job-as-a-service model
Azure Marketplace
HDP | CDH | MapR
Azure Data Lake
Analytics
IaaS Clusters Managed Clusters Big Data as-a-service
Azure HDInsight
Frictionless & Optimized
Spark clusters
Azure Databricks
BIGDATA
STORAGE
BIGDATA
ANALYTICS
ReducedAdministration
Talking points:
Unified.
Computing engine. Not a storage solution (interfaces w/ existing storage)
Libraries (Mllib, GraphX, Spark SQL, Structured Streaming, open source packages)
Developers can also choose to cache
For Jobs that reuse over again a particular Dataset
Different layers of API (low and high)
High over
Objective: Describe a running spark application
Talking points:
SparkSession
Driver – heart of Spark app
Responding to user’s program or input
Maintaning info about spark app
Analyze, distribute, schedule work across executors
Executors
Doing actual work assigned by driver
Reporting back to driver
Cluster manager
Spark can use three difference cluster managers (including built in Spark cluster manager)
Cluster Managers manage actual resources
CAREFUL with collects
Fun fact: Employees of Databricks have written over 75% of the code in Apache Spark
OBJECTIVE: Show how easy it is to get started
- Create Databricks workspace
- Create a spark cluster
Create a notebook
Import notebook: https://databricks.com/resources/type/example-notebooks
Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.
Also tight azure integration
Workspaces
Workspaces allow you to organize all the work that you are doing on Databricks. Like a folder structure in your computer, it allows you to save notebooks and libraries and share them with other users. Workspaces are not connected to data and should not be used to store data. They're simply for you to store the notebooks and libraries that you use to operate on and manipulate your data with.
Notebooks
Notebooks are a set of any number of cells that allow you to execute commands. Cells hold code in any of the following languages: Scala, Python, R, SQL, or Markdown. Notebooks have a default language, but each cell can have a language override to another language. This is done by including %[language name] at the top of the cell. For instance %python. We'll see this feature shortly.
Notebooks need to be connected to a cluster in order to be able to execute commands however they are not permanently tied to a cluster. This allows notebooks to be shared via the web or downloaded onto your local machine.
Here is a demonstration video of Notebooks.
Dashboards
Dashboards can be created from notebooks as a way of displaying the output of cells without the code that generates them.
Notebooks can also be scheduled as jobs in one click either to run a data pipeline, update a machine learning model, or update a dashboard.
Libraries
Libraries are packages or modules that provide additional functionality that you need to solve your business problems. These may be custom written Scala or Java jars; Python eggs or custom written packages. You can write and upload these manually or you may install them directly via package management utilities like pypi or maven.
Tables
Tables are structured data that you and your team will use for analysis. Tables can exist in several places. Tables can be stored in cloud storage, they can be stored on the cluster that you're currently using, or they can be cached in memory. For more about tables see the documentation.
Clusters
Clusters are groups of computers that you treat as a single computer. In Databricks, this means that you can effectively treat 20 computers as you might treat one computer. Clusters allow you to execute code from notebooks or libraries on set of data. That data may be raw data located on cloud storage or structured data that you uploaded as a table to the cluster you are working on.
It is important to note that clusters have access controls to control who has access to each cluster.
Here is a demonstration video of Clusters.
Jobs
Jobs are the tool by which you can schedule execution to occur either on an already existing cluster or a cluster of its own. These can be notebooks as well as jars or Python scripts. They can be created either manually or via the REST API.
Here is a demonstration video of Jobs.
Apps
Apps are third party integrations with the Databricks platform. These include applications like Tableau.
If Spark is computing engine, where does Databricks store the data?
Mount Blob Storage (WITHOUT SECRET API)
Anomaly detection problem
- Either simple classification, the distribution is known
- Anomaly is outside of the distribution
DEMO ARCHI IN NEXT SLIDE!!
Create a spark SQL notebook
Spark SQL, setup tables
Show Tables created under Data tab
Create a spark SQL notebook
Spark SQL, setup tables
Show Tables created under Data tab
Word of warning: two MLLib libraries, old RDD
Start with a SparkSQL DataFrame...
Specific API
High-level abstractions to transform data: Transformers and Estimators
GOAL: Explain Estimators vs Transformers
Output of an estimator is a “trained model”, which is a Transformer (takes in a DataFrame of inputs records and makes predictions)
Objective: Mllib is feature rich and support the usual requirements of any ML library
Extraction: Extracting features from “raw” data
Transformation: Scaling, converting, or modifying features
Selection: Selecting a subset from a larger set of features
Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
DEMO ARCHI IN NEXT SLIDE!!
Explore data
Train model multiple
Train model with custom estimator
As a data engineer, no need to rewrite model… just load it
Creating an scheduling Job