Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
95 views

DataEngineeringDatabricks

Uploaded by

Sandeep Suman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

DataEngineeringDatabricks

Uploaded by

Sandeep Suman
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 139

Data

Engineerin
g with
Databricks
􏰂
Course
Objectives
 Leverage the
Databricks Lakehouse
Platform to perform
core responsibilities for
data pipeline
development
 Use SQL and Python to
write production data
pipelines to extract,
transform, and load
data into tables and
views in the lakehouse
 Simplify data ingestion
and incremental
change propagation
using Databricks-native
features and syntax
 Orchestrate production
pipelines to deliver
fresh results for ad-hoc
analytics and
dashboarding
􏰀
Course Agenda
 Module 􏰂: Databricks
Workspace and
Services
 Module 􏰀: Delta Lake
 Module 􏰃: Relational
Entities on Databricks
 Module 􏰄: ETL With
Spark SQL
 Module 􏰅: OPTIONAL
Python for Spark SQL
 Module 􏰆: Incremental
Data Processing
 Module 􏰇: Multi-Hop
Architecture
 Module 􏰈: Delta Live
Tables
 Module 􏰉: Task
Orchestration with Jobs
 Module 􏰂􏰁: Running a
DBSQL Query
 Module 􏰂􏰂: Managing
Permissions
 Module 􏰂􏰀:
Productionalizing
Dashboards and
Queries in DBSQL
􏰃
Databricks
Certified Data
Engineer
Associate
Certification helps you
gain industry recognition,
competitive
differentiation, greater
productivity, and results.
 This course helps you
prepare for the
Databricks Certified
Data Engineer
Associate exam
 Please see the
Databricks Academy
for additional prep
materials
For more information visit:
databricks.com/learn/certification
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved

The
Databricks
Lakehouse
Platform
􏰅
Using the
Databricks
Lakehouse
Platform
Learning Objectives
 Describe the
components of the
Databricks Lakehouse
 Complete basic code
development tasks
using services of the
Databricks Data
Science and
Engineering Workspace
 Perform common table
operations using Delta
Lake in the Lakehouse
􏰆
Using the
Databricks
Lakehouse
Platform
Agenda
 Introduction to the
Databricks Lakehouse
Platform
 Introduction to the
Databricks Workspace
and Services
• Using clusters, files,
notebooks, and repos •
Introduction to Delta
Lake
• Manipulating and optimizing
data in Delta tables
􏰇
Customers

7000+
across the globe

Lakehous
e
One simple platform to unify
all of your data, analytics,
and AI workloads
Original creators of:
􏰈
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
Supporting
enterprises in
every
industry
Healthcare & Life Sciences
Public Sector
Manufacturing & Automotive
Retail & CPG
Media & Entertainment
Energy & Utilities
Financial Services
Digital Native
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰉

Most
enterprises
struggle with
data
Data Warehousing
Data Engineering Streaming
Data Science and ML
Siloed stacks increase data architecture
complexity
Analytics and BI Data marts
Data warehouse
Structured data
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
Transform
Real-time Database Streaming Data Engine
Streaming data sources
Data Science
Machine Learning
Extract
Load
Data prep Data Lake
Data Lake
Structured, semi-structured and unstructured data
Structured, semi-structured
and unstructured
􏰂􏰁
Most
enterprises
struggle with
data
Amazon Redshift Azure Synapse Snowflake
SAP

Teradata

Google BigQuery

IBM Db􏰀

Oracle Autonomous Data Warehouse

Jupyter
Azure ML Studio Domino Data Labs TensorFlow

Data Science
Amazon SageMaker MatLAB
SAS
PyTorch

Machine Learning
Data Warehousing
Data Engineering
Streaming
Data Science and ML
Disconnected systems and proprietary
data formats make integration difficult
Hadoop
Amazon EMR Google Dataproc

Apache Airflow Apache Spark Cloudera

Apache Kafka
Apache Flink
Azure Stream Analytics Tibco Spotfire

Apache Spark Amazon Kinesis Google Dataflow Confluent

Siloed stacks increase data architecture


complexity
Analytics and BI Data marts
Data warehouse
Structured data
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
Transform
Real-time Database Streaming Data Engine
Streaming data sources
Extract
Load
Data prep Data Lake
Data Lake
Structured, semi-structured and unstructured data
Structured, semi-structured
and unstructured
􏰂􏰂
Most
enterprises
struggle with
data
Data Warehousing
Data Analysts
Data Engineering
Streaming
Data Science and ML
Data Scientists
Hadoop
Amazon EMR Google Dataproc

Apache Airflow Apache Spark Cloudera

Siloed data teams decrease productivity


Data Engineers
Apache Kafka
Apache Flink
Azure Stream Analytics Tibco Spotfire

Apache Spark Amazon Kinesis Google Dataflow Confluent

Data Engineers
Disconnected systems and proprietary
data formats make integration difficult
Siloed stacks increase data architecture
complexity
Amazon Redshift Azure Synapse Snowflake
SAP

Teradata

Google BigQuery

IBM Db􏰀

Oracle Autonomous Data Warehouse

Jupyter
Azure ML Studio Domino Data Labs TensorFlow

Data Science
Amazon SageMaker MatLAB
SAS
PyTorch

Machine Learning
Analytics and BI Data marts
Data warehouse
Structured data
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
Transform
Real-time Database Streaming Data Engine
Streaming data sources
Extract
Load
Data prep Data Lake
Data Lake
Structured, semi-structured and unstructured data
Structured, semi-structured
and unstructured
􏰂􏰀
Data Lake
Data Warehouse

Lakehouse
One platform to unify all of your data,
analytics, and AI workloads
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰂􏰃

Data Lake
Data Warehouse
An open approach to bringing
data management and
governance to data lakes
Better reliability with transactions
48x faster data processing with indexing
Data governance at scale with fine-grained
access control lists
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰂􏰄

The
Databricks
Lakehouse
Platform
Data Engineering
BI and SQL Analytics
Data Science and ML
Real-Time Data Applications


✓✓
Simple
Open Collaborative
Databricks Lakehouse Platform
Data Management and Governance
Open Data Lake
Platform Security & Administration
Unstructured, semi-structured, structured, and
streaming data
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰂􏰅

The
Databricks
Lakehouse
Platform
Databricks Lakehouse Platform
Data Management and Governance
Open Data Lake
Platform Security & Administration
Unstructured, semi-structured, structured, and
streaming data

Simple
Unify your data, analytics,
and AI on one common
platform for all data use
cases

Data Engineering
BI and SQL Analytics
Data Science and ML
Real-Time Data Applications
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰂􏰆

The
Databricks
Lakehouse
Platform
30
Million+
Monthly downloads
Open
Unify your data ecosystem
with open source standards
and formats.
Built on the innovation of
some of the most successful
open source data projects in
the world

©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰂􏰇

The
Databricks
Lakehouse
Platform
Visual ETL & Data Ingestion
Business Intelligence
Azure Synapse

Google BigQuery

Amazon Redshift

Machine Learning
Open
Unify your data ecosystem
with open source standards
and formats.

450+
Partners across the data
landscape

Azure Data Factory

Data Providers
Amazon SageMaker

Azure Machine Learning

Google
AI Platform

Lakehouse Platform
Centralized Governance
AWS Glue

Top Consulting & SI Partners


©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰂􏰈

The
Databricks
Lakehouse
Platform
Data Analysts

Collaborative
Unify your data teams to
collaborate across the entire
data and AI workflow

Models Dashboards Notebooks
Datasets
Data Engineers
Data Scientists
©􏰀􏰁􏰀􏰂 Databricks Inc. — All rights reserved
􏰂􏰉

Databricks
Architectur
e and
Services
􏰀􏰁
Databricks
Architecture
Control Plane
Web Application
Repos / Notebooks
Job Scheduling
Cluster Management

Databricks Cloud Account


Customer Cloud Account
Data Plane
Data processing with Apache Spark Clusters
Data Sources
Databricks File System (DBFS)
􏰀􏰂
Databricks
Services
Control Plane in Databricks
Manage customer accounts,
datasets, and clusters
Databricks Web Cluster
Application Management
􏰀􏰀
Clusters
Control Plane
Web Application
Repos / Notebooks
Job Scheduling
Cluster Management

Databricks Cloud Account


Customer Cloud Account
Data Plane
Data processing with Apache Spark Clusters
Data Sources
Databricks File System (DBFS)
􏰀􏰃
Clusters
Overview
Clusters are made up of
one or more virtual
machine (VM) instances
Driver coordinates
activities of executors
Executors run tasks
composing a Spark job
CLUSTER
Driver
Executor
Core
Memory
Core
Local Storage

Executor
Core
Memory
Core
Local Storage
􏰀􏰄
Clusters
Types
All-purpose Clusters
Analyze data collaboratively
using interactive notebooks
Create clusters from the
Workspace or API
Retains up to 􏰇􏰁 clusters for
up to 􏰃􏰁 days.
Job Clusters
Run automated jobs
The Databricks job scheduler
creates job clusters when
running jobs.
Retains up to 􏰃􏰁 clusters.
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰀􏰅

Git
Versioning
with
Databricks
Repos
􏰀􏰆
Databricks
Repos
Overview
Git Versioning
Native integration with Github,
Gitlab, Bitbucket and Azure
Devops
UI-based workflows
CI/CD Integration
API surface to integrate with
automation
Simplifies the
dev/staging/prod multi-
workspace story
CI CD
Enterprise ready
Allow lists to avoid exfiltration
Secret detection to avoid
leaking keys
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰀􏰇

Databricks
Repos
CI/CD Integration
Control Plane in Databricks
Manage customer accounts, datasets,
and clusters
Repos / Jobs
Notebooks
Repos Service
Git and CI/CD Systems
Version Review Test
􏰀􏰈
Databricks
Repos
Best practices for CI/CD
workflows
Admin workflow
User workflow in Databricks
Merge workflow in Git provider
Production job workflow in Databricks
Set up top-level Repos folders (example:
Production)
Clone remote repository to user folder
Pull request and review process
API call brings Repo in Production folder to latest
version
Create new branch based on main branch
Merge into main branch
Set up Git automation to update Repos on merge
Run Databricks job based on Repo in Production
folder
Create and edit code
Git automation calls Databricks Repos API
Steps in Databricks Steps in your Git provider
Commit and push to feature branch
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰀􏰉

What is
Delta
Lake?
􏰃􏰁
Delta Lake is
an open-
source
project that
enables
building a
data
lakehouse
on top of
existing
storage
systems
􏰃􏰂
Delta Lake Is
Not...
 Proprietary
technology
 Storage format
 Storage medium
 Database service
or data warehouse
􏰃􏰀
Delta Lake Is...
 Open source
 Builds upon
standard data
formats
 Optimized for
cloud object
storage
 Built for scalable
metadata handling
􏰃􏰃
Delta Lake
brings ACID to
object storage
▪ Atomicity
▪ Consistency ▪
Isolation
▪ Durability
􏰃􏰄
Problems
solved by ACID
 􏰂. Hard to append
data
 􏰀. Modification of
existing data
difficult
 􏰃. Jobs failing mid
way
 􏰄. Real-time
operations hard
 􏰅. Costly to keep
historical data
versions
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved

Delta Lake is
the default
for all tables
created in
Databricks
􏰃􏰆
ETL with
Spark SQL
and
Python
􏰃􏰇
ETL With Spark
SQL and
Python
Learning Objectives
 Leverage Spark SQL
DDL to create and
manipulate relational
entities on Databricks
 Use Spark SQL to
extract, transform, and
load data to support
production workloads
and analytics in the
Lakehouse
 Leverage Python for
advanced code
functionality needed in
production applications
􏰃􏰈
ETL With Spark
SQL and
Python
Agenda
• Working with Relational
Entities on Databricks •
Managing databases, tables,
and views
• ETL with Spark SQL
• Extracting data from
external sources, loading and
updating data in the
lakehouse,
and common transformations
• Just Enough Python for
Spark SQL
• Building extensible functions
with Python-wrapped SQL
􏰃􏰉
Increment
al Data
and Delta
Live
Tables
􏰄􏰁
Incremental
Data and Delta
Live Tables
Learning Objectives
 Incrementally process
data to power analytic
insights with Spark
Structured Streaming
and Auto Loader
 Propagate new data
through multiple tables
in the data lakehouse
 Leverage Delta Live
Tables to simplify
productionalizing SQL
data
pipelines with
Databricks
􏰄􏰂
Incremental
Data and Delta
Live Tables
Agenda
• Incremental Data
Processing with
Structured Streaming
and Auto Loader
• Processing and aggregating
data incrementally in near real
time • Multi-hop in the
Lakehouse
• Propagating changes
through a series of tables to
drive production systems •
Using Delta Live Tables
• Simplifying deployment of
production pipelines and
infrastructure using SQL
􏰄􏰀
Multi-hop
Architectur
e
􏰄􏰃
Multi-Hop in
the Lakehouse
Streaming analytics

CSV JSON TXT


Databricks Auto Loader

Bronze
Silver Gold Data quality
AI and reporting

©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved

Multi-Hop in
the Lakehouse
Bronze Layer
Typically just a raw copy
of ingested data
Replaces traditional data
lake
Provides efficient storage
and querying of full,
unprocessed history of
data
Bronze
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰄􏰅

Multi-Hop in
the Lakehouse
Silver Layer
Reduces data storage
complexity, latency, and
redundancy Optimizes
ETL throughput and
analytic query
performance Preserves
grain of original data
(without aggregations)
Eliminates duplicate
records
Production schema
enforced
Data quality checks,
corrupt data quarantined
Silver
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰄􏰆

Multi-Hop in
the Lakehouse
Gold Layer
Powers ML applications,
reporting, dashboards, ad
hoc analytics Refined
views of data, typically
with aggregations
Reduces strain on
production systems
Optimizes query
performance for
business-critical data
Gold
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰄􏰇

Introducin
g Delta
Live
Tables
􏰄􏰈
Multi-Hop in
the Lakehouse
Streaming analytics

CSV JSON TXT


Databricks Auto Loader

Bronze
Raw Ingestion and History

Silver
Filtered, Cleaned, Augmented

Data quality
Gold
Business-level Aggregates

AI and reporting

©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved


The Reality is
Not so Simple
Bronze Silver Gold
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved

Large scale
ETL is complex
and brittle
Complex pipeline
development
Hard to build and maintain table
dependencies
Difficult to switch between batch and
stream processing
Data quality and
governance
Difficult to monitor and enforce
data quality
Impossible to trace data lineage
Difficult pipeline
operations
Poor observability at granular, data
level
Error handling and recovery is
laborious
􏰅􏰂
Introducing
Delta Live
Tables
Make reliable ETL easy
on Delta Lake
Operate with agility
Declarative tools to build batch
and streaming data pipelines
Trust your data
DLT has built-in declarative
quality controls
Declare quality expectations
and actions to take
Scale with reliability
Easily scale infrastructure
alongside your data
􏰅􏰀
Managing
Data
Access
and
Production
Pipelines
􏰅􏰃
Managing Data
Access and
Production
Pipelines
Learning Objectives
 Orchestrate tasks with
Databricks Jobs
 Use Databricks SQL for
on-demand queries
 Configure Databricks
Access Control Lists to
provide groups with
secure access to
production and
development
databases
 Configure and schedule
dashboards and alerts
to reflect updates to
production data
pipelines
􏰅􏰄
Managing Data
Access and
Production
Pipelines
Agenda
• Task Orchestration with
Databricks Jobs
• Scheduling notebooks and
DLT pipelines with
dependencies
• Running Your First
Databricks SQL Query
• Navigating, configuring, and
executing queries in
Databricks SQL
• Managing Permissions
in the Lakehouse
• Configuring permissions for
databases, tables, and views
in the data lakehouse
• Productionalizing
Dashboards and Queries
in DBSQL
• Scheduling queries,
dashboards, and alerts for
end-to-end analytic pipelines
􏰅􏰅
Introducin
g Unity
Catalog
􏰅􏰆
Data
Governance
Overview
Four key functional areas
Data Access Control
Control who has access to which data
Data Access Audit
Capture and record all access to data
Data Lineage
Capture upstream sources and
downstream consumers
Data Discovery
Ability to search for and discover
authorized assets
􏰅􏰇
Data
Governance
Overview
Challenges
Structured
Semi-structured
Unstructured
Streaming
Cloud 􏰀
Data Analysts
Data Scientists
Data Engineers
Machine Learning
Cloud 􏰂
Cloud 􏰃
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰅􏰈

Databricks
Unity Catalog
Overview
Unify governance across clouds
Fine-grained governance for data
lakes across clouds - based on open
standard ANSI SQL.
Unify data and AI assets
Centrally share, audit, secure and
manage all data types with one simple
interface.
Unify existing catalogs
Works in concert with existing data,
storage, and catalogs - no hard
migration required.
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰅􏰉

Databricks
Unity Catalog
Three-layer namespace
Traditional two-layer
namespace
SELECT * FROM .
Three-layer namespace
with Unity Catalog
SELECT * FROM .schema.table
schema
table
catalog
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰆􏰁

Databricks
Unity Catalog
Security Model
Traditional Query
Lifecycle
􏰂. Submit query
􏰆. Filter unauthorized data
Workspace
􏰀. Check grants
Table ACL
Hive Metastore

SELECT * FROM table


Cluster or SQL Endpoint
􏰃. Lookup location
􏰄. Return path to table
􏰅. Cloud-specific credentials
Cloud Storage
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰆􏰂

Databricks
Unity Catalog
Security Model
Query Life Cycle with
Unity Catalog
Workspace
􏰂. Submit query
􏰈. Filter unauthorized data
Cluster or SQL Endpoint
􏰀. Check namespace
􏰆. URLs and short-lived tokens
Audit Log
􏰃. Write to log
􏰄. Check grants and Delta metadata
Unity Catalog

SELECT * FROM table


􏰇. Ingest from array of URLs
Cloud Storage
􏰅. Cloud-specific credentials
©􏰀􏰁􏰀􏰀 Databricks Inc. — All rights reserved
􏰆􏰀

Course
Recap
􏰆􏰃
Course
Objectives
 Leverage the
Databricks Lakehouse
Platform to perform
core responsibilities for
data pipeline
development
 Use SQL and Python to
write production data
pipelines to extract,
transform, and load
data into tables and
views in the lakehouse
 Simplify data ingestion
and incremental
change propagation
using Databricks-native
features and syntax
 Orchestrate production
pipelines to deliver
fresh results for ad-hoc
analytics and
dashboarding
􏰆􏰄

You might also like