Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
©2021 Databricks Inc. — All rights reserved
Deep Dive:
Delta Live Tables
©2021 Databricks Inc. — All rights reserved
What’s the problem with
Data Engineering?
©2021 Databricks Inc. — All rights reserved
We know data is critical to business outcomes
Customer
experience
Product /Service
Innovation
Operational
efficiency
Revenue
growth
Self-service
analytics
Predictive
analytics
Data-driven
decisions
GDPR
CCPA
BCBS29
HIPAA
Data warehouse /
lake convergence
Data migrations
DATA
Governance &
compliance
Business
objectives
Analytics
& AI
Digital
modernization
©2021 Databricks Inc. — All rights reserved
Semi-structured
Unstructured
Structured Cloud
Data Lake
ETL
ETL
ETL
ETL
ETL
Azure
Synapse
AWS Glue
Azure Data
Factory
Home-
Grown
ETL
Home-
Grown
ETL
Code
Generated
AWS EMR
TASK
FLOW
TASK
FLOW
TASK
FLOW
Data sharing
Streaming
Sources
Cloud Object Stores
SaaS Applications
NoSQL
Relational Databases
On-premises
systems
Data Sources
Business
Insights
Analytics
Machine
Learning
Streaming
Analytics
But there is complexity in the data delivery….
ETL
©2021 Databricks Inc. — All rights reserved
How does Databricks Help?
©2021 Databricks Inc. — All rights reserved 6
Lakehouse Platform
Data
Warehousing
Data
Engineering
Data Science
and ML
Data
Streaming
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Databricks
Lakehouse
Platform is the
foundation for
Data Engineering
©2021 Databricks Inc. — All rights reserved
Delta Live Tables
7
CREATE STREAMING LIVE TABLE raw_data
AS SELECT *
FROM cloud_files ("/raw_data", "json")
CREATE LIVE TABLE clean_data
AS SELECT …
FROM LIVE.raw_data
The best way to do ETL on the lakehouse
Accelerate ETL development
Declare SQL or Python and DLT automatically
orchestrates the DAG, handles retries, changing data
Automatically manage your infrastructure
Automates complex tedious activities like recovery,
auto-scaling, and performance optimization
Ensure high data quality
Deliver reliable data with built-in quality controls,
testing, monitoring, and enforcement
Unify batch and streaming
Get the simplicity of SQL with freshness
of streaming with one unified API
©2021 Databricks Inc. — All rights reserved
Bronze
Zone
Silver
Zone
Gold
Zone
Analytics
Machine Learning
Business Insights
Data Quality
Data Quality
Data Transformation
Continuous Batch
or Stream
Processing
Error Handling and
Automatic
Recovery
Data Pipeline
Observability
Automatic
Deployments &
Operations
Orchestrate
Data Pipelines
Databricks Lakehouse Platform
Continuous or
Scheduled Ingest
Business Level
Aggregates
Operational Apps
Photon
UNITY CATALOG
Build Production ETL Pipelines with DLT
©2021 Databricks Inc. — All rights reserved
Key Differentiators
©2021 Databricks Inc. — All rights reserved
Continuous or scheduled data ingestion
● Incrementally and efficiently process new
data files as they arrive in cloud storage
using Auto Loader
● Automatically infer schema of incoming files
or superimpose what you know with Schema
Hints
● Automatic schema evolution
● Rescue data column - never lose data again
JSON CSV
✅ ✅ AVRO
Schema
Evolution ✅ PARQUET
✅
©2021 Databricks Inc. — All rights reserved
● Use intent-driven declarative development
to abstract away the “how” and define
“what” to solve
● Automatically generate lineage based on
table dependencies across the data pipeline
● Automatically checks for errors, missing
dependencies and syntax errors
/* Create a temp view on the accounts table */
CREATE STREAMING LIVE VIEW account_raw AS
SELECT * FROM cloud_files(“/data”, “csv”);
/* Stage 1: Bronze Table drop invalid rows */
CREATE STREAMING LIVE TABLE account_bronze AS
COMMENT "Bronze table with valid account ids"
SELECT * FROM account_raw ...
/* Stage 2:Send rows to Silver, run validation rules */
CREATE STREAMING LIVE TABLE account_silver AS
COMMENT "Silver Accounts table with validation checks"
SELECT * FROM account_bronze ...
Bronze
Silver
Gold
Source
Declarative SQL & Python APIs
©2021 Databricks Inc. — All rights reserved
Bronze Silver
UPSERT
via CDC
UPSERT
via CDC
UPSERT
via CDC
Streaming
Sources
Cloud Object
Stores
Structured
Data
Unstructured
Data
Semi-
structured
data
Data
Migration
Services
Data
Sources
● Stream change records (inserts, updates,
deletes) from any data source supported by
DBR, cloud storage, or DBFS
● Simple, declarative “APPLY CHANGES INTO”
API for SQL or Python
● Handles out-of-order events
● Schema evolution
● SCD2 support
Change data capture (CDC)
©2021 Databricks Inc. — All rights reserved
Data quality validation and monitoring
● Define data quality and integrity
controls within the pipeline with data
expectations
● Address data quality errors with flexible
policies: fail, drop, alert, quarantine(future)
● All data pipeline runs and quality metrics are
captured, tracked and reported
/* Stage 1: Bronze Table drop invalid rows */
CREATE STREAMING LIVE TABLE fire_account_bronze AS
( CONSTRAINT valid_account_open_dt EXPECT (acconut_dt is not
null and (account_close_dt > account_open_dt)) ON VIOLATION DROP
ROW
COMMENT "Bronze table with valid account ids"
SELECT * FROM fire_account_raw ...
©2021 Databricks Inc. — All rights reserved
Data pipeline observability
• High-quality, high-fidelity lineage diagram
that provides visibility into how data flows
for impact analysis
• Granular logging for operational, governance,
quality and status of the data pipeline at a
row level
• Continuously monitor data pipeline jobs to
ensure continued operation
• Notifications using Databricks SQL
©2021 Databricks Inc. — All rights reserved
• Develop in environment(s) separate from
production with the ability to easily test it
before deploying - entirely in SQL
• Deploy and manage environments using
parameterization
• Unit testing and documentation
• Enables metadata-driven ability to
programatically scale to 100s of
tables/pipelines dynamically
Lineage information
captured and used
to keep data fresh
anywhere
raw
clean
scored
Development
Staging
Production
Automated ETL development lifecycle
©2021 Databricks Inc. — All rights reserved
• Reduce down time with automatic error handling
and easy replay
• Eliminate maintenance with automatic
optimizations of all Delta Live Tables
• Auto-scaling adds more resources automatically
when needed.
Automated ETL operations
©2021 Databricks Inc. — All rights reserved
▪ Easily orchestrate DLT Pipelines and tasks in the
same DAG
▪ Fully integrated in Databricks platform, making
inspecting results, debugging faster
▪ Orchestrate and manage workloads in
multi-cloud environments
▪ You can run a Delta Live Tables pipeline as part
of a data processing workflow with Databricks
jobs, Apache Airflow, or Azure Data Factory.
Task
DLT
Pipeline
DLT
Pipeline
Task
Multi-Task Jobs Orchestration
Simplify orchestration and management of data pipelines
Workflow Orchestration
©2021 Databricks Inc. — All rights reserved
• Built to handle streaming workloads which
are spiky and unpredictable
• Shuts down nodes when utilization is low
while guaranteeing task execution
• Only scales up to needed # of nodes
Enhanced Autoscaling
Save infrastructure costs while maintaining end-to-end latency SLAs for streaming workloads
Streaming source Spark executors
No/Small
backlog
& low
utilization
Backlog
monitoring
Utilization
monitoring
Scale
down
18
AWS Azure GCP
Generally Available Generally Available Public Preview
GA Coming Soon
Problem
Optimize infrastructure spend when making scaling
decisions for streaming workloads
©2021 Databricks Inc. — All rights reserved
Customers
©2021 Databricks Inc. — All rights reserved
1.3 trillion rows of sensor
data processed efficiently
86% reduction in time
to production
Saved immense data
management time and
effort
Enabled data analysts to build
their own data pipelines with
SQL
Enabled the NextGen
self-service data quality
platform
Supports a 100+ table pipeline
in one managed job - time and
money savings
Customers Save Time with Delta Live Tables
©2021 Databricks Inc. — All rights reserved
Delta Live Tables to ingest and analyze data from car service stations.
Use this data to get Insights into issue types, what parts are being
replaced, regulatory reporting, and part replacement forecasting.
Service health and
vehicle reliability
“It's so intuitive that even somebody with only moderate Python
skills can create efficient, powerful data pipelines with relative ease”
- Tom Renish, Principal Data Architect, Rivian
©2021 Databricks Inc. — All rights reserved
“At ADP, we are migrating our human resource
management data to an integrated data store on
the Lakehouse. Delta Live Tables has helped our
team build in quality controls, and because of
the declarative APIs, support for batch and
real-time using only SQL, it has enabled our
team to save time and effort in managing our
data."
Jack Berkowitz, CDO, ADP
©2021 Databricks Inc. — All rights reserved
Use Case + Challenge
• 70+ use cases impacting
supply chain, operations,
product development,
marketing, customer exp
• Large volumes of IoT data
from millions of sensors
difficult to harness for
actionable insights and ML
due to operational load
created by complex data
pipelines
Why Databricks + DLT?
• Lakehouse for unified data
warehousing, BI, & ML —
enabling new use cases not
possible before
• DLT enables Shell to build
reliable and scalable data
pipelines - automatic job
maintenance and deep
pipeline visibility saves time
and resources
Impact of DLT
• Process 1.3 trillion rows of
sensor data with ease
• Simplifying ETL development
and management for faster
insights and ML innovation
“Delta Live Tables has helped our teams save time and effort in managing data at this scale. With this capability augmenting the
existing lakehouse architecture, Databricks is disrupting the ETL and data warehouse markets, which is important for companies
like ours. We are excited to continue to work with Databricks as an innovation partner.” - Dan Jeavons, GM Data Science
©2021 Databricks Inc. — All rights reserved
Shell Developers share their thoughts
“New gold standard for data pipelines”
“Delta Live Tables makes it easier for us to build
intelligence into our data ingestion process”
“Delta maintenance tasks are no longer an
afterthought for developers”
“Expectations allows us to trust the data”
©2021 Databricks Inc. — All rights reserved
Use Case + Challenge
• Real-time insights for real
estate investors
• Holistic view of real estate
insights for informed real
estate buying and selling
decisions
• Processing hundreds of
millions of records on
increasingly complex and
bogged down architecture
Why Databricks + DLT?
• Lakehouse architecture and
DLT frees up Audantic’s
data teams from focusing
on infrastructure so they
can innovate more easily
• DLT allows them to build
and manage more reliable
data pipelines that deliver
high-quality data in a much
more streamlined way
Impact of DLT
• 86% reduction in
time-to-market for new ML
solutions due to shorter
development time
• 33% fewer lines of code
required
• Productivity value: $300k
“Delta Live Tables is enabling us to do some things on the scale and performance side that we haven’t been able to do before,”
explained Lowery. “We now run our pipelines on a daily basis compared to a weekly or even monthly basis before — that's an order
of magnitude improvement.” - Joel Lowery, Chief Information Officer at Audantic
©2021 Databricks Inc. — All rights reserved
Thank you

More Related Content

Technical Deck Delta Live Tables.pdf

  • 1. ©2021 Databricks Inc. — All rights reserved Deep Dive: Delta Live Tables
  • 2. ©2021 Databricks Inc. — All rights reserved What’s the problem with Data Engineering?
  • 3. ©2021 Databricks Inc. — All rights reserved We know data is critical to business outcomes Customer experience Product /Service Innovation Operational efficiency Revenue growth Self-service analytics Predictive analytics Data-driven decisions GDPR CCPA BCBS29 HIPAA Data warehouse / lake convergence Data migrations DATA Governance & compliance Business objectives Analytics & AI Digital modernization
  • 4. ©2021 Databricks Inc. — All rights reserved Semi-structured Unstructured Structured Cloud Data Lake ETL ETL ETL ETL ETL Azure Synapse AWS Glue Azure Data Factory Home- Grown ETL Home- Grown ETL Code Generated AWS EMR TASK FLOW TASK FLOW TASK FLOW Data sharing Streaming Sources Cloud Object Stores SaaS Applications NoSQL Relational Databases On-premises systems Data Sources Business Insights Analytics Machine Learning Streaming Analytics But there is complexity in the data delivery…. ETL
  • 5. ©2021 Databricks Inc. — All rights reserved How does Databricks Help?
  • 6. ©2021 Databricks Inc. — All rights reserved 6 Lakehouse Platform Data Warehousing Data Engineering Data Science and ML Data Streaming All structured and unstructured data Cloud Data Lake Unity Catalog Fine-grained governance for data and AI Delta Lake Data reliability and performance Databricks Lakehouse Platform is the foundation for Data Engineering
  • 7. ©2021 Databricks Inc. — All rights reserved Delta Live Tables 7 CREATE STREAMING LIVE TABLE raw_data AS SELECT * FROM cloud_files ("/raw_data", "json") CREATE LIVE TABLE clean_data AS SELECT … FROM LIVE.raw_data The best way to do ETL on the lakehouse Accelerate ETL development Declare SQL or Python and DLT automatically orchestrates the DAG, handles retries, changing data Automatically manage your infrastructure Automates complex tedious activities like recovery, auto-scaling, and performance optimization Ensure high data quality Deliver reliable data with built-in quality controls, testing, monitoring, and enforcement Unify batch and streaming Get the simplicity of SQL with freshness of streaming with one unified API
  • 8. ©2021 Databricks Inc. — All rights reserved Bronze Zone Silver Zone Gold Zone Analytics Machine Learning Business Insights Data Quality Data Quality Data Transformation Continuous Batch or Stream Processing Error Handling and Automatic Recovery Data Pipeline Observability Automatic Deployments & Operations Orchestrate Data Pipelines Databricks Lakehouse Platform Continuous or Scheduled Ingest Business Level Aggregates Operational Apps Photon UNITY CATALOG Build Production ETL Pipelines with DLT
  • 9. ©2021 Databricks Inc. — All rights reserved Key Differentiators
  • 10. ©2021 Databricks Inc. — All rights reserved Continuous or scheduled data ingestion ● Incrementally and efficiently process new data files as they arrive in cloud storage using Auto Loader ● Automatically infer schema of incoming files or superimpose what you know with Schema Hints ● Automatic schema evolution ● Rescue data column - never lose data again JSON CSV ✅ ✅ AVRO Schema Evolution ✅ PARQUET ✅
  • 11. ©2021 Databricks Inc. — All rights reserved ● Use intent-driven declarative development to abstract away the “how” and define “what” to solve ● Automatically generate lineage based on table dependencies across the data pipeline ● Automatically checks for errors, missing dependencies and syntax errors /* Create a temp view on the accounts table */ CREATE STREAMING LIVE VIEW account_raw AS SELECT * FROM cloud_files(“/data”, “csv”); /* Stage 1: Bronze Table drop invalid rows */ CREATE STREAMING LIVE TABLE account_bronze AS COMMENT "Bronze table with valid account ids" SELECT * FROM account_raw ... /* Stage 2:Send rows to Silver, run validation rules */ CREATE STREAMING LIVE TABLE account_silver AS COMMENT "Silver Accounts table with validation checks" SELECT * FROM account_bronze ... Bronze Silver Gold Source Declarative SQL & Python APIs
  • 12. ©2021 Databricks Inc. — All rights reserved Bronze Silver UPSERT via CDC UPSERT via CDC UPSERT via CDC Streaming Sources Cloud Object Stores Structured Data Unstructured Data Semi- structured data Data Migration Services Data Sources ● Stream change records (inserts, updates, deletes) from any data source supported by DBR, cloud storage, or DBFS ● Simple, declarative “APPLY CHANGES INTO” API for SQL or Python ● Handles out-of-order events ● Schema evolution ● SCD2 support Change data capture (CDC)
  • 13. ©2021 Databricks Inc. — All rights reserved Data quality validation and monitoring ● Define data quality and integrity controls within the pipeline with data expectations ● Address data quality errors with flexible policies: fail, drop, alert, quarantine(future) ● All data pipeline runs and quality metrics are captured, tracked and reported /* Stage 1: Bronze Table drop invalid rows */ CREATE STREAMING LIVE TABLE fire_account_bronze AS ( CONSTRAINT valid_account_open_dt EXPECT (acconut_dt is not null and (account_close_dt > account_open_dt)) ON VIOLATION DROP ROW COMMENT "Bronze table with valid account ids" SELECT * FROM fire_account_raw ...
  • 14. ©2021 Databricks Inc. — All rights reserved Data pipeline observability • High-quality, high-fidelity lineage diagram that provides visibility into how data flows for impact analysis • Granular logging for operational, governance, quality and status of the data pipeline at a row level • Continuously monitor data pipeline jobs to ensure continued operation • Notifications using Databricks SQL
  • 15. ©2021 Databricks Inc. — All rights reserved • Develop in environment(s) separate from production with the ability to easily test it before deploying - entirely in SQL • Deploy and manage environments using parameterization • Unit testing and documentation • Enables metadata-driven ability to programatically scale to 100s of tables/pipelines dynamically Lineage information captured and used to keep data fresh anywhere raw clean scored Development Staging Production Automated ETL development lifecycle
  • 16. ©2021 Databricks Inc. — All rights reserved • Reduce down time with automatic error handling and easy replay • Eliminate maintenance with automatic optimizations of all Delta Live Tables • Auto-scaling adds more resources automatically when needed. Automated ETL operations
  • 17. ©2021 Databricks Inc. — All rights reserved ▪ Easily orchestrate DLT Pipelines and tasks in the same DAG ▪ Fully integrated in Databricks platform, making inspecting results, debugging faster ▪ Orchestrate and manage workloads in multi-cloud environments ▪ You can run a Delta Live Tables pipeline as part of a data processing workflow with Databricks jobs, Apache Airflow, or Azure Data Factory. Task DLT Pipeline DLT Pipeline Task Multi-Task Jobs Orchestration Simplify orchestration and management of data pipelines Workflow Orchestration
  • 18. ©2021 Databricks Inc. — All rights reserved • Built to handle streaming workloads which are spiky and unpredictable • Shuts down nodes when utilization is low while guaranteeing task execution • Only scales up to needed # of nodes Enhanced Autoscaling Save infrastructure costs while maintaining end-to-end latency SLAs for streaming workloads Streaming source Spark executors No/Small backlog & low utilization Backlog monitoring Utilization monitoring Scale down 18 AWS Azure GCP Generally Available Generally Available Public Preview GA Coming Soon Problem Optimize infrastructure spend when making scaling decisions for streaming workloads
  • 19. ©2021 Databricks Inc. — All rights reserved Customers
  • 20. ©2021 Databricks Inc. — All rights reserved 1.3 trillion rows of sensor data processed efficiently 86% reduction in time to production Saved immense data management time and effort Enabled data analysts to build their own data pipelines with SQL Enabled the NextGen self-service data quality platform Supports a 100+ table pipeline in one managed job - time and money savings Customers Save Time with Delta Live Tables
  • 21. ©2021 Databricks Inc. — All rights reserved Delta Live Tables to ingest and analyze data from car service stations. Use this data to get Insights into issue types, what parts are being replaced, regulatory reporting, and part replacement forecasting. Service health and vehicle reliability “It's so intuitive that even somebody with only moderate Python skills can create efficient, powerful data pipelines with relative ease” - Tom Renish, Principal Data Architect, Rivian
  • 22. ©2021 Databricks Inc. — All rights reserved “At ADP, we are migrating our human resource management data to an integrated data store on the Lakehouse. Delta Live Tables has helped our team build in quality controls, and because of the declarative APIs, support for batch and real-time using only SQL, it has enabled our team to save time and effort in managing our data." Jack Berkowitz, CDO, ADP
  • 23. ©2021 Databricks Inc. — All rights reserved Use Case + Challenge • 70+ use cases impacting supply chain, operations, product development, marketing, customer exp • Large volumes of IoT data from millions of sensors difficult to harness for actionable insights and ML due to operational load created by complex data pipelines Why Databricks + DLT? • Lakehouse for unified data warehousing, BI, & ML — enabling new use cases not possible before • DLT enables Shell to build reliable and scalable data pipelines - automatic job maintenance and deep pipeline visibility saves time and resources Impact of DLT • Process 1.3 trillion rows of sensor data with ease • Simplifying ETL development and management for faster insights and ML innovation “Delta Live Tables has helped our teams save time and effort in managing data at this scale. With this capability augmenting the existing lakehouse architecture, Databricks is disrupting the ETL and data warehouse markets, which is important for companies like ours. We are excited to continue to work with Databricks as an innovation partner.” - Dan Jeavons, GM Data Science
  • 24. ©2021 Databricks Inc. — All rights reserved Shell Developers share their thoughts “New gold standard for data pipelines” “Delta Live Tables makes it easier for us to build intelligence into our data ingestion process” “Delta maintenance tasks are no longer an afterthought for developers” “Expectations allows us to trust the data”
  • 25. ©2021 Databricks Inc. — All rights reserved Use Case + Challenge • Real-time insights for real estate investors • Holistic view of real estate insights for informed real estate buying and selling decisions • Processing hundreds of millions of records on increasingly complex and bogged down architecture Why Databricks + DLT? • Lakehouse architecture and DLT frees up Audantic’s data teams from focusing on infrastructure so they can innovate more easily • DLT allows them to build and manage more reliable data pipelines that deliver high-quality data in a much more streamlined way Impact of DLT • 86% reduction in time-to-market for new ML solutions due to shorter development time • 33% fewer lines of code required • Productivity value: $300k “Delta Live Tables is enabling us to do some things on the scale and performance side that we haven’t been able to do before,” explained Lowery. “We now run our pipelines on a daily basis compared to a weekly or even monthly basis before — that's an order of magnitude improvement.” - Joel Lowery, Chief Information Officer at Audantic
  • 26. ©2021 Databricks Inc. — All rights reserved Thank you