Technical Deck Delta Live Tables.pdf

©2021 Databricks Inc. — All rights reserved
Deep Dive:
Delta Live Tables

What’s the problem with
Data Engineering?

We know data is critical to business outcomes
Customer
experience
Product /Service
Innovation
Operational
efﬁciency
Revenue
growth
Self-service
analytics
Predictive
analytics
Data-driven
decisions
GDPR
CCPA
BCBS29
HIPAA
Data warehouse /
lake convergence
Data migrations
DATA
Governance &
compliance
Business
objectives
Analytics
& AI
Digital
modernization

Semi-structured
Unstructured
Structured Cloud
Data Lake
ETL
ETL
ETL
ETL
ETL
Azure
Synapse
AWS Glue
Azure Data
Factory
Home-
Grown
ETL
Home-
Grown
ETL
Code
Generated
AWS EMR
TASK
FLOW
TASK
FLOW
TASK
FLOW
Data sharing
Streaming
Sources
Cloud Object Stores
SaaS Applications
NoSQL
Relational Databases
On-premises
systems
Data Sources
Business
Insights
Analytics
Machine
Learning
Streaming
Analytics
But there is complexity in the data delivery….
ETL

How does Databricks Help?

©2021 Databricks Inc. — All rights reserved 6
Lakehouse Platform
Data
Warehousing
Data
Engineering
Data Science
and ML
Data
Streaming
All structured and unstructured data
Cloud Data Lake
Unity Catalog
Fine-grained governance for data and AI
Delta Lake
Data reliability and performance
Databricks
Lakehouse
Platform is the
foundation for
Data Engineering

Delta Live Tables
7
CREATE STREAMING LIVE TABLE raw_data
AS SELECT *
FROM cloud_files ("/raw_data", "json")
CREATE LIVE TABLE clean_data
AS SELECT …
FROM LIVE.raw_data
The best way to do ETL on the lakehouse
Accelerate ETL development
Declare SQL or Python and DLT automatically
orchestrates the DAG, handles retries, changing data
Automatically manage your infrastructure
Automates complex tedious activities like recovery,
auto-scaling, and performance optimization
Ensure high data quality
Deliver reliable data with built-in quality controls,
testing, monitoring, and enforcement
Unify batch and streaming
Get the simplicity of SQL with freshness
of streaming with one uniﬁed API

Bronze
Zone
Silver
Zone
Gold
Zone
Analytics
Machine Learning
Business Insights
Data Quality
Data Quality
Data Transformation
Continuous Batch
or Stream
Processing
Error Handling and
Automatic
Recovery
Data Pipeline
Observability
Automatic
Deployments &
Operations
Orchestrate
Data Pipelines
Databricks Lakehouse Platform
Continuous or
Scheduled Ingest
Business Level
Aggregates
Operational Apps
Photon
UNITY CATALOG
Build Production ETL Pipelines with DLT

Key Differentiators

Continuous or scheduled data ingestion
● Incrementally and efficiently process new
data files as they arrive in cloud storage
using Auto Loader
● Automatically infer schema of incoming files
or superimpose what you know with Schema
Hints
● Automatic schema evolution
● Rescue data column - never lose data again
JSON CSV
✅ ✅ AVRO
Schema
Evolution ✅ PARQUET
✅

● Use intent-driven declarative development
to abstract away the “how” and deﬁne
“what” to solve
● Automatically generate lineage based on
table dependencies across the data pipeline
● Automatically checks for errors, missing
dependencies and syntax errors
/* Create a temp view on the accounts table */
CREATE STREAMING LIVE VIEW account_raw AS
SELECT * FROM cloud_files(“/data”, “csv”);
/* Stage 1: Bronze Table drop invalid rows */
CREATE STREAMING LIVE TABLE account_bronze AS
COMMENT "Bronze table with valid account ids"
SELECT * FROM account_raw ...
/* Stage 2:Send rows to Silver, run validation rules */
CREATE STREAMING LIVE TABLE account_silver AS
COMMENT "Silver Accounts table with validation checks"
SELECT * FROM account_bronze ...
Bronze
Silver
Gold
Source
Declarative SQL & Python APIs

Bronze Silver
UPSERT
via CDC
UPSERT
via CDC
UPSERT
via CDC
Streaming
Sources
Cloud Object
Stores
Structured
Data
Unstructured
Data
Semi-
structured
data
Data
Migration
Services
Data
Sources
● Stream change records (inserts, updates,
deletes) from any data source supported by
DBR, cloud storage, or DBFS
● Simple, declarative “APPLY CHANGES INTO”
API for SQL or Python
● Handles out-of-order events
● Schema evolution
● SCD2 support
Change data capture (CDC)

Data quality validation and monitoring
● Deﬁne data quality and integrity
controls within the pipeline with data
expectations
● Address data quality errors with ﬂexible
policies: fail, drop, alert, quarantine(future)
● All data pipeline runs and quality metrics are
captured, tracked and reported
/* Stage 1: Bronze Table drop invalid rows */
CREATE STREAMING LIVE TABLE fire_account_bronze AS
( CONSTRAINT valid_account_open_dt EXPECT (acconut_dt is not
null and (account_close_dt > account_open_dt)) ON VIOLATION DROP
ROW
COMMENT "Bronze table with valid account ids"
SELECT * FROM fire_account_raw ...

Data pipeline observability
• High-quality, high-fidelity lineage diagram
that provides visibility into how data flows
for impact analysis
• Granular logging for operational, governance,
quality and status of the data pipeline at a
row level
• Continuously monitor data pipeline jobs to
ensure continued operation
• Notifications using Databricks SQL

• Develop in environment(s) separate from
production with the ability to easily test it
before deploying - entirely in SQL
• Deploy and manage environments using
parameterization
• Unit testing and documentation
• Enables metadata-driven ability to
programatically scale to 100s of
tables/pipelines dynamically
Lineage information
captured and used
to keep data fresh
anywhere
raw
clean
scored
Development
Staging
Production
Automated ETL development lifecycle

• Reduce down time with automatic error handling
and easy replay
• Eliminate maintenance with automatic
optimizations of all Delta Live Tables
• Auto-scaling adds more resources automatically
when needed.
Automated ETL operations

▪ Easily orchestrate DLT Pipelines and tasks in the
same DAG
▪ Fully integrated in Databricks platform, making
inspecting results, debugging faster
▪ Orchestrate and manage workloads in
multi-cloud environments
▪ You can run a Delta Live Tables pipeline as part
of a data processing workflow with Databricks
jobs, Apache Airflow, or Azure Data Factory.
Task
DLT
Pipeline
DLT
Pipeline
Task
Multi-Task Jobs Orchestration
Simplify orchestration and management of data pipelines
Workflow Orchestration

• Built to handle streaming workloads which
are spiky and unpredictable
• Shuts down nodes when utilization is low
while guaranteeing task execution
• Only scales up to needed # of nodes
Enhanced Autoscaling
Save infrastructure costs while maintaining end-to-end latency SLAs for streaming workloads
Streaming source Spark executors
No/Small
backlog
& low
utilization
Backlog
monitoring
Utilization
monitoring
Scale
down
18
AWS Azure GCP
Generally Available Generally Available Public Preview
GA Coming Soon
Problem
Optimize infrastructure spend when making scaling
decisions for streaming workloads

Customers

1.3 trillion rows of sensor
data processed efﬁciently
86% reduction in time
to production
Saved immense data
management time and
effort
Enabled data analysts to build
their own data pipelines with
SQL
Enabled the NextGen
self-service data quality
platform
Supports a 100+ table pipeline
in one managed job - time and
money savings
Customers Save Time with Delta Live Tables

Delta Live Tables to ingest and analyze data from car service stations.
Use this data to get Insights into issue types, what parts are being
replaced, regulatory reporting, and part replacement forecasting.
Service health and
vehicle reliability
“It's so intuitive that even somebody with only moderate Python
skills can create efﬁcient, powerful data pipelines with relative ease”
- Tom Renish, Principal Data Architect, Rivian

“At ADP, we are migrating our human resource
management data to an integrated data store on
the Lakehouse. Delta Live Tables has helped our
team build in quality controls, and because of
the declarative APIs, support for batch and
real-time using only SQL, it has enabled our
team to save time and effort in managing our
data."
Jack Berkowitz, CDO, ADP

Use Case + Challenge
• 70+ use cases impacting
supply chain, operations,
product development,
marketing, customer exp
• Large volumes of IoT data
from millions of sensors
difﬁcult to harness for
actionable insights and ML
due to operational load
created by complex data
pipelines
Why Databricks + DLT?
• Lakehouse for uniﬁed data
warehousing, BI, & ML —
enabling new use cases not
possible before
• DLT enables Shell to build
reliable and scalable data
pipelines - automatic job
maintenance and deep
pipeline visibility saves time
and resources
Impact of DLT
• Process 1.3 trillion rows of
sensor data with ease
• Simplifying ETL development
and management for faster
insights and ML innovation
“Delta Live Tables has helped our teams save time and effort in managing data at this scale. With this capability augmenting the
existing lakehouse architecture, Databricks is disrupting the ETL and data warehouse markets, which is important for companies
like ours. We are excited to continue to work with Databricks as an innovation partner.” - Dan Jeavons, GM Data Science

Shell Developers share their thoughts
“New gold standard for data pipelines”
“Delta Live Tables makes it easier for us to build
intelligence into our data ingestion process”
“Delta maintenance tasks are no longer an
afterthought for developers”
“Expectations allows us to trust the data”

Use Case + Challenge
• Real-time insights for real
estate investors
• Holistic view of real estate
insights for informed real
estate buying and selling
decisions
• Processing hundreds of
millions of records on
increasingly complex and
bogged down architecture
Why Databricks + DLT?
• Lakehouse architecture and
DLT frees up Audantic’s
data teams from focusing
on infrastructure so they
can innovate more easily
• DLT allows them to build
and manage more reliable
data pipelines that deliver
high-quality data in a much
more streamlined way
Impact of DLT
• 86% reduction in
time-to-market for new ML
solutions due to shorter
development time
• 33% fewer lines of code
required
• Productivity value: $300k
“Delta Live Tables is enabling us to do some things on the scale and performance side that we haven’t been able to do before,”
explained Lowery. “We now run our pipelines on a daily basis compared to a weekly or even monthly basis before — that's an order
of magnitude improvement.” - Joel Lowery, Chief Information Ofﬁcer at Audantic

Thank you

Technical Deck Delta Live Tables.pdf

More Related Content

Technical Deck Delta Live Tables.pdf