Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Delta Change Data
Feed
Joseph Torres
Rahul Mahadev
Itai Weiss
Agenda
▪ Data capture challenges
▪ Change Data Feed on the
Lakehouse
▪ Capture Changes
▪ Process Changes
▪ Demo
Who are we? - Jose Torres
● Software Engineer - Databricks
● Committer - Delta Lake, Apache
Spark
● Database and cooking enthusiast
Who are we? - Rahul Mahadev
● Software Engineer– Databricks
● Delta Lake Committer
● MS Computer Science, University of
Illinois Urbana-Champaign
Who are we? - Itai Weiss
• Lead Solution Architect– Databricks
• Working with Apache Spark! since
v1.6
• Consulted for numerous firms across
Financial, Insurance, Tech, Pharma,
Manufacturing, Retail and
Transportation
Delta Lake
An open, reliable, performant,
and secure data storage and
management layer for your
data
■ Fresh & reliable data with a single source of truth
■ Data warehouse performance with data lake
economics
■ Advanced security and standards to meet
compliance needs
■ Open format
Current Challenges
Delta table v-1
Spark Insert
Spark Updates
Delta table v
Spark Delete
Spark Union
■ Big data increases the complexity of
• Lots of data
• Changes infrequently
• Just want to process the newest changes
Full Outer Join
Inserts
Updates
Deletes
■ Read only the changed data
■ Avoid full table scans
■ Reduce compute and memory
Solve data challenges with CDF
Improve ETL
pipelines
Unify batch and
streaming
BI on your
data lake
Meet regulatory
needs
For All your use cases
Delta table with CDF
Spark Union
Delta Change Data Feed
Where Delta Change Data Feed Applies
Raw Ingestion
and History
BRONZE
Filtered, Cleaned,
Augmented
SILVER
Business-level
Aggregates
GOLD
CD
F
External feeds,
Other CDC output,
Extracts
How Does Delta Change Data Feed Work?
PK B
A1 B1
A2 B2
A3 B3
Original Table (v1)
PK B
A2 Z2
A3 B3
A4 B4
Change data
(Merged as v2)
PK B
Change
Type
Time
Versio
n
A2 B2 Preimage
12:00:0
0
2
A2 Z2 Postimage
12:00:0
0
2
A3 B3 Delete
12:00:0
0
2
A4 B4 Insert
12:00:0
0
2
Change Data Feed Output
A1 record did not receive an update or delete.
So it will not be output by CDF.
Consuming the Delta Change Data Feed
Stream-based Consumption
● Delta Change Feed is processed as
each source commit completes
● Rapid source commits can result
in multiple Delta versions being
included in a single micro-batch
Batch Consumption
● Batches are constructed based in
time-bound windows which may
contain multiple Delta versions
A2 B2 Preimage 12:00:00 2
A2 Z2 Postimage 12:00:00 2
A3 B3 Delete 12:00:00 2
A4 B4 Insert 12:00:00 2
A5 B5 Insert
12:08:0
0
3 A6 B6 Insert 12:09:00 4
A6 B6 Preimage 12:10:05 5
A6 Z6 Postimage 12:10:05 5
A5 B5 Insert 12:08:00 3
A6 B6 Insert 12:09:00 4
12:00
A6 B6 Preimage 12:10:05 5
A6 Z6
Postimag
e
12:10:05 5
12:10:00 12:20
12:08 12:10:10
Stream - micro batches
Batch - every 10 mins
12:00
Storing the Delta Change Data Feed
Changes to the table
are stored as ordered,
atomic units called
commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
Storing the Delta Change Data Feed
● When Change Data
Feed is enabled,
commits will reference
an additional set of
files containing the
change data events
● These files contains
the updated and
deleted records
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Cdf 1.parquet
000000.json
000001.json
…
Storing the Delta Change Data Feed
● Silver & Gold Tables
○ Improve Delta performance by processing only data changes, and
simplify ETL/ELT operations
● Materialized Views
○ Create up-to-date, aggregated views of information based on the data
changes
● Transmit Changes
○ Send only the data changes to downstream systems
● Audit Trail Table
○ Capture and store all the data changes over time, including inserts,
updates, and deletes
When to Use Delta Change Data Feed
● Delta changes include updates and/or deletes ● Delta changes are append only
● Small fraction of records updated in each
batch
● Most records in the table updated in each
batch
● Data received from external sources is in CDC
format
● Data received comprises destructive loads
● Send data changes to downstream application
● Find and ingest data outside of the Lakehouse
Getting started with Delta Change Data Feed
spark.conf.set('spark.databri
cks.delta.properties.defaults
.enableChangeDataFeed', True)
ALTER TABLE ...
SET TBLPROPERTIES
(delta.enableChangeDataFeed =
true);
Enable CDF on TABLE… … or Establish CDF on CLUSTER
in SQL, Python, or Scala in Python, or Scala
Using Delta Change Data Feed
INSERT INTO TABLE ...
USING delta ...
as
SELECT … FROM
table_changes(...)
SELECT … FROM
table_changes('tableName',
startingVersion
[,endingVersion])
or
SELECT … FROM
table_changes('tableName',
'startingTimestamp'
[,'endingTimestamp'])
Query changes… … and store them
in SQL, Python, or Scala
Getting started with Delta Change Data Feed
Let’s look at some notebooks
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Change Data Feed in Delta

More Related Content

Change Data Feed in Delta

  • 1. Delta Change Data Feed Joseph Torres Rahul Mahadev Itai Weiss
  • 2. Agenda ▪ Data capture challenges ▪ Change Data Feed on the Lakehouse ▪ Capture Changes ▪ Process Changes ▪ Demo
  • 3. Who are we? - Jose Torres ● Software Engineer - Databricks ● Committer - Delta Lake, Apache Spark ● Database and cooking enthusiast
  • 4. Who are we? - Rahul Mahadev ● Software Engineer– Databricks ● Delta Lake Committer ● MS Computer Science, University of Illinois Urbana-Champaign
  • 5. Who are we? - Itai Weiss • Lead Solution Architect– Databricks • Working with Apache Spark! since v1.6 • Consulted for numerous firms across Financial, Insurance, Tech, Pharma, Manufacturing, Retail and Transportation
  • 6. Delta Lake An open, reliable, performant, and secure data storage and management layer for your data ■ Fresh & reliable data with a single source of truth ■ Data warehouse performance with data lake economics ■ Advanced security and standards to meet compliance needs ■ Open format
  • 7. Current Challenges Delta table v-1 Spark Insert Spark Updates Delta table v Spark Delete Spark Union ■ Big data increases the complexity of • Lots of data • Changes infrequently • Just want to process the newest changes Full Outer Join Inserts Updates Deletes
  • 8. ■ Read only the changed data ■ Avoid full table scans ■ Reduce compute and memory Solve data challenges with CDF Improve ETL pipelines Unify batch and streaming BI on your data lake Meet regulatory needs For All your use cases Delta table with CDF Spark Union Delta Change Data Feed
  • 9. Where Delta Change Data Feed Applies Raw Ingestion and History BRONZE Filtered, Cleaned, Augmented SILVER Business-level Aggregates GOLD CD F External feeds, Other CDC output, Extracts
  • 10. How Does Delta Change Data Feed Work? PK B A1 B1 A2 B2 A3 B3 Original Table (v1) PK B A2 Z2 A3 B3 A4 B4 Change data (Merged as v2) PK B Change Type Time Versio n A2 B2 Preimage 12:00:0 0 2 A2 Z2 Postimage 12:00:0 0 2 A3 B3 Delete 12:00:0 0 2 A4 B4 Insert 12:00:0 0 2 Change Data Feed Output A1 record did not receive an update or delete. So it will not be output by CDF.
  • 11. Consuming the Delta Change Data Feed Stream-based Consumption ● Delta Change Feed is processed as each source commit completes ● Rapid source commits can result in multiple Delta versions being included in a single micro-batch Batch Consumption ● Batches are constructed based in time-bound windows which may contain multiple Delta versions A2 B2 Preimage 12:00:00 2 A2 Z2 Postimage 12:00:00 2 A3 B3 Delete 12:00:00 2 A4 B4 Insert 12:00:00 2 A5 B5 Insert 12:08:0 0 3 A6 B6 Insert 12:09:00 4 A6 B6 Preimage 12:10:05 5 A6 Z6 Postimage 12:10:05 5 A5 B5 Insert 12:08:00 3 A6 B6 Insert 12:09:00 4 12:00 A6 B6 Preimage 12:10:05 5 A6 Z6 Postimag e 12:10:05 5 12:10:00 12:20 12:08 12:10:10 Stream - micro batches Batch - every 10 mins 12:00
  • 12. Storing the Delta Change Data Feed Changes to the table are stored as ordered, atomic units called commits Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet 000000.json 000001.json …
  • 13. Storing the Delta Change Data Feed ● When Change Data Feed is enabled, commits will reference an additional set of files containing the change data events ● These files contains the updated and deleted records Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet Cdf 1.parquet 000000.json 000001.json …
  • 14. Storing the Delta Change Data Feed ● Silver & Gold Tables ○ Improve Delta performance by processing only data changes, and simplify ETL/ELT operations ● Materialized Views ○ Create up-to-date, aggregated views of information based on the data changes ● Transmit Changes ○ Send only the data changes to downstream systems ● Audit Trail Table ○ Capture and store all the data changes over time, including inserts, updates, and deletes
  • 15. When to Use Delta Change Data Feed ● Delta changes include updates and/or deletes ● Delta changes are append only ● Small fraction of records updated in each batch ● Most records in the table updated in each batch ● Data received from external sources is in CDC format ● Data received comprises destructive loads ● Send data changes to downstream application ● Find and ingest data outside of the Lakehouse
  • 16. Getting started with Delta Change Data Feed spark.conf.set('spark.databri cks.delta.properties.defaults .enableChangeDataFeed', True) ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true); Enable CDF on TABLE… … or Establish CDF on CLUSTER in SQL, Python, or Scala in Python, or Scala
  • 17. Using Delta Change Data Feed INSERT INTO TABLE ... USING delta ... as SELECT … FROM table_changes(...) SELECT … FROM table_changes('tableName', startingVersion [,endingVersion]) or SELECT … FROM table_changes('tableName', 'startingTimestamp' [,'endingTimestamp']) Query changes… … and store them in SQL, Python, or Scala
  • 18. Getting started with Delta Change Data Feed
  • 19. Let’s look at some notebooks
  • 20. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.