Spark with Delta Lake

Presented By:
Kundan Kumar
Software Consultant
Spark With Delta
Lake

Lack of etiquette and manners is a huge turn oﬀ.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.

Agenda
01 What & Why Delta Lake
02 Features Of Delta Lake
03 Delta Lake Transaction
Log
04 Demo

Brings Data Reliability and Performance to Data
Lakes

What is Data Lake?
A Data Lake is a centralized repository that can store large amount of
structured, semi-structured, and unstructured data. It is a place to store
every type of data in its native format with no fixed limits on account size or
file. It offers high data quantity to increase analytic performance and native
integration.

Why Delta Lake?
1. Data reliability challenges with data lake.
2013 2014 2018
Failed production
job
Orphan Data
No Schema
Enforcement

2. ACID Transactions: Critical Feature missing from Spark.

What is Delta Lake?
Delta Lake is basically a open source storage layer that would sit on top of your
existing data lake and is fully compatible with Apache Spark APIs. It brings ACID
transactions to Apache Spark and big data workloads.

Features of Delta Lake
1. ACID Transaction
2. Scalable metedata handling
3. Open Format
4. Time Travel
5. Schema Enforcement & Evolution
6. Updates and Deletes
7. Unified Batch and Streaming

ACID Transactions
Delta Lake brings Atomicity, Consistency, Isolation and Durability (ACID)
transactions to your data lakes. It provides serializability, the strongest level of
isolation level and ensure that readers never see inconsistent data.

Transaction Log
The Delta Lake transaction log (also known as the DeltaLog) is an ordered
record of every transaction that has ever been performed on a Delta Lake
table since its inception. It is a single source of truth.

Optimistic Concurrency Control
Optimistic concurrency control is a method of dealing with concurrent
transactions that assumes that transactions (changes) made to a table by
diﬀerent users can complete without conﬂicting with one another.

Time Travel (Data Versioning)
Delta Lake time travel allows us to query an older snapshot of a Delta Lake table. This
time-traveling can be achieved using 2 approaches:
1. Using a version number
2. Using a timestamp
Time travel has many use cases, including:
● Time travel makes it easy to do rollbacks in case of bad writes, playing an important role in fixing
mistakes in our data.
● It helps in re-creating analysis, reports, or outputs (for example, the output of a machine learning
model). This could be useful for debugging or auditing, especially in regulated industries.
● It also simplifies time-series analytics. For instance, in finding out how many new customers were
added over the last week.

Scalable Metadata Handling: In big data, even the metadata itself can be "big
data". Delta Lake treats metadata just like data, leveraging Spark's distributed
processing power to handle all its metadata. As a result, Delta Lake can handle
petabyte-scale tables with billions of partitions and files at ease.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling
Delta Lake to leverage the efficient compression and encoding schemes that are
native to Parquet. Apache Parquet is column-oriented and designed to bring
efficient columnar storage of data compared to row-based like CSV.
.
id name age
123 xyz 21
321 abc 20
123 xyz 21 321 abc 20
123 321 xyz abc 21 20
Row Orriented
Column Oriented

Schema Enforcement: Schema enforcement, also known as schema validation,
is a safeguard in Delta Lake that ensures data quality by rejecting writes to a
table that do not match the table’s schema. To determine whether a write to a
table is compatible, Delta Lake uses the following rules:
1. Cannot contain any additional columns that are not present in the target table’s schema.
2. Cannot have column data types that differ from the column data types in the target table.
3. Cannot contain column names that differ only by case.
Schema Evolution: Delta Lake enables you to make changes to a table schema
that can be applied automatically.

Updates, Deletes and Merges
Delta Lake supports Scala / Java APIs to merge, update and delete datasets.
This allows you to easily comply with GDPR and CCPA.
Updates: We can update data that matches a predicate in a Delta Lake table.
Deletes: We can remove data that matches a predicate from a Delta Lake
table.
Merges: we can upsert data from a Spark DataFrame into a Delta Lake table
using the merge operation.

Uniﬁed Batch and Streaming Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and
sink. Streaming data ingest, batch historic backﬁll, and interactive queries all
just work out of the box.

References
1. Welcome to the Delta Lake documentation — Delta Lake
Documentation
2. Spark: ACID compliant or not
3. Spark: ACID Transaction with Delta Lake
4. Time Travel: Data versioning in Delta Lake

Spark with Delta Lake

More Related Content

Spark with Delta Lake