Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Presented By:
Kundan Kumar
Software Consultant
Spark With Delta
Lake
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Agenda
01 What & Why Delta Lake
02 Features Of Delta Lake
03 Delta Lake Transaction
Log
04 Demo
Brings Data Reliability and Performance to Data
Lakes
What is Data Lake?
A Data Lake is a centralized repository that can store large amount of
structured, semi-structured, and unstructured data. It is a place to store
every type of data in its native format with no fixed limits on account size or
file. It offers high data quantity to increase analytic performance and native
integration.
Why Delta Lake?
1. Data reliability challenges with data lake.
2013 2014 2018
Failed production
job
Orphan Data
No Schema
Enforcement
2. ACID Transactions: Critical Feature missing from Spark.
What is Delta Lake?
Delta Lake is basically a open source storage layer that would sit on top of your
existing data lake and is fully compatible with Apache Spark APIs. It brings ACID
transactions to Apache Spark and big data workloads.
Features of Delta Lake
1. ACID Transaction
2. Scalable metedata handling
3. Open Format
4. Time Travel
5. Schema Enforcement & Evolution
6. Updates and Deletes
7. Unified Batch and Streaming
ACID Transactions
Delta Lake brings Atomicity, Consistency, Isolation and Durability (ACID)
transactions to your data lakes. It provides serializability, the strongest level of
isolation level and ensure that readers never see inconsistent data.
Transaction Log
The Delta Lake transaction log (also known as the DeltaLog) is an ordered
record of every transaction that has ever been performed on a Delta Lake
table since its inception. It is a single source of truth.
Optimistic Concurrency Control
Optimistic concurrency control is a method of dealing with concurrent
transactions that assumes that transactions (changes) made to a table by
different users can complete without conflicting with one another.
Time Travel (Data Versioning)
Delta Lake time travel allows us to query an older snapshot of a Delta Lake table. This
time-traveling can be achieved using 2 approaches:
1. Using a version number
2. Using a timestamp
Time travel has many use cases, including:
● Time travel makes it easy to do rollbacks in case of bad writes, playing an important role in fixing
mistakes in our data.
● It helps in re-creating analysis, reports, or outputs (for example, the output of a machine learning
model). This could be useful for debugging or auditing, especially in regulated industries.
● It also simplifies time-series analytics. For instance, in finding out how many new customers were
added over the last week.
Scalable Metadata Handling: In big data, even the metadata itself can be "big
data". Delta Lake treats metadata just like data, leveraging Spark's distributed
processing power to handle all its metadata. As a result, Delta Lake can handle
petabyte-scale tables with billions of partitions and files at ease.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling
Delta Lake to leverage the efficient compression and encoding schemes that are
native to Parquet. Apache Parquet is column-oriented and designed to bring
efficient columnar storage of data compared to row-based like CSV.
.
id name age
123 xyz 21
321 abc 20
123 xyz 21 321 abc 20
123 321 xyz abc 21 20
Row Orriented
Column Oriented
Schema Enforcement: Schema enforcement, also known as schema validation,
is a safeguard in Delta Lake that ensures data quality by rejecting writes to a
table that do not match the table’s schema. To determine whether a write to a
table is compatible, Delta Lake uses the following rules:
1. Cannot contain any additional columns that are not present in the target table’s schema.
2. Cannot have column data types that differ from the column data types in the target table.
3. Cannot contain column names that differ only by case.
Schema Evolution: Delta Lake enables you to make changes to a table schema
that can be applied automatically.
Updates, Deletes and Merges
Delta Lake supports Scala / Java APIs to merge, update and delete datasets.
This allows you to easily comply with GDPR and CCPA.
Updates: We can update data that matches a predicate in a Delta Lake table.
Deletes: We can remove data that matches a predicate from a Delta Lake
table.
Merges: we can upsert data from a Spark DataFrame into a Delta Lake table
using the merge operation.
Unified Batch and Streaming Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and
sink. Streaming data ingest, batch historic backfill, and interactive queries all
just work out of the box.
DEMO
Q/A
References
1. Welcome to the Delta Lake documentation — Delta Lake
Documentation
2. Spark: ACID compliant or not
3. Spark: ACID Transaction with Delta Lake
4. Time Travel: Data versioning in Delta Lake
Thank You !

More Related Content

Spark with Delta Lake

  • 1. Presented By: Kundan Kumar Software Consultant Spark With Delta Lake
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Respect Knolx session timings, you are requested not to join sessions after a 5 minutes threshold post the session start time. Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Agenda 01 What & Why Delta Lake 02 Features Of Delta Lake 03 Delta Lake Transaction Log 04 Demo
  • 4. Brings Data Reliability and Performance to Data Lakes
  • 5. What is Data Lake? A Data Lake is a centralized repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
  • 6. Why Delta Lake? 1. Data reliability challenges with data lake. 2013 2014 2018 Failed production job Orphan Data No Schema Enforcement
  • 7. 2. ACID Transactions: Critical Feature missing from Spark.
  • 8. What is Delta Lake? Delta Lake is basically a open source storage layer that would sit on top of your existing data lake and is fully compatible with Apache Spark APIs. It brings ACID transactions to Apache Spark and big data workloads.
  • 9. Features of Delta Lake 1. ACID Transaction 2. Scalable metedata handling 3. Open Format 4. Time Travel 5. Schema Enforcement & Evolution 6. Updates and Deletes 7. Unified Batch and Streaming
  • 10. ACID Transactions Delta Lake brings Atomicity, Consistency, Isolation and Durability (ACID) transactions to your data lakes. It provides serializability, the strongest level of isolation level and ensure that readers never see inconsistent data.
  • 11. Transaction Log The Delta Lake transaction log (also known as the DeltaLog) is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. It is a single source of truth.
  • 12. Optimistic Concurrency Control Optimistic concurrency control is a method of dealing with concurrent transactions that assumes that transactions (changes) made to a table by different users can complete without conflicting with one another.
  • 13. Time Travel (Data Versioning) Delta Lake time travel allows us to query an older snapshot of a Delta Lake table. This time-traveling can be achieved using 2 approaches: 1. Using a version number 2. Using a timestamp Time travel has many use cases, including: ● Time travel makes it easy to do rollbacks in case of bad writes, playing an important role in fixing mistakes in our data. ● It helps in re-creating analysis, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries. ● It also simplifies time-series analytics. For instance, in finding out how many new customers were added over the last week.
  • 14. Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. Apache Parquet is column-oriented and designed to bring efficient columnar storage of data compared to row-based like CSV. . id name age 123 xyz 21 321 abc 20 123 xyz 21 321 abc 20 123 321 xyz abc 21 20 Row Orriented Column Oriented
  • 15. Schema Enforcement: Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. To determine whether a write to a table is compatible, Delta Lake uses the following rules: 1. Cannot contain any additional columns that are not present in the target table’s schema. 2. Cannot have column data types that differ from the column data types in the target table. 3. Cannot contain column names that differ only by case. Schema Evolution: Delta Lake enables you to make changes to a table schema that can be applied automatically.
  • 16. Updates, Deletes and Merges Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA. Updates: We can update data that matches a predicate in a Delta Lake table. Deletes: We can remove data that matches a predicate from a Delta Lake table. Merges: we can upsert data from a Spark DataFrame into a Delta Lake table using the merge operation.
  • 17. Unified Batch and Streaming Source and Sink A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
  • 18. DEMO
  • 19. Q/A
  • 20. References 1. Welcome to the Delta Lake documentation — Delta Lake Documentation 2. Spark: ACID compliant or not 3. Spark: ACID Transaction with Delta Lake 4. Time Travel: Data versioning in Delta Lake