Databricks Associate Data Engineer Notes
Databricks Associate Data Engineer Notes
Key Features
In the lakehouse, you can work on:
Data engineering
Analytics
AI
Architecture
The architecture of Databricks Lakehouse is divided into three important layers:
1. Cloud Service: Multi-cloud available on Microsoft Azure, Amazon Web Services, and Google
Cloud.
1. Runtime: Consists of core components like Apache Spark, Delta Lake, and other system libraries.
Databricks uses the infrastructure of your cloud provider to provision virtual machines or nodes
of a cluster, which comes pre-installed with Databricks Runtime.
2. Workspace: Allows you to interactively implement and run your data engineering, analytics, and
AI workloads.
Deployment
Databricks resources are deployed in two high-level components:
Control Plane: Resides in the Databricks account and includes services like Databricks UI, Cluster
Manager, workflow service, and notebooks.
Data Plane: Resides in your own cloud subscription and includes a storage account used for
Databricks File System (DBFS) and cluster virtual machines when setting up a Spark cluster.
Language Support
Databricks was founded by the engineers who developed Spark. It supports multiple languages:
Scala
Python
SQL
R
Java
Notes By : Zeeshan Khan
Processing Capabilities
Databricks supports:
Batch processing
Stream processing in Spark
Processing structured, semi-structured, and unstructured data, including images and videos
Storage
Databricks offers native support for DBFS, which is an abstraction layer using underlying cloud storage.
When a file is created and stored in DBFS, it is actually persisted in the underlying cloud storage, such as
Azure storage or S3 buckets. Even after the cluster is terminated, the data remains safe in the cloud
storage.
Creating a cluster is an essential step in setting up your environment for data processing
and analysis. Here are the key concepts and steps involved in creating a cluster:
Cluster
A cluster is a set of nodes or computers that work together as a single entity. It consists of a master node
(driver) and worker nodes that perform parallel tasks.
Notebooks Fundamentals
Overview
Databricks notebooks are coding environments that allow interactive development and execution of
code. Similar to Jupyter Notebook, Databricks notebooks offer additional features and capabilities.
Supported Languages
Python
SQL
Scala
R
Collaboration
Notebooks can be shared among team members for collaboration.
Cell Execution
Cells in notebooks allow for cell-by-cell execution of code.
Cells can be run individually or in groups using various options:
Run cell
Run all above cells
Run all below cells
Shortcuts like Shift + Enter
Default Language
The default language for a notebook can be changed at any time.
Magic Commands
Magic commands in notebooks are built-in commands that provide specialized functionality:
Changing the language of a specific cell
Adding formatted text using markdown
Execution of code in a language other than the notebook's default
The run magic command allows running another notebook from the current notebook, enabling
code reuse and a modular approach.
FS magic command and dbutils provide utilities for file system operations and interaction with
the Databricks environment.
Notes By : Zeeshan Khan
Auto Completion
Auto completion using the Tab key is supported in notebooks for easier coding.
Display Function
The display function helps render output in a tabular format, allowing for better visualization of data.
Exporting Notebooks
Notebooks can be exported as IPython notebooks or DBC archives for sharing or moving to other
workspaces.
Revision History
Revision history in Databricks notebooks allows for tracking changes and reverting to previous versions
if needed.
Setting Up a Repo
1. Create a repository in GitHub.
1. Copy the repository URL.
2. Add the repo in Databricks workspace.
Adding Notebooks
Create folders.
Import existing notebooks.
Clone notebooks from the workspace.
Pulling Changes
Pull changes from branches into the main branch by creating pull requests in GitHub and
confirming the merge.
Regularly pull changes to avoid conflicts, especially when multiple developers are working on the
same branch in Databricks Repos.
Delta Lake is an open-source storage framework for data lakes. It brings reliability to data lakes by
addressing data inconsistency and performance issues. Delta Lake is not a storage format or medium, but
a storage layer that enables building a lakehouse architecture. A lakehouse platform unifies data
warehouse and advanced analytics capabilities.
Notes By : Zeeshan Khan
Deployment
Delta Lake is deployed on the cluster as part of the Databricks runtime.
Table Creation
When creating a Delta Lake table, data is stored in data files in parquet format along with a
transaction log.
The transaction log, also known as Delta Log, records every transaction performed on the table
since its creation, serving as a single source of truth.
Each transaction is recorded in a JSON file, containing details of the operation, predicates, and
affected files.
Key Features
Delta Lake ensures that read operations always retrieve the most recent version of the data
without conflicts.
The transaction log enables Delta Lake to perform ACID transactions on data lakes and handle
scalable metadata.
Delta Lake maintains an audit trail of all changes on the table using the underlying file format of
parquet and JSON.
Benefits
Reliability: Ensures data consistency and integrity.
Scalability: Handles large-scale data efficiently.
Consistency: Provides a single source of truth through its transaction log and storage framework.
Delta Tables
Delta Lake tables are a type of data storage format that provides ACID transactions, scalable metadata
handling, and data versioning capabilities. Delta Lake simplifies data management by storing data in
Parquet files and writing transaction logs to maintain table versions. Delta tables allow for efficient
querying, updating, and managing of big data sets in a distributed computing environment.
Inserting Records
-- Insert records into the Delta table
INSERT INTO my_delta_table (id, name, age) VALUES
(1, 'Alice', 30),
(2, 'Bob', 25),
(3, 'Charlie', 35);
Querying Records
-- Query the Delta table to validate inserted records
SELECT * FROM my_delta_table;
Table Metadata
-- Get detailed metadata information about the Delta table
DESCRIBE DETAIL my_delta_table;
Update Operations
-- Update records in the Delta table
UPDATE my_delta_table
Notes By : Zeeshan Khan
SET age = 31
WHERE id = 1;
Table History
-- View the history of all operations performed on the Delta table
DESCRIBE HISTORY my_delta_table;
Transaction Log
-- List the contents of the transaction log directory
ls /path/to/delta/table/_delta_log/
Delta tables in Delta Lake offer a robust solution for managing big data sets with features like ACID
transactions and versioning. Understanding Delta table operations, metadata, and transaction logs is
essential for efficient data processing and analysis.
Z-Order Indexing
Co-locates and reorganizes column information in the same set of files
Improves data skipping algorithm by reducing data that needs to be read
Applied by adding ZORDER BY keyword to the OPTIMIZE command followed by column name
-- Apply Z-Order indexing on a specific column
OPTIMIZE my_table ZORDER BY (column_name);
-- Vacuum command to delete unused data files older than a specified threshold (e.g., 3
days)
VACUUM my_table RETAIN 3 HOURS;
Restoring Data:
-- Restore a Delta table to a previous version
RESTORE TABLE my_delta_table TO VERSION AS OF 3;
Optimize Command:
-- Optimize a Delta table to combine small files
OPTIMIZE my_delta_table;
Vacuum Command:
-- Vacuum a Delta table to remove old data files with the default retention period (7
days)
VACUUM my_delta_table;
Configuration:
-- Disable retention duration check (use with caution)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
Best Practices:
Avoid turning off retention duration check in production
Use Delta features judiciously to optimize data storage and querying
Tables in Databricks:
Can be managed tables (stored under the database directory) or external tables (stored
outside the database directory).
Managed tables have both metadata and table data managed by Hive, while external tables
only have metadata managed by Hive.
When dropping a managed table, the underlying data files are deleted; however, dropping
an external table does not delete the data files.
Tables can be created in custom locations using the LOCATION keyword when defining the
table, specifying the path for data storage.
External tables can be created in the default database or any custom database by specifying
the database name with the USE keyword.
It is possible to create external tables in databases located in custom locations outside the
Hive default directory.
-- Create a managed table
CREATE TABLE my_table (
id INT,
name STRING
);
Overall, understanding relational entities in Databricks involves grasping the concept of databases,
tables, and their storage locations, whether managed or external. Proper management of databases and
tables is crucial for organizing and accessing data efficiently within the Databricks workspace.
Notes By : Zeeshan Khan
Table Constraints
Databricks supports two types of table constraints: NOT NULL and CHECK constraints.
Constraints must be defined after ensuring that no data in the table violates the constraint.
New data violating a constraint will result in a write failure.
-- Add NOT NULL constraint
ALTER TABLE delta_table
ALTER COLUMN age SET NOT NULL;
Partitioning
Data tables can be partitioned in subfolders by the value of one or more columns.
Partitioning can improve performance for huge Delta tables but may not benefit small to medium-
sized tables.
-- Create a partitioned Delta table
CREATE TABLE partitioned_delta_table
USING DELTA
PARTITIONED BY (country)
AS SELECT * FROM source_table;
External Tables
Delta tables created with CTAS statements can be specified as external tables, allowing data
storage in an external location.
-- Create an external Delta table
CREATE TABLE external_delta_table
USING DELTA
LOCATION '/mnt/delta/external_delta_table'
AS SELECT * FROM source_table;
Views
Views in Databricks are virtual tables that do not hold physical data, but rather store SQL
queries against actual tables. There are three types of views in Databricks: stored views,
temporary views, and global temporary views.
Stored Views
Stored views, also known as classical views, are persisted in the database and can be
accessed across multiple sessions. To create a stored view, the CREATE VIEW statement is
used with the AS keyword followed by the SQL query.
-- Creating a stored view
CREATE VIEW my_stored_view AS
SELECT * FROM my_table WHERE column_a > 100;
Temporary Views
Temporary views are tied to a Spark session and are dropped when the session ends.
They can be created by adding the TEMP or TEMPORARY keyword to the CREATE VIEW
command.
-- Creating a temporary view
CREATE TEMP VIEW my_temp_view AS
SELECT * FROM my_table WHERE column_b < 50;
Querying Views
To query a view in a SELECT statement, the appropriate database qualifier should be used
(e.g., global_temp for global temporary views).
-- Querying a stored view
SELECT * FROM my_stored_view;
Showing Views
To list all views in the current database, the SHOW TABLES command can be used with a
filter for views.
-- Showing all views in the current database
SHOW TABLES;
Dropping Views
Stored views can be dropped using the DROP VIEW command.
Temporary views are automatically dropped when the session ends.
Global temporary views are dropped upon cluster restart.
-- Dropping a stored view
DROP VIEW my_stored_view;
Summary
Stored Views: Persisted in the database, accessible across multiple sessions.
Temporary Views: Accessible only in the current session, dropped when the
session ends.
Global Temporary Views: Accessible across multiple sessions within the same
cluster, dropped upon cluster restart.
The CREATE VIEW statements for each type of view differ in that TEMP is used for temporary
views and GLOBAL TEMP is used for global temporary views.
-- Multiple files
SELECT * FROM `dbfs:/path/to/your/files*.csv`
-- Entire directory
SELECT * FROM `dbfs:/path/to/your/directory/`
-- Using input_file_name function to track the source file for each record
SELECT input_file_name(), * FROM json.`/path/to/directory/*.json`;
Writing to Tables
Delta technology provides ACID compliant updates to Delta tables, ensuring data
integrity.
# Example of creating a Delta table
from delta.tables import *
Using CREATE TABLE AS SELECT (CTAS) statement to create a new table from
existing data in Parquet files.
-- SQL example of CTAS
CREATE TABLE new_table
USING delta
AS SELECT * FROM parquet.`/path/to/parquet/files`;
CREATE OR REPLACE TABLE statement fully replaces the content of a table each
time it executes.
-- SQL example of CREATE OR REPLACE TABLE
CREATE OR REPLACE TABLE delta_table
USING delta
AS SELECT * FROM source_table;
INSERT OVERWRITE statement can only overwrite an existing table and is a safer
technique for data integrity.
-- SQL example of INSERT OVERWRITE
INSERT OVERWRITE TABLE delta_table
SELECT * FROM source_table;
INSERT INTO statement is used for appending new records to tables but may result
in duplicate records if not managed properly.
Notes By : Zeeshan Khan
-- SQL example of INSERT INTO
INSERT INTO delta_table
SELECT * FROM new_records;
MERGE INTO statement allows for upserting (insert, update, delete) data in the
target table based on conditions.
-- SQL example of MERGE INTO
MERGE INTO delta_table AS target
USING updates_table AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name, target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age) VALUES (source.id, source.name, source.age);
spark.sql("""
MERGE INTO delta_table AS target
USING updates_view AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name, target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age) VALUES (source.id, source.name, source.age)
""")
Ensuring uniqueness and avoiding duplicates when inserting records using MERGE
INTO statement.
-- SQL example ensuring uniqueness
MERGE INTO delta_table AS target
USING (
SELECT DISTINCT id, name, age FROM updates_table
) AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name, target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age) VALUES (source.id, source.name, source.age);
By understanding these key concepts, users can efficiently manage and update data in
Delta tables while maintaining data integrity and consistency.
Advanced Transformations
JSON Data Interaction
Colon Syntax: Traverse nested JSON data.
Example: Accessing nested values like first name or country.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
Notes By : Zeeshan Khan
spark = SparkSession.builder.appName("JSONExample").getOrCreate()
json_df = spark.read.json("path/to/json/file")
json_df.select(col("name.first"), col("address.country")).show()
schema = StructType([
StructField("name", StructType([
StructField("first", StringType(), True),
StructField("last", StringType(), True)
]), True),
StructField("address", StructType([
StructField("country", StringType(), True)
]), True)
])
json_df = spark.read.json("path/to/json/file")
json_df.withColumn("parsed_json", from_json(col("json_column"),
schema)).select("parsed_json.*").show()
Struct Types
Period/Dot Syntax: Interact with nested objects.
struct_df = spark.createDataFrame([
(("John", "Doe"), "USA"),
(("Jane", "Doe"), "Canada")
], ["name", "country"])
struct_df.select(col("name.first"), col("country")).show()
Array Handling
explode Function: Puts each element of an array on its own row.
from pyspark.sql.functions import explode
array_df = spark.createDataFrame([
(1, ["a", "b", "c"]),
(2, ["d", "e", "f"])
], ["id", "letters"])
array_df.select("id", explode("letters")).show()
collect_set Function: Collects unique values for a field, including within arrays.
Combine with flatten and array_distinct to keep only distinct values.
Notes By : Zeeshan Khan
from pyspark.sql.functions import collect_set, flatten, array_distinct
array_df = spark.createDataFrame([
(1, ["a", "b", "a"]),
(2, ["d", "e", "d"])
], ["id", "letters"])
array_df.select(array_distinct(flatten(collect_set("letters")))).show()
Join Operations
Types of Joins: Inner, outer, left, right, anti, cross, and semi joins.
Usage: Join tables based on specific keys and store results in new views.
df1 = spark.createDataFrame([
(1, "John"),
(2, "Jane")
], ["id", "name"])
df2 = spark.createDataFrame([
(1, "USA"),
(3, "Canada")
], ["id", "country"])
Set Operations
Union: Combines datasets.
df1 = spark.createDataFrame([
(1, "John"),
(2, "Jane")
], ["id", "name"])
df2 = spark.createDataFrame([
(3, "Doe"),
(4, "Smith")
], ["id", "name"])
union_df = df1.union(df2)
union_df.show()
df2 = spark.createDataFrame([
(2, "Jane"),
(3, "Doe")
], ["id", "name"])
Notes By : Zeeshan Khan
intersect_df = df1.intersect(df2)
intersect_df.show()
Pivot Clause
Usage: Aggregates values based on specific column values and turns them into
multiple columns.
Useful for creating flattened data formats for dashboarding or machine
learning.
pivot_df = spark.createDataFrame([
("A", "Math", 85),
("A", "Science", 90),
("B", "Math", 75),
("B", "Science", 80)
], ["student", "subject", "score"])
pivot_df.groupBy("student").pivot("subject").avg("score").show()
Summary
Advanced transformations in Spark SQL enable:
Manipulation of complex data structures.
Efficient aggregations.
Joining multiple datasets for analytical and machine learning tasks.
Transform Function
The transform function applies a transformation to all items in an array and extracts the
transformed values.
# Example of transform function in Python using map
numbers = [1, 2, 3, 4, 5, 6]
squared_numbers = list(map(lambda x: x ** 2, numbers))
print(squared_numbers) # Output: [1, 4, 9, 16, 25, 36]
Notes By : Zeeshan Khan
Describe Function
The Describe Function command provides information about the registered function,
including expected inputs and return type. Describe Function Extended offers more
detailed information, including the SQL logic used in the function.
-- Example of Describe Function in Spark SQL
DESCRIBE FUNCTION square;
DESCRIBE FUNCTION EXTENDED square;
Application of UDFs
UDFs can be applied to columns in tables to perform custom operations. They are
powerful tools for manipulating data within Spark.
# Example of applying a UDF to a DataFrame column
df = spark.createDataFrame([(1,), (2,), (3,)], ["number"])
df = df.withColumn("squared", square_udf(df["number"]))
df.show()
Optimization
UDF functions in Spark are optimized for parallel execution, enhancing performance for
large datasets.
Notes By : Zeeshan Khan
Dropping UDFs
UDFs can be dropped when they are no longer needed, freeing up resources in the
database.
-- Example of dropping a UDF in Spark SQL
DROP FUNCTION IF EXISTS square;
Data Stream
A data stream is any data source that grows over time, such as new JSON log files, CDC
feeds, or events from messaging systems like Kafka.
Processing Approaches
Traditional processing involves reprocessing the entire dataset each time new data
arrives, whereas custom logic can be written to capture only new data since the last
update. Spark structured streaming simplifies this process.
# Apply transformations
transformed_df = streaming_df.select("column1", "column2").where("column3 > 100")
Limitations
Some operations, like sorting and deduplication, are not supported in streaming data
frames. Advanced methods like windowing and watermarking can help achieve these
operations in structured streaming.
Example:
# Example of windowing and watermarking
from pyspark.sql.functions import window
Structured streaming in Databricks offers a robust solution for processing streaming data
efficiently and reliably. It provides a simplified approach to handling infinite data sources
and ensures fault tolerance and exactly-once processing in streaming jobs.
Structured Streaming is a feature in Apache Spark that allows for real-time processing of
data in a consistent and fault-tolerant manner. It enables incremental processing of data
streams by treating them as a continuous series of small batch jobs.
Key Concepts:
Data Streaming: Structured Streaming enables the processing of data streams in
real-time, making it suitable for applications that require low-latency processing of
continuous data streams.
PySpark API: In order to work with data streaming in SQL, the spark.readStream
method provided by the PySpark API is used to query Delta tables as a stream
source.
# Example: Reading a stream from a Delta table
streaming_df = spark.readStream.format("delta").table("delta_table_name")
Trigger Intervals: The trigger intervals can be configured to control the frequency
of processing streaming data, with options such as every 4 seconds or availableNow
for batch mode execution.
# Example: Setting trigger intervals
query = result_df.writeStream.trigger(processingTime='4
seconds').format("delta").outputMode("append").option("checkpointLocation",
"/path/to/checkpoint").start("/path/to/delta_table")
Output Modes: Output modes such as append or complete can be set to determine
how the results of streaming queries are written to durable storage, with complete
mode used for aggregation queries.
# Example: Using complete output mode for aggregations
query = agg_df.writeStream.outputMode("complete").format("console").start()
Overall, Structured Streaming in Apache Spark provides a powerful framework for real-
time data processing, enabling users to efficiently handle continuous streams of data and
perform complex analysis and transformations in a fault-tolerant and scalable manner.
Auto Loader
The Auto Loader in Databricks uses Spark structured streaming to efficiently process new
data files as they arrive in a storage location. It can scale to support real-time ingestion of
millions of files per hour.
Checkpointing: Uses checkpointing to track the ingestion process and store
metadata of discovered files, ensuring they are processed exactly once and can
resume from failure points.
StreamReader Format: The format of StreamReader used in Auto Loader is
cloudFiles, where users specify the format of the source files through options.
Schema Detection: Can automatically configure the schema of the data, detecting
any updates to fields in the source dataset.
It is recommended to use Auto Loader for ingesting data from cloud object storage,
especially when dealing with large volumes of data.
Example Code
from pyspark.sql.functions import *
Conclusion
Incremental data ingestion methods like Copy Into and Auto Loader in Databricks provide
efficient ways to process new data from files without reprocessing existing data.
Understanding the differences between these methods and when to use each can help
optimize data pipelines and improve overall data processing efficiency.
Code Examples
Setting Up Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("AutoLoaderExample") \
.getOrCreate()
simulate_data_arrival("/path/to/simulated/data",
"/path/to/data/source/directory")
Notes By : Zeeshan Khan
Summary
Overall, Auto Loader in combination with Apache Spark Structured Streaming provides a
powerful tool for handling real-time data ingestion and processing scenarios, enabling
efficient and scalable data analytics solutions.
Multi-hop architecture, also known as Medallion architecture, is a data design pattern
used to logically organize data in a multilayered approach. The goal of multi-hop
architecture is to incrementally improve the structure and quality of data as it flows
through each layer of the architecture. The architecture typically consists of three layers:
bronze, silver, and gold.
Layers
1. Bronze Table
Contains raw data ingested from various sources such as JSON files,
operational databases, or Kafka Stream.
1. Silver Table
Provides a more refined view of the data by cleaning, filtering, and enriching
the records with fields from various tables.
2. Gold Table
Provides business-level aggregations and is often used for reporting,
dashboarding, or machine learning purposes.
Benefits
Simple and easy-to-understand data model
Incremental ETL capabilities
Ability to combine streaming and batch workloads
Ability to recreate tables from raw data at any time
Delta Lake Multi-hop Pipeline
Creating a pipeline using Delta Lake to process data from a bookstore dataset containing
customers, orders, and books tables.
Initial Steps
1. Running the Copy-Datasets script and checking the source directory.
1. Configuring an Auto Loader for stream reading Parquet files with schema inference.
2. Registering a streaming temporary view (orders_raw_tmp) for data transformation
in Spark SQL.
Enriching Raw Data
Adding metadata such as source file information and ingestion time for
troubleshooting purposes.
Visualizing enriched data before further processing.
Processing Raw Data
Passing enriched data to PySpark API for incremental write to a Delta Lake table
(orders_bronze).
Notes By : Zeeshan Khan
Monitoring and checking records written into the bronze table.
Working on the Silver Layer
Creating a static lookup table (customers) for joining with the bronze table.
Performing enrichments and checks (adding customer names, parsing timestamps,
filtering out orders with no items).
Writing enriched data into a silver table.
Moving to the Gold Layer
Creating an aggregated gold table (daily_customer_books) for daily book counts per
customer.
Configuring streams with trigger options and output modes for data processing.
Querying the gold table to view aggregated data.
Processing Remaining Data Files
Updating the gold table by rerunning the final query as a batch job.
Wrapping Up
Stopping all active streams to conclude the pipeline.
In this multi-hop architecture, we observe a step-by-step process involving data
enrichment, transformation, and aggregation using Delta Lake and Apache Spark. The
pipeline demonstrates the integration of streaming and batch processing for efficient data
processing and analytics tasks.
Delta Live Tables (DLT) is a framework designed to simplify the creation of reliable and
maintainable data processing pipelines.
Multi-hop Pipeline:
DLT allows for the creation of multi-hop pipelines, where data flows through different
layers (bronze, silver, gold) to undergo various transformations and enrichment
processes.
Bronze Tables:
The initial layer in a DLT pipeline is the bronze layer, where raw data is ingested. Tables
in this layer capture data in its rawest form, before any processing or transformations
have been applied.
Silver Tables:
The silver layer is where data is refined through operations such as data cleansing and
enrichment. Quality control measures are implemented at this level using constraint
keywords to ensure data integrity.
Gold Tables:
The gold layer represents the final refined data, often used for reporting or analysis
purposes. DLT pipelines can be configured to continuously ingest new data using
triggered or continuous pipeline modes.
Notes By : Zeeshan Khan
Databricks Notebooks:
DLT pipelines are implemented using Databricks notebooks, which allow for the easy
creation and management of data processing logic.
Error Handling:
DLT pipelines include error handling mechanisms, such as reporting constraint violations
and providing options for handling records that do not meet specified criteria.
Metadata Management:
DLT pipelines store metadata about tables, events, and configurations in a designated
storage location, allowing for easy monitoring and tracking of pipeline activities.
Cluster Management:
DLT pipelines can be run on designated clusters, which can be configured based on
performance requirements and resource needs. Clusters can be created, managed, and
terminated as needed.
Summary:
Delta Live Tables provides a powerful framework for building and managing data
processing pipelines, offering features for data transformation, quality control, error
handling, and cluster management. By utilizing DLT, organizations can streamline the
process of building and maintaining data pipelines while ensuring data integrity and
reliability.
Optional Features
Optional features of the Apply Changes Into command include:
Specifying handling for delete events
Specifying primary keys for the table
Ignoring specified columns
Choosing between storing records as slowly changing dimension type 1 or type 2
Type 1 slowly changing dimension tables maintain only one record for each unique key,
with updates overwriting existing information.
Disadvantages
One disadvantage of Apply Changes Into is that it breaks the append-only requirements
for streaming table sources. This means that tables updated using this command cannot
be used as streaming sources in subsequent layers.
Conclusion
In conclusion, Change Data Capture is a crucial process for identifying, capturing, and
delivering data changes to target destinations. Utilizing the Apply Changes Into command
in Delta Live Tables facilitates effective processing of CDC events and ensures accurate
and up-to-date data management.
@dlt.table
def bronze_table():
return (
spark.readStream.format("json")
.load("/path/to/cdc/json/files")
)
Silver Table
@dlt.table
def silver_table():
return dlt.apply_changes(
target = "silver_table",
source = "bronze_table",
keys = ["primary_key"],
sequence_by = col("row_time"),
apply_as_deletes = col("row_status") == "delete",
except_column_list = ["row_status", "row_time"]
)
Gold Layer
@dlt.table
def gold_layer():
return (
dlt.read("silver_table")
.groupBy("some_column")
.agg({"another_column": "sum"})
)
Benefits of DLT
Efficient processing of CDC feed with incremental changes.
Flexibility to create views and reference tables across notebooks in the DLT
pipeline.
Scalability for handling multiple data sources and complex data processing
requirements.
Conclusion
Processing CDC feed with DLT involves capturing, ingesting, and applying incremental
changes to data tables. The DLT pipeline components, Apply Changes Into command, and
DLT views provide a structured approach to handling CDC data in a scalable and efficient
manner. By following the steps outlined in this demo, users can manage and process CDC
feed effectively using DLT in their data pipelines.
SQL Warehouse:
A SQL warehouse in Databricks SQL is the compute power, or SQL engine,
based on a Spark cluster.
It can be created by configuring a new SQL engine with options like cluster
size and name.
Dashboards:
Dashboards in Databricks SQL allow users to create visualizations and
analyze data.
Users can import pre-built dashboards or create new ones from scratch.
Visualization:
Users can create various visualizations like pie charts and graphs by selecting
columns and settings for X and Y axes.
Visualizations can be added to dashboards for easy data analysis.
Notes By : Zeeshan Khan
Querying:
Custom SQL queries can be written in the SQL editor by selecting the schema,
database, and table.
Queries can be saved, added to dashboards, and scheduled to refresh
automatically.
Alerts:
Alerts in Databricks SQL notify users when a certain threshold in a query
result is met.
Users can set up alerts with specific conditions and receive notifications
through email or other platforms like Slack and Microsoft Teams.
Summary:
Databricks SQL provides a comprehensive platform for data analysis through SQL queries,
visualizations, dashboards, and alerting capabilities. Users can leverage the power of
Spark clusters to run complex analytics and gain insights from their data.
Privilege Management
In order to grant privileges on an object, the user must be a Databricks administrator or
the owner of the object. Different ownership levels include:
Ownership Level Description
Catalog Owner Can grant privileges for all objects in the catalog.
Database Owner Grants privileges for objects only within that specific database.
Table Owner Can grant privileges solely for the table.
Similar rules apply for views and named functions.
Operations
Apart from granting privileges, users can also perform deny and revoke operations to
manage object privileges. The SHOW GRANTS operation allows users to view the granted
permissions on objects.
By understanding and effectively managing data object privileges in Databricks, users can
ensure secure and controlled access to their data assets within the platform.
Unity Catalog
Unity Catalog is a centralized governance solution provided by the Databricks platform
that unifies governance for all data and AI assets in your Lakehouse across multiple
Notes By : Zeeshan Khan
workspaces and clouds. It simplifies data access rules by allowing users to define them
once and apply them across various workspaces.
Architecture of Unity Catalog
Location: Unity Catalog sits outside workspaces and is accessed via the Account
Console.
Management: It manages users, groups, and metastores which can be assigned to
multiple workspaces.
Security: It improves security and functionality compared to the traditional Hive
metastore.
Three-Level Namespace
Unity Catalog introduces a three-level namespace hierarchy consisting of:
1. Metastores: Top-level container.
1. Catalogs: Container for data objects.
2. Schemas: Contain data assets like tables, views, and functions.
Security Model
Access Control: Unity Catalog offers fine-grained access control with privileges
such as CREATE, USAGE, SELECT, MODIFY, READ FILES, WRITE FILES, and
EXECUTE.
Identities: Supports different types of identities including users, service principals,
and groups.
Identity Federation: Simplifies identity management across workspaces.
Data Search and Lineage
Data Search: Includes built-in data search and discovery features.
Lineage Tracking: Provides automated lineage tracking to identify data origin and
usage across different assets like tables, notebooks, workflows, and dashboards.
Legacy Compatibility
Additive: Unity Catalog is additive and works alongside existing legacy catalogs.
Hive Metastore: Allows access to the Hive metastore local to a workspace through
the catalog named hive_metastore.
Unified Governance
Unity Catalog unifies existing legacy catalogs without the need for hard migration. It offers
a centralized solution for managing data assets, access control, and lineage tracking.
Conclusion
Unity Catalog enhances data governance by providing a centralized solution for managing
data assets, access control, and lineage tracking across multiple workspaces and clouds. It
simplifies data access rules, improves security, and offers advanced features for data
management in a Lakehouse environment.