Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Databricks Associate Data Engineer Notes

Databricks is a multi-cloud lakehouse platform that integrates data lakes and warehouses, offering features like data engineering, analytics, and AI support. It supports multiple programming languages and provides a robust architecture with a control and data plane for resource management. Key functionalities include Delta Lake for reliable data management, advanced features like time travel, and tools for collaboration through notebooks and Git integration.

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Databricks Associate Data Engineer Notes

Databricks is a multi-cloud lakehouse platform that integrates data lakes and warehouses, offering features like data engineering, analytics, and AI support. It supports multiple programming languages and provides a robust architecture with a control and data plane for resource management. Key functionalities include Delta Lake for reliable data management, advanced features like time travel, and tools for collaboration through notebooks and Git integration.

Uploaded by

Suraj Solanke
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Notes By : Zeeshan Khan

Databricks: A Multi-Cloud Lakehouse Platform


Overview
Databricks is a multi-cloud lakehouse platform based on Apache Spark. A data lakehouse is a unified
analytics platform that combines the best elements of data lakes and data warehouses. This platform
provides:
 Openness
 Flexibility
 Machine learning support of data lakes
 Reliability
 Strong governance
 Performance of data warehouses

Key Features
In the lakehouse, you can work on:
 Data engineering
 Analytics
 AI

Architecture
The architecture of Databricks Lakehouse is divided into three important layers:
1. Cloud Service: Multi-cloud available on Microsoft Azure, Amazon Web Services, and Google
Cloud.
1. Runtime: Consists of core components like Apache Spark, Delta Lake, and other system libraries.
Databricks uses the infrastructure of your cloud provider to provision virtual machines or nodes
of a cluster, which comes pre-installed with Databricks Runtime.
2. Workspace: Allows you to interactively implement and run your data engineering, analytics, and
AI workloads.

Deployment
Databricks resources are deployed in two high-level components:
 Control Plane: Resides in the Databricks account and includes services like Databricks UI, Cluster
Manager, workflow service, and notebooks.
 Data Plane: Resides in your own cloud subscription and includes a storage account used for
Databricks File System (DBFS) and cluster virtual machines when setting up a Spark cluster.

Language Support
Databricks was founded by the engineers who developed Spark. It supports multiple languages:
 Scala
 Python
 SQL
 R
 Java
Notes By : Zeeshan Khan
Processing Capabilities
Databricks supports:
 Batch processing
 Stream processing in Spark
 Processing structured, semi-structured, and unstructured data, including images and videos

Storage
Databricks offers native support for DBFS, which is an abstraction layer using underlying cloud storage.
When a file is created and stored in DBFS, it is actually persisted in the underlying cloud storage, such as
Azure storage or S3 buckets. Even after the cluster is terminated, the data remains safe in the cloud
storage.

Creating a cluster is an essential step in setting up your environment for data processing
and analysis. Here are the key concepts and steps involved in creating a cluster:

Cluster
A cluster is a set of nodes or computers that work together as a single entity. It consists of a master node
(driver) and worker nodes that perform parallel tasks.

Setting up the Cluster


1. Navigate to the Compute section in the left sidebar of your environment.
1. Click on "Create compute" to start the process.
2. Name your cluster (e.g., Demo Cluster) and choose the policy (e.g., Unrestricted for
fully configurable cluster).
3. Decide on the cluster type: single-node or multi-node based on your
requirements.
4. Choose the Databricks Runtime version (e.g., 11.3 LTS) and optionally enable
features like Photon for enhanced performance.
Worker Node Configuration
1. Select the virtual machine size and number of worker nodes based on your
memory, core, and disk requirements.
1. Enable or disable auto scaling for the number of workers.
Driver Node Configuration
1. Configure the driver node settings or keep them the same as the worker nodes.
Auto Termination
1. Set a time limit for auto termination of the cluster if there is no activity for a
specified period (e.g., 30 minutes).
Confirm and Create
1. Review the cluster configuration summary, including the number of Databricks
Units (DBUs) consumed.
1. Click "Confirm" and then "Create" to provision the cluster.
Notes By : Zeeshan Khan
Managing the Cluster
1. Access your cluster anytime from the Compute section.
1. Monitor the cluster status (running or terminated) and manage it by starting,
terminating, deleting, or editing permissions.
2. Check the Event Log for cluster activity and the Driver Log for logs generated
within the cluster notebooks and libraries.
Remember, changing the cluster configuration may require a restart.

Notebooks Fundamentals
Overview
Databricks notebooks are coding environments that allow interactive development and execution of
code. Similar to Jupyter Notebook, Databricks notebooks offer additional features and capabilities.

Supported Languages
 Python
 SQL
 Scala
 R

Collaboration
Notebooks can be shared among team members for collaboration.

Cell Execution
 Cells in notebooks allow for cell-by-cell execution of code.
 Cells can be run individually or in groups using various options:
 Run cell
 Run all above cells
 Run all below cells
 Shortcuts like Shift + Enter

Default Language
The default language for a notebook can be changed at any time.

Magic Commands
Magic commands in notebooks are built-in commands that provide specialized functionality:
 Changing the language of a specific cell
 Adding formatted text using markdown
 Execution of code in a language other than the notebook's default
 The run magic command allows running another notebook from the current notebook, enabling
code reuse and a modular approach.
 FS magic command and dbutils provide utilities for file system operations and interaction with
the Databricks environment.
Notes By : Zeeshan Khan
Auto Completion
Auto completion using the Tab key is supported in notebooks for easier coding.

Display Function
The display function helps render output in a tabular format, allowing for better visualization of data.

Exporting Notebooks
Notebooks can be exported as IPython notebooks or DBC archives for sharing or moving to other
workspaces.

Revision History
Revision history in Databricks notebooks allows for tracking changes and reverting to previous versions
if needed.

Databricks Repos (Git Folders)


Databricks Repos provide source control for data projects by integrating with Git providers like GitHub
and Azure DevOps.

Configuring Git Integration


1. Go to user settings in Databricks workspace.
1. Navigate to the Git Integration tab.
2. Fill in your Git provider username and personal access token.

Setting Up a Repo
1. Create a repository in GitHub.
1. Copy the repository URL.
2. Add the repo in Databricks workspace.

Working with Branches


 Create branches in Databricks Repos to work on different versions of your project.

Adding Notebooks
 Create folders.
 Import existing notebooks.
 Clone notebooks from the workspace.

Committing and Pushing Changes


 Commit changes to the repo with a commit message.
 Push changes to the remote repository.

Pulling Changes
 Pull changes from branches into the main branch by creating pull requests in GitHub and
confirming the merge.
 Regularly pull changes to avoid conflicts, especially when multiple developers are working on the
same branch in Databricks Repos.
Delta Lake is an open-source storage framework for data lakes. It brings reliability to data lakes by
addressing data inconsistency and performance issues. Delta Lake is not a storage format or medium, but
a storage layer that enables building a lakehouse architecture. A lakehouse platform unifies data
warehouse and advanced analytics capabilities.
Notes By : Zeeshan Khan
Deployment
 Delta Lake is deployed on the cluster as part of the Databricks runtime.

Table Creation
 When creating a Delta Lake table, data is stored in data files in parquet format along with a
transaction log.
 The transaction log, also known as Delta Log, records every transaction performed on the table
since its creation, serving as a single source of truth.
 Each transaction is recorded in a JSON file, containing details of the operation, predicates, and
affected files.

Key Features
 Delta Lake ensures that read operations always retrieve the most recent version of the data
without conflicts.
 The transaction log enables Delta Lake to perform ACID transactions on data lakes and handle
scalable metadata.
 Delta Lake maintains an audit trail of all changes on the table using the underlying file format of
parquet and JSON.

Benefits
 Reliability: Ensures data consistency and integrity.
 Scalability: Handles large-scale data efficiently.
 Consistency: Provides a single source of truth through its transaction log and storage framework.

Delta Tables
Delta Lake tables are a type of data storage format that provides ACID transactions, scalable metadata
handling, and data versioning capabilities. Delta Lake simplifies data management by storing data in
Parquet files and writing transaction logs to maintain table versions. Delta tables allow for efficient
querying, updating, and managing of big data sets in a distributed computing environment.

Creating Delta Tables


-- Create a Delta table
CREATE TABLE my_delta_table (
id INT,
name STRING,
age INT
);

Inserting Records
-- Insert records into the Delta table
INSERT INTO my_delta_table (id, name, age) VALUES
(1, 'Alice', 30),
(2, 'Bob', 25),
(3, 'Charlie', 35);

Querying Records
-- Query the Delta table to validate inserted records
SELECT * FROM my_delta_table;

Table Metadata
-- Get detailed metadata information about the Delta table
DESCRIBE DETAIL my_delta_table;

Update Operations
-- Update records in the Delta table
UPDATE my_delta_table
Notes By : Zeeshan Khan
SET age = 31
WHERE id = 1;

Table History
-- View the history of all operations performed on the Delta table
DESCRIBE HISTORY my_delta_table;

Transaction Log
-- List the contents of the transaction log directory
ls /path/to/delta/table/_delta_log/

Delta tables in Delta Lake offer a robust solution for managing big data sets with features like ACID
transactions and versioning. Understanding Delta table operations, metadata, and transaction logs is
essential for efficient data processing and analysis.

Advanced Delta Lake Features


Time Travel Feature
 All operations on the table are automatically versioned
 Full audit trail of changes
 History can be viewed using DESCRIBE HISTORY command
 Query older versions using timestamps or version numbers
 Enables rollbacks in case of bad writes using RESTORE TABLE command
-- View history of a Delta table
DESCRIBE HISTORY my_table;

-- Query an older version of the table using a timestamp


SELECT * FROM my_table TIMESTAMP AS OF '2023-01-01T00:00:00Z';

-- Query an older version of the table using a version number


SELECT * FROM my_table VERSION AS OF 5;

-- Rollback to a previous version of the table


RESTORE TABLE my_table TO VERSION AS OF 5;

Compacting Small Files


 Improves read query speed
 Compact small files into larger ones using OPTIMIZE command
-- Compact small files into larger ones
OPTIMIZE my_table;

Z-Order Indexing
 Co-locates and reorganizes column information in the same set of files
 Improves data skipping algorithm by reducing data that needs to be read
 Applied by adding ZORDER BY keyword to the OPTIMIZE command followed by column name
-- Apply Z-Order indexing on a specific column
OPTIMIZE my_table ZORDER BY (column_name);

Vacuum Command for Garbage Collection


 Deletes unused data files
 Specify retention period for files to be deleted
 Default threshold is 7 days
 Prevents deletion of files less than threshold to ensure no longer running operations are
referencing them
Notes By : Zeeshan Khan
-- Vacuum command to delete unused data files older than the default threshold
VACUUM my_table;

-- Vacuum command to delete unused data files older than a specified threshold (e.g., 3
days)
VACUUM my_table RETAIN 3 HOURS;

Apply Advanced Delta Features: Time Travel, Optimize, and Vacuum


Time Travel Feature:
Querying Previous Versions:
-- Query a previous version of a Delta table using version number
SELECT * FROM my_delta_table VERSION AS OF 3;

-- Query a previous version of a Delta table using timestamp


SELECT * FROM my_delta_table TIMESTAMP AS OF '2023-01-01 00:00:00';

Restoring Data:
-- Restore a Delta table to a previous version
RESTORE TABLE my_delta_table TO VERSION AS OF 3;

-- Restore a Delta table to a specific timestamp


RESTORE TABLE my_delta_table TO TIMESTAMP AS OF '2023-01-01 00:00:00';

Optimize Command:
-- Optimize a Delta table to combine small files
OPTIMIZE my_delta_table;

-- Optimize a Delta table with Z-order indexing on specific columns


OPTIMIZE my_delta_table ZORDER BY (column1, column2);

Vacuum Command:
-- Vacuum a Delta table to remove old data files with the default retention period (7
days)
VACUUM my_delta_table;

Configuration:
-- Disable retention duration check (use with caution)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

-- Vacuum a Delta table with a specified retention period (e.g., 1 day)


VACUUM my_delta_table RETAIN 1 HOURS;

Best Practices:
 Avoid turning off retention duration check in production
 Use Delta features judiciously to optimize data storage and querying

Relational Entities on Databricks: Understanding Databases and Tables


Key Concepts:
 Databases in Databricks:
 Schemas in the Hive metastore, used for storing metadata about data structures.
 The Hive metastore is a repository of metadata that stores information about databases,
tables, and partitions.
Notes By : Zeeshan Khan
 The default database in Databricks is called "default" and can be used to create tables
without specifying a database name.
 Databases can be created using CREATE DATABASE or CREATE SCHEMA syntax, with the
database folder stored in the Hive default directory.
 -- Create a new database
CREATE DATABASE my_database;

-- Create a new schema (alias for database)


CREATE SCHEMA my_schema;

 Tables in Databricks:
 Can be managed tables (stored under the database directory) or external tables (stored
outside the database directory).
 Managed tables have both metadata and table data managed by Hive, while external tables
only have metadata managed by Hive.
 When dropping a managed table, the underlying data files are deleted; however, dropping
an external table does not delete the data files.
 Tables can be created in custom locations using the LOCATION keyword when defining the
table, specifying the path for data storage.
 External tables can be created in the default database or any custom database by specifying
the database name with the USE keyword.
 It is possible to create external tables in databases located in custom locations outside the
Hive default directory.
 -- Create a managed table
CREATE TABLE my_table (
id INT,
name STRING
);

-- Create an external table


CREATE TABLE my_external_table (
id INT,
name STRING
)
LOCATION '/mnt/data/my_external_table';

-- Create a table in a specific database


USE my_database;
CREATE TABLE my_database_table (
id INT,
name STRING
);

-- Create an external table in a specific database


USE my_database;
CREATE TABLE my_database_external_table (
id INT,
name STRING
)
LOCATION '/mnt/data/my_database_external_table';

Overall, understanding relational entities in Databricks involves grasping the concept of databases,
tables, and their storage locations, whether managed or external. Proper management of databases and
tables is crucial for organizing and accessing data efficiently within the Databricks workspace.
Notes By : Zeeshan Khan

Databases and Tables on Databricks


In this notebook, we will explore working with databases and tables on Databricks. Let's begin by looking
at the Data Explorer, where we can view the Hive metastore for the Databricks workspace. By default,
there is a database called "default". We will create tables within this default database.

Creating a Managed Table


We will start by creating a table called "managed_default" and inserting some data. This table is a
managed table since we did not specify the LOCATION keyword during its creation.
-- Create a managed table
CREATE TABLE managed_default (
id INT,
name STRING
);

-- Insert data into the managed table


INSERT INTO managed_default (id, name) VALUES (1, 'Alice'), (2, 'Bob');

Viewing Metadata Information


Viewing the metadata information using the DESCRIBE EXTENDED command, we can see details such as
the table location and type (Managed).
-- Describe extended information about the managed table
DESCRIBE EXTENDED managed_default;

Dropping a Managed Table


Dropping the managed table results in the deletion of the table directory and its data files.
-- Drop the managed table
DROP TABLE managed_default;

Creating an External Table


Next, we create an external table under the default database by specifying the LOCATION keyword with
the desired path.
-- Create an external table
CREATE EXTERNAL TABLE external_default (
id INT,
name STRING
)
LOCATION '/path/to/external/data';

-- Insert data into the external table


INSERT INTO external_default (id, name) VALUES (3, 'Charlie'), (4, 'Diana');

Observing External Table Behavior


After creating and inserting data into the external table, we observe that dropping the external table does
not delete the underlying data files, as they are stored outside the database directory.
-- Drop the external table
DROP TABLE external_default;
Notes By : Zeeshan Khan
Creating Additional Databases
Creating a New Database
We demonstrate the creation of a new database using the CREATE DATABASE syntax.
-- Create a new database
CREATE DATABASE IF NOT EXISTS my_custom_database;

Creating Tables in a New Database


Tables can be created in new databases by specifying the database to be used through the USE keywords.
-- Use the new database
USE my_custom_database;

-- Create a managed table in the new database


CREATE TABLE managed_table (
id INT,
name STRING
);

-- Create an external table in the new database


CREATE EXTERNAL TABLE external_table (
id INT,
name STRING
)
LOCATION '/path/to/external/data';

Dropping Tables in a New Database


Dropping tables within a new database highlights the difference in behavior between managed and
external tables in terms of data file retention upon deletion.
-- Drop the managed table in the new database
DROP TABLE managed_table;

-- Drop the external table in the new database


DROP TABLE external_table;

Creating a Database in a Custom Location


We create a database in a custom location outside the default hive directory, showcasing the flexibility of
defining database locations.
-- Create a database in a custom location
CREATE DATABASE IF NOT EXISTS custom_location_db
LOCATION '/custom/path/to/database';

Creating Tables in a Custom Database


Managed and external tables can be created within this custom database, with managed tables having
their data files managed within the custom location.
-- Use the custom location database
USE custom_location_db;

-- Create a managed table in the custom location database


CREATE TABLE managed_custom_table (
id INT,
name STRING
);
Notes By : Zeeshan Khan

-- Create an external table in the custom location database


CREATE EXTERNAL TABLE external_custom_table (
id INT,
name STRING
)
LOCATION '/custom/path/to/external/data';

Set Up Delta Tables


Creation of Delta Tables
 Delta Lake tables can be created using CTAS (Create Table As Select) statements.
 CTAS statements automatically infer schema information from query results.
 CTAS statements support simple transformations such as column renaming or omitting columns
during table creation.
-- Create a Delta table using CTAS
CREATE TABLE delta_table AS
SELECT id, name, age
FROM source_table;

Table Constraints
 Databricks supports two types of table constraints: NOT NULL and CHECK constraints.
 Constraints must be defined after ensuring that no data in the table violates the constraint.
 New data violating a constraint will result in a write failure.
-- Add NOT NULL constraint
ALTER TABLE delta_table
ALTER COLUMN age SET NOT NULL;

-- Add CHECK constraint


ALTER TABLE delta_table
ADD CONSTRAINT age_check CHECK (age > 0);

Partitioning
 Data tables can be partitioned in subfolders by the value of one or more columns.
 Partitioning can improve performance for huge Delta tables but may not benefit small to medium-
sized tables.
-- Create a partitioned Delta table
CREATE TABLE partitioned_delta_table
USING DELTA
PARTITIONED BY (country)
AS SELECT * FROM source_table;

External Tables
 Delta tables created with CTAS statements can be specified as external tables, allowing data
storage in an external location.
-- Create an external Delta table
CREATE TABLE external_delta_table
USING DELTA
LOCATION '/mnt/delta/external_delta_table'
AS SELECT * FROM source_table;

Cloning Delta Tables


 Delta Lake provides options for efficiently copying tables: deep clone and shallow clone.
 Deep clone copies both data and metadata from a source table to a target.
 Shallow clone only copies Delta transaction logs.
 Cloning is useful for creating backups or testing code changes without affecting the source table.
Notes By : Zeeshan Khan
-- Deep clone a Delta table
CREATE TABLE deep_cloned_table
DEEP CLONE source_table;

-- Shallow clone a Delta table


CREATE TABLE shallow_cloned_table
SHALLOW CLONE source_table;

Views
Views in Databricks are virtual tables that do not hold physical data, but rather store SQL
queries against actual tables. There are three types of views in Databricks: stored views,
temporary views, and global temporary views.
Stored Views
Stored views, also known as classical views, are persisted in the database and can be
accessed across multiple sessions. To create a stored view, the CREATE VIEW statement is
used with the AS keyword followed by the SQL query.
-- Creating a stored view
CREATE VIEW my_stored_view AS
SELECT * FROM my_table WHERE column_a > 100;

Temporary Views
Temporary views are tied to a Spark session and are dropped when the session ends.
They can be created by adding the TEMP or TEMPORARY keyword to the CREATE VIEW
command.
-- Creating a temporary view
CREATE TEMP VIEW my_temp_view AS
SELECT * FROM my_table WHERE column_b < 50;

Global Temporary Views


Global temporary views behave similarly to other views but are tied to the cluster. They
can be accessed by any notebook attached to the cluster while it is running. To create a
global temporary view, the GLOBAL TEMP keyword is added to the command.
-- Creating a global temporary view
CREATE GLOBAL TEMP VIEW my_global_temp_view AS
SELECT * FROM my_table WHERE column_c = 'example';

Querying Views
To query a view in a SELECT statement, the appropriate database qualifier should be used
(e.g., global_temp for global temporary views).
-- Querying a stored view
SELECT * FROM my_stored_view;

-- Querying a temporary view


SELECT * FROM my_temp_view;
Notes By : Zeeshan Khan
-- Querying a global temporary view
SELECT * FROM global_temp.my_global_temp_view;

Showing Views
To list all views in the current database, the SHOW TABLES command can be used with a
filter for views.
-- Showing all views in the current database
SHOW TABLES;

Dropping Views
 Stored views can be dropped using the DROP VIEW command.
 Temporary views are automatically dropped when the session ends.
 Global temporary views are dropped upon cluster restart.
-- Dropping a stored view
DROP VIEW my_stored_view;

-- Dropping a global temporary view


DROP VIEW global_temp.my_global_temp_view;

Summary
 Stored Views: Persisted in the database, accessible across multiple sessions.
 Temporary Views: Accessible only in the current session, dropped when the
session ends.
 Global Temporary Views: Accessible across multiple sessions within the same
cluster, dropped upon cluster restart.
The CREATE VIEW statements for each type of view differ in that TEMP is used for temporary
views and GLOBAL TEMP is used for global temporary views.

Querying Files in Databricks


Extracting data directly from files using Spark SQL
In Databricks, you can extract data directly from files using Spark SQL. This allows you to
access the contents of various file formats such as JSON, CSV, TSV, TXT, and more.
Using select statement for querying file content
To query a file's content, you can use a select statement like SELECT * FROM file, where
you specify the file path. It's important to note the use of backticks around the file path
instead of single quotes.
SELECT * FROM `dbfs:/path/to/your/file.csv`

Different methods for reading files


 Single file: You can read a single file.
 Multiple files: Use a wildcard character to read multiple files.
 Entire directory: Read an entire directory, ensuring all files have the same format
and schema.
Notes By : Zeeshan Khan
-- Single file
SELECT * FROM `dbfs:/path/to/your/file.csv`

-- Multiple files
SELECT * FROM `dbfs:/path/to/your/files*.csv`

-- Entire directory
SELECT * FROM `dbfs:/path/to/your/directory/`

Working with text-based files


The "text" format can be used to extract data from text-based files as raw strings. This is
helpful when dealing with corrupted input data, as you can apply custom text parsing
functions to extract values.
SELECT * FROM `dbfs:/path/to/your/file.txt` USING text

Extracting binary representation of file content


When working with images and unstructured data, you can use the "binaryFile" format to
extract the binary representation of files' content.
SELECT * FROM `dbfs:/path/to/your/image.png` USING binaryFile

Loading data into Delta tables


To load data from files into Delta tables, you can use CTAS (Create Table As Select)
statements. This allows you to extract data directly from external sources with well-
defined schema, such as parquet files and tables.
CREATE TABLE delta_table
USING delta
AS SELECT * FROM `dbfs:/path/to/your/file.parquet`

Creating external tables with additional options


For file formats that require additional options, you can use the regular CREATE TABLE
statement with the USING keyword to specify the external data source type (e.g., CSV
format) and any additional options. This creates an external table that references the files
without moving the data during table creation.
CREATE TABLE external_table
USING csv
OPTIONS (path 'dbfs:/path/to/your/file.csv', header 'true', inferSchema 'true')

Limitations of external tables


External tables created using CTAS statements or JDBC connections to external databases
are not Delta tables. This means they may not have the performance and features
guaranteed by Delta Lake, such as time travel and always reading the most recent data.
Overcoming limitations with temporary views
To overcome limitations of external tables, you can create a temporary view referring to
the external data source and then query this temporary view to create a table using CTAS
Notes By : Zeeshan Khan
statements. This allows you to extract data from external sources and load it into Delta
tables, ensuring the benefits of Delta Lake are fully leveraged.
-- Create a temporary view
CREATE OR REPLACE TEMP VIEW temp_view AS
SELECT * FROM `dbfs:/path/to/your/file.csv`

-- Create a Delta table from the temporary view


CREATE TABLE delta_table
USING delta
AS SELECT * FROM temp_view

Querying JSON Files


-- Reading a single JSON file
SELECT * FROM json.`/path/to/file.json`;

-- Listing files in a directory and querying multiple files using wildcard


characters
SELECT * FROM json.`/path/to/directory/*.json`;

-- Using input_file_name function to track the source file for each record
SELECT input_file_name(), * FROM json.`/path/to/directory/*.json`;

Querying CSV Files


-- Issues with parsing CSV files using simple SELECT statement
SELECT * FROM csv.`/path/to/file.csv`;

-- Creating a table with USING keyword to provide schema declaration and


configuration
CREATE TABLE customers
USING csv
OPTIONS (path '/path/to/customers.csv', header 'true', inferSchema 'true');

-- Ensuring column order consistency when working with CSV files


SELECT * FROM customers;

-- Describing table extended for metadata information


DESCRIBE EXTENDED customers;

Delta Tables vs. External Tables


-- Understanding the difference between Delta tables and external tables
-- Example of creating an external table
CREATE TABLE external_customers
USING csv
OPTIONS (path '/path/to/customers.csv', header 'true', inferSchema 'true');

-- Refreshing table cache to reflect changes made to the external source


REFRESH TABLE external_customers;

Creating Delta Tables


-- Using CTAS statements to create Delta tables with inferred schema from query
results
CREATE TABLE delta_customers AS
SELECT * FROM csv.`/path/to/customers.csv`;
Notes By : Zeeshan Khan

-- Correcting data parsing issues by specifying file options in temporary views


CREATE OR REPLACE TEMP VIEW temp_customers
USING csv
OPTIONS (path '/path/to/customers.csv', header 'true', inferSchema 'true');

CREATE TABLE delta_customers AS


SELECT * FROM temp_customers;

Writing to Tables
 Delta technology provides ACID compliant updates to Delta tables, ensuring data
integrity.
# Example of creating a Delta table
from delta.tables import *

# Create a Delta table


DeltaTable.create(spark) \
.tableName("delta_table") \
.addColumn("id", "INT") \
.addColumn("name", "STRING") \
.addColumn("age", "INT") \
.location("/path/to/delta_table") \
.execute()

 Using CREATE TABLE AS SELECT (CTAS) statement to create a new table from
existing data in Parquet files.
-- SQL example of CTAS
CREATE TABLE new_table
USING delta
AS SELECT * FROM parquet.`/path/to/parquet/files`;

 Benefits of overwriting tables instead of deleting and recreating them, such as


faster execution and Time Travel feature.
-- SQL example of overwriting a table
CREATE OR REPLACE TABLE delta_table
USING delta
AS SELECT * FROM source_table;

 CREATE OR REPLACE TABLE statement fully replaces the content of a table each
time it executes.
-- SQL example of CREATE OR REPLACE TABLE
CREATE OR REPLACE TABLE delta_table
USING delta
AS SELECT * FROM source_table;

 INSERT OVERWRITE statement can only overwrite an existing table and is a safer
technique for data integrity.
-- SQL example of INSERT OVERWRITE
INSERT OVERWRITE TABLE delta_table
SELECT * FROM source_table;

 INSERT INTO statement is used for appending new records to tables but may result
in duplicate records if not managed properly.
Notes By : Zeeshan Khan
-- SQL example of INSERT INTO
INSERT INTO delta_table
SELECT * FROM new_records;

 MERGE INTO statement allows for upserting (insert, update, delete) data in the
target table based on conditions.
-- SQL example of MERGE INTO
MERGE INTO delta_table AS target
USING updates_table AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name, target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age) VALUES (source.id, source.name, source.age);

 Atomic transactions in merge operations ensure data consistency.


 Use of temporary views for specifying data to be merged into the target table.
# Example of using temporary views in Python
updates_df.createOrReplaceTempView("updates_view")

spark.sql("""
MERGE INTO delta_table AS target
USING updates_view AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name, target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age) VALUES (source.id, source.name, source.age)
""")

 Ensuring uniqueness and avoiding duplicates when inserting records using MERGE
INTO statement.
-- SQL example ensuring uniqueness
MERGE INTO delta_table AS target
USING (
SELECT DISTINCT id, name, age FROM updates_table
) AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name, target.age = source.age
WHEN NOT MATCHED THEN
INSERT (id, name, age) VALUES (source.id, source.name, source.age);

By understanding these key concepts, users can efficiently manage and update data in
Delta tables while maintaining data integrity and consistency.

Advanced Transformations
JSON Data Interaction
 Colon Syntax: Traverse nested JSON data.
 Example: Accessing nested values like first name or country.
 from pyspark.sql import SparkSession
from pyspark.sql.functions import col
Notes By : Zeeshan Khan

spark = SparkSession.builder.appName("JSONExample").getOrCreate()
json_df = spark.read.json("path/to/json/file")
json_df.select(col("name.first"), col("address.country")).show()

 from_json Function: Parses JSON objects into struct types.


 Requires schema of the JSON object.
 from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([
StructField("name", StructType([
StructField("first", StringType(), True),
StructField("last", StringType(), True)
]), True),
StructField("address", StructType([
StructField("country", StringType(), True)
]), True)
])

json_df = spark.read.json("path/to/json/file")
json_df.withColumn("parsed_json", from_json(col("json_column"),
schema)).select("parsed_json.*").show()

Struct Types
 Period/Dot Syntax: Interact with nested objects.
 struct_df = spark.createDataFrame([
(("John", "Doe"), "USA"),
(("Jane", "Doe"), "Canada")
], ["name", "country"])

struct_df.select(col("name.first"), col("country")).show()

 Star Operation: Flatten fields into separate columns.


 struct_df.select("name.*", "country").show()

Array Handling
 explode Function: Puts each element of an array on its own row.
 from pyspark.sql.functions import explode

array_df = spark.createDataFrame([
(1, ["a", "b", "c"]),
(2, ["d", "e", "f"])
], ["id", "letters"])

array_df.select("id", explode("letters")).show()

 collect_set Function: Collects unique values for a field, including within arrays.
 Combine with flatten and array_distinct to keep only distinct values.
Notes By : Zeeshan Khan
 from pyspark.sql.functions import collect_set, flatten, array_distinct

array_df = spark.createDataFrame([
(1, ["a", "b", "a"]),
(2, ["d", "e", "d"])
], ["id", "letters"])

array_df.select(array_distinct(flatten(collect_set("letters")))).show()

Join Operations
 Types of Joins: Inner, outer, left, right, anti, cross, and semi joins.
 Usage: Join tables based on specific keys and store results in new views.
 df1 = spark.createDataFrame([
(1, "John"),
(2, "Jane")
], ["id", "name"])

df2 = spark.createDataFrame([
(1, "USA"),
(3, "Canada")
], ["id", "country"])

inner_join_df = df1.join(df2, df1.id == df2.id, "inner")


inner_join_df.show()

Set Operations
 Union: Combines datasets.
 df1 = spark.createDataFrame([
(1, "John"),
(2, "Jane")
], ["id", "name"])

df2 = spark.createDataFrame([
(3, "Doe"),
(4, "Smith")
], ["id", "name"])

union_df = df1.union(df2)
union_df.show()

 Intersect: Finds common elements between datasets.


 df1 = spark.createDataFrame([
(1, "John"),
(2, "Jane")
], ["id", "name"])

df2 = spark.createDataFrame([
(2, "Jane"),
(3, "Doe")
], ["id", "name"])
Notes By : Zeeshan Khan

intersect_df = df1.intersect(df2)
intersect_df.show()

Pivot Clause
 Usage: Aggregates values based on specific column values and turns them into
multiple columns.
 Useful for creating flattened data formats for dashboarding or machine
learning.
 pivot_df = spark.createDataFrame([
("A", "Math", 85),
("A", "Science", 90),
("B", "Math", 75),
("B", "Science", 80)
], ["student", "subject", "score"])

pivot_df.groupBy("student").pivot("subject").avg("score").show()

Summary
Advanced transformations in Spark SQL enable:
 Manipulation of complex data structures.
 Efficient aggregations.
 Joining multiple datasets for analytical and machine learning tasks.

Higher Order Functions


Higher order functions allow for direct manipulation of complex data types like arrays
and map type objects. Examples of higher order functions include filter and transform
functions.
Filter Function
The filter function is used to filter an array using a given lambda function. It helps extract
specific elements based on a defined condition.
# Example of filter function in Python
numbers = [1, 2, 3, 4, 5, 6]
filtered_numbers = list(filter(lambda x: x % 2 == 0, numbers))
print(filtered_numbers) # Output: [2, 4, 6]

Transform Function
The transform function applies a transformation to all items in an array and extracts the
transformed values.
# Example of transform function in Python using map
numbers = [1, 2, 3, 4, 5, 6]
squared_numbers = list(map(lambda x: x ** 2, numbers))
print(squared_numbers) # Output: [1, 4, 9, 16, 25, 36]
Notes By : Zeeshan Khan

User Defined Functions (UDFs)


UDFs allow for the registration of custom SQL logic as functions in a database. They are
reusable and leverage Spark SQL for optimization.
Creating UDFs
To create a UDF, specify a function name, optional parameters, return type, and custom
logic. UDFs can be used across different Spark sessions and notebooks.
# Example of creating a UDF in PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.appName("UDF Example").getOrCreate()

# Define a Python function


def square(x):
return x * x

# Register the function as a UDF


square_udf = udf(square, IntegerType())

# Use the UDF in a DataFrame


df = spark.createDataFrame([(1,), (2,), (3,)], ["number"])
df = df.withColumn("squared", square_udf(df["number"]))
df.show()

Describe Function
The Describe Function command provides information about the registered function,
including expected inputs and return type. Describe Function Extended offers more
detailed information, including the SQL logic used in the function.
-- Example of Describe Function in Spark SQL
DESCRIBE FUNCTION square;
DESCRIBE FUNCTION EXTENDED square;

Application of UDFs
UDFs can be applied to columns in tables to perform custom operations. They are
powerful tools for manipulating data within Spark.
# Example of applying a UDF to a DataFrame column
df = spark.createDataFrame([(1,), (2,), (3,)], ["number"])
df = df.withColumn("squared", square_udf(df["number"]))
df.show()

Optimization
UDF functions in Spark are optimized for parallel execution, enhancing performance for
large datasets.
Notes By : Zeeshan Khan
Dropping UDFs
UDFs can be dropped when they are no longer needed, freeing up resources in the
database.
-- Example of dropping a UDF in Spark SQL
DROP FUNCTION IF EXISTS square;

Structured Streaming in Databricks


Structured streaming in Databricks is a powerful tool for processing streaming data using
Spark.

Data Stream
A data stream is any data source that grows over time, such as new JSON log files, CDC
feeds, or events from messaging systems like Kafka.

Processing Approaches
Traditional processing involves reprocessing the entire dataset each time new data
arrives, whereas custom logic can be written to capture only new data since the last
update. Spark structured streaming simplifies this process.

Scalable Processing Engine


Spark structured streaming is a scalable streaming processing engine that queries infinite
data sources, automatically detects new data, and persists results incrementally into a
data sink.

Interacting with Infinite Data Source


Treat infinite data sources as tables, where new data is treated as new rows appended to a
table. Delta tables are well integrated with Spark structured streaming for seamless
processing.

Configuring Streaming Queries


Use spark.readStream() to query Delta tables as stream sources and apply
transformations on streaming data frames. Write streaming queries to durable storage
using dataframe.writeStream method.
Example:
# Read from a Delta table as a stream source
streaming_df = spark.readStream.format("delta").table("source_table")

# Apply transformations
transformed_df = streaming_df.select("column1", "column2").where("column3 > 100")

# Write the streaming data to a Delta table


query =
transformed_df.writeStream.format("delta").outputMode("append").option("checkpoin
tLocation", "/path/to/checkpoint/dir").table("target_table")
Notes By : Zeeshan Khan
# Start the streaming query
query.awaitTermination()

Streaming Job Configurations


Define trigger intervals to process data batches, set output modes (append or complete),
and create checkpoints to track progress. Checkpoints cannot be shared between streams
for processing guarantees.
Example:
# Define a streaming query with trigger interval and output mode
query =
transformed_df.writeStream.format("delta").outputMode("append").trigger(processin
gTime='10 seconds').option("checkpointLocation",
"/path/to/checkpoint/dir").table("target_table")

# Start the streaming query


query.awaitTermination()

Guarantees and Fault Tolerance


Structured streaming ensures fault tolerance with checkpointing and write-ahead logs for
resuming from failures. Exactly-once data processing is ensured with idempotent sinks
and repeatable data sources.
Example:
# Define a streaming query with exactly-once processing guarantees
query =
transformed_df.writeStream.format("delta").outputMode("append").option("checkpoin
tLocation", "/path/to/checkpoint/dir").option("failOnDataLoss",
"false").table("target_table")

# Start the streaming query


query.awaitTermination()

Limitations
Some operations, like sorting and deduplication, are not supported in streaming data
frames. Advanced methods like windowing and watermarking can help achieve these
operations in structured streaming.
Example:
# Example of windowing and watermarking
from pyspark.sql.functions import window

# Define a streaming query with windowing and watermarking


windowed_df = streaming_df.withWatermark("timestamp", "10
minutes").groupBy(window("timestamp", "5 minutes"), "column1").count()

# Write the windowed data to a Delta table


query =
windowed_df.writeStream.format("delta").outputMode("append").option("checkpointLo
cation", "/path/to/checkpoint/dir").table("target_table")
Notes By : Zeeshan Khan
# Start the streaming query
query.awaitTermination()

Structured streaming in Databricks offers a robust solution for processing streaming data
efficiently and reliably. It provides a simplified approach to handling infinite data sources
and ensures fault tolerance and exactly-once processing in streaming jobs.
Structured Streaming is a feature in Apache Spark that allows for real-time processing of
data in a consistent and fault-tolerant manner. It enables incremental processing of data
streams by treating them as a continuous series of small batch jobs.
Key Concepts:
 Data Streaming: Structured Streaming enables the processing of data streams in
real-time, making it suitable for applications that require low-latency processing of
continuous data streams.
 PySpark API: In order to work with data streaming in SQL, the spark.readStream
method provided by the PySpark API is used to query Delta tables as a stream
source.
# Example: Reading a stream from a Delta table
streaming_df = spark.readStream.format("delta").table("delta_table_name")

 Temporary Views: Streaming temporary views can be created against stream


sources to apply SQL transformations and operations on the streaming data in the
same way as with static data.
# Example: Creating a temporary view
streaming_df.createOrReplaceTempView("streaming_view")
result_df = spark.sql("SELECT * FROM streaming_view WHERE value > 100")

 Aggregations: Streaming temporary views can be used to apply aggregations on


the streaming data, allowing for the computation of summary statistics and metrics
in real-time.
# Example: Applying aggregations
agg_df = spark.sql("""
SELECT
window(timestamp, '1 minute') as time_window,
count(*) as count
FROM streaming_view
GROUP BY time_window
""")

 Incremental Processing: To persist incremental results of streaming queries, the


logic needs to be passed back to the PySpark DataFrame API, where streaming
DataFrames can be created and processed.
# Example: Writing the results to a Delta table
query =
result_df.writeStream.format("delta").outputMode("append").option("checkpointLoca
tion", "/path/to/checkpoint").start("/path/to/delta_table")

 Checkpointing: Checkpoint locations are used to track the progress of the


streaming processing, ensuring fault tolerance and efficient recovery in case of
failures.
Notes By : Zeeshan Khan
# Example: Specifying checkpoint location
query =
result_df.writeStream.format("delta").outputMode("append").option("checkpointLoca
tion", "/path/to/checkpoint").start("/path/to/delta_table")

 Trigger Intervals: The trigger intervals can be configured to control the frequency
of processing streaming data, with options such as every 4 seconds or availableNow
for batch mode execution.
# Example: Setting trigger intervals
query = result_df.writeStream.trigger(processingTime='4
seconds').format("delta").outputMode("append").option("checkpointLocation",
"/path/to/checkpoint").start("/path/to/delta_table")

 Output Modes: Output modes such as append or complete can be set to determine
how the results of streaming queries are written to durable storage, with complete
mode used for aggregation queries.
# Example: Using complete output mode for aggregations
query = agg_df.writeStream.outputMode("complete").format("console").start()

 Interactive Dashboard: An interactive dashboard can be used to monitor the


performance and progress of streaming queries, allowing for real-time visualization
of streaming data processing.
# Example: Starting a streaming query with a console sink for monitoring
query = result_df.writeStream.format("console").start()

 Incremental Batch Processing: By using trigger options like availableNow,


streaming queries can be executed in incremental batch mode to process all new
available data and stop automatically after execution.
# Example: Using availableNow trigger for incremental batch processing
query =
result_df.writeStream.trigger(availableNow=True).format("delta").outputMode("appe
nd").option("checkpointLocation",
"/path/to/checkpoint").start("/path/to/delta_table")

Overall, Structured Streaming in Apache Spark provides a powerful framework for real-
time data processing, enabling users to efficiently handle continuous streams of data and
perform complex analysis and transformations in a fault-tolerant and scalable manner.

Incremental Data Ingestion


Incremental data ingestion refers to the process of loading new data from files into a
storage location without the need to reprocess previously processed files. This allows for
efficient and streamlined data processing, particularly in data pipelines. In Databricks,
there are two main methods for incrementally ingesting data from files: the Copy Into SQL
command and the Auto Loader.

Copy Into SQL Command


The Copy Into command in Databricks allows users to load data from a file location into a
Delta table. This command operates idempotently and incrementally, meaning it will only
load new files from the source location each time it is run.
 Target Table: Users specify the target table.
Notes By : Zeeshan Khan
 Source Location: Users specify the source location.
 File Format: Users specify the format of the source file (e.g., CSV, Parquet) along
with any other relevant format options.
The Copy Into command is efficient for ingesting thousands of files.
Example Code
COPY INTO target_table
FROM 's3://your-bucket/path/'
FILEFORMAT = PARQUET
FORMAT_OPTIONS ('mergeSchema' = 'true')

Auto Loader
The Auto Loader in Databricks uses Spark structured streaming to efficiently process new
data files as they arrive in a storage location. It can scale to support real-time ingestion of
millions of files per hour.
 Checkpointing: Uses checkpointing to track the ingestion process and store
metadata of discovered files, ensuring they are processed exactly once and can
resume from failure points.
 StreamReader Format: The format of StreamReader used in Auto Loader is
cloudFiles, where users specify the format of the source files through options.
 Schema Detection: Can automatically configure the schema of the data, detecting
any updates to fields in the source dataset.
It is recommended to use Auto Loader for ingesting data from cloud object storage,
especially when dealing with large volumes of data.
Example Code
from pyspark.sql.functions import *

# Define the schema if known, or use schema inference


schema = "id INT, name STRING, age INT"

# Read data using Auto Loader


df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.schemaLocation", "/path/to/checkpoint")
.load("s3://your-bucket/path/"))

# Write data to a Delta table


(df.writeStream
.format("delta")
.option("checkpointLocation", "/path/to/checkpoint")
.start("/path/to/delta-table"))

Choosing Between Copy Into and Auto Loader


 Copy Into Command: Use for ingesting a smaller number of files, typically in
thousands.
Notes By : Zeeshan Khan
 Auto Loader: Use for ingesting a larger number of files, especially in the order of
millions or more over time. Auto Loader can split processing into multiple batches,
making it more efficient at scale and suitable for cloud object storage ingestion.

Conclusion
Incremental data ingestion methods like Copy Into and Auto Loader in Databricks provide
efficient ways to process new data from files without reprocessing existing data.
Understanding the differences between these methods and when to use each can help
optimize data pipelines and improve overall data processing efficiency.

Auto Loader - Incremental Data Ingestion with Apache Spark Structured


Streaming
Auto Loader
Auto Loader is a feature in Apache Spark that allows for incremental data ingestion from
files. It automatically detects new data files in a specified directory and processes them for
ingestion into a target table.
Spark Structured Streaming
Spark Structured Streaming is a scalable and fault-tolerant stream processing engine built
on Spark SQL. It enables real-time data processing by treating data streams as continuous
tables.
Data Source Directory
The directory where the data files are stored for Auto Loader to read and ingest. It can
contain various file formats, such as Parquet files in this example.
ReadStream and WriteStream
Methods in Spark structured streaming API used to read and write streaming data,
respectively. In the context of Auto Loader, these methods are used to define the data
ingestion pipeline.
Checkpointing
The process of saving the progress and state of a streaming query in a checkpoint location.
This helps in fault tolerance and resuming the processing from the last checkpoint in case
of failures.
Delta Lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark
and big data workloads. It provides features for data versioning and management,
ensuring data reliability and consistency.
Streaming Update
Each batch of new data ingested by Auto Loader results in a new table version in Delta
Lake. This versioning allows tracking the history of changes and updates to the table.
Notes By : Zeeshan Khan
Data Simulation
The process of simulating the arrival of new data files in the data source directory for
testing and demonstration purposes. This can be done using helper functions or scripts to
mimic real-time data ingestion scenarios.
Data Ingestion
The process of loading data from external sources into a target table for analysis and
querying. Auto Loader simplifies this process by automatically detecting and processing
new data files.
Data Quality
Ensuring the accuracy, completeness, and consistency of data being ingested and
processed by Auto Loader is essential for making informed decisions based on the
analyzed data.

Code Examples
Setting Up Spark Session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("AutoLoaderExample") \
.getOrCreate()

Reading Data with Auto Loader


df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.load("/path/to/data/source/directory")

Writing Data to Delta Lake


df.writeStream \
.format("delta") \
.option("checkpointLocation", "/path/to/checkpoint/dir") \
.start("/path/to/delta/table")

Simulating Data Arrival


import shutil
import time

def simulate_data_arrival(source_dir, target_dir, interval=5):


files = [f for f in os.listdir(source_dir) if f.endswith('.parquet')]
for file in files:
shutil.copy(os.path.join(source_dir, file), target_dir)
time.sleep(interval)

simulate_data_arrival("/path/to/simulated/data",
"/path/to/data/source/directory")
Notes By : Zeeshan Khan

Summary
Overall, Auto Loader in combination with Apache Spark Structured Streaming provides a
powerful tool for handling real-time data ingestion and processing scenarios, enabling
efficient and scalable data analytics solutions.
Multi-hop architecture, also known as Medallion architecture, is a data design pattern
used to logically organize data in a multilayered approach. The goal of multi-hop
architecture is to incrementally improve the structure and quality of data as it flows
through each layer of the architecture. The architecture typically consists of three layers:
bronze, silver, and gold.
Layers
1. Bronze Table
 Contains raw data ingested from various sources such as JSON files,
operational databases, or Kafka Stream.
1. Silver Table
 Provides a more refined view of the data by cleaning, filtering, and enriching
the records with fields from various tables.
2. Gold Table
 Provides business-level aggregations and is often used for reporting,
dashboarding, or machine learning purposes.
Benefits
 Simple and easy-to-understand data model
 Incremental ETL capabilities
 Ability to combine streaming and batch workloads
 Ability to recreate tables from raw data at any time
Delta Lake Multi-hop Pipeline
Creating a pipeline using Delta Lake to process data from a bookstore dataset containing
customers, orders, and books tables.
Initial Steps
1. Running the Copy-Datasets script and checking the source directory.
1. Configuring an Auto Loader for stream reading Parquet files with schema inference.
2. Registering a streaming temporary view (orders_raw_tmp) for data transformation
in Spark SQL.
Enriching Raw Data
 Adding metadata such as source file information and ingestion time for
troubleshooting purposes.
 Visualizing enriched data before further processing.
Processing Raw Data
 Passing enriched data to PySpark API for incremental write to a Delta Lake table
(orders_bronze).
Notes By : Zeeshan Khan
 Monitoring and checking records written into the bronze table.
Working on the Silver Layer
 Creating a static lookup table (customers) for joining with the bronze table.
 Performing enrichments and checks (adding customer names, parsing timestamps,
filtering out orders with no items).
 Writing enriched data into a silver table.
Moving to the Gold Layer
 Creating an aggregated gold table (daily_customer_books) for daily book counts per
customer.
 Configuring streams with trigger options and output modes for data processing.
 Querying the gold table to view aggregated data.
Processing Remaining Data Files
 Updating the gold table by rerunning the final query as a batch job.
Wrapping Up
 Stopping all active streams to conclude the pipeline.
In this multi-hop architecture, we observe a step-by-step process involving data
enrichment, transformation, and aggregation using Delta Lake and Apache Spark. The
pipeline demonstrates the integration of streaming and batch processing for efficient data
processing and analytics tasks.
Delta Live Tables (DLT) is a framework designed to simplify the creation of reliable and
maintainable data processing pipelines.
Multi-hop Pipeline:
DLT allows for the creation of multi-hop pipelines, where data flows through different
layers (bronze, silver, gold) to undergo various transformations and enrichment
processes.
Bronze Tables:
The initial layer in a DLT pipeline is the bronze layer, where raw data is ingested. Tables
in this layer capture data in its rawest form, before any processing or transformations
have been applied.
Silver Tables:
The silver layer is where data is refined through operations such as data cleansing and
enrichment. Quality control measures are implemented at this level using constraint
keywords to ensure data integrity.
Gold Tables:
The gold layer represents the final refined data, often used for reporting or analysis
purposes. DLT pipelines can be configured to continuously ingest new data using
triggered or continuous pipeline modes.
Notes By : Zeeshan Khan
Databricks Notebooks:
DLT pipelines are implemented using Databricks notebooks, which allow for the easy
creation and management of data processing logic.
Error Handling:
DLT pipelines include error handling mechanisms, such as reporting constraint violations
and providing options for handling records that do not meet specified criteria.
Metadata Management:
DLT pipelines store metadata about tables, events, and configurations in a designated
storage location, allowing for easy monitoring and tracking of pipeline activities.
Cluster Management:
DLT pipelines can be run on designated clusters, which can be configured based on
performance requirements and resource needs. Clusters can be created, managed, and
terminated as needed.
Summary:
Delta Live Tables provides a powerful framework for building and managing data
processing pipelines, offering features for data transformation, quality control, error
handling, and cluster management. By utilizing DLT, organizations can streamline the
process of building and maintaining data pipelines while ensuring data integrity and
reliability.

Change Data Capture (CDC)


Change Data Capture (CDC) is a process that involves identifying and capturing changes
made to data in a data source, and then delivering those changes to a target destination.
These changes can include:
 New records to be inserted
 Updated records that need to be reflected
 Deleted records that must be removed in the target
When changes are made, they are logged as events containing both data records and
metadata information. This metadata indicates:
 Whether the record was inserted, updated, or deleted
 A version number or timestamp to indicate the order in which changes occurred

Applying CDC Events


CDC events need to be applied to the target table, ensuring that only the most recent
changes are applied. This can involve:
 Inserting new records
 Updating existing records
 Deleting records as needed
Notes By : Zeeshan Khan

Delta Live Tables and CDC


Delta Live Tables support CDC feed processing using the Apply Changes Into command.
This command allows for changes to be applied to a target table from a CDC feed table.
The command includes:
 Specifying primary key fields
 Updating or inserting records based on key existence
 Deleting records based on specified conditions
 Sequencing the operations
 Specifying fields to be added to the target table
The Apply Changes Into command automatically orders late-arriving records, ensuring
that records are properly updated even if they arrive out of order. It also ensures that
deleted records from the source table are no longer reflected in downstream tables. The
command's default behavior is to upsert CDC events into the target table, updating
matching records or inserting new ones when no match is found.

Optional Features
Optional features of the Apply Changes Into command include:
 Specifying handling for delete events
 Specifying primary keys for the table
 Ignoring specified columns
 Choosing between storing records as slowly changing dimension type 1 or type 2
Type 1 slowly changing dimension tables maintain only one record for each unique key,
with updates overwriting existing information.

Disadvantages
One disadvantage of Apply Changes Into is that it breaks the append-only requirements
for streaming table sources. This means that tables updated using this command cannot
be used as streaming sources in subsequent layers.

Conclusion
In conclusion, Change Data Capture is a crucial process for identifying, capturing, and
delivering data changes to target destinations. Utilizing the Apply Changes Into command
in Delta Live Tables facilitates effective processing of CDC events and ensures accurate
and up-to-date data management.

Change Data Capture (CDC) Feed with Delta Live Tables


Processing CDC feed involves capturing changes made to data in source systems and
applying them to a target system. Delta live tables (DLT) provide a mechanism for
processing CDC feed by ingesting and applying changes incrementally.
Notes By : Zeeshan Khan
CDC Data
 CDC data is delivered in JSON files containing information about insert, update, and
delete operations.
 Operational columns like row_status and row_time indicate the type of operation
and the timestamp of the change.
DLT Pipeline Components
 Bronze table: Ingests the CDC feed using auto loader to load JSON files
incrementally.
 Silver table: Target table where changes from the CDC feed are applied using the
Apply Changes Into command.
 Gold layer: Aggregate query to create a live table from the data in the silver table.
Code Examples
Bronze Table
import dlt
from pyspark.sql.functions import col

@dlt.table
def bronze_table():
return (
spark.readStream.format("json")
.load("/path/to/cdc/json/files")
)

Silver Table
@dlt.table
def silver_table():
return dlt.apply_changes(
target = "silver_table",
source = "bronze_table",
keys = ["primary_key"],
sequence_by = col("row_time"),
apply_as_deletes = col("row_status") == "delete",
except_column_list = ["row_status", "row_time"]
)

Gold Layer
@dlt.table
def gold_layer():
return (
dlt.read("silver_table")
.groupBy("some_column")
.agg({"another_column": "sum"})
)

Apply Changes Into Command


 Specifies the target table, streaming source, primary key, update/insert operations,
delete conditions, and ordering of operations.
 Excludes operational columns from being added to the target table.
Notes By : Zeeshan Khan
DLT Views
 Temporary views scoped to the DLT pipeline for enforcing data equality.
 Views can reference tables across notebooks in the DLT pipeline configuration.
Updating DLT Pipeline
 Adding a new notebook to the existing DLT pipeline by configuring the pipeline
settings and adding the notebook library.
 Running the updated pipeline and ensuring data refresh by starting the pipeline
from scratch if needed.

Benefits of DLT
 Efficient processing of CDC feed with incremental changes.
 Flexibility to create views and reference tables across notebooks in the DLT
pipeline.
 Scalability for handling multiple data sources and complex data processing
requirements.

Conclusion
Processing CDC feed with DLT involves capturing, ingesting, and applying incremental
changes to data tables. The DLT pipeline components, Apply Changes Into command, and
DLT views provide a structured approach to handling CDC data in a scalable and efficient
manner. By following the steps outlined in this demo, users can manage and process CDC
feed effectively using DLT in their data pipelines.

Orchestrating Jobs with Databricks


Jobs in Databricks
Jobs in Databricks allow users to schedule one or multiple tasks as part of a job. These
tasks can include executing notebooks, running pipelines, or other data processing
operations.
Creating a Multitask Job
To create a multitask job in Databricks, users can navigate to the workflow tabs on the
sidebar and click on the "Create Job" button. Tasks can be added to the job, each with its
own configuration such as:
 Task name
 Type (notebook or pipeline)
 Path to the notebook or pipeline
 Cluster selection
Notes By : Zeeshan Khan
 Dependencies on other tasks
Scheduling Options
Databricks provides users with scheduling options for jobs, including:
 Setting trigger types (e.g., scheduled triggers)
 Configuring cron syntax for job scheduling
 Setting up email notifications for job status updates (start, success, failure)
Monitoring Job Runs
Users can monitor the status of their job runs in Databricks, including active runs and
completed runs. The job runs tab displays information about each run, such as:
 Start time
 Tasks executed
 Results

Handling Job Failures


In case of job failures, users can investigate the root cause of the failure by examining the
run output. Tasks that have failed can be rerun using the "Repair Run" option, which
allows users to re-execute only the failed tasks to fix the job.
Overall, orchestrating jobs in Databricks involves creating, scheduling, monitoring, and
handling job runs efficiently to automate data processing workflows and ensure
successful execution of tasks.
Databricks SQL is a data warehouse that allows users to run SQL and BI applications at
scale with a unified governance model. It is accessed by switching to the SQL persona on
the sidebar in the Databricks workspace.

 SQL Warehouse:
 A SQL warehouse in Databricks SQL is the compute power, or SQL engine,
based on a Spark cluster.
 It can be created by configuring a new SQL engine with options like cluster
size and name.

 Dashboards:
 Dashboards in Databricks SQL allow users to create visualizations and
analyze data.
 Users can import pre-built dashboards or create new ones from scratch.
 Visualization:
 Users can create various visualizations like pie charts and graphs by selecting
columns and settings for X and Y axes.
 Visualizations can be added to dashboards for easy data analysis.
Notes By : Zeeshan Khan
 Querying:
 Custom SQL queries can be written in the SQL editor by selecting the schema,
database, and table.
 Queries can be saved, added to dashboards, and scheduled to refresh
automatically.
 Alerts:
 Alerts in Databricks SQL notify users when a certain threshold in a query
result is met.
 Users can set up alerts with specific conditions and receive notifications
through email or other platforms like Slack and Microsoft Teams.
Summary:
Databricks SQL provides a comprehensive platform for data analysis through SQL queries,
visualizations, dashboards, and alerting capabilities. Users can leverage the power of
Spark clusters to run complex analytics and gain insights from their data.

Data Object Privileges in Databricks


Data Governance Model
Databricks provides a data governance model that allows users to programmatically
grant, deny, and revoke access to data objects from Spark SQL. This model helps in
managing permissions for different data objects such as databases, tables, and views.
Object Types
Databricks allows configuration of permissions for various object types including:
Object Type Description
Catalog For entire data catalog access
Schema For database access
Table Both managed and external
SQL View
Named Function
Any File For underlying file system
access
Notes By : Zeeshan Khan
Privileges
Different privileges can be configured on data objects in Databricks. These include:
Privilege Description
SELECT Provides read access to an object.
MODIFY Allows addition, deletion, and modification of data in an object.
CREATE Grants the ability to create new objects (e.g., tables).
READ_METADATA Enables viewing of an object and its metadata.
USAGE Though it does not provide any specific ability, it is a prerequisite for
performing actions on a database object.
ALL PRIVILEGES Combines all the above privileges into a single permission.

Privilege Management
In order to grant privileges on an object, the user must be a Databricks administrator or
the owner of the object. Different ownership levels include:
Ownership Level Description
Catalog Owner Can grant privileges for all objects in the catalog.
Database Owner Grants privileges for objects only within that specific database.
Table Owner Can grant privileges solely for the table.
Similar rules apply for views and named functions.
Operations
Apart from granting privileges, users can also perform deny and revoke operations to
manage object privileges. The SHOW GRANTS operation allows users to view the granted
permissions on objects.
By understanding and effectively managing data object privileges in Databricks, users can
ensure secure and controlled access to their data assets within the platform.

Managing Permissions in Databricks SQL


Permissions
Permissions refer to the level of access and privileges that users or groups have over
databases, tables, and views in Databricks SQL. These permissions determine what
actions can be performed on these objects.
Database Creation
In Databricks SQL, users can create databases to organize their data. Within a database,
tables and views can be created to store and analyze data. It is important to manage
permissions at the database level to control access to all objects within it.
Notes By : Zeeshan Khan
Granting Privileges
Privileges can be granted to users or groups to allow them to perform specific actions on
databases, tables, and views. Privileges include read access, write access, metadata access,
and the ability to create new objects.
Usage Privilege
The USAGE privilege is essential for users to be able to perform any action on a database
object. Without this privilege, objects within the database cannot be used effectively.
Granting Privileges to Users
Privileges can be granted to individual users or groups. By assigning specific privileges,
users can have tailored access to different objects within the database.
Show Grants Command
The SHOW GRANTS command allows users to review the permissions that have been
assigned to different users or groups. This command provides transparency on who has
access to specific objects.
Data Explorer
The Data Explorer in Databricks SQL is a tool that allows users to navigate through
databases, tables, and views. It also enables users to manage permissions, change owners,
and review metadata information.
Revoking Privileges
Users with administrative rights can revoke privileges from users or groups if needed.
This helps in controlling access to sensitive data and objects.
Changing Owner
The owner of a database or object can be changed to an individual or group. Owners have
special rights and responsibilities over the object, such as managing permissions and
making important decisions.
Query History
The Query History feature in Databricks SQL displays all the SQL queries that have been
run, including those executed through the Data Explorer. This allows users to track and
analyze query activity.
By understanding and effectively managing permissions in Databricks SQL, users can
ensure data security, control access to sensitive information, and optimize data
management processes.

Unity Catalog
Unity Catalog is a centralized governance solution provided by the Databricks platform
that unifies governance for all data and AI assets in your Lakehouse across multiple
Notes By : Zeeshan Khan
workspaces and clouds. It simplifies data access rules by allowing users to define them
once and apply them across various workspaces.
Architecture of Unity Catalog
 Location: Unity Catalog sits outside workspaces and is accessed via the Account
Console.
 Management: It manages users, groups, and metastores which can be assigned to
multiple workspaces.
 Security: It improves security and functionality compared to the traditional Hive
metastore.
Three-Level Namespace
Unity Catalog introduces a three-level namespace hierarchy consisting of:
1. Metastores: Top-level container.
1. Catalogs: Container for data objects.
2. Schemas: Contain data assets like tables, views, and functions.
Security Model
 Access Control: Unity Catalog offers fine-grained access control with privileges
such as CREATE, USAGE, SELECT, MODIFY, READ FILES, WRITE FILES, and
EXECUTE.
 Identities: Supports different types of identities including users, service principals,
and groups.
 Identity Federation: Simplifies identity management across workspaces.
Data Search and Lineage
 Data Search: Includes built-in data search and discovery features.
 Lineage Tracking: Provides automated lineage tracking to identify data origin and
usage across different assets like tables, notebooks, workflows, and dashboards.
Legacy Compatibility
 Additive: Unity Catalog is additive and works alongside existing legacy catalogs.
 Hive Metastore: Allows access to the Hive metastore local to a workspace through
the catalog named hive_metastore.
Unified Governance
Unity Catalog unifies existing legacy catalogs without the need for hard migration. It offers
a centralized solution for managing data assets, access control, and lineage tracking.

Conclusion
Unity Catalog enhances data governance by providing a centralized solution for managing
data assets, access control, and lineage tracking across multiple workspaces and clouds. It
simplifies data access rules, improves security, and offers advanced features for data
management in a Lakehouse environment.

You might also like