0% found this document useful (0 votes)

10 views

Course Notes

Uploaded by

jaysonruzario94

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Course Notes

Uploaded by

jaysonruzario94

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Course notes

202407051132

Status: #reference
Tags:

End to End Databricks project

Section 1: Introduction

1. Course Introduction

2. Project architecture and concepts

3. Course Prerequisites and benefits

4. Project Complete Code

5. Importing project code into Databricks workspace

Section 2: Environment Setup

6. Section Introduction

7. Creating Budget for project

8. Creating an Azure Databricks Workspace

9. Creating an Azure Datalake Storage Gen2

10.Walkthrough on Databricks Workspace UI

Section 3: Azure Databricks - An introduction

11. Section Introduction

12. Introduction to Distributed Data Processing

13. What is Azure Databricks

14. Azure Databricks Architecture

15. Cluster Types and Configuration

16. Behind the scenes when creating cluster

17. Behind the scenes when creating cluster

Section 1: Introduction
1. Course Introduction
provides an introduction of what is covered in this course .

2. Project architecture and concepts

this is the project architecture we would follow,

the ADF/manually we will load files in /landing container and then use batch processing to process the
data using medallion architecture.

things we will learn while we create this project:

Batch processing

delta lake

Databricks unity catalog

Spark Streaming

CICD

3. Course Prerequisites and benefits

no prior experience of Databricks needed

basic knowledge of python , SQL

Basic knowledge on Azure cloud Environment

Azure account for hands-on practical

4. Project Complete Code

code + pdf is already downloaded in this directory.

5. Importing project code into Databricks workspace

shows how to import the dbc file into Databricks workspace

Section 2: Environment Setup

6. Section Introduction
we will see how to create the following things

Databricks workspace

Data lake

add billing / budget

7. Creating Budget for project

we create budget in azure portal.

8. Creating an Azure Databricks Workspace

we create DB workspace

9. Creating an Azure Datalake Storage Gen2

we create ADLS

10.Walkthrough on Databricks Workspace UI

a quick UI walkthrough

Section 3: Azure Databricks - An introduction

11. Section Introduction

provides overview of what will be covered in this section which includes:

Databricks architecture
cluster type

notebooks and its usage

DButils

12. Introduction to Distributed Data Processing

Before Databricks:

Early data processing used single supercomputers for both storage and compute.

Vertical scaling: adding more hardware to a single machine, which isn't efficient long-term.

Distributed Computing:

Big data introduced distributed storage and computing.

Horizontal scaling: adding more computers to work in parallel, forming clusters.

Cluster: a group of computers working together.

Cluster Manager: coordinates the computers in a cluster.

Evolution from Hadoop to Spark:

Hadoop:

Introduced MapReduce for distributed computing and HDFS for distributed storage.

MapReduce involved high disk I/O, as each operation required reading from and writing to the disk.

Complex coding required for map and reduce functions.

Apache Spark:

Processes data in-memory (RAM) rather than on disk, making it 10 to 100 times faster.

Decouples storage from compute.

Supports multiple programming languages: Java, Python, SQL, R.

Acts as a compute engine, independent of any storage system.

Unified platform for real-time processing, streaming, machine learning, and graphics.

Spark Ecosystem:

Spark Core:

Foundation for parallel and distributed processing.

Manages input/output, job scheduling, memory management, fault recovery, and interaction with
clusters and storage systems.

Uses RDDs (Resilient Distributed Datasets) for parallel processing.

Higher Level APIs and Libraries:

DataFrames and Datasets APIs for easier and optimized operations.

Libraries: Spark SQL, Spark Streaming, MLlib, GraphX.

Use DataFrame APIs for interacting with Spark in various languages.

Cluster Managers:

Options: Yarn Resource Manager, Mesos, Standalone, Kubernetes.

Databricks uses Spark standalone cluster manager.

Spark doesn't include storage or a cluster manager; it's solely a data processing framework.

Databricks:

Commercial Product:

Developed by the creators of Apache Spark.

Provides a cloud platform to work with Spark efficiently.

Available on Azure, GCP, and AWS.

13. What is Azure Databricks

Introduction to Azure Databricks:

Overview:

Spark is an in-memory processing engine, but requires setting up clusters, managing security, and
using notebooks to write code.

Databricks simplifies these tasks by providing an interactive platform to work with Spark.

Features of Databricks:

Unified Interface:

Manages data engineering, data science, and data analyst workloads.

Integrates with open-source technologies like Delta Lake for transactional support.

Compute Management:

Databricks helps spin up clusters to run code.

Provides notebooks (web-based interface) to write and execute code in various programming
languages.

Cloud Integration:

Integrates with AWS, Azure, and GCP storage layers.

Decouples storage from compute.

MLflow Modeling:

Supports creating and running models.

Provides Git support for CI/CD setup.

SQL Warehouses:
Offers a platform for data analysts to run SQL queries and analyze data.

Additional Features:

Provides SQL warehouses for data analysis.

Integrates with various tools and services to streamline data processing and analytics.

Databricks with Azure:

First-party Service:

Directly offered by Azure as a first-party service.

Simplifies setup and integration with Azure services.

Unified Billing:

Single bill for Databricks and other Azure services.

Integration with Azure Data Services:

Compatible with Azure Blob Storage, Data Lake Storage, SQL Database, Cosmos DB, etc.

Security managed by Azure Intra ID (Azure Active Directory).

Supports single sign-on for authentication.

ETL and BI Integration:

Seamless integration with Azure Data Factory for ETL processes.

Integration with Power BI for data visualization and reporting.

Connects Power BI to Databricks clusters and SQL warehouses for enhanced data visualization.

Azure DevOps Integration:

Source control and CI/CD setup through Azure DevOps.

14. Azure Databricks Architecture

Azure Databricks Architecture:

Designed for core data engineering, data science, or data analyst tasks.

Most configurations managed by Azure Databricks service.

User Access and Authorization:

Users registered in Azure Intra ID (formerly Azure Active Directory) can access Azure Databricks.

Single sign-on feature for easy access.

Control Plane and Compute Plane:

Control Plane: Managed by Databricks.

Includes backend services, Databricks web application, notebooks, jobs, queries, and cluster
manager.

Contains metadata information (e.g., notebook names, commands, job configurations).

Provides web UI for user interaction.

Metadata encrypted at rest.

Compute Plane (Data Plane): Managed by Azure.

Holds actual data and compute resources.

Azure Storage account created during Databricks setup.

Maintains control and ownership of data.

Resource Management:

Control Plane:
Manages metadata like notebooks, commands, jobs, cluster configurations.

Controls Databricks resources.

Compute Plane:

Manages actual compute resources (e.g., virtual machines).

Data and compute resources are decoupled.

Contains Azure Storage, Network Security Group (NSG), and Virtual Network (VNet).

Cluster Creation:

Cluster manager in control plane sends configuration to compute plane.

Compute plane launches virtual machines based on configuration.

Data Sharing and Execution:

Data sharing between storage and compute resources.

Log sharing between control and compute planes.

Job results stored in Azure Storage.

Interactive cluster results stored in both control and compute planes.

External Data Access:

Azure Databricks connectors for accessing data outside Azure subscription.

Summary:

Control plane manages configurations and resources.

Compute plane manages actual compute resources.

Databricks UI used for notebook commands, cluster configurations, and job definitions.

Parallel virtual machines enable distributed compute.

Data accessed from Azure Storage in compute plane.

15. Cluster Types and Configuration

Databricks Compute and Clusters

1. Introduction to Databricks Compute

Compute resources are essential for running any workload in Azure Databricks.

To utilize compute resources, clusters are created in Databricks.

2. Understanding Azure Databricks Clusters

A cluster is a set of computational resources and configurations to run your workload.

Workloads can include ETL pipelines, streaming analytics, ad hoc analytics, and machine learning
tasks.

3. Types of Workloads

Notebook Commands: Writing and running commands interactively in a notebook.

Automated Jobs: Running predefined workflows on a scheduled basis.

4. Cluster Types

All Purpose Clusters: Used for interactive work with notebooks. Multiple users can share these
clusters for collaborative analysis.

Job Clusters: Designed for automated workflows. These clusters terminate automatically after the job
is done.

5. Detailed View of Cluster Types

All Purpose Clusters:

Enable interactive execution of commands in notebooks.

Can be terminated, restarted, and attached/detached to multiple notebooks.

Two node types: Multi-node (separate driver and executor nodes) and Single-node (a single driver
node).

Job Clusters:

Used for running automated workflows.

Terminate automatically after completing the workload.

Cannot be restarted manually.

6. Creating Clusters in Azure Databricks Portal

Navigate to the Azure Databricks portal and click on "Compute."

Choose between all-purpose and job clusters.

Configure the cluster by selecting options such as node type (multi-node or single-node) and access
mode (single user, shared, or no isolation shared).

7. Access Modes

Single User: Exclusive use by a single user.

Shared: Used by multiple users with data isolation.

No Isolation Shared: Multiple users without data isolation.

8. Databricks Runtime Versions

Databricks Runtime is a set of core components managed by Databricks.

Types of Runtimes:

Standard: Used for general purposes.

ML: Preconfigured for machine learning tasks.

LTS: Long-term support versions recommended for job compute to ensure compatibility.

9. Configuring Cluster Nodes

Select virtual machine configurations for driver and worker nodes.

Define the minimum and maximum number of executors.

Enable auto-scaling to adjust the number of executors based on workload.

Set termination time for inactivity to avoid unnecessary charges.

10. Advanced Options

Configure pre-configuration scripts if needed for the cluster.

This overview provides a foundational understanding of Databricks compute resources, cluster types,
and configurations, enabling efficient management and execution of various workloads in Azure
Databricks.

16. Behind the scenes when creating cluster

provides a demo showing how resources like vpn , virtual computers are spin up when we create cluster.

17. Behind the scenes when creating cluster

show's how to sign up for community edition but we rich we don't use that so chill.

References

Seealso

Course: Azure Databricks end to end project with Unity Catalog CICD | Udemy

Azure Databricks architecture overview - Azure Databricks | Microsoft Learn

Azure Databricks
80% (5)
Azure Databricks
69 pages
Get Started With Databricks For Machine Learning
No ratings yet
Get Started With Databricks For Machine Learning
85 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Data Engineering With Databricks Da
100% (2)
Data Engineering With Databricks Da
232 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
7,197 pages
DeltaV Virtual Studio HCI Installation Manual
No ratings yet
DeltaV Virtual Studio HCI Installation Manual
168 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
De Mod 1 Get Started With Databricks Data Science and Engineering Workspace
No ratings yet
De Mod 1 Get Started With Databricks Data Science and Engineering Workspace
27 pages
Azure Databricks Documentation
No ratings yet
Azure Databricks Documentation
32 pages
NetappTS ExerciseGuide Answers
No ratings yet
NetappTS ExerciseGuide Answers
78 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
DBA - Syllabus
No ratings yet
DBA - Syllabus
13 pages
Veeam Backup 11.0 Plug-Ins User Guide
No ratings yet
Veeam Backup 11.0 Plug-Ins User Guide
235 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Databricks
No ratings yet
Databricks
36 pages
DataBricks_Note_free__1736678274
No ratings yet
DataBricks_Note_free__1736678274
87 pages
AZURE DATA BRICKS
No ratings yet
AZURE DATA BRICKS
8 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Azuredatabricks New
No ratings yet
Azuredatabricks New
22 pages
databricks
No ratings yet
databricks
131 pages
Azure Databricks An Introduction
No ratings yet
Azure Databricks An Introduction
54 pages
Azure Databricks Overview
100% (1)
Azure Databricks Overview
4 pages
Databricks 2
No ratings yet
Databricks 2
22 pages
Azure Databricks - An Introduction
No ratings yet
Azure Databricks - An Introduction
38 pages
Data Bricks
No ratings yet
Data Bricks
115 pages
Azure Databricks Brief Introduction
No ratings yet
Azure Databricks Brief Introduction
40 pages
Databricks+Course+Deck
No ratings yet
Databricks+Course+Deck
134 pages
Azure DataBricks Interview Questions
No ratings yet
Azure DataBricks Interview Questions
17 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Compute
No ratings yet
Compute
56 pages
Databricks, An Introduction: Chuck Connell, Insight Digital Innovation
No ratings yet
Databricks, An Introduction: Chuck Connell, Insight Digital Innovation
36 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Azure Databricks Interview Questions
No ratings yet
Azure Databricks Interview Questions
28 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
dp-203 Notes1
No ratings yet
dp-203 Notes1
12 pages
PDF_1733662736
No ratings yet
PDF_1733662736
17 pages
Azure Databricks Course Slide Deck V4
100% (4)
Azure Databricks Course Slide Deck V4
308 pages
004 Azure Databricks Course Slide Deck V3
0% (1)
004 Azure Databricks Course Slide Deck V3
219 pages
Azure Databricks
No ratings yet
Azure Databricks
12 pages
Azure Databricks - An Introduction 2019 Roadshow
No ratings yet
Azure Databricks - An Introduction 2019 Roadshow
13 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Cluster in Databricks
No ratings yet
Cluster in Databricks
9 pages
DataEngineeringDatabricks
No ratings yet
DataEngineeringDatabricks
139 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Azure Data Bricks Int
No ratings yet
Azure Data Bricks Int
6 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
From Everand
Big Data on Kubernetes: A practical guide to building efficient and scalable data solutions
Neylson Crepalde
No ratings yet
Azure Cloud: Fundamentals to Architecture
From Everand
Azure Cloud: Fundamentals to Architecture
Alex Carvalho
No ratings yet
Mastering Microsoft Azure: Essential Techniques
From Everand
Mastering Microsoft Azure: Essential Techniques
Rob Proutyon
No ratings yet
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
A Comprehensive Guide to Cloud Infrastructure and Management: IT Books, #1
From Everand
A Comprehensive Guide to Cloud Infrastructure and Management: IT Books, #1
Mario Marinov
No ratings yet
Docker Basics Explained Clearly: A Practical Guide with Examples
From Everand
Docker Basics Explained Clearly: A Practical Guide with Examples
William E. Clark
No ratings yet
AWS CDK Essentials: A Beginner's Guide to Infrastructure as Code
From Everand
AWS CDK Essentials: A Beginner's Guide to Infrastructure as Code
Robert Johnson
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Databricks Lab 1
100% (3)
Databricks Lab 1
7 pages
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
From Everand
Software Containers: The Complete Guide to Virtualization Technology. Create, Use and Deploy Scalable Software with Docker and Kubernetes. Includes Docker and Kubernetes.
Jordan Lioy
No ratings yet
ETL Azure
No ratings yet
ETL Azure
12 pages
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
Azure Databricks Interview
No ratings yet
Azure Databricks Interview
4 pages
Brick Loop PDF
No ratings yet
Brick Loop PDF
3 pages
Docker: The Complete Guide to the Most Widely Used Virtualization Technology. Create Containers and Deploy them to Production Safely and Securely.: Docker & Kubernetes, #1
From Everand
Docker: The Complete Guide to the Most Widely Used Virtualization Technology. Create Containers and Deploy them to Production Safely and Securely.: Docker & Kubernetes, #1
Jordan Lioy
No ratings yet
B.tech - Final Year Computer Science Engineering
No ratings yet
B.tech - Final Year Computer Science Engineering
60 pages
Css Evam SetA-E
100% (2)
Css Evam SetA-E
21 pages
Rain Technology and Its Implementation
No ratings yet
Rain Technology and Its Implementation
6 pages
Fluent Analysis Intel
No ratings yet
Fluent Analysis Intel
17 pages
Vcap-Dca 5 Schedule: Objective Vsos Track Vmware Videos PG Guide Practical Date J Guide
No ratings yet
Vcap-Dca 5 Schedule: Objective Vsos Track Vmware Videos PG Guide Practical Date J Guide
2 pages
Single-Node and Two-Node Clusters FAQ
No ratings yet
Single-Node and Two-Node Clusters FAQ
13 pages
IIB & IHS High Availability
No ratings yet
IIB & IHS High Availability
12 pages
unit-3.3 dynamic interconnection network
No ratings yet
unit-3.3 dynamic interconnection network
15 pages
Practicetorrent: Latest Study Torrent With Verified Answers Will Facilitate Your Actual Test
No ratings yet
Practicetorrent: Latest Study Torrent With Verified Answers Will Facilitate Your Actual Test
8 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
LSF Doc1
No ratings yet
LSF Doc1
49 pages
Practice 2 - IZO 083
No ratings yet
Practice 2 - IZO 083
107 pages
9.1 VCenter Server and Host Management
No ratings yet
9.1 VCenter Server and Host Management
198 pages
GoogleCloud PER
No ratings yet
GoogleCloud PER
8 pages
MLLN
No ratings yet
MLLN
10 pages
Cisco Application Centric Infrastructure
No ratings yet
Cisco Application Centric Infrastructure
13 pages
Lanka Nvidia Resume
No ratings yet
Lanka Nvidia Resume
8 pages
CS435 Current Paper by Muhmmad Khan Travel
No ratings yet
CS435 Current Paper by Muhmmad Khan Travel
14 pages
Sun ZFS Storage 7000 Appliance Install, Admin & Hands On Lab - 2 PDF
No ratings yet
Sun ZFS Storage 7000 Appliance Install, Admin & Hands On Lab - 2 PDF
33 pages
HACMP For AIX 6L Administration Guide
No ratings yet
HACMP For AIX 6L Administration Guide
500 pages
Configuration Cheat Sheet For The New Vsphere Web Client: 1-800-Courses
No ratings yet
Configuration Cheat Sheet For The New Vsphere Web Client: 1-800-Courses
20 pages
VSAN Design and Sizing Guide PDF
No ratings yet
VSAN Design and Sizing Guide PDF
12 pages
ch08 Unit3
No ratings yet
ch08 Unit3
56 pages
Openshift Container Platform 4.3: Installing
No ratings yet
Openshift Container Platform 4.3: Installing
36 pages
vsphere-esxi-vcenter-server-703-availability-guide
No ratings yet
vsphere-esxi-vcenter-server-703-availability-guide
92 pages
PM - Tick Delay
No ratings yet
PM - Tick Delay
1 page