Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Course Notes

Uploaded by

jaysonruzario94
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Course Notes

Uploaded by

jaysonruzario94
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Course notes

202407051132

Status: #reference
Tags:

End to End Databricks project


Section 1: Introduction

1. Course Introduction

2. Project architecture and concepts

3. Course Prerequisites and benefits

4. Project Complete Code

5. Importing project code into Databricks workspace

Section 2: Environment Setup

6. Section Introduction

7. Creating Budget for project

8. Creating an Azure Databricks Workspace

9. Creating an Azure Datalake Storage Gen2

10.Walkthrough on Databricks Workspace UI

Section 3: Azure Databricks - An introduction

11. Section Introduction

12. Introduction to Distributed Data Processing

13. What is Azure Databricks

14. Azure Databricks Architecture

15. Cluster Types and Configuration

16. Behind the scenes when creating cluster

17. Behind the scenes when creating cluster

Section 1: Introduction
1. Course Introduction
provides an introduction of what is covered in this course .

2. Project architecture and concepts

this is the project architecture we would follow,


the ADF/manually we will load files in /landing container and then use batch processing to process the
data using medallion architecture.

things we will learn while we create this project:

Batch processing

delta lake

Databricks unity catalog

Spark Streaming

CICD

3. Course Prerequisites and benefits


no prior experience of Databricks needed

basic knowledge of python , SQL

Basic knowledge on Azure cloud Environment

Azure account for hands-on practical


4. Project Complete Code

code + pdf is already downloaded in this directory.

5. Importing project code into Databricks workspace

shows how to import the dbc file into Databricks workspace

Section 2: Environment Setup

6. Section Introduction
we will see how to create the following things

Databricks workspace

Data lake

add billing / budget

7. Creating Budget for project


we create budget in azure portal.

8. Creating an Azure Databricks Workspace


we create DB workspace

9. Creating an Azure Datalake Storage Gen2


we create ADLS

10.Walkthrough on Databricks Workspace UI


a quick UI walkthrough

Section 3: Azure Databricks - An introduction

11. Section Introduction


provides overview of what will be covered in this section which includes:

Databricks architecture
cluster type

notebooks and its usage

DButils

12. Introduction to Distributed Data Processing

Before Databricks:

Early data processing used single supercomputers for both storage and compute.

Vertical scaling: adding more hardware to a single machine, which isn't efficient long-term.

Distributed Computing:

Big data introduced distributed storage and computing.

Horizontal scaling: adding more computers to work in parallel, forming clusters.

Cluster: a group of computers working together.

Cluster Manager: coordinates the computers in a cluster.

Evolution from Hadoop to Spark:

Hadoop:

Introduced MapReduce for distributed computing and HDFS for distributed storage.

MapReduce involved high disk I/O, as each operation required reading from and writing to the disk.

Complex coding required for map and reduce functions.


Apache Spark:

Processes data in-memory (RAM) rather than on disk, making it 10 to 100 times faster.

Decouples storage from compute.

Supports multiple programming languages: Java, Python, SQL, R.

Acts as a compute engine, independent of any storage system.

Unified platform for real-time processing, streaming, machine learning, and graphics.

Spark Ecosystem:

Spark Core:

Foundation for parallel and distributed processing.

Manages input/output, job scheduling, memory management, fault recovery, and interaction with
clusters and storage systems.

Uses RDDs (Resilient Distributed Datasets) for parallel processing.

Higher Level APIs and Libraries:

DataFrames and Datasets APIs for easier and optimized operations.

Libraries: Spark SQL, Spark Streaming, MLlib, GraphX.

Use DataFrame APIs for interacting with Spark in various languages.

Cluster Managers:

Options: Yarn Resource Manager, Mesos, Standalone, Kubernetes.

Databricks uses Spark standalone cluster manager.


Spark doesn't include storage or a cluster manager; it's solely a data processing framework.

Databricks:

Commercial Product:

Developed by the creators of Apache Spark.

Provides a cloud platform to work with Spark efficiently.

Available on Azure, GCP, and AWS.

13. What is Azure Databricks


Introduction to Azure Databricks:

Overview:

Spark is an in-memory processing engine, but requires setting up clusters, managing security, and
using notebooks to write code.

Databricks simplifies these tasks by providing an interactive platform to work with Spark.

Features of Databricks:

Unified Interface:

Manages data engineering, data science, and data analyst workloads.

Integrates with open-source technologies like Delta Lake for transactional support.

Compute Management:

Databricks helps spin up clusters to run code.

Provides notebooks (web-based interface) to write and execute code in various programming
languages.

Cloud Integration:

Integrates with AWS, Azure, and GCP storage layers.

Decouples storage from compute.

MLflow Modeling:

Supports creating and running models.

Provides Git support for CI/CD setup.

SQL Warehouses:
Offers a platform for data analysts to run SQL queries and analyze data.

Additional Features:

Provides SQL warehouses for data analysis.

Integrates with various tools and services to streamline data processing and analytics.

Databricks with Azure:

First-party Service:

Directly offered by Azure as a first-party service.

Simplifies setup and integration with Azure services.

Unified Billing:

Single bill for Databricks and other Azure services.

Integration with Azure Data Services:

Compatible with Azure Blob Storage, Data Lake Storage, SQL Database, Cosmos DB, etc.

Security managed by Azure Intra ID (Azure Active Directory).

Supports single sign-on for authentication.

ETL and BI Integration:

Seamless integration with Azure Data Factory for ETL processes.

Integration with Power BI for data visualization and reporting.

Connects Power BI to Databricks clusters and SQL warehouses for enhanced data visualization.

Azure DevOps Integration:

Source control and CI/CD setup through Azure DevOps.

14. Azure Databricks Architecture


Azure Databricks Architecture:

Designed for core data engineering, data science, or data analyst tasks.

Most configurations managed by Azure Databricks service.

User Access and Authorization:

Users registered in Azure Intra ID (formerly Azure Active Directory) can access Azure Databricks.

Single sign-on feature for easy access.

Control Plane and Compute Plane:

Control Plane: Managed by Databricks.

Includes backend services, Databricks web application, notebooks, jobs, queries, and cluster
manager.

Contains metadata information (e.g., notebook names, commands, job configurations).

Provides web UI for user interaction.

Metadata encrypted at rest.

Compute Plane (Data Plane): Managed by Azure.

Holds actual data and compute resources.

Azure Storage account created during Databricks setup.

Maintains control and ownership of data.

Resource Management:

Control Plane:
Manages metadata like notebooks, commands, jobs, cluster configurations.

Controls Databricks resources.

Compute Plane:

Manages actual compute resources (e.g., virtual machines).

Data and compute resources are decoupled.

Contains Azure Storage, Network Security Group (NSG), and Virtual Network (VNet).

Cluster Creation:

Cluster manager in control plane sends configuration to compute plane.

Compute plane launches virtual machines based on configuration.

Data Sharing and Execution:

Data sharing between storage and compute resources.

Log sharing between control and compute planes.

Job results stored in Azure Storage.

Interactive cluster results stored in both control and compute planes.

External Data Access:

Azure Databricks connectors for accessing data outside Azure subscription.

Summary:

Control plane manages configurations and resources.

Compute plane manages actual compute resources.

Databricks UI used for notebook commands, cluster configurations, and job definitions.

Parallel virtual machines enable distributed compute.

Data accessed from Azure Storage in compute plane.

15. Cluster Types and Configuration


Databricks Compute and Clusters

1. Introduction to Databricks Compute

Compute resources are essential for running any workload in Azure Databricks.

To utilize compute resources, clusters are created in Databricks.

2. Understanding Azure Databricks Clusters


A cluster is a set of computational resources and configurations to run your workload.

Workloads can include ETL pipelines, streaming analytics, ad hoc analytics, and machine learning
tasks.

3. Types of Workloads

Notebook Commands: Writing and running commands interactively in a notebook.

Automated Jobs: Running predefined workflows on a scheduled basis.

4. Cluster Types

All Purpose Clusters: Used for interactive work with notebooks. Multiple users can share these
clusters for collaborative analysis.

Job Clusters: Designed for automated workflows. These clusters terminate automatically after the job
is done.

5. Detailed View of Cluster Types

All Purpose Clusters:

Enable interactive execution of commands in notebooks.

Can be terminated, restarted, and attached/detached to multiple notebooks.

Two node types: Multi-node (separate driver and executor nodes) and Single-node (a single driver
node).

Job Clusters:

Used for running automated workflows.

Terminate automatically after completing the workload.

Cannot be restarted manually.

6. Creating Clusters in Azure Databricks Portal

Navigate to the Azure Databricks portal and click on "Compute."

Choose between all-purpose and job clusters.

Configure the cluster by selecting options such as node type (multi-node or single-node) and access
mode (single user, shared, or no isolation shared).

7. Access Modes

Single User: Exclusive use by a single user.

Shared: Used by multiple users with data isolation.

No Isolation Shared: Multiple users without data isolation.


8. Databricks Runtime Versions

Databricks Runtime is a set of core components managed by Databricks.

Types of Runtimes:

Standard: Used for general purposes.

ML: Preconfigured for machine learning tasks.

LTS: Long-term support versions recommended for job compute to ensure compatibility.

9. Configuring Cluster Nodes

Select virtual machine configurations for driver and worker nodes.

Define the minimum and maximum number of executors.

Enable auto-scaling to adjust the number of executors based on workload.

Set termination time for inactivity to avoid unnecessary charges.

10. Advanced Options

Configure pre-configuration scripts if needed for the cluster.

This overview provides a foundational understanding of Databricks compute resources, cluster types,
and configurations, enabling efficient management and execution of various workloads in Azure
Databricks.

16. Behind the scenes when creating cluster


provides a demo showing how resources like vpn , virtual computers are spin up when we create cluster.

17. Behind the scenes when creating cluster


show's how to sign up for community edition but we rich we don't use that so chill.

References

Seealso

Course: Azure Databricks end to end project with Unity Catalog CICD | Udemy

Azure Databricks architecture overview - Azure Databricks | Microsoft Learn

You might also like