Course Notes
Course Notes
202407051132
Status: #reference
Tags:
1. Course Introduction
6. Section Introduction
Section 1: Introduction
1. Course Introduction
provides an introduction of what is covered in this course .
Batch processing
delta lake
Spark Streaming
CICD
6. Section Introduction
we will see how to create the following things
Databricks workspace
Data lake
Databricks architecture
cluster type
DButils
Before Databricks:
Early data processing used single supercomputers for both storage and compute.
Vertical scaling: adding more hardware to a single machine, which isn't efficient long-term.
Distributed Computing:
Hadoop:
Introduced MapReduce for distributed computing and HDFS for distributed storage.
MapReduce involved high disk I/O, as each operation required reading from and writing to the disk.
Processes data in-memory (RAM) rather than on disk, making it 10 to 100 times faster.
Unified platform for real-time processing, streaming, machine learning, and graphics.
Spark Ecosystem:
Spark Core:
Manages input/output, job scheduling, memory management, fault recovery, and interaction with
clusters and storage systems.
Cluster Managers:
Databricks:
Commercial Product:
Overview:
Spark is an in-memory processing engine, but requires setting up clusters, managing security, and
using notebooks to write code.
Databricks simplifies these tasks by providing an interactive platform to work with Spark.
Features of Databricks:
Unified Interface:
Integrates with open-source technologies like Delta Lake for transactional support.
Compute Management:
Provides notebooks (web-based interface) to write and execute code in various programming
languages.
Cloud Integration:
MLflow Modeling:
SQL Warehouses:
Offers a platform for data analysts to run SQL queries and analyze data.
Additional Features:
Integrates with various tools and services to streamline data processing and analytics.
First-party Service:
Unified Billing:
Compatible with Azure Blob Storage, Data Lake Storage, SQL Database, Cosmos DB, etc.
Connects Power BI to Databricks clusters and SQL warehouses for enhanced data visualization.
Designed for core data engineering, data science, or data analyst tasks.
Users registered in Azure Intra ID (formerly Azure Active Directory) can access Azure Databricks.
Includes backend services, Databricks web application, notebooks, jobs, queries, and cluster
manager.
Resource Management:
Control Plane:
Manages metadata like notebooks, commands, jobs, cluster configurations.
Compute Plane:
Contains Azure Storage, Network Security Group (NSG), and Virtual Network (VNet).
Cluster Creation:
Summary:
Databricks UI used for notebook commands, cluster configurations, and job definitions.
Compute resources are essential for running any workload in Azure Databricks.
Workloads can include ETL pipelines, streaming analytics, ad hoc analytics, and machine learning
tasks.
3. Types of Workloads
4. Cluster Types
All Purpose Clusters: Used for interactive work with notebooks. Multiple users can share these
clusters for collaborative analysis.
Job Clusters: Designed for automated workflows. These clusters terminate automatically after the job
is done.
Two node types: Multi-node (separate driver and executor nodes) and Single-node (a single driver
node).
Job Clusters:
Configure the cluster by selecting options such as node type (multi-node or single-node) and access
mode (single user, shared, or no isolation shared).
7. Access Modes
Types of Runtimes:
LTS: Long-term support versions recommended for job compute to ensure compatibility.
This overview provides a foundational understanding of Databricks compute resources, cluster types,
and configurations, enabling efficient management and execution of various workloads in Azure
Databricks.
References
Seealso
Course: Azure Databricks end to end project with Unity Catalog CICD | Udemy