Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chapter 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Introduction to

Databricks
Lakehouse
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Analytics Practitioner
The Data Warehouse
Data Warehouse
Pros

Great for structured data

Highly performant

Easy to keep data clean

Cons

Very expensive

Cannot support modern applications

Not built for Machine Learning

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
The Data Lake
Data Lake
Pros

Support for all use cases

Very flexible

Cost effective

Cons

Data can become messy

Historically not very performant

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
Birth of the Lakehouse

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
Birth of the Lakehouse

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
The Databricks Lakehouse
The Databricks Lakehouse Platform

Single platform for all data workloads

Built on open source technology


Collaborative environment

Simplified architecture

1 https://www.databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html

INTRODUCTION TO DATABRICKS
Databricks Architecture Benefits
Unification Multi-Cloud

Every use case from AI to BI Bring powerful platform to your data

Benefits of data warehouse and data lake No lock-in to a specific cloud platform

INTRODUCTION TO DATABRICKS
Databricks Development Benefits
Collaborative Open-Source

Every data persona Underpinned by Apache Spark

Ability to work in same platform in real- Support for most popular languages
time (Python, R, Scala, SQL)

INTRODUCTION TO DATABRICKS
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S
Core features of the
Databricks
Lakehouse Platform
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Apache Spark
Apache Spark is an open-source data processing framework and is the engine underneath
Databricks.

DataCamp Courses

Introduction to Pyspark

Big Data Fundamentals with Pyspark

Cleaning Data with Pyspark

Machine Learning with Pyspark

Introduction to Spark SQL in Python

INTRODUCTION TO DATABRICKS
Benefits of Spark
Key Benefits:

1. Extensible, flexible open-source framework

2. Large developer community


3. High performing

4. Databricks optimizations

1 https://spark.apache.org/docs/latest/cluster-overview.html

INTRODUCTION TO DATABRICKS
Cloud computing basics

INTRODUCTION TO DATABRICKS
Databricks Compute
Clusters

Collection of computational resources

All workloads, any use case


All-purpose vs. Jobs

SQL Warehouses

SQL only

BI use cases

Photon

INTRODUCTION TO DATABRICKS
Cloud data storage

INTRODUCTION TO DATABRICKS
Delta
Delta is an open-source data storage file
format, and provides:

ACID transactions

Unified batch and streaming

Schema evolution

Table history

Time-travel

1 delta.io

INTRODUCTION TO DATABRICKS
Unity Catalog
Unity Catalog is an open data governance
strategy that controls access to all data
assets in the Databricks Lakehouse platform.

SQL GRANT , REVOKE statements to control


access

Simple interface for governance

INTRODUCTION TO DATABRICKS
Databricks UI
Designed for easier access to capabilities
based on your data workload.

All users have access to data and compute

SQL users get a familiar interface for


queries and reports

Data engineers leverage Delta Live Tables

Machine Learning workloads use models,


features, and more

INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Administering a
Databricks
workspace
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Account Admin
Key Responsibilities:

Creating and managing workspaces

Enabling Unity Catalog


Managing identities

Managing the account subscription

INTRODUCTION TO DATABRICKS
Account Console

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Workspaces

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Data

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Users & Groups

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Account Console - Settings

https://accounts.cloud.databricks.com/

INTRODUCTION TO DATABRICKS
Workspace Admin
Key Responsibilities:

Managing identities in your workspace

Creating and managing compute resources


Managing workspace features and settings

INTRODUCTION TO DATABRICKS
Data Plane
Contains all of the customer's assets needed for computation with Databricks.

Data is stored in the customer's cloud environment

Clusters / SQL Warehouses run in customer's cloud tenant.

INTRODUCTION TO DATABRICKS
Control Plane
The portion of the platform that is managed and hosted by Databricks.

Orchestrates various background tasks in Databricks

Sends requests to Data Plane to create clusters, run jobs, etc.

INTRODUCTION TO DATABRICKS
Databricks Platform Architecture
Each cloud will have the same general
options to create a workspace:

Cloud Service Provider marketplace

Account Console

Using the Accounts API with Databricks

Programmatic deployment (e.g., Terraform)

1 https://docs.databricks.com/getting-started/overview.html

INTRODUCTION TO DATABRICKS
Let's review!
I N T R O D U C T I O N T O D ATA B R I C K S
Setting up a
Databricks
workspace example
I N T R O D U C T I O N T O D ATA B R I C K S

Kevin Barlow
Data Practitioner
Let's practice!
I N T R O D U C T I O N T O D ATA B R I C K S

You might also like