Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo

1

Moving to Databricks
Carsten Herbe
Technical Lead Data Products at wetter.com GmbH
19.11.2020

2

About wetter.com & METEONOMIQS
What do we need Spark for?

3

About us
METEONOMIQS - B2B Data Solutions by wetter.com - https://www.meteonomiqs.com
CONSUMER DATAMEDIA
#1 B2C weather portal in DACH
with up to 20mil monthly UU (AGOF)
Full service, cross-media
>50 daily TV/video productions
Weather and geo-based solutions for
advertiser, FMCG & retailer …
Weather Data
Impact
Weather Based
Advertising
Weather Driven
Forecasting
Footfall & Location
Insights
Weather API

4

Example product: GfK Retail Crowd Monitor (in cooperation with GfK)
Comparison of visit frequencies in major German city centers
Carsten Herbe - Technical Lead Data Products
» Architecture, Dev & Ops for Infrastructure & Data Pipelines
» Today: #aws #spark #databricks
» Past: #hadoop #kafka #dwh #bi

5

Our world before Databricks …
Architecture & Motivation

6

High level architecture & tech stack
How our world looked before Databricks
ingestion
Ingestions
API
data analytics
RAW
data
EMR
cleanse
serving
REST
API
delivery
customer
CLEAN
data
EMR
calc
CALC
data
EMR
calc &
aggregate
S3
adhoc &
reports
AWS Glue
Data Catalog
StepFunctionsCloudWatch
Athena
partner
EMR
analyze
Athena
Lambda

7

Motivation for moving to Databricks & Delta
It’s not about performance but productivity
• GDPR deletes in minutes without downtime instead of hours/days with downtime
• Data corrections / enhancements without downtime instead of whole days of
downtime
Delta
• Improved usability & productivity using Databricks Notebooks
• Cluster Sharing (80% of EMR costs were from dedicated dev & analytics clusters)
Databricks
• Improve collaboration between Data Scientists and Data Engineers / Dev Team
• Replace custom solution (that has been started)
MLflow

8

Welcome to the world of Databricks & Delta
Infrastructure view

9

High level architecture & tech stack
How our world looked before Databricks
ingestion
Ingestions
API
data analytics
RAW
data
EMR
cleanse
serving
REST
API
delivery
customer
CLEAN
data
EMR
calc
CALC
data
EMR
calc &
aggregate
S3
adhoc &
reports
Athena
partner
EMR
analyze
Athena
AWS Glue
Data Catalog
StepFunctionsCloudWatch Lambda

10

Architecture: How Databricks fits in
Just plug & play?
ingestion
Ingestions
API
data analytics serving
REST
API
delivery
customer
S3
adhoc &
reports
AWS Glue
Data Catalog
StepFunctionsCloudWatch
partner
Athena
Lambda
RAW
data
CLEAN
data
CALC
data
Databricks
analyze
Databricks
cleanse
Databricks
calc
Databricks
calc &
aggregate
Databricks
cleanse

11

Databricks Workspaces
We use one Workspace as all our projects already share one AWS account and are logically separated
• One workspace for all stages/projects
• must be created manually by Databricks
Single classic
Workspace
• dedicated workspaces for stages/projects
• must all be created manually by Databricks
• No SSO means user management per workspace
Multiple classic
Workspaces
• dedicated workspaces for stages/projects
• use the recently published Account API
• you can create workspaces yourself (checkout Databricks terraform)
Multiple
Workspaces
with Account API

12

Workspace setup: what Databricks requires & creates
We could reuse our IAM policies and roles with slight modifications.
• S3 bucket for the Databricks workspace (e.g. notebooks)
• IAM role to be able to deploy from Databricks account into your account
• EC2 instance role(s) for deploying/starting a Databricks cluster
Requires
• VPC & related AWS resources
• you cannot reuse an existing VPC
• so you must do new VPC peerings
• NOTE: this changes with the new Account API
• you can provide your specific IP range to Databricks before workspace creation
Creates

13

Security: Combining Databricks & AWS IAM
We now can share one cluster per project - and later with SSO & IAM passthrough just one cluster in total
• Each user must have a valid mail address à same for technical users!
• You can create tokens for users à API access
• You can restrict access to clusters based on user or group
• launch (both analytical & job) clusters only with specific EC2 instance role
Users, groups,
permissions
• Use Databricks cluster with your AWS IAM role
• requires SSO
• allows you to share one cluster between projects
IAM
passthrough

14

Cluster configurations
init scripts, instance types, spot instances
• Init scripts (pip install …) quite
similar to EMR init scripts
• one EC2 instance type for
workers
• no instance fleets as with EMR
• mix of on-demand and spot-
instances (with fallback to on-
demand)
• for analytics & dev we use 1 on-
demand for driver and spot
otherwise
• Autoscaling for shared (dev)
clusters

15

Migrating our Spark Pipelines
Data & Application

16

From Parquet tables to Delta tables
Conversion was easy. Some regular housekeeping required.
• CONVERT TO DELTA mytab PARTITIONED BY (...)
• requires some compute resources to analyze existing data
Parquet to
Delta
• UPDATE/DELETE generates a lot of small files
• à configure as table property à done as part of DML
• or à periodical housekeeping
OPTIMIZE
• Improves selective queries with WHERE on one or more columns
• Cannot be configured as table property à housekeeping
• we keep PARTITION BY substr(zip,1,1) + sortWithinPartition(‘h3_index’)
ZORDER
• Deleted files stay on S3 but are referenced as old versions (àDelta TimeTravel)
• After retention period they should be deleted using VACUUM
• à periodical housekeeping required
VACUUM

17

AWS Glue Data Catalog vs Databricks Hive Metastore
We continued with Glue. No we have two external tables: one for Delta, one for Athena …
Glue Data Catalog Databricks Metastore
can be accessed by
multiple Databricks
workspaces
can only be accessed
by the Databricks
workspace that owns it
can be accessed by
other tools like Athena
cannot be used by
Athena
officially not supported
with Databricks
Connect
supported with
Databricks connect
Native Delta Tables “Symlink” Tables
CREATE TABLE …
USING DELTA
[LOCATION …]
GENERATE symlink_format_manifest FOR
TABLE …
CREATE EXTERNAL TABLE …
LOCATION ‘…/symlink_format_manifest/’
Full Delta Features:
DML, TimeTravel
Only SELECT
Used by our Spark
applications and
interactive analytics
Used for reporting with Tableau &
Athena

18

Workflows (1/2): From single steps on EMR …
Detailed graphical overview of the workflow
StepFunctions before Databricks
Create or use existing EMR cluster (latter is quite
handy for dev)
Lambda for adding step to EMR cluster
access control by IAM role
One step = one Spark job
Good (graphical) overview of pipeline

19

Workflows (2/2): … to bulk steps with Databricks Job clusters
cheaper Job clusters vs more expensive All-Purpose Clusters
StepFunctions with Databricks
You just can run one job on a Job cluster
Using an All-Purpose cluster is more expensive.
One job cluster for each steps: lot of time for cluster
creation à painful for dev
One step = one sequence of Spark jobs à workflow
code on driver required!
Lambda for launching Spark job on Databricks
cluster
Databricks token for authentication
Only partial picture
(but Python workflow is easier to implement)

20

Adapting our (Spark) applications
Hardly any work required in existing Spark applications
• Athena cannot write to Delta
• Migrate to Spark
Athena steps
• replace with DML (DELETE/MERGE/UPDATE)
• instead of deleting objects in mytab/year=2020/month=10/day=1:
DELETE FROM mytab
WHERE year=2020 AND month=10 AND day=1
• works even faster than S3 deletes!
S3 object
deletions
• Replace df.write.mode("append").format(”parquet")
with df.write.mode("append").format("delta")
• No more explicit adding of partitions after writing to an S3 folder
df.write

21

Analyzing, developing & collaborating with Databricks
Developer & Data Scientist view

22

Databricks Notebooks
Here we highlight the differences to EMR Notebooks
• Notebooks run on the same binaries as the cluster
• i.e. all packages installed by the init script are also available on the notebook host
• this way we could use direct visualization package like folium in the notebook
Python
packages
• You can access notebooks without a running cluster
• Attaching notebooks to another cluster is just a click
• no more restarts as with EMR Notebooks
Accessibility
• Code Completion
• Quick visualization of SQL results
Usability
• If you prefer to write Python programs instead of Notebooks
• submit a local script from your laptop to a running All-Purpose cluster
• Beware of different time zones on your machine and the cloud instance!
Databricks
Connect

23

Collaboration & Git/Bitbucket
Our team members tried different ways, but we all ended up using batch workspace export/import
• Collaborative Notebook editing (like Office 365) at the same timeBuilt-in
• You cannot ”sync” a repository or a folder from Git/Bitbucket to your Databricks
workspace
• You must manually link an existing notebook in your workspace to Git/Bitbucket
Direct Git
integration
• install Databricks CLI (you want to do this anyway)
• import into workspace: databricks workspace import_dir ...
• export to local machine: databricks workspace export_dir ...
• git add / commit / push
workspace
export/import

24

Summary

25

Summary
We already investigated performance & costs during PoC. So we felt comfortable with the migration.
• was not a driver for us
• we observed factor 2 for dedicated steps
• functionality of our Pipeline increases, so hard to compare
Performance
• Per instance costs for Databricks + EC2 are higher than for EMR + EC2
• we save resources by sharing autoscale clusters
• DML capabilities reduce ops costs
Costs
• Mainly for changing workflows to use Job clusters
• Having automated integration tests in place helped a lot
• Hardly any work for notebooks
Migration
Effort

26

Conclusion & outlook
We already investigated performance & costs during PoC. So we felt pretty comfortable with the migration.
• was not a driver for us
• we observed factor 2 for dedicated steps
• complexity of our Pipeline increases, so hard to compare
Performance
• Per instance costs for Databricks + EC2 are higher than for EMR + EC2
• we save resources by sharing autoscale clusters
• DML capabilities reduce ops costs
Costs
• Mainly for changing workflows to use Job clusters
• Having automated integration tests in place helped a lot
• Hardly any work for notebooks
Migration
Effort
• MLflow
• Autoloader
• Account API
More features
www.meteonomiqs.com
Carsten Herbe
carsten.herbe@meteonomiqs.com
T +49 89 412 007-289 M +49 151 4416 5763

27

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

Moving to Databricks & Delta

  • 1. Moving to Databricks Carsten Herbe Technical Lead Data Products at wetter.com GmbH 19.11.2020
  • 2. About wetter.com & METEONOMIQS What do we need Spark for?
  • 3. About us METEONOMIQS - B2B Data Solutions by wetter.com - https://www.meteonomiqs.com CONSUMER DATAMEDIA #1 B2C weather portal in DACH with up to 20mil monthly UU (AGOF) Full service, cross-media >50 daily TV/video productions Weather and geo-based solutions for advertiser, FMCG & retailer … Weather Data Impact Weather Based Advertising Weather Driven Forecasting Footfall & Location Insights Weather API
  • 4. Example product: GfK Retail Crowd Monitor (in cooperation with GfK) Comparison of visit frequencies in major German city centers Carsten Herbe - Technical Lead Data Products » Architecture, Dev & Ops for Infrastructure & Data Pipelines » Today: #aws #spark #databricks » Past: #hadoop #kafka #dwh #bi
  • 5. Our world before Databricks … Architecture & Motivation
  • 6. High level architecture & tech stack How our world looked before Databricks ingestion Ingestions API data analytics RAW data EMR cleanse serving REST API delivery customer CLEAN data EMR calc CALC data EMR calc & aggregate S3 adhoc & reports AWS Glue Data Catalog StepFunctionsCloudWatch Athena partner EMR analyze Athena Lambda
  • 7. Motivation for moving to Databricks & Delta It’s not about performance but productivity • GDPR deletes in minutes without downtime instead of hours/days with downtime • Data corrections / enhancements without downtime instead of whole days of downtime Delta • Improved usability & productivity using Databricks Notebooks • Cluster Sharing (80% of EMR costs were from dedicated dev & analytics clusters) Databricks • Improve collaboration between Data Scientists and Data Engineers / Dev Team • Replace custom solution (that has been started) MLflow
  • 8. Welcome to the world of Databricks & Delta Infrastructure view
  • 9. High level architecture & tech stack How our world looked before Databricks ingestion Ingestions API data analytics RAW data EMR cleanse serving REST API delivery customer CLEAN data EMR calc CALC data EMR calc & aggregate S3 adhoc & reports Athena partner EMR analyze Athena AWS Glue Data Catalog StepFunctionsCloudWatch Lambda
  • 10. Architecture: How Databricks fits in Just plug & play? ingestion Ingestions API data analytics serving REST API delivery customer S3 adhoc & reports AWS Glue Data Catalog StepFunctionsCloudWatch partner Athena Lambda RAW data CLEAN data CALC data Databricks analyze Databricks cleanse Databricks calc Databricks calc & aggregate Databricks cleanse
  • 11. Databricks Workspaces We use one Workspace as all our projects already share one AWS account and are logically separated • One workspace for all stages/projects • must be created manually by Databricks Single classic Workspace • dedicated workspaces for stages/projects • must all be created manually by Databricks • No SSO means user management per workspace Multiple classic Workspaces • dedicated workspaces for stages/projects • use the recently published Account API • you can create workspaces yourself (checkout Databricks terraform) Multiple Workspaces with Account API
  • 12. Workspace setup: what Databricks requires & creates We could reuse our IAM policies and roles with slight modifications. • S3 bucket for the Databricks workspace (e.g. notebooks) • IAM role to be able to deploy from Databricks account into your account • EC2 instance role(s) for deploying/starting a Databricks cluster Requires • VPC & related AWS resources • you cannot reuse an existing VPC • so you must do new VPC peerings • NOTE: this changes with the new Account API • you can provide your specific IP range to Databricks before workspace creation Creates
  • 13. Security: Combining Databricks & AWS IAM We now can share one cluster per project - and later with SSO & IAM passthrough just one cluster in total • Each user must have a valid mail address à same for technical users! • You can create tokens for users à API access • You can restrict access to clusters based on user or group • launch (both analytical & job) clusters only with specific EC2 instance role Users, groups, permissions • Use Databricks cluster with your AWS IAM role • requires SSO • allows you to share one cluster between projects IAM passthrough
  • 14. Cluster configurations init scripts, instance types, spot instances • Init scripts (pip install …) quite similar to EMR init scripts • one EC2 instance type for workers • no instance fleets as with EMR • mix of on-demand and spot- instances (with fallback to on- demand) • for analytics & dev we use 1 on- demand for driver and spot otherwise • Autoscaling for shared (dev) clusters
  • 15. Migrating our Spark Pipelines Data & Application
  • 16. From Parquet tables to Delta tables Conversion was easy. Some regular housekeeping required. • CONVERT TO DELTA mytab PARTITIONED BY (...) • requires some compute resources to analyze existing data Parquet to Delta • UPDATE/DELETE generates a lot of small files • à configure as table property à done as part of DML • or à periodical housekeeping OPTIMIZE • Improves selective queries with WHERE on one or more columns • Cannot be configured as table property à housekeeping • we keep PARTITION BY substr(zip,1,1) + sortWithinPartition(‘h3_index’) ZORDER • Deleted files stay on S3 but are referenced as old versions (àDelta TimeTravel) • After retention period they should be deleted using VACUUM • à periodical housekeeping required VACUUM
  • 17. AWS Glue Data Catalog vs Databricks Hive Metastore We continued with Glue. No we have two external tables: one for Delta, one for Athena … Glue Data Catalog Databricks Metastore can be accessed by multiple Databricks workspaces can only be accessed by the Databricks workspace that owns it can be accessed by other tools like Athena cannot be used by Athena officially not supported with Databricks Connect supported with Databricks connect Native Delta Tables “Symlink” Tables CREATE TABLE … USING DELTA [LOCATION …] GENERATE symlink_format_manifest FOR TABLE … CREATE EXTERNAL TABLE … LOCATION ‘…/symlink_format_manifest/’ Full Delta Features: DML, TimeTravel Only SELECT Used by our Spark applications and interactive analytics Used for reporting with Tableau & Athena
  • 18. Workflows (1/2): From single steps on EMR … Detailed graphical overview of the workflow StepFunctions before Databricks Create or use existing EMR cluster (latter is quite handy for dev) Lambda for adding step to EMR cluster access control by IAM role One step = one Spark job Good (graphical) overview of pipeline
  • 19. Workflows (2/2): … to bulk steps with Databricks Job clusters cheaper Job clusters vs more expensive All-Purpose Clusters StepFunctions with Databricks You just can run one job on a Job cluster Using an All-Purpose cluster is more expensive. One job cluster for each steps: lot of time for cluster creation à painful for dev One step = one sequence of Spark jobs à workflow code on driver required! Lambda for launching Spark job on Databricks cluster Databricks token for authentication Only partial picture (but Python workflow is easier to implement)
  • 20. Adapting our (Spark) applications Hardly any work required in existing Spark applications • Athena cannot write to Delta • Migrate to Spark Athena steps • replace with DML (DELETE/MERGE/UPDATE) • instead of deleting objects in mytab/year=2020/month=10/day=1: DELETE FROM mytab WHERE year=2020 AND month=10 AND day=1 • works even faster than S3 deletes! S3 object deletions • Replace df.write.mode("append").format(”parquet") with df.write.mode("append").format("delta") • No more explicit adding of partitions after writing to an S3 folder df.write
  • 21. Analyzing, developing & collaborating with Databricks Developer & Data Scientist view
  • 22. Databricks Notebooks Here we highlight the differences to EMR Notebooks • Notebooks run on the same binaries as the cluster • i.e. all packages installed by the init script are also available on the notebook host • this way we could use direct visualization package like folium in the notebook Python packages • You can access notebooks without a running cluster • Attaching notebooks to another cluster is just a click • no more restarts as with EMR Notebooks Accessibility • Code Completion • Quick visualization of SQL results Usability • If you prefer to write Python programs instead of Notebooks • submit a local script from your laptop to a running All-Purpose cluster • Beware of different time zones on your machine and the cloud instance! Databricks Connect
  • 23. Collaboration & Git/Bitbucket Our team members tried different ways, but we all ended up using batch workspace export/import • Collaborative Notebook editing (like Office 365) at the same timeBuilt-in • You cannot ”sync” a repository or a folder from Git/Bitbucket to your Databricks workspace • You must manually link an existing notebook in your workspace to Git/Bitbucket Direct Git integration • install Databricks CLI (you want to do this anyway) • import into workspace: databricks workspace import_dir ... • export to local machine: databricks workspace export_dir ... • git add / commit / push workspace export/import
  • 25. Summary We already investigated performance & costs during PoC. So we felt comfortable with the migration. • was not a driver for us • we observed factor 2 for dedicated steps • functionality of our Pipeline increases, so hard to compare Performance • Per instance costs for Databricks + EC2 are higher than for EMR + EC2 • we save resources by sharing autoscale clusters • DML capabilities reduce ops costs Costs • Mainly for changing workflows to use Job clusters • Having automated integration tests in place helped a lot • Hardly any work for notebooks Migration Effort
  • 26. Conclusion & outlook We already investigated performance & costs during PoC. So we felt pretty comfortable with the migration. • was not a driver for us • we observed factor 2 for dedicated steps • complexity of our Pipeline increases, so hard to compare Performance • Per instance costs for Databricks + EC2 are higher than for EMR + EC2 • we save resources by sharing autoscale clusters • DML capabilities reduce ops costs Costs • Mainly for changing workflows to use Job clusters • Having automated integration tests in place helped a lot • Hardly any work for notebooks Migration Effort • MLflow • Autoloader • Account API More features www.meteonomiqs.com Carsten Herbe carsten.herbe@meteonomiqs.com T +49 89 412 007-289 M +49 151 4416 5763
  • 27. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.