Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Lasse Benninga
26-10-22
GoDataFest
Deploying a Modern
Data Stack
Lasse Benninga
• Analytics Engineer @ GDD since
2021
• Studied Informatics in Groningen,
background in SE and DE
• Love OSS and Data <3
• Live in Utrecht
• Enjoy podcasts, running
GODATADRIVEN
Chapters
• The Modern Data Stack (MDS)
• Infrastructure-as-code
• Deploying a MDS
• Demo
• Pros and Cons
• Conclusions & Future
GODATADRIVEN
The Modern Data Stack
What’s a Data
Warehouse?
• The “Relation Model” was designed in
the 1970s at IBM by Edgar F. Codd.
• Online Transactional Processing
(OLTP) databases were built on this
“relational model” in the 70’s for day-
to-day business needs
• Businesses poured growing amounts
of data into OLTP systems, and a need
for analysis clogged down the systems
• Separate Online Analytical Processing
(OLAP) systems were designed to
handle the need for data-driven
insights from the 80’s onward
https://future.com/emerging-architectures-modern-data-infrastructure/
https://future.com/emerging-architectures-modern-data-infrastructure/
https://future.com/emerging-architectures-modern-data-infrastructure/
What's a Modern
Data Stack
• Cloud Native
• Structured and Unstructured data
• Pay-as-you-go
• SQL-first
• Freedom of choice
• Managed
• Self-hosted
GODATADRIVEN
Infrastructure-as-code
GODATADRIVEN
Terraform
• Declarative: describe what you want to achieve
instead of how to achieve this
• Modular / DRY
• Open source
• Industry standard
• Cloud agnostic
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
• Keeps a state of a project and compares changes to know
what to add/update/delete
• Terraform officiallysupports around 130 providers
GODATADRIVEN
Deploying a Modern Data Stack
Steps for deploying
an MDS
1. Pick a cloud.
2. Choose the tools.
3. Deploy the tools.
4. Ingest, Transform, Analyze &
Visualize
GODATADRIVEN
Pick a cloud
Google Cloud Platform (GCP)
• Launched in 2008 as Google’s
cloud computing platform
• Hosts the infrastructure that runs
Google Search,
Google Mail, Google Drive and
YouTube
• It’s easy to onboard and start
using GCP
Google BigQuery
Launched in 2011,BigQuery is Google’s
Data Warehouse of choice:
• Fully-managed by Google
• Serverless architecture
• Scalable to query petabytes of data
• Supports SQL as the standard
• Pay-as-you-query
GODATADRIVEN
Choose the tools
https://www.datafold.com/blog/the-modern-data-stack-open-source-edition
August 17, 2021
https://www.datafold.com/blog/the-modern-data-stack-open-source-edition
August 17, 2021
Airbyte
Founded in 2020, Airbyte is an Extraction & Load
tool:
• Open-source software with a SaaS
offering (Airbyte Cloud)
• 100+ connectors and growing
• Has a scheduler
• Support for staged, incremental, normalized
loading
• SDK for creating custom connectors
dbt
Released in 2016 , data build tool (dbt) is a workflow tooling
that is meant for transforming data inside a data warehouse:
• Open-source software with a SaaS offering (dbt Cloud)
• Connects to most major cloud DWHs
• Runs SQL statements on the DWH
• Creates a DAG of SQL ”models”
• Supports documentation in code
• Cloud offering contains Scheduler, IDE, Documentation hosting
Superset
Starting out as a Apache Incubator project in 2017, Superset is an
open-source data exploration & data visualization tool:
• Open-source software with a SaaS offering (Preset)
• Connects to most major cloud DWHs
• Supports creating many different visualizations
• Data can be explored using SQL Lab
• Supports enterprise authentication (OpenID, LDAP, OAuth)
GODATADRIVEN
Installing the components
Superset on Google Compute (simple)
1. Create a single VM
2. Deploy Superset using it’s Docker-Compose images
with startup script and share port
3. Connect to to VM from the browser +
Airbyte on Kubernetes (complex)
• Terraform module containing Kubernetes Airbyte integration
for GCP. Contains helm chart, networking, IAM,
Storage components.
• Publicly available at https://github.com/thomas-vl/airbyte-terraform
• Created by GDD Data Engineer Thomas van Latum
+ +
DBT Cloud
1. Create a DBT Cloud account
2. Create Github repository
3. Connect to BigQuery
4. Create ”models” for dataset
5. Create and run period Job in Scheduler
GODATADRIVEN
Demo: Ingest, Transform,
Analyze & Visualize
Demo architecture
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
GODATADRIVEN
Pro’s and con’s of the ”DIY”
Modern Data Stack
Pros Cons
+ Quick to get started
+ Freedom to pick components
+ Pay-as-you-go cost model
+ Stand of the shoulders of OSS
- Can become very complex to maintain
- Cost model can be deceiving
- Dependent on OSS community
- Dangers of free architecture
GODATADRIVEN
Conclusions and future (?)
Conclusions
• Think deeply about buy vs build
• Consider going Cloud Native
• Consider using OSS
• Look at the competition around
you
• Monitor costs!
Future
• Battle of the Giants:
• Snowflake vs Databricks
• Azure vs GCP vs AWS
• Data Lineage
• Tracking your data end-to-end
• Column-level-lineage
• Semantic Layer
• Querying using Natural Language
• Cross platform integration
• Data Governance
• Cataloging
• Permissions
• Self-service
vs
Links
• https://www.moderndatastack.xyz/
• https://future.com/emerging-architectures-modern-data-infrastructure/
• https://www.datafold.com/blog/the-modern-data-stack-open-source-edition
GDD Solutions
Let us deploy a MDS for you and land your firstanalytics usecase!
Check out https://godatadriven.com/what-we-do/solutions/
WWW.GODATADRIVEN.COM
lassebenninga@godatadriven.com

More Related Content

Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022