Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Designing a modern data warehouse in azure
Designing a
Modern Data Warehouse
in Azure
Antonios
Chatzipavlis
Data Solutions
Consultant & Trainer
1988 Beginning of my professional career
1996 I started working with SQL Server 6.0
1998 Certified as MCSD (3rd in Greece)
1999 Became an MCT
2010 Microsoft MVP on Data Platform
Created www.sqlschool.gr
2012 Became MCT Regional Lead by Microsoft Learning
2013 Certified as MCSE : Data Platform and
MCSE : Business Intelligence
2016 Certified as MCSE: Data Management & Analytics
2018 Certified as MCSA : Machine Learning
Recertified as MCSE: Data Management & Analytics
• Articles
• SQL Server in Greek
• SQL Nights
• Webcasts
• SQL Server News
• Downloads
• Resources
What we are doing Follow us
fb/sqlschoolgr
fb/groups/sqlschool
@antoniosch
@sqlschool
yt/c/SqlschoolGr
SQLschool.gr Group
A community for
Greek professionals
who use the
Microsoft
Data Platform
Ask your question at help@sqlschool.gr
Explore
everything
PASS has
to offer
Free Online Resources
Newsletters
PASS.org
Get involved
Free online
webinar
events
Local user groups
around the world
Free 1-day local
training events
Online special
interest user
groups
Business analytics
training
bit.ly/AAB2019Evaluation
A data warehouse is a subject-oriented,
integrated, time-variant and
non-volatile collection
of data in support of management’s
decision making process.
WHAT IS A DATA WAREHOUSE?
TRADITIONAL DATA WAREHOUSE
SELF-SERVICE DATA WAREHOUSE
TRADITIONAL DW LIMITATIONS
Data
sources
User
Competition
Scaling up
Data
platforms
CURRENT DW CHALLENGES
Timeliness Flexibility Quality Findability
RECENT RESEARCH SURVEYS
of responders reports that
they will replace their
primary DW platform and
analytics tools within 3
years
+50%
Designing a modern data warehouse in azure
The data tsunami
WHY YOU NEED A
MODERN DATA
WAREHOUSE
Customer
experience
Quality
assurance
Operational
efficiency
Innovation
THE CRITERIA
FOR SELECTING A
MODERN DW
Meets Current
and Future
Needs
ON-PREMISES
VS.
CLOUD DW
• Evaluating Time to Value
• Accounting for Storage and Computing Costs
• Sizing, Balancing and Tuning
• Considering Data Preparation and ETL Costs
• Cost of Specialized Business Analytic Tools
• Scaling and Elasticity
• Delays and Downtime
• Cost of Security Breaches
• Data Protection and Recovery
STEPS TO
GETTING STARTED
WITH CLOUD DW
• Evaluate your data warehousing needs.
• Migrate or start fresh.
• Establish success criteria.
• Evaluate cloud data warehouse solutions.
• Calculate your total cost of ownership.
• Set up a proof of concept (POC).
Azure Modern Data Warehouse
MODERN DATA WAREHOUSE
MODERN DATA WAREHOUSE IN AZURE
ADVANCED ANALYTICS ON BIG DATA
REAL-TIME ANALYTICS
SQL SERVER 2019 BIG DATA CLUSTERS
INGEST DATA
ADF
• PaaS
• Mapping Data Flow transform data (ETL)
• Copy Data tool easily copy from source
to destination
• Templates
• Any new project
• Converting SSIS packages
• Row by row ETL can be slower
• Data needs to be moved to Databricks –
limited by compute size
• Mapping Data flow takes time to startup
SSIS
• SSDT – Visual Studio
• Very popular product
• Used for on-prem ETL for may year
• Too big of an effort to migrate existing
packages
• Skillset staying on-prem
• Change to IR in ADF
• Row by row ETL can be slower
• Data need to moved to IR
• Limited by node size/number of SSIS IR
STORE DATA
ADLS Gen 2
• PaaS
• Best features of blob
storage
• Not all features are
available yet
• Some products not support
yet
• 5TB file size limit
Blob Storage
• PaaS
• Original storage
• Most popular
• Don’t use for new projects
• Account limit 2 PB for US
and Europe
• 4,75TB file size limit
SQL Server 2019 Big Data
Cluster
• IaaS
• Combines SQL Server
database engine, Spark,
HDFS (ADLS Gen2) into a
unified data platform
• Deployed as containers on
Kubernetes
• Polybase
• Hybrid cloud
• Data virtualization
• AI Platform
PREP DATA
Azure Databricks
• PaaS
• Processing massive
amounts of data
• Training & deploy
models
• Manage workflows
• Spark & notebooks
• Integration with
ADLS, SQL DW, PBI
• Writing Code
• High learning curve
Azure HDInsight
• PaaS
• Deploys &
provisions Apache
Hadoop clusters
• No integration with
SQL DW
• Always running and
incurring cost
• Hortonworks
merged with
Cloudera
Polybase & Stored
Procedures in SQL
DW
• IaaS
• T-SQL queries via
external tables
• Tuning queries
• Increase storage
space
PowerBI Dataflow
• PowerBI service
• Power Query
• Self-service data
prep
• Individual solution
• Small workloads
• Don’t use this to
replace a DW or
ADF
MODEL & SERVE DATA
Azure SQL DW
• PaaS
• Fully managed
petabyte scale
cloud DW
• Can scale compute
and storage
independently
• Can be paused
• MPP
Azure Analysis
Services
• PaaS
• Tabular model
• Fast queries
• High concurrency
• Semantic layer
• Vertical scale-out
• High availability
• Advanced time-
calculations
• Time to process
the cube
Azure SQL
Database
• PaaS
• Suitable for small
DW
• Size limits/tier
• Optimized for
OLTP
SQL Server in
VM
• IaaS
• MDX models
Cosmos DB
• PaaS
• Globally
distributed
• Multi-model
database service
• Spark to Cosmos
DB connector for
DW aggregations
Designing a modern data warehouse in azure
ETL vs ELT
ETL ELT
Time – Load Uses staging area and system, extra time to load data All in one system, load only once
Time – Transformation
Need to wait, especially for big data sizes - as data grows,
transformation time increases
All in one system, speed is not dependent on data size
Time – Maintenance
High maintenance - choice of data to load and transform and
must do it again if deleted or want to enhance the main data
repository
Low maintenance - all data is always available
Implementation complexity At early stage, requires less space and result is clean
Requires in-depth knowledge of tools and expert design of the
main large repository
Analysis & Processing style
Based on multiple scripts to create the views - deleting view
means deleting data
Creating adhoc views - low cost for building and maintaining
Data limitation or restriction By presuming and choosing data a priori By HW (none) and data retention policy
DW Support
Prevalent legacy model used for on-premises and relational,
structured data
Tailored to using in scalable cloud infrastructure to support
structured, unstructured such
big data sources
Data Lake Support Not part of approach Enables use of lake with unstructured data supported
Usability Fixed tables, Fixed timeline, Used mainly by IT
Ad Hoc, Agility, Flexibility, Usable by everyone from developer to
citizen integrator
Cost-effective Not cost-effective for small and medium businesses
Scalable and available to all business sizes using online SaaS
solutions
LAMBDA ARCHITECTURE
LAMBDA
ARCHITECTURE IN
AZURE
COMMON
DATA MODEL
Antonios
Chatzipavlis
Data Solutions
Consultant & Trainer
./sqlschoolgr - ./groups/sqlschool
@antoniosch - @sqlschool
yt/c/SqlschoolGr
SQLschool.gr Group
Thank you!
A community for Greek professionals who use the Microsoft Data Platform
Copyright © 2018 SQLschool.gr. All right reserved. PRESENTER MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION

More Related Content

Designing a modern data warehouse in azure

  • 2. Designing a Modern Data Warehouse in Azure
  • 3. Antonios Chatzipavlis Data Solutions Consultant & Trainer 1988 Beginning of my professional career 1996 I started working with SQL Server 6.0 1998 Certified as MCSD (3rd in Greece) 1999 Became an MCT 2010 Microsoft MVP on Data Platform Created www.sqlschool.gr 2012 Became MCT Regional Lead by Microsoft Learning 2013 Certified as MCSE : Data Platform and MCSE : Business Intelligence 2016 Certified as MCSE: Data Management & Analytics 2018 Certified as MCSA : Machine Learning Recertified as MCSE: Data Management & Analytics
  • 4. • Articles • SQL Server in Greek • SQL Nights • Webcasts • SQL Server News • Downloads • Resources What we are doing Follow us fb/sqlschoolgr fb/groups/sqlschool @antoniosch @sqlschool yt/c/SqlschoolGr SQLschool.gr Group A community for Greek professionals who use the Microsoft Data Platform Ask your question at help@sqlschool.gr
  • 5. Explore everything PASS has to offer Free Online Resources Newsletters PASS.org Get involved Free online webinar events Local user groups around the world Free 1-day local training events Online special interest user groups Business analytics training
  • 7. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. WHAT IS A DATA WAREHOUSE?
  • 11. CURRENT DW CHALLENGES Timeliness Flexibility Quality Findability
  • 12. RECENT RESEARCH SURVEYS of responders reports that they will replace their primary DW platform and analytics tools within 3 years +50%
  • 15. WHY YOU NEED A MODERN DATA WAREHOUSE Customer experience Quality assurance Operational efficiency Innovation
  • 16. THE CRITERIA FOR SELECTING A MODERN DW Meets Current and Future Needs
  • 17. ON-PREMISES VS. CLOUD DW • Evaluating Time to Value • Accounting for Storage and Computing Costs • Sizing, Balancing and Tuning • Considering Data Preparation and ETL Costs • Cost of Specialized Business Analytic Tools • Scaling and Elasticity • Delays and Downtime • Cost of Security Breaches • Data Protection and Recovery
  • 18. STEPS TO GETTING STARTED WITH CLOUD DW • Evaluate your data warehousing needs. • Migrate or start fresh. • Establish success criteria. • Evaluate cloud data warehouse solutions. • Calculate your total cost of ownership. • Set up a proof of concept (POC).
  • 19. Azure Modern Data Warehouse
  • 24. SQL SERVER 2019 BIG DATA CLUSTERS
  • 25. INGEST DATA ADF • PaaS • Mapping Data Flow transform data (ETL) • Copy Data tool easily copy from source to destination • Templates • Any new project • Converting SSIS packages • Row by row ETL can be slower • Data needs to be moved to Databricks – limited by compute size • Mapping Data flow takes time to startup SSIS • SSDT – Visual Studio • Very popular product • Used for on-prem ETL for may year • Too big of an effort to migrate existing packages • Skillset staying on-prem • Change to IR in ADF • Row by row ETL can be slower • Data need to moved to IR • Limited by node size/number of SSIS IR
  • 26. STORE DATA ADLS Gen 2 • PaaS • Best features of blob storage • Not all features are available yet • Some products not support yet • 5TB file size limit Blob Storage • PaaS • Original storage • Most popular • Don’t use for new projects • Account limit 2 PB for US and Europe • 4,75TB file size limit SQL Server 2019 Big Data Cluster • IaaS • Combines SQL Server database engine, Spark, HDFS (ADLS Gen2) into a unified data platform • Deployed as containers on Kubernetes • Polybase • Hybrid cloud • Data virtualization • AI Platform
  • 27. PREP DATA Azure Databricks • PaaS • Processing massive amounts of data • Training & deploy models • Manage workflows • Spark & notebooks • Integration with ADLS, SQL DW, PBI • Writing Code • High learning curve Azure HDInsight • PaaS • Deploys & provisions Apache Hadoop clusters • No integration with SQL DW • Always running and incurring cost • Hortonworks merged with Cloudera Polybase & Stored Procedures in SQL DW • IaaS • T-SQL queries via external tables • Tuning queries • Increase storage space PowerBI Dataflow • PowerBI service • Power Query • Self-service data prep • Individual solution • Small workloads • Don’t use this to replace a DW or ADF
  • 28. MODEL & SERVE DATA Azure SQL DW • PaaS • Fully managed petabyte scale cloud DW • Can scale compute and storage independently • Can be paused • MPP Azure Analysis Services • PaaS • Tabular model • Fast queries • High concurrency • Semantic layer • Vertical scale-out • High availability • Advanced time- calculations • Time to process the cube Azure SQL Database • PaaS • Suitable for small DW • Size limits/tier • Optimized for OLTP SQL Server in VM • IaaS • MDX models Cosmos DB • PaaS • Globally distributed • Multi-model database service • Spark to Cosmos DB connector for DW aggregations
  • 30. ETL vs ELT ETL ELT Time – Load Uses staging area and system, extra time to load data All in one system, load only once Time – Transformation Need to wait, especially for big data sizes - as data grows, transformation time increases All in one system, speed is not dependent on data size Time – Maintenance High maintenance - choice of data to load and transform and must do it again if deleted or want to enhance the main data repository Low maintenance - all data is always available Implementation complexity At early stage, requires less space and result is clean Requires in-depth knowledge of tools and expert design of the main large repository Analysis & Processing style Based on multiple scripts to create the views - deleting view means deleting data Creating adhoc views - low cost for building and maintaining Data limitation or restriction By presuming and choosing data a priori By HW (none) and data retention policy DW Support Prevalent legacy model used for on-premises and relational, structured data Tailored to using in scalable cloud infrastructure to support structured, unstructured such big data sources Data Lake Support Not part of approach Enables use of lake with unstructured data supported Usability Fixed tables, Fixed timeline, Used mainly by IT Ad Hoc, Agility, Flexibility, Usable by everyone from developer to citizen integrator Cost-effective Not cost-effective for small and medium businesses Scalable and available to all business sizes using online SaaS solutions
  • 34. Antonios Chatzipavlis Data Solutions Consultant & Trainer ./sqlschoolgr - ./groups/sqlschool @antoniosch - @sqlschool yt/c/SqlschoolGr SQLschool.gr Group Thank you!
  • 35. A community for Greek professionals who use the Microsoft Data Platform Copyright © 2018 SQLschool.gr. All right reserved. PRESENTER MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION