Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
1 of 57
More Related Content
Building a modern data warehouse
2. About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
3. I tried to understand the modern data warehouse on my own…
And felt like I was body slammed by Randy
Savage:
Let’s prevent that from happening…
4. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
Azure Data Lake
Storage Gen1
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflow
Azure Data Lake Analytics
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
9. Questions to ask customer
• Can you use the cloud?
• Is this a new solution or a migration?
• What is the skillset of the developers?
• Will you use non-relational data (variety)?
• How much data do you need to store (volume)?
• Is this an OLTP or OLAP/DW solution?
• Will you have streaming data (velocity)?
• Will you use dashboards and/or ad-hoc queries?
• Will you use batch and/or interactive queries?
• How fast do the operational reports need to run?
• Will you do predictive analytics?
• Do you want to use Microsoft tools or open source?
• What are your high availability and/or disaster recovery requirements?
• Do you need to master the data (MDM)?
• Are there any security limitations with storing data in the cloud?
• Does this solution require 24/7 client access?
• How many concurrent users will be accessing the solution at peak-time and on average?
• What is the skill level of the end users?
• What is your budget and timeline?
• Is the source data cloud-born and/or on-prem born?
• How much daily data needs to be imported into the solution?
• What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)?
• Are you ok with using products that are in preview?
11. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
Azure Data Lake
Storage Gen1
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflow
Azure Data Lake Analytics
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
15. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
Azure Data Lake
Storage Gen1
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflow
Azure Data Lake Analytics
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
18. LRS
Multiple replicas across
a datacenter
Protect against disk,
node, rack failures
Write is ack’d when all
replicas are committed
Superior to dual-parity
RAID
11 9s of durability
SLA: 99.9%
GRS
Multiple replicas across each
of 2 regions
Protects against major
regional disasters
Asynchronous to secondary
16 9s of durability
SLA: 99.9%
RA-GRS
GRS + Read access to secondary
Separate secondary endpoint
RPO delay to secondary can be
queried
SLA: 99.99% (read), 99.9% (write)
Zone 1
ZRS
Replicas across 3 Zones
Protect against disk, node, rack and
zone failures
Synchronous writes to all 3 zones
12 9s of durability
Available in 8 regions
SLA: 99.9%
Zone 2 Zone 3
20. updateable
distributed tables and replicated dimensional tables). We now have HDFS on-prem version.
Both SQL and Spark can access same data. Great if you are already a SQL shop
22. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
Azure Data Lake
Storage Gen1
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflow
Azure Data Lake Analytics
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
24. Databricks is the preferred product over HDI, unless the customer has
a mature Hadoop ecosystem already established, wants to be 100% open source,
wants to use other Hadoop tools that are available 24/7 at a lower cost, or wants
to use other tools like Kafka/Storm/HBase/R Server/LLAP/Hive/Pig
always running and incurring costs
(no pausing or auto scale). Hortonworks merged with Cloudera
25. Stick with T-SQL and don’t want to deal with Spark or
Hive or other more-difficult technologies
26. Integrates data lake and data prep technology (Power Query)
directly into Power BI Service, independent of PBI reports. Self-service
data prep
Individual solution or for small workloads. Data Analysts
and Business Analysts. Can transform data that lands in the data lake
and can then be used as part of an enterprise solution
27. transforming large
amounts of data in a data lake or replacing long-running monthly batch
processing with shorter running distributed processes. Predictable
performance with no startup time
Does not support interactive
queries, persistence, or indexing
29. Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
Azure Data Lake
Storage Gen1
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflow
Azure Data Lake Analytics
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
30. SQL-based, fully-managed, petabyte-scale cloud data warehouse.
Can scale compute and storage independently allowing you to burst
compute, and c
MPP technology that shines when used for ad-hoc queries and
operational reports in relational format
equires data to be copied from
ADLS into SQL DW but this can be done quickly using PolyBase
33. cases: Need control over / access to the operating system, have to run
the app or agents side-by-side with the DB, need to use older version of SQL
Server, SSRS, DW in the 4TB-50TB range, 3rd-party app not certified for PaaS,
DBA afraid of losing his job, control over backups and maintenance window,
want to avoid risk
How to use: IaaS. Provision
34. A globally distributed, multi-model (key-value, graph, and
document) database service. It fits into the NoSQL camp by having a non-
relational model (supporting schema-on-read and JSON documents)
Works really well for large-scale OLTP solutions.
for DW aggregations. Use for data lake to have one datastore
for both operational and analytical queries
40. Microsoft data platform solutions
Product Category Description More Info
SQL Server 2017 RDBMS Earned top spot in Gartner’s Operational Database magic
quadrant. JSON support. Linux support
https://www.microsoft.com/en-us/server-
cloud/products/sql-server-2017/
SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly.
Has built-in high availability and disaster recovery. JSON
support. Managed Instance option
https://azure.microsoft.com/en-
us/services/sql-database/
SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data.
Provision and scale quickly. Can pause service to reduce
cost
https://azure.microsoft.com/en-
us/services/sql-data-warehouse/
Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of
your data while making it faster to get up and running with
batch, streaming, and interactive analytics
https://azure.microsoft.com/en-
us/services/data-lake-store/
HDInsight PaaS Hadoop
compute/Hadoop
clusters-as-a-service
A managed Apache Hadoop, Spark, R Server, HBase, Kafka,
Interactive Query (Hive LLAP) and Storm cloud service
made easy
https://azure.microsoft.com/en-
us/services/hdinsight/
Azure Databricks PaaS Spark clusters A fast, easy, and collaborative Apache Spark based analytics
platform optimized for Azure
https://databricks.com/azure
Azure Data Lake Analytics On-demand analytics job
service/Big Data-as-a-
service
Cloud-based service that dynamically provisions resources
so you can run queries on exabytes of data. Includes U-
SQL, a new big data query language
https://azure.microsoft.com/en-
us/services/data-lake-analytics/
Azure Cosmos DB PaaS NoSQL: Key-value,
Column-family,
Document, Graph
Globally distributed, massively scalable, multi-model, multi-
API, low latency data service – which can be used as an
operational database or a hot data lake
https://azure.microsoft.com/en-
us/services/cosmos-db/
Azure Database for PostgreSQL,
MySQL, and MariaDB
RDBMS/DBaaS A fully managed database service for app developers https://azure.microsoft.com/en-
us/services/postgresql
41. A “no-compromises” Data Lake: secure, performant, massively-scalable Data Lake storage that brings the cost and
scale profile of object storage together with the performance and analytics feature set of data lake storage
A z u r e D a t a L a k e S t o r a g e G e n 2
M A N A G E A B L E S C A L A B L EF A S TS E C U R E
No limits on
data store size
Global footprint
(50 regions)
Optimized for Spark
and Hadoop
Analytic Engines
Tightly integrated
with Azure end to
end analytics
solutions
Automated
Lifecycle Policy
Management
Object Level
tiering
Support for fine-
grained ACLs,
protecting data at the
file and folder level
Multi-layered
protection via at-rest
Storage Service
encryption and Azure
Active Directory
integration
C O S T
E F F E C T I V E
I N T E G R AT I O N
R E A D Y
Atomic file
operations
means jobs
complete faster
Object store
pricing levels
File system
operations
minimize
transactions
required for job
completion
42. Managed data lake with
SQL Server and Spark
SQL
Server
Data virtualization
T-SQL
Analytics Apps
Open
database
connectivity
NoSQL Relational
databases
HDFS
Complete AI platform
SQL Server External Tables
Compute pools and data pools
Spark
Scalable, shared storage (HDFS)
External
data sources
Admin portal and management services
Integrated AD-based security
SQL Server
ML Services
Spark &
Spark ML
HDFS
REST API containers
for models
Managing all dataIntegrating all data AI over all data
Store high volume data in a data lake and access
it easily using either SQL or Spark
Management services, admin portal, and
integrated security make it all easy to manage
Combine data from many sources without
moving or replicating it
Scale out compute and caching to boost
performance
Easily feed integrated data from many sources to
your model training
Ingest and prep data and then train, store, and
operationalize your models all in one system
Intelligence over all data
43. Increase analytics and apps performance
Compute pool
SQL Compute
Node
SQL Compute
Node
SQL Compute
Node
…
Compute pool
SQL Compute
Node
IoT data
Directly
read from
HDFS
Persistent storage
…
Storage pool
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
SQL
Server
Spark
HDFS Data Node
Kubernetes pod
Analytics
Custom
apps BI
SQL Server
master instance
Node Node Node Node Node Node Node
SQL
Data pool
SQL Data
Node
SQL Data
Node
Compute pool
SQL Compute
Node
Storage Storage
Intelligence over all data
46. Contact Lead Opportunity AccountContact Lead Opportunity Account Product ProfileProduct Profile People ProfileCustomer ProfileCustomer Profile
Power BI Azure
Databricks
Azure
Data
Factory
Azure
SQL DW
Self-service data prep
Dataflows
AI consumption
Enterprise BI
Semantic models
Self-service BI
Data ingestion
& orchestration
Enterprise
data prep
Curated data
48. INGEST STORE PREP & TRAIN MODEL & SERVE
C L O U D D A T A W A R E H O U S E
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
49. INGEST STORE PREP & TRAIN MODEL & SERVE
M O D E R N D A T A W A R E H O U S E
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory
Azure Databricks
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
50. A D V A N C E D A N A L Y T I C S O N B I G D A T A
INGEST STORE PREP & TRAIN MODEL & SERVE
Cosmos DB
Business/custom apps
(structured)
Files (unstructured)
Media (unstructured)
Logs (unstructured)
Azure Data Lake Store Gen2Azure Data Factory Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
PolyBase
SparkR
Azure Databricks
Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure Machine Learning to allow customers to tailor the above architecture to meet
their unique needs.
Real-time apps
51. INGEST STORE PREP & TRAIN MODEL & SERVE
R E A L T I M E A N A L Y T I C S
Sensors and IoT
(unstructured)
Apache Kafka for
HDInsight
Cosmos DB
Files (unstructured)
Media (unstructured)
Logs (unstructured)
Azure Data Lake Store Gen2Azure Data Factory
Azure Databricks
Real-time apps
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
Microsoft Azure also supports other Big Data services like Azure IoT Hub, Azure Event Hubs, Azure Machine Learning to allow customers to
tailor the above architecture to meet their unique needs.
PolyBase
52. INGEST STORE MODEL & SERVE
D A T A M A R T C O N S O L I D A T I O N
Azure Data Lake Store Gen2 Azure SQL
Data Warehouse
Azure Data Factory Azure Analysis
Services
Power BI
RDBMS data marts
Hadoop
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the architecture to meet their unique needs.
PolyBase
53. INGEST STORE PREP & TRAIN MODEL & SERVE
H U B & S P O K E A R C H I T E C T U R E F O R B I
Azure SQL
Data Warehouse
PolyBase
Business/custom apps
(structured)
Power BI
Microsoft Azure supports other services like Azure HDInsight to allow customers a truly customized solution.
Multiple Azure Analysis
Services instances
SQL
Multiple Azure SQL
Database instances
Data Marts
Data Cubes
Azure Databricks
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Azure Data Lake Store Gen2Azure Data Factory
54. INGEST STORE PREP & TRAIN MODEL & SERVE
A U T O S C A L I N G D A T A W A R E H O U S E
Microsoft Azure supports other services like Azure HDInsight to allow customers a truly customized solution.
Azure Analysis
Services
Azure Functions
(Auto-scaling)
Business/custom apps
(structured)
Logs (unstructured)
Media (unstructured)
Files (unstructured)
Azure SQL
Data Warehouse
PolyBase
Power BIAzure Data Lake Store Gen2Azure Data Factory
Azure Databricks
55. D A T A W A R E H O U S E M I G R A T I O N
INGEST STORE PREP & TRAIN MODEL & SERVE
Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the architecture to meet their unique needs.
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Business/custom apps
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory Azure Databricks
Media (unstructured)
Files (unstructured)
Azure Analysis
Services
Power BI
PolyBase
https://azure.microsoft.com/en-us/blog/json-functionalities-in-azure-sql-database-public-preview/ “If you need a specialized JSON database in order to take advantage of automatic indexing of JSON fields, tunable consistency levels for globally distributed data, and JavaScript integration, you may want to choose Azure DocumentDB as a storage engine.”
https://blogs.msdn.microsoft.com/jocapc/2015/05/16/json-support-in-sql-server-2016/
https://msdn.microsoft.com/en-us/library/dn921897.aspx “If you have pure JSON workloads where you want to use some query language that is customized and dedicated for processing of JSON documents, you might consider Microsoft Azure DocumentDB.”
http://demo.sqlmag.com/scaling-success-sql-server-2016/integrating-big-data-and-sql-server-2016
https://www.simple-talk.com/sql/learn-sql-server/json-support-in-sql-server-2016/
Integrating all data
Combine data from many sources without moving or replicating it – eliminate ETL, access current data, maintain security
Scale-out data marts cache data to boost performance
Managing all data
SQL Server can now read and write to HDFS
Store high volume data in a data lake and analyze it easily using either T-SQL or Spark
Management services, admin portal, and integrated security make it all easy to manage
Analyzing all data
Perform analytics over structured and unstructured data in real time
Easily feed integrated data from many sources to your model training
Ingest and prep data and then train, store, and operationalize your models all in one system
Increase analytics and apps performance with scale out data pools
Microsoft Azure supports other services like Azure HDInsight, Azure Data Lake, Azure IoT Hub, Azure Events Hub in various layers of the architecture above to allow customers a truly customized solution.
1) Copy source data into the Azure Data Lake Store (twitter data example)2) Massage/filter the data using Hadoop (or skip using Hadoop and use stored procedures in SQL DW/DB to massage data after step #5)3) Pass data into Azure ML to build models using Hive query (or pass in directly from Azure Data Lake Store)4) Azure ML feeds prediction results into the data warehouse5) Non-relational data in Azure Data Lake Store copied to data warehouse in relational format (optionally use PolyBase with external tables to avoid copying data)6) Power BI pulls data from data warehouse to build dashboards and reports7) Azure Data Catalog captures metadata from Azure Data Lake Store and SQL DW/DB8) Power BI and Excel can pull data from the Azure Data Lake Store via HDInsight9) To support high concurrency if using SQL DW, or for easier end-user data layer, create an SSAS cube
Individual/Personal BI vs Departmental/Team BI vs Enterprise/Corporate BI