Modern Data Warehousing with the Microsoft Analytics Platform System

Microsoft Analytics Platform
System (APS)
Modern Data Warehousing
James Serra
Big Data Evangelist
Microsoft

Agenda
• Traditional data warehouse & modern data warehouse
• APS architecture
• Hadoop & PolyBase
• Performance and scale
• Appliance benefits
• Summarize/questions

5
Data sources
Will your current solution handle future needs?

10
Data sourcesNon-Relational Data


Data sources Non-relational data

Are you using or going to
use “Big Data” and/or
“Hadoop”
No or limited access to
detailed data; can only
surface reports and cannot
ask ad-hoc questions.
Slow data loading
performance cannot keep
up with the need for data
from transactional systems
for intraday reporting.
MOLAP cube processing
and data refresh take too
long.
Slow query performance
with need for constant
tuning, especially with SAN
storage.
High cost of SAN storage
chargeback.

Keep legacy
investment
Buy new tier one
hardware appliance
Acquire big data
solution (Hadoop)
Acquire business
intelligence solution
Roadblocks to evolving to a modern data warehouse
Limited
scalability & ability to
handle new data types
Significant training
& still siloed
High acquisition/
migration
costs & no Hadoop
Complex with low
adoption
Solution and issue with that solution

Introducing the Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance
• Relational and non-relational data in
a single appliance
• Or, integrate relational data with
non-relational data in an external
Hadoop cluster on premise or data
stored in the Cloud (hot, warm, cold)
• Enterprise-ready Hadoop
• Integrated querying across Hadoop
and APS using T-SQL (PolyBase)
• Direct integration with Microsoft BI
tools such as Power BI
• Near real-time performance with In-
Memory
• Scale-out to accommodate your
growing data or to increase
performance (2-nodes to 56-nodes)
• Remove SMP DW bottlenecks with
MPP SQL Server
• No rip and replace when more
performance needed
• No performance tuning required
• Concurrency that fuels rapid
adoption
• Industry’s lowest DW price/TB
• Value through a single appliance
solution
• Value with flexible hardware options
using commodity hardware
• Free up space on SAN (cost averages
10k per TB)

Hardware appliance vendor offerings

Hardware and software engineered together
The ease of an appliance
Co-engineered
with HP, Dell, and
Quanta best
practices
Leading
performance with
commodity
hardware
Pre-configured,
built, and tuned
software and
hardware
Integrated
support plan with
a single Microsoft
contactPDW
HDInsight
PolyBase

APS History
• DatAllegro started in 2003
• Microsoft acquires DatAllegro in September 2008
• PDW released in December 2010 (version 1)
• Version 2 made available in March, 2013 (PolyBase introduced)
• AU1 released in April 2014. Renamed from Parallel Data Warehouse (PDW) to Analytics Platform System (APS). It
still includes the PDW region as well as a new HDInsights/Hadoop region
• AU2 was released in July 2014
• AU3 released in October 2014
There will be AU updates every 3-4 months.
NOTE: This is a Data Warehouse solution and not an OLTP (online transaction processing) solution.
Case studies: Go to https://customers.microsoft.com and enter "parallel data warehouse" (old name) in the keyword
box and search the results, then enter "analytics platform system“ (new name)

Parallelism
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel
Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing

APS Logical Architecture (overview)
“Compute” node Balanced storage
SQL
SQL
SQL
SQL
DMS
DMS
DMS
DMS
Compute Node – the “worker bee” of APS
• Runs SQL Server 2014 APS
• Contains a “slice” of each database
• CPU is saturated by storage
Control Node – the “brains” of the APS
• Also runs SQL Server 2014 APS
• Holds a “shell” copy of each database
• Metadata, statistics, etc
• The “public face” of the appliance
Data Movement Services (DMS)
• Part of the “secret sauce” of APS
• Moves data around as needed
• Enables parallel operations among the compute
nodes (queries, loads, etc)
“Control” node
SQL
DMS

APS Logical Architecture (overview)
SQL“Control” node
SQL
SQL
SQL
SQL
DMS
DMS
DMS
DMS
DMS
1) User connects to the appliance (control node)
and submits query
2) Control node query processor determines best
*parallel* query plan
3) DMS distributes sub-queries to each compute
node
4) Each compute node executes query on its
subset of data
5) Each compute node returns a subset of the
response to the control node
6) If necessary, control node does any final
aggregation/computation
7) Control node returns results to user
Queries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger

APS Data Layout Options
SQL
Balanced storage
Balanced storage
Balanced storage
“Compute” node
SQL
“Compute” node
SQL
“Compute” node
SQL
DMS
DMS
DMS
DMS
Time Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Store Dim
Store Dim ID
Store Name
Store Mgr
Store Size
Product Dim
Prod Dim ID
Prod Category
Prod Sub Cat
Prod Desc
Customer Dim
Cust Dim ID
Cust Name
Cust Addr
Cust Phone
Cust Email
Sales Fact
Date Dim ID
Store Dim ID
Prod Dim ID
Cust Dim ID
Qty Sold
Dollars Sold
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
SalesFact
Replicated
Table copied to each compute node
Distributed
Table spread across compute nodes based on “hash”
Star Schema

FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
FactSales_
A
FactSales_
B
FactSales_
C
FactSales_
D
FactSales_
E
FactSales_
F
FactSales_
G
FactSales_
H
DATA DISTRIBUTION
CREATE TABLE FactSales
(
ProductKey INT NOT NULL ,
OrderDateKey INT NOT NULL ,
DueDateKey INT NOT NULL ,
ShipDateKey INT NOT NULL ,
ResellerKey INT NOT NULL ,
EmployeeKey INT NOT NULL ,
PromotionKey INT NOT NULL ,
CurrencyKey INT NOT NULL ,
SalesTerritoryKey INT NOT NULL ,
SalesOrderNumber VARCHAR(20) NOT NULL,
) WITH
(
DISTRIBUTION = HASH(ProductKey),
CLUSTERED INDEX(OrderDateKey) ,
PARTITION
(OrderDateKey RANGE RIGHT FOR VALUES
( 20010601,
20010901,
) ) );
Control Node
…Compute Node 1 Compute Node 2 Compute Node X
Send Create Table SQL to each compute node
Create Table FactSales_A
Create Table FactSales_B
Create Table FactSales_C
……
Create Table FactSales_H
FactSalesA
FactSalesB
FactSalesC
FactSalesD
FactSalesE
FactSalesF
FactSalesG
FactSalesH
FactSalesA
FactSalesB
FactSalesC
FactSalesD
FactSalesE
FactSalesF
FactSalesG
FactSalesH
FactSalesA
FactSale B
FactSalesC
FactSalesD
FactSalesE
FactSalesF
FactSalesG
FactSalesH
Create table metadata on
Control Node

APS – Balanced across servers and within
41
Largest Table 600,000,000,000
Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000
In each server randomly distributed to 8 tables (so 320 total tables) 1,875,000,000
Each partition – 2 years data partitioned by week (benefiting queries by date) 18,028,846
As an end user or DBA you think about 1 table: LineItem.
“Select * from LineItem” is split into 320 queries running in parallel against 320 (1.875b row) tables.
“Select * from LineItem where OrderDate = ‘1/1/2014’ is 320 queries against 320 (18m row) tables.
You don’t care or need to know that there are actually 320 tables representing your 1 logical table.
CCI can add further performance via segment elimination.

¼Rack
15TB
(Uncompressed)
1/2Rack
30TB(Uncompressed)
FullRack
60TB(Uncompressed)
1¼Rack
75.5TB
(Uncompressed)
3Rack
181.2TB(Uncompressed)
11/2Rack
2Rack
• 2 – 56 compute nodes (32-
896 cores)
• 1 – 7 racks
• 1, 2, or 3 TB drives
• 15TB – 1.2PB uncompressed
• 75TB – 6PB User data (5:1)
• Up to 7 spare nodes
available across the entire
appliance
• Dual Infiband: 56Gbps

Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance

Modern Data Warehousing with the Microsoft Analytics Platform System

What is Hadoop?
Microsoft Confidential
 Distributed, scalable system on commodity HW
 Composed of a few parts:
 HDFS – Distributed file system
 MapReduce – Programming model
 Other tools: Hive, Pig, SQOOP, HCatalog, HBase,
Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie,
ZooKeeper, Flume, Storm
 Main players are Hortonworks, Cloudera, MapR
 WARNING: Hadoop, while ideal for processing huge
volumes of data, is inadequate for analyzing that
data in real time (companies do batch analytics
instead)
Core Services
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP
REDUCE
HIVE &
HCATALOG
PIG
HBASEFALCON
Hadoop Cluster
compute
&
storage . . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware

Move HDFS into the warehouse before analysis
ETL
Learn new skills
TSQL
Build
Integrate
Manage
Maintain
Support
Complex query and analysis with big data today
Steep learning curve, slow and inefficient
Hadoop ecosystem
“New” data sources
“New” data sources“New” data sources

APS delivers enterprise-ready Hadoop with HDInsight
Manageable, secured and highly available Hadoop integrated into the appliance
High performance
tuned within the
appliance
End-user
authentication with
Active Directory
Accessible insights
for everyone with
Microsoft BI tools
Managed and
monitored using
System Center
100% Apache
Hadoop
SQL Server
Parallel Data
Warehouse
Microsoft
HDInsight
PolyBase
Leverage your
existing TSQL skills
Additional features over a separate Hadoop cluster
Plus one support contact still!

Parallel Data Warehouse
region
HDInsight region
Fabric
Hardware
Appliance
A region is a logical container within an
appliance
Each workload contains the following
boundaries:
• Security
• Metering
• Servicing
APS appliance overview

Select… Result set Provides a single T-SQL query model (“semantic
layer”) for APS and Hadoop with rich features of
T-SQL, including joins without ETL
Uses the power of MPP to enhance query
execution performance
Supports Windows Azure HDInsight to enable
new hybrid cloud scenarios
Provides the ability to query non-Microsoft
Hadoop distributions, such as Hortonworks and
Cloudera
Use existing SQL skillset, no IT intervention
Query Hadoop data with T-SQL using PolyBase
Bringing the worlds or big data and the data warehouse together for users and IT
SQL Server
Parallel Data
Warehouse
Cloudera CHD Linux 5.1
Hortonworks HDP 2.2
(Windows, Linux)
Windows Azure
HDInsight (HDP 2.2)
(WASB)
PolyBase
Microsoft
HDInsight
HDP 2.0
Others (SQL Server, DB2, Oracle)?
True federated query engine

Use cases where PolyBase simplifies using Hadoop data
Bringing islands of Hadoop data together
High performance queries against Hadoop data
(Predicate pushdown)
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)

Big data insights for anyone
Native Microsoft BI integration to create new insights with familiar tools
Tools like Power
BI minimize IT
intervention for
discovering data
T-SQL for DBA
and power
users to join
relational and
Hadoop data
Hadoop tools
like map-
reduce, Hive
and Pig for data
scientists
Leverages high
adoption
of Excel, Power
View, Power
Pivot, and SSAS
Power Users
Data Scientist
Everyone else using
Microsoft BI tools

Scale-out  Massively Parallel Processing (MPP) parallelizes
queries (speed-driven not just capacity-driven)
 Multiple nodes with dedicated CPU, memory,
storage “shared-nothing”
 Incrementally add HW for near-linear scale to
multi-PB (no need to delete older data, stage)
 Handles query complexity and concurrency at scale
 No “forklift” of prior warehouse to increase capacity
 Start small with a few terabyte warehouse
 Mixed workload support: Query while you load
(250GB/hour per node). No need for maintenance
window
Scale-out technologies in the Analytics Platform System
91
PDW
0TB 6PB
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight

• Store data in columnar format for massive
compression
• Load data into or out of memory for next-
generation performance
• Updateable and clustered for real-time trickle
loading
• No secondary indexes required
92
Up to 100x
faster queries
Updatable clustered columnstore vs. table with customary indexing
Up to 15x
more compression
Columnstore index representation
Parallel query execution
Query
Results

Investment firm Before/After Results - HP
SMP vs APS
21x
improvement
loading data
(7:30 minutes vs 21
seconds)
62x
improvement
staging to
landing (30
minutes vs 29
seconds)
17x, 166x, 169x
query
performance
improvement
(1:05 hour vs 23
seconds)
Microsoft BI
tools work
unchanged
1.1 TB/hr loading
time, 8.8x
compression (2
billion rows)
(472GB to 53GB)
46x
improvement
creating
datamart (70
minutes vs 1:31
minutes)

BI Tools
Reporting and cubes
SQL Server SMP (Spoke)
Concurrency that fuels rapid adoption
Great performance with mixed workloads
Analytics Platform System
ETL/ELT with SSIS, DQS, MDS
ERP CRM LOB APPS
ETL/ELT with DWLoader
Hadoop / Big Data
PDW
HDInsight
PolyBase
Ad hoc queries
Intra-Day
Near real-time
Fast ad hoc
Columnstore
Polybase
CRTAS
“Link Table”
Real-Time
ROLAP / MOLAP
DirectQuery
SNAC

Stream Analytics
TransformIngest
Example overall data flow and Architecture
Web logs
Present &
decide
IoT, Mobile Devices
etc.
Social Data
Event Hubs HDInsight
Azure Data
Factory
Azure SQL DB
Azure Blob Storage
Azure Machine
Learning
(Fraud detection
etc.)
Power BI
Web
dashboards
Mobile devices
DW / Long-term
storage
Predictive
analytics
Event & data
producers
Analytics Platform Sys.

APS provides the industry’s lowest DW appliance price/TB
Reshaped hardware specs through software innovation
Price per terabyte for leading vendors (Sept 2014)
Significantly lower price
per TB than the closest
competitor
Lower storage costs
with Windows Server 2012
Storage Spaces
Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack
$-
$20
$40
$60
$80
$100
$120
$140
Oracle Pivotal IBM Teradata Microsoft
Thousands
TCO per TB (uncompressed):

Virtualized architecture overview
Host 2
Host 1
Host 3
Host 4
Economical
disk storage
IB and
Ethernet
Direct attached SAS
Base Unit
CT
L
M
AD
A
D
V
M
M
Compute 2
Compute 1
• APS engine
• DMS Manager
• SQL Server 2012 Enterprise Edition (APS build) (AU3: SQL 2014)
Software details
• All hosts run Windows Server 2012 Standard (AU3:
2012 R2) and Windows Azure Virtual Machines
• Fabric or workload in Hyper-V Virtual Machines
• Fabric virtual machine, management server (MAD01),
and control server (CTL) share one server
• APS agent that runs on all hosts and all virtual
machines
• DWConfig and Admin Console
• Windows Storage Spaces and Azure Storage blobs
• Does not require expertise in Hyper-V or Windows

APS High-Availability
X
X
Compute
Host 1
Compute
Host 2
XControl Host
Failover Host
Infiniband1
Ethernet1
Infiniband2
Ethernet2
XXXFAB AD VMM MAD CTL
Compute 2 VM
Compute 1 VMCompute 1 VMInfiniband1
Ethernet1
• No Single Point-Of-Failure
• No need for SQL Server Clustering

Less DBA Maintenance/Monitoring
• No index creation
• No deleting/archiving data to save space
• Management simplicity (System Center, Admin console, DMVs)
• No blocking
• No logs
• No query hints
• No wait states
• No IO tuning
• No query optimization/tuning
• No index reorgs/rebuilds
• No partitioning
• No managing filegroups
• No shrinking/expanding databases
• No managing physical servers
• No patching servers and software
RESULT: DBA’s spend more of their time as architects and not baby sitters!

The no-compromise modern data warehouse solution
Microsoft’s turn-key modern data warehouse appliance
Analytics Platform System
Microsoft
• Improved query performance
• Faster data loading
• Improved concurrency
• Less DBA maintenance
• Limited training needed
• Use familiar BI tools
• Ease of appliance deployment
• Mixed workload support
• Improved data compression
• Scalability
• High availability
• PolyBase
• Integration with cloud-born data
• HDInsight/Hadoop integration
• Data warehouse consolidation
• Easy support model
Summary of Benefits
Bold = benefits of APS over upgrading to SQL Server 2014, no worry about future hardware roadblocks

Questions?
James Serra
jserra@microsoft.com
Blog about PDW topics:
http://www.jamesserra.com/archive/category/pdw/

Enterprise-ready big data – cloud
enabled
• Improved PolyBase Support
• Cloudera 5.1 Support
• Partial Aggregate Pushdowns
• Expanding Big Data capacity
• Grow HDInsight region on an appliance
with an existing region
Next-gen performance & engineered
for optimal value
• 1.5X data return rate for SELECT * queries
• Streaming large data sets for external apps
(e.g., SSAS, SAS, R, etc.)
Next-gen performance &
engineered for optimal value
• TSQL Compatibility
• Scalar UDFs (CREATE Function)
• SQL Server SMP to APS (SQL Server
MPP) Migration Utility
• Bulk load / BCP through SQL Server
command-line tools
• OEM Hardware Refresh (HP Gen 9)
• HP ProLiant DL360 Gen9 Server w/2x
Intel Haswell Processors, 256 GB
(16x16Gb) 2133MHz memory
• HP 5900 series switches (HA
improvements)
Symmetry between DW On-Prem
and Azure
T-SQL Compat:
Appliance Hardware

Modern Data Warehousing with the Microsoft Analytics Platform System

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Modern Data Warehousing with the Microsoft Analytics Platform System

Similar to Modern Data Warehousing with the Microsoft Analytics Platform System (20)

More from James Serra

More from James Serra (17)

Recently uploaded

Recently uploaded (20)

Modern Data Warehousing with the Microsoft Analytics Platform System

Editor's Notes