Prague data management meetup 2018-03-27

Jan Pospíšil
janpos@microsoft.com
@pospanet

Considering Data Types
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet,
ORC). Strict data model structure
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types

Big Data = All Data!
What is Big Data?
• Variety: It can be structured, semi-structured, or unstructured
• Velocity: It can be streaming, near real-time or batch
• Volume: It can be 1GB or 1PB
• Big data is the new currency

SMP vs MPP
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing

Big Data Solutions Decision Tree
Thanks to Ivan Kosyakov: https://biz-excellence.com/2016/08/30/big-data-dt/

Velocity
Volume Per
Day
Real-world
Transactions
Per Day
Real-world
Transactions
Per Second
Relational DB Document
Store
Key Value or
Wide Column
8 GB 8.64B 100,000 As Is
86 GB 86.4B 1M Tuned* As Is
432 GB 432B 5M Appliance Tuned* As Is
864 GB 864B 10M Clustered
Appliance
Clustered
Servers
Tuned*
8,640 GB 8.64T 100M Many
Clustered
Servers
Clustered
Servers
43,200 GB 43.2T 500M Many
Clustered
Servers
* Tuned means tuning the model, queries, and/or hardware (more CPU, RAM, and Flash)

Microsoft data platform solutions
Product Category Description More Info
SQL Server 2016 RDBMS Earned top spot in Gartner’s Operational Database magic
quadrant. JSON support. Linux TBD
https://www.microsoft.com/en-us/server-
cloud/products/sql-server-2016/
SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly.
Has built-in high availability and disaster recovery. JSON
support
https://azure.microsoft.com/en-
us/services/sql-database/
SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data.
Provision and scale quickly. Can pause service to reduce
cost
us/services/sql-data-warehouse/
Analytics Platform System (APS) MPP RDBMS Big data analytics appliance for high performance and
seamless integration of all your data
https://www.microsoft.com/en-us/server-
cloud/products/analytics-platform-
system/
Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of
your data while making it faster to get up and running with
batch, streaming, and interactive analytics
us/services/data-lake-store/
Azure Data Lake Analytics On-demand analytics job
service/Big Data-as-a-
service
Cloud-based service that dynamically provisions resources
so you can run queries on exabytes of data. Includes U-
SQL, a new big data query language
us/services/data-lake-analytics/
HDInsight PaaS Hadoop
compute/Hadoop
clusters-as-a-service
A managed Apache Hadoop, Spark, R, HBase, Kafka, and
Storm cloud service made easy
us/services/hdinsight/
Azure Cosmos DB PaaS NoSQL: Document
Store
Get your apps up and running in hours with a fully
managed NoSQL database service that indexes, stores, and
queries data using familiar SQL syntax
us/services/documentdb/
Azure Table Storage PaaS NoSQL: Key-value
Store
Store large amount of semi-structured data in the cloud https://azure.microsoft.com/en-
us/services/storage/tables/

Microsoft Big Data Portfolio
SQL Server Stretch
Business intelligence
Machine learning analytics
Insights
Azure SQL Database
SQL Server 2016
SQL Server 2016 Fast Track
Azure SQL DW
ADLS & ADLA
Cosmos DB
HDInsight
Hadoop
Analytics Platform System
Sequential Scale Out + AcrossScale Up
Key
Relational Non-relational
On-premisesCloud
Microsoft has solutions covering
and connecting all four
quadrants – that’s why SQL
Server is one of the most utilized
databases in the world

Azure SQL Data Warehouse
A relational data warehouse-as-a-service, fully managed by Microsoft.
Industries first elastic cloud data warehouse with enterprise-grade capabilities.
Support your smallest to your largest data storage needs while handling queries up to 100x faster.

Azure
Data Lake Store
A hyper-scale
repository for Big Data
analytics workloads
Hadoop File System (HDFS) for the cloud
No limits to scale
Store any data in its native format
Enterprise-grade access control,
encryption at rest
Optimized for analytic workload performance

Data lake is the center of a big data solution
A storage repository, usually Hadoop, that holds a vast amount of raw data in its native
format until it is needed.
• Inexpensively store unlimited data
• Collect all data “just in case”
• Store data with no modeling – “Schema on read”
• Complements EDW
• Frees up expensive EDW resources
• Quick user access to data
• ETL Hadoop tools
• Easily scalable
• With Hadoop, high availability built in

Data Lake Transformation (ELT not ETL)
New Approaches
All data sources are considered
Leverages the power of on-prem
technologies and the cloud for
storage and capture
Native formats, streaming data, big
data
Extract and load, no/minimal transform
Storage of data in near-native format
Orchestration becomes possible
Streaming data accommodation becomes
possible
Refineries transform data on read
Produce curated data sets to
integrate with traditional warehouses
Users discover published data
sets/services using familiar tools
CRMERPOLTP LOB
DATA SOURCES
FUTURE DATA
SOURCESNON-RELATIONAL DATA
EXTRACT AND LOAD
DATA LAKE DATA REFINERY PROCESS
(TRANSFORM ON READ)
Transform
relevant data
into data sets
BI AND ANALYTCIS
Discover and
consume
predictive
analytics, data
sets and other
reports
DATA WAREHOUSE
Star schemas,
views
other read-
optimized
structures

Data Analysis Paradigm Shift
OLD WAY: Structure -> Ingest -> Analyze
NEW WAY: Ingest -> Analyze -> Structure
This solves the two biggest reasons why may EDW projects fail:
• Too much time spent modeling when you don’t know all of the questions your data needs to answer
• Wasted time spent on ETL where the net effect is a star schema that doesn’t actually show value

Data Lake layers
• Raw data layer– Raw events are stored for historical reference. Also called staging layer or
landing area
• Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly
consumable data sets. Aim is to uniform the way files are stored in terms of encoding,
format, data types and content (i.e. strings). Also called conformed layer
• Application data layer – Business logic is applied to the cleansed data to produce data ready
to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is
also called by a lot of other names: workspace, trusted, gold, secure, production ready,
governed
• Sandbox data layer – Optional layer to be used to “play” in. Also called exploration layer or
data science workspace
Still need data governance so your data lake does not turn into a data swamp!

Azure
HDInsight
Hadoop and Spark
as a Service on Azure
Fully-managed Hadoop and Spark
for the cloud
100% Open Source Hortonworks
data platform
Clusters up and running in minutes
Managed, monitored and supported
by Microsoft with the industry’s best SLA
Familiar BI tools for analysis, or open source
notebooks for interactive data science
63% lower TCO than deploy your own
Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

Azure
Data Lake Analytics
A new distributed
analytics service
Distributed analytics service built on
Apache YARN
Elastic scale per query lets users focus on
business goals—not configuring hardware
Includes U-SQL—a language that unifies the
benefits of SQL with the expressive
power of C#
Integrates with Visual Studio to develop,
debug, and tune code faster
Federated query across Azure data sources
Enterprise-grade role based access control

Query data where it lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
• Avoid moving large amounts of data across the network
between stores (federated query/logical data warehouse)
• Single view of data irrespective of physical location
• Minimize data proliferation issues caused by maintaining
multiple copies
• Single query language for all data
• Each data store maintains its own sovereignty
• Design choices based on the need
• Push SQL expressions to remote SQL sources
• Filters, Joins
• SELECT * FROM EXTERNAL MyDataSource EXECUTE
@”Select CustName from Customers WHERE ID=1”;
(remote queries)
• SELECT CustName FROM EXTERNAL MyDataSource
LOCATION “dbo.Customers” WHERE ID=1 (federated
queries)
U-SQL
Query
Query
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Azure
SQL Data Warehouse
Azure
Data Lake Storage

PolyBase
Query relational and non-relational data with T-SQL
PolyBase extents T-SQL onto data via views

Azure
Data Lake Store
Azure
Blob Storage
Purpose Optimized for big data analytics General purpose bulk storage
Use Cases Batch, Interactive, Streaming App backend, backup data, media storage
for streaming
Units of Storage Accounts / Folders / Files Accounts / Containers / Blobs
Structure Hierarchical File System Flat namespace
WebHDFS Implements WebHDFS No (WASB)
Security AD SAS keys
Storage Auto Shared/Files chunked Manually manage expansion/Files intact
Size Limits No limits on account size, file size, # files 500TB account, 4.75TB file
Service State Generally Available Generally Available
Billing Pay for data stored and for I/O Pay for data stored and for I/O
Region Availability Two US regions (East, Central) & North
Europe (At the moment)
All Azure Regions
ADL Store vs Blob Store

Want
Hadoop?
Need exact
same on-
prem
Need
interactive /
streaming?
Mandatory
No strong opinion
Azure Marketplace (IaaS)
• Need all workloads exactly like on-
premises
• Need 100% Hortonworks/Cloudera/MapR
Azure HDInsight
• Most Hadoop workloads
• Fully managed by Microsoft
• Sell HDI + ADLS
• Stickier to Microsoft than VMs
• Can do interactive (Spark) and streaming
(Storm/Spark)
Azure Data Lake Analytics
• Easiest experience for admin: no sense of
clusters, instant scale per job
• Easiest experience for developers: Visual
Studio/U-SQL (C#+SQL)
• Sell ADLA + ADLS
• Batch workloads only
Need everything exactly
like on-prem
Need core
projects Yes Batch is OK
Always present
ADLA if .NET or
Visual Studio Shop
If .NET or
VS shop?

Azure SQL DW HDInsight Hive HDInsight Spark ADLS/ADLA SQL Server (IaaS)
Volume Petabytes Petabytes Petabytes Petabytes Terabytes
Security Encryption, TD,
Audit
ADLS / Apache
Ranger
ADLS AAD Security
Groups (data)
Encryption, TD
Audit
Languages T-SQL (subset) HiveQL SparkSQL, HiveQL,
Scala, Java,
Python, R
U-SQL T-SQL
Extensibility No Yes, .NET/SerDe Yes, Packages Yes, .NET Yes, .NET CLR
External File
Types
ORC, TXT,
Parquet, RCFile
ORC, CSV, Parquet
+ others
Parquet, JSON,
Hive + others
Many ORC, TXT, Parquet,
RCFile
Admin Low-Medium Medium-High Medium-High Low High
Cost Model DWU Nodes & VM Nodes & VM Units/Jobs VM
Schema
Definition
Schema on
Write / Polybase
Schema on Read Schema on Read Schema on Read Schema on Write /
Polybase
Max DB Size 240TB Comp
(5X = 1PB)
Unlimited 256B (64 4TB
drives)

Data Lake Data Warehouse
Complementary to DW Can be sourced from Data Lake
Schema-on-read Schema-on-write
Physical collection of uncurated data Data of common meaning
System of Insight: Unknown data to do
experimentation / data discovery
System of Record: Well-understood data to do
operational reporting
Any type of data Limited set of data types (ie. relational)
Skills are limited Skills mostly available
All workloads – batch, interactive, streaming,
machine learning
Optimized for interactive querying

Roles when using both Data Lake and DW
Data Lake/Hadoop (staging and processing environment)
• Batch reporting
• Data refinement/cleaning
• ETL workloads
• Store historical data
• Sandbox for data exploration
• One-time reports
• Data scientist workloads
• Quick results
Data Warehouse/RDBMS (serving and compliance environment)
• Low latency
• High number of users
• Additional security
• Large support for tools
• Easily create reports (Self-service BI)
• A data lake is just a glorified file folder with data files in it – how many end-users can accurately create reports from it?

Lambda Architecture : Interactive Analytics Pipeline

Lambda Architecture : Interactive Analytics Pipeline
Layer Description Azure Capabilities
Batch Layer Stores master dataset , high
latency , horizontal scalable
Data will get appended and
stored (Batch View)
Azure HDInsight , Azure
Blob storage
Speed Layer Stream processing of data ,
stored limited data, dynamic
computation Processed in
real-time and stored for
both read & write
operations (real-time view)
Azure Stream Analytics ,
Azure HDInsight Spark
Serving Layer Queries batch & real-time
views , merge results
Indexes batch views / out of
date results
Power BI

Microsoft Products vs Hadoop/OSS Products
Microsoft Product Hadoop/Open Source Software Product
Office365/Excel OpenOffice/Calc
DocumentDB MongoDB, HBase, Cassandra
SQL Database SQLite, MySQL, PostgreSQL, MariaDB
Azure Data Lake Analytics/YARN None
Azure VM/IaaS OpenStack
Blob Storage HDFS, Ceph (Note: These are distributed file systems and Blob storage is not distributed)
Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion
Event Hub Apache Kafka
Azure Stream Analytics Apache Storm, Apache Spark, Twitter Heron
Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana
HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay)
Azure ML Apache Mahout, Apache Spark MLib
Microsoft R Open R
SQL Data Warehouse Apache Hive, Apache Drill, Presto
IoT Hub Apache NiFi
Azure Data Factory Apache Falcon, Apache Oozie, Airbnb Airflow
Azure Data Lake Storage/WebHDFS HDFS Ozone
Azure Analysis Services/SSAS Apache Kylin, Apache Lens, AtScale (pay)
SQL Server Reporting Services None
Hadoop Indexes Jethro Data (pay)
Azure Data Catalog Apache Atlas
PolyBase Apache Drill
Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES)
Others Apache Flink, Apache Ambari, Apache Ranger, Apache Knox
Note: Many of the Hadoop/OSS products are available in Azure

Business
apps
Custom
apps
Sensors
and devices
Events Events
Spark Streaming
Stream Processing
Azure
Stream Analytics
Event Processing
Azure Event
Hubs
Kafka
Events
Events

Choosing a Ingestion Technology
Kafka Azure Event Hubs
Managed No Yes
Ordering Yes Yes
Delivery At-least-once At-least-once
Lifetime Configurable 1-30 Days
Replication Configurable within Region Yes
Throughput *nodes 20 throughput units
Parallel Clients Yes No
MapReduce Yes No
Record Size Configurable 256K
Cost Low + Admin Low

Choosing a Stream Processing Technology
Azure Stream Analytics Storm Spark Streaming
Managed Yes Yes Yes
Temporal Operators Windowed aggregates, and temporal
joins are supported out of the box.
Temporal operators must to be
implemented
Temporal operators must to be
implemented
Development
Experience
Interactive authoring and debugging
experience through Azure Portal on
sample data.
Visual Studio, etc Visual Studio, etc
Data Encoding formats Stream Analytics requires UTF-8 data
format to be utilized.
Any data encoding format may be
implemented via custom code.
Any data encoding format may be
implemented via custom code.
Scalability Number of Streaming Units for each
job. Each Streaming Unit processes up
to 1MB/s. Max of 50 units by default.
Call to increase limit.
Number of nodes in the HDI Storm
cluster. No limit on number of nodes
(Top limit defined by your Azure
quota). Call to increase limit.
Number of nodes in the HDI Spark
cluster. No limit on number of
nodes (Top limit defined by your
Azure quota). Call to increase limit.
Data processing limits Users can scale up or down number of
Streaming Units to increase data
processing or optimize costs.
Scale up to 1 GB/s
User can scale up or down cluster
size to meet needs.
User can scale up or down cluster
size to meet needs.
Late arrival and out of
order event handling
Built-in configurable policies to
reorder, drop events or adjust event
time.
User must implement logic to handle
this scenario.
User must implement logic to
handle this scenario.

Prague data management meetup 2018-03-27

Related slideshows

More Related Content

Prague data management meetup 2018-03-27