The document discusses Microsoft's use of a data lake approach to better leverage large amounts of data from various sources using tools like Azure Data Lake Store, Azure Data Lake Analytics, HDInsight, and Spark. It provides an overview of how Microsoft built their own data lake to handle exabytes of data from different parts of the company and support analytics, machine learning, and real-time streaming. Common patterns for using Azure Data Lake tools for ingesting, storing, analyzing, and visualizing data are also presented.
Report
Share
Report
Share
1 of 20
More Related Content
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
3. The Data Lake Approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Hadoop, Spark, R,
Azure Data Lake
Analytics (ADLA)
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
4. Microsoft’s Big Data Journey
We needed to better leverage data and analytics to
do more experimentation
So, we built a Data Lake for Microsoft:
• A data lake for everyone to put their data
• Tools approachable by any developer
• Batch, Interactive, Streaming, ML
By the numbers
• Exabytes of data under management
• 100Ks of Physical Servers
• 100Ks of Batch Jobs, Millions of Interactive Queries
• Huge Streaming Pipelines
• 10K+ Developers running diverse workloads and scenarios
2010 2011 2012 2013 2014 2015 2016
Windows
SMSG
Live
Bing
CRM/Dynamics
Xbox Live
Office365
Malware Protection Microsoft Stores
Commerce Risk
Skype
LCA
Exchange
Yammer
Data Stored
5. Building a Cloud Strategy
Where are you
coming from?
• Partner investments
• Infrastructure
investments
• Today’s business
requirements
Where are you
going?
• New cloud data, new
sources, new scale
• New business
opportunities
Business guiding
principles
• Lift and shift?
• Cloud first?
• PaaS first?
• Multi-cloud?
Continuing
evolution
• Subset of existing
business/customers
• Development of new
business
• Azure roadmap,
partner roadmap,
business roadmap
• New Guiding
principles
6. Why Azure Data Lake?
-an on-demand, real-time stream processing service with no-limits data lake built to support massively
parallel analytics
HDFS Compatible REST API
ADL Store
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics
Open Source Apache
Hadoop ADL Client
HDInsight
Hive
• Performance at
scale
• Optimized for
analytics
• Multiple
analytics engines
• Single repository
sharing
7. HDFS Compatible REST API
ADL Store
Storage
Azure Data Lake Store
• Architected and built for very high throughput at scale for Big Data workloads
• No limits to file size, account size or number of files
• Single-repository for sharing
• Cloud-scale distributed filesystem with file/folder ACLS and RBAC
• Encryption-at-rest by default with Azure Key Vault
• Authenticated access with Azure Active Directory integration
• The Big Data platform for Microsoft
8. HDFS Compatible REST API
ADL Store
Analytics
Storage
Azure Data Lake cloud models
Cloudera CDH
Hortonworks HDP
Qubole QDS (soon)
• Open Source Apache® ADL client
for commercial and custom Hadoop
• Cloud IaaS and Hybrid
9. HDFS Compatible REST API
HDInsight
ADL Store
Hive
Analytics
Storage
Azure Data Lake cloud models
• 63% lower TCO
than on-premise*
• SLA- managed,
monitored and
supported by
Microsoft
• Fully managed
Hadoop, Spark
and R
• Clusters
deployed in
minutes
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
10. HDFS Compatible REST API
ADL Store
Analytics
Storage
Azure Data Lake cloud models
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics
• Serverless. Pay per job. Starts in
seconds. Scales instantly.
• Develop massively parallel
programs with simplicity
• Federated query from multiple data
sources
11. Why Azure Data Lake?
-an on-demand, real-time stream processing service with no-limits data lake built to support
massively parallel analytics
•Optimized for high
throughput I/O for
streaming and batch
•Petabyte size files and
trillions of objects
Performance at
scale
•HDFS-compatible REST
API
•Optimized Java, Python,
.NET SDKs and R
• Tools for Debugging and
optimizing your Big Data
programs with ease
•Open Source Apache
ADL client for
commercial Hadoop
distributions and RYOH
•Fully integrated with
HDInsight managed
Hadoop
• Pay-per-job R, Python
and USQL execution
with ADLA
•Azure Active Directory
(AAD) integration
•Posix-style ACLs at file
and folder level
•Role-based access
control (RBAC)
•Encryption data at rest
with Azure Key Vault
Single
repository
sharing
Optimized for
analytics
Multiple
analytics
engines
14. DATA ORIGINS ANALYTICS and INSIGHTS
Customer
Behavior
Move via Data Factory
Data Lake
SQL DW
CONSUMPTION
Web Portals
Mobile Apps
Power BI
Data Lake Scenario
Data Science
Notebooks
ETL & Analytics
Cleanup, Normalize, Basic Stats
Experimentation
A/B testing at scale. Drive changes
based on actual
Customer behavior
Machine Learning
Do ML at Scale (Customer
Segmentation & Fraud Detection)
Store
Analytics
Move via Data Factory
HDInsight
SQL DB
Clickstream
DBs
15. ON-PREMISES CLOUD
Massive Archive
On-Prem HDFS
Initial one
time import
Active Incoming
Data
Incremental
updates
“Landing Zone”
Data Lake Store
via Azure
Data
Factory Curated Data
DW (many
instances)
Data is portioned into
multiple SQL DWs (one
per data consumer.
Several hundred
consumers)
CONSUMPTION
Machine Learning
Do ML at Scale
(Customer Segmentation
& Fraud Detection)
Web Portals
Mobile Apps
Power BI
Experimentation
A/B testing at scale. Drive
changes based on actual
Customer behavior
Retail Scenario
Jupyter
Notebooks
16. Event data
Data Lake
Store
HDI R Jupyter
Enriching
Event data
Event data
Kafka
EventHubs Power BI
Data Lake
Analytics
CLOUD CONSUMPTION
Real Time Data Analytics
17. Customer Stories
(customers.microsoft.com)
• “HDInsight”
• “Azure Data Lake Analytics”
• “Azure Data Lake Store”
• Cloudera now supports Azure Data Lake
Store
• Run Hortonworks clusters and easily access
Azure Data Lake
• ImanisData – Cloud migration, backup for
your big data applications on Azure
HDInsight
• Ingest data into Azure Data Lake Store with
StreamSets Data Collector
ADL Partners
(azure.microsoft.com)
(msdn.microsoft.com/azuredatalake)
(azuremarketplace.microsoft.com)