Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
2
Data sources
Non-relational data
The Data Lake Approach
Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Hadoop, Spark, R,
Azure Data Lake
Analytics (ADLA)
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Microsoft’s Big Data Journey
We needed to better leverage data and analytics to
do more experimentation
So, we built a Data Lake for Microsoft:
• A data lake for everyone to put their data
• Tools approachable by any developer
• Batch, Interactive, Streaming, ML
By the numbers
• Exabytes of data under management
• 100Ks of Physical Servers
• 100Ks of Batch Jobs, Millions of Interactive Queries
• Huge Streaming Pipelines
• 10K+ Developers running diverse workloads and scenarios
2010 2011 2012 2013 2014 2015 2016
Windows
SMSG
Live
Bing
CRM/Dynamics
Xbox Live
Office365
Malware Protection Microsoft Stores
Commerce Risk
Skype
LCA
Exchange
Yammer
Data Stored
Building a Cloud Strategy
Where are you
coming from?
• Partner investments
• Infrastructure
investments
• Today’s business
requirements
Where are you
going?
• New cloud data, new
sources, new scale
• New business
opportunities
Business guiding
principles
• Lift and shift?
• Cloud first?
• PaaS first?
• Multi-cloud?
Continuing
evolution
• Subset of existing
business/customers
• Development of new
business
• Azure roadmap,
partner roadmap,
business roadmap
• New Guiding
principles
Why Azure Data Lake?
-an on-demand, real-time stream processing service with no-limits data lake built to support massively
parallel analytics
HDFS Compatible REST API
ADL Store
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics
Open Source Apache
Hadoop ADL Client
HDInsight
Hive
• Performance at
scale
• Optimized for
analytics
• Multiple
analytics engines
• Single repository
sharing
HDFS Compatible REST API
ADL Store
Storage
Azure Data Lake Store
• Architected and built for very high throughput at scale for Big Data workloads
• No limits to file size, account size or number of files
• Single-repository for sharing
• Cloud-scale distributed filesystem with file/folder ACLS and RBAC
• Encryption-at-rest by default with Azure Key Vault
• Authenticated access with Azure Active Directory integration
• The Big Data platform for Microsoft
HDFS Compatible REST API
ADL Store
Analytics
Storage
Azure Data Lake cloud models
Cloudera CDH
Hortonworks HDP
Qubole QDS (soon)
• Open Source Apache® ADL client
for commercial and custom Hadoop
• Cloud IaaS and Hybrid
HDFS Compatible REST API
HDInsight
ADL Store
Hive
Analytics
Storage
Azure Data Lake cloud models
• 63% lower TCO
than on-premise*
• SLA- managed,
monitored and
supported by
Microsoft
• Fully managed
Hadoop, Spark
and R
• Clusters
deployed in
minutes
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
HDFS Compatible REST API
ADL Store
Analytics
Storage
Azure Data Lake cloud models
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics
• Serverless. Pay per job. Starts in
seconds. Scales instantly.
• Develop massively parallel
programs with simplicity
• Federated query from multiple data
sources
Why Azure Data Lake?
-an on-demand, real-time stream processing service with no-limits data lake built to support
massively parallel analytics
•Optimized for high
throughput I/O for
streaming and batch
•Petabyte size files and
trillions of objects
Performance at
scale
•HDFS-compatible REST
API
•Optimized Java, Python,
.NET SDKs and R
• Tools for Debugging and
optimizing your Big Data
programs with ease
•Open Source Apache
ADL client for
commercial Hadoop
distributions and RYOH
•Fully integrated with
HDInsight managed
Hadoop
• Pay-per-job R, Python
and USQL execution
with ADLA
•Azure Active Directory
(AAD) integration
•Posix-style ACLs at file
and folder level
•Role-based access
control (RBAC)
•Encryption data at rest
with Azure Key Vault
Single
repository
sharing
Optimized for
analytics
Multiple
analytics
engines
Patterns for Azure Data Lake
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx
DATA ORIGINS ANALYTICS and INSIGHTS
Customer
Behavior
Move via Data Factory
Data Lake
SQL DW
CONSUMPTION
Web Portals
Mobile Apps
Power BI
Data Lake Scenario
Data Science
Notebooks
ETL & Analytics
Cleanup, Normalize, Basic Stats
Experimentation
A/B testing at scale. Drive changes
based on actual
Customer behavior
Machine Learning
Do ML at Scale (Customer
Segmentation & Fraud Detection)
Store
Analytics
Move via Data Factory
HDInsight
SQL DB
Clickstream
DBs
ON-PREMISES CLOUD
Massive Archive
On-Prem HDFS
Initial one
time import
Active Incoming
Data
Incremental
updates
“Landing Zone”
Data Lake Store
via Azure
Data
Factory Curated Data
DW (many
instances)
Data is portioned into
multiple SQL DWs (one
per data consumer.
Several hundred
consumers)
CONSUMPTION
Machine Learning
Do ML at Scale
(Customer Segmentation
& Fraud Detection)
Web Portals
Mobile Apps
Power BI
Experimentation
A/B testing at scale. Drive
changes based on actual
Customer behavior
Retail Scenario
Jupyter
Notebooks
Event data
Data Lake
Store
HDI R Jupyter
Enriching
Event data
Event data
Kafka
EventHubs Power BI
Data Lake
Analytics
CLOUD CONSUMPTION
Real Time Data Analytics
Customer Stories
(customers.microsoft.com)
• “HDInsight”
• “Azure Data Lake Analytics”
• “Azure Data Lake Store”
• Cloudera now supports Azure Data Lake
Store
• Run Hortonworks clusters and easily access
Azure Data Lake
• ImanisData – Cloud migration, backup for
your big data applications on Azure
HDInsight
• Ingest data into Azure Data Lake Store with
StreamSets Data Collector
ADL Partners
(azure.microsoft.com)
(msdn.microsoft.com/azuredatalake)
(azuremarketplace.microsoft.com)
aka.ms/AzureDataLake
A webinar using these slides
can be viewed at Data
Platforms Online 2017
Building Big Data Solutions with Azure Data Lake.10.11.17.pptx

More Related Content

Building Big Data Solutions with Azure Data Lake.10.11.17.pptx

  • 3. The Data Lake Approach Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Hadoop, Spark, R, Azure Data Lake Analytics (ADLA) Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 4. Microsoft’s Big Data Journey We needed to better leverage data and analytics to do more experimentation So, we built a Data Lake for Microsoft: • A data lake for everyone to put their data • Tools approachable by any developer • Batch, Interactive, Streaming, ML By the numbers • Exabytes of data under management • 100Ks of Physical Servers • 100Ks of Batch Jobs, Millions of Interactive Queries • Huge Streaming Pipelines • 10K+ Developers running diverse workloads and scenarios 2010 2011 2012 2013 2014 2015 2016 Windows SMSG Live Bing CRM/Dynamics Xbox Live Office365 Malware Protection Microsoft Stores Commerce Risk Skype LCA Exchange Yammer Data Stored
  • 5. Building a Cloud Strategy Where are you coming from? • Partner investments • Infrastructure investments • Today’s business requirements Where are you going? • New cloud data, new sources, new scale • New business opportunities Business guiding principles • Lift and shift? • Cloud first? • PaaS first? • Multi-cloud? Continuing evolution • Subset of existing business/customers • Development of new business • Azure roadmap, partner roadmap, business roadmap • New Guiding principles
  • 6. Why Azure Data Lake? -an on-demand, real-time stream processing service with no-limits data lake built to support massively parallel analytics HDFS Compatible REST API ADL Store .NET, SQL, Python, R scaled out by U-SQL ADL Analytics Open Source Apache Hadoop ADL Client HDInsight Hive • Performance at scale • Optimized for analytics • Multiple analytics engines • Single repository sharing
  • 7. HDFS Compatible REST API ADL Store Storage Azure Data Lake Store • Architected and built for very high throughput at scale for Big Data workloads • No limits to file size, account size or number of files • Single-repository for sharing • Cloud-scale distributed filesystem with file/folder ACLS and RBAC • Encryption-at-rest by default with Azure Key Vault • Authenticated access with Azure Active Directory integration • The Big Data platform for Microsoft
  • 8. HDFS Compatible REST API ADL Store Analytics Storage Azure Data Lake cloud models Cloudera CDH Hortonworks HDP Qubole QDS (soon) • Open Source Apache® ADL client for commercial and custom Hadoop • Cloud IaaS and Hybrid
  • 9. HDFS Compatible REST API HDInsight ADL Store Hive Analytics Storage Azure Data Lake cloud models • 63% lower TCO than on-premise* • SLA- managed, monitored and supported by Microsoft • Fully managed Hadoop, Spark and R • Clusters deployed in minutes *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
  • 10. HDFS Compatible REST API ADL Store Analytics Storage Azure Data Lake cloud models .NET, SQL, Python, R scaled out by U-SQL ADL Analytics • Serverless. Pay per job. Starts in seconds. Scales instantly. • Develop massively parallel programs with simplicity • Federated query from multiple data sources
  • 11. Why Azure Data Lake? -an on-demand, real-time stream processing service with no-limits data lake built to support massively parallel analytics •Optimized for high throughput I/O for streaming and batch •Petabyte size files and trillions of objects Performance at scale •HDFS-compatible REST API •Optimized Java, Python, .NET SDKs and R • Tools for Debugging and optimizing your Big Data programs with ease •Open Source Apache ADL client for commercial Hadoop distributions and RYOH •Fully integrated with HDInsight managed Hadoop • Pay-per-job R, Python and USQL execution with ADLA •Azure Active Directory (AAD) integration •Posix-style ACLs at file and folder level •Role-based access control (RBAC) •Encryption data at rest with Azure Key Vault Single repository sharing Optimized for analytics Multiple analytics engines
  • 12. Patterns for Azure Data Lake
  • 14. DATA ORIGINS ANALYTICS and INSIGHTS Customer Behavior Move via Data Factory Data Lake SQL DW CONSUMPTION Web Portals Mobile Apps Power BI Data Lake Scenario Data Science Notebooks ETL & Analytics Cleanup, Normalize, Basic Stats Experimentation A/B testing at scale. Drive changes based on actual Customer behavior Machine Learning Do ML at Scale (Customer Segmentation & Fraud Detection) Store Analytics Move via Data Factory HDInsight SQL DB Clickstream DBs
  • 15. ON-PREMISES CLOUD Massive Archive On-Prem HDFS Initial one time import Active Incoming Data Incremental updates “Landing Zone” Data Lake Store via Azure Data Factory Curated Data DW (many instances) Data is portioned into multiple SQL DWs (one per data consumer. Several hundred consumers) CONSUMPTION Machine Learning Do ML at Scale (Customer Segmentation & Fraud Detection) Web Portals Mobile Apps Power BI Experimentation A/B testing at scale. Drive changes based on actual Customer behavior Retail Scenario Jupyter Notebooks
  • 16. Event data Data Lake Store HDI R Jupyter Enriching Event data Event data Kafka EventHubs Power BI Data Lake Analytics CLOUD CONSUMPTION Real Time Data Analytics
  • 17. Customer Stories (customers.microsoft.com) • “HDInsight” • “Azure Data Lake Analytics” • “Azure Data Lake Store” • Cloudera now supports Azure Data Lake Store • Run Hortonworks clusters and easily access Azure Data Lake • ImanisData – Cloud migration, backup for your big data applications on Azure HDInsight • Ingest data into Azure Data Lake Store with StreamSets Data Collector ADL Partners (azure.microsoft.com) (msdn.microsoft.com/azuredatalake) (azuremarketplace.microsoft.com)
  • 19. A webinar using these slides can be viewed at Data Platforms Online 2017