Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
What is Azure Data Lake?
A hyper scale repository for any data, optimized for big data analytic workloads.
Why do we need Data Lakes?
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Top-Down
Confirmation
Theory
Hypothesis
Observation
Ingest
regardless of requirements
Store
in native format without
schema definition
Analyze
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
• Store any data in its native format
• Hadoop File System (HDFS) for the cloud
• Enterprise grade
• No limits to scale
• Optimized for analytic workload performance
Introducing Azure Data Lake
A hyper scale repository for big data analytics workloads
HDFS/WebHDFS API
Hadoop Cluster
Azure HDInsight
Web HDFS API
Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
Devices
• Unstructured
• Semi-structured
• Structured
Ingress
Azure Storage Blobs
Client Machines
Azure SQL
DB
Azure
SQL DW
Azure
Tables
Azure Web Portal via Browser
Azure PowerShell
.NET SDK
JavaScript CLI
ADL built-in Copy Service
Azure Data Factory
Azure Data Factory
Sqoop
Azure Data Factory
Azure Data Factory
Third-party tools
ASA
Egress
Azure Storage Blobs
Client Machines
Azure SQL
DB
Azure
SQL DW
Azure
Tables
Azure Web Portal via Browser
Azure PowerShell
.NET SDK
JavaScript CLI
ADL built-in Copy Service
Azure Data FactoryAzure Data Factory
Sqoop
Azure Data Factory
Cortana Analytics Workshop: Azure Data Lake
``
• Built from the ground-up as a Hadoop File
System
• Support for file/folder objects and
operations
• Integrated w/ HDInsight, Hortonworks,
Cloudera
• Accessible to all HDFS compliant projects
(Spark, Storm, Flume, Sqoop, Kafka, R, etc.)
HDInsight
Cortana Analytics Workshop: Azure Data Lake
• Azure Active Directory integration
• File and folder level access control
• Audit data access
• Encryption of data-at-rest
Access Control
• Secure Files and
Folders
• POSIX compliant
ACLs
• Minimal (octet)
and enhanced
ACLs
• Based on Azure
AD principals
Auditing
• Audit logs for all
operations
• Consumable via
big data analytics
Encryption at Rest
• Transparent
server-side
encryption
• Azure Managed
and Customer
managed Keys
• Azure Key Vault
Integration
• Automatically replicates your data
• 3 copies within a single region
• Highly available
Cortana Analytics Workshop: Azure Data Lake
• Unlimited account sizes
• Individual file sizes from GBs to PBs
• No limits to scale
PB
TB GB
PB
TB
• Built for running large analytic systems that
require massive throughput
• Optimized for parallel computation over PBs
of data
• Automatically optimize for any throughput
Cortana Analytics Workshop: Azure Data Lake
Hortonworks
(HDP)
Cloudera
(CDH) MAPR
Web HDFS API
Azure ML ASA
Streaming
Azure RRE
Relational Data
Warehouse
Websites
Sensors
Social
Relational
• Can store structured, semi-structured, unstructured data
• Can support all Hadoop applications
• Is built for the enterprise
• Can meet performance needs of big data applications
Cortana Analytics Workshop: Azure Data Lake

More Related Content

Cortana Analytics Workshop: Azure Data Lake

  • 4. What is Azure Data Lake? A hyper scale repository for any data, optimized for big data analytic workloads.
  • 5. Why do we need Data Lakes?
  • 6. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Top-Down Confirmation Theory Hypothesis Observation
  • 7. Ingest regardless of requirements Store in native format without schema definition Analyze Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 8. • Store any data in its native format • Hadoop File System (HDFS) for the cloud • Enterprise grade • No limits to scale • Optimized for analytic workload performance Introducing Azure Data Lake A hyper scale repository for big data analytics workloads
  • 9. HDFS/WebHDFS API Hadoop Cluster Azure HDInsight Web HDFS API Azure Data Lake
  • 12. Ingress Azure Storage Blobs Client Machines Azure SQL DB Azure SQL DW Azure Tables Azure Web Portal via Browser Azure PowerShell .NET SDK JavaScript CLI ADL built-in Copy Service Azure Data Factory Azure Data Factory Sqoop Azure Data Factory Azure Data Factory Third-party tools ASA
  • 13. Egress Azure Storage Blobs Client Machines Azure SQL DB Azure SQL DW Azure Tables Azure Web Portal via Browser Azure PowerShell .NET SDK JavaScript CLI ADL built-in Copy Service Azure Data FactoryAzure Data Factory Sqoop Azure Data Factory
  • 15. `` • Built from the ground-up as a Hadoop File System • Support for file/folder objects and operations • Integrated w/ HDInsight, Hortonworks, Cloudera • Accessible to all HDFS compliant projects (Spark, Storm, Flume, Sqoop, Kafka, R, etc.) HDInsight
  • 17. • Azure Active Directory integration • File and folder level access control • Audit data access • Encryption of data-at-rest
  • 18. Access Control • Secure Files and Folders • POSIX compliant ACLs • Minimal (octet) and enhanced ACLs • Based on Azure AD principals Auditing • Audit logs for all operations • Consumable via big data analytics Encryption at Rest • Transparent server-side encryption • Azure Managed and Customer managed Keys • Azure Key Vault Integration
  • 19. • Automatically replicates your data • 3 copies within a single region • Highly available
  • 21. • Unlimited account sizes • Individual file sizes from GBs to PBs • No limits to scale PB TB GB PB TB
  • 22. • Built for running large analytic systems that require massive throughput • Optimized for parallel computation over PBs of data • Automatically optimize for any throughput
  • 24. Hortonworks (HDP) Cloudera (CDH) MAPR Web HDFS API Azure ML ASA Streaming Azure RRE Relational Data Warehouse Websites Sensors Social Relational
  • 25. • Can store structured, semi-structured, unstructured data • Can support all Hadoop applications • Is built for the enterprise • Can meet performance needs of big data applications