Azure Lowlands: An intro to Azure Data Lake

Thank you to our sponsors!
Gold Sponsors
Silver Sponsors
Community Sponsors

An intro to
Azure Data Lake
Rick van den Bosch
M +31 (0)6 52 34 89 30
r.van.den.bosch@betabit.nl

Calendar
Data Lakes
About Azure Data Lake
Azure Data Lake Store
- DEMO
Azure Data Lake HDInsights
- DEMO
Azure Data Lake Analytics
- DEMO
Power BI
- DEMO
Resources

Rick van den Bosch
Cloud Solutions Architect
@rickvdbosch
rickvandenbosch.net
r.van.den.bosch@betabit.nl

The Traditional Data Warehouse
6
Data sourcesNon-relational data

Ingest all data
regardless of
requirements
Store all data
in native format
without schema
definition
Do analysis
Hadoop, Spark, R,
Azure Data Lake
Analytics (ADLA)
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Designed for the questions you don’t yet know!
The Data Lake Approach

Azure Data Lake
• Store and analyze petabyte-size files and trillions of
objects
• Develop massively parallel programs with simplicity
• Debug and optimize your big data programs with ease
• Enterprise-grade security, auditing, and support
• Start in seconds, scale instantly, pay per job
• Built on YARN, designed for the cloud
9

Azure Lowlands: An intro to Azure Data Lake

HDFS Compatible REST API
ADL Store
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics
Open Source Apache
Hadoop ADL Client
Azure DataBricks
HDInsight
Hive
• Performance at scale
• Optimized for analytics
• Multiple analytics engines
• Single repository sharing
Why Azure Data Lake?
an on-demand, real-time stream processing service with no-limits data lake built to support
massively parallel analytics

Store
• Enterprise-wide hyper-scale repository
• Data of any size, type and ingestion speed
• Operational and exploratory analytics
• WebHDFS-compatible API
• Specifically designed to enable analytics
• Tuned for (data analytics scenario) performance
• Out of the box:
security, manageability, scalability, reliability, and
availability
15

Store
Architected and built for very high throughput at scale for
Big Data workloads
- No limits to file size, account size or number of files
Single-repository for sharing
- Cloud-scale distributed filesystem with file/folder
ACLS and RBAC
- Encryption-at-rest by default with Azure Key Vault
- Authenticated access with Azure Active Directory
integration
The Big Data platform for Microsoft
16

Key capabilities
Built for Hadoop
Unlimited storage, petabyte files
Performance-tuned for big data analytics
Enterprise-ready: Highly-available and secure
All data

Security
Authentication
• Azure Active Directory integration
• Oauth 2.0 support for REST interface
Access control
• Supports POSIX-style permissions (exposed by
WebHDFS)
• ACLs on root, subfolders and individual files
Encryption
18

Store
20
ADL Store

Ingest data – Ad hoc
Local computer
• Azure Portal
• Azure PowerShell
• Azure CLI
• Using Data Lake Tools for Visual Studio
Azure Storage Blob
• Azure Data Factory
• AdlCopy tool
• DistCp running on HDInsight cluster
22

Ingest data
Streamed
• Azure Stream Analytics
• Azure HDInsight Storm
• EventProcessorHost
Relational
• Apache Sqoop
23
Web server
Upload using custom applications
• Azure CLI
• Azure PowerShell
• Azure Data Lake Storage Gen1 .NET SDK

ADLS Gen 2
Takes core capabilities from Azure Data Lake Storage Gen1 such as
- a Hadoop compatible file system
- Azure Active Directory
- POSIX based ACLs
and integrates them into Azure Blob Storage
28

Additional benefits
Unlimited scale and performance
Performance improvements reading/writing individual objects (> throughput & concurrency)
Removes need to decide a priority: run analytics or not at data ingestion time
Data protection capabilities: encryption at rest
Integrated network Firewall capabilities
Durability options (Zone and Geo-Redundant Storage: high-availability and disaster recovery)
Linux integration – BlobFUSE
- mount Blob Storage from Linux VMs
- interact using standard Linux shell commands.
29

Data Lake Storage Gen2
“In Data Lake Storage Gen2, all
the qualities of object storage
remain while adding the
advantages of a file system
interface optimized for analytics
workloads.”
30

Known issues
Blob Storage APIs and Azure Data Lake Gen2 APIs aren't interoperable
Blob storage APIs not available
Azure Storage Explorer >= 1.6.0
AZCopy >= v10
Event Grid doesn't receive events
Soft Delete and Snapshots not available
Object level storage tiers not available
Diagnostic logs not available
31

HDInsight
Cloud distribution of the (Hortonworks) Hadoop
components
Supports multiple Hadoop cluster versions (can be
deployed any time)
Hadoop
• YARN for job scheduling & resource management
• MapReduce for parallel processing
• HDFS
33

HDInsight
35
Open Source Apache Hadoop ADL
Client
Azure DataBricks
HDInsight
Hive

Analytics
Dynamic scaling
Develop faster, debug and optimize smarter using familiar
tools
U-SQL: simple and familiar, powerful, and extensible
Integrates seamlessly with your IT investments
Affordable and cost effective
Works with all your Azure data
38

Analytics
On-demand analytics job service to simplify big data
analytics
Can handle jobs of any scale instantly
Azure Active Directory integration
U-SQL
39

Azure Data Lake Analytics
40
Analytics
Storage
ADL Store
.NET, SQL, Python, R
scaled out by U-SQL
ADL Analytics• Serverless. Pay per job. Starts in
seconds. Scales instantly.
• Develop massively parallel programs
with simplicity
• Federated query from multiple data
sources

U-SQL
Language that combines declarative SQL with imperative C#
41

U-SQL – Key concepts
Rowset variables
• Each query expression that produces a rowset can be
assigned to a variable.
EXTRACT
• Reads data from a file & defines the schema on read *
OUTPUT
• Writes data from a rowset to a file *
42

U-SQL – Scalar variables
DECLARE @in string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";
@searchlog =
EXTRACT UserId int,
ClickedUrls string
FROM @in
USING Extractors.Tsv();
OUTPUT @searchlog
TO @out
USING Outputters.Csv();
43

U-SQL – Transform rowsets
@searchlog =
EXTRACT UserId int,
Region string
FROM "/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();
@rs1 =
SELECT UserId, Region
FROM @searchlog
WHERE Region == "en-gb";
OUTPUT @rs1
TO "/output/SearchLog-transform-rowsets.csv"
USING Outputters.Csv();
44

U-SQL – Extractor parameters
delimiter
encoding
escapeCharacter
nullEscape
quoting
rowDelimiter
silent
skipFirstNRows
charFormat

U-SQL – Outputter parameters
delimiter
dateTimeFormat
encoding
escapeCharacter
nullEscape
quoting
rowDelimeter
charFormat
outputHeader

U-SQL
Built-in extractors and outputters:
Text
Csv
Tsv
A (for instance) CSV Extractor or Outputter is
EXACTLY THAT

Data sources
Options in the Azure Portal:
• Data Lake Storage Gen1
• Azure Storage

Resources
Basic example
Advanced example
Create Database (U-SQL) & Create Data Source (U-SQL)
This example
HDInsight quickstart
Azure blog
Azure roadmap

Track 1
15:35 – 16:20
Skynet Is Talking - Microsoft Bot Framework
Kris van der Mast
Track 2
15:35 – 16:20
Enter The Matrix: Securing Azure's Assets
Mike Martin

Azure Lowlands: An intro to Azure Data Lake

More Related Content

Azure Lowlands: An intro to Azure Data Lake

Editor's Notes