Modern ETL: Azure Data Factory, Data Lake, and SQL Database

Modern ETL: Azure Data Factory,
Data Lake, and SQL Database
Eric Bragas

Local User Groups
Los Angeles User Group
3rd Thursday of each odd month
sqlla.pass.org
Malibu User Group
3rd Wednesday of each month
sqlmalibu.pass.org
San Diego User Group
1st & 3rd Thursday of each month
meetup.com/sdsqlug
meetup.com/sdsqlbig
Los Angeles - Korean
Every Other Tuesday
sqlangeles.pass.org
Orange County User Group
2rd Thursday of each month
bigpass.pass.org
SQLSaturday Los Angeles
June 9th
SQLSaturday San Diego
September 15th

SQL Summit
Annual International Conference
November 6 -9 | Seattle, WA
2 Days of Pre-Cons
200+ sessions over 3 days
Over 5,000 SQL Professionals
Evening Networking Activities
Discount Code: SSDISODNS

About Me
• Senior Business Intelligence Consultant with
DesignMind
• Undergoing a metamorphosis (somewhat
Kafkaesque) into a Cloud Data Engineer
• Always had a passion for art, design, and clean
engineering (I own a Dyson vacuum) and those
passions have stuck with me over the years
• I returned from a trip to Dresden, Germany,
Prague, and Venice this week
• Undergoing my Accelerated Freefall training to
become a certified skydiver
https://www.linkedin.com/in/ericbragas93/ @ericbragas
eric@designmind.com

Overview
This session IS
• A discussion of the awesome tools available in Azure for batch processing
data
• A comparison of ETL and ELT (or LETS)
• PaaS first!
This session IS NOT
• A technical deep dive
• A discussion about migrations
• For the faint of heart ;)

Overview (cont’d)
• Background in architecting and implementing SQL Server Data
Warehouses
• Experience with lift-and-shift, hybrid IaaS and Paas warehouses, and
brand new implementations using just PaaS

PaaS vs. IaaS
Benefits of PaaS
• No server to maintain!
• Literally just data and configurations
• A lot less room for user error
• Ridiculous reliability
• Developers, develop
• Elasticity of all services, including on as needed basis
• U-SQL AUs
• Data Factory parallelism
• SQL Database scaling (kills connections)

PaaS vs. IaaS (cont’d)
Benefits of PaaS Development Process
• Wide variety of tools, both visual and via the API
• Azure Portal makes the dev-test cycle very fast
• Also web based which makes working from anywhere really easy
• Visual Studio and VS Code extensions for development and tuning
• Excellent for integrating with source control
• And a bunch more!
The effectiveness of a solution is largely influenced by the effectiveness
of the team

Modern ETL: Azure Data Factory, Data Lake, and SQL Database

Azure Data Factory
• "[Azure Data Factory] is a cloud-
based data integration service
that allows you to create data-
driven workflows in the cloud
that orchestrate and automate
data movement and data
transformation.“
• Version 1 – service for batch
processing of time series data
• Version 2 – a general purpose
data processing and workflow
orchestration tool

What if we need more?
Loading directly to SQL and transforming using SQL can be a good
option for smaller datasets where you don’t expect much evolution
What if you want more flexibility to add larger or more varying data
sets? Or you need a warehouse, but the business doesn’t know what
exactly they need until they see it?
Enter, the Data Lake!

Azure Data Lake
Two components:
• Data Lake Store – a distributed file store that
enables massively parallel read/write on data by a
number of services i.e. ADF, ADLA, HDInsight, ADW,
etc.
• Data Lake Analytics – a data processing engine that
leverages the hybrid SQL and C# language called U-
SQL to perform massively parallel processing of data.
Pay only for what you use.
Note: ADLA is not an ad hoc query engine. It is a batch
processing engine that takes file inputs and produces
file outputs.

What is a Data Lake?
• Place to load all your raw data into a folder framework
• Important to maintain order
• Schema-on-read queries to process data as needed
• Unstructured, semi-structured, and structured data
• Batch data processing at scale to feed your data marts
• Extensible query language
• Utilize as hub for analytics
• ADW, ADLA, ML, etc.

What are the Benefits?
• Load data without first defining or being locked into a particular
schema
• Explore the data before deciding what schema to impose and
processing for your downstream analytics
• Alleviates a major challenge with starting a DW project
• Faster time-to-value (less time deliberating, more time iterating)
• Feed multiple downstream systems from the same system
• Enable a variety of user types to interact with data at the level they
need
• Data Scientists on raw data; Analysts on Data Marts

Demo
Azure Data Lake Store and Analytics

SQL Database
• Cloud managed database
service; similar but not the same
as SQL Server
• Use as the presentation or
semantic layer for your data
warehouse
• Fast ad hoc queries and many
concurrent connections
• Supports clustered indexes,
memory optimized tables, etc.

Creating Data Marts
• Pre-process data incrementally using Data Lake Analytics, and stage in
Data Lake Store
• Copy to SQL Database table using pre-copy script and the copy
activity
• More advanced requirements can be serviced by the “writeStoredProcedure”
in the copy activity
• Maintain metadata for incremental loading within the same database
• Track what was loaded last, then load the difference using lookup activity

Demo
Run Pipeline and Query SQL Database

Modern ETL: Azure Data Factory, Data Lake, and SQL Database

More Related Content

Modern ETL: Azure Data Factory, Data Lake, and SQL Database