Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Modern ETL: Azure Data Factory,
Data Lake, and SQL Database
Eric Bragas
Please Support Our Sponsors
Local User Groups
Los Angeles User Group
3rd Thursday of each odd month
sqlla.pass.org
Malibu User Group
3rd Wednesday of each month
sqlmalibu.pass.org
San Diego User Group
1st & 3rd Thursday of each month
meetup.com/sdsqlug
meetup.com/sdsqlbig
Los Angeles - Korean
Every Other Tuesday
sqlangeles.pass.org
Orange County User Group
2rd Thursday of each month
bigpass.pass.org
SQLSaturday Los Angeles
June 9th
SQLSaturday San Diego
September 15th
SQL Summit
Annual International Conference
November 6 -9 | Seattle, WA
2 Days of Pre-Cons
200+ sessions over 3 days
Over 5,000 SQL Professionals
Evening Networking Activities
Discount Code: SSDISODNS
About Me
• Senior Business Intelligence Consultant with
DesignMind
• Undergoing a metamorphosis (somewhat
Kafkaesque) into a Cloud Data Engineer
• Always had a passion for art, design, and clean
engineering (I own a Dyson vacuum) and those
passions have stuck with me over the years
• I returned from a trip to Dresden, Germany,
Prague, and Venice this week
• Undergoing my Accelerated Freefall training to
become a certified skydiver
https://www.linkedin.com/in/ericbragas93/ @ericbragas
eric@designmind.com
Overview
This session IS
• A discussion of the awesome tools available in Azure for batch processing
data
• A comparison of ETL and ELT (or LETS)
• PaaS first!
This session IS NOT
• A technical deep dive
• A discussion about migrations
• For the faint of heart ;)
Overview (cont’d)
• Background in architecting and implementing SQL Server Data
Warehouses
• Experience with lift-and-shift, hybrid IaaS and Paas warehouses, and
brand new implementations using just PaaS
PaaS vs. IaaS
Benefits of PaaS
• No server to maintain!
• Literally just data and configurations
• A lot less room for user error
• Ridiculous reliability
• Developers, develop
• Elasticity of all services, including on as needed basis
• U-SQL AUs
• Data Factory parallelism
• SQL Database scaling (kills connections)
PaaS vs. IaaS (cont’d)
Benefits of PaaS Development Process
• Wide variety of tools, both visual and via the API
• Azure Portal makes the dev-test cycle very fast
• Also web based which makes working from anywhere really easy
• Visual Studio and VS Code extensions for development and tuning
• Excellent for integrating with source control
• And a bunch more!
The effectiveness of a solution is largely influenced by the effectiveness
of the team
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Azure Data Factory
• "[Azure Data Factory] is a cloud-
based data integration service
that allows you to create data-
driven workflows in the cloud
that orchestrate and automate
data movement and data
transformation.“
• Version 1 – service for batch
processing of time series data
• Version 2 – a general purpose
data processing and workflow
orchestration tool
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Demo
Azure Data Factory
What if we need more?
Loading directly to SQL and transforming using SQL can be a good
option for smaller datasets where you don’t expect much evolution
What if you want more flexibility to add larger or more varying data
sets? Or you need a warehouse, but the business doesn’t know what
exactly they need until they see it?
Enter, the Data Lake!
Azure Data Lake
Two components:
• Data Lake Store – a distributed file store that
enables massively parallel read/write on data by a
number of services i.e. ADF, ADLA, HDInsight, ADW,
etc.
• Data Lake Analytics – a data processing engine that
leverages the hybrid SQL and C# language called U-
SQL to perform massively parallel processing of data.
Pay only for what you use.
Note: ADLA is not an ad hoc query engine. It is a batch
processing engine that takes file inputs and produces
file outputs.
What is a Data Lake?
• Place to load all your raw data into a folder framework
• Important to maintain order
• Schema-on-read queries to process data as needed
• Unstructured, semi-structured, and structured data
• Batch data processing at scale to feed your data marts
• Extensible query language
• Utilize as hub for analytics
• ADW, ADLA, ML, etc.
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
What are the Benefits?
• Load data without first defining or being locked into a particular
schema
• Explore the data before deciding what schema to impose and
processing for your downstream analytics
• Alleviates a major challenge with starting a DW project
• Faster time-to-value (less time deliberating, more time iterating)
• Feed multiple downstream systems from the same system
• Enable a variety of user types to interact with data at the level they
need
• Data Scientists on raw data; Analysts on Data Marts
Data Lake as a Hub
Demo
Azure Data Lake Store and Analytics
SQL Database
• Cloud managed database
service; similar but not the same
as SQL Server
• Use as the presentation or
semantic layer for your data
warehouse
• Fast ad hoc queries and many
concurrent connections
• Supports clustered indexes,
memory optimized tables, etc.
Creating Data Marts
• Pre-process data incrementally using Data Lake Analytics, and stage in
Data Lake Store
• Copy to SQL Database table using pre-copy script and the copy
activity
• More advanced requirements can be serviced by the “writeStoredProcedure”
in the copy activity
• Maintain metadata for incremental loading within the same database
• Track what was loaded last, then load the difference using lookup activity
Demo
Run Pipeline and Query SQL Database
Q&A
Thanks!

More Related Content

Modern ETL: Azure Data Factory, Data Lake, and SQL Database

  • 1. Modern ETL: Azure Data Factory, Data Lake, and SQL Database Eric Bragas
  • 3. Local User Groups Los Angeles User Group 3rd Thursday of each odd month sqlla.pass.org Malibu User Group 3rd Wednesday of each month sqlmalibu.pass.org San Diego User Group 1st & 3rd Thursday of each month meetup.com/sdsqlug meetup.com/sdsqlbig Los Angeles - Korean Every Other Tuesday sqlangeles.pass.org Orange County User Group 2rd Thursday of each month bigpass.pass.org SQLSaturday Los Angeles June 9th SQLSaturday San Diego September 15th
  • 4. SQL Summit Annual International Conference November 6 -9 | Seattle, WA 2 Days of Pre-Cons 200+ sessions over 3 days Over 5,000 SQL Professionals Evening Networking Activities Discount Code: SSDISODNS
  • 5. About Me • Senior Business Intelligence Consultant with DesignMind • Undergoing a metamorphosis (somewhat Kafkaesque) into a Cloud Data Engineer • Always had a passion for art, design, and clean engineering (I own a Dyson vacuum) and those passions have stuck with me over the years • I returned from a trip to Dresden, Germany, Prague, and Venice this week • Undergoing my Accelerated Freefall training to become a certified skydiver https://www.linkedin.com/in/ericbragas93/ @ericbragas eric@designmind.com
  • 6. Overview This session IS • A discussion of the awesome tools available in Azure for batch processing data • A comparison of ETL and ELT (or LETS) • PaaS first! This session IS NOT • A technical deep dive • A discussion about migrations • For the faint of heart ;)
  • 7. Overview (cont’d) • Background in architecting and implementing SQL Server Data Warehouses • Experience with lift-and-shift, hybrid IaaS and Paas warehouses, and brand new implementations using just PaaS
  • 8. PaaS vs. IaaS Benefits of PaaS • No server to maintain! • Literally just data and configurations • A lot less room for user error • Ridiculous reliability • Developers, develop • Elasticity of all services, including on as needed basis • U-SQL AUs • Data Factory parallelism • SQL Database scaling (kills connections)
  • 9. PaaS vs. IaaS (cont’d) Benefits of PaaS Development Process • Wide variety of tools, both visual and via the API • Azure Portal makes the dev-test cycle very fast • Also web based which makes working from anywhere really easy • Visual Studio and VS Code extensions for development and tuning • Excellent for integrating with source control • And a bunch more! The effectiveness of a solution is largely influenced by the effectiveness of the team
  • 12. Azure Data Factory • "[Azure Data Factory] is a cloud- based data integration service that allows you to create data- driven workflows in the cloud that orchestrate and automate data movement and data transformation.“ • Version 1 – service for batch processing of time series data • Version 2 – a general purpose data processing and workflow orchestration tool
  • 16. What if we need more? Loading directly to SQL and transforming using SQL can be a good option for smaller datasets where you don’t expect much evolution What if you want more flexibility to add larger or more varying data sets? Or you need a warehouse, but the business doesn’t know what exactly they need until they see it? Enter, the Data Lake!
  • 17. Azure Data Lake Two components: • Data Lake Store – a distributed file store that enables massively parallel read/write on data by a number of services i.e. ADF, ADLA, HDInsight, ADW, etc. • Data Lake Analytics – a data processing engine that leverages the hybrid SQL and C# language called U- SQL to perform massively parallel processing of data. Pay only for what you use. Note: ADLA is not an ad hoc query engine. It is a batch processing engine that takes file inputs and produces file outputs.
  • 18. What is a Data Lake? • Place to load all your raw data into a folder framework • Important to maintain order • Schema-on-read queries to process data as needed • Unstructured, semi-structured, and structured data • Batch data processing at scale to feed your data marts • Extensible query language • Utilize as hub for analytics • ADW, ADLA, ML, etc.
  • 20. What are the Benefits? • Load data without first defining or being locked into a particular schema • Explore the data before deciding what schema to impose and processing for your downstream analytics • Alleviates a major challenge with starting a DW project • Faster time-to-value (less time deliberating, more time iterating) • Feed multiple downstream systems from the same system • Enable a variety of user types to interact with data at the level they need • Data Scientists on raw data; Analysts on Data Marts
  • 21. Data Lake as a Hub
  • 22. Demo Azure Data Lake Store and Analytics
  • 23. SQL Database • Cloud managed database service; similar but not the same as SQL Server • Use as the presentation or semantic layer for your data warehouse • Fast ad hoc queries and many concurrent connections • Supports clustered indexes, memory optimized tables, etc.
  • 24. Creating Data Marts • Pre-process data incrementally using Data Lake Analytics, and stage in Data Lake Store • Copy to SQL Database table using pre-copy script and the copy activity • More advanced requirements can be serviced by the “writeStoredProcedure” in the copy activity • Maintain metadata for incremental loading within the same database • Track what was loaded last, then load the difference using lookup activity
  • 25. Demo Run Pipeline and Query SQL Database