Azure Data Factory
Azure Data Factory
Azure Data Factory
Copy data from Azure SQL to Azure Data Lake and how to visualize the data by loading it to
Power BI, and how to create an ETL process by creating a pipeline using Azure Data Factory. You
will be introduced to Azure Data Lake Analytics by using USQL for data processing.
Data generated by several applications of products is increasing exponentially day by day. As the
data is coming from many sources, it is very difficult to manage it.
To analyze and store all this data, we can use Data Factory which:
·2 Transforms the data with the help of pipelines (a logical grouping of activities that
together perform a task)
We can publish the output data to data stores such as Azure Data Lake for Business
Intelligence(BI) applications to perform visualisation or analytics. For better business decisions,
we can organise the raw data into meaningful data stores.
The Data Factory service allows us to create pipelines which helps us to move and transform
data and then run the pipelines on a specified schedule which can be daily, hourly or weekly. The
data that is consumed and produced by workflows is time-sliced data, and we can specify the
pipeline mode as scheduled or one time.
Input dataset: It is the data we have within our data store, which needs to be processed and
then passed through a pipeline.
Pipeline: Pipeline operates on data to transform it. Data transformation could be anything like
data movement.
Data transformation is possible with the help of USQL, stored procedures, or Hive.
Output dataset: It will contain data that is in a structured form because it is already been
transformed and made structured in the pipeline storage. Then, it is given to linked services like
Azure Data Lake, blob storage, or SQL.
Linked services: These store information that is very important when it comes to connecting an
external source.
For example, consider the SQL server. We need a connection string to connect to an external
device. We need to mention the source and the destination of our data.
Gateway: The Gateway connects our on-premises data to the cloud. We need a client installed
on our on-premises system so that we can connect to the Azure cloud.
Cloud: Our data can be analyzed and visualized with much different analytical software like
Apache Spark, R, Hadoop, and so on.
Using Data Factory for automating the movement and transformation of data by creating linked
services, data sets, pipelines and scheduling those pipelines.
Azure Data Lake is a data storage or a file system that is highly scalable and distributed. It is
located in the cloud and works with multiple analytics frameworks, which are external
frameworks, like Hadoop, Apache Spark, and so on.
We can get our output dataset from web, mobile, or social media. It is sent into Azure Data Lake
Store, and then it is provided to external frameworks like Apache spark, Hive, etc.
There are two main concepts when it comes to Azure Data Lake Storage: Storage & Analytics.
Storage is of unlimited size. It can be terabytes, gigabytes, and much more. Azure Data Lake
Store shows a wide variety of data, both unstructured or structured. It can store really large files.
After the installation of SSMS, open the dashboard in Microsoft Azure. We need to create a data
warehouse.
* Deploy
·10 Connect to the server created in Azure data warehouse using SQL authentication
Note: We can find the server name in the overview of SQL warehouse.
·12 Go to the dashboard of Microsoft Azure and create a new storage (Azure Data Lake)
·16 Mention the source data as SQL server and create a new linked service (SQL server)
·17 Select the dataset
·18 Mention the destination data store as Azure Data Lake and create a new linked service
To get the service principal ID and service principal key, do the following steps:
·23 Go to Certificates and secrets, create a new client secret, and then a password is
generated which is the service principal key
·24 Go to Azure Data Lake Store and give all access to the application which is
created for generating the service principal key
·25 Copying or moving of data can be done vice versa, i.e., source and destination
can be interchanged
Microsoft’s Power BI is a cloud-based business analytics service for analyzing and visualizing
data. Various processes in Power BI are as follows:
·32 Go to Microsoft Azure Dashboard and create a new Azure Data Lake Store
·33 Upload the dataset to Azure Data Lake Store
·37 Connect to Azure Data Lake Store in Power BI with the URL provided by Data Explorer in
Azure Data Lake Store
·38 Go to Azure dashboard and open Data Lake Store which we have created
Creating a Pipeline Using Data Factory – ETL (Extract, Transform, Load) Solution
We have an SQL database which is on our Azure SQL server database, and we are trying to
extract some data from this database. While extracting, if something has to be processed, then it
will be processed and then stored in the Data Lake Store.
Power BI has self-service ETL within itself. Azure Data Factory Data Flow is a new preview feature
in Azure Data Factory to visually create ETL flows.
Steps for Creating ETL
Analytics
Storage
Analytics, we have HDInsight Analytics and Azure Data Lake Analytics. Distributed analytics
service is built on Apache YARN which is similar to Hadoop because Hadoop also uses YARN for
distributed processing.
Distributed processing: Imagine, we have a large amount of data and we want to process it in a
distributed manner to speed up the process.
All we want is to get the data separated, process this separated dataset with the instruction set
we have, and finally combine them. This is distributed processing, and that’s what we get from
Azure Data Lake Analytics.
In Hadoop, we have to spend some considerable time on provisioning it, but still we have been
using Hadoop in an open-source platform or HDInsight in Azure mainly for processing data.
Azure Data Lake Analytics simplifies this. Here, we don’t need to worry about provisioning
clusters. We can simply make a job for processing our dataset and submit it.
We don’t need to worry about the installation, configuration, and management of our big data
cluster management. Moreover, we are only charged per job instance, i.e., we are charged for
how nodes that we assign to the job as well as for how long the job runs.
Though we compare this with Hadoop, up to some extent, it doesn’t mean that this completely
replaces the Hadoop ecosystem. When it comes to a distributed environment system, scalability
is a must. There should not be a limit on scaling out.
Azure Data Lake Analytics is an on-demand analytics job service to simplify big data analytics. It
provides us the ability to increase or decrease the processing power in a per job basis. Let us say,
we are handling terabytes of data and we need answers fast. In this case, we can throw nodes
per job on the particular instance and thus we will be able to get our insights quickly.
We make use of USQL for processing data. This language is based on internal languages that
Microsoft has been using for years: TSQL and C#.NET. We can pass the instruction data using
USQL, and there are some other ways too. Let’s first discuss about using USQL.
USQL can be written in Visual Studio which is the most familiar integrated development
environment, and once it is written it can be submitted to Data Lake Analytics. USQL can be used
against many data sources.
Data Lake Analytics supports many enterprise features like integration, security, compliance,
role-based access control, and auditing. So, we can simply say that it is enterprise grade. If data
is in Azure Data Lake Store, we can expect good performance because Azure Data Lake Analytics
is optimized for working with Azure Data Lake Store.
It is also important to understand the difference between HDInsight and Azure Data Lake
Analytics (ADLA). The below table will help us differentiate between the two.
USQL works with both structured and unstructured data. We can process images stored in Azure
Data Lake Store with the help of image processing libraries. That’s how we process unstructured
data using USQL.
When there is a dataset without a predefined structure, we can define the structure as we want
and can read data using this defined structure. It allows defining the schema for reading without
holding a dataset in a predefined structure.