Lab 3 - Enabling Team Based Data Science With Azure Databricks
Lab 3 - Enabling Team Based Data Science With Azure Databricks
Lab 3 - Enabling Team Based Data Science With Azure Databricks
Pre-requisites: It is assumed that the case study for this lab has already been read. It is
assumed that the content and lab for module 1: Azure for the Data Engineer has also been
completed
Lab files: The files for this lab are located in the Allfiles\Labfiles\Starter\DP-200.3 folder.
Lab overview
By the end of this lab the student will be able to explain why Azure Databricks can be used to
help in Data Science projects. The students will provision an Azure Databricks instance and will
then create a workspace that will be used to perform simple data preparation tasks from a Data
Lake Store Gen2 store. Finally, the student will perform a walk-through of performing
transformations using Azure Databricks.
Lab objectives
After completing this lab, you will be able to:
Scenario
In response to the Information Services (IS) department, you will start the process of building a
predictive analytics platform by listing out the benefits of using the technology. The
department will be joined by data scientists and they want to ensure that there is a predictive
analytics environment available to the new team members.
You will stand up and provision an Azure Databricks environment, and then test that this
environment works by performing a simple data preparation routine on the service by
ingesting data from a pre-existing Data Lake Storage Gen2 account. As a data engineer, it has
been indicated to you that you may be required to help the data scientists perform data
preparation exercises. To that end, you have been recommended to walk-through a notebook
that can help you perform basic transformations.
IMPORTANT: As you go through this lab, make a note of any issue(s) that you have
encountered in any provisioning or configuration tasks and log it in the table in the document
located at \Labfiles\DP-200-Issues-Doc.docx. Document the Lab number, note the technology,
Describe the issue, and what was the resolution. Save this document as you will refer back to it
in a later module.
Individual exercise
1. From the content you have learned in this course so far, identify the digital
transformation requirement that Azure Databricks will meet and a candidate data source
for Azure Databricks.
Result: After you completed this exercise, you have created a Microsoft Word document that
identifies the digital transformation requirement that Azure Databricks will meet and a
candidate data source.
Individual exercise
3. In the New screen, click in the Search the Marketplace text box, and type the
word databricks. Click Azure Databricks in the list that appears.
5. In the Azure Databricks Service blade, create an Azure Databricks Workspace with the
following settings:
o Subscription: the name of the subscription you are using in this lab
o Location: the name of the Azure region which is closest to the lab location and
where you can provision Azure VMs.
Note: The provision will take approximately 3 minutes. The Databricks Runtime is built
on top of Apache Spark and is natively built for the Azure cloud. Azure Databricks
completely abstracts out the infrastructure complexity and the need for specialized
expertise to set up and configure your data infrastructure. For data engineers, who care
about the performance of production jobs, Azure Databricks provides a Spark engine
that is faster and performant through various optimizations at the I/O layer and
processing layer (Databricks I/O).
3. In the Resource groups screen, click on the **awrgstudxx resource group, where xx are
your initials.
4. In the awrgstudxx screen, click awdbwsstudxx, where xx are your initials to open
Azure Databricks. This will open your Azure Databricks service.
Task 3: Launch a Databricks Workspace and create a Spark Cluster.
1. In the Azure portal, in the awdbwsstudxx screen, click on the button Launch
Workspace.
Note: You will be signed into the Azure Databricks Workspace in a separate tab in
Microsoft Edge.
3. In the Create Cluster screen, under New Cluster, create a Databricks Cluster with the
following settings, and then click on Create Cluster:
o Pool: None
o Make sure you select the Terminate after 60 minutes of inactivity check box. If
the cluster isn't being used, provide a duration (in minutes) to terminate the
cluster.
Note: The creation of the Azure Databricks instance will take approximately 10 minutes as the
creation of a Spark cluster is simplified through the graphical user interface. You will note that
the State of Pending whilst the cluster is being created. This will change to Running when the
Cluster is created.
Note: While the cluster is being created, go back and perform Exercise 1.
Individual exercise
1. Return back to Microsoft Edge, under Interactive Clusters confirm that the state
column is set to Running for the cluster named awdbclstudxx, where xx are your
initials.
Task 2: Collect the Azure Data Lake Store Gen2 account name
1. In Microsoft Edge, click on the Azure portal tab, click Resource groups, and then
click awrgstudxx, and then click on awdlsstudxx, where xx are your initials.
2. In the awdlsstudxx screen, under settings, click on Access keys, and then click on the
copy icon next to the Storage account name and paste it into Notepad.
Task 3: Enable your Databricks instance to access the Data Lake Gen2
Store.
1. In the Azure portal, Click the Home hyperlink, and then click the Azure Active
Directory icon.
3. In the Microsoft - App registrations screen, click on the + New registration button.
4. In the register an application screen, provide the name of DLAccess and under
the Redirect URI (optional) section, ensure Web is selected and
type https://adventure-works.com/exampleapp for the application value. After
setting the values.
6. In the DLAccess registered app screen, copy the Application (client) ID and Directory
(tenant) ID and paste both into Notepad.
7. In the DLAccess registered app screen, click on Certificates and Secrets, and the
click + New Client Secret
8. In the Add a client secret screen. type a description of DL Access Key, and
a duration of In 1 year for the key. When done, click Add.
Important: When you click on Add, the key will appear as shown in the graphic below.
You only have one opportunity to copy this key value into Notepad
10. Assign the Storage Blob Data Contributor permission to your resource group. In the
Azure portal, click on the Home hyperlink, and then the Resource groups icon, click on
the resource group awrgstudxx, where xx are your initials.
15. In the Add role assignment blade, under Select, select DLAccess, and then click Save.
16. In the Azure portal, click the Home hyperlink, and then click the Azure Active
Directory icon, Note your role. If you have the User role, you must make sure that non-
administrators can register applications.
17. Click Users, and then click User settings in the Users - All users blade, Check the App
registrations setting. This value can only be set by an administrator. If set to Yes, any
user in the Azure AD tenant can register an app.
20. Click on the Copy icon next to the Directory ID to get your tenant ID and paste this into
notepad.
2. In the Azure Databricks blade on the left of Microsoft Edge, click on Under Workspace,
click on the drop down next to Workspace, then point to Create and then click
on Notebook.
5. Ensure that the Cluster states the name of the cluster that you have created earlier, click
on Create
Note: This will open up a Notebook with the title My Notebook (Scala).
6. In the Notebook, in the cell Cmd 1, copy the following code and paste it into the cell:
13. In this code block, replace the application-id, authentication-id, tenant-id, file-
system-name and storage-account-name placeholder values in this code block with
the values that you collected earlier and are held in notepad.
14. In the Notebook, in the cell under Cmd 1, click on the Run icon and click on Run Cell as
highlighted in the following graphic.
Note A message will be returned at the bottom of the cell that states "Command took
0.0X seconds -- by person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 5: Read data in Azure Databricks.
1. In the Notebook, hover your mouse at the top right of cell Cmd 1, and click on the Add
Cell Below icon. A new cell will appear named Cmd2.
2. In the Notebook, in the cell Cmd 2, copy the following code and paste it into the cell:
3. //Read JSON data in Azure Data Lake Storage Gen2 file system
4.
val df = spark.read.json("abfss://<file-system-name>@<storage-account-
name>.dfs.core.windows.net/preferences.json")
5. In this code block, replace the file-system-name with the word logs and storage-
account-name placeholder values in this code block with the value that you collected
earlier and are held in notepad.
6. In the Notebook, in the cell under Cmd 2, click on the Run icon and click on Run Cell.
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed and "Command took 0.0X seconds -- by person at 4/4/2019, 2:46:48 PM
on awdbclstudxx"
7. In the Notebook, hover your mouse at the top right of cell Cmd 2, and click on the Add
Cell Below icon. A new cell will appear named Cmd3.
8. In the Notebook, in the cell Cmd 3, copy the following code and paste it into the cell:
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Result In this exercise, you have performed the necessary steps that setup up the permission
for Azure Databricks to access data in an Azure Data Lake Store Gen2. You then used scala to
connect up to a Data Lake Store and you read data and created a table output showing the
preferences of people.
3. Add an Annotation
2. In the Notebook, in the cell Cmd 4, copy the following code and paste it into the cell:
3. //Retrieve specific columns from a JSON dataset in Azure Data Lake Storage Gen2 file
system
4.
5. val specificColumnsDf = df.select("firstname", "lastname", "gender", "location",
"page")
specificColumnsDf.show()
6. In the Notebook, in the cell under Cmd 4, click on the Run icon and click on Run Cell.
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 2: Performing a column rename on a Dataset
1. In the Notebook, hover your mouse at the top right of cell Cmd 4, and click on the Add
Cell Below icon. A new cell will appear named Cmd5.
2. In the Notebook, in the cell Cmd 5, copy the following code and paste it into the cell:
6. In the Notebook, in the cell under Cmd 5, click on the Run icon and click on Run Cell.
Note A message will be returned at the bottom of the cell that states that a Spark job
has executed, a table of results are returned and "Command took 0.0X seconds -- by
person at 4/4/2019, 2:46:48 PM on awdbclstudxx"
Task 3: Adding Annotations
1. In the Notebook, hover your mouse at the top right of cell Cmd 5, and click on the Add
Cell Below icon. A new cell will appear named Cmd6.
2. In the Notebook, in the cell Cmd 6, copy the following code and paste it into the cell:
3. This code connects to the Data Lake Storage filesystem named "Data" and reads data in
the preferences.json file stored in that data lake. Then a simple query has been
created to retrieve data and the column "page" has been renamed to "bike_preference".
4. In the Notebook, in the cell under Cmd 6, click on the down pointing arrow icon and
click on Move up. Repeat until the cell appears at the top of the Notebook.
Note A future lab will explore how this data can be exported to another data platform
technology
Result: After you completed this exercise, you have created an annotation within a notebook.
If the url are inaccessible, there is a copy of the notebooks ion the Allfiles\Labfiles\Starter\DP-
200.3\Post Course Review folder
Basic transformations
1. Within the Workspace, using the command bar on the left, select Workspace, Users,
and select your username (the entry with house icon).
2. In the blade that appears, select the downwards pointing chevron next to your name,
and select Import.
3. On the Import Notebooks dialog, select URL below and paste in the following URL:
https://github.com/MicrosoftDocs/mslearn-perform-basic-data-transformation-in-azure-
databricks/blob/master/DBC/05.1-Basic-ETL.dbc?raw=true
1. Select Import.
2. A folder named 05.1-Basic-ETL after the import should appear. Select that folder.
3. The folder will contain one or more notebooks that you can use to learn basic
transformations using scala or python.
Follow the instructions within the notebook, until you've completed the entire notebook. Then
continue with the remaining notebooks in order:
[Note] You'll find corresponding notebooks within the Solutions subfolder. These contain
completed cells for exercises that ask you to complete one or more challenges. Refer to these if
you get stuck or simply want to see the solution.
Advanced transformations
1. Within the Workspace, using the command bar on the left, select Workspace, Users,
and select your username (the entry with house icon).
2. In the blade that appears, select the downwards pointing chevron next to your name,
and select Import.
3. On the Import Notebooks dialog, select URL below and paste in the following URL:
https://github.com/MicrosoftDocs/mslearn-perform-advanced-data-transformation-in-azure-
databricks/blob/master/DBC/05.2-Advanced-ETL.dbc?raw=true
1. Select Import.
2. A folder named 05.2-Advanced-ETL after the import should appear. Select that folder.
3. The folder will contain one or more notebooks that you can use to learn basic
transformations using scala or python.
Follow the instructions within the notebook, until you've completed the entire notebook. Then
continue with the remaining notebooks in order:
[Note] You'll find corresponding notebooks within the Solutions subfolder. These contain
completed cells for exercises that ask you to complete one or more challenges. Refer to these if
you get stuck or simply want to see the solution.