Databricks Lab 1
Databricks Lab 1
Databricks Lab 1
Databricks
Lab 1 - Getting Started with Spark
Overview
In this lab, you will provision a Databricks workspace and a Spark cluster. You will then use the Spark
cluster to explore data interactively.
Note: To set up the required environment for the lab, follow the instructions in the Setup document for
this course. Specifically, you must have signed up for an Azure subscription.
spark.hadoop.fs.azure.account.key.your_storage_account.blob.core.windows.net your_key1_value
Note: The first setting enables code using the newer Dataframe-based API to access your
storage account. The second setting is used by the older RDD-based API.
4. Wait for the cluster to be created.
Exploring Data Interactively with Spark RDDs
Now that you have provisioned a Spark cluster, you can use it to analyze data. In this exercise, you will
use Spark Resilient Distributed Datasets (RDDs) to load and explore data. The RDD-based API is an
original component of Spark, and has largely been superseded by a newer Dataframe-based API;
however, there are many production systems (and code examples on the Web) that use RDDs, so it’s
worth starting your exploration of Spark there.
1. In the folder where you extracted the lab files for this course on your local computer, in the data
folder, verify that the KennedyInaugural.txt file exist. This file contains the data you will explore
in this exercise.
2. Start Azure Storage Explorer, and if you are not already signed in, sign into your Azure
subscription.
3. Expand your storage account and the Blob Containers folder, and then double-click the spark
blob container you created previously.
4. In the Upload drop-down list, click Upload Files. Then upload KennedyInaugural.txt as a block
blob to a new folder named data in root of the spark container.
Create a Notebook
Most interactive data analysis in Databricks is conducted using notebooks. These browser-based
interactive documents enable you to combine notes in Markdown format with code that you can run
right in the notebook – with no requirement to install a local code editing environment. In this exercise,
you can choose to write your code using Python or Scala.
1. In the Databricks workspace, click Workspace. Then click Users, click your user name, and in the
drop-down menu for your username point click Create and Notebook as shown here:
2. Create a new notebook with the following settings:
• Name: RDDs
• Language: Choose Python or Scala as preferred.
• Cluster: Your cluster
3. In the new notebook, in the first cell, enter the following code to enter some Markdown text:
%md
# Kennedy Inauguration
This notebook contains code to analyze President Kennedy’s inauguration speech.
4. Click anywhere in the notebook outside of the first cell to see the formatted markdown, which
should look like this:
Kennedy Inauguration
This notebook contains code to analyze President Kennedy’s inauguration speech.
5. Hold the mouse pointer under the center of the bottom edge of the cell until a (+) symbol is
displayed; then click this to insert a new cell.
6. In the new cell, type the following code, replacing <account> with the fully qualified name of
your Azure Storage account (account_name.blob.core.windows.net):
Python
txt = sc.textFile("wasbs://spark@<account>/data/KennedyInaugural.txt")
txt.count()
Scala
val txt = sc.textFile("wasbs://spark@<account>/data/KennedyInaugural.txt")
txt.count()
In this code, the variable sc is the Spark context for your cluster; which is created automatically
within the notebook.
7. With the code cell selected, at the top left of the cell, click the button and then click Run
Cell to run the cell. After a few seconds, the code will run and display the number of lines of text
in the text file as Out[1]
8. Add a new cell and enter the following command to view the first line in the text file.
Python
txt.first()
Scala
txt.first()
9. Run the new cell and note that the first line of the speech is displayed as Out[2].
10. Add a new cell and enter the following command to create a new RDD named filtTxt that filters
the txt RDD so that only lines containing the word “freedom” are included, and counts the
filtered lines.
Python
filtTxt = txt.filter(lambda line: "freedom" in line)
filtTxt.count()
Scala
val filtTxt = txt.filter(line => line.contains("freedom"))
filtTxt.count()
11. Run the new cell and note that the number of lines containing “freedom” is returned as Out[3].
12. Add a new cell and enter the following command to display the contents of the filtTxt RDD.
Python
filtTxt.collect()
Scala
filtTxt.collect()
13. Run the new cell and note that the lines containing “freedom” are returned as Out[4].
14. Add a new cell and enter the following command to split the full speech into words, count the
number of times each word occurs, and display the counted words in descending order of
frequency.
Python
words = txt.flatMap(lambda txt: txt.split(" "))
counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.sortBy(lambda a: a[1], False).collect()
Scala
val words = txt.flatMap(line => line.split(" "))
val counts = words.map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.sortBy(_._2,false).collect().foreach(println)
15. Run the new cell and review the output, which shows the frequency of each word in the speech
in descending order.
1. In the folder where you extracted the lab files for this course on your local computer, in the data
folder, verify that the Accidents.csv and Vehicles.csv files exist. These files contain the data you
will explore in this exercise.
2. Start Azure Storage Explorer, and if you are not already signed in, sign into your Azure
subscription.
3. Expand your storage account and the Blob Containers folder, and then double-click the spark
blob container you created previously in this lab.
4. In the Upload drop-down list, click Upload Files. Then upload Accidents.csv and Vehicles.csv as
block blobs to the data folder in root of the spark container that you created previously in this
lab.
Work with Dataframes
In this procedure, you will use your choice of Python or Scala to query the road traffic accident data
in the comma-delimited text files you have uploaded. Notebooks containing the necessary steps to
explore the data have been provided.
1. In the Databricks workspace, click Workspace. Then click Users, click your user name, and in the
drop-down menu for your username click Import as shown here:
2. Browse to the folder where you extracted the lab files. Then select either Dataframes.ipynb or
Dataframes.scala, depending on your preferred choice of language (Python or Scala), and
upload it.
3. Open the notebook you uploaded and in the Detached drop-down menu, attach the notebook
to your Spark cluster as shown here:
4. Read the notes and run the code cells to explore the data.
Clean Up
Note: If you intend to proceed straight to the next lab, skip this section. Otherwise, follow the steps
below to delete your Azure resources and avoid being charged for them when you are not using them.