Lab 1 - Getting Started With Azure ML
Lab 1 - Getting Started With Azure ML
Essentials
Lab 1 – Getting Started with Azure Machine Learning
Overview
In this lab, you will learn how to open and navigate the Microsoft Azure Machine Learning (Azure ML)
Studio. You will also learn how to create and run experiments in Azure ML.
Note: The goal of this lab is to familiarize yourself with the Azure ML environment and some of the
modules and techniques you will use in subsequent labs. Details of how to visualize and manipulate data,
and how to build and evaluate machine learning models will be discussed in more depth later in the
course.
Note: To set up the required environment for the lab, follow the instructions in the Setup Guide for this
course.
Tip: To organize the labs for this course you can click Projects and create a new project. You can
add your datasets and experiments to this project so they are easy to find in the future.
Create an Experiment
1. In the studio, at the bottom left, click NEW. Then in the Experiment category, in the collection of
Microsoft samples, select Blank Experiment. This creates a blank experiment, which looks similar
to the following image.
2. Change the title of your experiment from “Experiment created on today’s date” to “Bank Credit”
1. From the folder where you extracted the lab files for this module (for example,
C:\DAT203.1x\Mod1), open the Credit-Scoring-Clean.csv file, using either a spreadsheet
application such as Microsoft Excel, or a text editor such as Microsoft Windows Notepad.
2. View the contents of the Credit-Scoring-Clean.csv file, noting that it contains data on 950
customer cases. You can see the column headers for 20 features (data columns which can be
used to train a machine learning model) and the label (the column indicating the actual credit
status of the customers). Your data file should appear as shown here:
Note: the information in some of these features (columns) is in a coded format; e.g. A14, A11. You
can see the meaning of these codes on the UCI Machine Learning repository at
https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).
3. Close the text file and return to your browser where your experiment is displayed. At the bottom
left, click NEW. Then in the NEW dialog box, click the DATASET tab as shown in the following
image.
4. Click FROM LOCAL FILE. Then in the Upload a new dataset dialog box, browse to select the
Credit-Scoring-Clean.csv file from the folder where you extracted the lab files on your local
computer. Enter the following details as shown in the image below, and then click the icon.
This is a new version of an existing dataset: Unselected
Enter a name for the new dataset: Credit-Scoring-Clean
Select a type for the new dataset: Generic CSV file with a header (.csv)
Provide an optional description: Bank credit scoring data.
5. Wait for the upload of the dataset to complete, then click OK on the status bar at the bottom of
the AML Studio screen.
6. On the experiment items pane, expand Saved Datasets > My Datasets to verify that the Credit-
Scoring-Clean dataset is listed.
7. Drag the Credit-Scoring-Clean dataset to the canvas for the Bank Credit experiment.
8. Verify that the Azure ML screen, which shows your experiment, now looks like the figure shown
here:
9. Click the output port for the Credit-Scoring-Clean dataset on the canvas and click Visualize to
view the data in the dataset as shown in the figure:
Note: The output port can be accessed by clicking on the small circle on the bottom of the
module boxes, pointed to by the red arrow in the figure.
10. Click on the second column labeled Duration, which will display some properties of that feature
(data column) on the right side of the display. These properties include summary statistics and the
data type, as shown here:
11. Verify that the dataset contains the data you viewed in the source file.
12. Using the scroll bar on the right side of the display, scroll down until you can see the histogram of
the Duration feature as shown here:
13. On the data display, scroll to the right and click CreditStatus. Scroll down in the pane on the right
and observe the histogram, which should appear as shown below. Note that CreditStatus has
two values, {0,1}, and that the number of cases with each value are approximately balanced.
14. Click the ‘x’ in the upper right corner to close the visualization.
2. Connect the Results Dataset output of the Select Columns in Dataset module to the Dataset1
(left most) input of the Execute Python Script module as shown here:
3. Select the Execute Python Script module, set the Python Version to the latest available version
of Python, and then replace the existing code in the code editor pane with the following code,
which drops the SexAndStatus and OtherDetorsGuarantors columns. You can copy and paste
this code from dropcols.py in the lab files folder for this lab.
def azureml_main(creditframe):
drop_cols = ['SexAndStatus',
'OtherDetorsGuarantors']
creditframe.drop(drop_cols, axis = 1, inplace = True)
return creditframe
Tip: To paste code from the clipboard into the code editor in the Azure ML Properties pane,
press CTRL+A to select the existing code, and then press CTRL+V to paste the code from the
clipboard, to replace the existing code.
4. Save and Run the experiment, and when it has finished running, visualize the Results Dataset
(left hand) output of the Execute Python Script module. Note that there are now 18 columns, as
another two have been removed.
2. Connect the Results Dataset1 (left) output of the Execute Python Script module to the
Dataset1 (left most) input of the Execute R Script module.
3. Select the Execute R Script module, set the R Version to the latest available version of Microsoft
R Open, and then replace the existing R code in the code editor window of the Execute R Script
module with the following code. You can copy and paste this code from dropcols.R in the lab
files folder for this lab.
4. Save and Run the experiment. Then, when it has finished running, visualize the Results Dataset
(left hand) output of the Execute R Script module. Note that there are now 16 columns.
2. Connect the Results Dataset (left) output of the Execute R Script module to the Table1 (left
most) input of the Apply SQL Transform module.
3. Replace the existing SQL code in the code editor window of the Apply SQL Transform module
with the following code. You can copy and paste this code from selectcols.sql in the lab files
folder for this lab.
select
CheckingAcctStat,
Duration,
CreditHistory,
Purpose,
Savings,
Employment,
InstallmentRatePecnt,
PresentResidenceTime,
Property,
Age,
Telephone,
CreditStatus
from t1;
4. Save and Run the experiment. Then, when it has finished running, visualize the Results Dataset
output of the Apply SQL Transform module. Note it contains only the 12 columns named in the
SQL select statement.
Note: The purpose of these exercises is to give you a feel for working with machine learning models in
Azure Machine Learning. In subsequent chapters and in the next course we will explore the theory of
operation and evaluation for machine learning models.
2. Connect the output of the Apply SQL Transformation module to the input of the Split Data
module.
3. Select the Split Data module, and in the Properties pane, view the default split settings, which
split the data randomly into two datasets. Set these properties as follows:
Splitting mode: Split Rows
Fraction of rows in the first output dataset: 0.7
Randomized split: checked
Random seed: 876
Stratified split: False
Summary
This lab has familiarized you with the essentials of using the Azure ML Studio environment. In this lab you
have used built-in Azure ML functionality, Python, R and SQL to select the features used for training a
machine learning model. You then created, trained, and evaluated a first machine learning model to
classify bank customers as good or bad credit risks.
In the rest of this course, you will learn how to employ a range of techniques to prepare data for
modeling, to build effective models, and to evaluate model performance to create a suitably accurate
predictive solution.