Lesson 09 - Introduction To Model Building
Lesson 09 - Introduction To Model Building
If we stored the data generated in a day on Blu-ray disks and stacked them up, it would be equal to the height
of four Eiffel towers. Machine learning helps analyze this data easily and quickly.
Machine Learning
Purpose of Machine Learning
Machine learning is a great tool to analyze data, find hidden data patterns and relationships, and extract
information to enable information-driven decisions and provide insights.
Data
Gain insights
into unknown
data
Take information-
driven decisions
Machine Learning Terminologies
These are some machine learning terminologies that you will come across in this lesson:
Inputs Attributes
Features
Label Records
Response Observations
Outcome Examples
Target Samples
Machine Learning Approach
Machine Learning Approach
The machine learning approach starts with either a problem that you need to solve or a given dataset that
you need to analyze.
Strive for
accuracy
Train and test
the model
Choose the right
model
Identify the
problem type
Extract the
features from
Understand the the dataset
problem/dataset
Steps 1 and 2: Understand the Dataset and Extract Its Features
Let us look at a dataset and understand its features in terms of machine learning.
Features Response
(attributes) (label)
Education Professional Training Hourly Rate
(Yrs.) (Yes/No) (USD)
16 1 90
15 0 65
12 1 70
18 1 130
Observations
(records) 16 0 110
16 1 100
15 1 105
31 0 70
Predictors
Steps 3 and 4: Identify the Problem Type and Learning Model
Machine learning can either be supervised or unsupervised. The problem type should be selected
based on the type of learning model.
• In supervised learning, the dataset used to train a • In unsupervised learning, the response or the
model should have observations, features, and outcome of the data is not known.
responses. The model is trained to predict the right
response for a given set of data points. • Unsupervised learning models are used to identify
and visualize patterns in data by grouping similar
• Supervised learning models are used to predict an types of data.
outcome.
• The goal of this model is to represent data in a way
• The goal of this model is to generalize a dataset so that meaningful information can be extracted.
that the general rule can be applied to new data as
well.
Steps 3 and 4: Identify the Problem Type and Learning Model
Data can either be continuous or categorical. Based on whether it is supervised or unsupervised learning, the
problem type will differ.
Data Data
Categories of news based on the topics Grouping of similar stories on different news networks
Working of Supervised Learning Model
In supervised learning, a known dataset with observations, features, and response is used to create and
train a machine learning algorithm. A predictive model, built on top of this algorithm, is then used to predict
the response for a new dataset that has the same features.
New or
Known Data
Unseen Data
Observations/ Observations/
Records Records
Features/
Attributes
Predictive Features/
Model Attributes
Response/
Label Machine
Learning
Algorithm Predicted
Response/Label
Working of Unsupervised Learning Model
In unsupervised learning, a known dataset has a set of observations with features, but the response is not
known. The predictive model uses these features to identify how to classify and represent the data points of
new or unseen data.
New or
Known Data
Unseen Data
Observations/ Observations/
Records Records
Machine Predictive
Features/ Features/
Learning Model
Attributes Attributes
Algorithm
Data
Representation
Steps 5 and 6: Train, Test, and Optimize the Model
To train supervised learning models, data analysts usually divide a known dataset into
training and testing sets.
Observations/ Observations/
Records Records
Features/
Attributes
Features/
Attributes
Response/
Label
Steps 5 and 6: Train, Test, and Optimize the Model
Known Data
Train
(60%-80%)
Test Observations/
(20%-40%) Records
Features/
Attributes
Response
/ Label Machine
Learning
Algorithm
Steps 5 and 6: Train, Test, and Optimize the Model
Model Training
Observation Response
10 16 1 90
Train set Train set
45 15 0 65
Test set Test set
83 12 1 70
45 18 1 130
54 16 0 110
67 16 1 100
71 15 1 105
31 15 0 70
Supervised Learning Model Considerations
Performance
Response optimization
Model Accuracy
Features
Generalization
Scikit-Learn
Scikit-Learn
Scikit is a powerful and modern machine learning Python library for fully and semi-
automated data analysis and information extraction.
Scikit-learn helps data scientists organize their work through its problem-solution approach.
While working with a Scikit-Learn dataset or loading your own data to Scikit-Learn, consider
these points:
✔ Since features and response would be in the form of arrays, they would have shapes and sizes
The linear regression equation is based on the formula for a simple linear equation.
Coefficient of x
Intercept
Supervised Learning Models: Linear Regression
Data point
Residual
y Residual
(response) Least square line
dy
dx
(0, y) Slope/gradient
Actual Predicted
value value
x (predictor variable)
! The attributes are usually fitted using the least square approach.
Supervised Learning Models: Linear Regression
Data point
y SSE
(response) Least square line
SSR
(0, y)
x (predictor variable)
Error of sum of squares
Smaller the value of SSR or SSE, the more accurate the prediction will be, which would make the
! model the best fit.
Supervised Learning Models: Linear Regression
Problem Statement: Demonstrate how to create and train a linear regression model.
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Supervised Learning Models: Logistic Regression
Supervised Learning Models: Logistic Regression
Logistic regression is a generalization of the linear regression model used for classification problems.
Probability of y = 1, given x
Change in the log-
odds for a unit
change in x
The above equation is the simplest logistic function used for performing logistic regression.
Supervised Learning Models: Logistic Regression
Probability
Inverse of
Specifies the norm used
regularization
in penalization
Calculates the intercept
Implemented only
Class for L2 penalty
K-Nearest Neighbors, or K-NN, is one of the simplest machine learning algorithms used for both
classification and regression problem types.
Features
(Attributes)
Supervised Learning Models: K-Nearest Neighbors
K=6
K=3
If you are using this method for binary classification, choose an odd number for k to avoid the case of a tied
distance between two classes.
Supervised Learning Models: K-Nearest Neighbors
It looks at the inputs or features of the training dataset to identify the attributes of any new or unseen
data. Based on how similar a data point is to an attribute, the algorithm classifies it.
Features Response
(Attributes) (label)
K-NN and Logistic Regression Models
Problem Statement: Demonstrate the use of K-NN and logistic regression models.
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Unsupervised Learning Models: Clustering
Unsupervised Learning Models: Clustering
Assign Optimize
Find the number of clusters and assign Iterate and optimize the mean for each cluster for
mean its respective data points
Unsupervised Learning Models: K-Means Clustering
K-means finds the best centroids by alternatively assigning random centroids to a dataset and
selecting mean data points from the resulting clusters to form new centroids. It continues this
process iteratively until the model is optimized.
Problem Statement: Demonstrate how to use K-means clustering to classify data points.
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Unsupervised Learning Models: Dimensionality Reduction
Unsupervised Learning Models: Dimensionality Reduction
It reduces a high-dimensional dataset into a dataset with fewer dimensions. This makes it easier and faster for the
algorithm to analyze the data.
Unsupervised Learning Models: Dimensionality Reduction
Class
Problem Statement: Demonstrate how to use the PCA model to reduce the dimensions of a dataset.
Access: To execute the practice, follow these steps:
• Go to the PRACTICE LABS tab on your LMS
• Click the START LAB button
• Click the LAUNCH LAB button to start the lab
Pipeline
You can save your model for future use. This avoids the need to retrain the model.
You can use the metrics function to evaluate the accuracy of your model’s predictions.
metrics. accuracy_score
Classification metrics.average_precision_score
Clustering metrics.adjusted_rand_score
metrics.mean_absolute_error
Regression metrics.mean_squared_error
metrics.median_absolute_error
Project 1: Create a Model to Predict the Sales Outcome
Problem Statement:
The given dataset contains ad budgets for different media channels and
the corresponding ad sales of the firm. Evaluate the dataset to:
• Find the features or media channels used by the firm
• Find the sales figures for each channel
• Create a model to predict the sales outcome
• Split as training and testing datasets for the model
• Calculate the Mean Square Error (MSE)
Instructions to perform the assignment:
Download the FAA dataset from the “Resource” tab. Upload the dataset
to the JupyterLab to view and evaluate it.
Project 2: List the Glucose Level Readings
Problem Statement:
The given dataset lists the glucose level readings of several pregnant
women taken either during a survey examination or routine medical care.
It specifies if the two hours post-load plasma glucose was at least 200
mg/dl. Analyze the dataset to:
• Find the features of the dataset
• Find the response label of the dataset
• Create a model to predict the diabetes outcome
• Use training and testing datasets to train the model
• Check the accuracy of the model
Project 2: List the Glucose Level Readings
• Open the .NAMES file with a notepad application to view its text. Use this
file to view the features of the dataset and add them manually in your
code.
Key Takeaways
a. Features
b. Attributes
c. Records
d. Labels
Knowledge
Check
In machine learning, which one of the following is an observation?
1
a. Features
b. Attributes
c. Records
d. Labels
The regression algorithm belonging to the supervised learning model is best suited to analyze continuous data.
Knowledge
Check
Identify the goal of unsupervised learning. Select all that apply.
3
The goal of unsupervised learning is to understand the structure of the data and represent it. There is no right or
certain answer in unsupervised learning.
Knowledge
Check
The estimator instance in scikit-learn is a _____.
4
a. Model
b. Feature
c. Dataset
d. Response
Knowledge
Check
The estimator instance in scikit-learn is a _____.
4
a. Model
b. Feature
c. Dataset
d. Response
b. Split the known dataset into separate training and testing sets
b. Split the known dataset into separate training and testing sets
The best way to train a model is to split the known dataset into training and testing sets. The testing set varies from
20% to 40%.
Knowledge
Check
Which of the following is true with a greater value of SSR or SSE? Select all that apply.
6
a. The prediction will be more accurate, making it the best fit model.
d. The model will not be the best fit for the attributes.
Knowledge
Check
Which of the following is true with a greater value of SSR or SSE? Select all that apply.
6
a. The prediction will be more accurate, making it the best fit model.
d. The model will not be the best fit for the attributes.
With higher SSR or SSE, the prediction will be less accurate and the model will not be the best fit for the attributes.
Knowledge
Check
Class sklearn.linear_model.LogisticRegression, random_state _____.
7
a. Indicates the seed of the pseudo random number generator used to shuffle data
a. Indicates the seed of the pseudo random number generator used to shuffle data
The class “sklearn.linear_model.LogisticRegression, random_state” indicates the seed of the pseudo random number
generator used to shuffle data.
Knowledge
Check
What are the requirements of the K-means algorithm? Select all that apply.
8
The K-means algorithm requires the number of clusters to be specified and the centroids to minimize inertia. It
requires several iterations to fine tune itself and meet the required criteria to become the best fit model.
Knowledge
Check In Class sklearn.decomposition.PCA, the transform(X) method, where X is multi-dimensional,
_____.
9