Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Module I Complete Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Module I Complete Notes

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Advanced Machine Learning Code: 18AI72

Module I - Advanced 6
Machine Learning

Dr. Varalatchoumy M
Prof.&Head – Dept. of AIML
Head –CHOSS,
Cambridge Institute of Technology, Bangalore
6.1 | OVERVIEW
Machine learning algorithms are a subset of artificial intelligence (AI) that imitates
the human learning process.
Humans learn through multiple experiences how to perform a task.
Similarly, machine learning algorithms develop multiple models (usually using
multiple datasets) and each model is analogous to an experience
Mitchell (2006) defined machine learning as follows:
Machine learns with respect to a particular task T, performance metric P
following experience E, if the system reliably improves its performance P at task
T following experience E.
• Let the task T be a classification problem
• Performance P can be measured through several metrics such as overall accuracy,
sensitivity, specificity, and area under the receive operating characteristic curve
(AUC)
• Experience E is analogous to different classifiers generated in machine learning
algorithms
The major difference between statistical learning and machine learning is that
statistical learning depends heavily on validation of model assumptions and hypothesis testing,
objective of machine learning is to improve prediction accuracy.

For example, while developing a regression model, we check for assumptions such as normality of residuals,
significance of regression parameters and so on. However, in the case of the random forest using classification
trees, the most important objective is the accuracy/performance of the model.

Two ML algorithms:
1. Supervised Learning: In supervised learning, the datasets have the values of input
variables (feature values) and the corresponding outcome variable. The algorithms learn
from the training dataset and predict the outcome variable for a new record with values
of input variables. Linear regression and logistic regression are examples of supervised
learning algorithms.
2. Unsupervised Learning: In this case, the datasets will have only input variable
values, but not the output. The algorithm learns the structure in the inputs. Clustering
and factor analysis are examples of unsupervised learning and will be discussed in
Chapter 7.
6.1.1 | How Machines Learn?
In supervised learning, the algorithm learns using a function called loss function, cost
function or error function, which is a function of predicted output and the desired output. If
h(Xi) is the predicted output and yi is the desired output, then the loss function is

n is the total number of records for which the predictions are made.
The function defined above is a sum of squared error (SSE).
SSE is the loss function for a regression model.
The objective is to learn the values of parameters (aka feature weights) that minimize the
cost function.
Machine learning uses optimization algorithms which can be used for minimizing the loss
function.
Most widely used optimization technique is called the Gradient Descent.
In the next section, we will discuss a regression problem and understand how gradient descent algorithm minimizes the
loss function and learn the model parameters.

6.2 | GRADIENT DESCENT ALGORITHM


In this section, we will discuss how gradient descent (GD) algorithm can be used for estimating the values of
regression parameters, given a dataset with inputs and outputs
The error is given by,
6.2.1 | Developing a Gradient Descent Algorithm for Linear Regression Model

• For better understanding the GD algorithm, we will implement the GD algorithm using the dataset
Advertising.csv.
• The dataset contains the examples of advertisement spends across multiple channels such as Radio,
TV, and Newspaper, and the corresponding sales revenue generated at different time periods.

The dataset has the following elements:


1. TV – Spend on TV advertisements
2. Radio – Spend on radio advertisements
3. Newspaper – Spend on newspaper advertisements
4. Sales – Sales revenue generated

For predicting future sales using spends on different advertisement channels, we can build a regression
model.
6.2.1.1 Loading the Dataset
6.2.1.2 Set X and Y Variables
For building a regression model, the inputs TV, Radio, and Newspaper are taken as X
features and Sales Y is taken as the outcome variable.
6.2.1.3 Standardize X and Y
It is important to convert all variables into one scale. This can be done by subtracting mean from each
value of the variable and dividing by the corresponding standard deviation of the variable.
def initialize( dim ):
np.random.seed(seed=42)
random.seed(42)
#Initialize the bias.
b = random.random()
#Initialize the weights.
w = np.random.rand( dim )
return b, w
#dim - is the number of weights to be
initialized besides the bias
To initialize the bias and 3 weights, as we have three input variables TV, Radio and
Newspaper, we can invoke the initialize() method as follows:
b, w = initialize( 3 )
print( "Bias: ", b, "Weights: ", w )
Method 2: Predict Y Values from the Bias and Weights
Calculate the Y values for all the inputs, given the bias and weights. We will use matrix multiplication of weights with
input variable values. matmul() method in numpy library can be used for matrix multiplication. Each row of X can be
multiplied with the weights column to produce the predicted outcome variable.
# Inputs:
# b - bias
# w - weights
# X - the input matrix
6.2.1.5 Finding the Optimal Bias and Weights
The updates to the bias and weights need to be done iteratively, until the cost is minimum. It can take several
iterations and is time-consuming. There are two approaches to stop the iterations:
1. Run a fixed number of iterations and use the bias and weights as optimal values at the end these
iterations.
2. Run iterations until the change in cost is small, that is, less than a predefined value (e.g., 0.001).

We will define a method run_gradient_descent(), which takes alpha and num_iterations as parameters
and invokes methods like initialize(), predict_Y(), get_cost(), and update_beta().

Also, inside the method,


1. variable gd_iterations_df keeps track of the cost every 10 iterations.
2. default value of 0.01 for the learning parameter and 100 for number of iterations will be used.
6.3.1 | Steps for Building Machine Learning Models

The steps to be followed for building, validating a machine learning model and
measuring its accuracy
are as follows:
1. Identify the features and outcome variable in the dataset.
2. Split the dataset into training and test sets.
3. Build the model using training set.
4. Predict outcome variable using a test set.
5. Compare the predicted and actual values of the outcome variable in the test set and
measure accuracy using measures such as mean absolute percentage error (MAPE) or
root mean square error (RMSE).
6.3.1.2 Building Linear Regression Model with Train Dataset
Linear models are included in sklearn.linear_model module. We will use
LinearRegression method for building the model and compare with the results we
obtained through our own implementation of gradient descent algorithm.
https://scikit-
learn.org/stable/modules/linear_model.html#:~:text=The%20f
ollowing%20are%20a%20set,if%20is%20the%20predicted%20v
alue.
6.3.1.2 Building Linear Regression Model with Train Dataset
6.3.1.4 Measuring Accuracy
Root Mean Square Error (RMSE) and R-squared are two key accuracy measures
for Linear Regression Models.
sklearn.metrics package provides methods to measure various metrics.
For regression models, mean_squared_error and r2_score can be used to calculate
MSE and R-squared values, respectively.

## Importing metrics from sklearn


from sklearn import metrics
6.3.2 | Bias-Variance Trade-off

Model errors can be decomposed into two components: bias and variance.
Understanding these two components is key to diagnosing model accuracies
and avoiding model overfitting or underfitting.
High bias can lead to building underfitting model, whereas high variance
can lead to overfitting models.

The term "variance" refers to the degree of change that may be expected in the
estimation of the target function as a result of using multiple sets of training data. The
disparity between the values that were predicted and the values that were actually
observed is referred to as bias
6.4 | ADVANCED REGRESSION MODELS

IPL dataset

First, we will build a linear regression model to understand the shortcomings and then proceed to advanced
regression models.

6.4.1.1 Loading IPL Dataset

Load the dataset and display information about the dataset using the following commands:

ipl_auction_df = pd.read_csv( ‘IPL IMB381IPL2013.csv’ )


ipl_auction_df.info()
6.4.1.3 Split the Dataset into Train and Test

Split the dataset into train and test with 80:20 split. random_state (seed value) is set to 42 for reproducibility
of exact results.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_ test_split(


X_scaled,
Y,
test_size=0.2,
random_state = 42)
## Sorting the features by coefficient values in descending order
sorted_coef_vals = columns_coef_df.sort_values( ‘coef’,ascending=False)
Few observations from Figure 6.10 are as follows:
1. AVE, ODI-RUNS-S, SIXERS are top three highly influential features which determine the player’s SOLD
PRICE.
2. Higher ECON, SR-B and AGE have negative effect on SOLD PRICE.
3. Interestingly, higher test runs (T-Runs) and highest score (HS) have negative effect on the SOLD PRICE. Note
that few of these counter-intuitive sign for coefficients could be due to multicollinearity. For example, we expect
SR-B (batting strike rate) to have a positive effect on the SOLD PRICE.
6.4.2 | Applying Regularization
One way to deal with overfitting is regularization. It is observed that overfitting is typically caused by inflation
of the coefficients. To avoid overfitting, the coefficients should be regulated by penalizing potential inflation
of coefficients. Regularization applies penalties on parameters if they inflate to large values and keeps them
from being weighted too heavily.
6.4.2.2 LASSO Regression
sklearn.linear_model provides
LASSO regression for building linear
models by applying L1 penalty. Two
key parameters for LASSO
regression are:

1. alpha – float – multiplies the L1


term. Default value is set to 1.0.
2. max_iter – int – Maximum
number of iterations for gradient
solver.
6.4.2.3 Elastic Net
Regression
ElasticNet regression
combines both L1 and L2
regularizations to build a
regression model.
The corresponding
cost function is given by
6.5 | ADVANCED MACHINE LEARNING ALGORITHMS

In this section, we will take a binary classification problem and explore it through
machine learning algorithms such as K-Nearest Neighbors (KNN), Random Forest,
and Boosting.
Bank marketing dataset available at the University of California, Irvine machine
learning repository is used in this section for the demonstration of various techniques.
The dataset is based on a telemarketing campaign carried out by a Portuguese bank for
subscription of a term deposit.
The data has several features related to the potential customers and whether they
subscribed the term deposit or not (outcome).
The objective, in this case, is to predict which customers may respond to their
marketing campaign to open a term deposit with the bank.
The response variable Y = 1 implies that the customer subscribed a term deposit after
the campaign and 0 otherwise. The marketing campaign is based on phone calls.
We can use the following commands for reading
the dataset and printing a few records.

bank_df = pd.read_csv(‘bank.csv’)
bank_df.head(5)
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split( X,Y,test_size
= 0.3,random_state = 42)
6.5.2.2 Confusion Matrix
We develop a custom method draw_cm() to draw the confusion matrix. This method will be used to draw the confusion
matrix from the models we discuss in the subsequent sections of the chapter. It takes the actual and predicted class labels to
draw the confusion matrix. Usage and interpretation of a confusion matrix is already explained in detail in Chapter 5.
In the confusion matrix in Figure 5.2, the columns represent the predicted label (class), while the rows
represent the actual label (class).

1.Left-top quadrant represents actual bad credit and is correctly classified as bad credit.
This is called True Positives (TP).
2. Left-down quadrant represents actual good credit and is incorrectly classified as bad
credit. This is called False Positives (FP).
3. Right-top quadrant represents actual bad credit and is incorrectly classified as good
credit. This is called False Negatives (FN).
4. Right-down quadrant represents actual good credit and is correctly classified as good
credit. This is called True Negatives (TN).
5.3.9 | Receiver Operating Characteristic (ROC) and Area Under the Curve (AUC)

Receiver operating characteristic (ROC) curve can be used to understand the overall
performance (worth) of a logistic regression model (and, in general, of classification
models) and used for model selection.

The term has its origin in electrical engineering when electrical signals were used for
predicting enemy objects (such as submarines and aircraft) during World War II.

Given a random pair of positive and negative class records, ROC gives the proportions of
such pairs that will be correctly classified.
ROC curve is a plot between sensitivity (true positive rate) on the vertical axis and
1 – specificity (false positive rate) on the horizontal axis.

We will write a method draw_roc() which takes the actual classes and predicted
probability values and then draws the ROC curve (Figure 5.4).

metrics.roc_curve() returns different threshold (cut-off) values and their


corresponding false positive and true positive rates.

Then these values can be taken and plotted to create the ROC curve. metrics.
roc_auc_score() returns the area under the curve (AUC).
Plotting ROC Curve
To visualize the ROC curve, an utility method draw_roc_curve() is implemented, which takes the mode, test set and
actual labels of test set to draw the ROC curve. It returns the auc_score, false positive rate (FPR), true positive rate
(TPR) values for different threshold (cut-off probabilities) ranging from 0.0 to 1.0. This method will be used for all
future ML models that we will be discussing in subsequent sections. ROC AUC curve is discussed in detail in Section
5.2.10 (ROC and AUC), Chapter 5.
The following custom method is created for plotting ROC curve and calculating the area under the ROC curve.
sklearn has a GridSearch mechanism in which one or multiple
hyperparameters can be searched through for the most optimal values,
where the model gives the highest accuracy.

The search mechanism is a brute force approach, that is, evaluate all the
possible values and find the most optimal ones.
Advanced Machine Learning Code: 18AI72

Module I - Forecasting

Dr. Varalatchoumy M
Prof.&Head – Dept. of AIML
Head –CHOSS,
Cambridge Institute of Technology, Bangalore
8.1 | FORECASTING OVERVIEW

Forecasting is by far the most important and frequently used application of predictive
analytics because it has significant impact on both the top line and the bottom line of
an organization.
Every organization prepares long-range and short-range planning and forecasting
demand for product and service is an important input for both long-range and short-
range planning.
Different capacity planning problems such as manpower planning, machine capacity,
warehouse capacity, materials requirements planning (MRP) will depend on the
forecasted demand for the product/service.
Budget allocation for marketing promotions and advertisements are usually made
based on forecasted demand for the product.
8.2 | COMPONENTS OF TIME-SERIES DATA

The time-series data Yt is a random variable, usually


collected at regular time intervals and in chronological
order.

If the time-series data contains observations of just a


single variable (such as demand of a product at time t),
then it is termed as univariate time-series data.

If the data consists of more than one variable, for


example, demand for a product at time t, price at time t,
amount of money spent by the company on promotion at
time t, competitors’ price at time t, etc., then it is called
multivariate timeseries data.
There are several forecasting techniques such as moving average, exponential
smoothing, and Auto-Regressive Integrated Moving Average (ARIMA) that are
used across various industries.

Moving average and exponential smoothing predict the future value of a time-
series data as a function of past observations.
8.3.2 | Forecasting Using Moving Average
We use the following code to plot actual versus the predicted values from moving average forecasting:
plt.figure(figsize=(10,4))
plt.xlabel(“Months”)
plt.ylabel(“Quantity”)
plt.plot(wsb_df[‘Sale Quantity’][12:]);
plt.plot(wsb_df[‘mavg_12’][12:], ‘.’);
plt.legend();
8.5 | AUTO-REGRESSIVE INTEGRATED MOVING AVERAGE
MODELS

Auto-regressive (AR) and moving average (MA) models are popular


models that are frequently used for forecasting.

AR and MA models are combined to create models such as auto-regressive


moving average (ARMA) and auto-regressive integrated moving average
(ARIMA) models.

ARMA models are basically regression models; auto-regression simply


means regression of a variable on itself measured at different time periods.
8.5.1 | Auto-Regressive (AR) Models
T Y ACF
1 5 0
2 8 3
3 7 -1
4 4 -3
5 3 -1
6 9 6
7 8 -1
8 6 -2
9 6 0
10 5 -2
T Y ACF PACF
1 5 0 0
2 8 3
3 7 -1
4 4 -3 -1
5 3 -1
6 9 6
7 8 -1 4
8 6 -2
9 6 0
10 5 -2 -3
The model summary indicates the AR with lag 1 is significant variables in the model. The corresponding
p-value is less than 0.05 (0.0056).
8.5.4 | ARIMA Model

ARMA models can be used only when the time-series data is stationary. ARIMA models are used when
the time-series data is non-stationary. Time-series data is called stationary if the mean, variance, and
covariance are constant over time. ARIMA model was proposed by Box and Jenkins (1970) and thus
is also known as Box−Jenkins methodology. ARIMA has the following three components and is represented
as ARIMA (p, d, q):
1. AR component with p lags AR(p).
2. Integration component (d).
3. MA with q lags, MA(q).

The main objective of the integration component is to convert a non-stationary time-series process to a
stationary process so that the AR and MA processes can be used for forecasting.
8.5.4.1 What is Stationary Data?
Time-series data should satisfy the following conditions to be
stationary:
1. The mean values of Y at different values of t are constant.
t

2. The variances of Y at different time periods are constant


t

(Homoscedasticity).
3. The covariance of Y and Y for different lags depend only
t t−k

on k and not on time t.

You might also like