Notes - Machine Learning
Notes - Machine Learning
Acknowledgments I
vi
BOOK TITLE
vii
ACKNOWLEDGMENTS
i
BOOK TITLE
2
BOOK TITLE
machine learning. The Machine Learning algorithm's operation is depicted in the following
block diagram:
Following are some key points which show the importance of Machine Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult for a human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful information from data.
3
BOOK TITLE
2. Unsupervised learning
3. Reinforcement learning
1) Supervised Learning
In supervised learning, sample labeled data are provided to the machine learning system for
training, and the system then predicts the output based on the training data.
The system uses labeled data to build a model that understands the datasets and learns about
each one. After the training and processing are done, we test the model with sample data to
see if it can accurately predict the output.
The mapping of the input data to the output data is the objective of supervised learning. The
managed learning depends on oversight, and it is equivalent to when an understudy learns
things in the management of the educator. Spam filtering is an example of supervised
learning.
Supervised learning can be grouped further in two categories of algorithms:
• Classification
• Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any
supervision.
The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision. The goal of unsupervised learning is to restructure the input data into new
features or a group of objects with similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find
useful insights from the huge amount of data. It can be further classifieds into two categories
of algorithms:
• Clustering
• Association
3) Reinforcement Learning
4
BOOK TITLE
5
BOOK TITLE
• 1943: In 1943, a human neural network was modeled with an electrical circuit.
In 1950, the scientists started applying their idea to work and analyzed how
human neurons might work.
6
BOOK TITLE
7
BOOK TITLE
8
BOOK TITLE
In this topic, we will provide a detailed description of the types of Machine Learning along
with their respective algorithms:
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based
on the training, the machine predicts the output. Here, the labelled data specifies that some
of the inputs are already mapped to the output. More preciously, we can say; first, we train
the machine with the input and corresponding output, and then we ask the machine to
predict the output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an input dataset of
cats and dog images. So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour,
height (dogs are taller, cats are smaller), etc. After completion of training, we input the
picture of a cat and ask the machine to identify the object and predict the output. Now, the
machine is well trained, so it will check all the features of the object, such as height, shape,
colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is
the process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x)
with the output variable(y). Some real-world applications of supervised learning are Risk
Assessment, Fraud Detection, Spam filtering, etc.
9
BOOK TITLE
10
BOOK TITLE
• Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
• Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.
11
BOOK TITLE
accordingly so that it can generate maximum profit. This algorithm is mainly applied
in Market Basket analysis, Web usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled datasets
during the training period.
Although Semi-supervised learning is the middle ground between supervised and
unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data. As labels are costly, but for corporate purposes, they may have
few labels. It is completely different from supervised and unsupervised learning as they are
based on the presence & absence of labels.
12
BOOK TITLE
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets
rewarded for each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents learn
from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns
various things by experiences in his day-to-day life. An example of reinforcement learning is
to play a game, where the Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score. Agent receives feedback in terms of
punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such
as Game theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision
Process(MDP). In MDP, the agent constantly interacts with the environment and performs
actions; at each action, the environment responds and generates a new state.
13
BOOK TITLE
14
BOOK TITLE
• Too much reinforcement learning can lead to an overload of states which can
weaken the results.
15
BOOK TITLE
16
BOOK TITLE
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction( for example, feature selection and
extraction and record sampling), also attribute transformation(for example, discretization of
numerical attributes and functional transformation). This step can be essential for the
success of the entire KDD project, and it is typically very project-specific. For example, in
medical assessments, the quotient of attributes may often be the most significant factor and
not each one by itself. In business, we may need to think about impacts beyond our control
as well as efforts and transient issues. For example, studying the impact of advertising
accumulation. However, if we do not utilize the right transformation at the starting, then we
may acquire an amazing effect that insights to us about the transformation required in the
next iteration. Thus, the KDD process follows upon itself and prompts an understanding of
the transformation required.
5. Prediction and description
We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc. This mainly relies on the KDD objectives, and also
on the previous steps. There are two significant objectives in Data Mining, the first one is a
prediction, and the second one is the description. Prediction is usually referred to as
supervised Data Mining, while descriptive Data Mining incorporates the unsupervised and
visualization aspects of Data Mining. Most Data Mining techniques depend on inductive
learning, where a model is built explicitly or implicitly by generalizing from an adequate
number of preparing models. The fundamental assumption of the inductive approach is that
the prepared model applies to future cases. The technique also takes into account the level of
meta-learning for the specific set of accessible data.
6. Selecting the Data Mining algorithm
Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning,
there are several possibilities of how it can be succeeded. Meta-learning focuses on clarifying
what causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus, this
methodology attempts to understand the situation under which a Data Mining algorithm is
most suitable. Each algorithm has parameters and strategies of leaning, such as ten folds
cross-validation or another division for training and testing.
7. Utilizing the Data Mining algorithm
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may
need to utilize the algorithm several times until a satisfying outcome is obtained. For
example, by turning the algorithms control parameters, such as the minimum number of
instances in a single leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact
on the Data Mining algorithm results. For example, including a feature in step 4, and repeat
17
BOOK TITLE
from there. This step focuses on the comprehensibility and utility of the induced model. In
this step, the identified knowledge is also recorded for further use. The last step is the use,
and overall feedback and discovery results acquire by Data Mining.
9. Using the discovered knowledge
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and
measure the impacts. The accomplishment of this step decides the effectiveness of the whole
KDD process. There are numerous challenges in this step, such as losing the "laboratory
conditions" under which we have worked. For example, the knowledge was discovered from
a certain static depiction, it is usually a set of data, but now the data becomes dynamic. Data
structures may change certain quantities that become unavailable, and the data domain might
be modified, such as an attribute that may have a value that was not expected previously.
18
BOOK TITLE
entire dataset originally collected, which means it should contain sufficient information to
retrieve. The data is also divided into training and validation purpose.
Explore: In this phase, activities are carried out to understand the data gaps and relationship
with each other. Two key activities are univariate and multivariate analysis. In univariate
analysis, each variable looks individually to understand its distribution, whereas in
multivariate analysis the relationship between each variable is explored. Data visualization is
heavily used to help understand the data better. In this step, we do analysis with all the
factors which influence our outcome.
Modify: In this phase, variables are cleaned where required. New derived features are
created by applying business logic to existing features based on the requirement. Variables
are transformed if necessary. The outcome of this phase is a clean dataset that can be passed
to the machine learning algorithm to build the model. In this step, we check whether the
data is completely transformed or not. If we need the transformation of data we use the label
encoder or label binarizer.
Model: In this phase, various modelling or data mining techniques are applied to the pre-
processed data to benchmark their performance against desired outcomes. In this step, we
perform all the mathematical which makes our outcome more precise and accurate as well.
Assess: This is the last phase. Here model performance is evaluated against the test data
(not used in model training) to ensure reliability and business usefulness. Finally, in this step,
we perform the evaluation and interpretation of data. We compare our model outcome with
the actual outcome and analysis of our model limitation and also try to overcome that
limitation.
===000===
19
BOOK TITLE
From a machine learning perspective, data is the lifeblood of the entire process. Machine
learning is all about developing algorithms and models that can learn patterns, make
predictions, and automate decision-making tasks based on data. Here's how data fits into the
machine learning pipeline:
1. Data Collection: This is the starting point of any machine learning project. You
gather data from various sources, which could include sensors, databases, web
scraping, user inputs, and more. The quality and quantity of data play a crucial role
in the success of a machine learning model.
2. Data Preprocessing: Raw data often needs to be cleaned and preprocessed. This
includes handling missing values, normalizing data, encoding categorical variables,
and removing outliers. Proper preprocessing is essential to ensure that the data is in
a format that can be used by machine learning algorithms.
3. Feature Engineering: This is the process of selecting or creating relevant features
from the data. Feature engineering can significantly impact the model's
performance. It involves domain knowledge, creativity, and data analysis to decide
which features are most informative for the task at hand.
4. Data Splitting: The data is typically split into training, validation, and testing sets.
The training set is used to train the model, the validation set is used to fine-tune
hyperparameters, and the testing set is used to evaluate the model's performance.
5. Model Training: Machine learning algorithms learn from the training data to build
models. These models can be classifiers, regressors, clustering models, or more
advanced deep learning networks. During training, the model optimizes its
parameters to minimize the difference between its predictions and the actual target
values.
6. Model Evaluation: After training, the model is evaluated using the validation and
testing datasets. Evaluation metrics such as accuracy, precision, recall, F1 score, and
mean squared error are used to assess how well the model performs.
7. Model Fine-Tuning: Based on the evaluation results, hyperparameters may be
adjusted to optimize the model's performance. This process may involve iterations
of training and evaluation until a satisfactory model is obtained.
8. Model Deployment: Once a model is developed and validated, it can be deployed
in a real-world application. This can involve integrating the model into a software
system, a website, or any other relevant platform.
9. Monitoring and Maintenance: Machine learning models require ongoing
monitoring and maintenance. As new data becomes available, the model may need
to be retrained or updated to ensure it continues to make accurate predictions.
10. Feedback Loop: In some cases, machine learning models can benefit from a
feedback loop. This involves collecting data on the model's predictions in real-world
scenarios and using this feedback to further improve the model.
20
BOOK TITLE
In summary, data is the foundation of machine learning. The quality of data, along with the
effectiveness of data preprocessing and feature engineering, greatly influences the success of
a machine learning project. The ultimate goal is to develop a model that can learn from data
and make accurate predictions or automate decision-making based on that data.
Nominal Scale
A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or
“labels” to classify or identify the objects. A nominal scale usually deals with the non-
numeric variables or the numbers that do not have any value.
Characteristics of Nominal Scale
• A nominal scale variable is classified into two or more categories. In this
measurement mechanism, the answer should fall into either of the classes.
• It is qualitative. The numbers are used here to identify the objects.
• The numbers don’t define the object characteristics. The only permissible aspect of
numbers in the nominal scale is “counting.”
21
BOOK TITLE
Example:
An example of a nominal scale measurement is given below:
What is your gender?
M- Male
F- Female
Here, the variables are used as tags, and the answer to this question should be either M or F.
Ordinal Scale
The ordinal scale is the 2nd level of measurement that reports the ordering and ranking of
data without establishing the degree of variation between them. Ordinal represents the
“order.” Ordinal data is known as qualitative data or categorical data. It can be grouped,
named and also ranked.
Characteristics of the Ordinal Scale
• The ordinal scale shows the relative ranking of the variables
• It identifies and describes the magnitude of a variable
• Along with the information provided by the nominal scale, ordinal scales give the
rankings of those variables
• The interval properties are not known
• The surveyors can quickly analyse the degree of agreement concerning the identified
order of variables
Example:
• Ranking of school students – 1st, 2nd, 3rd, etc.
• Ratings in restaurants
• Evaluating the frequency of occurrences
o Very often
o Often
o Not often
o Not at all
• Assessing the degree of agreement
o Totally agree
o Agree
o Neutral
o Disagree
o Totally disagree
Interval Scale
The interval scale is the 3rd level of measurement scale. It is defined as a quantitative
measurement scale in which the difference between the two variables is meaningful. In other
words, the variables are measured in an exact manner, not as in a relative way in which the
presence of zero is arbitrary.
Characteristics of Interval Scale:
• The interval scale is quantitative as it can quantify the difference between the values
22
BOOK TITLE
Ratio Scale
The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of
variable measurement scale. It allows researchers to compare the differences or intervals.
The ratio scale has a unique feature. It possesses the character of the origin or zero points.
Characteristics of Ratio Scale:
• Ratio scale has a feature of absolute zero
• It doesn’t have negative numbers, because of its zero-point feature
• It affords unique opportunities for statistical analysis. The variables can be orderly
added, subtracted, multiplied, divided. Mean, median, and mode can be calculated
using the ratio scale.
• Ratio scale has unique and useful properties. One such feature is that it allows unit
conversions like kilogram – calories, gram – calories, etc.
Example:
An example of a ratio scale is:
What is your weight in Kgs?
• Less than 55 kgs
• 55 – 75 kgs
• 76 – 85 kgs
• 86 – 95 kgs
• More than 95 kgs
23
BOOK TITLE
can afford to discard those instances or features without significantly impacting the
quality of your dataset. However, this approach may result in loss of valuable
information.
3. Imputation: Imputation is the process of filling in missing values with estimated or
predicted values. There are several methods for imputing missing data:
a. Mean, Median, or Mode Imputation: Replace missing values with the
mean, median, or mode of the observed values in that column. This is a
simple and quick method, but it can introduce bias if the missing data is not
missing at random.
b. Constant Value Imputation: Replace missing values with a predefined
constant value. For example, you might replace missing values with zero or
a specific value that makes sense in the context of your data.
c. Regression Imputation: Use regression models to predict missing values
based on the relationships between the missing feature and other features.
This is a more sophisticated approach, and it can capture complex
relationships in the data.
d. K-Nearest Neighbors (KNN) Imputation: For each missing value, find
the K-nearest data points (based on other features) and impute the missing
value as a weighted average of the K-nearest neighbors' values.
e. Multiple Imputation: This method involves generating multiple imputed
datasets with different imputed values and then aggregating the results.
Multiple imputation accounts for the uncertainty associated with imputing
missing data.
4. Missing Data Indicators: Create binary indicator variables to represent
missingness in the dataset. This allows the model to learn from the fact that data
was missing, which can be informative. However, it increases the dimensionality of
the data.
5. Advanced Techniques: There are advanced imputation methods, such as using
machine learning models (e.g., decision trees, random forests, or deep learning) to
predict missing values based on the available data. These methods may be more
accurate when there is a complex relationship between features.
6. Domain Knowledge: In some cases, domain knowledge can guide you in
determining the best approach for handling missing data. For example, you may
know that certain missing values are meaningful and should be treated differently.
7. Time-Series Interpolation: When dealing with time-series data, you can use time-
based interpolation techniques to estimate missing values based on the values before
and after the missing point.
The choice of method for dealing with missing data depends on the nature and amount of
missing data, the specific problem you're working on, and the goals of your machine learning
project. It's important to carefully consider the implications of your chosen method on the
quality and fairness of your model. Additionally, cross-validation and model evaluation
should be performed to ensure that the chosen approach does not introduce bias or
adversely affect the model's performance
24
BOOK TITLE
Note: We will be using libraries in Python such as Numpy, Pandas and SciKit Learn to
handle these values.
Let us get started. To understand various methods we will be working on the Titanic dataset:
1. Deleting Rows
This method commonly used to handle the null values. Here, we either delete a particular
row if it has a null value for a particular feature and a particular column if it has more than
70-75% of missing values. This method is advised only when there are enough samples in
the data set. One has to make sure that after we have deleted the data, there is no addition of
bias. Removing the data will lead to loss of information which will not give the expected
results while predicting the output.
25
BOOK TITLE
Pros:
• Complete removal of data with missing values results in robust and highly accurate
model
• Deleting a particular row or a column with no specific information is better, since it
does not have a high weightage
Cons:
• Loss of information and data
• Works poorly if the percentage of missing values is high (say 30%), compared to the
whole dataset
26
BOOK TITLE
To replace it with median and mode we can use the following to calculate the same:
Pros:
• This is a better approach when the data size is small
• It can prevent data loss which results in removal of the rows and columns
Cons:
• Imputing the approximations add variance and bias
• Works poorly compared to other multiple-imputations method
27
BOOK TITLE
A categorical feature will have a definite number of possibilities, such as gender, for example.
Since they have a definite number of classes, we can assign another class for the missing
values. Here, the features Cabin and Embarked have missing values which can be replaced
with a new category, say, U for ‘unknown’. This strategy will add more information into the
dataset which will result in the change of variance. Since they are categorical, we need to find
one hot encoding to convert it to a numeric form for the algorithm to understand it. Let us
look at how it can be done in Python:
Pros:
• Less possibilities with one extra category, resulting in low variance after one hot
encoding — since it is categorical
• Negates the loss of data by adding an unique category
Cons:
• Adds less variance
• Adds another feature to the model while encoding, which may result in poor
performance
28
BOOK TITLE
Pros:
• Imputing the missing variable is an improvement as long as the bias from the same
is smaller than the omitted variable bias
• Yields unbiased estimates of the model parameters
Cons:
• Bias also arises when an incomplete conditioning set is used for a categorical
variable
• Considered only as a proxy for the true values
29
BOOK TITLE
Unfortunately, the SciKit Learn library for the K – Nearest Neighbour algorithm in Python
does not support the presence of the missing values.
Another algorithm which can be used here is RandomForest. This model produces a robust
result because it works well on non-linear and the categorical data. It adapts to the data
structure taking into consideration of the high variance or the bias, producing better results
on large datasets.
Pros:
• Does not require creation of a predictive model for each attribute with missing data
in the dataset
• Correlation of the data is neglected
Cons:
• Is a very time consuming process and it can be critical in data mining where large
databases are being extracted
• Choice of distance functions can be Euclidean, Manhattan etc. which is do not yield
a robust result
Conclusion
Every dataset we come across will almost have some missing values which need to be dealt
with. But handling them in an intelligent way and giving rise to robust models is a
challenging task. We have gone through a number of ways in which nulls can be replaced. It
is not necessary to handle a particular dataset in one single manner. One can use various
methods on different features depending on how and what the data is about. Having a small
domain knowledge about the data is important, which can give you an insight about how to
approach the problem.
30
BOOK TITLE
31
BOOK TITLE
Setup
pip install pandas
pip install scikit-learn
pip install category_encoders
Categorical data is often represented as text labels, and many machine learning algorithms
require numerical input data. Customer demographics, product classifications, and
geographic areas are just a few examples of real-world datasets that include categorical data
which must be converted into numerical representation before being used in machine
learning algorithms. Therefore, it is important to convert categorical data into a numerical
format before feeding it to a machine learning algorithm. This process is known as encoding.
There are various techniques for encoding categorical data, including one-hot encoding,
ordinal encoding, and target encoding.
32
BOOK TITLE
33
BOOK TITLE
print(df)
Output
category category_encoded
0 red 1
1 green 2
2 blue 3
3 red 1
4 green 2
As you can see, the red category has been given the value 1, green has been given the value
2, and blue has been given the value 3. The sequence in which the categories occurred in the
original dataset served as the basis for this encoding.
Example 3: Target Encoding using Category Encoders
Target Encoding is another technique used for encoding categorical data, particularly when
dealing with high cardinality features. It replaces each category with the average target value
for that category. Target Encoding is useful when there is a strong relationship between the
categorical feature and the target variable.
import pandas as pd
import category_encoders as ce
34
BOOK TITLE
The color column was successfully encoded using target encoding, as can be seen in the
output, by category encoders. The column to be encoded is specified using the cols option,
and the encoding is done using TargetEncoder. The target variable and the encoding target
column are the two arguments that the fit transform function requires.
Conclusion
The significance of managing categorical data properly in machine learning applications was
covered in this article. It investigated one-hot encoding, ordinal encoding, and target
encoding as three distinct methods for encoding categorical data in Python. One-hot
encoding is a quick and efficient method, but it can result in a lot more features. When the
order of the categories is known, ordinal encoding is a reasonable option, but it misses the
connection between the categories and the target variable.
Hence, managing categorical data is a crucial component of machine learning systems, and
selecting the proper encoding method is key for producing accurate and trustworthy results.
35
BOOK TITLE
Here, µ represents the mean of feature value, and σ represents the standard deviation of
feature values.
36
BOOK TITLE
However, unlike Min-Max scaling technique, feature values are not restricted to a specific
range in the standardization technique.
This technique is helpful for various machine learning algorithms that use distance measures
such as KNN, K-means clustering, and Principal component analysis, etc. Further, it is
also important that the model is built on assumptions and data is normally distributed.
Disadvantages of Normalization
There are various drawbacks to normalizing a database. A few disadvantages are as follows:
• When information is dispersed over many tables, it becomes necessary to link them
together, extending the work. Additionally, the database becomes more intriguing to
recognize.
• Tables will include codes rather than actual data since rewritten data will be saved as
lines of numbers rather than actual data. As a result, the query table must constantly
be consulted.
• Being designed for programs rather than ad hoc querying, the information model
proves to be exceedingly difficult to query. It is made up of SQL that has been
accumulated through time, and operating framework cordial query devices often
carry out this task. As a result, it might be difficult to demonstrate knowledge and
understanding without first comprehending the client’s needs.
• The show’s pace gradually slows down compared to the typical structural type.
37
BOOK TITLE
Need of Normalization
Normalization is generally required when we are dealing with attributes on a different scale,
otherwise, it may lead to a dilution in the effectiveness of an important equally important
attribute(on a lower scale) because of other attributes having values on a larger scale. In
simple words, when multiple attributes are there but attributes have values on different
scales, this may lead to poor data models while performing data mining operations. So they
are normalized to bring all the attributes on the same scale.
38
BOOK TITLE
v’, v is new and old of each entry in data respectively. σA, A is the standard deviation and
mean of A respectively.
39
BOOK TITLE
40
BOOK TITLE
41
BOOK TITLE
Measuring correlation
For two variables, a statistical correlation is measured by the use of a Correlation Coefficient,
represented by the symbol (r), which is a single number that describes the degree of
relationship between two variables.
The coefficient's numerical value ranges from +1.0 to –1.0, which provides an indication of
the strength and direction of the relationship.
If the correlation coefficient has a negative value (below 0) it indicates a negative relationship
between the variables. This means that the variables move in opposite directions (ie when
one increases the other decreases, or when one decreases the other increases).
If the correlation coefficient has a positive value (above 0) it indicates a positive relationship
between the variables meaning that both variables move in tandem, i.e. as one variable
decreases the other also decreases, or when one variable increases the other also increases.
Where the correlation coefficient is 0 this indicates there is no relationship between the
variables (one variable can remain constant while the other increases or decreases).
While the correlation coefficient is a useful measure, it has its limitations: Correlation
coefficients are usually associated with measuring a linear relationship.
For example, if you compare hours worked and income earned for a tradesperson who
charges an hourly rate for their work, there is a linear (or straight line) relationship since with
each additional hour worked the income will increase by a consistent amount.
If, however, the tradesperson charges based on an initial call out fee and an hourly fee which
progressively decreases the longer the job goes for, the relationship between hours worked
and income would be non-linear, where the correlation coefficient may be closer to 0.
Care is needed when interpreting the value of 'r'. It is possible to find correlations between
many variables, however the relationships can be due to other factors and have nothing to
do with the two variables being considered.
For example, sales of ice creams and the sales of sunscreen can increase and decrease across
a year in a systematic manner, but it would be a relationship that would be due to the effects
of the season (ie hotter weather sees an increase in people wearing sunscreen as well as
eating ice cream) rather than due to any direct relationship between sales of sunscreen and
ice cream.
42
BOOK TITLE
The correlation coefficient should not be used to say anything about cause and effect
relationship. By examining the value of 'r', we may conclude that two variables are related,
but that 'r' value does not tell us if one variable was the cause of the change in the other.
Establishing causation
Causality is the area of statistics that is commonly misunderstood and misused by people in
the mistaken belief that because the data shows a correlation that there is necessarily an
underlying causal relationship.
The use of a controlled study is the most effective way of establishing causality between
variables. In a controlled study, the sample or population is split in two, with both groups
being comparable in almost every way. The two groups then receive different treatments,
and the outcomes of each group are assessed.
For example, in medical research, one group may receive a placebo while the other group is
given a new type of medication. If the two groups have noticeably different outcomes, the
different experiences may have caused the different outcomes.
Due to ethical reasons, there are limits to the use of controlled studies; it would not be
appropriate to use two comparable groups and have one of them undergo a harmful activity
while the other does not. To overcome this situation, observational studies are often used to
investigate correlation and causation for the population of interest. The studies can look at
the groups' behaviours and outcomes and observe any changes over time.
The objective of these studies is to provide statistical information to add to the other sources
of information that would be required for the process of establishing whether or not
causality exists between two variables.
43
BOOK TITLE
•
• In the above image, we have taken a dataset which is arranged non-linearly. So if we
try to cover it with a linear model, then we can clearly see that it hardly covers any
data point. On the other hand, a curve is suitable to cover most of the data points,
which is of the Polynomial model.
• Hence, if the datasets are arranged in a non-linear fashion, then we should use the Polynomial
Regression model instead of Simple Linear Regression.
44
BOOK TITLE
check whether he is telling the truth or bluff. So to identify this, they only have a dataset of
his previous company in which the salaries of the top 10 positions are mentioned with their
levels. By checking the dataset available, we have found that there is a non-linear
relationship between the Position levels and the salaries. Our goal is to build a Bluffing
detector regression model, so HR can hire an honest candidate. Below are the steps to
build such a model.
45
BOOK TITLE
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('Position_Salaries.csv')
#Extracting Independent and dependent Variable
x= data_set.iloc[:, 1:2].values
y= data_set.iloc[:, 2].values
Explanation:
• In the above lines of code, we have imported the important Python libraries to
import dataset and operate on it.
• Next, we have imported the dataset 'Position_Salaries.csv', which contains
three columns (Position, Levels, and Salary), but we will consider only two
columns (Salary and Levels).
• After that, we have extracted the dependent(Y) and independent variable(X)
from the dataset. For x-variable, we have taken parameters as [:,1:2], because we
want 1 index(levels), and included :2 to make it as a matrix.
Output:
By executing the above code, we can read our dataset as:
As we can see in the above output, there are three columns present (Positions, Levels, and
Salaries). But we are only considering two columns because Positions are equivalent to the
levels or may be seen as the encoded form of Positions.
46
BOOK TITLE
Here we will predict the output for level 6.5 because the candidate has 4+ years' experience
as a regional manager, so he must be somewhere between levels 7 and 6.
Building the Linear regression model:
Now, we will build and fit the Linear regression model to the dataset. In building polynomial
regression, we will take the Linear regression model as reference and compare both the
results. The code is given below:
#Fitting the Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_regs= LinearRegression()
lin_regs.fit(x,y)
In the above code, we have created the Simple Linear model using lin_regs object
of LinearRegression class and fitted it to the dataset variables (x and y).
Output:
Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Building the Polynomial regression model:
Now we will build the Polynomial Regression model, but it will be a little different from the
Simple Linear model. Because here we will use PolynomialFeatures class
of preprocessing library. We are using this class to add some extra features to our dataset.
#Fitting the Polynomial regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_regs= PolynomialFeatures(degree= 2)
x_poly= poly_regs.fit_transform(x)
lin_reg_2 =LinearRegression()
lin_reg_2.fit(x_poly, y)
In the above lines of code, we have used poly_regs.fit_transform(x), because first we are
converting our feature matrix into polynomial feature matrix, and then fitting it to the
Polynomial regression model. The parameter value(degree= 2) depends on our choice. We
can choose it according to our Polynomial features.
After executing the code, we will get another matrix x_poly, which can be seen under the
variable explorer option:
47
BOOK TITLE
In the above output image, we can clearly see that the regression line is so far from the
datasets. Predictions are in a red straight line, and blue points are actual values. If we
consider this output to predict the value of CEO, it will give a salary of approx. 600000$,
which is far away from the real value.
So we need a curved model to fit the dataset other than a straight line.
Visualizing the result for Polynomial Regression
Here we will visualize the result of Polynomial regression model, code for which is little
different from the above model.
Code for this is given below:
#Visulaizing the result for Polynomial Regression
mtp.scatter(x,y,color="blue")
48
BOOK TITLE
As we can see in the above output image, the predictions are close to the real values. The
above plot will vary as we will change the degree.
For degree= 3:
If we change the degree=3, then we will give a more accurate plot, as shown in the below
image.
49
BOOK TITLE
SO as we can see here in the above output image, the predicted salary for level 6.5 is near to
170K$-190k$, which seems that future employee is saying the truth about his salary.
Degree= 4: Let's again change the degree to 4, and now will get the most accurate plot.
Hence we can get more accurate results by increasing the degree of Polynomial.
50
BOOK TITLE
51
BOOK TITLE
• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
• But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
52
BOOK TITLE
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:
o Data Pre-processing step
o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we
can use it in our code efficiently. It will be the same as we have done in Data pre-processing
topic. The code for this is given below:
#Data Pre-procesing Step
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
By executing the above lines of code, we will get the dataset as the output. Consider the
given image:
53
BOOK TITLE
Now, we will extract the dependent and independent variables from the given dataset. Below
is the code for it:
#Extracting Independent and dependent Variable
x= data_set.iloc[:, [2,3]].values
y= data_set.iloc[:, 4].values
In the above code, we have taken [2, 3] for x because our independent variables are age and
salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:
54
BOOK TITLE
Now we will split the dataset into a training set and test set. Below is the code for it:
# Splitting the dataset into training and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
The output for this is given below:
55
BOOK TITLE
x_test= st_x.transform(x_test)
The scaled output is given below:
56
BOOK TITLE
Our model is well trained on the training set, so we will now predict the result by using test
set data. Below is the code for it:
#Predicting the test set result
y_pred= classifier.predict(x_test)
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the
variable explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or
not purchase the car.
57
BOOK TITLE
We can find the accuracy of the predicted result by interpreting the confusion matrix. By
above output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect
Output).
58
BOOK TITLE
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have
used the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.
59
BOOK TITLE
The above graph shows the test set result. As we can see, the graph is divided into two
regions (Purple and Green). And Green observations are in the green region, and Purple
observations are in the purple region. So we can say it is a good prediction and model. Some
of the green and purple data points are in different regions, which can be ignored as we have
already calculated this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification
problem.
60
BOOK TITLE
AUC is short for "Area Under the ROC Curve," which measures the whole two-
dimensional area located underneath the entire ROC curve from (0,0) to (1,1). The
AUC measures the classifier's ability to distinguish between classes. It is used as a
summary of the ROC curve. The higher the AUC, the better the model can
differentiate between positive and negative classes. AUC supplies an aggregate
measure of the model's performance across all possible classification thresholds.
Model creators want AUC for two chief reasons:
• AUC is scale-invariant. The AUC measures how well the predictions were ranked
instead of measuring their absolute values.
• AUC is classification-threshold-invariant, meaning it measures the quality of the
model's predictions regardless of the classification threshold.
However, AUC has its downsides, which manifest in certain situations:
• Scale invariance is not always wanted. For instance, sometimes, the situation calls for
well-calibrated probability outputs, and AUC doesn’t deliver that.
• Classification-threshold invariance isn't always wanted, especially in cases that show
wide disparities in the cost of false negatives compared to false positives. Instead, it
may be essential to minimize only one type of classification error. For instance, when
designing a model that performs email spam detection, you probably want to
prioritize minimizing false positives, despite resulting in a notable increase of false
negatives. Unfortunately, AUC isn't a good metric for this kind of optimization.
61
BOOK TITLE
Source
TP stands for True Positive, and TN means True Negative. FP stands for False Positive, and
62
BOOK TITLE
How to Use the AUC - ROC Curve for the Multi-Class Model
We can use the One vs. ALL methodology to plot the N number of AUC ROC Curves for
N number classes when using a multi-class model. One vs. ALL gives us a way to leverage
binary classification. If you have a classification problem with N possible solutions, One vs.
ALL provides us with one binary classifier for each possible outcome.
So, for example, you have three classes named 0, 1, and 2. You will have one ROC for 0
that’s classified against 1 and 2, another ROC for 1, which is classified against 0 and 2, and
finally, the third one of 2 classified against 0 and 1.
We should take a moment and explain the One vs. ALL methodology to better answer the
question “what is a ROC curve?”. This methodology is made up of N separate binary
classifiers. The model runs through the binary classifier sequence during training, training
each to answer a classification question. For instance, if you have a cat picture, you can train
four different recognizers, one seeing the image as a positive example (the cat) and the other
three seeing a negative example (not the cat). It would look like this:
• Is this image a rutabaga? No
• Is this image a cat? Yes
• Is this image a dog? No
• Is this image a hammer? No
This methodology works well with a small number of total classes. However, as the number
of classes rises, the model becomes increasingly inefficient.
===000===
63
BOOK TITLE
Machine learning algorithms are computational techniques that enable computers to learn
and make predictions or decisions based on data. They are a fundamental part of the field of
artificial intelligence and are used in a wide range of applications, from image and speech
recognition to recommendation systems and autonomous vehicles. Machine learning
algorithms can be categorized into several main types, including supervised learning,
unsupervised learning, and reinforcement learning. Here's a brief introduction to these
categories:
1. Supervised Learning:
a. Supervised learning is one of the most common types of machine learning.
It involves training a model on a labeled dataset, where the input data is
paired with corresponding output labels. The goal is to learn a mapping
from inputs to outputs.
b. Common algorithms in supervised learning include:
i. Linear Regression: Used for predicting continuous numeric
values (e.g., predicting house prices).
ii. Logistic Regression: Used for binary classification tasks (e.g.,
spam detection).
iii. Decision Trees and Random Forests: Effective for both
classification and regression tasks.
iv. Support Vector Machines (SVM): Used for classification and
regression, with a focus on maximizing the margin between classes.
v. Neural Networks: Deep learning models with multiple layers of
neurons, suitable for a wide range of tasks, from image recognition
to natural language processing.
vi. Naive Bayes: A probabilistic algorithm often used for text
classification.
vii. K-Nearest Neighbors (K-NN): Used for classification and
regression based on the nearest data points in the training set.
2. Unsupervised Learning:
a. Unsupervised learning involves working with unlabeled data. The goal is to
discover hidden patterns or structures within the data, such as clustering
similar data points or reducing the dimensionality of the data.
b. Common algorithms in unsupervised learning include:
i. K-Means Clustering: Used for grouping similar data points into
clusters.
ii. Hierarchical Clustering: Builds a hierarchy of clusters.
iii. Principal Component Analysis (PCA): Reduces the
dimensionality of data while preserving as much variance as
possible.
64
BOOK TITLE
65
BOOK TITLE
66
BOOK TITLE
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
PauseNext
Unmute
Current Time 0:13
/
Duration 18:10
Loaded: 5.50%
Â
Fullscreen
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
67
BOOK TITLE
68
BOOK TITLE
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can be created
69
BOOK TITLE
by using the SVM algorithm. We will first train our model with lots of images of cats and
dogs so that it can learn about different features of cats and dogs, and then we test it with
this strange creature. So as support vector creates a decision boundary between these two
data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the
below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
• Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then such
data is termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
70
BOOK TITLE
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
71
BOOK TITLE
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
72
BOOK TITLE
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
73
BOOK TITLE
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight
line that joins the two points which are into consideration. This metric helps us calculate
the net displacement done between the two states of an object.
Manhattan Distance
This distance metric is generally used when we are interested in the total distance traveled
by the object instead of the displacement. This metric is calculated by summing the
absolute difference between the coordinates of the points in n-dimensions.
Minkowski Distance
We can say that the Euclidean, as well as the Manhattan distance, are special cases of the
Minkowski distance.
0 seconds of 15 secondsVolume 0%
From the formula above we can say that when p = 2 then it is the same as the formula for
the Euclidean distance and when p = 1 then we obtain the formula for the Manhattan
distance.
74
BOOK TITLE
The above-discussed metrics are most common while dealing with a Machine
Learning problem but there are other distance metrics as well like Hamming
Distance which come in handy while dealing with problems that require overlapping
comparisons between two vectors whose contents can be boolean as well as string values.
75
BOOK TITLE
lots of computing power as well as data storage. This makes this algorithm both
time-consuming and resource exhausting.
• Curse of Dimensionality – There is a term known as the peaking phenomenon
according to this the KNN algorithm is affected by the curse of
dimensionality which implies the algorithm faces a hard time classifying the data
points properly when the dimensionality is too high.
• Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well. Hence
generally feature selection as well as dimensionality reduction techniques are
applied to deal with this problem.
Example Program:
Assume 0 and 1 as the two classifiers (groups).
import math
def classifyAPoint(points,p,k=3):
'''
This function finds the classification of p using
k nearest neighbor algorithm. It assumes only two
groups and returns 0 if p belongs to group 0, else
1 (belongs to group 1).
Parameters -
points: Dictionary of training points having two keys - 0 and 1
Each key have a list of training data points belong to that
distance=[]
for group in points:
for feature in points[group]:
76
BOOK TITLE
for d in distance:
if d[1] == 0:
freq1 += 1
else if d[1] == 1:
freq2 += 1
# driver function
def main():
points = {0:[(1,12),(2,5),(3,6),(3,10),(3.5,8),(2,11),(2,9),(1,7)],
1:[(5,3),(3,2),(1.5,9),(7,2),(6,1),(3.8,1),(5.6,4),(4,2),(2,5)]}
# Number of neighbours
k=3
if __name__ == '__main__':
main()
77
BOOK TITLE
Output:
The value classified as an unknown point is 0.
Time Complexity: O(N * logN)
Auxiliary Space: O(1)
78
BOOK TITLE
79
BOOK TITLE
a. ETS models capture the error, trend, and seasonality components of time
series data. These models are particularly useful when the data exhibits
exponential growth or decay.
3. Prophet by Facebook:
a. Prophet is an open-source forecasting tool developed by Facebook. It is
designed to handle time series data with daily observations that display
patterns on different time scales. Prophet allows users to incorporate
holidays and special events.
4. Long Short-Term Memory (LSTM) Networks:
a. LSTM networks, a type of recurrent neural network (RNN), are effective in
capturing long-term dependencies in sequential data. LSTMs are well-suited
for time series forecasting tasks, especially when dealing with complex and
non-linear patterns.
5. Attention Mechanisms:
a. Attention mechanisms, often used in sequence-to-sequence models, allow
the model to focus on different parts of the input sequence when making
predictions. This can be beneficial in capturing relevant temporal patterns.
6. Ensemble Methods:
a. Ensemble methods, such as combining multiple models or predictions, can
enhance forecasting accuracy. Techniques like bagging (Bootstrap
Aggregating) or stacking can be applied to time series forecasting models.
7. Hyperparameter Optimization:
a. Grid search or randomized search can be employed for hyperparameter
tuning to find the optimal configuration for the forecasting model.
8. Probabilistic Forecasting:
a. Instead of providing a single point estimate, probabilistic forecasting
models offer a distribution of possible outcomes. This approach is valuable
in capturing uncertainty and providing more informative predictions.
9. Backtesting:
a. Backtesting involves assessing the performance of a forecasting model on
historical data. This helps validate the model's effectiveness and
generalization to unseen data.
10. Online Learning:
a. For scenarios where data arrives sequentially, online learning techniques
allow the model to continuously update and adapt to new information.
80
BOOK TITLE
81
BOOK TITLE
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is
the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.
82
BOOK TITLE
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters. The dense areas in data space are divided from each other by
sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
83
BOOK TITLE
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster. Each dataset has a set of membership coefficients, which depend on
the degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of
this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such as, some
84
BOOK TITLE
algorithms need to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that
works on updating the candidates for centroid to be the center of the points within
a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of
high density are separated by the areas of low density. Because of this, the clusters
can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used
as an alternative for the k-means algorithm or for those cases where K-means can
be failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as
a single cluster at the outset and then successively merged. The cluster hierarchy can
be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity,
which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
• In Identification of Cancer Cells: The clustering algorithms are widely used
for the identification of cancerous cells. It divides the cancerous and non-
cancerous data sets into different groups.
• In Search Engines: Search engines also work on the clustering technique. The
search result appears based on the closest object to the search query. It does it
by grouping similar data objects in one group that is far from the other
dissimilar objects. The accurate result of a query depends on the quality of the
clustering algorithm used.
• Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
85
BOOK TITLE
86
BOOK TITLE
87
BOOK TITLE
the Z. In the resultant matrix Z*, each observation is the linear combination of
original features. Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.
88
BOOK TITLE
data points in the dataset. Principal Component Analysis can identify these
outliers by looking for data points that are far from the other points in the
principal component space.
===000===
89
BOOK TITLE
Model diagnostics and tuning are crucial steps in the machine learning pipeline to ensure that
your model is performing at its best. These steps involve evaluating the model's
performance, identifying issues, and optimizing its hyperparameters. Here's a breakdown of
the processes involved:
Model Diagnostics:
1. Model Evaluation Metrics:
a. Choose appropriate evaluation metrics based on the problem type. For
classification tasks, metrics like accuracy, precision, recall, F1 score, and
ROC AUC are commonly used. For regression tasks, metrics like mean
squared error (MSE), mean absolute error (MAE), and R-squared are
common.
2. Cross-Validation:
a. Implement k-fold cross-validation to assess the model's performance.
Cross-validation helps estimate the model's generalization performance and
detect issues like overfitting.
3. Confusion Matrix and ROC Curve:
a. For classification tasks, create a confusion matrix and ROC curve to
understand the model's performance in more detail. This can help identify
issues like class imbalance or misclassification errors.
4. Bias-Variance Trade-off:
a. Analyze the bias-variance trade-off to find the right balance. High bias
(underfitting) occurs when the model is too simple, and high variance
(overfitting) occurs when the model is too complex. Adjust the model's
complexity accordingly.
5. Learning Curve:
a. Plot learning curves to visualize how the model's performance changes with
increasing training data. Learning curves help identify issues related to data
size and model convergence.
6. Residual Analysis:
a. In regression tasks, analyze the residuals (the differences between predicted
and actual values) to check for patterns, heteroscedasticity, or nonlinearity.
7. Feature Importance:
a. Evaluate the importance of features to determine which variables have the
most impact on the model's predictions. This can help you understand the
model's decision-making process.
8. Visualization:
a. Use visualization techniques to inspect the model's performance, feature
relationships, and data distributions. Visualization can help detect anomalies
and potential issues.
90
BOOK TITLE
Hyperparameter Tuning:
1. Grid Search and Random Search:
a. Grid search and random search are techniques to find the best
hyperparameters for your model. Grid search exhaustively explores
predefined hyperparameter combinations, while random search randomly
samples from a predefined range of hyperparameters. These methods help
optimize the model's performance.
2. Cross-Validation for Hyperparameter Tuning:
a. Apply cross-validation during hyperparameter tuning to ensure that the
selected hyperparameters generalize well. Use k-fold cross-validation to
estimate the performance of different hyperparameter combinations.
3. Hyperparameter Optimization Libraries:
a. Employ specialized libraries like scikit-learn's GridSearchCV and
RandomizedSearchCV, or more advanced libraries like Optuna or
Hyperopt, to automate the hyperparameter search process.
4. Learning Rate Schedules (for Neural Networks):
a. When working with neural networks, learning rate schedules can be used to
adapt the learning rate during training. Techniques like learning rate
annealing or cyclic learning rates can improve convergence.
5. Regularization Techniques:
a. Utilize regularization techniques such as L1 (Lasso) or L2 (Ridge)
regularization to control overfitting. The choice of regularization strength
should be part of the tuning process.
6. Ensemble Models:
a. Experiment with ensemble techniques like bagging (e.g., Random Forests)
and boosting (e.g., Gradient Boosting) to combine multiple models for
improved performance.
7. Feature Engineering:
a. Consider modifying, engineering, or transforming features to improve
model performance. Feature engineering can involve creating interactions,
encoding categorical data, and dimensionality reduction.
8. Feature Scaling:
a. Ensure that feature scaling is appropriate for the model. Some algorithms,
like k-nearest neighbors and support vector machines, are sensitive to
feature scales.
9. Early Stopping (for Neural Networks):
a. Implement early stopping to halt training when the model starts to overfit.
Early stopping can prevent unnecessary training epochs and save time.
10. Validation Set for Hyperparameter Tuning:
a. Reserve a separate validation set for hyperparameter tuning to avoid data
leakage from the test set and obtain an unbiased evaluation of the model.
Model diagnostics and hyperparameter tuning are iterative processes that require careful
consideration of the problem, data, and model characteristics. By diagnosing model issues
91
BOOK TITLE
and optimizing hyperparameters, you can fine-tune your machine learning model to achieve
the best possible performance.
92
BOOK TITLE
What is Bias?
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them
to test data for prediction. While making predictions, a difference occurs between
prediction values made by the model and actual values/expected values , and this
difference is known as bias errors or Errors due to bias . It can be defined as an inability
of machine learning algorithms such as Linear Regression to capture the true relationship
between the data points. Each algorithm begins with some amount of bias because bias
occurs from assumptions in the model, which makes the target function simple to learn. A
model has either:
• Low Bias: A low bias model will make fewer assumptions about the form of
the target function.
• High Bias: A model with a high bias makes more assumptions, and the model
becomes unable to capture the important features of our dataset. A high bias
model also cannot perform well on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the
algorithm, the higher the bias it has likely to be introduced. Whereas a nonlinear algorithm
often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-
Nearest Neighbours and Support Vector Machines. At the same time, an algorithm with
high bias is Linear Regression, Linear Discriminant Analysis and Logistic Regression.
Ways to reduce High Bias:
High bias mainly occurs due to a much simple model. Below are some ways to reduce the
high bias:
• Increase the input features as the model is underfitted.
• Decrease the regularization term.
• Use more complex models, such as including some polynomial features.
93
BOOK TITLE
Since, with high variance, the model learns too much from the dataset, it leads to overfitting
of the model. A model with high variance has the below problems:
o A high variance model leads to overfitting.
o Increase model complexities.
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression,
Logistic Regression, and Linear discriminant analysis. At the same time, algorithms
with high variance are decision tree, Support Vector Machine, and K-nearest
neighbours.
1. Low-Bias, Low-Variance:
The combination of low bias and low variance shows an ideal machine learning
model. However, it is not possible practically.
94
BOOK TITLE
2. Low-Bias, High-Variance: With low bias and high variance, model predictions are
inconsistent and accurate on average. This case occurs when the model learns with a
large number of parameters and hence leads to an overfitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are
consistent but inaccurate on average. This case occurs when a model does not learn
well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
4. High-Bias, High-Variance:
5. With high bias and high variance, predictions are inconsistent and also inaccurate on
average.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the model
has a large number of parameters, it will have high variance and low bias. So, it is required to
make a balance between bias and variance errors, and this balance between the bias error and
variance error is known as the Bias-Variance trade-off.
95
BOOK TITLE
For an accurate prediction of the model, algorithms need a low variance and low bias. But
this is not possible because bias and variance are related to each other:
• If we decrease the variance, it will increase the bias.
• If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that
accurately captures the regularities in training data and simultaneously generalizes well with
the unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high
variance algorithm may perform well with training data, but it may lead to overfitting to
noisy data. Whereas, high bias algorithm generates a much simple model that may not even
capture important regularities in the data. So, we need to find a sweet spot between bias and
variance to make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance
between bias and variance errors.
96
BOOK TITLE
This technique is similar to k-fold cross-validation with some little changes. This approach
works on stratification concept, it is a process of rearranging the data to ensure that each
fold or group is a good representative of the complete dataset. To deal with the bias and
variance, it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some houses
can be much high than other houses. To tackle such situations, a stratified k-fold cross-
validation technique is useful.
Holdout Method
This method is the simplest cross-validation technique among all. In this method, we need to
remove a subset of the training data and use it to get prediction results by training it on the
rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue of high
variance, and it also produces misleading results sometimes.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
• For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the big
disadvantages of cross-validation, as there is no certainty of the type of data in
machine learning.
• In predictive modeling, the data evolves over a period, due to which, it may face
the differences between the training set and validation sets. Such as if we create
a model for the prediction of stock market values, and the data is trained on the
previous 5 years stock values, but the realistic future values for the next 5 years
97
BOOK TITLE
may drastically different, so it is difficult to expect the correct output for such
situations.
Applications of Cross-Validation
• This technique can be used to compare the performance of different predictive
modeling methods.
• It has great scope in the medical research field.
• It can also be used for the meta-analysis, as it is already being used by the data
scientists in the field of medical statistics.
98
BOOK TITLE
Ensemble learning is frequently illustrated using selection timber as this algorithm may be
liable to overfitting (excessive variance and low bias) when it has not been pruned. It could
additionally lend itself to underfitting (low variance and extreme bias) when it is very small,
like a decision stump, a decision tree with one stage. While an algorithm overfits or fits its
education set, it cannot generalize nicely to new datasets, so ensemble strategies are used to
counteract this conduct to allow for the generalization of the model to new datasets. While
selection timber can showcase excessive variance or high bias, it is worth noting that it is not
the best modelling approach that leverages ensemble learning to find the "sweet spot" in the
bias-variance trade-off.
Bagging Boosting
The most effective manner of mixing predictions that A manner of mixing predictions
belong to the same type. that belong to different sorts.
The main task of it is decrease the variance but not The main task of it is decrease the
bias. bias but not variance.
Here each of the model is different weight. Here each of the model is same
weight.
Each of the model is built here independently. Each of the model is built here
dependently.
99
BOOK TITLE
This training records subsets are decided on using row Each new subset consists of the
sampling with alternative and random sampling factors that were misclassified
techniques from the whole training dataset. through preceding models.
If the classifier is volatile (excessive variance), then If the classifier is stable and easy
apply bagging. (excessive bias) the practice
boosting.
In the bagging base, the classifier is works parallelly. In the boosting base, the classifier is
works sequentially.
Example is random forest model by using bagging. Example is AdaBoost using the
boosting technique.
100
BOOK TITLE
101
BOOK TITLE
2. Loss of interpretability:
The Bagging slows down and grows extra in depth because of the quantity of iterations
growth. accordingly, it is no longer adequately suitable for actual-time applications. Clustered
structures or large processing cores are perfect for quickly growing bagged ensembles on
massive look-at units.
3. Expensive for computation:
The Bagging is tough to draw unique business insights via Bagging because of the averaging
concerned throughout predictions. While the output is more precise than any person's
information point, a more accurate or whole dataset may yield greater precision within a
single classification or regression model.
102
BOOK TITLE
plt.show()
Output:
By utilizing iterating thru exceptional values for the range of estimators, we will see an
increase in version overall performance from 82.2% to 95.5%. After 14 estimators, the
accuracy begins to drop, and once more, if you set an exceptional random_state, the values
you see will range. This is why cross-validation is an adequate exercise to ensure solid
consequences. In this case, we see a 13.3% boom in accuracy concerning identifying the type
of wine. Now we compile the above program and then run it. After that, the output is
screened below -
103
BOOK TITLE
print(oob_model.oob_score_)
Output:
Now we compile the above program and then run it. After that, the output is screened
below -
0.8951612903225806
104
BOOK TITLE
• The predictions from each tree must have very low correlations.
105
BOOK TITLE
106
BOOK TITLE
107
BOOK TITLE
4.6 STACKING
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners'
predictions and Meta learners so that a better output prediction model can be achieved.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how
to best combine the input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the Model
Averaging Ensemble technique in which all sub-models equally participate as per their
performance weights and build a new model with better predictions. This new model is
stacked up on top of the others; this is the reason why it is named stacking.
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of two or
more base/learner's models and a meta-model that combines the predictions of the base
models. These base models are called level 0 models, and the meta-model is known as the
level 1 model. So, the Stacking ensemble method includes original (training) data,
primary level models, primary level prediction, secondary level model, and final
prediction. The basic architecture of stacking can be represented as shown below the image.
• Original data: This data is divided into n-folds and is also considered test data or
training data.
• Base models: These models are also referred to as level-0 models. These models
use training data and provide compiled predictions (level-0) as an output.
• Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
108
BOOK TITLE
• Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The meta-model is
also known as the level-1 model.
• Level-1 Prediction: The meta-model learns how to best combine the predictions of
the base models and is trained on different predictions made by individual base
models, i.e., data not used to train the base models are fed to the meta-model,
predictions are made, and these predictions, along with the expected outputs,
provide the input and output pairs of the training dataset used to fit the meta-model.
Voting ensembles:
This is one of the simplest stacking ensemble methods, which uses different algorithms to
prepare all members individually. Unlike the stacking method, the voting ensemble uses
simple statistics instead of learning how to best combine predictions from base models
separately.
It is significant to solve regression problems where we need to predict the mean or median
of the predictions from base models. Further, it is also helpful in various classification
109
BOOK TITLE
problems according to the total votes received for prediction. The label with the higher
numbers of votes is referred to as hard voting, whereas the label that receives the largest
sums of probability or lesser votes is referred to as soft voting.
The voting ensemble differs from than stacking ensemble in terms of weighing models based
on each member's performance because here, all models are considered to have the same
skill levels.
Member Assessment: In the voting ensemble, all members are assumed to have the same
skill sets.
Combine with Model: Instead of using combined prediction from each member, it uses
simple statistics to get the final prediction, e.g., mean or median.
Blending Ensemble:
Blending is a similar approach to stacking with a specific configuration. It is considered a
stacking method that uses k-fold cross-validation to prepare out-of-sample predictions for
the meta-model. In this method, the training dataset is first to split into different training sets
and validation sets then we train learner models on the training sets. Further, predictions are
made on the validation set and sample set, where validation predictions are used as features
to build a new model, which is later used to make final predictions on the test set using the
prediction values as features.
Member Predictions: The blending stacking ensemble uses out-of-sample predictions on a
validation set.
Combine With Model: Linear model (e.g., linear regression or logistic regression).
110
BOOK TITLE
===000===
111
BOOK TITLE
Artificial Neural Network Tutorial provides basic and advanced concepts of ANNs. Our
Artificial Neural Network tutorial is developed for beginners as well as professions.
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. An Artificial neural network is usually a computational
network based on biological neural networks that construct the structure of the human
brain. Similar to a human brain has neurons interconnected to each other, artificial neural
networks also have neurons that are linked to each other in various layers of the networks.
These neurons are known as nodes.
Artificial neural network tutorial covers all the aspects related to the artificial neural network.
In this tutorial, we will discuss ANNs, Adaptive resonance theory, Kohonen self-organizing
map, Building blocks, unsupervised learning, Genetic algorithm, etc.
The given figure illustrates the typical diagram of Biological Neural Network.
The typical Artificial Neural Network looks something like the given figure.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks,
cell nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
Relationship between Biological neural network and artificial neural network:
112
BOOK TITLE
Dendrites Inputs
Synapse Weights
Axon Output
An Artificial Neural Network in the field of Artificial intelligence where it attempts to
mimic the network of neurons makes up a human brain so that computers will have an
option to understand things and make decisions in a human-like manner. The artificial neural
network is designed by programming computers to behave simply like interconnected brain
cells.
There are around 1000 billion neurons in the human brain. Each neuron has an association
point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in
such a manner as to be distributed, and we can extract more than one piece of this data
when necessary from our memory parallelly. We can say that the human brain is made up of
incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two inputs.
If one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off,"
then we get "Off" in output. Here the output depends upon input. Our brain does not
perform the same task. The outputs to inputs relationship keep changing because of the
neurons in our brain, which are "learning."
113
BOOK TITLE
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
114
BOOK TITLE
There is no particular guideline for determining the structure of artificial neural networks.
The appropriate network structure is accomplished through experience, trial, and error.
Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing solution, it does not
provide insight concerning why and how. It decreases trust in the network.
Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per their
structure. Therefore, the realization of the equipment is dependent.
Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical values
before being introduced to ANN. The presentation mechanism to be resolved here will
directly impact the performance of the network. It relies on the user's abilities.
The duration of the network is unknown:
The network is reduced to a specific value of the error, and this value does not give us
optimum results.
Science artificial neural networks that have steeped into the world in the mid-20th century are exponentially
developing. In the present time, we have investigated the pros of artificial neural networks and the issues
encountered in the course of their utilization. It should not be overlooked that the cons of ANN networks,
which are a flourishing science branch, are eliminated individually, and their pros are increasing day by day.
It means that artificial neural networks will turn into an irreplaceable part of our lives progressively
important.
115
BOOK TITLE
Afterward, each of the input is multiplied by its corresponding weights ( these weights are
the details utilized by the artificial neural networks to solve a specific problem ). In general
terms, these weights normally represent the strength of the interconnection between neurons
inside the artificial neural network. All the weighted inputs are summarized inside the
computing unit.
If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response. Bias has the same input, and weight
equals to 1. Here the total of weighted inputs can be in the range of 0 to positive infinity.
Here, to keep the response in the limits of the desired value, a certain maximum value is
benchmarked, and the total of weighted inputs is passed through the activation function.
The activation function refers to the set of transfer functions used to achieve the desired
output. There is a different kind of the activation function, but primarily either linear or non-
linear sets of functions. Some of the commonly used sets of activation functions are the
Binary, linear, and Tan hyperbolic sigmoidal activation functions. Let us take a look at each
of them in details:
Binary:
In binary activation function, the output is either a one or a 0. Here, to accomplish this,
there is a threshold value set up. If the net weighted input of neurons is more than 1, then
the final output of the activation function is returned as one or else the output is returned as
0.
Sigmoidal Hyperbolic:
The Sigmoidal Hyperbola function is generally seen as an "S" shaped curve. Here the tan
hyperbolic function is used to approximate output from the actual net input. The function is
defined as:
F(x) = (1/1 + exp(-????x))
Where ???? is considered the Steepness parameter.
116
BOOK TITLE
input, the intensity of the network can be noticed based on group behavior of the associated
neurons, and the output is decided. The primary advantage of this network is that it figures
out how to evaluate and recognize input patterns.
Prerequisite
No specific expertise is needed as a prerequisite before starting this tutorial.
Audience
Our Artificial Neural Network Tutorial is developed for beginners as well as professionals,
to help them understand the basic concept of ANNs.
Problems
We assure you that you will not find any problem in this Artificial Neural Network tutorial.
But if there is any problem or mistake, please post the problem in the contact form so that
we can further improve it.
Perceptron uses the step function that returns +1 if the weighted sum of its input 0 and -1.
The activation function is used to map the input between the required value like (0, 1) or (-1,
1).
A regular neural network looks like this:
117
BOOK TITLE
118
BOOK TITLE
b. In this step, add all the increased values and call them the Weighted sum.
c. In our last step, apply the weighted sum to a correct Activation Function.
For Example:
A Unit Step Activation Function
119
BOOK TITLE
There are two types of architecture. These types focus on the functionality of artificial neural
networks as follows-
o Single Layer Perceptron
o Multi-Layer Perceptron
120
BOOK TITLE
121
BOOK TITLE
The logistic regression is considered as predictive analysis. Logistic regression is mainly used
to describe data and use to explain the relationship between the dependent binary variable
and one or many nominal or independent variables.
122
BOOK TITLE
MLP networks are used for supervised learning format. A typical learning algorithm for
MLP networks is also called back propagation's algorithm.
A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set
of outputs from a set of inputs. An MLP is characterized by several layers of input nodes
connected as a directed graph between the input nodes connected as a directed graph
between the input and output layers. MLP uses backpropagation for training the network.
MLP is a deep learning method.
Now, we are focusing on the implementation with MLP for an image classification problem.
# Import MINST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
import tensorflow as tf
import matplotlib.pyplot as plt
# Parameters
learning_rate = 0.001
training_epochs = 20
batch_size = 100
display_step = 1
# Network Parameters
n_hidden_1 = 256
123
BOOK TITLE
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
# weights layer 1
h = tf.Variable(tf.random_normal([n_input, n_hidden_1])) # bias layer 1
bias_layer_1 = tf.Variable(tf.random_normal([n_hidden_1]))
# layer 1 layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, h), bias_layer_1))
# weights layer 2
w = tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2]))
# bias layer 2
bias_layer_2 = tf.Variable(tf.random_normal([n_hidden_2]))
# layer 2
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, w), bias_layer_2))
# cost function
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits = output_layer, labels = y))
# optimizer = tf.train.GradientDescentOptimizer(
learning_rate = learning_rate).minimize(cost)
# Plot settings
124
BOOK TITLE
avg_set = []
epoch_set = []
# Training cycle
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(mnist.train.num_examples / batch_size)
# Test model
correct_prediction = tf.equal(tf.argmax(output_layer, 1), tf.argmax(y, 1))
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
125
BOOK TITLE
print
"Model Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels})
126
BOOK TITLE
updates the weight of every model in every single layer. We will talk more about
optimization algorithms and backpropagation later.
It is important to recognize the subsequent training of our neural network. Recognition is
done by dividing our data samples through some decision boundary.
"The process of receiving an input to produce some kind of output to make some kind of
prediction is known as Feed Forward." Feed Forward neural network is the core of many
other important neural networks such as convolution neural network.
In the feed-forward neural network, there are not any feedback loops or connections in the
network. Here is simply an input layer, a hidden layer, and an output layer.
There can be multiple hidden layers which depend on what kind of data you are dealing
with. The number of hidden layers is known as the depth of the neural network. The deep
neural network can learn from more functions. Input layer first provides the neural network
with data and the output layer then make predictions on that data which is based on a series
of functions. ReLU Function is the most commonly used activation function in the deep
neural network.
To gain a solid understanding of the feed-forward process, let's see this mathematically.
1) The first input is fed to the network, which is represented as matrix x1, x2, and one where
one is the bias value.
2) Each input is multiplied by weight with respect to the first and second model to obtain
their probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
3) After that, we will take the sigmoid of our scores and gives us the probability of the point
being in the positive region in both models.
127
BOOK TITLE
4) We multiply the probability which we have obtained from the previous step with the
second set of weights. We always include a bias of one whenever taking a combination of
inputs.
And as we know to obtain the probability of the point being in the positive region of this
model, we take the sigmoid and thus producing our final output in a feed-forward process.
Let takes the neural network which we had previously with the following linear models and
the hidden layer which combined to form the non-linear model in the output layer.
So, what we will do we use our non-linear model to produce an output that describes the
probability of the point being in the positive region. The point was represented by 2 and 2.
Along with bias, we will represent the input as
The first linear model in the hidden layer recall and the equation defined it
Which means in the first layer to obtain the linear combination the inputs are multiplied by -
4, -1 and the bias value is multiplied by twelve.
The weight of the inputs are multiplied by -1/5, 1, and the bias is multiplied by three to
obtain the linear combination of that same point in our second model.
128
BOOK TITLE
Now, to obtain the probability of the point is in the positive region relative to both models
we apply sigmoid to both points as
The second layer contains the weights which dictated the combination of the linear models
in the first layer to obtain the non-linear model in the second layer. The weights are 1.5, 1,
and a bias value of 0.5.
Now, we have to multiply our probabilities from the first layer with the second set of
weights as
It is complete math behind the feed forward process where the inputs from the input
traverse the entire depth of the neural network. In this example, there is only one hidden
layer. Whether there is one hidden layer or twenty, the computational processes are the same
for all hidden layers.
129
BOOK TITLE
The RBM is called “restricted” because the connections between the neurons in the same
layer are not allowed. In other words, each neuron in the visible layer is only connected to
neurons in the hidden layer, and vice versa. This allows the RBM to learn a compressed
representation of the input data by reducing the dimensionality of the input.
The RBM is trained using a process called contrastive divergence, which is a variant of the
stochastic gradient descent algorithm. During training, the network adjusts the weights of
the connections between the neurons in order to maximize the likelihood of the training
data. Once the RBM is trained, it can be used to generate new samples from the learned
probability distribution.
RBM has found applications in a wide range of fields, including computer vision, natural
language processing, and speech recognition. It has also been used in combination with
other neural network architectures, such as deep belief networks and deep neural networks,
to improve their performance.
130
BOOK TITLE
As in this machine, there is no output layer so the question arises how we are going to
identify, adjust the weights and how to measure the that our prediction is accurate or not. All
the questions have one answer, that is Restricted Boltzmann Machine.
The RBM algorithm was proposed by Geoffrey Hinton (2007), which learns probability
distribution over its sample training data inputs. It has seen wide applications in different
areas of supervised/unsupervised machine learning such as feature learning, dimensionality
reduction, classification, collaborative filtering, and topic modeling.
Consider the example movie rating discussed in the recommender system section.
Movies like Avengers, Avatar, and Interstellar have strong associations with the latest fantasy
and science fiction factor. Based on the user rating RBM will discover latent factors that can
explain the activation of movie choices. In short, RBM describes variability among
correlated variables of input dataset in terms of a potentially lower number of unobserved
variables.
The energy function is given by
131
BOOK TITLE
132
BOOK TITLE
1. Binary RBM: In a binary RBM, the input and hidden units are binary variables.
Binary RBMs are often used in modeling binary data such as images or text.
2. Gaussian RBM: In a Gaussian RBM, the input and hidden units are continuous
variables that follow a Gaussian distribution. Gaussian RBMs are often used in
modeling continuous data such as audio signals or sensor data.
Apart from these two types, there are also variations of RBMs such as:
1. Deep Belief Network (DBN): A DBN is a type of generative model that consists
of multiple layers of RBMs. DBNs are often used in modeling high-dimensional
data such as images or videos.
2. Convolutional RBM (CRBM): A CRBM is a type of RBM that is designed
specifically for processing images or other grid-like structures. In a CRBM, the
connections between the input and hidden units are local and shared, which makes
it possible to capture spatial relationships between the input units.
3. Temporal RBM (TRBM): A TRBM is a type of RBM that is designed for
processing temporal data such as time series or video frames. In a TRBM, the
hidden units are connected across time steps, which allows the network to model
temporal dependencies in the data.
===000===
133
BOOK TITLE
Insert author bio text here. Insert author bio text here Insert author bio text here Insert
author bio text here Insert author bio text here Insert author bio text here Insert author bio
text here Insert author bio text here Insert author bio text here Insert author bio text here
Insert author bio text here Insert author bio text here Insert author bio text here Insert
author bio text here Insert author bio text here Insert author bio text here Insert author bio
text here Insert author bio text here Insert author bio text here Insert author bio text here
Insert author bio text here Insert author bio text here Insert author bio text here
134