Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
140 views

Machine Learning

The document provides a history of machine learning from 1950 to present day, covering important developments and algorithms. It discusses different categories of machine learning including supervised and unsupervised learning. Supervised learning involves labeled training data to predict outputs, while unsupervised learning finds patterns in unlabeled data.

Uploaded by

Saurabh Kansara
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views

Machine Learning

The document provides a history of machine learning from 1950 to present day, covering important developments and algorithms. It discusses different categories of machine learning including supervised and unsupervised learning. Supervised learning involves labeled training data to predict outputs, while unsupervised learning finds patterns in unlabeled data.

Uploaded by

Saurabh Kansara
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

UNIT 1

HISTORY
Machine Learning is a sub-set of artificial intelligence where computer algorithms are used to autonomously
learn from data and information. In machine learning computers don’t have to be explicitly programmed but
can change and improve their algorithms by themselves.

1950 — Alan Turing creates the “Turing Test” to determine if a computer has real intelligence. To pass the
test, a computer must be able to fool a human into believing it is also human.

1952 — Arthur Samuel wrote the first computer learning program. The program was the game of checkers,
and the IBM computer improved at the game the more it played, studying which moves made up winning
strategies and incorporating those moves into its program.

1957 — Frank Rosenblatt designed the first neural network for computers (the perceptron), which simulate
the thought processes of the human brain.

1967 — The “nearest neighbor” algorithm was written, allowing computers to begin using very basic pattern
recognition. This could be used to map a route for traveling salesmen, starting at a random city but ensuring
they visit all cities during a short tour.

1979 — Students at Stanford University invent the “Stanford Cart” which can navigate obstacles in a room
on its own.

1981 — Gerald Dejong introduces the concept of Explanation Based Learning (EBL), in which a computer
analyses training data and creates a general rule it can follow by discarding unimportant data.

1985 — Terry Sejnowski invents NetTalk, which learns to pronounce words the same way a baby does.

1990s — Work on machine learning shifts from a knowledge-driven approach to a data-driven approach.
Scientists begin creating programs for computers to analyze large amounts of data and draw conclusions —
or “learn” — from the results.

1997 — IBM’s Deep Blue beats the world champion at chess.

2006 — Geoffrey Hinton coins the term “deep learning” to explain new algorithms that let computers “see”
and distinguish objects and text in images and videos.

2010 — The Microsoft Kinect can track 20 human features at a rate of 30 times per second, allowing people
to interact with the computer via movements and gestures.

2011 — IBM’s Watson beats its human competitors at Jeopardy.

2011 — Google Brain is developed, and its deep neural network can learn to discover and categorize objects
much the way a cat does.

2012 – Google’s X Lab develops a machine learning algorithm that is able to autonomously browse
YouTube videos to identify the videos that contain cats.

2014 – Facebook develops DeepFace, a software algorithm that is able to recognize or verify individuals on
photos to the same level as humans can.

2015 – Amazon launches its own machine learning platform.


2015 – Microsoft creates the Distributed Machine Learning Toolkit, which enables the efficient distribution
of machine learning problems across multiple computers.

2015 – Over 3,000 AI and Robotics researchers, endorsed by Stephen Hawking, Elon Musk and Steve
Wozniak (among many others), sign an open letter warning of the danger of autonomous weapons which
select and engage targets without human intervention.

2016 – Google’s artificial intelligence algorithm beats a professional player at the Chinese board game Go,
which is considered the world’s most complex board game and is many times harder than chess. The
AlphaGo algorithm developed by Google DeepMind managed to win five games out of five in the Go
competition more efficient Machine Learning is.

EVOLUTION
Machine Learning technology has been in existence since 1952. It has evolved drastically over the last
decade and saw several transition periods in the mid-90s. The data-driven approach to Machine Learning
came into existence during the 1990s. From 1995-2005, there was a lot of focus on natural language, search,
and information retrieval. In those days, Machine Learning tools were more straightforward than the tools
being used currently. Neural networks, which were popular in the 80s, are a subset of Machine Learning that
are computer systems modeled on the human brain and nervous system. Neural networks started making a
comeback around 2005. It has become one of the trending technologies of the current decade. According
to Gartner’s 2016 Hype Cycle for Emerging Technologies, Machine Learning is among the technologies at
the peak of inflated expectations and is expected to reach the mainstream adoption in the next 2–5 years.

MACHINE LEARNING & IT’S CATEGORIES:


Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions. Machine learning contains a set of
algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and on the
basis of training, they build the model & perform a specific task.

These ML algorithms help to solve different business problems like Regression, Classification, Forecasting,
Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into mainly four types, which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Reinforcement Learning

1) Supervised Machine Learning

As its name suggests, Supervised machine learning is based on supervision. It means in the supervised
learning technique, we train the machines using the "labelled" dataset, and based on the training, the
machine predicts the output. Here, the labelled data specifies that some of the inputs are already mapped to
the output. More preciously, we can say; first, we train the machine with the input and corresponding output,
and then we ask the machine to predict the output using the test dataset.

Let's understand supervised learning with an example. Suppose we have an input dataset of cats and dog
images. So, first, we will provide the training to the machine to understand the images, such as the shape &
size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After
completion of training, we input the picture of a cat and ask the machine to identify the object and predict
the output. Now, the machine is well trained, so it will check all the features of the object, such as height,
shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the
process of how the machine identifies the objects in Supervised Learning.

The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection,
Spam filtering, etc.

Categories of Supervised Machine Learning

Supervised machine learning can be classified into two types of problems, which are given below:

o Classification
o Regression

a) Classification

Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms predict the
categories present in the dataset. Some real-world examples of classification algorithms are Spam Detection,
Email filtering, etc.

Some popular classification algorithms are given below:

o Random Forest Algorithm


o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression

Regression algorithms are used to solve regression problems in which there is a linear relationship between
input and output variables. These are used to predict continuous output variables, such as market trends,
weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression
Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea about the
classes of objects.

o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning:

Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image classification
is performed on different image data with pre-defined labels.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition. The
algorithm is trained with voice data, and various identifications can be done using the same, such as
voice-activated passwords, voice commands, etc.

2) Unsupervised Machine Learning

Unsupervised learning is different from the Supervised learning technique; as its name suggests, there is no
need for supervision. It means, in unsupervised machine learning, the machine is trained using the unlabeled
dataset, and the machine predicts the output without any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor labelled, and the
model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according
to the similarities, patterns, and differences. Machines are instructed to find the hidden patterns from the
input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit images, and we
input it into the machine learning model. The images are totally unknown to the model, and the task of the
machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour difference, shape difference,
and predict the output when it is tested with the test dataset.

Categories of Unsupervised Machine Learning

Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association

1) Clustering

The clustering technique is used when we want to find the inherent groups from the data. It is a way to group
the objects into a cluster such that the objects with the most similarities remain in one group and have fewer
or no similarities with the objects of other groups. An example of the clustering algorithm is grouping the
customers by their purchasing behaviour.

Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association

Association rule learning is an unsupervised learning technique, which finds interesting relations among
variables within a large dataset. The main aim of this learning algorithm is to find the dependency of one
data item on another data item and map those variables accordingly so that it can generate maximum profit.
This algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth algorithm.

Advantages:

o These algorithms can be used for complicated tasks compared to the supervised ones because these
algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as
compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that
does not map with the output.

Applications of Unsupervised Learning:

o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in


document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised learning techniques
for building recommendation applications for different web applications and e-commerce websites.

3) Reinforcement Machine Learning

Reinforcement learning works on a feedback-based process, in which an AI agent (A software component)


automatically explore its surrounding by hitting & trail, taking action, learning from experiences, and
improving its performance. Agent gets rewarded for each good action and get punished for each bad action;
hence the goal of reinforcement learning agent is to maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their
experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns various things by
experiences in his day-to-day life. An example of reinforcement learning is to play a game, where the Game
is the environment, moves of an agent at each step define states, and the goal of the agent is to get a high
score. Agent receives feedback in terms of punishment and rewards.

Due to its way of working, reinforcement learning is employed in different fields such as Game theory,
Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP, the
agent constantly interacts with the environment and performs actions; at each action, the environment
responds and generates a new state.

Categories of Reinforcement Learning

Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: 

Positive reinforcement learning specifies increasing the tendency that the required behaviour would
occur again by adding something. It enhances the strength of the behaviour of the agent and
positively impacts it.

o Negative Reinforcement Learning: 

Negative reinforcement learning works exactly opposite to the positive RL. It increases the tendency
that the specific behaviour would occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning

o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human performance.
Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning. There are
different industries that have their vision of building intelligent robots using AI and Machine
learning technology.

Advantages

o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate results can
be found.
o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken the results.

SIDE-BY-SIDE COMPARISON

Criteria Supervised Learning Unsupervised Learning Reinforcement Learning

Input Data Input data is labelled. Input data is not labelled. Input data is not predefined.

Problem Learn pattern of inputs Divide data into classes. Find the best reward between a
and their labels. start and an end state.

Solution Finds a mapping equation Finds similar features in input Maximizes reward by assessing
on input data and its data to classify it into classes. the results of state-action pairs
labels.

Model Model is built and trained Model is built and trained prior The model is trained and tested
Building prior to testing. to testing. simultaneously.

Applications Deal with regression and Deals with clustering and Deals with exploration and
classification problems. associative rule mining exploitation problems.
problems.

Algorithms Decision trees, linear K-means clustering, k-medoids Q-learning, SARSA, Deep Q
Used regression, K-nearest clustering, agglomerative Network
neighbors clustering

Examples Image detection, Customer segmentation, feature Drive-less cars, self-navigating


Population growth elicitation, targeted marketing, vacuum cleaners, etc
prediction etc
KNOWLEDGE DISCOVERY IN DATABASES
What is the KDD Process?

The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding
knowledge in data, and emphasizes the "high-level" application of particular data mining methods. It is of
interest to researchers in machine learning, pattern recognition, databases, statistics, artificial intelligence,
knowledge acquisition for expert systems, and data visualization.

The unifying goal of the KDD process is to extract knowledge from data in the context of large databases.

It does this by using data mining methods (algorithms) to extract (identify) what is deemed knowledge,
according to the specifications of measures and thresholds, using a database along with any required
preprocessing, subsampling, and transformations of that database.

An Outline of the Steps of the KDD Process

The overall process of finding and interpreting patterns from data involves the repeated application of the
following steps:

1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples,
on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of the task.
o Using dimensionality reduction or transformation methods to reduce the effective number of
variables under consideration or to find invariant representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
6. Choosing the data mining algorithm(s).
o Selecting method(s) to be used for searching for patterns in the data.
o Deciding which models and parameters may be appropriate.
o Matching a particular data mining method with the overall criteria of the KDD process.
7. Data mining.
o Searching for patterns of interest in a particular representational form or a set of such
representations as classification rules or trees, regression, clustering, and so forth.
8. Interpreting mined patterns.
9. Consolidating discovered knowledge.

SEMMA

What is SEMMA?
The SAS Institute developed SEMMA as the process of data mining. It has five steps
(Sample, Explore, Modify, Model, and Assess), earning the acronym of SEMMA. The data mining method
can be used to solve a wide range of business problems, including fraud identification, customer retention
and turnover, database marketing, customer loyalty, bankruptcy forecasting, market segmentation, as well as
risk, affinity, and portfolio analysis.

Why SEMMA?
Data is used by businesses to achieve a competitive advantage, improve performance, and deliver more
useful services to customers. The data we collect about our surroundings serve as the foundation for
hypotheses and models of the world we live in.

Ultimately, data is accumulated to help in collecting knowledge. That means the data is not worth much
until it is studied and analyzed. But hoarding vast volumes of data is not equivalent to gathering valuable
knowledge. It is only when data is sorted and evaluated that we learn anything from it.

Thus, SEMMA is designed as a data science methodology to help practitioners convert data into knowledge.

The 5 Stages Of SEMMA


SEMMA is leveraged as an organized, functional toolset, or is claimed as such by SAS to be associated with
their SAS Enterprise Miner initiative. While it is true that the SEMMA process is more ambiguous to those
not using the tool, most regard it as a functional data mining methodology rather than a specific tool.

The process breaks down into its own set of stages. These include:

 Sample: This step entails choosing a subset of the appropriate volume dataset from a vast dataset
that has been given for the model’s construction. The goal of this initial stage of the process is to
identify variables or factors (both dependent and independent) influencing the process. The collected
information is then sorted into preparation and validation categories.
 Explore: During this step, univariate and multivariate analysis is conducted in order to study
interconnected relationships between data elements and to identify gaps in the data. While the
multivariate analysis studies the relationship between variables, the univariate one looks at each
factor individually to understand its part in the overall scheme. All of the influencing factors that
may influence the study’s outcome are analyzed, with heavy reliance on data visualization.
 Modify: In this step, lessons learned in the exploration phase from the data collected in the sample
phase are derived with the application of business logic. In other words, the data is parsed and
cleaned, being then passed onto the modeling stage, and explored if the data requires refinement and
transformation.
 Model: With the variables refined and data cleaned, the modeling step applies a variety of data
mining techniques in order to produce a projected model of how this data achieves the final, desired
outcome of the process.
 Assess: In this final SEMMA stage, the model is evaluated for how useful and reliable it is for the
studied topic. The data can now be tested and used to estimate the efficacy of its performance.

UNIT 2

SCALES OF MEASUREMENT
 
Data can be classified as being on one of four scales: nominal, ordinal, interval or ratio. Each level of
measurement has some important properties that are useful to know.
Properties of Measurement Scales:
 Identity – Each value on the measurement scale has a unique meaning.
 Magnitude – Values on the measurement scale have an ordered relationship to one another. That is,
some values are larger and some are smaller.
 Equal intervals – Scale units along the scale are equal to one another. For Example the difference
between 1 and 2 would be equal to the difference between 11 and 12.
 A minimum value of zero – The scale has a true zero point, below which no values exist.
1. Nominal Scale –
Nominal variables can be placed into categories. These don’t have a numeric value and so cannot be added,
subtracted, divided or multiplied. These also have no order, and nominal scale of measurement only satisfies
the identity property of measurement.
For example, gender is an example of a variable that is measured on a nominal scale. Individuals may be
classified as “male” or “female”, but neither value represents more or less “gender” than the other.
2. Ordinal Scale –
The ordinal scale contains things that you can place in order. It measures a variable in terms of magnitude,
or rank. Ordinal scales tell us relative order, but give us no information regarding differences between the
categories. The ordinal scale has the property of both identity and magnitude.
For example, in a race If Ram takes first and Vidur takes second place, we do not know competition was
close by how many seconds.
3. Interval Scale –
An interval scale has ordered numbers with meaningful divisions, the magnitude between the consecutive
intervals are equal. Interval scales do not have a true zero i.e In Celsius 0 degrees does not mean the absence
of heat.
Interval scales have the properties of:
 Identity
 Magnitude
 Equal distance
For example, temperature on Fahrenheit/Celsius thermometer i.e. 90° are hotter than 45° and the difference
between 10° and 30° are the same as the difference between 60° degrees and 80°.
4. Ratio Scale –
The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has
equality of units with one major difference: zero is meaningful (no numbers exist below the zero). The true
zero allows us to know how many times greater one case is than another. Ratio scales have all of the
characteristics of the nominal, ordinal and interval scales. The simplest example of a ratio scale is the
measurement of length. Having zero length or zero money means that there is no length and no money but
zero temperature is not an absolute zero.
Properties of Ratio Scale:
 Identity
 Magnitude
 Equal distance
 Absolute/true zero
For example, in distance 10 miles is twice as long as 5 mile.

WAYS TO HANDLE MISSING VALUES


Popular strategies to handle missing values in the dataset
The real-world data often has a lot of missing values. The cause of missing values can be data corruption or
failure to record data. The handling of missing data is very important during the preprocessing of the dataset
as many machine learning algorithms do not support missing values.
This article covers 7 ways to handle missing values in the dataset:

1. Deleting Rows with missing values


2. Impute missing values for continuous variable
3. Impute missing values for categorical variable
4. Other Imputation Methods
5. Using Algorithms that support missing values
6. Prediction of missing values
7. Imputation using Deep Learning Library — Datawig

Delete Rows with Missing Values:


Missing values can be handled by deleting the rows or columns having null values. If columns have more
than half of the rows as null then the entire column can be dropped. The rows which are having one or more
columns values as null can also be dropped.

Pros:
 A model trained with the removal of all missing values creates a robust model.
Cons:
 Loss of a lot of information.
 Works poorly if the percentage of missing values is excessive in comparison to the complete dataset.
Impute missing values with Mean/Median:
Columns in the dataset which are having numeric continuous values can be replaced with the mean, median,
or mode of remaining values in the column. This method can prevent the loss of data compared to the earlier
method. Replacing the above two approximations (mean, median) is a statistical approach to handle the
missing values.

The missing values are replaced by the mean value in the above example, in the same way, it can be
replaced by the median value.

Pros:

 Prevent data loss which results in deletion of rows or columns


 Works well with a small dataset and is easy to implement.

Cons:

 Works only with numerical continuous variables.


 Can cause data leakage
 Do not factor the covariance between features.

Imputation method for categorical columns:


When missing values is from categorical columns (string or numerical) then the missing values can be
replaced with the most frequent category. If the number of missing values is very large then it can be
replaced with a new category.

Pros:
 Prevent data loss which results in deletion of rows or columns
 Works well with a small dataset and is easy to implement.
 Negates the loss of data by adding a unique category
Cons:
 Works only with categorical variables.
 Addition of new features to the model while encoding, which may result in poor performance

Other Imputation Methods:


Depending on the nature of the data or data type, some other imputation methods may be more appropriate
to impute missing values.
For example, for the data variable having longitudinal behavior, it might make sense to use the last valid
observation to fill the missing value. This is known as the Last observation carried forward (LOCF) method.
For the time-series dataset variable, it makes sense to use the interpolation of the variable before and after a
timestamp for a missing value.
Using Algorithms that support missing values:
All the machine learning algorithms don’t support missing values but some ML algorithms are robust to
missing values in the dataset. The k-NN algorithm can ignore a column from a distance measure when a
value is missing. Naive Bayes can also support missing values when making a prediction. These algorithms
can be used when the dataset contains null or missing values.
The sklearn implementations of naive Bayes and k-Nearest Neighbors in Python do not support the presence
of the missing values.
Another algorithm that can be used here is RandomForest that works well on non-linear and categorical
data. It adapts to the data structure taking into consideration the high variance or the bias, producing better
results on large datasets.

Pros:
 No need to handle missing values in each column as ML algorithms will handle them efficiently.
Cons:
 No implementation of these ML algorithms in the scikit-learn library.

Prediction of missing values:


In the earlier methods to handle missing values, we do not use the correlation advantage of the variable
containing the missing value and other variables. Using the other features which don’t have nulls can be
used to predict missing values.
The regression or classification model can be used for the prediction of missing values depending on the
nature (categorical or continuous) of the feature having missing value.

Pros:
 Gives a better result than earlier methods
 Takes into account the covariance between the missing value column and other columns.
Cons:
 Considered only as a proxy for the true values

Imputation using Deep Learning Library — Datawig


This method works very well with categorical, continuous, and non-numerical features. Datawig is a library
that learns ML models using Deep Neural Networks to impute missing values in the datagram.
Datawig can take a data frame and fit an imputation model for each column with missing values, with all
other columns as inputs.
Below is the code to impute missing values in the Age column
Pros:
 Quite accurate compared to other methods.
 It supports CPUs and GPUs.
Cons:
 Can be quite slow with large datasets.

Conclusion:
Every dataset has missing values that need to be handled intelligently to create a robust model. In this
article, I have discussed 7 ways to handle missing values that can handle missing values in every type of
column. There is no thump rule to handle missing values in a particular manner, the method which gets a
robust model with the best performance. One can use various methods on different features depending on
how and what the data is about. Having domain knowledge about the dataset is important, which can give an
insight into how to pre-process the data and handle missing values.

HANDLING CATEGORICAL DATA IN MACHINE LEARNING


Not all machine learning algorithms can handle categorical data, so it is very important to convert the
categorical features of a dataset into numeric values. The scikit-learn library in Python provides many
methods for handling categorical data. Some of the best techniques for handling categorical data are:

1. LabelEncoder
2. LabelBinarizer
To use these two methods to handle categorical data, we first need to have a dataset with categorical
features. So let’s create one:

import numpy as np
x = np.random.uniform(0.0, 1.0, size=(10, 2))

y = np.random.choice(("Male", "Female"), size=(10))

print(x[0])

print(y[0])

view rawcategorical data1.py hosted with ❤ by GitHub


[0.03345401 0.48645195]
Female
So, as you can see, I created a very small dataset consisting of 10 categorical samples as Male and Female.
In the section below, I’ll show you how to handle these categorical features in machine learning by
using LabelEncoder and LabelBinarizer.
LabelEncoder:
The LabelEncoder class of the scikit-learn library in Python takes a dictionary-oriented approach to
associate each categorical value with a progressive integer value. Below is how to use LabelEncoder for
handling categorical data in machine learning:

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
y1 = le.fit_transform(y)

print(y1)

view rawcategorical data2.py hosted with ❤ by GitHub


[1 0 0 1 1 1 1 1 1 1]
This is how we can use LabelEncoder to handle categorical features, you can also decode these transformed
values back to the original categorical labels as shown below:

output = [1, 0, 1, 0, 1, 0, 0, 1, 1, 1]
output1 = [le.classes_[int(i)] for i in output]

print(output1)

view rawcategorical data3.py hosted with ❤ by GitHub


['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male']
LabelBinarizer:
The LabelEncoder method works in many cases when transforming categorical data into numeric values.
But it has the disadvantage that all the labels are transformed into sequential numbers. For this reason, it is
best to use one-hot-encoding that binarizes categorical data. So here’s how to use the LabelBinarizer class in
scikit-learn to handle categorical data:

from sklearn.preprocessing import LabelBinarizer


lb = LabelBinarizer()

y2 = lb.fit_transform(y)

print(y2)

view rawcategorical data4.py hosted with ❤ by GitHub


[[0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]]
Here is how you can decode these transformed values back to the original categorical labels:

output2 = lb.inverse_transform(y2)
print(output2)

view rawcategorical data5.py hosted with ❤ by GitHub


['Female' 'Male' 'Female' 'Female' 'Female' 'Male' 'Male' 'Male' 'Female'
 'Female']
Summary
When solving problems based on classification with machine learning, we mostly find datasets made up of
categorical labels that cannot be processed by all machine learning algorithms. This is why we need to
convert the categorical features into numerical values.
NORMALIZING DATA

Normalization is a data preparation technique that is frequently used in machine learning. The process of
transforming the columns in a dataset to the same scale is referred to as normalization. Every dataset does
not need to be normalized for machine learning. It is only required when the ranges of characteristics are
different.
What is normalization?
Let’s first define what exactly is normalization.
Let’s say we have a dataset containing two variables: time traveled and distance covered. Time is measured
in hours (e.g. 5, 10, 25 hours ) and distance in miles (e.g. 500, 800, 1200 miles). Do you see the problem?
One obvious problem of course is that these two variables are measured in two different units — one in
hours and the other in miles. The other problem — which is not obvious but if you take a closer look you'll
find it — is the distribution of data, which is quite different in these two variables (both within and between
variables).
The purpose of normalization is to transform data in a way that they are either dimensionless and/or have
similar distributions. This process of normalization is known by other names such as standardization, feature
scaling etc. Normalization is an essential step in data pre-processing in any machine learning application and
model fitting.
Does normalization help?
Now the question is how (on earth) exactly does this transformation help?
The short answer is — it dramatically improves model accuracy.
Normalization gives equal weights/importance to each variable so that no single variable steers model
performance in one direction just because they are bigger numbers.
As an example, clustering algorithms use distance measures to determine if an observation should belong to
a certain cluster. “Euclidean distance” is often used to measure those distances. If a variable has significantly
higher values, it can dominate distance measures, suppressing other variables with small values.
 
What tools and techniques are used?
Several methods are applied for normalization, three popular and widely used techniques are as follows:

 Rescaling: also known as “min-max normalization”, it is the simplest of all methods and calculated as:

 Mean normalization: This method uses the mean of the observations in the transformation process:

 Z-score normalization: Also known as standardization, this technic uses Z-score or “standard score”.
It is widely used in machine learning algorithms such as SVM and logistic regression:

Here, z is the standard score, µ is the population mean and ϭ is the population standard deviation.
Show an example
Let’s do an experiment, it’s always good to see an algorithm in action. The example is not going to be
dramatic, I’m just showing it as an illustration to give an intuition.
Let’s import some libraries — pandas for data wrangling, matplotlib for visualization
and preprocessing and KMeans from the sklearn library.
Let’s also import data from a GitHub repo as a csv file. That’s the Iris dataset, already cleaned, so you can
import and follow along right away.
I’m going to use two features — petal_length and sepal_length — for clustering of data points.
# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.cluster import KMeans# importing **cleaned** data
df=pd.read_csv("iris.csv")# feature selection
df = df[["petal_length", "sepal_length"]]
After importing libraries and data, first we’ll implement the KMeans clustering algorithm without
normalization. I’ve annotated each line of code so you know what’s going on.
# inputs (NOT normalized)
X_not_norm = df.values# instantiate model
model = KMeans(n_clusters = 3)# fit predict
y_model = model.fit_predict(X_not_norm)# visualizing clusters
plt.scatter(X_not_norm[:,0], X_not_norm[:,1], c=model.labels_, cmap='viridis')# counts per cluster
print("Value Counts")
print(pd.value_counts(y_model))
The outputs are the count of data points in each cluster and the visualization of those clusters.
Now let’s re-run the model, this time after normalizing the inputs using preprocessing from
the sklearn library.
# normalizing inputs
X_norm = preprocessing.scale(df)# instantiate model
model = KMeans(n_clusters = 3)# fit predict
y_model = model.fit_predict(X_norm)print("Value Counts")
print(pd.value_counts(y_model))# visualize clusters
plt.scatter(X_norm[:,0], X_norm[:,1], c=model.labels_, cmap='viridis')
In the following are the outputs before and after the normalization of data. First, if you compare the value
counts there are some changes — for example, counts cluster 0 is reduced by 4 members.
If you closely examine the data points in the left and the right figures you might be able to see which data
points shifted from pre-normalized to post-normalized model. These changes are often at the boundaries
rather than at either end of the spectrum in the distribution. Again, as I said, it’s not too dramatic, but you
get the point.
Differences of clustering before and after normalization (source: Author)
The other side of the coin…..
So far we’ve got the impression that normalization is absolutely a great thing to happen in data science! Not
really, it does some good things but creates some bad side effects along the way.
Normalization compresses data within a certain range, reduces the variance and applies equal weights to all
features. You lose a lot of important information in the process.
One example is what happens to outliers — normalization lease absolutely no traces of outliers. We perceive
outliers as bad guys and we need to get rid of them ASAP. But remember, outliers are real data points, once
you lose that just to get a better model, you lose information.
In the process of normalization, the variables lose their units of measurements too. So at the end of
modeling, you can’t really tell what are the key differences between the variables.

FEATURE CONSTRUCTION OR GENERATION


Feature engineering is the pre-processing step of machine learning, which is used to transform raw data into
features that can be used for creating a predictive model using Machine learning or statistical Modelling.
Feature engineering in machine learning aims to improve the performance of models.

Generally, all machine learning algorithms take input data to generate the output. The input data remains in
a tabular form consisting of rows (instances or observations) and columns (variable or attributes), and these
attributes are often known as features.

A feature (or column) represents a measurable piece of data like name, age or gender.It is the basic building
block of a dataset. The quality of a feature can vary significantly and has an immense effect on model
performance. We can improve the quality of a dataset’s features in the pre-processing stage using processes
like Feature Generation and Feature Selection.

Feature Generation (also known as feature construction, feature extraction or feature engineering) is the
process of transforming features into new features that better relate to the target. This can involve mapping a
feature into a new feature using a function like log, or creating a new feature from one or multiple features
using multiplication or addition.

Feature Generation can improve model performance when there is a feature interaction. Two or more
features interact if the combined effect is (greater or less) than the sum of their individual effects. It is
possible to make interactions with three or more features, but this tends to result in diminishing returns.
Feature Generation is often overlooked as it is assumed that the model will learn any relevant relationships
between features to predict the target variable. However, the generation of new flexible features is important
as it allows us to use less complex models that are faster to run and easier to understand and maintain.

CORRELATION AND CAUSATION 


Correlation :
It is a statistical term which depicts the degree of association between two random variables. In data
analysis it is often used to determine the amount to which they relate to one another.
Three types of correlation-
1. Positive correlation –
If with increase in random variable A, random variable B increases too, or vice versa.
2. Negative correlation –
If increase in random variable A leads to a decrease in B, or vice versa.
3. No correlation –
When both the variables are completely unrelated and change in one leads to no change in
other.
2. Causation :
Causation between random variables A and B implies that A and B have a cause-and-effect relationship
with one another. Or we can say existence of one gives birth to other, and we say A causes B or vice
versa. Causation is also termed as causality.
Correlation does not imply Causation.
Correlation and Causation can exist at the same time also, so definitely correlation doesn’t imply
causation. Below example is to show this difference more clearly-

No battery in computer causes computer to shut and also causes video player to stop shows causality of
battery over laptop and video player. The moment computer shuts, video player also shuts shows both are
correlated. More specifically positively correlated.

POLYNOMIAL REGRESSION
o Polynomial Regression is a regression algorithm that models the relationship between a dependent(y)
and independent variable(x) as nth degree polynomial. The Polynomial Regression equation is given
below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because we add some
polynomial terms to the Multiple Linear regression equation to convert it into Polynomial
Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear functions and datasets.
o dataset and non-linear dataset.
Steps for Polynomial Regression:

The main steps involved in Polynomial Regression are given below:

o Data Pre-processing
o Build a Linear Regression model and fit it to the dataset
o Build a Polynomial Regression model and fit it to the dataset
o Visualize the result for Linear Regression and Polynomial Regression model.
o Predicting the output.

LINEAR REGRESSION
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical
method that is used for predictive analysis. Linear regression makes predictions for continuous/real or
numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to the
value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:
Mathematically, we can represent a linear regression as:

Skip Ady= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Multiple Linear Regression.

LOGISTIC REGRESSION

o Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for
solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification. The below image is showing
the logistic function:

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the logistic
function

Logistic Regression Equation:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of
the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".

ROC CURVE
ROC or Receiver Operating Characteristic curve represents a probability graph to show the performance of a
classification model at different threshold levels. The curve is plotted between two parameters, which are:

o True Positive Rate or TPR


o False Positive Rate or FPR

In the curve, TPR is plotted on Y-axis, whereas FPR is on the X-axis.

TPR:

TPR or True Positive rate is a synonym for Recall, which can be calculated as:

FPR or False Positive Rate can be calculated as:

Here, TP: True Positive

FP: False Positive

TN: True Negative

FN: False Negative

Now, to efficiently calculate the values at any threshold level, we need a method, which is AUC.

AUC: Area Under the ROC curve

AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-dimensional
area under the entire ROC curve ranging from (0,0) to (1,1), as shown below image:
In the ROC curve, AUC computes the performance of the binary classifier across different thresholds and
provides an aggregate measure. The value of AUC ranges from 0 to 1, which means an excellent model will
have AUC near 1, and hence it will show a good measure of Separability.

Applications of AUC-ROC Curve

Although the AUC-ROC curve is used to evaluate a classification model, it is widely used for various
applications. Some of the important applications of AUC-ROC are given below:

1. Classification of 3D model
The curve is used to classify a 3D model and separate it from the normal models. With the specified
threshold level, the curve classifies the non-3D and separates out the 3D models.
2. Healthcare
The curve has various applications in the healthcare sector. It can be used to detect cancer disease in
patients. It does this by using false positive and false negative rates, and accuracy depends on the
threshold value used for the curve.
3. Binary Classification
AUC-ROC curve is mainly used for binary classification problems to evaluate their performance.

UNIT 3

DECISION TREE
Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
o Below diagram explains the general structure of a decision tree:

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and
problem is the main point to remember while creating a machine learning model. Below are the two reasons
for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like structure.

Decision Tree Terminologies

o Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
o Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
o Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
o Branch/Sub Tree: A tree formed by splitting the tree.
o Pruning: Pruning is the process of removing the unwanted branches from the tree.
o Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based
on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called
the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept
the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into one decision node (Cab
facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute for the root
node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the
tree. There are two popular techniques for ASM, which are:

o Information Gain
o Gini Index
Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making any
decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

SUPPORT VECTOR MACHINE ALGORITHM


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used
for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the
future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called
as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram
in which there are two different categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we
see a strange cat that also has some features of dogs, so if we want a model that can accurately identify
whether it is a cat or dog, so such a model can be created by using the SVM algorithm. We will first train
our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs,
and then we test it with this strange creature. So as support vector creates a decision boundary between these
two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and
dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that
has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can
be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is
called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These
points are called support vectors. The distance between the vectors and the hyperplane is called as margin.
And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot
draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used two
dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space
with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

K-NEAREST NEIGHBOR(KNN)
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
o It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works
on a similarity measure. Our KNN model will find the similar features of the new data set to the cats
and dogs images and based on the most similar features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm.
With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the
below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors


o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is
the distance between two points, which we have already studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in
category A and two nearest neighbors in category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to
category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some values to find
the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for all the
training samples.

TIME-SERIES FORECASTING

Time series forecasting is one of the most applied data science techniques in business, finance, supply chain
management, production and inventory planning. Many prediction problems involve a time component and
thus require extrapolation of time series data, or time series forecasting. Time series forecasting is also an
important area of machine learning (ML) and can be cast as a supervised learning problem. ML methods
such as Regression, Neural Networks, Support Vector Machines, Random Forests and XGBoost — can be
applied to it. Forecasting involves taking models fit on historical data and using them to predict future
observations.

Time series forecasting means to forecast or to predict the future value over a period of time. It entails
developing models based on previous data and applying them to make observations and guide future
strategic decisions.

The future is forecast or estimated based on what has already happened. Time series adds a time order
dependence between observations. This dependence is both a constraint and a structure that provides a
source of additional information. Before we discuss time series forecasting methods, let’s define time series
forecasting more closely.

Time series forecasting is a technique for the prediction of events through a sequence of time. It predicts
future events by analyzing the trends of the past, on the assumption that future trends will hold similar to
historical trends. It is used across many fields of study in various applications including:

• Astronomy
• Business planning
• Control engineering
• Earthquake prediction
• Econometrics
• Mathematical finance
• Pattern recognition
• Resources allocation
• Signal processing
• Statistics
• Weather forecasting

Time series models

Time series models are used to forecast events based on verified historical data. Common types include
ARIMA, smooth-based, and moving average. Not all models will yield the same results for the same dataset,
so it's critical to determine which one works best based on the individual time series.

When forecasting, it is important to understand your goal. To narrow down the specifics of your predictive
modeling problem, ask questions about:

1. Volume of data available — more data is often more helpful, offering greater opportunity for
exploratory data analysis, model testing and tuning, and model fidelity.
2. Required time horizon of predictions — shorter time horizons are often easier to predict — with
higher confidence — than longer ones.
3. Forecast update frequency — Forecasts might need to be updated frequently over time or might need
to be made once and remain static (updating forecasts as new information becomes available often
results in more accurate predictions).
4. Forecast temporal frequency — Often forecasts can be made at lower or higher frequencies, which
allows harnessing downsampling and up-sampling of data (this in turn can offer benefits while
modeling).

CLUSTERING

A cluster refers to a small group of objects. Clustering is grouping those objects into clusters. In order to
learn clustering, it is important to understand the scenarios that lead to cluster different objects.
What is Clustering?
•Clustering is dividing data points into homogeneous classes or clusters:
• Points in the same group are as similar as possible
• Points in different group are as dissimilar as possible
•When a collection of objects is given, we put objects into group based on similarity. Clustering Algorithms
- • A Clustering Algorithm tries to analyze natural groups of data on the basis of some similarity. It locates
the centroid of the group of data points. To carry out effective clustering, the algorithm evaluates the
distance between each point from the centroid of the cluster.
• The goal of clustering is to determine the intrinsic grouping in a set of unlabelled data.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are
divided into several groups with similar properties.

Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group)
and Soft Clustering (data points can belong to another group also). But there are also other various
approaches of Clustering exist. Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the centroid-
based method. The most common example of partitioning clustering is the K-Means Clustering
algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined
groups. The cluster center is created in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.

Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high
dimensions.

Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by assuming some distributions
commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian


Mixture Models (GMM).

Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement
of pre-specifying the number of clusters to be created. In this technique, the dataset is divided into clusters to
create a tree-like structure, which is also called a dendrogram. The observations or any number of clusters
can be selected by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or
cluster. Each dataset has a set of membership coefficients, which depend on the degree of membership to be
in a cluster. Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known
as the Fuzzy k-means algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models that are explained above. There are different
types of clustering algorithms published, but only a few are commonly used. The clustering algorithm is
based on the kind of data that we are using. Such as, some algorithms need to guess the number of clusters
in the given dataset, whereas some are required to find the minimum distance between the observation of the
dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It
classifies the dataset by dividing the samples into different clusters of equal variances. The number
of clusters must be specified in this algorithm. It is fast with fewer computations required, with the
linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of
data points. It is an example of a centroid-based model, that works on updating the candidates for
centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise.
It is an example of a density-based model similar to the mean-shift, but with some remarkable
advantages. In this algorithm, the areas of high density are separated by the areas of low density.
Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative
for the k-means algorithm or for those cases where K-means can be failed. In GMM, it is assumed
that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the
bottom-up hierarchical clustering. In this, each data point is treated as a single cluster at the outset
and then successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to
specify the number of clusters. In this, each data point sends a message between the pair of data
points until convergence. It has O(N2T) time complexity, which is the main drawback of this
algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification
of cancerous cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears
based on the closest object to the search query. It does it by grouping similar data objects in one
group that is far from the other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers based on their
choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using
the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS
database. This can be very useful to find that for what purpose the particular land should be used,
that means for which purpose it is more suitable.

PCA

Principal Component Analysis (PCA): Principal Component Analysis (PCA) is an unsupervised, non-
parametric statistical technique primarily used for dimensionality reduction in machine learning.

• Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality
reduction in machine learning.
• It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation. These new transformed features are called
the Principal Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing the
variances.
• PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
• High dimensionality means that the dataset has a large number of features. The primary problem associated
with high-dimensionality in the machine learning field is model overfitting, which reduces the ability to
generalize beyond the examples in the training set.
• Richard Bellman described this phenomenon in 1961 as the Curse of Dimensionality where “Many
algorithms that work fine in low dimensions become intractable when the input is highdimensional.”
• The ability to generalize correctly becomes exponentially harder as the dimensionality of the training
dataset grows, as the training set covers a dwindling fraction of the input space. Models also become more
efficient as the reduced feature set boosts learning rates and diminishes computation costs by removing
redundant features.
• PCA can also be used to filter noisy datasets, such as image compression. The first principal component
expresses the most amount of variance. Each additional component expresses less variance and more noise,
so representing the data with a smaller subset of principal components preserves the signal and discards the
noise.
The PCA algorithm is based on some mathematical concepts such as:

• Variance and Covariance


• Eigenvalues and Eigen factors

Principal Components in PCA

As described, the transformed new features or the output of PCA are the Principal Components. The number

of these PCs are either equal to or less than the original features present in the dataset. Some properties of

these principal components are given below:

1. The principal component must be the linear combination of the original features.

2. These components are orthogonal, i.e., the correlation between a pair of variables is zero.

3. The importance of each component decreases when going to 1 to n, it means the 1 PC has the most

importance, and n PC will have the least importance.

Applications of Principal Component Analysis:

• PCA is mainly used as the dimensionality reduction technique in various AI applications such as computer
vision, image compression, etc.

• It can also be used for finding hidden patterns if data has high dimensions. Some fields where PCA is used
are Finance, data mining, Psychology, etc.

Steps for PCA algorithm

1. Getting the dataset Firstly, we need to take the input dataset and divide it into two subparts X and Y,
where X is the training set, and Y is the validation set.
2. Representing data into a structure Now we will represent our dataset into a structure. Such as we will
represent the two dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the dimensions of the dataset.
3. Standardizing the data In this step, we will standardize our dataset. Such as in a particular column, the
features with high variance are more important compared to the features with lower variance. If the
importance of features is independent of the variance of the feature, then we will divide each data item in a
column with the standard deviation of the column. Here we will name the matrix as Z.
4. Calculating the Covariance of Z To calculate the covariance of Z, we will take the matrix Z, and will
transpose it. After transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors Now we need to calculate the eigenvalues and
eigenvectors for the resultant covariance matrix Z. Eigenvectors or the covariance matrix are the directions
of the axes with high information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors In this step, we will take all the eigenvalues and will sort them in decreasing
order, which means from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix
P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components Here we will calculate the new features. To do
this, we will multiply the P* matrix to the Z. In the resultant matrix Z*, each observation is the linear
combination of original features. Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset. The new feature set has occurred, so we will
decide here what to keep and what to remove. It means, we will only keep the relevant or important features
in the new dataset, and unimportant features will be removed out.

UNIT 4
BIAS AND VARIANCE
Machine learning is a branch of Artificial Intelligence, which allows machines to perform data analysis and
make predictions. However, if the machine learning model is not accurate, it can make predictions errors,
and these prediction errors are usually known as Bias and Variance.

In machine learning, these errors will always be present as there is always a slight difference between the
model predictions and actual predictions. The main aim of ML/data science analysts is to reduce these errors
in order to get more accurate results.

What is Bias?

In general, a machine learning model analyses the data, find patterns in it and make predictions. While
training, the model learns these patterns in the dataset and applies them to test data for prediction. While
making predictions, a difference occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due to bias. It can be defined
as an inability of machine learning algorithms such as Linear Regression to capture the true relationship
between the data points. Each algorithm begins with some amount of bias because bias occurs from
assumptions in the model, which makes the target function simple to learn. A model has either:

o Low Bias: A low bias model will make fewer assumptions about the form of the target function.
o High Bias: A model with a high bias makes more assumptions, and the model becomes unable to
capture the important features of our dataset. A high bias model also cannot perform well on new
data.

Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the algorithm, the
higher the bias it has likely to be introduced. Whereas a nonlinear algorithm often has low bias.

Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours and
Support Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear
Discriminant Analysis and Logistic Regression.

Ways to reduce High Bias:

High bias mainly occurs due to a much simple model. Below are some ways to reduce the high bias:

o Increase the input features as the model is underfitted.


o Decrease the regularization term.
o Use more complex models, such as including some polynomial features.

What is a Variance Error?

The variance would specify the amount of variation in the prediction if the different training data was used.
In simple words, variance tells that how much a random variable is different from its expected
value. Ideally, a model should not vary too much from one training dataset to another, which means the
algorithm should be good in understanding the hidden mapping between inputs and output variables.
Variance errors are either of low variance or high variance.

Low variance means there is a small variation in the prediction of the target function with changes in the
training data set. At the same time, High variance shows a large variation in the prediction of the target
function with changes in the training dataset.

A model that shows high variance learns a lot and perform well with the training dataset, and does not
generalize well with the unseen dataset. As a result, such a model gives good results with the training dataset
but shows high error rates on the test dataset.

Since, with high variance, the model learns too much from the dataset, it leads to overfitting of the model. A
model with high variance has the below problems:

o A high variance model leads to overfitting.


o Increase model complexities.

Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.
Some examples of machine learning algorithms with low variance are, Linear Regression, Logistic
Regression, and Linear discriminant analysis. At the same time, algorithms with high variance
are decision tree, Support Vector Machine, and K-nearest neighbours.

Ways to Reduce High Variance:

o Reduce the input features or number of parameters as a model is overfitted.


o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.

K-FOLD CROSS VALIDATION


K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These
samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the
folds are used for the test set. This approach is a very popular CV approach because it is easy to understand,
and the output is less biased than other methods.

The steps for k-fold cross-validation are:

o Split the input dataset into K groups


o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model using the test set.

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1 st iteration, the
first fold is reserved for test the model, and rest are used to train the model. On 2 nd iteration, the second fold
is used to test the model, and rest are used to train the model. This process will continue until each fold is
not used for the test fold.

Consider the below diagram:


Stratified k-fold cross-validation

This technique is similar to k-fold cross-validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best approaches.

It can be understood with an example of housing prices, such that the price of some houses can be much
high than other houses. To tackle such situations, a stratified k-fold cross-validation technique is useful.

Holdout Method

This method is the simplest cross-validation technique among all. In this method, we need to remove a
subset of the training data and use it to get prediction results by training it on the rest part of the dataset.

The error that occurs in this process tells how well our model will perform with the unknown dataset.
Although this approach is simple to perform, it still faces the issue of high variance, and it also produces
misleading results sometimes.

Comparison of Cross-validation to train/test split in Machine Learning


o Train/test split: The input data is divided into two parts, that are training set and test set on a ratio of
70:30, 80:20, etc. It provides a high variance, which is one of the biggest disadvantages.
o Training Data: The training data is used to train the model, and the dependent variable is
known.
o Test Data: The test data is used to make the predictions from the model that is already
trained on the training data. This has the same features as training data but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by splitting the
dataset into groups of train/test splits, and averaging the result. It can be used if we want to optimize
our model that has been trained on the training dataset for the best performance. It is more efficient
as compared to train/test split as every observation is used for the training and testing both.

Limitations of Cross-Validation

There are some limitations of the cross-validation technique, which are given below:

o For the ideal conditions, it provides the optimum output. But for the inconsistent data, it may
produce a drastic result. So, it is one of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the differences
between the training set and validation sets. Such as if we create a model for the prediction of stock
market values, and the data is trained on the previous 5 years stock values, but the realistic future
values for the next 5 years may drastically different, so it is difficult to expect the correct output for
such situations.

Applications of Cross-Validation
o This technique can be used to compare the performance of different predictive modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by the data scientists in the field
of medical statistics.

BAGGING

Bagging Machine Learning uses several techniques to build models and improve their performance.
Ensemble learning methods help improves the accuracy of classification and regression models. This article
will discuss one of the most popular ensemble learning algorithms, i.e., Bagging in Machine Learning

Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to improve the
performance and accuracy of machine learning algorithms. It is used to deal with bias-variance trade-offs
and reduces the variance of a prediction model. Bagging avoids overfitting of data and is used for both
regression and classification models, specifically for decision tree algorithms.

What Is Bootstrapping?

Bootstrapping is the method of randomly creating samples of data out of a population with replacement to
estimate a population parameter.

Steps to Perform Bagging


 Consider there are n observations and m features in the training set. You need to select a random
sample from the training dataset without replacement

 A subset of m features is chosen randomly to create a model using sample observations

 The feature offering the best split out of the lot is used to split the nodes

 The tree is grown, so you have the best root nodes

 The above steps are repeated n times. It aggregates the output of individual decision trees to give
the best prediction

Advantages of Bagging in Machine Learning

 Bagging minimizes the overfitting of data

 It improves the model’s accuracy

 It deals with higher dimensional data efficiently

GRADIENT BOOSTING
Machine learning is one of the most popular technologies to build predictive models for various complex
regression and classification tasks. Gradient Boosting Machine (GBM) is considered one of the most
powerful boosting algorithms.

Boosting is one of the popular learning ensemble modeling techniques used to build strong classifiers from
various weak classifiers. It starts with building a primary model from available training data sets then it
identifies the errors present in the base model. After identifying the error, a secondary model is built, and
further, a third model is introduced in this process. In this way, this process of introducing more models is
continued until we get a complete training data set by which model predicts correctly.

AdaBoost (Adaptive boosting) was the first boosting algorithm to combine various weak classifiers into a
single strong classifier in the history of machine learning. It primarily focuses to solve classification tasks
such as binary classification.

Steps in Boosting Algorithms:

There are a few important steps in boosting the algorithm as follows:

o Consider a dataset having different data points and initialize it.


o Now, give equal weight to each of the data points.
o Assume this weight as an input for the model.
o Identify the data points that are incorrectly classified.
o Increase the weight for data points in step 4.
o If you get appropriate output then terminate this process else follow steps 2 and 3 again.

Example:
Let's suppose, we have three different models with their predictions and they work in completely different
ways. For example, the linear regression model shows a linear relationship in data while the decision tree
model attempts to capture the non-linearity in the data as shown below image.

Further, instead of using these models separately to predict the outcome if we use them in form of series or
combination, then we get a resulting model with correct information than all base models. In other words,
instead of using each model's individual prediction, if we use average prediction from these models then we
would be able to capture more information from the data. It is referred to as ensemble learning and boosting
is also based on ensemble methods in machine learning.

Boosting Algorithms in Machine Learning

There are primarily 4 boosting algorithms in machine learning. These are as follows:

o Gradient Boosting Machine (GBM)


o Extreme Gradient Boosting Machine (XGBM)
o Light GBM
o CatBoost

What is GBM in Machine Learning?

Gradient Boosting Machine (GBM) is one of the most popular forward learning ensemble methods in
machine learning. It is a powerful technique for building predictive models for regression and classification
tasks.

GBM helps us to get a predictive model in form of an ensemble of weak prediction models such as decision
trees. Whenever a decision tree performs as a weak learner then the resulting algorithm is called gradient-
boosted trees.

It enables us to combine the predictions from various learner models and build a final predictive model
having the correct prediction.

But here one question may arise if we are applying the same algorithm then how multiple decision trees can
give better predictions than a single decision tree? Moreover, how does each decision tree capture different
information from the same data?
So, the answer to these questions is that a different subset of features is taken by the nodes of each decision
tree to select the best split. It means, that each tree behaves differently, and hence captures different signals
from the same data.

STACKING
There are many ways to ensemble models in machine learning, such as Bagging, Boosting, and
stacking. Stacking is one of the most popular ensemble machine learning techniques used to predict
multiple nodes to build a new model and improve model performance. Stacking enables us to train multiple
models to solve similar problems, and based on their combined output, it builds a new model with improved
performance.

Stacking is one of the popular ensemble modeling techniques in machine learning. Various weak
learners are ensembled in a parallel manner in such a way that by combining them with Meta learners,
we can predict better predictions for the future.

This ensemble technique works by applying input of combined multiple weak learners' predictions and Meta
learners so that a better output prediction model can be achieved.

In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to best combine
the input predictions to make a better output prediction.

Stacking is also known as a stacked generalization and is an extended form of the Model Averaging
Ensemble technique in which all sub-models equally participate as per their performance weights and build
a new model with better predictions. This new model is stacked up on top of the others; this is the reason
why it is named stacking.

Architecture of Stacking

The architecture of the stacking model is designed in such as way that it consists of two or more
base/learner's models and a meta-model that combines the predictions of the base models. These base
models are called level 0 models, and the meta-model is known as the level 1 model. So, the Stacking
ensemble method includes original (training) data, primary level models, primary level prediction,
secondary level model, and final prediction. The basic architecture of stacking can be represented as
shown below the image.
o Original data: This data is divided into n-folds and is also considered test data or training data.
o Base models: These models are also referred to as level-0 models. These models use training data
and provide compiled predictions (level-0) as an output.
o Level-0 Predictions: Each base model is triggered on some training data and provides different
predictions, which are known as level-0 predictions.
o Meta Model: The architecture of the stacking model consists of one meta-model, which helps to best
combine the predictions of the base models. The meta-model is also known as the level-1 model.
o Level-1 Prediction: The meta-model learns how to best combine the predictions of the base models
and is trained on different predictions made by individual base models, i.e., data not used to train the
base models are fed to the meta-model, predictions are made, and these predictions, along with the
expected outputs, provide the input and output pairs of the training dataset used to fit the meta-
model.

EXTRA
What is Ensemble learning in Machine Learning?

Ensemble learning is one of the most powerful machine learning techniques that use the combined output of
two or more models/weak learners and solve a particular computational intelligence problem. E.g., a
Random Forest algorithm is an ensemble of various decision trees combined.

Ensemble learning is primarily used to improve the model performance, such as classification, prediction,
function approximation, etc. In simple words, we can summarise the ensemble learning as follows:

"An ensembled model is a machine learning model that combines the predictions from two or more
models.”

There are 3 most common ensemble learning methods in machine learning. These are as follows:

o Bagging
o Boosting
o Stacking
RANDOM FOREST ALGORITHM

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It
can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve
the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.

The below diagram explains the working of the Random Forest algorithm:

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision tree, and
second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the
category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the
training phase, each decision tree produces a prediction result, and when a new data point occurs, then based
on the majority of results, the Random Forest classifier predicts the final decision. Consider the below
image:
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest


o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.

UNIT 5

ARTIFICAL NEURON NETWORK

What is Artificial Neuron


An artificial neuron is a mathematical function based on a model of biological neurons, where each neuron
takes inputs, weighs them separately, sums them up and passes this sum through a nonlinear function to
produce output.

Perceptron in Machine Learning


 The most commonly used term in Artificial Intelligence and Machine Learning (AIML) is Perceptron. It is
the beginning step of learning coding and Deep Learning technologies, which consists of input values,
scores, thresholds, and weights implementing logic gates. Perceptron is the nurturing step of an Artificial
Neural Link. In 19h century, Mr. Frank Rosenblatt invented the Perceptron to perform specific high-level
calculations to detect input data capabilities or business intelligence. However, now it is used for various
other purposes. 

Types of Perceptron:
1. Single layer: Single layer perceptron can learn only linearly separable patterns.
2. Multilayer: Multilayer perceptrons can learn about two or more layers having a greater processing
power.
The Perceptron algorithm learns the weights for the input signals in order to draw a linear decision
boundary.

Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron model
consists feed-forward network and also includes a threshold transfer function inside the model. The main
objective of the single-layer perceptron model is to analyze the linearly separable objects with binary
outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with inconstantly
allocated input for weight parameters. Further, it sums up all inputs (weight). After adding all inputs, if the
total sum of all inputs is more than a pre-determined value, the model gets activated and shows the output
value as +1.

If the outcome is same as pre-determined or threshold value, then the performance of this model is stated as
satisfied, and weight demand does not change. However, this model consists of a few discrepancies
triggered when multiple weight inputs values are fed into the model. Hence, to find desired output and
minimize errors, some changes should be necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

MULTI-LAYERED PERCEPTRON MODEL


Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but
has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on
the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's
requirement. In this stage, the error between actual output and demanded originated backward on the
output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having
various layers in which activation function does not remain linear, similar to a single layer perceptron
model. Instead of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns.
Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear problems.


o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.


o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each
independent variable.
o The model functioning depends on the quality of the training.

FEEDFORWARD NEURAL NETWORK.

Feedforward Neural Networks are artificial neural networks where the node connections do not form a
cycle. They are biologically inspired algorithms that have several neurons like units arranged in layers. The
units in neural networks are connected and are called nodes. Data enters the network at the point of input,
seeps through every layer before reaching the output.

Feedforward Neural Networks are also known as multi-layered networks of neurons (MLN). The neuron
network is called feedforward as the information flows only in the forward direction in the network through
the input nodes. 

These networks are depicted through a combination of simple models, known as sigmoid neurons. The
sigmoid neuron is the foundation for a feedforward neural network.

Components of Feedforward Neural Networks


The feedforward neural networks comprise the following components:

o Input layer
o Output layer
o Hidden layer
o Neuron weights
o Neurons
o Activation function

Input layer: This layer comprises neurons that receive the input and transfer them to the different layers in
the network.

Output layer: This layer is the forecasted feature that depends on the type of model being built. 

Hidden layer: The hidden layers are positioned between the input and the output layer. The number of
hidden layers depends on the type of model.

Neuron weights: The strength or the magnitude of connection between two neurons is called weights. The
value of the weights is usually small and falls within the range of 0 to 1. 

Neurons: The feedforward network has artificial neurons, which are an adaptation of biological neurons.
The neurons work in two ways: first, they determine the sum of the weighted inputs, and, second, they
initiate an activation process to normalize the sum. 

Activation Function: This is the decision-making center at the neuron output. The neurons finalize linear or
non-linear decisions based on the activation functionThe three most important activation functions are :

o Sigmoid: It maps the input values within the range of 0 to 1.


o Tanh: It maps the input values between -1 and 1.
o Rectified linear Unit: This function allows only the positive values to flow through. The negative
values are mapped at 0. 

How Does a Feedforward Neural Network Function?


Data travels through the neural network’s mesh. Each layer of the network acts as a filter and filters outliers
and other known components, following which it generates the final output.

o Step 1: A set of inputs enter the network through the input layer and are multiplied by their weights. 
o Step 2: Each value is added to receive a summation of the weighted inputs. If the sum value exceeds
the specified limit ( usually 0), the output usually settles at 1. If the value falls short of the threshold (
specified limit), the result will be -1. 
o Step 3: A single-layer perceptron uses the concepts of machine learning for classification. It is a
crucial model of a feedforward neural network. 
o Step 4: The outputs of the neural network can then be compared with their predicted values with the
help of the delta rule, thereby facilitating the network to optimize its weights through training to
obtain output values with better accuracy. This process of training and learning generates a gradient
descent. 
Step 5: In multi-layered networks, updating weights are analogous and more specifically defined as
backpropagation. Here, each hidden layer is modified to stay in tune with the output value generated by the
final layer.
RESTRICTED BOLTZMANN MACHINES
What are Boltzmann Machines?

It is a network of neurons in which all the neurons are connected to each other. In this machine, there are
two layers named visible layer or input layer and hidden layer. The visible layer is denoted as  v and the
hidden layer is denoted as the h. In Boltzmann machine, there is no output layer. Boltzmann machines are
random and generative neural networks capable of learning internal representations and are able to
represent and (given enough time) solve tough combinatoric problems.
The Boltzmann distribution (also known as Gibbs Distribution) which is an integral part of Statistical
Mechanics and also explain the impact of parameters like Entropy and Temperature on the Quantum
States in Thermodynamics. Due to this, it is also known as Energy-Based Models (EBM). It was
invented in 1985 by Geoffrey Hinton, then a Professor at Carnegie Mellon University, and Terry
Sejnowski, then a Professor at Johns Hopkins University.

What are Restricted Boltzmann Machines (RBM)?

A restricted term refers to that we are not allowed to connect the same type layer to each other. In other
words, the two neurons of the input layer or hidden layer can’t connect to each other. Although the hidden
layer and visible layer can be connected to each other.
As in this machine, there is no output layer so the question arises how we are going to identify, adjust the
weights and how to measure the that our prediction is accurate or not. All the questions have one answer,
that is Restricted Boltzmann Machine.
The RBM algorithm was proposed by Geoffrey Hinton (2007), which learns probability distribution over
its sample training data inputs. It has seen wide applications in different areas of supervised/unsupervised
machine learning such as feature learning, dimensionality reduction, classification etc.
Consider the example movie rating discussed in the recommender system section.
Movies like Avengers, Avatar, and Interstellar have strong associations with the latest fantasy and science
fiction factor. Based on the user rating RBM will discover latent factors that can explain the activation of
movie choices.
How do Restricted Boltzmann Machines work?
In RBM there are two phases through which the entire RBM works:
1st Phase: In this phase, we take the input layer and using the concept of weights and biased we are going
to activate the hidden layer. This process is said to be Feed Forward Pass. In Feed Forward Pass we are
identifying the positive association and negative association.  
Feed Forward Equation:
 Positive Association — When the association between the visible unit and the hidden unit is
positive. 
 Negative Association — When the association between the visible unit and the hidden unit is
negative.
2nd Phase: As we don’t have any output layer. Instead of calculating the output layer, we are
reconstructing the input layer through the activated hidden state. This process is said to be Feed Backward
Pass. We are just backtracking the input layer through the activated hidden neurons. After performing this
we have reconstructed Input through the activated hidden state. So, we can calculate the error and adjust
weight in this way:  
Feed Backward Equation:
 Error = Reconstructed Input Layer-Actual Input layer
 Adjust Weight = Input*error*learning rate (0.1)  
After doing all the steps we get the pattern that is responsible to activate the hidden neurons. To
understand how it works:
Let us consider an example in which we have some assumption that V1 visible unit activates the h1 and h2
hidden unit and V2 visible unit activates the h2 and h3 hidden. Now when any new visible unit let V5 has
come into the machine and it also activates the h1 and h2 unit. So, we can back trace the hidden units
easily and also identify that the characteristics of the new V5 neuron is matching with that of V1. This is
because V1 also activated the same hidden unit earlier.  

You might also like