Module 1 - ML
Module 1 - ML
Module 1 - ML
Introduction:
Terminologies in machine learning, Applications, Types of machine learning: supervised,
unsupervised, semi-supervised learning, Reinforcement Learning.
Features: Types of Data (Qualitative and Quantitative), Scales of Measurement (Nominal, Ordinal,
Interval, Ratio), Concept of Feature, Feature construction, Feature Selection and Transformation,
Curse of Dimensionality. Linear discriminant Analysis (LDA).
Labels
A label is the thing we're predicting—the y variable in simple linear regression. The label could
be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip,
or just about anything.
Features
x1,x2,...xN
In the spam detector example, the features could include the following:
An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.)
We break examples into two categories:
labeled examples
unlabeled examples
A labeled example includes both feature(s) and the label. That is:
Use labeled examples to train the model. In our spam detector example, the labeled examples
would be individual emails that users have explicitly marked as "spam" or "not spam."
An unlabeled example contains features but not the label. That is:
Once we've trained our model with labeled examples, we use that model to predict the label on
unlabeled examples. In the spam detector, unlabeled examples are new emails that humans haven't
yet labeled.
Models
A model defines the relationship between features and label. For example, a spam detection model
might associate certain features strongly with "spam". Let's highlight two phases of a model's life:
Training means creating or learning the model. That is, you show the model labeled examples
and enable the model to gradually learn the relationships between features and label.
Inference means applying the trained model to unlabeled examples. That is, you use the trained
model to make useful predictions (y'). For example, during inference, you can
predict medianHouseValue for new unlabeled examples.
A regression model predicts continuous values. For example, regression models make predictions
that answer questions like the following:
A classification model predicts discrete values. For example, classification models make
predictions that answer questions like the following:
Is a given email message spam or not spam?
Is this an image of a dog, a cat, or a hamster?
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face recognition
and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information from
the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series, movies,
etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company is
working on self-driving car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our
spam box, and the technology behind this is Machine learning. Below are some spam filters used
by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is
a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern which
gets change for the fraud transaction hence, it detects it and makes our online transactions more
secure.
Machine learning is widely used in stock market trading. In the stock market, there is always a risk
of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical technology
is growing very fast and able to build 3D models that can predict the exact position of lesions in
the brain.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at
all, as for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and it called as automatic
translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.
Some types of learning describe whole subfields of study composed of many different types of
algorithms in themselves such as “supervised learning.”
There are mainly 4 types of learning that you must be familiar with as a machine learning
practitioner, namely:
Learning Problems
1.Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Now that we broadly know the types of Machine Learning Algorithms, let us try and understand
them better one after the other.
As you must have understood from the name, Supervised machine learning is based on supervision
of the learning process of the machines. It indicates that in the supervised learning technique, we
train the machines using the labeled or trained dataset, and on the basis of this training, the machine
predicts the output. In this context, the labeled data specifies that some of the inputs that we feed
to the algorithm are already mapped to the output.
To put it simply, in this type of Machine Learning, we try to teach the machines using the trained
data and then expect it to predict the outcomes on the test data.
Let’s understand the working of supervised learning with an example. Consider that we have an
input dataset of cats and dog images. As the first step, we will provide training to the machine to
understand the images. It means that we will try to teach the machine to classify the images on the
features such as the shape & size of the tail of cat and dog, Shape of eyes, color, height, and so on.
After successful completion of training, we input the picture of a cat and ask the machine to
identify the object and predict the output on the basis of its training. As a result, the machine is
well trained, so it will check all the classifying features of the object, such as height, shape, color,
eyes, ears, tail, and ascertain that it’s a cat. So, it will classify the image in the Cat category.
Through this process, the machine identifies the objects in Supervised Learning.
Categories of Supervised Machine Learning
On the basis of the problem that we have to encounter, we can classify Supervised Learning into
two categories:
a. Classification:
We use Classification algorithms to solve the classification problems in which the output variable
is categorical. These categories can be of many types such as Yes or No, Male or Female, Red or
Blue, and so on. The classification algorithms predict the categories present in the dataset on the
basis of training data.
Some popular classification algorithms in practice are Random Forest, Algorithm, Decision Tree
Algorithm, Logistic Regression Algorithm, Support Vector Machine Algorithm.
We use Regression algorithms to solve regression problems in which there exists a linear
relationship between input and output variables. We can use these variables to predict continuous
output variables, such as market trends, weather prediction, and so on.
Some of the popular Regression algorithms are: Simple Linear Regression Algorithm, Multivariate
Regression Algorithm, Decision Tree Algorithm, Lasso Regression
Since supervised learning works with the labeled dataset, we can have an exact idea about the
classification of objects and other variables that we feed in as input.
These algorithms are helpful in predicting the output on the basis of prior experience and
trained data.
Disadvantages of Supervised Learning
These algorithms are not able to solve complex tasks due to a lack of data.
It may predict the wrong output if the test data is different from the training data or the training
data has some noise.
It requires lots of computational time to train the algorithm and a huge load of trained data.
Applications of Supervised Learning
a. Image Segmentation:
We make use of Supervised Learning algorithms in image segmentation. In this process, we
perform image classification on different image data with predefined labels in the dataset.
b. Medical Diagnosis:
We can also observe the usage of Supervised algorithms in the medical field for diagnosis
purposes. It is achieved by using medical images and past labeled data with labels for disease
conditions to accurately predict the aliments. With such a process, the machine can identify a
disease for the new patients with the help of previous data history of other patients.
c. Fraud Detection:
We use Supervised Learning classification algorithms for identifying fraud transactions, fraud
customers, and other financial frauds. We achieve this by using historic data to identify the patterns
that can lead to possible frauds before they can occur.
d. Spam detection:
In the usage of spam detection & filtering, classification algorithms are quite effective and reliable.
These algorithms are able to classify an email as spam or not spam. It then sends the spam emails
to the spam folder.
e. Speech Recognition
Supervised learning algorithms are even used in speech recognition. We can even train the
algorithm with voice data, and can even perform various identifications using the same, such as
voice-activated passwords, voice commands, and so on.
Unsupervised Machine Learning is far different from the Supervised learning technique in many
ways. As you would have understood from the name, there is no need for supervision in this type
of Machine Learning. It indicates that, in unsupervised machine learning, we train the machine
using the unlabeled or untrained dataset, and the machine predicts the output without any
supervision of such data.
In unsupervised learning, we train the models with the data that is neither classified nor labeled,
and the model acts on that data without performing any supervision methods.
The main aim of the unsupervised learning algorithm is to group or categorize the unsorted dataset
according to the similarities, patterns, and differences that it recognizes from the dataset. We
instruct the Machines to recognize the hidden patterns from the input dataset and analyze the
results.
a. Clustering:
We make use of the clustering technique when we want to look for the inherent groups from the
data. It is a technique to group the objects into a cluster. We group the objects such that the objects
with the most similarities are classified in one group and have fewer or no similarities with the
objects of other groups.
Some of the popular clustering algorithms are given below: K-Means Clustering algorithm, Mean-
shift algorithm, DBSCAN Algorithm, Principal Component Analysis, Independent Component
Analysis
b. Association
Association rule learning is a type of unsupervised learning technique, which looks for interesting
relations among variables within a large dataset that we feed in as input. This algorithm aims to
find the dependency of one data item on another data item and map those variables accordingly so
that it can generate maximum profit from the unsorted large dataset.
Some of the popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
We can use these algorithms for complicated tasks as compared to the supervised ones
because these algorithms work on unlabeled datasets and do not require large datasets.
Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labeled dataset for the training of these algorithms.
Disadvantages of Unsupervised Learning
An unsupervised algorithm may result in less accurate outputs as the dataset is not labeled,
and we have not trained the algorithms with the exact output beforehand.
Working with Unsupervised learning is more difficult as compared to other types as it works
with the unlabelled dataset that does not map with the output precisely.
a. Network Analysis:
We use Unsupervised learning for identifying plagiarism and copyright in document network
analysis of text data for scholarly articles and prevent copyright frauds.
b. Recommendation Systems:
Recommendation systems extensively makes use of unsupervised learning techniques for building
recommendation applications for different web applications and e-commerce websites for
convenience and popularity of products.
c. Anomaly Detection:
Anomaly detection is a widely used application of unsupervised learning, which can identify
unusual data points within a large input dataset. We extensively use it to discover fraudulent
transactions.
We use Singular Value Decomposition or commonly known as SVD to extract a certain type of
information from the database. For example, extraction of information of each user located at a
certain geographical location.
3. Semi-Supervised Learning
Semi-Supervised learning is yet another type of Machine Learning algorithm that lies between or
is a hybrid of Supervised and Unsupervised machine learning. It demonstrates the intermediate
ground between Supervised and Unsupervised learning algorithms and makes use of the
combination of labeled and unlabeled data sets during the training period for the machines.
Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and we use it on the data that consists of a few label datasets, it mostly comprises
unlabeled data. As these labeled datasets are quite expensive, but for corporate purposes, they may
require a few labeled datasets. It is completely different from supervised and unsupervised learning
since they are based on the presence and absence of labeled data.
In order to overcome the drawbacks of supervised learning and unsupervised learning algorithms,
we came up with the concept of Semi-supervised learning. Semi-supervised learning aims to
effectively use all the available data, rather than only labeled data like in supervised learning.
Initially, the model clusters together similar data along with an unsupervised learning algorithm.
Furthermore, it helps to put labels on the unlabeled data and turns it into labeled data. We perform
this step because labeled data is quite expensive as compared to untrained data.
It is quite simple and easy to understand the algorithm and does not encounter anomalies.
This is highly efficient in predicting the output on the basis of input data.
It overcomes the drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages of semi-supervised learning
Iteration results may not be stable and outputs may vary significantly.
We cannot apply these algorithms to network-level data due to its complexities.
The accuracy rate for this type of Learning is low.
4. Reinforcement Learning
In reinforcement learning, the algorithm does not require any labeled data like supervised learning,
and agents solely learn from their experiences.
The reinforcement learning process is similar to the learning process of a human being. As an
example, consider a child that learns various things by his day-to-day life experiences. A simple
example to understand reinforcement learning is to play a game, where the Game mimics the
environment, moves of an agent at each step define states of the game, and the goal of the agent is
to get a high score at the end of the game. Agent receives feedback in terms of punishments and
rewards that it collects at the end.
Classification of Reinforcement Learning
Positive reinforcement learning specifies increasing the tendency that the required behavior would
occur again and again by adding something. It enhances the strength of the behavior of the agent
and positively leaves an impact.
Negative reinforcement learning works exactly in the opposite way as compared to positive
Reinforcement Learning. It tends to increase the tendency that the specific behavior would occur
again by avoiding the negative condition that will lead to punishment.
a. Video Games:
b. Resource Management:
The “Resource Management with Deep Reinforcement Learning” paper demonstrates how to use
this type of learning in computers to automatically learn and helps them in resource scheduling
which helps the resources to wait for different jobs in order to minimize average job slowdown.
c. Robotics:
This type of learning is extensively being used in Robotics applications. We make use of Robots
in the industrial and manufacturing area, and we can make these robots more powerful with
reinforcement learning. There are different industries that have a keen eye for building intelligent
robots using AI and Machine learning technology.
d. Text Mining:
Text mining, to date, is one of the greatest applications of NLP. We now tend to implement NLP
with the help of Reinforcement Learning by Salesforce company.
This type of learning assists us in solving complex real-world problems which are difficult to
be solved by general techniques that we use conventionally.
The learning model of Reinforcement Learning is similar to the learning process of human
beings; therefore, we can look for the most accurate results.
This type of learning helps us in achieving long-term results.
Disadvantages of Reinforcement Learning
Features:
Types of Data
There are two forms of data types:
1. Qualitative
2. Quantitative
As we go deeper into these two types, we will encounter more classifications.
When we mention qualitative data, you should be able to tell that qualitative means “quality” just
by looking at the word. It’s an attribute because it includes values that express a quality or state.
This type of data is impossible to count or quantify. Gender, feelings, and so on are some examples.
Quantitative data, on the other hand, is concerned with quantity. This type of statistics is numerical
in nature, which means you can count or measure them. Income, age, and so on are few examples.
Let us clarify this with a more general example. Consider feelings; you cannot quantify someone’s
emotions. Because it is impossible to quantify how sad or happy someone is, we consider feelings
to be qualitative data because they express quality. However, if you were to count how many
people are sad or how many people are happy, you would be able to call it quantitative data derived
from qualitative data.
When we delve deeper into qualitative data, we can further divide it into two types.
Nominal Variable
Nominal variables are figures that cannot be easily classified and ordered into hierarchies.
Flowers, for example, cannot have a hierarchical order ranking. You cannot say that lily is superior
to rose. The same is true for colors, gender, race, country, and so on. Nominal variables are what
they’re termed as.
On such variables, you can experiment four types of tests which are:
McNemar
Fisher’s Exact
Cochran Q
Chi-Square
Ordinal Variable
Ordinal variables, on the other hand, can be ordered and scaled. It also possesses all of the
properties of a nominal variable. Therefore, the ability to be able to measure ordinal variable and
to be able to rank it in hierarchical order is a plus.
However, we should not assume that it can be numbered because it is still qualitative data!
So, if I were to create a Google form to collect feedback on, say, a webinar I hosted on Zoom
application, I would insert the following question: “How informative did you find the webinar?”
I’d also offer them the following marking options:
A little informative
Exceptionally Informative
Not at all informative.
These options, which I just listed, are examples of ordinal variables.
On such variables, you can experiment four types of tests which are:
Kruskal-Wallis 1-way
Wilcoxon signed-rank
Wilcoxon rank-sum
Friedman 2-way ANOVA
It is important to note that both nominal and ordinal variables are non-parametric variables. The
only difference between them is their ability to rank information based on their position.
Interval
We know both the order and the exact difference between the values in interval scales. For
example, the difference between 20 and 10 degrees is at the same magnitude as the difference
between 30 and 20 degrees.
Ratio
In contrast, ratio data is an interval scale with a natural 0 point. This basically means that negative
values cannot exist in ratio data. Height measurements in centimeters, meters, inches, and so on
are an example of it.
Discrete: You can easily count discrete variables within a limited amount of time. You
can, for example, count the money in your bank account or the number of times you drank
a glass of water in a day. Essentially, if a variable is discrete, you will be able to count
them, though it may take longer at times to count completely. They can only accept certain
values that do not include decimal form.
Continuous: Continuous variables, on the other hand, are “continuous” in nature, as the
name implies. However, you can never completely count them because they are an on-
going process. For example, if I asked you to tell me how many times you drank a glass of
water in this century, you would be unable to do so because a century has not passed, so
you would end up counting until you died. Continuous variables can have any value within
a certain range of values.
Advantages
QUALITATIVE QUANTITATIVE
Disadvantages
QUALITATIVE QUANTITATIVE
What is a feature?
Generally, all machine learning algorithms take input data to generate the output. The input data
remains in a tabular form consisting of rows (instances or observations) and columns (variable or
attributes), and these attributes are often known as features. For example, an image is an instance
in computer vision, but a line in the image could be the feature. Similarly, in NLP, a document can
be an observation, and the word count could be the feature. So, we can say a feature is an attribute
that impacts a problem or is useful for the problem.
Feature engineering is the pre-processing step of machine learning, which extracts features
from raw data. It helps to represent an underlying problem to predictive models in a better way,
which as a result, improve the accuracy of the model for unseen data. The predictive model
contains predictor variables and an outcome variable, and while the feature engineering process
selects the most useful predictor variables for the model.
Since 2016, automated feature engineering is also used in different machine learning software that
helps in automatically extracting features from raw data. Feature engineering in ML contains
mainly four processes: Feature Creation, Transformations, Feature Extraction, and Feature
Selection.
1. Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
2. Transformations: The transformation step of feature engineering involves adjusting the
predictor variable to improve the accuracy and performance of the model. For example, it
ensures that the model is flexible to take input of the variety of data; it ensures that all the
variables are on the same scale, making the model easier to understand. It improves the
model's accuracy and ensures that all the features are within the acceptable range to avoid
any computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data. The main aim of this step is
to reduce the volume of data so that it can be easily used and managed for data modelling.
Feature extraction methods include cluster analysis, text analytics, edge detection
algorithms, and principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few variables in
the dataset are useful for building the model, and the rest features are either redundant or
irrelevant. If we input the dataset with all these redundant and irrelevant features, it may
negatively impact and reduce the overall performance and accuracy of the model. Hence it
is very important to identify and select the most appropriate features from the data and
remove the irrelevant or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of selecting the subset of the
most relevant features from the original features set by removing the redundant,
irrelevant, or noisy features."
In machine learning, the performance of the model depends on data pre-processing and data
handling. But if we create a model without pre-processing or data handling, then it may not give
good accuracy. Whereas, if we apply feature engineering on the same model, then the accuracy of
the model is enhanced. Hence, feature engineering in machine learning improves the model's
performance. Below are some points that explain the need for feature engineering:
The steps of feature engineering may vary as per different data scientists and ML engineers.
However, there are some common steps that are involved in most machine learning algorithms,
and these steps are as follows:
o Data Preparation: The first step is data preparation. In this step, raw data acquired from
different resources are prepared to make it in a suitable format so that it can be used in the
ML model. The data preparation may contain cleaning of data, delivery, data augmentation,
fusion, ingestion, or loading.
o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA) is an
important step of features engineering, which is mainly used by data scientists. This step
involves analysis, investing data set, and summarization of the main characteristics of data.
Different data visualization techniques are used to better understand the manipulation of
data sources, to find the most appropriate statistical technique for data analysis, and to
select the best features for the data.
o Benchmark: Benchmarking is a process of setting a standard baseline for accuracy to
compare all the variables from this baseline. The benchmarking process is used to improve
the predictability of the model and reduce the error rate.
Feature engineering deals with inappropriate data, missing values, human interruption, general
errors, insufficient data sources, etc. Missing values within the dataset highly affect the
performance of the algorithm, and to deal with them "Imputation" technique is used. Imputation
is responsible for handling irregularities within the dataset.
For example, removing the missing values from the complete row or complete column by a huge
percentage of missing values. But at the same time, to maintain the data size, it is required to
impute the missing data, which can be done as:
o For numerical data imputation, a default value can be imputed in a column, and missing
values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the maximum
occurred value in a column.
2. Handling Outliers
Outliers are the deviated values or data points that are observed too away from other data points
in such a way that they badly affect the performance of the model. Outliers can be handled with
this feature engineering technique. This technique first identifies the outliers and then remove them
out.
Standard deviation can be used to identify the outliers. For example, each value within a space
has a definite to an average distance, but if a value is greater distant than a certain value, it can be
considered as an outlier. Z-score can also be used to detect outliers.
3. Log transform
Logarithm transformation or log transform is one of the commonly used mathematical techniques
in machine learning. Log transform helps in handling the skewed data, and it makes the distribution
more approximate to normal after transformation. It also reduces the effects of outliers on the data,
as because of the normalization of magnitude differences, a model becomes much robust.
Note: Log transformation is only applicable for the positive values; else, it will give an error.
To avoid this, we can add 1 to the data before transformation, which ensures transformation
to be positive.
4. Binning
In machine learning, overfitting is one of the main issues that degrade the performance of the
model and which occurs due to a greater number of parameters and noisy data. However, one of
the popular techniques of feature engineering, "binning", can be used to normalize the noisy data.
This process involves segmenting different features into bins.
5. Feature Split
As the name suggests, feature split is the process of splitting features intimately into two or more
parts and performing to make new features. This technique helps the algorithms to better
understand and learn the patterns in the dataset.
The feature splitting process enables the new features to be clustered and binned, which results in
extracting useful information and improving the performance of the data models.
One hot encoding is the popular encoding technique in machine learning. It is a technique that
converts the categorical data in a form so that they can be easily understood by machine learning
algorithms and hence can make a good prediction. It enables group the of categorical data without
losing any information.
Curse of Dimensionality
Curse of Dimensionality refers to a set of problems that arise when working with high -
dimensional data. The dimension of a dataset corresponds to the number of attributes/features
that exist in a dataset. A dataset with a large number of attributes, generally of the order of a
hundred or more, is referred to as high dimensional data. Some of the difficulties that come
with high dimensional data manifest during analyzing or visualizing the data to identify
patterns, and some manifest while training machine learning models. The difficulties related
to training machine learning models due to high dimensional data are referred to as the ‘Curse
of Dimensionality’.
Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence, in
this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D
graph in order to maximize the separability between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both the classes linearly separable.
In such cases, we use non-linear discriminant analysis.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used
such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of
the variance (actually covariance), moderating the influence of different variables on LDA.
Implementation
In this implementation, we will perform linear discriminant analysis using the Scikit-learn
library on the Iris dataset.
# necessary import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix