ML Lecture Notes Unit-1
ML Lecture Notes Unit-1
GIDA
Gorakhpur
Machine Learning
(LECTURE NOTE)
CSE/IT (VII Sem)
UNIT- I
Topic Covered:
INTRODUCTION – Well defined learning problems, Designing a
Learning System, Issues in Machine Learning; THE CONCEPT
LEARNING TASK - General-to-specific ordering of hypotheses, Find-
S, List then eliminate algorithm, Candidate elimination algorithm,
Inductive bias.
Submitted by:
Ranjeet Singh
(Asstt. Prof)
CSE
Introduction of Machine Learnimg:
The field of machine learning is concerned with the question of how to construct computer
programs that automatically improve with experience.
In recent years many successful machine learning applications have been developed,
ranging from data-mining programs that learn to detect fraudulent credit card
transactions, to information-filtering systems that learn users' reading preferences, to
autonomous vehicles that learn to drive on public highways.
At the same time, there have been important advances in the theory and algorithms that form the
foundations of Machine Learning.
Machine learning draws on concepts and results from many fields, including statistics,
artificial intelligence, philosophy, information theory, biology, cognitive science,
computational complexity, and control theory, to learn about machine learning is to view it
from all of these perspectives and to understand the problem settings, algorithms, and
assumptions that underlie each.
Ever since computers were invented, we have wondered whether they might be made to learn. If
we could understand how to program them to learn-to improve automatically with experience-
the impact would be dramatic. Imagine computers learning from medical records which
treatments are most effective for new diseases, houses learning from experience to optimize
energy costs based on the particular usage patterns of their occupants, or personal software
assistants learning the evolving interests of their users in order to highlight especially relevant
stories from the online morning newspaper.
We do not yet know how to make computers learn nearly as well as people learn. However,
algorithms have been invented that are effective for certain types of learning tasks, and a
theoretical understanding of learning is beginning to emerge. Many practical computer programs
have been developed to exhibit useful types of learning, and significant commercial applications
have begun to appear.
For problems such as speech recognition, algorithms based on machine learning outperform all
other approaches that have been attempted to date. In the field known as data mining, machine
learning algorithms are being used routinely to discover valuable knowledge from large
commercial databases containing equipment maintenance records, loan applications, financial
transactions, medical records, and the like. As our understanding of computers continues to
mature, it seems inevitable that machine learning will play an increasingly central role in
computer science and computer technology.
Key Point of Machine Learning Dataset:
A machine learning dataset is a structured collection of data that are carefully selecting,
preparing, and organizing the data for training, evaluating, and testing machine learning
algorithms and models. These datasets serve as the foundation for teaching machines to learn and
make predictions or decisions based on patterns and information within the data.
Here are some key characteristics and components of a typical machine learning dataset:
1- Data Points or Samples or instances: Each row or observation in the dataset represents
a single data point or sample. These data points can pertain to anything from images and
text documents to numerical measurements and categorical information.
2- Features or Attributes: Columns within the dataset are known as features or attributes.
These features provide the relevant information about each data point that the machine
learning algorithm uses to make predictions. Features can be numerical, categorical, or
even text-based.
3- Target Variable: In supervised learning, a dataset often includes a target variable (or
label) that the machine learning model aims to predict based on the input features. This is
the variable that the model learns to map from the input data.
4- Training, Validation, and Test Sets: A typical machine learning dataset is divided into
at least two or three subsets: a training set, a validation set (optional), and a test set. The
training set is used to train the model, the validation set helps tune model
hyperparameters and assess its performance during training, and the test set is used to
evaluate the model's performance after training.
5- Data Preprocessing: Before using the dataset for training, it often requires preprocessing
steps such as to ensure that the data is suitable for machine learning algorithms.
Data cleaning: Data cleaning is the process of identifying and correcting errors,
inconsistencies, and inaccuracies in a dataset to ensure its quality and reliability
for analysis or machine learning, often involving tasks like handling missing
values, removing duplicates, and addressing outliers.
Data normalization: Data normalization is rescaling data to fit within a specific
range (usually between 0 and 1 or -1 and 1) to remove scale differences.
Example: Consider two features in a dataset, "Age" (ranging from 0 to 100) and
"Income" (ranging from $20,000 to $100,000). Normalizing these features would
transform "Age" values to, say, [0.2, 0.7] and "Income" values to [0.2, 1.0],
making them comparable on the same scale for machine learning algorithms.
Feature scaling: Feature scaling is the process of standardizing the range of
independent variables (features) in a dataset, making them comparable and
preventing certain features from dominating others.
Handling missing values: Handling missing values involves dealing with data
points where certain values are absent or undefined in a dataset to prevent their
negative impact on analysis or modeling.
Example: In a dataset of patient records, some entries may lack values for
"Cholesterol Level." Handling missing values could involve methods like
imputing the missing cholesterol levels with the mean value of the available data
or using advanced techniques like predictive modeling to estimate the missing
values based on other patient characteristics.
6- Label Encoding or One-Hot Encoding: For categorical features, encoding methods like
label encoding or one-hot encoding are applied to convert categorical data into a
numerical format that can be processed by machine learning algorithms.
7- Dataset Size: The size of a dataset can vary widely, from small datasets with a few
hundred samples to large datasets with millions or even billions of data points. The
dataset size often impacts the complexity and training time of machine learning models.
8- Data Sources: Datasets can be sourced from various places, including research
institutions, government agencies, open data repositories, or generated through
simulations or data collection efforts.
Machine learning datasets play a crucial role in the development and evaluation of machine
learning models. The choice of dataset is critical, as it can significantly impact the model's
performance, generalization, and applicability to real-world tasks. Researchers and practitioners
select or create datasets that align with the specific problem they aim to solve and the type of
machine learning algorithm they plan to use.
What is machine learning (ML)?
• ML relies on:
– Statistics: making inferences from sample data.
– Numerical algorithms (linear algebra, optimization): optimize criteria,
manipulate models.
– Computer science: data structures and programs that solve a ML
problem efficiently.
• A model:
– is a compressed version of a database;
– extracts knowledge from it;
– does not have perfect performance but is a useful approximation to the
data.
Definition of ML:
Arthur Samuel described it as: “the field of study that gives computers the ability
to learn without being explicitly programmed.” This is an older, informal
definition.
OR
DATA SET:
Statistical Techniques
The development of today’s AI applications started with using the age-old
traditional statistical techniques. You must have used straight-line
interpolation in schools to predict a future value. There are several other such
statistical techniques which are successfully applied in developing so-called AI
programs. We say “so-called” because the AI programs that we have today are
much more complex and use techniques far beyond the statistical techniques used
by the early AI programs.
Some of the examples of statistical techniques that are used for developing AI
applications in those days and are still in practice are listed here:
Regression
Classification
Clustering
Probability Theories
Decision Trees
Machine learning evolved from left to right as shown in the above diagram.
Supervised Learning
Supervised learning is a type of machine learning where an algorithm learns from a labeled dataset,
which means it's provided with input data along with corresponding output labels. The algorithm's goal
is to learn a mapping or relationship between the inputs and the desired outputs so that it can make
predictions or classify new, unseen data points accurately.
Here's an explanation of supervised learning with an example: Imagine you want to build a system that
can automatically recognize handwritten digits (0 through 9). This is a common task in optical character
recognition (OCR) and is a classic example of supervised learning.
There are mainly two types of Supervised learning, Regression and Classification
1- Regression
Regression algorithms are a class of supervised learning techniques used for predicting continuous values
or numerical outcomes. They establish a relationship between input features and a target variable. Here
are some common types of regression algorithms along with examples:
Simple Linear Regression: A simple linear regression allows you to model the relationship between
a dependent variable (target) and a independent variables (features) by fitting a linear equation to the
observed data.
Where: 𝑋 is Mean of 𝑋 & Where: 𝑌 is Mean of 𝑌
Here b1 is known as Regression Coefficient
Now, you can calculate the intercept b0 using the mean values:
Multiple Linear Regression: Multiple linear regression is an extension of simple linear regression
that allows you to model the relationship between a dependent variable (target) and two or more
independent variables (features) by fitting a linear equation to the observed data. Instead of a straight
line, it finds the best-fitting hyperplane in higher-dimensional space.
Polynomial Regression: Polynomial Regression is a type of regression analysis used when the
relationship between the independent variable (feature) and the dependent variable (target) is
nonlinear. It models this relationship using a polynomial equation rather than a straight line. The
polynomial equation can have various degrees (e.g., quadratic, cubic) to capture different levels of
complexity in the data.
Example: Predicting a person's income based on various features like education, experience, and
location while avoiding overfitting the model by using ridge regression to balance feature
importance.
Lasso Regression: Lasso Regression, short for Least Absolute Shrinkage and Selection Operator, is
a linear regression technique used to address multicollinearity and perform feature selection by
adding a penalty term to the linear regression equation. It can be used for feature selection as it
encourages some of the coefficients of independent variables to become exactly zero, effectively
selecting a subset of the most important features while also reducing the complexity of the model.
Support Vector Regression (SVR): Support Vector Regression (SVR) is a type of regression
algorithm that extends the concepts of Support Vector Machines (SVMs) from classification to
regression problems. It is used for modeling and predicting continuous or numerical values,
making it suitable for tasks where the relationship between input features and the target variable
is non-linear or complex.
In SVR, the goal is to find a hyperplane that best fits the data while minimizing the margin violations.
This hyperplane is often referred to as the "epsilon tube." SVR works by defining an epsilon tube around
the regression line. Data points within this tube are considered to have zero error, while those outside the
tube incur a penalty proportional to the distance they deviate from the tube.
Example: Predicting the price of a stock based on historical data and other market indicators,
using SVR to capture complex non-linear relationships.
Here are some key components and concepts of Support Vector Regression:
Kernel Trick: Similar to SVMs, SVR can use a kernel function to map the input features into a higher-
dimensional space, making it easier to find non-linear relationships in the data.
Epsilon (ϵ): The width of the tube around the regression line. It defines the acceptable margin of error for
the predictions. Data points within this margin do not contribute to the loss function, while those outside
the margin are penalized.
Support Vectors: These are the data points that lie on the margin boundary or within the tube. They are
the most critical data points for defining the SVR model.
Loss Function: The loss function in SVR aims to minimize the deviation of data points from the
regression line while considering the width of the epsilon tube.
Regularization Parameter (C): Similar to SVMs, SVR has a regularization parameter C that controls the
trade-off between maximizing the margin and minimizing the loss. A smaller C allows for a wider margin
but permits more margin violations, while a larger C enforces a narrower margin with fewer violations.
Hyperparameters: SVR has hyperparameters that need to be tuned, such as the choice of kernel function
(e.g., linear, polynomial, radial basis function), kernel-specific parameters (e.g., degree in polynomial
kernel, gamma in RBF kernel), and the value of C.
Advantages of SVR:
Can capture complex non-linear relationships between features and the target variable.
Effective in high-dimensional spaces.
Handles outliers well by limiting their impact through the epsilon tube.
Provides flexibility through kernel functions.
Disadvantages of SVR:
Requires careful tuning of hyperparameters.
Can be computationally expensive, especially with large datasets.
Interpretability may be limited when using complex kernels.
Decision Tree Regression: Decision tree regression splits the data into subsets based on the
values of input features and then predicts the average or median of the target variable for each
subset.
Example: Predicting the expected delivery time of a pizza based on factors like distance, traffic,
and weather conditions using a decision tree.
Random Forest Regression: Random forest regression is an ensemble method that combines
multiple decision trees to improve predictive accuracy and reduce overfitting.
Example: Predicting housing prices by aggregating predictions from multiple decision trees trained
on various subsets of data and features.
The choice of above regression algorithm depends on the nature of the data and the problem you are
trying to solve. Experimentation and model evaluation are essential to determine which regression
technique works best for a specific task.
Type of
Application Type of Algorithm Example
Dataset
Labeled Image Convolutional Neural Handwriting digit
Image Classification
Data Network (CNN) recognition (MNIST)
Naive Bayes, Logistic
Spam Detection Text Data Email spam filtering
Regression
Recurrent Neural
Voice-to-text systems
Speech Recognition Audio Data Network (RNN), Hidden
(Siri, Google Assistant)
Markov Model (HMM)
User-Item Product
Recommendation Collaborative Filtering,
Interaction recommendations
Systems Matrix Factorization
Data (Amazon, Netflix)
Transaction Random Forest, Support Credit card fraud
Fraud Detection
Data Vector Machine (SVM) detection
Analyzing reviews on
Sentiment Analysis Text Data LSTM, Naive Bayes social media
(positive/negative)
Image, Sensor CNN, Reinforcement Self-driving cars (Tesla,
Autonomous Vehicles
Data Learning Waymo)
Time Series Predicting machine
Predictive Maintenance LSTM, ARIMA
Data failures in factories
Medical
Detecting cancer from
Medical Diagnosis Imaging, Random Forest, CNN
MRI scans
Tabular Data
Type of
Application Type of Algorithm Example
Dataset
Financial Time
Market Forecasting LSTM, ARIMA Stock price prediction
Series
Customer service
Chatbots/Conversational
Text Data Transformer (e.g., GPT) chatbots (OpenAI,
AI
Google BERT)
Facebook facial
Face Recognition Image Data CNN, FaceNet recognition, iPhone Face
ID
Labeled Image Detecting objects in
Object Detection CNN (YOLO, SSD)
Data images or videos
Transformer (e.g.,
Language Translation Text Data Google Translate, DeepL
BERT, GPT)
Sensor,
Detecting unusual
Anomaly Detection Transaction Isolation Forest, SVM
patterns in network traffic
Data
Chemical, Random Forest, Neural Predicting the efficacy of
Drug Discovery
Biological Data Networks drug molecules
Components of Learning
Basic components of learning process The learning process, whether by a human or a machine, can be
divided into four components, namely, data storage, abstraction, generalization and evaluation. Figure 1.1
illustrates the various components and the steps involved in the learning process.
1. Data storage Facilities for storing and retrieving huge amounts of data are an important component of
the learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning. • In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals. • Computers use hard disk drives, flash memory, random access memory and similar devices to
store data and use cables and other technology to retrieve data.
2. Abstraction The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts about the
data as a whole. The creation of knowledge involves application of known models and creation of new
models. The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.
3. Generalization The third component of the learning process is known as generalisation. The term
generalization describes the process of turning the knowledge about stored data into a form that can
be utilized for future action. These actions are to be carried out on tasks that are similar, but not
identical, to those what have been seen before. In generalization, the goal is to discover those properties
of the data that will be most relevant to future tasks.
4. Evaluation Evaluation is the last component of the learning process. It is the process of giving
feedback to the user to measure the utility of the learned knowledge. This feedback is then utilised to
effect improvements in the whole learning process
For checkers-playing problem, given this setting where we must learn to choose among the legal moves,
the most obvious choice for the type of information to be learned is a program, or function, that chooses
the best move for any given board state. Let us call this function ChooseMove and use the notation
ChooseMove : B → M to indicate that this function accepts as input any board from the set of legal
board states B and produces as output some move from the set of legal moves M.
What exactly should be the value of the target function V for any given board state?
The target concept is "days on which my friend Aldo enjoys his favorite water sport."
Solution:
The attribute EnjoySport indicates whether or not Aldo enjoys his favorite water sport on this
day. The task is to learn to predict the value of EnjoySport for an arbitrary day, based on the
values of its other attributes.
What hypothesis representation shall we provide to the learner in this case? Let us begin by
considering a simple representation in which each hypothesis consists of a conjunction of
constraints on the instance attributes. In particular, let each hypothesis be a vector of six
constraints, specifying the values of the six attributes Sky, AirTemp, Humidity, Wind, Water,
and Forecast. For each attribute, the hypothesis will either
TABLE-2.1: Positive and negative training examples for the target concept EnjoySport.
When learning the target concept, the learner is presented a set of training examples, each
consisting of an instance x from X, along with its target concept value c(x) (e.g., the training
examples in Table-2.1). Instances for which c(x) = 1 are called positive examples, or members of
the target concept. Instances for which c(x) = 0 are called negative examples, or nonmembers of
the target concept. We will often write the ordered pair <x, c(x)> to describe the training
example consisting of the instance x and its target concept value c(x). We use the symbol D to
denote the set of available training examples.
Given a set of training examples of the target concept c, the problem faced by the learner is to
hypothesize, or estimate, c. We use the symbol H to denote the set of all possible hypotheses that
the learner may consider regarding the identity of the target concept. Usually H is determined by
the human designer's choice of hypothesis representation. In general, each hypothesis h in H
represents a boolean-valued function defined over X; that is, h : X→ {0,1}. The goal of the
learner is to find a hypothesis h such that h(x) = c(x) for all x in X.
NOTE- Hypothesis: Hypothesis is a consists of a conjunction of constraints on the instance
attributes.
Hypothesis space 'h' is described by a conjunction of constraints on the attribute, the constraints
may General hypothesis "?" ( any value is acceptable), Specific hypothesis "φ" (a specific value
or no value is accepted) or any specific value.
Version Space: The Version Space denotes VSHD (with respect to hypothesis space H and
training example D) is the subset of hypothesis from H consistent with training example in D.
Now consider the sets of instances that are classified positive by hl and by h2. Because h2
imposes fewer constraints on the instance, it classifies more instances as positive. In fact, any
instance classified positive by hl will also be classified positive by h2. Therefore, we say that h2
is more general than hl.
This intuitive "more general than" relationship between hypotheses can be defined more
precisely as follows. First, for any instance x in X and hypothesis h in H, we say that x satisfies
h if and only if h(x) = 1. We now define the more-general than or equal to relation in terms of
the sets of instances that satisfy the two hypotheses: Given hypotheses hj and hk, hj is more-
general-than equal to hk if and only if any instance that satisfies hk also satisfies hj.
Instances, hypotheses, and the more-general-than relation. The box on the left represents the set
X of all instances, the box on the right the set H of all hypotheses. Each hypothesis corresponds
to some subset of X-the subset of instances that it classifies positive. The arrows connecting
hypotheses represent the m o r e - g e n e r a l - t h a n relation, with the arrow pointing toward
the less general hypothesis.
The Question is How can we use the more-general-than partial ordering to organize the search
for a hypothesis consistent with the observed training examples?
One way is to begin with the most specific possible hypothesis in H, then generalize this
hypothesis each time it fails to cover an observed positive training example. (We say that a
hypothesis "covers" a positive example if it correctly classifies the example as positive.) To be
more precise about how the partial ordering is used, consider the FIND-S algorithm.
To illustrate FIND-S algorithm, assume the learner is given the sequence of training examples
given in below Table for the Enjoy-Sport task.
The first step of FIND-S algorithm is to initialize h to the most specific hypothesis in H.
Upon observing the first training example from above given Table, which happens to be a
positive example, it becomes clear that our hypothesis is too specific. In particular, none of the
"ϕ" constraints in h are satisfied by this example, so each is replaced by the next more general
constraint that fits the example; namely, the attribute values for this training example.
This h is still very specific; it asserts that all instances are negative except for the single positive
training example we have observed. Next, the second training example (also positive in this
case) forces the algorithm to further generalize h, this time substituting a "?" in place of any
attribute value in h that is not satisfied by the new example. The refined hypothesis in this case
is
Upon encountering the third training example-in this case a negative example- the algorithm
makes no change to h. In fact, the FIND-S algorithm simply ignores every negative example!
While this may at first seem strange, notice that in the current case our hypothesis h is already
consistent with the new negative example (i.e., h correctly classifies this example as negative),
and hence no revision is needed. In the general case, as long as we assume that the hypothesis
space H contains a hypothesis that describes the true target concept c and that the training data
contains no errors, then the current hypothesis h can never require a revision in response to a
negative example. To see why, recall that the current hypothesis h is the most specific hypothesis
in H consistent with the observed positive examples.
Because the target concept c is also assumed to be in H and to be consistent with the positive
training examples, c must be more_general_than-or-equal to h . But the target concept c will
never cover a negative example, thus neither will h (by the definition of more-general-than).
Therefore, no revision to h will be required in response to any negative example.
To complete our trace of FIND-S, the fourth (positive) example leads to a further generalization
of h.
Final Maximally Specific Hypothesis is for given data set is:
The FIND-S algorithm illustrates one way in which the more-general-than partial ordering can
be used to organize the search for an acceptable hypothesis.
The search moves from hypothesis to hypothesis, searching from the most specific to
progressively more general hypotheses along one chain of the partial ordering. As given in below
fig illustrates this search in terms of the instance and hypothesis spaces.
The key property of the FIND-S algorithm is that for hypothesis spaces described by
conjunctions of attribute constraints, FIND-S is guaranteed to output the most specific
hypothesis within H that is consistent with the positive training examples. Its final hypothesis
will also be consistent with the negative examples provided the correct target concept is
contained in H, and provided the training examples are correct.
Consistent Hypothesis-
Notice the key difference between this definition of consistent and our earlier definition of
satisfies. An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is
a positive or negative example of the target concept.
However, whether such an example is consistent with h depends on the target concept, and in
particular, whether h(x) = c(x).
The LIST-THEN-ELIMINATE algorithm
In above figure, given version space with its general and specific boundary sets. The version
space includes all six hypotheses shown here, but can be represented more simply by S and G.
Arrows indicate instances of the more-general-than relation. This is the version space for the
Enjoy sport concept learning problem and training examples described in below Table.
In fact, this is just one of six different hypotheses from H that are consistent with these training
examples. All six hypotheses are shown in above Figure. They constitute the version space
relative to this set of data and this hypothesis representation.
The arrows among these six hypotheses in above Figure indicate instances of the more-general-
than-relation. The CANDIDATE-ELIMINATION algorithm represents the version space by
storing only its most general members (labeled G in Figure) and its most specific (labeled S in
the figure).
CANDIDAT-ELIMINATION Algorithm- The CANDIDAT-ELIMINATION
algorithm is a second approach of concept learning, , that addresses several of the
limitations of FIND-S Algorithm. Notice that although FIND-S outputs a hypothesis from H,
that is consistent with the training examples, this is just one of many hypotheses from H that
might fit the training data equally well.
The CANDIDATE-ELIMINAN-algorithm LIST the set of all hypotheses (called as version space)
consistent with the training examples.
Surprisingly, the CANDIDAT-ELIMINATION algorithm computes the description of this set
without explicitly enumerating all of its members.
This is accomplished by again using the more-general-than partial ordering, this time to
maintain a compact representation of the set of consistent hypotheses and to incrementally refine
this representation as each new training example is encountered.
.
Here CANDIDATE-ELIMINATION Trace-3. The positive training example generalizes the S
boundary, from S3 to S4. One member of G3 must also be deleted, because it is no longer more
general than the S4 boundary.
In above figure, given version space with its general and specific boundary sets. The version
space includes all six hypotheses shown here, but can be represented more simply by S and G.
Concept Learning - Summary
• Concept learning can be seen as a problem of searching through a large predefined space of
potential hypotheses.
• The general-to-specific partial ordering of hypotheses provides a useful structure for organizing
the search through the hypothesis space.
• The FIND-S algorithm utilizes this general-to-specific ordering, performing a specific-to-
general search through the hypothesis space along one branch of the partial ordering, to find the
more specific hypothesis consistent with the training examples.
• The CANDIDATE-ELIMINATION algorithm utilizes this general-to-specific ordering to
compute the version space (the set of all hypotheses consistent with the training data) by
incrementally computing the sets of maximally specific (S) and maximally general (G)
hypotheses.
• Because the S and G sets delimit the entire set of hypotheses consistent with the data, they
provide the learner with a description of its uncertainty regarding the exact identity of the target
concept. This version space of alternative hypotheses can be examined
– to determine whether the learner has converged to the target concept,
– to determine when the training data are inconsistent,
– to generate informative queries to further refine the version space, and
– to determine which unseen instances can be unambiguously classified based on the
partially learned concept.
• The CANDIDATE-ELIMINATION algorithm is not robust to noisy data or to situations in
which the unknown target concept is not expressible in the provided hypothesis space.
• Inductive learning algorithms are able to classify unseen examples only because of their
implicit inductive bias for selecting one consistent hypothesis over another.
• If the hypothesis space is enriched to the point where there is a hypothesis corresponding to
every possible subset of instances (the power set of the instances), this will remove any inductive
bias from the CANDIDATE-ELIMINATION algorithm .
– Unfortunately, this also removes the ability to classify any instance beyond the
observed training examples.
– An unbiased learner cannot make inductive leaps to classify unseen examples.
INDUCTIVE BIAS