Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
3 views15 pages

Unit1 ML

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 15

Discuss any four examples of machine learning applications.

1. Recommendation systems: Machine learning algorithms are used to analyze user preferences and
behavior to make personalized recommendations for products, movies, music, and more. For
example, companies like Netflix and Amazon use machine learning to suggest content or products
to their users based on their past interactions.

2. Fraud detection: Machine learning is used in the financial industry to detect fraudulent activities
such as credit card fraud, identity theft, and money laundering. By analyzing patterns and anomalies
in large datasets, machine learning algorithms can identify potential fraudulent transactions and
alert the appropriate authorities.

3. Medical diagnosis: Machine learning is being used in healthcare to assist with medical diagnosis
and treatment planning. For example, machine learning algorithms can analyze medical images
such as X-rays, MRIs, and CT scans to help doctors detect diseases or abnormalities at an early
stage.

4. Natural language processing: Machine learning is used in natural language processing (NLP)
applications such as chatbots, virtual assistants, and language translation services. These
applications use machine learning algorithms to understand and respond to human language in a
way that simulates human conversation.
Differentiate between Supervised, Unsupervised and Reinforcement Learning
Supervised Learning: Supervised learning is a type of machine learning where the algorithm is
trained on a labeled dataset, which means the input data is paired with corresponding output labels.
The algorithm learns to map the input data to the correct output by generalizing from the labeled
examples. During training, the model makes predictions, and the error between the predicted output
and the actual label is used to adjust the model parameters. The goal is to minimize this error and
enable the model to make accurate predictions on new, unseen data.

Unsupervised Learning: Unsupervised learning involves training a model on an unlabeled


dataset, where the algorithm explores the inherent structure or patterns in the data without explicit
guidance. Unlike supervised learning, there are no predefined output labels for the input data.
Common tasks in unsupervised learning include clustering, where the algorithm groups similar data
points together, and dimensionality reduction, which aims to reduce the number of features while
preserving essential information. Unsupervised learning is particularly useful for exploring and
discovering hidden patterns within datasets.

Reinforcement Learning: Reinforcement learning is a paradigm where an agent learns to make


decisions by interacting with an environment. The agent receives feedback in the form of rewards
or punishments based on the actions it takes. The objective is for the agent to learn a policy or
strategy that maximizes the cumulative reward over time. Reinforcement learning involves a
sequence of actions, states, and rewards. The agent explores different actions and learns to
associate them with positive or negative outcomes. It learns by trial and error, adjusting its strategy
to achieve the most favorable long-term outcomes.

Examples:

• Supervised Learning: Image classification, regression.


• Unsupervised Learning: Clustering, dimensionality reduction.
• Reinforcement Learning: Game playing, robotic control.
Briefly describe the concept on model selection and generalisation.
Model selection is a critical step in the machine learning process where a data scientist or machine
learning practitioner chooses the best algorithm or model from a set of candidate models. The goal
is to pick the model that is most likely to generalize well to new, unseen data. The process involves
evaluating and comparing the performance of different models on a validation set or using cross-
validation techniques.
Key steps in model selection include:

1. Candidate Model Selection: Choose a set of candidate models with different architectures,
hyperparameters, or algorithms.

2. Training Models: Train each candidate model on a portion of the dataset.

3. Validation: Evaluate the models on a separate validation set to assess their performance.

4. Performance Metrics: Use appropriate metrics (e.g., accuracy, precision, recall, F1 score)
to quantify the performance of each model.

5. Selection Criteria: Select the model that performs best according to the chosen evaluation
metrics.

Model selection helps prevent overfitting (fitting the training data too closely) and underfitting
(failing to capture the underlying patterns), ensuring the chosen model has good predictive power
on new, unseen data.

Generalization:

Generalization refers to the ability of a machine learning model to perform well on new, previously
unseen data. A model that generalizes well has learned the underlying patterns in the training data
and can make accurate predictions on examples it has never encountered before.
Key considerations for achieving good generalization include:

1. Training Data Quality: Ensure the training data is representative of the broader population
to which the model will be applied.

2. Model Complexity: Avoid overly complex models that may fit the training data too closely
(overfitting). Simpler models are often more robust and generalize better.

3. Regularization: Apply regularization techniques to penalize overly complex models and


prevent them from capturing noise in the training data.

4. Validation and Testing: Use separate validation and test datasets to evaluate the model's
performance on data it has not seen during training.

5. Cross-Validation: Employ cross-validation methods to assess how well the model


generalizes across different subsets of the data.
Compare Classification with regression with an example

Classification:

Definition: Classification is a type of supervised learning where the goal is to categorize input
data into predefined classes or labels. It is suitable for problems where the output is discrete and
falls into distinct categories.

Output Type: The output of a classification task is categorical, representing different classes or
labels. Examples include binary classification (spam or not spam) or multiclass classification
(image recognition - cat, dog, bird).

Example: Spam Email Detection (Binary Classification): For instance, in spam email detection,
the algorithm is trained to classify emails as either spam (1) or not spam (0). The logistic regression
formula for binary classification is given by:

P(Y=1∣X)=1+e−(β0+β1X1+β2X2+...+βnXn)1

This logistic function maps the input features ( Xi) to a probability between 0 and 1, facilitating the
classification into the respective classes.

Evaluation Metrics: Common evaluation metrics for classification include accuracy, precision,
recall (sensitivity), and the F1 score. These metrics assess the model's performance in terms of
correct predictions and the balance between false positives and false negatives.

Regression:

Definition: Regression, like classification, is a form of supervised learning, but it aims to predict
continuous numeric values rather than discrete classes. It is used when the output is not confined to
distinct categories but represents a quantity that can vary within a range.

Output Type: The output of a regression task is continuous, representing a numeric value.
Examples include predicting house prices, temperature, or sales.

Example: House Price Prediction (Linear Regression): In the context of predicting house
prices, the linear regression formula is utilized:

Y=β0+β1X1+β2X2+...+βnXn+ε

This formula expresses the relationship between the input features ( Xi), coefficients ( βi), and the
predicted output ( Y ).

Evaluation Metrics: Common evaluation metrics for regression include mean squared error
(MSE), mean absolute error (MAE), and R-squared. These metrics provide insights into the
accuracy and precision of the model's predictions.
Distinguish between overfitting and underfitting. How it can affect model generalization?

Overfitting:

Overfitting occurs when a machine learning model learns the training data too well, capturing
noise and random fluctuations that may not represent the true underlying patterns of the data. In
essence, the model becomes too complex, fitting the training data perfectly but struggling to
generalize well to new, unseen data. Signs of overfitting include excessively low training error but
high validation or test error.

Causes of Overfitting:

1. Model Complexity: Using a highly complex model with too many parameters.
2. Insufficient Data: Training on a small dataset that doesn't capture the true variability of the
underlying distribution.
3. Overemphasis on Outliers: Model may fit outliers in the training data instead of the
general trend.

Effects on Model Generalization:

• The model may perform exceptionally well on the training data but poorly on new, unseen
data.
• Generalization to real-world scenarios is compromised, and the model might fail to make
accurate predictions outside the training dataset.
• Overfit models are sensitive to noise and variations, making them less robust.

Underfitting:

Underfitting occurs when a model is too simple to capture the underlying patterns in the training
data. The model fails to learn the relationships and trends in the data, resulting in poor
performance on both the training and new data. Signs of underfitting include high training error
and high validation or test error.

Causes of Underfitting:

1. Model Too Simple: Using a model with insufficient complexity to represent the underlying
patterns.
2. Inadequate Features: Not including enough relevant features in the model.
3. Insufficient Training: The model has not been trained long enough to capture the
complexities of the data.

Effects on Model Generalization:


• Poor performance on both the training and new data.
• The model fails to capture the underlying structure of the data, leading to inaccurate
predictions.
• Underfit models are overly simplistic and lack the capacity to generalize to diverse
scenarios.

Balancing Overfitting and Underfitting:

Achieving a balance between overfitting and underfitting is crucial for model generalization.
Regularization techniques, such as dropout and L1/L2 regularization, can help control overfitting.
Increasing model complexity, adding more relevant features, and training on larger datasets can
mitigate underfitting. Cross-validation is a valuable tool to assess a model's generalization
performance, helping to identify the optimal trade-off between fitting the training data and
generalizing to new data.

Model Selection and detail

Model selection is a crucial aspect of the machine learning workflow that involves choosing the
best model from a set of candidate models for a particular task. The goal is to identify a model that
not only performs well on the training data but also generalizes effectively to new, unseen data.
Model selection is essential to ensure the chosen model has the right complexity, avoids overfitting
or underfitting, and provides robust predictions.

Key Steps in Model Selection:

1. Candidate Model Selection:

• Begin by selecting a set of candidate models that are suitable for the given task. This
may include various algorithms, architectures, or hyperparameter configurations.
2. Training Models:

• Train each candidate model on a portion of the dataset. The training data is used to
teach the model to capture patterns and relationships within the data.
3. Validation:

• Evaluate the performance of each model on a separate validation set that the model
has not seen during training. This set is crucial for assessing how well the model
generalizes to new data.
4. Performance Metrics:

• Use appropriate performance metrics (e.g., accuracy, precision, recall, F1 score for
classification; MSE, MAE, R-squared for regression) to quantify the performance of
each model on the validation set.
5. Selection Criteria:
• Choose the model that performs best according to the chosen evaluation metrics.
The selection criteria may vary depending on the specific goals of the task, such as
maximizing accuracy, minimizing error, or optimizing for a trade-off between
precision and recall.
6. Hyperparameter Tuning:

• Fine-tune the hyperparameters of the selected model to achieve optimal


performance. Hyperparameters are configuration settings that are not learned from
the data but impact the model's learning process.
7. Testing:

• Validate the final model on a separate test set that it has never encountered before.
This provides a final assessment of the model's generalization performance.

Considerations in Model Selection:

1. Bias-Variance Trade-off:

• Striking a balance between bias and variance is crucial. A model with high bias may
underfit, while a model with high variance may overfit. The goal is to find the sweet
spot that minimizes both bias and variance.
2. Complexity of the Model:

• The complexity of the model should be appropriate for the complexity of the
underlying data. Too simple a model may underfit, while too complex a model may
overfit.
3. Data Quality:

• The quality and representativeness of the training, validation, and test data play a
vital role in model selection. Ensuring a diverse and representative dataset helps in
better generalization.
4. Cross-Validation:

• Utilize cross-validation techniques to assess how well the model generalizes across
different subsets of the data. This helps in obtaining a more robust estimate of the
model's performance.
5. Domain Knowledge:

• Consider domain knowledge and task-specific requirements when selecting a model.


Certain models may be more suitable for specific types of data or tasks.
What are the different methods for measuring
classifier performance?

Confusion Matrix
Probably it got its name from the state of confusion it
deals with. If you remember the hypothesis testing, you
may recall the two errors we defined as type-I and type-II.
As depicted in Fig.1, type-I error occurs when null
hypothesis is rejected which should not be in actual. And
type-II error occurs when although alternate hypothesis is
true, you are failing to reject null hypothesis.

Fig.1: Type-I and Type-II errors

In figure 1 it is depicted clearly that the choice of


confidence interval affects the probabilities of these errors
to occur. But the fun is that if you try to reduce either if
these errors, that will result the increase of the other one.

So, what is confusion matrix?


Fig.2: Confusion Matrix

Confusion matrix is the image given above. It is a matrix


representation of the results of any binary testing. For
example let us take the case of predicting a disease. You
have done some medical testing and with the help of the
results of those tests, you are going to predict whether the
person is having a disease. So, actually you are going to
validate if the hypothesis of declaring a person as having
disease is acceptable or not. Say, among 100 people you
are predicting 20 people to have the disease. In actual
only 15 people to have the disease and among those 15
people you have diagnosed 12 people correctly. So, if I put
the result in a confusion matrix, it will look like the
following —

Fig.3: Confusion Matrix of prediction a disease


So, if we compare fig.3 with fig.2 we will find —

1. True Positive: 12 (You have predicted the positive


case correctly!)
2. True Negative: 77 (You have predicted negative
case correctly!)
3. False Positive: 8 (Oh! You have predicted these
people as having disease, but in actual they do not
have. But do not worry, this can be rectified in
further medical analysis. So, this is a low risk
error. This is type-II error in this case.)
4. False Negative: 3 (Oh ho! You have predicted these
three poor fellows as fit. But actually they have the
disease. This is dangerous! Be careful! This is
type-I error in this case.)

Now if I ask what is the accuracy of the prediction model


what I followed to get these results, the answer should
be the ratio of the accurately predicted number and
the total number of people which is (12+77)/100 =
0.89. If you study the confusion matrix thoroughly you will
find the following things —

1. The top row is depicting the total number of prediction


you did as having the disease. Among these predictions
you have predicted 12 people correctly to have the
disease in actual. So, the ratio, 12/(12+8) = 0.6 is the
measure of the accuracy of your model in detecting a
person to have the disease. This is called Precision of
the model.
2. Now, take the first column. This column represents the
total number of people who are having the disease in
actual. And you have predicted correctly for 12 of them.
So, the ratio, 12/(12+3) = 0.8 is the measure of the
accuracy of your model to detect a person having
disease out of all the people who are having the disease
in actual. This is termed as Recall.

Now, you may ask the question that why do we need to


measure precision or recall to evaluate the model?

The answer is it is highly recommended when a particular


result is very much sensitive. For example you are going to
build a model for a bank to predict fraudulent
transactions. It is not very common to have a fraudulent
transaction. In 1000 transactions, there may be 1
transaction which is fraud. So, undoubtedly your model
will predict a transaction as non-fraudulent very
accurately. So, in this case the whole accuracy does not
matter as it will be always very high irrespective of the
accuracy of the prediction of the fraudulent transactions
as that is of very low percentage in the whole population.
But the prediction of a fraudulent transaction as non-
fraudulent is not desirable. So, in this case the
measurement of precision will take a vital role to evaluate
the model. It will help to understand out of all the actual
fraudulent transactions how many it is predicting. If it is
low, even if the overall accuracy if high, the model is not
acceptable.

Receiver Operating Characteristics (ROC) Curve


Measuring the area under the ROC curve is also a very
useful method for evaluating a model. ROC is the ratio of
True Positive Rate (TPR) and False Positive Rate (FPR)
(see fig.2). In our disease detection example, TPR is the
measure of the ratio between the number of accurate
predictions of people having disease and the total number
of people having disease in actual. FPR is the ratio
between the number of people who are predicted as not to
have disease correctly and the total number of people who
are not having the disease in actual. So, if we plot the
curve, it comes like this —
Fig.4: ROC curve (source: https://www.medcalc.org/manual/roc-curves.php)

The blue line denotes the change of TPR with different


FPR for a model. More the ratio of the area under the
curve and the total area (100 x 100 in this case) defines
more the accuracy of the model. If it becomes 1, the model
will be overfit and if it is equal below 0.5 (i.e when the
curve is along the dotted diagonal line), the model will be
too inaccurate to use.

For classification models, there are many other evaluation


methods like Gain and Lift charts, Gini coefficient etc. But
the in depth knowledge about the confusion matrix can
help to evaluate any classification model very effectively.
So, in this article I tried to demystify the confusions
around the confusion matrix to help the readers.

Holdout
The purpose of holdout evaluation is to test a model on
different data than it was trained on. This provides an
unbiased estimate of learning performance.

In this method, the dataset is randomly divided into three


subsets:
1. Training set is a subset of the dataset used to build
predictive models.
2. Validation set is a subset of the dataset used to assess
the performance of the model built in the training
phase. It provides a test platform for fine-tuning a
model’s parameters and selecting the best performing
model. Not all modeling algorithms need a validation
set.
3. Test set, or unseen data, is a subset of the dataset used
to assess the likely future performance of a model. If a
model fits to the training set much better than it fits the
test set, overfitting is probably the cause.

The holdout approach is useful because of its speed,


simplicity, and flexibility. However, this technique is often
associated with high variability since differences in the
training and test dataset can result in meaningful
differences in the estimate of accuracy.

The latest in deep learning — from a


source you can trust. Sign up for a
weekly dive into all things deep learning,
curated by experts working in the field.
Cross-Validation
Cross-validation is a technique that involves partitioning
the original observation dataset into a training set, used to
train the model, and an independent set used to evaluate
the analysis.

The most common cross-validation technique is k-fold


cross-validation, where the original dataset is partitioned
into k equal size subsamples, called folds. The k is a user-
specified number, usually with 5 or 10 as its preferred
value. This is repeated k times, such that each time, one of
the k subsets is used as the test set/validation set and the
other k-1 subsets are put together to form a training set.
The error estimation is averaged over all k trials to get the
total effectiveness of our model.

For instance, when performing five-fold cross-validation,


the data is first partitioned into 5 parts of (approximately)
equal size. A sequence of models is trained. The first
model is trained using the first fold as the test set, and the
remaining folds are used as the training set. This is
repeated for each of these 5 splits of the data and the
estimation of accuracy is averaged over all 5 trials to get
the total effectiveness of our model.

As can be seen, every data point gets to be in a test set


exactly once and gets to be in a training set k-1 times. This
significantly reduces bias, as we’re using most of the data
for fitting, and it also significantly reduces variance, as
most of the data is also being used in the test set.
Interchanging the training and test sets also adds to the
effectiveness of this method.

Model Evaluation Metrics


Model evaluation metrics are required to quantify model
performance. The choice of evaluation metrics depends on
a given machine learning task (such as classification,
regression, ranking, clustering, topic modeling, among
others). Some metrics, such as precision-recall, are useful
for multiple tasks. Supervised learning tasks such as
classification and regression constitutes a majority of
machine learning applications. In this article, we focus on
metrics for these two supervised learning models.

Classification Metrics
In this section we will review some of the metrics used in
classification problems, namely:

● Classification Accuracy

● Confusion matrix

● Logarithmic Loss

● Area under curve (AUC)

● F-Measure

BIAS Variance trade-off


Bias:

Bias is a measure to determine how accurate is the model likely to be


on future unseen data. Complex models, assuming there is enough
training data available, can do predictions accurately. Whereas the
models that are too naive, are very likely to perform badly with
respect to predictions. Simply, Bias is errors made by training data.
Generally, linear algorithms have a high bias which makes them fast
to learn and easier to understand but in general, are less flexible.
Implying lower predictive performance on complex problems that fail
to meet the expected outcomes.
Variance:

Variance is the sensitivity of the model towards training data, that is it


quantifies how much the model will react when input data is changed.
Ideally, the model shouldn’t change too much from one training
dataset to the next training data, which will mean that the algorithm is
good at picking out the hidden underlying patterns between the inputs
and the output variables.
Ideally, a model should have lower variance which means that the
model doesn’t change drastically after changing the training data(it is
generalizable). Having higher variance will make a model change
drastically even on a small change in the training dataset.
Let’s understand what is a bias-variance tradeoff is.

Bias Variance Tradeoff


The aim of any supervised machine learning algorithm is to achieve
low bias and low variance as it is more robust. So that the algorithm
should achieve better performance.

There is no escape from the relationship between bias and variance in


machine learning.

There is an inverse relationship between bias and variance,

● An increase in bias will decrease the variance.


● An increase in the variance will decrease the bias.

There is a trade-off that plays between these two concepts and the
algorithms must find a balance between bias and variance.

As a matter of fact, one cannot calculate the real bias and variance
error terms because we do not know the actual underlying target
function.

You might also like