ML Unit 2
ML Unit 2
ML Unit 2
Let's say you're tasked with building a machine learning model to predict
housing prices based on various features such as size, location, number of
bedrooms, etc.
1. **Overfitting Scenario**:
- You have a small dataset of 100 houses with various features and
corresponding prices.
- You decide to train a complex neural network with many layers and
parameters to learn intricate relationships in the data.
- As you train the model, it starts to fit the training data extremely well,
achieving very low error.
- However, when you evaluate the model on a separate test set (houses the
model hasn't seen before), you notice that the model's performance is poor. It
fails to generalize to new data and predicts prices that are far from the actual
prices.
- This is a classic case of overfitting. The model has learned to memorize the
noise in the training data rather than capturing the true underlying patterns. It
performs poorly on unseen data because it's essentially making predictions
based on random fluctuations in the training data.
2. **Avoiding Overfitting**:
- **Use Sufficient Data**: Collect more data if possible. A larger dataset can
help the model learn more representative patterns and reduce the risk of
overfitting.
- **Cross-Validation**: Split the data into training and validation sets and use
techniques like k-fold cross-validation to evaluate the model's performance on
multiple validation sets. This provides a more robust estimate of the model's
generalization ability.
By implementing these strategies, you can build machine learning models that
generalize well to new data and avoid the pitfalls of overfitting.
2.what is ‘training set’ and ‘test set’ in a machine learning
model?
In this topic, we are going to discuss train and test datasets along with the difference
between both of them. So, let's start with the introduction of the training dataset and
test dataset in Machine Learning.
For example, for training a sentiment analysis model, the training data could be as
below:
The training data varies depending on whether we are using Supervised Learning or
Unsupervised Learning Algorithms.
For Unsupervised learning, the training data contains unlabeled data points, i.e.,
inputs are not tagged with the corresponding outputs. Models are required to find
the patterns from the given training datasets in order to make predictions.
On the other hand, for supervised learning, the training data contains labels in order
to train the model and make predictions.
The type of training data that we provide to the model is highly responsible for the
model's accuracy and prediction ability. It means that the better the quality of the
training data, the better will be the performance of the model. Training data is
approximately more than or equal to 60% of the total data for an ML project.
At this stage, we can also check and compare the testing accuracy with the training
accuracy, which means how accurate our model is with the test dataset against the
training dataset. If the accuracy of the model on training data is greater than that on
testing data, then the model is said to have overfitting.
ADVERTISEMENT
ADVERTISEMENT
We can understand it as if we train our model with a training set and then test it with
a completely different test dataset, and then our model will not be able to
understand the correlations between the features.
Therefore, if we train and test the model with two different datasets, then it will
decrease the performance of the model. Hence it is important to split a dataset into
two parts, i.e., train and test set.
In this way, we can easily evaluate the performance of our model. Such as, if it
performs well with the training data, but does not perform well with the test dataset,
then it is estimated that the model may be overfitted.
For splitting the dataset, we can use the train_test_split function of scikit-learn.
Explanation:
In the first line of the above code, we have imported the train_test_split function from
the sklearn library.
ADVERTISEMENT
A model can be said as overfitted when it performs quite well with the training
dataset but does not generalize well with the new or unseen dataset. The issue of
overfitting occurs when the model tries to cover all the data points and hence starts
caching noises present in the data. Due to this, it can't generalize well to the new
dataset. Because of these issues, the accuracy and efficiency of the model degrade.
Generally, the complex model has a high chance of overfitting. There are various
ways by which we can avoid overfitting in the model, such as Using the Cross-
Validation method, early stopping the training, or by regularization, etc.
ADVERTISEMENT
ADVERTISEMENT
On the other hand, the model is said to be under-fitted when it is not able to
capture the underlying trend of the data. It means the model shows poor
performance even with the training dataset. In most cases, underfitting issues occur
when the model is not perfectly suitable for the problem that we are trying to solve.
To avoid the overfitting issue, we can either increase the training time of the model
or increase the number of features in the dataset.
Once the model is trained enough with the relevant training data, it is tested with the
test data. We can understand the whole process of training and testing in three
steps, which are as follows:
1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in
Supervised Learning), and the model transforms the training data into text
vectors or a number of data features.
3. Test: In the last step, we test the model by feeding it with the test
data/unseen dataset. This step ensures that the model is trained efficiently
and can generalize well.
1. Relevant
The very first quality of training data should be relevant to the problem that you are
going to solve. It means that whatever data you are using should be relevant to the
current problem. For example, if you are building a model to analyze social media
data, then data should be taken from different social sites such as Twitter, Facebook,
Instagram, etc.
2. Uniform:
There should always be uniformity among the features of a dataset. It means all data
for a particular problem should be taken from the same source with the same
attributes.
3. Consistency: In the dataset, the similar attributes must always correspond to the
similar label in order to ensure uniformity in the dataset.
1. **Nature of Output**:
- **Classification**: The output variable in classification is categorical,
meaning it falls into one of a discrete set of classes or categories. Examples
include binary classification (two classes) like spam or not spam, and multiclass
classification (more than two classes) like different types of flowers.
- **Regression**: The output variable in regression is continuous, meaning it
can take on any value within a range. Examples include predicting house prices,
stock prices, temperature, etc.
2. **Objective**:
- **Classification**: The main objective of classification is to classify data
points into predefined classes or categories based on input features.
- **Regression**: The main objective of regression is to predict a continuous
numeric value based on input features.
3. **Model Output**:
- **Classification**: The output of a classification model is a probability score
or a class label indicating the likelihood of each class for a given input. For
binary classification, it may output probabilities for each class and assign the
observation to the class with the highest probability.
- **Regression**: The output of a regression model is a numeric value that
represents the predicted target variable.
4. **Evaluation Metrics**:
- **Classification**: Common evaluation metrics for classification tasks
include accuracy, precision, recall, F1-score, ROC-AUC (Receiver Operating
Characteristic - Area Under the Curve), etc.
- **Regression**: Common evaluation metrics for regression tasks include
Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute
Error (MAE), R-squared (coefficient of determination), etc.
5. **Examples**:
- **Classification**: Spam email detection, sentiment analysis, image
classification, medical diagnosis, customer churn prediction (yes/no), etc.
- **Regression**: Predicting house prices, stock prices, temperature
forecasting, sales forecasting, predicting the age of a person based on health
parameters, etc.
6. **Algorithms**:
- **Classification**: Algorithms commonly used for classification include
logistic regression, decision trees, random forests, support vector machines
(SVM), k-nearest neighbors (KNN), neural networks, etc.
- **Regression**: Algorithms commonly used for regression include linear
regression, polynomial regression, decision trees, random forests, gradient
boosting algorithms (e.g., XGBoost, LightGBM), neural networks, etc.
Understanding these differences helps in selecting the appropriate machine
learning approach for a given problem based on the nature of the target
variable and the objective of the task.
# Making predictions
y_pred = model.predict(X_test)
You can run this program in a Python environment with scikit-learn and
matplotlib libraries installed. It demonstrates a simple example of linear
regression, but you can apply similar principles to more complex datasets and
regression problems.