Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
31 views18 pages

ML Unit 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 18

ML UNIT -2

1.what is overfitting and how can you avoid it?


Ans- Overfitting occurs when a machine learning model captures noise or
random fluctuations in the training data, rather than learning the underlying
patterns or relationships. This leads to a model that performs well on the
training data but fails to generalize to unseen data.

Here's an example to illustrate overfitting and how to avoid it:

**Example: Predicting Housing Prices**

Let's say you're tasked with building a machine learning model to predict
housing prices based on various features such as size, location, number of
bedrooms, etc.

1. **Overfitting Scenario**:

- You have a small dataset of 100 houses with various features and
corresponding prices.
- You decide to train a complex neural network with many layers and
parameters to learn intricate relationships in the data.
- As you train the model, it starts to fit the training data extremely well,
achieving very low error.
- However, when you evaluate the model on a separate test set (houses the
model hasn't seen before), you notice that the model's performance is poor. It
fails to generalize to new data and predicts prices that are far from the actual
prices.
- This is a classic case of overfitting. The model has learned to memorize the
noise in the training data rather than capturing the true underlying patterns. It
performs poorly on unseen data because it's essentially making predictions
based on random fluctuations in the training data.

2. **Avoiding Overfitting**:

- **Use Sufficient Data**: Collect more data if possible. A larger dataset can
help the model learn more representative patterns and reduce the risk of
overfitting.

- **Simplify the Model**: Instead of using a complex neural network,


consider using a simpler model like linear regression or decision trees. These
models are less prone to overfitting and can still capture important
relationships in the data.

- **Regularization**: Apply regularization techniques such as L1 or L2


regularization to penalize large parameter values and prevent the model from
becoming too complex.

- **Cross-Validation**: Split the data into training and validation sets and use
techniques like k-fold cross-validation to evaluate the model's performance on
multiple validation sets. This provides a more robust estimate of the model's
generalization ability.

- **Feature Selection/Engineering**: Select only the most relevant features


or create new features that are more informative. This can help the model
focus on the most important information and avoid overfitting to noise.

By implementing these strategies, you can build machine learning models that
generalize well to new data and avoid the pitfalls of overfitting.
2.what is ‘training set’ and ‘test set’ in a machine learning
model?

Ans-Trainand Test datasets in


Machine Learning
Machine Learning is one of the booming technologies across the world that enables
computers/machines to turn a huge amount of data into predictions. However, these
predictions highly depend on the quality of the data, and if we are not using the
right data for our model, then it will not generate the expected result. In machine
learning projects, we generally divide the original dataset into training data and test
data. We train our model over a subset of the original dataset, i.e., the training
dataset, and then evaluate whether it can generalize well to the new or unseen
dataset or test set. Therefore, train and test datasets are the two key concepts of
machine learning, where the training dataset is used to fit the model, and the
test dataset is used to evaluate the model.

In this topic, we are going to discuss train and test datasets along with the difference
between both of them. So, let's start with the introduction of the training dataset and
test dataset in Machine Learning.

What is Training Dataset?


The training data is the biggest (in -size) subset of the original dataset, which is
used to train or fit the machine learning model. Firstly, the training data is fed to
the ML algorithms, which lets them learn how to make predictions for the given task.

For example, for training a sentiment analysis model, the training data could be as
below:

Backward Skip 10sPlay VideoForward Skip 10s


ADVERTISEMENT

Input Output (Labels)

The New UI is Great Positive

Update is really Slow Negative

The training data varies depending on whether we are using Supervised Learning or
Unsupervised Learning Algorithms.
For Unsupervised learning, the training data contains unlabeled data points, i.e.,
inputs are not tagged with the corresponding outputs. Models are required to find
the patterns from the given training datasets in order to make predictions.

On the other hand, for supervised learning, the training data contains labels in order
to train the model and make predictions.

The type of training data that we provide to the model is highly responsible for the
model's accuracy and prediction ability. It means that the better the quality of the
training data, the better will be the performance of the model. Training data is
approximately more than or equal to 60% of the total data for an ML project.

What is Test Dataset?


Once we train the model with the training dataset, it's time to test the model with the
test dataset. This dataset evaluates the performance of the model and ensures that
the model can generalize well with the new or unseen dataset. The test dataset is
another subset of original data, which is independent of the training dataset.
However, it has some similar types of features and class probability distribution and
uses it as a benchmark for model evaluation once the model training is completed.
Test data is a well-organized dataset that contains data for each type of scenario for
a given problem that the model would be facing when used in the real world.
Usually, the test dataset is approximately 20-25% of the total original data for an ML
project.

At this stage, we can also check and compare the testing accuracy with the training
accuracy, which means how accurate our model is with the test dataset against the
training dataset. If the accuracy of the model on training data is greater than that on
testing data, then the model is said to have overfitting.

ADVERTISEMENT
ADVERTISEMENT

The testing data should:

o Represent or part of the original dataset.


o It should be large enough to give meaningful predictions.

Need of Splitting dataset into Train and


Test set
Splitting the dataset into train and test sets is one of the important parts of data pre-
processing, as by doing so, we can improve the performance of our model and hence
give better predictability.

We can understand it as if we train our model with a training set and then test it with
a completely different test dataset, and then our model will not be able to
understand the correlations between the features.

Therefore, if we train and test the model with two different datasets, then it will
decrease the performance of the model. Hence it is important to split a dataset into
two parts, i.e., train and test set.

In this way, we can easily evaluate the performance of our model. Such as, if it
performs well with the training data, but does not perform well with the test dataset,
then it is estimated that the model may be overfitted.

For splitting the dataset, we can use the train_test_split function of scikit-learn.

The bellow line of code can be used to split dataset:

1. from sklearn.model_selection import train_test_split


2. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state
=0)

Explanation:

In the first line of the above code, we have imported the train_test_split function from
the sklearn library.

In the second line, we have used four variables, which are

ADVERTISEMENT

o x_train: It is used to represent features for the training data


o x_test: It is used to represent features for testing data
o y_train: It is used to represent dependent variables for training data
o y_test: It is used to represent independent variable for testing data
o In the train_test_split() function, we have passed four parameters. Which first
two are for arrays of data, and test_size is for specifying the size of the test set.
The test_size may be .5, .3, or .2, which tells the dividing ratio of training and
testing sets.
o The last parameter, random_state, is used to set a seed for a random
generator so that you always get the same result, and the most used value for
this is 42.

Overfitting and Underfitting issues


Overfitting and underfitting are the most common problems that occur in the
Machine Learning model.

A model can be said as overfitted when it performs quite well with the training
dataset but does not generalize well with the new or unseen dataset. The issue of
overfitting occurs when the model tries to cover all the data points and hence starts
caching noises present in the data. Due to this, it can't generalize well to the new
dataset. Because of these issues, the accuracy and efficiency of the model degrade.
Generally, the complex model has a high chance of overfitting. There are various
ways by which we can avoid overfitting in the model, such as Using the Cross-
Validation method, early stopping the training, or by regularization, etc.

ADVERTISEMENT
ADVERTISEMENT

On the other hand, the model is said to be under-fitted when it is not able to
capture the underlying trend of the data. It means the model shows poor
performance even with the training dataset. In most cases, underfitting issues occur
when the model is not perfectly suitable for the problem that we are trying to solve.
To avoid the overfitting issue, we can either increase the training time of the model
or increase the number of features in the dataset.

Training data vs. Testing Data


o The main difference between training data and testing data is that training
data is the subset of original data that is used to train the machine learning
model, whereas testing data is used to check the accuracy of the model.
o The training dataset is generally larger in size compared to the testing dataset.
The general ratios of splitting train and test datasets are 80:20, 70:30, or
90:10.
o Training data is well known to the model as it is used to train the model,
whereas testing data is like unseen/new data to the model.

How do training and testing data work in


Machine Learning?
Machine Learning algorithms enable the machines to make predictions and solve
problems on the basis of past observations or experiences. These experiences or
observations an algorithm can take from the training data, which is fed to it. Further,
one of the great things about ML algorithms is that they can learn and improve over
time on their own, as they are trained with the relevant training data.

Once the model is trained enough with the relevant training data, it is tested with the
test data. We can understand the whole process of training and testing in three
steps, which are as follows:

1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in
Supervised Learning), and the model transforms the training data into text
vectors or a number of data features.
3. Test: In the last step, we test the model by feeding it with the test
data/unseen dataset. This step ensures that the model is trained efficiently
and can generalize well.

The above process is explained using a flowchart given below:


Traits of Quality training data
As the ability to the prediction of an ML model highly depends on how it has been
trained, therefore it is important to train the model with quality data. Further, ML
works on the concept of "Garbage In, Garbage Out." It means that whatever type of
data we will input into our model, it will make the predictions accordingly. For a
quality training data, the below points should be considered:

1. Relevant

The very first quality of training data should be relevant to the problem that you are
going to solve. It means that whatever data you are using should be relevant to the
current problem. For example, if you are building a model to analyze social media
data, then data should be taken from different social sites such as Twitter, Facebook,
Instagram, etc.
2. Uniform:

There should always be uniformity among the features of a dataset. It means all data
for a particular problem should be taken from the same source with the same
attributes.

3. Consistency: In the dataset, the similar attributes must always correspond to the
similar label in order to ensure uniformity in the dataset.

4. Comprehensive: The training data must be large enough to represent sufficient


features that you need to train the model in a better way. With a comprehensive
dataset, the model will be able to learn all the edge cases.

3.when will you use classification over regression?


Ans- Certainly! Here's a comparison highlighting the differences between
classification and regression in machine learning:

1. **Nature of Output**:
- **Classification**: The output variable in classification is categorical,
meaning it falls into one of a discrete set of classes or categories. Examples
include binary classification (two classes) like spam or not spam, and multiclass
classification (more than two classes) like different types of flowers.
- **Regression**: The output variable in regression is continuous, meaning it
can take on any value within a range. Examples include predicting house prices,
stock prices, temperature, etc.

2. **Objective**:
- **Classification**: The main objective of classification is to classify data
points into predefined classes or categories based on input features.
- **Regression**: The main objective of regression is to predict a continuous
numeric value based on input features.

3. **Model Output**:
- **Classification**: The output of a classification model is a probability score
or a class label indicating the likelihood of each class for a given input. For
binary classification, it may output probabilities for each class and assign the
observation to the class with the highest probability.
- **Regression**: The output of a regression model is a numeric value that
represents the predicted target variable.

4. **Evaluation Metrics**:
- **Classification**: Common evaluation metrics for classification tasks
include accuracy, precision, recall, F1-score, ROC-AUC (Receiver Operating
Characteristic - Area Under the Curve), etc.
- **Regression**: Common evaluation metrics for regression tasks include
Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute
Error (MAE), R-squared (coefficient of determination), etc.

5. **Examples**:
- **Classification**: Spam email detection, sentiment analysis, image
classification, medical diagnosis, customer churn prediction (yes/no), etc.
- **Regression**: Predicting house prices, stock prices, temperature
forecasting, sales forecasting, predicting the age of a person based on health
parameters, etc.

6. **Algorithms**:
- **Classification**: Algorithms commonly used for classification include
logistic regression, decision trees, random forests, support vector machines
(SVM), k-nearest neighbors (KNN), neural networks, etc.
- **Regression**: Algorithms commonly used for regression include linear
regression, polynomial regression, decision trees, random forests, gradient
boosting algorithms (e.g., XGBoost, LightGBM), neural networks, etc.
Understanding these differences helps in selecting the appropriate machine
learning approach for a given problem based on the nature of the target
variable and the objective of the task.

4.what are the assumptions you need to take before starting


with linear regression?
Ans- Before applying linear regression in machine learning, it's important to
consider several assumptions to ensure the validity and reliability of the model.
These assumptions are crucial for the interpretation of the regression
coefficients and the accuracy of predictions. Here are the key assumptions
associated with linear regression:

1. **Linearity**: The relationship between the independent variables


(features) and the dependent variable (target) should be linear. This means that
the change in the dependent variable is proportional to the change in the
independent variables. You can check this assumption by plotting the variables
and assessing whether the relationship appears to be linear.

2. **Independence**: The observations in the dataset should be independent


of each other. In other words, there should be no correlation between the
residuals (errors) of the model. This assumption ensures that each observation
contributes unique information to the model. You can check for independence
by examining the autocorrelation plot of the residuals.

3. **Homoscedasticity**: The variance of the residuals should be constant


across all levels of the independent variables. In simpler terms, the spread of
the residuals should remain consistent as the values of the independent
variables change. You can assess homoscedasticity by plotting the residuals
against the predicted values and checking for a consistent spread of points.

4. **Normality of Residuals**: The residuals (errors) should be normally


distributed. This means that the distribution of the residuals should resemble a
bell-shaped curve when plotted. Normality of residuals is important for making
valid statistical inferences and for accurate confidence intervals and hypothesis
testing. You can check for normality by creating a histogram or a Q-Q plot of
the residuals.

5. **No Multicollinearity**: There should be no multicollinearity among the


independent variables, meaning that they should not be highly correlated with
each other. Multicollinearity can lead to unstable estimates of the regression
coefficients and affect the interpretation of the model. You can assess
multicollinearity using correlation matrices or variance inflation factors (VIFs).

6. **No Outliers or Influential Points**: Outliers or influential points can


disproportionately affect the regression results and may violate the
assumptions of linearity, independence, and homoscedasticity. It's important to
detect and handle outliers appropriately to ensure the integrity of the model.

Before fitting a linear regression model, it's advisable to check these


assumptions to determine if linear regression is an appropriate modeling
technique for your dataset. If any of these assumptions are violated, alternative
regression methods or data transformations may be necessary. Additionally,
diagnostic plots and statistical tests can be used to assess the validity of the
assumptions and the overall goodness of fit of the model.

8.Write a program on linear regression?


Ans- # Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generating synthetic dataset


np.random.seed(0)
X = 2 * np.random.rand(100, 1) # Independent variable
y = 4 + 3 * X + np.random.randn(100, 1) # Dependent variable with some noise

# Splitting dataset into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Creating and fitting the linear regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Plotting the training data and the linear regression line


plt.scatter(X_train, y_train, color='blue', label='Training Data')
plt.scatter(X_test, y_test, color='green', label='Test Data')
plt.plot(X_test, y_pred, color='red', linewidth=3, label='Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend()
plt.show()

# Evaluating the model


mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

This program performs the following steps:

1. Generates a synthetic dataset with one independent variable X and one


dependent variable y. The relationship between X and y follows the
equation: y = 4 + 3*X + noise.
2. Splits the dataset into training and test sets using train_test_split function
from scikit-learn.
3. Creates a linear regression model using LinearRegression class from
scikit-learn and fits it to the training data.
4. Makes predictions on the test data using the trained model.
5. Plots the training data points, test data points, and the linear regression
line.
6. Evaluates the model's performance using Mean Squared Error (MSE)
metric.

You can run this program in a Python environment with scikit-learn and
matplotlib libraries installed. It demonstrates a simple example of linear
regression, but you can apply similar principles to more complex datasets and
regression problems.

6.Explain the estimation of regression cofficients?


Ans-
7.Explain properties of list squared estimators?
Ans

You might also like