Regression
Regression
Regression
Incentive (Y) (in rupees): [500, 600, 700, 800, 900, 1000]
Testing Data:
used to evaluate the model's performance and assess how well it generalizes to
new, unseen data.
It's like a final exam for the model.
Comparison of training data and testing data :
Aspect Training Data Testing Data
Purpose Used for teaching the model. Used for evaluating the model.
Occurs when a statistical model, such as a ML algorithm, is too complex relative to the data it's trying to fit.
In other words, the model is excessively flexible and tries to capture all the noise and random fluctuations in
the training data, rather than just the underlying patterns or relationships.
High Training Accuracy: An overfit model often performs exceptionally well on the training data because it
has essentially memorized it, including the noise.
Poor Generalization: The problem with overfitting is that it doesn't generalize well to new, unseen data.
When you apply an overfit model to new data, it tends to perform poorly because it's making predictions
based on the noise it learned during training.
Complexity: Overfit models are usually complex, with many parameters or features, and they may exhibit
intricate patterns that don't reflect the true underlying relationships in the data.
Overfitting Example:
Imagine you have a group of students, and you want to predict their test scores based on the
number of hours they spent studying. You collect data from 50 students and build a complex
model that considers not only the number of hours they studied but also factors like the
exact minute they started studying, the type of music they listened to, and even the color of
their pens.
This model fits the training data incredibly well. It predicts each student's test score precisely
based on all these variables. However, when you use this model to predict the scores of new
students who weren't part of your original dataset, it performs poorly. It's almost like your
model has memorized the previous students' scores and study habits but can't generalize to
new students.
In simple terms, overfitting is like having an overly complicated recipe that only works for
specific ingredients and measurements. It doesn't adapt well to new ingredients.
Underfitting:
Occurs when a model is too simplistic to capture the underlying patterns in the
data. Essentially, the model is too rigid and fails to represent the complexities in
the dataset. This leads to the following characteristics:
Low Training and Test Accuracy: An underfit model performs poorly on both the
training data and new data. It doesn't fit the training data well and, as a result,
can't make accurate predictions.
Oversimplified: Underfit models are often overly simplistic, such as using a linear
model to represent a nonlinear relationship in the data.
Bias: Underfit models are biased because they make strong assumptions about
the data that may not hold true.
Underfitting Example:
Now, consider the opposite scenario. You build a very simplistic model that predicts test scores
solely based on the average number of hours all students in your dataset spent studying. This
model doesn't pay attention to individual study habits, time of day, or other factors. It's just a
straight line, representing the average.
This simple model doesn't fit your training data very well. It consistently underestimates or
overestimates the actual test scores. When you use this model to predict scores for the same
students it was trained on, it doesn't do a great job either because it's too basic.
In simple terms, underfitting is like trying to guess the doneness of a steak by only looking at
the color of the outside. You're missing all the important details inside.
The key is to find a balance between these extremes. We want a model that captures the essential patterns in the
data without being too simple or too complex. It's like having a recipe that's just right, adaptable to different
ingredients but still able to make a delicious dish. This balanced model will perform well on both the data it was
trained on and new, unseen data.