L5 SubjectReview
L5 SubjectReview
L5 SubjectReview
• Introduction to Forecasting
Linear Regression
Yˆ = 2 + 1.25 X
Decision Tree
Branch / Sub-
Root Node
Tree
Terminal Node
Decision Node Terminal Node Terminal Node
(Leaf)
• supervised machine learning algorithm used for both classification and prediction tasks
• graphical representation of a decision-making process
Logistic Regression
X2
X1
• unsupervised learning task
• segment the data into a set of homogeneous clusters of records for the purpose of
generating insights, discovering patterns, data reduction, etc
Timeseries Forecasting
How well will our prediction or classification model perform when we apply it to new data?
• The chosen model should be able to generalise beyond the dataset that we have at hand.
• To assure generalisation, we use the concept of data partitioning and try to avoid overfitting.
Overfitting
Overfitting in data mining (and in machine learning more broadly) occurs when a model
learns the training data too well, capturing not just the underlying patterns but also the
noise and random fluctuations present in the training set.
As a result, an overfitted model may perform exceptionally well on the training data but fails
to generalize to new, unseen data.
Data Partitioning
Training Data
• Typically, the largest partition
• contains the data used to build the models we are
examining
Data Partitioning
Validation Data
• used to assess the predictive performance of each
model so that you can compare models and choose
the best one
• In some algorithms, the validation partition may be
used to tune and improve the model
Data Partitioning
Test Data
• sometimes called evaluation partition is used to assess
the performance of the chosen model with new data
Data Partitioning
• in time series, a random partition would create two time series with “holes”
• standard forecasting methods cannot handle time series with missing values
• Solution is partition into two periods:
• the earlier period is set as the training data
• the later period is set as the validation data
• Methods are trained on the earlier training period, and their predictive
performance assessed on the later validation period
Timeseries Forecasting