Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

L5 SubjectReview

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Subject Review

• Data preparation and exploration

• Data analysis using charts

• Introduction to Decision Trees

• Introduction to Cluster Analysis

• Introduction to Linear Regression

• Introduction to Logistic Regression

• Introduction to Forecasting
Linear Regression

• supervised machine learning algorithm used for


prediction tasks
• To model the relationship between a numerical
response and one or more explanatory variables

Yˆ = 2 + 1.25 X
Decision Tree

Branch / Sub-
Root Node
Tree

Decision Node Splitting Decision Node

Terminal Node
Decision Node Terminal Node Terminal Node
(Leaf)

Terminal Node Terminal Node

• supervised machine learning algorithm used for both classification and prediction tasks
• graphical representation of a decision-making process
Logistic Regression

• Supervised machine learning algorithm used for classifying binary variables


• The logistic regression model predicts the probability that a given observation belongs to a
particular class. The model outputs probabilities between 0 and 1
Cluster Analysis

X2

X1
• unsupervised learning task
• segment the data into a set of homogeneous clusters of records for the purpose of
generating insights, discovering patterns, data reduction, etc
Timeseries Forecasting

predicting future trends and values in time-series data


Typical Steps in Data Mining

1. Define/understand the business requirements


2. Obtain data
3. Explore, clean, pre-process data
4. Reduce the data dimension
5. Determine DM task
6. Partition the data (for supervised tasks)
7. Choose the method
8. Iterative implementation and tuning
9. Assess results
10. Deploy best model
Data Partitioning and Overfitting

How well will our prediction or classification model perform when we apply it to new data?

• The chosen model should be able to generalise beyond the dataset that we have at hand.
• To assure generalisation, we use the concept of data partitioning and try to avoid overfitting.
Overfitting

Overfitting in data mining (and in machine learning more broadly) occurs when a model
learns the training data too well, capturing not just the underlying patterns but also the
noise and random fluctuations present in the training set.

As a result, an overfitted model may perform exceptionally well on the training data but fails
to generalize to new, unseen data.
Data Partitioning

Training Data
• Typically, the largest partition
• contains the data used to build the models we are
examining
Data Partitioning

Validation Data
• used to assess the predictive performance of each
model so that you can compare models and choose
the best one
• In some algorithms, the validation partition may be
used to tune and improve the model
Data Partitioning

Test Data
• sometimes called evaluation partition is used to assess
the performance of the chosen model with new data
Data Partitioning

Common partition percentage:


• Train: 70% - 80%
• Validation: 10% - 15%
• Test: 10% - 15%

à balance between having enough data to train a robust


model and having sufficient validation and test data to
reliably evaluate its performance
Data Partitioning

Selling Price Square Age Condition


Footage Training data
9500 1926 30 Good
80%
11900 2069 40 Excellent
124800 1720 30 Good
135000 1396 15 Mint
Validation data
142000 1706 32 Excellent
145000 1847 38 Mint 10%
169000 1950 27 Mint
182000 2323 30 Good
200000 2285 25 Good Testing data
210000 3752 17 Good 10%
… … … …
Timeseries Forecasting

• in time series, a random partition would create two time series with “holes”
• standard forecasting methods cannot handle time series with missing values
• Solution is partition into two periods:
• the earlier period is set as the training data
• the later period is set as the validation data
• Methods are trained on the earlier training period, and their predictive
performance assessed on the later validation period
Timeseries Forecasting

You might also like