Welcome to Scribd!

0% found this document useful (0 votes)

9 views

L5 SubjectReview

Uploaded by

The document discusses various machine learning algorithms including linear regression, decision trees, logistic regression, and cluster analysis. It also covers topics like data preparation, exploration, partitioning, and overfitting. Time series forecasting is discussed including partitioning time series data into training and validation periods to evaluate models.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

L5 SubjectReview

Uploaded by

Shaiba Shoshi

0% found this document useful (0 votes)

9 views18 pages

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

0% found this document useful (0 votes)

9 views18 pages

L5 SubjectReview

Uploaded by

Shaiba Shoshi

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Download as pdf or txt

Jump to Page

You are on page 1of 18

Search inside document

Subject Review

• Data preparation and exploration

• Data analysis using charts

• Introduction to Decision Trees

• Introduction to Cluster Analysis

• Introduction to Linear Regression

• Introduction to Logistic Regression

• Introduction to Forecasting
Linear Regression

• supervised machine learning algorithm used for

prediction tasks
• To model the relationship between a numerical
response and one or more explanatory variables

Yˆ = 2 + 1.25 X
Decision Tree

Branch / Sub-
Root Node
Tree

Decision Node Splitting Decision Node

Terminal Node
Decision Node Terminal Node Terminal Node
(Leaf)

Terminal Node Terminal Node

• supervised machine learning algorithm used for both classification and prediction tasks
• graphical representation of a decision-making process
Logistic Regression

• Supervised machine learning algorithm used for classifying binary variables

• The logistic regression model predicts the probability that a given observation belongs to a
particular class. The model outputs probabilities between 0 and 1
Cluster Analysis

X1
• unsupervised learning task
• segment the data into a set of homogeneous clusters of records for the purpose of
generating insights, discovering patterns, data reduction, etc
Timeseries Forecasting

predicting future trends and values in time-series data

Typical Steps in Data Mining

1. Define/understand the business requirements

2. Obtain data
3. Explore, clean, pre-process data
4. Reduce the data dimension
5. Determine DM task
6. Partition the data (for supervised tasks)
7. Choose the method
8. Iterative implementation and tuning
9. Assess results
10. Deploy best model
Data Partitioning and Overfitting

How well will our prediction or classification model perform when we apply it to new data?

• The chosen model should be able to generalise beyond the dataset that we have at hand.
• To assure generalisation, we use the concept of data partitioning and try to avoid overfitting.
Overfitting

Overfitting in data mining (and in machine learning more broadly) occurs when a model
learns the training data too well, capturing not just the underlying patterns but also the
noise and random fluctuations present in the training set.

As a result, an overfitted model may perform exceptionally well on the training data but fails
to generalize to new, unseen data.
Data Partitioning

Training Data
• Typically, the largest partition
• contains the data used to build the models we are
examining
Data Partitioning

Validation Data
• used to assess the predictive performance of each
model so that you can compare models and choose
the best one
• In some algorithms, the validation partition may be
used to tune and improve the model
Data Partitioning

Test Data
• sometimes called evaluation partition is used to assess
the performance of the chosen model with new data
Data Partitioning

Common partition percentage:

• Train: 70% - 80%
• Validation: 10% - 15%
• Test: 10% - 15%

à balance between having enough data to train a robust

model and having sufficient validation and test data to
reliably evaluate its performance
Data Partitioning

Selling Price Square Age Condition

Footage Training data
9500 1926 30 Good
80%
11900 2069 40 Excellent
124800 1720 30 Good
135000 1396 15 Mint
Validation data
142000 1706 32 Excellent
145000 1847 38 Mint 10%
169000 1950 27 Mint
182000 2323 30 Good
200000 2285 25 Good Testing data
210000 3752 17 Good 10%
… … … …
Timeseries Forecasting

• in time series, a random partition would create two time series with “holes”
• standard forecasting methods cannot handle time series with missing values
• Solution is partition into two periods:
• the earlier period is set as the training data
• the later period is set as the validation data
• Methods are trained on the earlier training period, and their predictive
performance assessed on the later validation period
Timeseries Forecasting

02.data Preprocessing PDF
Document31 pages
02.data Preprocessing PDF
sunil
100% (1)
Bank Marketing Data
Document14 pages
Bank Marketing Data
sanju
100% (2)
CS3491 - Aiml - Unit Iii Supervised Learning
Document162 pages
CS3491 - Aiml - Unit Iii Supervised Learning
sudha16121990
No ratings yet
On Unit-3
Document30 pages
On Unit-3
Nihar Ranjan Prusty 92
No ratings yet
AI Capstone Project - Notes-Part2
Document8 pages
AI Capstone Project - Notes-Part2
minha.fathima737373
No ratings yet
Data Warehousing and Mining
Document56 pages
Data Warehousing and Mining
lucky28august
No ratings yet
Business Analytics Process and Data Exploration
Document38 pages
Business Analytics Process and Data Exploration
J Warneck Gultøm
No ratings yet
CSE445_T2a_End_to_End_ML_Project
Document19 pages
CSE445_T2a_End_to_End_ML_Project
zikbal100
No ratings yet
Chapter 2 Data Preprocessing
Document23 pages
Chapter 2 Data Preprocessing
liyu agye
No ratings yet
Reliability Analysis For Repairable
Document261 pages
Reliability Analysis For Repairable
Muhammad Ghufran
No ratings yet
Data Mining and Classification
Document50 pages
Data Mining and Classification
komal kashyap
No ratings yet
22BCS14374 - Sanya - Singh - Assignment 2
Document8 pages
22BCS14374 - Sanya - Singh - Assignment 2
9473riya
No ratings yet
UNIT II Machine Learning
Document118 pages
UNIT II Machine Learning
kirange.dnyaneshwar
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
Document21 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
Wenbo Pan
No ratings yet
Chapter 02 Overview - 4
Document43 pages
Chapter 02 Overview - 4
Mery
No ratings yet
$RFJCVA0
Document54 pages
$RFJCVA0
Soumyajit Guha
No ratings yet
D06B-Data Preprocessing 2
Document50 pages
D06B-Data Preprocessing 2
Abdul Barir Hakim
No ratings yet
S2 - Datascience Lifecycle
Document19 pages
S2 - Datascience Lifecycle
mmtharindu
No ratings yet
PA DA1
Document17 pages
PA DA1
josephallen.abc
No ratings yet
3-Data Preprocessing
Document32 pages
3-Data Preprocessing
divyansh.roorkee
No ratings yet
DS-05 Introduction To Machine Learning
Document103 pages
DS-05 Introduction To Machine Learning
Bojana Jovanceva
No ratings yet
Analytical CRM: Dr. Madhurima Deb
Document27 pages
Analytical CRM: Dr. Madhurima Deb
Ravi Anand
No ratings yet
AIDS C04-Session-20
Document17 pages
AIDS C04-Session-20
Likhitha Bhagyasri Yella
No ratings yet
Unit 3
Document55 pages
Unit 3
kumarmagesh0055
No ratings yet
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
Document20 pages
Kenny-230718-The Ultimate Machine Learning Cheat Sheet
vanjchao
No ratings yet
Choosing Model and Tuning
Document20 pages
Choosing Model and Tuning
kar20201214
No ratings yet
Module 1 ML Mumbai University
Document47 pages
Module 1 ML Mumbai University
2021.shreya.pawaskar
No ratings yet
Data Splitting and Bias Variance Tradeoff
Document14 pages
Data Splitting and Bias Variance Tradeoff
Eileen Lovegood
No ratings yet
CH05 Business Analytics Process and Data Exploration
Document37 pages
CH05 Business Analytics Process and Data Exploration
Cindy
No ratings yet
What Is Data Mining?: Dama-Ncr
Document36 pages
What Is Data Mining?: Dama-Ncr
Pallavi Dikshit
No ratings yet
What Is Data Mining?: Dama-Ncr
Document36 pages
What Is Data Mining?: Dama-Ncr
Gautam Andavarapu
No ratings yet
Data Warehousing and Mining
Document14 pages
Data Warehousing and Mining
Val
No ratings yet
Storytelling With Data To Executives 09212016
Document33 pages
Storytelling With Data To Executives 09212016
Oscar Pinillos
No ratings yet
Integrated Supply Chain Analysis and Decision Support
Document64 pages
Integrated Supply Chain Analysis and Decision Support
Nirav Patel
No ratings yet
Pertemuan 5
Document67 pages
Pertemuan 5
Andika Nickolas
No ratings yet
Comptia Data Da0 001 Exam Objectives (2 0)
Document11 pages
Comptia Data Da0 001 Exam Objectives (2 0)
John Law
100% (1)
Semi Supervised Learning
Document86 pages
Semi Supervised Learning
chaudharylalit025
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
Document43 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
Aatmaj Salunke
No ratings yet
Pattern Recognition Application
Document43 pages
Pattern Recognition Application
Khaled Omar
No ratings yet
Predict Default of Credit Card Clients: By: Varsha Waingankar
Document25 pages
Predict Default of Credit Card Clients: By: Varsha Waingankar
Ramesh S Venkatraman
No ratings yet
Untitled
Document29 pages
Untitled
Nikhil
No ratings yet
Data Science PDF
Document11 pages
Data Science PDF
sredhar s
No ratings yet
Eda
Document12 pages
Eda
Inspiring Evolution
100% (1)
Predicting Churn
Document37 pages
Predicting Churn
SAHARUDIN BIN JAPARUDIN Moe
No ratings yet
Lecture 9 - Evaluations
Document68 pages
Lecture 9 - Evaluations
Syed Abubakar
No ratings yet
III Unit Mtech 2023
Document121 pages
III Unit Mtech 2023
Maryam Fatima
No ratings yet
What Are The Basic Concepts in Machine Learning
Document3 pages
What Are The Basic Concepts in Machine Learning
locefo3178
No ratings yet
Chapter 4
Document31 pages
Chapter 4
Bikila Seketa
No ratings yet
Supervised and Unsupervised Learning: Ciro Donalek Ay/Bi 199 - April 2011
Document69 pages
Supervised and Unsupervised Learning: Ciro Donalek Ay/Bi 199 - April 2011
Emmanuel Harris
No ratings yet
Decision Tree and Random Forest
Document41 pages
Decision Tree and Random Forest
asinghal2122003
No ratings yet
Sampling and Sampling Methods Dewi Rosmala
Document21 pages
Sampling and Sampling Methods Dewi Rosmala
Toko Dhika Berjaya
No ratings yet
DM Assignment 2
Document2 pages
DM Assignment 2
Memoona Ishfaq
No ratings yet
Feature Selection: Slide 1
Document29 pages
Feature Selection: Slide 1
Prathik Narayan
No ratings yet
U4 ML Updated
Document32 pages
U4 ML Updated
Janhvi
No ratings yet
How A Perfect Machine Model Should Be Done
Document5 pages
How A Perfect Machine Model Should Be Done
Pablo Julián Sanchez
No ratings yet
Data Mining Unit 3
Document50 pages
Data Mining Unit 3
balijagudam shashank
No ratings yet
Lecture 4 - Deep Learning Introduction
Document63 pages
Lecture 4 - Deep Learning Introduction
Prena Manohar Lal Lalwani
No ratings yet
Reliability Analysis For Repairable v1.9
Document266 pages
Reliability Analysis For Repairable v1.9
ES Rouza
100% (1)
Neural Networks for Beginners. Part 2
From Everand
Neural Networks for Beginners. Part 2
Simon Winston
No ratings yet
Data Analysis Foundation Courseware
From Everand
Data Analysis Foundation Courseware
Van Haren Learning Solutions a.o.
No ratings yet
Sem1 Statistics1-6
Document10 pages
Sem1 Statistics1-6
Navaneeth Shankar
No ratings yet
Correlations: Tugas Kelompok 3 Statistik Nama Kelompok: 1. M.Ferdi 2. Fauzi Afdillah
Document5 pages
Correlations: Tugas Kelompok 3 Statistik Nama Kelompok: 1. M.Ferdi 2. Fauzi Afdillah
Putri aii Aini
No ratings yet
3.ANOVA IIb-laboratory - Solution
Document13 pages
3.ANOVA IIb-laboratory - Solution
Ariadna Abad
No ratings yet
ME345 Professor John M. Cimbala: The True Temperature of The Ice Bath Is 0.0000 C
Document4 pages
ME345 Professor John M. Cimbala: The True Temperature of The Ice Bath Is 0.0000 C
rahmid farezi
No ratings yet
Probability and Statistics
Document2 pages
Probability and Statistics
Sri vishnu Vardhan
No ratings yet
Status of MIS Data Compilation - MGNREGS-MMJJM Convergence
Document1 page
Status of MIS Data Compilation - MGNREGS-MMJJM Convergence
Biswaranjan Dash
No ratings yet
Chapter 8 (Brooks) : Modelling Volatility and Correlation
Document64 pages
Chapter 8 (Brooks) : Modelling Volatility and Correlation
ahmad_hassan_59
No ratings yet
2 Normal-Distribution
Document21 pages
2 Normal-Distribution
Isiah Tye Angeles
No ratings yet
Electronic Management Requirements and Their Role in Improving Jo
Document24 pages
Electronic Management Requirements and Their Role in Improving Jo
MA Her
No ratings yet
Ho There Is No Significant Difference Among The Means. Ha There Is A Significant Difference Among The Means
Document3 pages
Ho There Is No Significant Difference Among The Means. Ha There Is A Significant Difference Among The Means
Clarissa Teodoro
No ratings yet
econometrics II CH-4 PPT (3)
Document25 pages
econometrics II CH-4 PPT (3)
yaboman0989
No ratings yet
Assignment+1 +Regression+This+assignment+is+to+
Document6 pages
Assignment+1 +Regression+This+assignment+is+to+
Daniel Gonzalez
No ratings yet
Theory of Errors in Observations
Document19 pages
Theory of Errors in Observations
NANDEESH M S
No ratings yet
Evaluating The Predictive Accuracy of Volatility Models: Jose A. Lopez
Document47 pages
Evaluating The Predictive Accuracy of Volatility Models: Jose A. Lopez
rifkiboyz
No ratings yet
Test Preparation C 1000 1642 1.642 0.230066 Math Score 1000 66089 66.089 229.919
Document63 pages
Test Preparation C 1000 1642 1.642 0.230066 Math Score 1000 66089 66.089 229.919
ASHUTOSH KUMAR SINGH
No ratings yet
Zulfitrah Sultan 10090215072 Ilmu Ekonomi
Document7 pages
Zulfitrah Sultan 10090215072 Ilmu Ekonomi
zulfitrah sultan
No ratings yet
Ch-5 Solution
Document75 pages
Ch-5 Solution
IdrissMahdi Okiye
No ratings yet
Akaike 1974
Document8 pages
Akaike 1974
pereiraomar
No ratings yet
Biostatistics and Research Methodology
Document5 pages
Biostatistics and Research Methodology
drashtirupavatiya1
No ratings yet
Data Table: No. Date Stock Prices Returns DHT Vnindex DHT Vnindex
Document7 pages
Data Table: No. Date Stock Prices Returns DHT Vnindex DHT Vnindex
tâm minh
No ratings yet
As MCQ Questions
Document30 pages
As MCQ Questions
CHARAN GOPI KRISHNA Kondapalli
100% (1)
Example 1 - Constructing A Frequency Distribution Table
Document6 pages
Example 1 - Constructing A Frequency Distribution Table
kierselenn
No ratings yet
Tables For Cas Exam Mas-Ii
Document18 pages
Tables For Cas Exam Mas-Ii
Phat Loc
No ratings yet
Chamaille 2007 JAridEnviron
Document6 pages
Chamaille 2007 JAridEnviron
Septiana Eka
No ratings yet
Outliers Detection in Regression Analysis Using Partial Least Square Approach
Document3 pages
Outliers Detection in Regression Analysis Using Partial Least Square Approach
gustavo rodriguez
No ratings yet
Special Correlation
Document24 pages
Special Correlation
MAnvi
No ratings yet
Statistics PT PDF
Document16 pages
Statistics PT PDF
Aldrin Bernardo
No ratings yet
CAPM Results
Document8 pages
CAPM Results
Dan Ronoh
No ratings yet
Module 4
Document5 pages
Module 4
Vijetha K Murthy
No ratings yet