Lect2 - Data Preprocessing
Lect2 - Data Preprocessing
What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of
an object
Data Preprocessing
• Attribute is also known as field, or feature
– Examples: eye color or age of a person
1 2
3 4
1
5/26/2020
5 6
2
5/26/2020
11 12
3
5/26/2020
15 16
4
5/26/2020
5
5/26/2020
It is not the points, but what is in between the points that matters... Edward
HOW? German
•Identifying/extracting a good feature set is the most challenging part of
data mining.
Feature vector: 10.2, 0.23, 0.34, 0.34, 20, …
6
5/26/2020
7
5/26/2020
• Sampling is used in data mining because processing the entire • Sampling with replacement
set of data of interest is too expensive or time consuming. Objects are not removed from the population as they are
selected for the sample.
• The key principle for effective sampling is the following: In sampling with replacement, the same object can be picked
using a sample will work almost as well as using the entire data up more than once
sets, if the sample is representative
A sample is representative if it has approximately the same • Stratified sampling
property (of interest) as the original set of data Split the data into several partitions; then draw random
samples from each partition
29 30
8
5/26/2020
9
5/26/2020
DM Assignment-I
• Compare and contrast DM and RDBMS
Describe the basic differences and similarities;
Describe the Pros and Cons (Merits & Demerits).
On average, a summarized report of two
pages (Font: Times New Roman 12, 1.5
spacing) should be submitted on May 28,
2020. Use aastukk@gmail.com to submit
your assignments before the due date.
37
10