Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
50 views

Data Preprocessing

The document provides an overview of data preprocessing techniques used in data mining. It discusses why preprocessing is important due to issues with real-world data being incomplete, noisy, and inconsistent. The key techniques covered are data cleaning, which involves filling in missing values, smoothing noise, and resolving inconsistencies; data integration, which combines data from multiple sources; data transformation such as normalization and attribute construction; and data reduction methods like aggregation and dimension reduction to reduce the overall data volume.

Uploaded by

maulidanqa123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Data Preprocessing

The document provides an overview of data preprocessing techniques used in data mining. It discusses why preprocessing is important due to issues with real-world data being incomplete, noisy, and inconsistent. The key techniques covered are data cleaning, which involves filling in missing values, smoothing noise, and resolving inconsistencies; data integration, which combines data from multiple sources; data transformation such as normalization and attribute construction; and data reduction methods like aggregation and dimension reduction to reduce the overall data volume.

Uploaded by

maulidanqa123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

3.

Data Preprocessing

Prodi Informatika 2021

Anna Baita, M. Kom.

Fakultas Ilmu Komputer


Outline

SCPMK 1683903: Mahasiswa dapat menerapkan teknik pre-processing [CPMK39]


The students can apply pre-processing techniques.

• Outline:
• What & Why preprocess the data?
• Data Cleaning
• Data Integration
• Data Transformation
• Data reduction

2
Data Preprocessing

It is a data mining technique that involves transforming


raw data into an understable format
Why PreProcess The data??
Why Preprocess the data?

Data in the real world is:


✓incomplete: lacking value, certain attributes of interest
✓noisy: containing error or outlier
✓inconsistent: lack of compatibility or similarity between two
or more fact

No quality data, No quality Mining result


✓Quality decisions must be based on quality data
✓Data warehouse needs consistent integration of quality data
Measure Of Data Quality

❑ Accuracy
❑ Completeness
❑ Consistency
❑ Timeliness
❑ Believability
❑ Value Added
❑ Interpretability
❑ Accessibility
Data Preprocessing Technique

1. Data Cleaning
2. Data Integration
3. Data Transformation
4. Data Reduction
Data Cleaning

Data Cleaning attempt to fill in


missing values, smooth out noise
while indentifying outliers and
correct inconsistensies in the
realworld data
Fill The Missing Value
Data Cleaning- Missing Value

1. Ignore The Tuple

• Ignore The Tuple


Data Cleaning- Missing Value

2. Fill the Missing Value Manually (Feasible)


3. Use a Global Constant
ex: “-”,
“unknown”
Data Cleaning- Missing Value

4. Use the Attribute Mean, or median Mean X2=66.1


Mean X4=0.22
Mean Y=69.44

66.1

69.44

66.1 0.22
Data Cleaning- Missing Value

3. Use The Most Probable Value


Predict using KNN, Regression,
Decission Tree, etc
14
smooth out noise
Data Cleaning- Noisy

Data Derau (Noise) : Adanya kesalahan kecil yang


random

Penyebab:
1. Kesalahan Instrumen
Pengumpul data
2. Masalah data Entri
3. Masalah transmisi data
4. Keterbatasan Teknologi
5. Tidak Konsisten dalam
penamaan.
Untuk mengatasinya harus
ex: “yogya” vs “jogja”
dilakukan smoothing
(dengan memperhatikan
nilai-nilai tetangga)
Data Cleaning- Noisy Data

✓ Binning
✓Clustering
✓Combined Computer and Human Inspection
Deteksi data yang mencurigakan tangani manusia
✓Regression
Data Cleaning- Noisy Data

Binning
Binning adalah sebuah proses untuk
mengelompokkan data ke dalam bagian-bagian
yang lebih kecil yang disebut bin berdasarkan
kriteria tertentu.

Langkah
1. Urutkan data
2. Partisi data tersebut ke dalam bin
3. Tentukan teknik Smoothing :
- by mean
- by boundaries
Data Cleaning- Noisy Data

1. Urutkan data
70,100,150,200,250,270,300,380,400

2. Misalnya jumlah bin 3


Bin 1 : 70,100,150
Bin 2 : 200,250,270
Bin 3 : 300,380,400
Data Cleaning- Noisy Data
Teknik Smoothing by Mean By mean :
Bin 1 : 70,100,150 Bin 1 : 107,107,107
Bin 2 : 200,250,270 Bin 2 : 240,240,240
Bin 3 : 300,380,400 Bin 3 : 360,360,360

In smoothing by bin means, each value


in a bin is replaced by the mean value
of the bin.
Data Cleaning- Noisy Data
Teknik Smoothing by Boundaries By boundaries :
Bin 1 : 70,100,150 Bin 1 : 70,70,150
Bin 2 : 200,250,270 Bin 2 : 200,270,270
Bin 3 : 300,380,400 Bin 3 : 300,400,400

In smoothing by bin boundaries, the


minimum and maximum values in a
given bin are identified as the bin
boundaries. Each bin value is then
replaced by the closest boundary
value.
Data Cleaning- Noisy Data

Clustering
Data pencilan : data yang menyimpang dari data
yang lainnya

Data pencilan dalam statistik disebut data


“outlier”

data pencilan boleh dibuang/diabaikan,


jumlah data pencilan umumnya tidak
banyak, hanya sekitar 2% dari jumlah data
Data Cleaning- Noisy Data

Regresi
correct inconsistensies
Data Cleaning- Inconsistent Data

• Manually, Using External References


• Knowledge Engineering tools
Data Integration

Data Integration implies combining of data


from multiple source into a coherent data
store (data warehouse)
Data Integration - Issue

• Entity Indentification problem


• Redudancy
• Tuple Duplication
• Detecting data value conflicts
Handling Redudant Data in Data Integration

• Redudant data occur often when integration of multiple databases


-the same attribute may have different names in different databases
-one attribute may be a "derived" attribute in another table.

• Redundant data may be able to be detected by correlation analysis


• Careful integration of multiple sources may help reduce/avoid redudancies
and inconsistencies and improve mining speed and quality
Data Integration

Data Source 1
Data Source 2

What Deferences??

Can the data be combined into one database?


Data Integration
Data Transformation

Transforming or consolidating data into mining suitable form is


known as data transformation
Smoothing

Agregation

Generalization

Normalization

Attribute Construction
Data Transformation

Smoothing: remove noise from data


Aggregation: Summarization, data cube construction
Generalization: Concept hierarchy climbing
Data Reduction

Data Reduction techniques are aplied to obtain a


reduce representation of the dataset that is much
smaller in volume, yet closely maintains the integrity
of base data
Data Reduction- Strategies

• Data cube aggregation


• Dimension Reduction
• Data Compression
• Numerosity Reduction
• Discretization and concept hierarchy generation
Text Pre processing???

35
Image Pre Processing??

36
Any Question

You might also like