0% found this document useful (0 votes)

8 views

Module 2_data preprocessing

Data pre-processing is a vital step in data mining that transforms raw data into a suitable format for analysis, addressing issues like missing values, noise, and inconsistencies. Techniques such as data cleaning, integration, transformation, and reduction are employed to enhance data quality and facilitate effective analysis. The document outlines various methods for handling missing data, noisy data, and integrating multiple data sources, along with their advantages and challenges.

Uploaded by

Shreyas C.K

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Module 2_data preprocessing

Uploaded by

Shreyas C.K

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

BIG DATA & ANALYTICS (ELECTIVE)

Unit -II
Data Pre-processing in Data Mining

Introduction to Data Pre-processing

Data pre-processing is a crucial step in the data mining process that involves transforming
raw data into a format suitable for analysis. In real-world scenarios, data often comes with
inconsistencies, missing values, and noise that can significantly impact the quality of mining
results.
Why Pre-process Data?
Data pre-processing is essential for several reasons:
1. Real-world data is often incomplete, containing missing values or lacking certain
attributes
2. Data may contain noise and inconsistencies
3. Different data sources might use different formats and representations
4. Raw data might be too large or complex for effective analysis
The goal is to produce high-quality data that leads to high-quality mining results.
Data Cleaning
Data cleaning addresses issues with data quality, including missing values, noise, and
inconsistencies. This process is vital for ensuring accurate analysis results.
Handling Missing Values
Missing values are a common issue in real-world datasets that occur when data is not
recorded, lost during collection, or intentionally not provided. Let's examine each handling
method with detailed examples:
1. Ignore the Record (Tuple Deletion)
This method involves removing entire records that contain missing values.
Example: Original Dataset (Customer Purchase Data):
After Ignoring Records with Missing Values:

When to Use:
 Dataset is large enough that removing records won't significantly impact analysis
 Missing values appear randomly (Missing Completely at Random - MCAR)
 Complete case analysis is required
Advantages:
 Simple to implement
 Maintains data consistency
 No need for value estimation
Disadvantages:
 Loss of potentially important data
 Can introduce bias if missing values aren't random
 Reduces dataset size
2. Manual Fill
This involves domain experts examining each missing value and filling it based on context
and expertise.
Example:

Expert Analysis:
 For CustomerID 2: Based on price and category, expert determines it's a "Tablet"
 For CustomerID 3: Based on product and price, expert assigns "Mobile Devices"
When to Use:
 Small datasets
 Critical data where accuracy is paramount
 Domain expertise is available
 Historical records can be referenced
Advantages:
 Highest accuracy
 Incorporates domain knowledge
 Maintains data relationships
Disadvantages:
 Time-consuming
 Not scalable
 Requires expert availability
3. Statistical Methods
A. Mean/Median Imputation
Example for Numerical Values:

B. Mode Imputation for Categorical Values

Example:
Category: [Electronics, Clothing, NULL, Electronics, NULL, Clothing, Electronics]
Mode = Electronics
After Imputation: [Electronics, Clothing, Electronics, Electronics, Electronics, Clothing,
Electronics]
C. Prediction-Based Imputation
Using regression for missing income values:
4. Advanced Methods
A. Multiple Imputation
Creates multiple versions of the dataset with different imputed values to account for
uncertainty.
Example:

B. K-Nearest Neighbors (KNN) Imputation

Finds similar records based on other attributes to estimate missing values.
Example:

For Record 2: KNN would use similar records (like Record 1 and 3) to estimate income.
Implementation Considerations:
1. Data Type: Choose method based on whether data is numerical or categorical
2. Missing Pattern: Identify if values are Missing Completely at Random (MCAR),
Missing at Random (MAR), or Missing Not at Random (MNAR)
3. Dataset Size: Consider computational resources for large datasets
4. Required Accuracy: Balance between accuracy needs and implementation complexity
Real-world Application Example: Healthcare Dataset:

Combined Approach:
1. Use mean imputation for Glucose in Record 1 (less critical)
2. Use manual fill for Age in Record 2 (from medical history)
3. Use regression for Blood Pressure in Record 3 (based on other vitals)
Handling Noisy Data
Noise refers to random errors or variations in measured values. Methods to handle noisy data
include:
1. Binning Methods:
 Sort data and partition into equal-sized bins
 Smooth by bin means, median, or boundaries
2. Regression:
 Fit data to a regression function
 Can be linear or multiple regression
3. Clustering:
 Detect and remove outliers
 Group similar data points together
Real-time Example: In sensor data collection, temperature readings might contain noise due
to equipment malfunction. Binning can help smooth these variations.
Data Integration
Data integration is the process of combining data from multiple sources into a coherent and
unified view. In real-world scenarios, organizations often need to merge data from various
systems, databases, and files to perform comprehensive analysis.
Key challenges include:
1. Schema Integration:
• Matching entities from different sources
• Resolving naming conflicts
• Handling different attribute representations
2. Redundancy Detection:
• Identifying duplicate records
• Managing derived attributes
• Resolving data value conflicts
Advantages:
 Creates a unified view of data
 Improves data consistency
 Enables comprehensive analysis
Disadvantages:
 Complex process
 Time and resource-intensive
 May introduce data quality issues

1. Schema Integration
Schema integration involves reconciling different database schemas to create a unified view.
a Real-world example from a retail company:
Database 1 (Store Sales):

Database 2 (Online Sales):

Integration Process:
1. Identify corresponding entities:
• Customer_Table ↔ Buyers
• Sales_Table ↔ Orders
2. Resolve naming conflicts:
• customer_id/buyer_number → customer_identifier
• cust_name/buyer_name → customer_name
• amount/total_amount → transaction_amount
3. Create unified schema:
Customers: customer_identifier | customer_name | address | contact
Transactions: transaction_id | customer_identifier | transaction_amount | transaction_date

2. Redundancy Detection
Example from a healthcare system integration:
Hospital System:

Lab System:

Redundancy Detection Process:

1. Identify matching records using:
• Name similarity (John Doe)
• Date of Birth match
• Compatible blood type information
2. Resolve conflicts:
• Standardize blood type format (O+ vs O Positive)
• Create unified patient identifier
Data Transformation
Data transformation converts data into forms suitable for mining. Methods include:
1. Smoothing:
 Removes noise from data
 Uses techniques like regression and binning
2. Aggregation:
 Summarizes data
 Example: Daily sales to monthly totals
3. Normalization:
 Scales data to specific ranges
 Common methods: Min-max, z-score, decimal scaling
4. Feature Construction:
 Creates new attributes from existing ones
 Helps in improving mining accuracy

1. Smoothing
Smoothing helps remove noise from data. Consider sensor data from a manufacturing plant:
Raw Temperature Readings (°C):

Moving Average Smoothing (window size 3):

2. Aggregation
Real-world example from retail sales data:
Daily Sales Data:

Monthly Aggregation:

3. Normalization
Example using customer transaction data:
Original Data:

Min-Max Normalization (0-1 scale):

Z-Score Normalization:
Data Reduction
Data reduction techniques reduce data volume while maintaining integrity of results.
Data Cube Aggregation
 Aggregates data by using data cube operations
 Example: Converting daily sales data to monthly summaries
 Reduces data volume while preserving trends
Dimensionality Reduction
Reduces number of random variables under consideration:
1. Principal Component Analysis (PCA):
 Transforms data into lower dimensional space
 Preserves maximum variance
2. Feature Selection:
 Identifies most relevant attributes
 Removes redundant features
Advantages:
 Reduces computational complexity
 Improves model performance
 Reduces storage requirements
Disadvantages:
 Potential loss of information
 May reduce model interpretability
 Requires careful selection of reduction method
Data Compression
Two main approaches:
1. Lossless Compression:
 Original data can be reconstructed exactly
 Examples: Run-length encoding, Huffman coding
2. Lossy Compression:
 Reconstructed data approximates original
 Examples: Wavelets, PCA
A. Principal Component Analysis (PCA)
Example using customer behavior data:
Original Features:

After PCA (reduced to 2 components):

PC1: Shopping_Engagement (combines Monthly_Spend, Store_Visits)
PC2: Digital_Engagement (combines Website_Visits, Email_Opens, Click_Rate)
B. Feature Selection
Example from credit risk assessment:
Original Features:
- Income
- Age
- Credit_Score
- Employment_Years
- Debt_Ratio
- Number_of_Cards
- Payment_History
- Recent_Inquiries
Selected Features (based on correlation with default risk):
- Credit_Score
- Debt_Ratio
- Payment_History
5. Data Compression
A. Lossless Compression :
Run-Length Encoding Example:
Original: AAABBBCCCC
Compressed: 3A3B4C
B. Lossy Compression :
Image Resolution Reduction:
Original: 1024x1024 pixels
Compressed: 512x512 pixels with color averaging
Numerosity Reduction
Replaces original data with smaller representations:
1. Parametric Methods:
 Regression models
 Log-linear models
2. Non-parametric Methods:
 Histograms
 Clustering
 Sampling
Discretization and Concept Hierarchy
Discretization
Converts continuous data into discrete intervals:
1. Top-down splitting:
 Starts with full range
 Recursively splits into intervals
2. Bottom-up merging:
 Starts with individual values
 Progressively merges intervals
Real-time Example: Age values (continuous) converted to categories (young, middle-aged,
senior).
Concept Hierarchy
Organizes data at different abstraction levels:
1. Schema Hierarchy:
 Based on database schema relationships
 Example: Street → City → State → Country
2. Set-grouping Hierarchy:
 Based on grouping of values
 Example: Product categories and subcategories
Significance:
 Enables multi-level analysis
 Supports roll-up and drill-down operations
 Facilitates data understanding at different granularities
Data Mining Task
Data mining primitives define a data mining task, which can be specified in the form of a data
mining query.
• Task Relevant Data
• Kinds of knowledge to be mined
• Background knowledge
• Interestingness measure
• Presentation and visualization of discovered patterns
1. Task Relevant Data
Task-relevant data refers to the specific data portions needed for a particular data mining
task. It encompasses the selection of appropriate data sets, attributes, and time periods that
are pertinent to the analysis objectives.
DMQL Syntax:
USE DATABASE database_name;
MINE FROM table_name
[USING attribute_list]
[WHERE condition]
[DURING time_period];
Real-world Example: In a retail database analysis scenario:

2. Kinds of Knowledge to be Mined

This specifies the type of patterns or knowledge we want to discover from the data. Common
types include characterization, discrimination, association, classification, clustering, and
trend analysis.
A. Association Rules

B. Classification Rules

C. Clustering

3. Background Knowledge
Background knowledge represents domain-specific information that can guide the mining
process. This includes concept hierarchies, expert knowledge, and domain constraints.

A. Concept Hierarchies
B. Domain Constraints

4. Interestingness Measures
Interestingness measures help evaluate the utility and significance of discovered patterns.
These include statistical measures, pattern-based measures, and domain-specific criteria.
A. Statistical Measures

B. Domain-Specific Measures

5. Presentation and Visualization of Discovered Patterns

This component defines how the discovered patterns should be presented to users. It includes
various visualization techniques and presentation formats.
A. Basic Visualization

B. Custom Reports

Voodoo Art Manual - V2.1
No ratings yet
Voodoo Art Manual - V2.1
28 pages
CS001-MidTerm MCQs With Reference Solved by Arslan
86% (14)
CS001-MidTerm MCQs With Reference Solved by Arslan
16 pages
Manual Fanuc Ladder Iii PDF
No ratings yet
Manual Fanuc Ladder Iii PDF
791 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Unit - II
No ratings yet
Unit - II
56 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Correlation
No ratings yet
Correlation
14 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Normalization
No ratings yet
Normalization
35 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
04 - ML - Data Preprocessing
No ratings yet
04 - ML - Data Preprocessing
13 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Week2-2
No ratings yet
Week2-2
25 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
DATA WAREHOUSING UNIT 1[1]
No ratings yet
DATA WAREHOUSING UNIT 1[1]
26 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
253777
No ratings yet
253777
66 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
From Everand
Knight's Microsoft Business Intelligence 24-Hour Trainer: Leveraging Microsoft SQL Server Integration, Analysis, and Reporting Services with Excel and SharePoint
Brian Knight
3/5 (1)
Mobile App Development Proposal: Prepared For
No ratings yet
Mobile App Development Proposal: Prepared For
7 pages
Cuent As
No ratings yet
Cuent As
7 pages
Multithreading and Generic Programming
No ratings yet
Multithreading and Generic Programming
58 pages
Jecod - dp-2 - Manual 2
No ratings yet
Jecod - dp-2 - Manual 2
3 pages
Aegex Intrinsically Safe Tablet: 10.1" WUXGA Display Powered by Intel Atom Quad Core Windows 10 Intel Security Suite
No ratings yet
Aegex Intrinsically Safe Tablet: 10.1" WUXGA Display Powered by Intel Atom Quad Core Windows 10 Intel Security Suite
2 pages
CSO Gaddis Java Chapter03 7e
No ratings yet
CSO Gaddis Java Chapter03 7e
64 pages
Flutter - Animation
100% (1)
Flutter - Animation
5 pages
QUESTION BANK SOL
No ratings yet
QUESTION BANK SOL
20 pages
Efb Manual Kha
No ratings yet
Efb Manual Kha
150 pages
Abap Faq'S - Alv: This Is A Set of Queries and Answers, Which We Have Collected From Multiple Sources To Help Abapers
No ratings yet
Abap Faq'S - Alv: This Is A Set of Queries and Answers, Which We Have Collected From Multiple Sources To Help Abapers
18 pages
cdm-2021 1
No ratings yet
cdm-2021 1
3 pages
Documentation of Home Automation
No ratings yet
Documentation of Home Automation
53 pages
Chapter 1 Finallllllllllll
No ratings yet
Chapter 1 Finallllllllllll
5 pages
L3 Demo - Building A Linear Regression
No ratings yet
L3 Demo - Building A Linear Regression
60 pages
User Manual 24740
No ratings yet
User Manual 24740
28 pages
Project Teori Bilangan
No ratings yet
Project Teori Bilangan
7 pages
REDCap User Manual for
No ratings yet
REDCap User Manual for
14 pages
Temple Run Functional Non Functional Domain Req
No ratings yet
Temple Run Functional Non Functional Domain Req
6 pages
Address Calculation 2-D Matrix
No ratings yet
Address Calculation 2-D Matrix
3 pages
Subject Code: 275
No ratings yet
Subject Code: 275
3 pages
DSML Practical
No ratings yet
DSML Practical
3 pages
How To Remove Recycler Virus From My Drive (Solved)
No ratings yet
How To Remove Recycler Virus From My Drive (Solved)
2 pages
Car Number Plate Recognition
80% (5)
Car Number Plate Recognition
87 pages
02 InstallAndConfig EG
No ratings yet
02 InstallAndConfig EG
5 pages
Unable To Assign Items To Child Organization - EGP-2775801
No ratings yet
Unable To Assign Items To Child Organization - EGP-2775801
4 pages
Flutter For TV: Aleksandr Denisov Advanced Software Engineer - EPAM Flutter and Dart GDE
No ratings yet
Flutter For TV: Aleksandr Denisov Advanced Software Engineer - EPAM Flutter and Dart GDE
51 pages
MSC SA Cold Store - SOP
No ratings yet
MSC SA Cold Store - SOP
3 pages

Module 2_data preprocessing

Uploaded by

Module 2_data preprocessing

Uploaded by

BIG DATA & ANALYTICS (ELECTIVE)

Introduction to Data Pre-processing

B. Mode Imputation for Categorical Values

B. K-Nearest Neighbors (KNN) Imputation

Database 2 (Online Sales):

Redundancy Detection Process:

Moving Average Smoothing (window size 3):

Min-Max Normalization (0-1 scale):

After PCA (reduced to 2 components):

2. Kinds of Knowledge to be Mined

5. Presentation and Visualization of Discovered Patterns

You might also like