Module 2_data preprocessing
Module 2_data preprocessing
Unit -II
Data Pre-processing in Data Mining
When to Use:
Dataset is large enough that removing records won't significantly impact analysis
Missing values appear randomly (Missing Completely at Random - MCAR)
Complete case analysis is required
Advantages:
Simple to implement
Maintains data consistency
No need for value estimation
Disadvantages:
Loss of potentially important data
Can introduce bias if missing values aren't random
Reduces dataset size
2. Manual Fill
This involves domain experts examining each missing value and filling it based on context
and expertise.
Example:
Expert Analysis:
For CustomerID 2: Based on price and category, expert determines it's a "Tablet"
For CustomerID 3: Based on product and price, expert assigns "Mobile Devices"
When to Use:
Small datasets
Critical data where accuracy is paramount
Domain expertise is available
Historical records can be referenced
Advantages:
Highest accuracy
Incorporates domain knowledge
Maintains data relationships
Disadvantages:
Time-consuming
Not scalable
Requires expert availability
3. Statistical Methods
A. Mean/Median Imputation
Example for Numerical Values:
For Record 2: KNN would use similar records (like Record 1 and 3) to estimate income.
Implementation Considerations:
1. Data Type: Choose method based on whether data is numerical or categorical
2. Missing Pattern: Identify if values are Missing Completely at Random (MCAR),
Missing at Random (MAR), or Missing Not at Random (MNAR)
3. Dataset Size: Consider computational resources for large datasets
4. Required Accuracy: Balance between accuracy needs and implementation complexity
Real-world Application Example: Healthcare Dataset:
Combined Approach:
1. Use mean imputation for Glucose in Record 1 (less critical)
2. Use manual fill for Age in Record 2 (from medical history)
3. Use regression for Blood Pressure in Record 3 (based on other vitals)
Handling Noisy Data
Noise refers to random errors or variations in measured values. Methods to handle noisy data
include:
1. Binning Methods:
Sort data and partition into equal-sized bins
Smooth by bin means, median, or boundaries
2. Regression:
Fit data to a regression function
Can be linear or multiple regression
3. Clustering:
Detect and remove outliers
Group similar data points together
Real-time Example: In sensor data collection, temperature readings might contain noise due
to equipment malfunction. Binning can help smooth these variations.
Data Integration
Data integration is the process of combining data from multiple sources into a coherent and
unified view. In real-world scenarios, organizations often need to merge data from various
systems, databases, and files to perform comprehensive analysis.
Key challenges include:
1. Schema Integration:
• Matching entities from different sources
• Resolving naming conflicts
• Handling different attribute representations
2. Redundancy Detection:
• Identifying duplicate records
• Managing derived attributes
• Resolving data value conflicts
Advantages:
Creates a unified view of data
Improves data consistency
Enables comprehensive analysis
Disadvantages:
Complex process
Time and resource-intensive
May introduce data quality issues
1. Schema Integration
Schema integration involves reconciling different database schemas to create a unified view.
a Real-world example from a retail company:
Database 1 (Store Sales):
Integration Process:
1. Identify corresponding entities:
• Customer_Table ↔ Buyers
• Sales_Table ↔ Orders
2. Resolve naming conflicts:
• customer_id/buyer_number → customer_identifier
• cust_name/buyer_name → customer_name
• amount/total_amount → transaction_amount
3. Create unified schema:
Customers: customer_identifier | customer_name | address | contact
Transactions: transaction_id | customer_identifier | transaction_amount | transaction_date
2. Redundancy Detection
Example from a healthcare system integration:
Hospital System:
Lab System:
1. Smoothing
Smoothing helps remove noise from data. Consider sensor data from a manufacturing plant:
Raw Temperature Readings (°C):
2. Aggregation
Real-world example from retail sales data:
Daily Sales Data:
Monthly Aggregation:
3. Normalization
Example using customer transaction data:
Original Data:
Z-Score Normalization:
Data Reduction
Data reduction techniques reduce data volume while maintaining integrity of results.
Data Cube Aggregation
Aggregates data by using data cube operations
Example: Converting daily sales data to monthly summaries
Reduces data volume while preserving trends
Dimensionality Reduction
Reduces number of random variables under consideration:
1. Principal Component Analysis (PCA):
Transforms data into lower dimensional space
Preserves maximum variance
2. Feature Selection:
Identifies most relevant attributes
Removes redundant features
Advantages:
Reduces computational complexity
Improves model performance
Reduces storage requirements
Disadvantages:
Potential loss of information
May reduce model interpretability
Requires careful selection of reduction method
Data Compression
Two main approaches:
1. Lossless Compression:
Original data can be reconstructed exactly
Examples: Run-length encoding, Huffman coding
2. Lossy Compression:
Reconstructed data approximates original
Examples: Wavelets, PCA
A. Principal Component Analysis (PCA)
Example using customer behavior data:
Original Features:
B. Classification Rules
C. Clustering
3. Background Knowledge
Background knowledge represents domain-specific information that can guide the mining
process. This includes concept hierarchies, expert knowledge, and domain constraints.
A. Concept Hierarchies
B. Domain Constraints
4. Interestingness Measures
Interestingness measures help evaluate the utility and significance of discovered patterns.
These include statistical measures, pattern-based measures, and domain-specific criteria.
A. Statistical Measures
B. Domain-Specific Measures
B. Custom Reports