Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Module 2_data preprocessing

Data pre-processing is a vital step in data mining that transforms raw data into a suitable format for analysis, addressing issues like missing values, noise, and inconsistencies. Techniques such as data cleaning, integration, transformation, and reduction are employed to enhance data quality and facilitate effective analysis. The document outlines various methods for handling missing data, noisy data, and integrating multiple data sources, along with their advantages and challenges.

Uploaded by

Shreyas C.K
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 2_data preprocessing

Data pre-processing is a vital step in data mining that transforms raw data into a suitable format for analysis, addressing issues like missing values, noise, and inconsistencies. Techniques such as data cleaning, integration, transformation, and reduction are employed to enhance data quality and facilitate effective analysis. The document outlines various methods for handling missing data, noisy data, and integrating multiple data sources, along with their advantages and challenges.

Uploaded by

Shreyas C.K
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

BIG DATA & ANALYTICS (ELECTIVE)

Unit -II
Data Pre-processing in Data Mining

Introduction to Data Pre-processing


Data pre-processing is a crucial step in the data mining process that involves transforming
raw data into a format suitable for analysis. In real-world scenarios, data often comes with
inconsistencies, missing values, and noise that can significantly impact the quality of mining
results.
Why Pre-process Data?
Data pre-processing is essential for several reasons:
1. Real-world data is often incomplete, containing missing values or lacking certain
attributes
2. Data may contain noise and inconsistencies
3. Different data sources might use different formats and representations
4. Raw data might be too large or complex for effective analysis
The goal is to produce high-quality data that leads to high-quality mining results.
Data Cleaning
Data cleaning addresses issues with data quality, including missing values, noise, and
inconsistencies. This process is vital for ensuring accurate analysis results.
Handling Missing Values
Missing values are a common issue in real-world datasets that occur when data is not
recorded, lost during collection, or intentionally not provided. Let's examine each handling
method with detailed examples:
1. Ignore the Record (Tuple Deletion)
This method involves removing entire records that contain missing values.
Example: Original Dataset (Customer Purchase Data):
After Ignoring Records with Missing Values:

When to Use:
 Dataset is large enough that removing records won't significantly impact analysis
 Missing values appear randomly (Missing Completely at Random - MCAR)
 Complete case analysis is required
Advantages:
 Simple to implement
 Maintains data consistency
 No need for value estimation
Disadvantages:
 Loss of potentially important data
 Can introduce bias if missing values aren't random
 Reduces dataset size
2. Manual Fill
This involves domain experts examining each missing value and filling it based on context
and expertise.
Example:

Expert Analysis:
 For CustomerID 2: Based on price and category, expert determines it's a "Tablet"
 For CustomerID 3: Based on product and price, expert assigns "Mobile Devices"
When to Use:
 Small datasets
 Critical data where accuracy is paramount
 Domain expertise is available
 Historical records can be referenced
Advantages:
 Highest accuracy
 Incorporates domain knowledge
 Maintains data relationships
Disadvantages:
 Time-consuming
 Not scalable
 Requires expert availability
3. Statistical Methods
A. Mean/Median Imputation
Example for Numerical Values:

B. Mode Imputation for Categorical Values


Example:
Category: [Electronics, Clothing, NULL, Electronics, NULL, Clothing, Electronics]
Mode = Electronics
After Imputation: [Electronics, Clothing, Electronics, Electronics, Electronics, Clothing,
Electronics]
C. Prediction-Based Imputation
Using regression for missing income values:
4. Advanced Methods
A. Multiple Imputation
Creates multiple versions of the dataset with different imputed values to account for
uncertainty.
Example:

B. K-Nearest Neighbors (KNN) Imputation


Finds similar records based on other attributes to estimate missing values.
Example:

For Record 2: KNN would use similar records (like Record 1 and 3) to estimate income.
Implementation Considerations:
1. Data Type: Choose method based on whether data is numerical or categorical
2. Missing Pattern: Identify if values are Missing Completely at Random (MCAR),
Missing at Random (MAR), or Missing Not at Random (MNAR)
3. Dataset Size: Consider computational resources for large datasets
4. Required Accuracy: Balance between accuracy needs and implementation complexity
Real-world Application Example: Healthcare Dataset:

Combined Approach:
1. Use mean imputation for Glucose in Record 1 (less critical)
2. Use manual fill for Age in Record 2 (from medical history)
3. Use regression for Blood Pressure in Record 3 (based on other vitals)
Handling Noisy Data
Noise refers to random errors or variations in measured values. Methods to handle noisy data
include:
1. Binning Methods:
 Sort data and partition into equal-sized bins
 Smooth by bin means, median, or boundaries
2. Regression:
 Fit data to a regression function
 Can be linear or multiple regression
3. Clustering:
 Detect and remove outliers
 Group similar data points together
Real-time Example: In sensor data collection, temperature readings might contain noise due
to equipment malfunction. Binning can help smooth these variations.
Data Integration
Data integration is the process of combining data from multiple sources into a coherent and
unified view. In real-world scenarios, organizations often need to merge data from various
systems, databases, and files to perform comprehensive analysis.
Key challenges include:
1. Schema Integration:
• Matching entities from different sources
• Resolving naming conflicts
• Handling different attribute representations
2. Redundancy Detection:
• Identifying duplicate records
• Managing derived attributes
• Resolving data value conflicts
Advantages:
 Creates a unified view of data
 Improves data consistency
 Enables comprehensive analysis
Disadvantages:
 Complex process
 Time and resource-intensive
 May introduce data quality issues

1. Schema Integration
Schema integration involves reconciling different database schemas to create a unified view.
a Real-world example from a retail company:
Database 1 (Store Sales):

Database 2 (Online Sales):

Integration Process:
1. Identify corresponding entities:
• Customer_Table ↔ Buyers
• Sales_Table ↔ Orders
2. Resolve naming conflicts:
• customer_id/buyer_number → customer_identifier
• cust_name/buyer_name → customer_name
• amount/total_amount → transaction_amount
3. Create unified schema:
Customers: customer_identifier | customer_name | address | contact
Transactions: transaction_id | customer_identifier | transaction_amount | transaction_date

2. Redundancy Detection
Example from a healthcare system integration:
Hospital System:

Lab System:

Redundancy Detection Process:


1. Identify matching records using:
• Name similarity (John Doe)
• Date of Birth match
• Compatible blood type information
2. Resolve conflicts:
• Standardize blood type format (O+ vs O Positive)
• Create unified patient identifier
Data Transformation
Data transformation converts data into forms suitable for mining. Methods include:
1. Smoothing:
 Removes noise from data
 Uses techniques like regression and binning
2. Aggregation:
 Summarizes data
 Example: Daily sales to monthly totals
3. Normalization:
 Scales data to specific ranges
 Common methods: Min-max, z-score, decimal scaling
4. Feature Construction:
 Creates new attributes from existing ones
 Helps in improving mining accuracy

1. Smoothing
Smoothing helps remove noise from data. Consider sensor data from a manufacturing plant:
Raw Temperature Readings (°C):

Moving Average Smoothing (window size 3):

2. Aggregation
Real-world example from retail sales data:
Daily Sales Data:

Monthly Aggregation:

3. Normalization
Example using customer transaction data:
Original Data:

Min-Max Normalization (0-1 scale):

Z-Score Normalization:
Data Reduction
Data reduction techniques reduce data volume while maintaining integrity of results.
Data Cube Aggregation
 Aggregates data by using data cube operations
 Example: Converting daily sales data to monthly summaries
 Reduces data volume while preserving trends
Dimensionality Reduction
Reduces number of random variables under consideration:
1. Principal Component Analysis (PCA):
 Transforms data into lower dimensional space
 Preserves maximum variance
2. Feature Selection:
 Identifies most relevant attributes
 Removes redundant features
Advantages:
 Reduces computational complexity
 Improves model performance
 Reduces storage requirements
Disadvantages:
 Potential loss of information
 May reduce model interpretability
 Requires careful selection of reduction method
Data Compression
Two main approaches:
1. Lossless Compression:
 Original data can be reconstructed exactly
 Examples: Run-length encoding, Huffman coding
2. Lossy Compression:
 Reconstructed data approximates original
 Examples: Wavelets, PCA
A. Principal Component Analysis (PCA)
Example using customer behavior data:
Original Features:

After PCA (reduced to 2 components):


PC1: Shopping_Engagement (combines Monthly_Spend, Store_Visits)
PC2: Digital_Engagement (combines Website_Visits, Email_Opens, Click_Rate)
B. Feature Selection
Example from credit risk assessment:
Original Features:
- Income
- Age
- Credit_Score
- Employment_Years
- Debt_Ratio
- Number_of_Cards
- Payment_History
- Recent_Inquiries
Selected Features (based on correlation with default risk):
- Credit_Score
- Debt_Ratio
- Payment_History
5. Data Compression
A. Lossless Compression :
Run-Length Encoding Example:
Original: AAABBBCCCC
Compressed: 3A3B4C
B. Lossy Compression :
Image Resolution Reduction:
Original: 1024x1024 pixels
Compressed: 512x512 pixels with color averaging
Numerosity Reduction
Replaces original data with smaller representations:
1. Parametric Methods:
 Regression models
 Log-linear models
2. Non-parametric Methods:
 Histograms
 Clustering
 Sampling
Discretization and Concept Hierarchy
Discretization
Converts continuous data into discrete intervals:
1. Top-down splitting:
 Starts with full range
 Recursively splits into intervals
2. Bottom-up merging:
 Starts with individual values
 Progressively merges intervals
Real-time Example: Age values (continuous) converted to categories (young, middle-aged,
senior).
Concept Hierarchy
Organizes data at different abstraction levels:
1. Schema Hierarchy:
 Based on database schema relationships
 Example: Street → City → State → Country
2. Set-grouping Hierarchy:
 Based on grouping of values
 Example: Product categories and subcategories
Significance:
 Enables multi-level analysis
 Supports roll-up and drill-down operations
 Facilitates data understanding at different granularities
Data Mining Task
Data mining primitives define a data mining task, which can be specified in the form of a data
mining query.
• Task Relevant Data
• Kinds of knowledge to be mined
• Background knowledge
• Interestingness measure
• Presentation and visualization of discovered patterns
1. Task Relevant Data
Task-relevant data refers to the specific data portions needed for a particular data mining
task. It encompasses the selection of appropriate data sets, attributes, and time periods that
are pertinent to the analysis objectives.
DMQL Syntax:
USE DATABASE database_name;
MINE FROM table_name
[USING attribute_list]
[WHERE condition]
[DURING time_period];
Real-world Example: In a retail database analysis scenario:

2. Kinds of Knowledge to be Mined


This specifies the type of patterns or knowledge we want to discover from the data. Common
types include characterization, discrimination, association, classification, clustering, and
trend analysis.
A. Association Rules

B. Classification Rules

C. Clustering

3. Background Knowledge
Background knowledge represents domain-specific information that can guide the mining
process. This includes concept hierarchies, expert knowledge, and domain constraints.

A. Concept Hierarchies
B. Domain Constraints

4. Interestingness Measures
Interestingness measures help evaluate the utility and significance of discovered patterns.
These include statistical measures, pattern-based measures, and domain-specific criteria.
A. Statistical Measures

B. Domain-Specific Measures

5. Presentation and Visualization of Discovered Patterns


This component defines how the discovered patterns should be presented to users. It includes
various visualization techniques and presentation formats.
A. Basic Visualization

B. Custom Reports

You might also like