Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
19 views

Data Warehousing & Data Mining Unit-3 Notes

Uploaded by

Ananya Dudeja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Data Warehousing & Data Mining Unit-3 Notes

Uploaded by

Ananya Dudeja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Unit: 3

Data Mining
Overview of Data Mining:

Data mining is the process of discovering patterns and extracting valuable insights from large
sets of data. It combines techniques from statistics, machine learning, and database systems to
analyze and interpret complex data structures. Here's an overview of its key components and
applications:

Key Components

1. Data Preprocessing:
 Cleaning: Removing noise and inconsistencies from the data.
 Transformation: Normalizing or aggregating data to improve analysis.
 Reduction: Simplifying data without losing significant information, often through
techniques like dimensionality reduction.
2. Data Exploration:
 Using statistical methods to understand the data distribution, trends, and
relationships.
3. Modeling:
 Applying algorithms to identify patterns and make predictions. Common
techniques include:
 Classification: Assigning labels to data points based on input features.
 Regression: Predicting continuous values.
 Clustering: Grouping similar data points together.
 Association Rule Learning: Discovering relationships between variables
in large databases.
4. Evaluation:
 Assessing the accuracy and effectiveness of the models using metrics like
precision, recall, and F1-score.
5. Deployment:
 Integrating the model into production systems for real-time analysis or decision-
making.

Applications

 Marketing: Analyzing customer behavior for targeted advertising and sales strategies.
 Finance: Fraud detection, credit scoring, and risk management.
 Healthcare: Predicting disease outbreaks, patient diagnosis, and treatment effectiveness.
 Retail: Inventory management, customer segmentation, and personalized
recommendations.
 Manufacturing: Predictive maintenance and quality control.
Challenges

 Data Quality: Ensuring the data used is accurate and representative.


 Privacy Concerns: Balancing data analysis with the need to protect personal
information.
 Scalability: Managing large datasets and ensuring efficient processing.
 Interpretability: Making complex models understandable to stakeholders.

Tools and Technologies

 Programming Languages: Python, R, SQL.


 Libraries and Frameworks: Scikit-learn, TensorFlow, Apache Spark, Hadoop.
 Visualization Tools: Tableau, Power BI, Matplotlib.

Motivation of Data Mining:

The motivation behind data mining stems from the increasing volume of data generated in
various fields and the need to extract meaningful insights from that data. Here are some key
reasons driving the interest in data mining:

1. Data Explosion

 Volume of Data: With the rise of digital technologies, organizations are inundated with
vast amounts of data from sources like social media, sensors, transactions, and more.
 Variety of Data: Data comes in different formats (structured, unstructured, semi-
structured), requiring sophisticated techniques to analyze.

2. Competitive Advantage

 Informed Decision-Making: Companies use data mining to gain insights that guide
strategic decisions, helping them stay ahead of competitors.
 Personalization: Tailoring products and services to individual customer preferences can
enhance customer loyalty and satisfaction.

3. Improving Efficiency

 Process Optimization: Data mining helps identify inefficiencies in operations, leading to


cost savings and better resource allocation.
 Predictive Maintenance: In industries like manufacturing, predicting equipment failures
can minimize downtime and maintenance costs.

4. Enhanced Customer Insights


 Behavior Analysis: Understanding customer behavior allows businesses to create
targeted marketing campaigns and improve customer engagement.
 Segmentation: Identifying distinct customer segments helps in tailoring strategies to
meet diverse needs.

5. Risk Management

 Fraud Detection: Financial institutions employ data mining to detect fraudulent


activities by identifying unusual patterns.
 Credit Scoring: Analyzing historical data helps assess the creditworthiness of
individuals and businesses.

6. Scientific Research

 Pattern Discovery: Researchers use data mining to uncover new patterns in complex
datasets, aiding advancements in fields like genomics, climate science, and social
sciences.

7. Real-Time Decision Making

 Timely Insights: With tools that support real-time data analysis, organizations can make
quick, informed decisions that respond to changing circumstances.

8. Predictive Analytics

 Forecasting Trends: Businesses can anticipate future trends and behaviors, allowing
them to proactively adjust their strategies.

9. Innovative Solutions

 New Product Development: Insights from data mining can lead to innovative product
features or entirely new products that meet emerging customer needs.

10. Social Impact

 Public Health: Data mining is used to analyze health trends, predict outbreaks, and
improve patient care.
 Policy Making: Governments can use data insights to develop policies that better serve
their populations.

Definition of Data Mining

Data Mining is the process of discovering patterns, correlations, and insights from large sets of
data using statistical and computational techniques. It involves the extraction of meaningful
information from a vast amount of raw data, enabling organizations to make data-driven
decisions and predictions.

Functionalities of Data Mining

Data mining encompasses several key functionalities, each aimed at extracting different types of
knowledge from data. Here are the primary functionalities:

1. Classification:
 Assigning items in a dataset to target categories or classes based on input features.
 Common algorithms: Decision Trees, Random Forest, Support Vector Machines,
Neural Networks.
 Example: Email filtering (spam vs. non-spam), credit scoring.
2. Regression:
 Predicting a continuous numeric value based on input variables.
 Common algorithms: Linear Regression, Polynomial Regression, Regression
Trees.
 Example: Forecasting sales revenue, predicting housing prices.
3. Clustering:
 Grouping a set of objects in such a way that objects in the same group (cluster)
are more similar to each other than to those in other groups.
 Common algorithms: K-means, Hierarchical clustering, DBSCAN.
 Example: Customer segmentation, grouping similar documents.
4. Association Rule Learning:
 Discovering interesting relationships or associations among a set of items in large
datasets.
 Common algorithms: Apriori, Eclat.
 Example: Market basket analysis (finding products that are frequently purchased
together).
5. Anomaly Detection (Outlier Detection):
 Identifying rare items, events, or observations which raise suspicions by differing
significantly from the majority of the data.
 Common techniques: Statistical tests, Isolation Forest, One-Class SVM.
 Example: Fraud detection in transactions, network security breaches.
6. Sequential Pattern Mining:
 Identifying regular sequences or trends in data that occur over time.
 Common algorithms: GSP, SPADE.
 Example: Analyzing purchase patterns over time, web page traversal paths.
7. Dependency Modeling:
 Modeling the dependencies between variables in a dataset to understand how
changes in one variable affect others.
 Techniques include Bayesian networks and graphical models.
 Example: Understanding the relationship between marketing spend and sales.
8. Text Mining:
 Extracting meaningful information from unstructured text data using natural
language processing (NLP) techniques.
Example: Sentiment analysis, topic modeling, document classification.
9. Time Series Analysis:
 Analyzing time-ordered data points to extract meaningful statistics and
characteristics.
 Common techniques: ARIMA, Seasonal decomposition, Exponential smoothing.
 Example: Stock price forecasting, demand forecasting.

Data Processing:

Data processing in data mining is a crucial step that involves preparing and transforming raw
data into a format suitable for analysis. This process typically consists of several key stages:

1. Data Collection

Gathering data from various sources, which can include databases, data warehouses, web
scraping, or sensors.

2. Data Cleaning

Identifying and correcting errors or inconsistencies in the data. This may involve:

 Removing duplicates
 Filling in missing values
 Correcting inaccuracies

3. Data Integration

Combining data from different sources to create a cohesive dataset. This may include:

 Merging data from different databases


 Ensuring consistency in formats and values

4. Data Transformation

Modifying the data to make it suitable for analysis. This can involve:

 Normalization: Scaling values to a common range


 Aggregation: Summarizing data points
 Encoding categorical variables: Converting categories into numerical formats

5. Data Reduction
Reducing the dataset's size while preserving its integrity, which may include:

 Dimensionality reduction: Using techniques like PCA (Principal Component Analysis)


 Sampling: Selecting a representative subset of the data

6. Data Discretization

Transforming continuous data into discrete categories, which can make certain types of analysis
easier.

7. Data Exploration

Using descriptive statistics and visualization techniques to understand the data’s characteristics
and identify patterns or anomalies.

8. Data Mining Techniques

Applying various algorithms and methods (e.g., clustering, classification, association rules) to
extract meaningful insights and patterns from the prepared data.

Form of Data Pre-processing:

Data preprocessing is a crucial step in data mining to prepare raw data for analysis and model
development. The main goal of preprocessing is to improve the quality of the data and make it
more suitable for mining processes. Below are common forms of data preprocessing:

1. Data Cleaning

 Handling Missing Data: Filling in missing values (e.g., mean, median, mode
imputation), removing incomplete records, or using interpolation methods.
 Outlier Detection and Removal: Identifying and either removing or adjusting data
points that are far outside the norm.
 Handling Noisy Data: Smoothing noisy data using techniques like binning, clustering, or
regression.
 Correcting Data Inconsistencies: Fixing problems like incorrect formats, redundancies,
or duplicate entries.

2. Data Integration

 Combining Data from Multiple Sources: Merging data from different databases or files
to create a cohesive dataset.
 Resolving Schema Differences: Harmonizing different schemas to integrate multiple
data sources.
 Data Transformation: Making sure that the structure and format of data across sources
are consistent.

3. Data Transformation

 Normalization: Scaling data to fit within a particular range (e.g., 0 to 1) to improve


algorithm performance.
 Standardization: Adjusting data to have a mean of 0 and a standard deviation of 1.
 Discretization: Converting continuous data into categorical or ordinal data.
 Feature Construction: Creating new features from existing ones (e.g., deriving age from
a birth date).

4. Data Reduction

 Dimensionality Reduction: Reducing the number of features while preserving important


information, often using techniques like PCA (Principal Component Analysis).
 Feature Selection: Identifying the most important features to include in the model.
 Sampling: Reducing the dataset size by selecting a representative subset of the data (e.g.,
random sampling, stratified sampling).
 Data Compression: Using methods like clustering or encoding to compress data for
efficient storage and processing.

5. Data Discretization

 Binning: Dividing continuous values into intervals (bins) to reduce granularity.


 Decision Tree-Based Discretization: Using decision tree algorithms to categorize
continuous data.

6. Data Aggregation

 Summarizing Data: Aggregating multiple data points into meaningful summaries, such
as averaging, summing, or computing the median for large datasets.
 Time-Based Aggregation: For time-series data, aggregating by specific time intervals
(e.g., daily, monthly).

7. Data Encoding

 Label Encoding: Converting categorical labels into integers.


 One-Hot Encoding: Converting categorical variables into binary columns (dummy
variables) for each category.
 Binary Encoding: A more compact encoding of categorical values into binary.

8. Data Balancing
 Handling Imbalanced Data: Using techniques like over-sampling (e.g., SMOTE),
under-sampling, or generating synthetic samples to balance classes in classification
problems.

Each of these steps is critical in transforming raw data into a form suitable for analysis, helping
improve model accuracy and reliability in data mining.

Data Cleaning
Data cleaning is an essential process in data mining, ensuring the quality and consistency of the
dataset. It deals with correcting or removing inaccurate records, handling missing values, and
managing noisy data. Here are some of the key methods for handling missing values and noisy
data:

1. Handling Missing Values

Missing data is a common problem in datasets. Several methods are used to deal with missing
values, depending on the nature of the data and the extent of the missing entries:

Methods for Handling Missing Data:

1. Ignore the Tuple:


 If the missing values are few and the dataset is large, the simplest method is to
ignore or delete the records with missing data.
 This method is only suitable when missing values are not frequent, as it could
otherwise result in significant data loss.
2. Fill in the Missing Values (Imputation):
 Mean/Median/Mode Imputation: Fill missing values with the mean, median, or
mode of the column.
 Example: For numerical data, replace a missing value with the mean or
median of the available data.
 For categorical data, the mode (most frequent category) can be used.
 Regression Imputation: Use regression techniques to predict the missing value
based on other variables.
 For example, if one feature is strongly correlated with another, the missing
values can be predicted using a regression model.
 K-Nearest Neighbors (KNN) Imputation: Replacing the missing value based on
the values from the nearest neighbors in the data.
 KNN finds the k nearest points to the missing data point and estimates the
missing value by averaging the values of the neighbors.
 Interpolation: Filling missing values by interpolating between known data
points, especially for time-series data.
3. Use Algorithms that Handle Missing Data:
 Some machine learning algorithms, such as decision trees, can handle missing
data inherently and do not require preprocessing.

2. Handling Noisy Data

Noisy data refers to data that contains errors, outliers, or irrelevant values. It can distort data
mining processes, leading to inaccurate results. Several techniques are used to reduce or
eliminate noise from the data.

Methods for Handling Noisy Data:

a. Binning

 Binning is a simple and effective technique for smoothing noisy data by grouping values
into bins or intervals and then replacing the values with representative values for each
bin.
 Equal-width binning: Divides the range of data into equal-sized intervals.
 Equal-frequency binning: Divides the data into intervals containing an equal number of
data points.
 Smoothing by bin means: The values in a bin are replaced by the mean of the bin.
 Smoothing by bin median: The values in a bin are replaced by the median value.
 Smoothing by bin boundaries: The values near a boundary are replaced by the
boundary values.

b. Clustering

 Clustering groups data into clusters (groups of similar objects). Outliers and noise are
often located in sparse regions that are far from dense clusters.
 Detecting Outliers: Data points that do not belong to any cluster can be identified as
outliers.
 Removing Noise: The clusters of noise or points far from the core clusters can be
removed or examined for correction.

c. Regression

 Regression is a statistical technique that can be used to detect noise in data and smooth it
out by fitting a function to the data points.
 Linear Regression: A linear model is fitted to two variables (x and y), and noise is
detected as deviations from this linear relationship.
 Multiple Regression: Extends this concept to multiple variables, where a more complex
model is fitted, and deviations from the model indicate noise.
d. Computer and Human Inspection

 Human Inspection: In some cases, expert domain knowledge is required to identify


noisy or incorrect data. Manual inspection and correction may be necessary for complex
datasets.
 Example: In medical data, human experts may review outlier patient records to
verify their validity.
 Computer Inspection: Automated methods for detecting anomalies or errors, such as
rule-based detection systems, statistical methods, or machine learning algorithms.
 Rule-Based Systems: Define rules for acceptable ranges or formats and flag any
data that violates these rules as potential noise.
 Statistical Methods: Use statistical tests or thresholds to identify data points that
are far from the expected range.

Inconsistent Data in Data Mining:

Inconsistent data refers to data that contains discrepancies or contradictions, making it unreliable
or difficult to use for analysis. Inconsistencies can arise due to a variety of reasons such as data
entry errors, integration of multiple data sources, or system updates. Inconsistent data can lead to
incorrect conclusions in data mining, so identifying and rectifying these inconsistencies is
critical.

Causes of Inconsistent Data

1. Multiple Data Sources: When data is gathered from different sources, there may be
different formats, units of measure, or naming conventions. For example, one source may
use "Male/Female" for gender, while another uses "M/F."
2. Data Redundancy: The same data may be recorded multiple times in different ways,
leading to discrepancies. For example, a customer may have multiple accounts with slight
variations in their address.
3. Time-Based Inconsistencies: Data that changes over time may lead to inconsistencies
when older records are not updated correctly. For example, if a customer changes their
address, and only some records reflect this change.
4. Data Entry Errors: Manual input can introduce errors such as misspellings, incorrect
formats, or misplaced decimal points.
5. System Upgrades: When databases are updated or systems are migrated, data
inconsistencies can emerge due to different formats or missing data mappings.

Types of Inconsistencies

1. Contradictory Data:
 Data that provides conflicting information. For example, if one part of the dataset
says "John Smith" lives in New York, but another part says he lives in Los
Angeles.
2. Different Units of Measurement:
 Inconsistent use of units can lead to errors. For example, weight may be recorded
in kilograms in one part of the dataset and in pounds in another.
3. Duplicate Records:
 Multiple entries for the same entity with slight differences. For example, one
record may list "Jane Doe" and another "J. Doe" with the same address and phone
number.
4. Inconsistent Formats:
 Data that is formatted differently in various parts of the dataset. For example, one
column might store dates as "MM/DD/YYYY" while another uses "DD-MM-
YYYY."
5. Violation of Business Rules:
 Data that contradicts predefined business rules or constraints. For instance, a
customer may have a date of birth that makes them 150 years old, which violates
the business logic.

Methods to Handle Inconsistent Data

1. Data Standardization:
 Consistency in Units: Ensure all data follows the same units of measurement. For
example, converting all weights to kilograms or all temperatures to Celsius.
 Consistent Formats: Ensure that formats (like date formats, currency, etc.) are
consistent across the dataset.
2. Data Deduplication:
 Identifying and removing duplicate records. Tools can be used to detect near-
duplicates based on similarity measures such as edit distance or phonetic
similarity.
3. Schema Matching:
 Aligning fields from different data sources. For example, if one source labels a
field as "Customer ID" and another as "Client Number," schema matching would
ensure both are recognized as the same attribute.
4. Business Rule Enforcement:
 Implementing rules and constraints that enforce consistency. For example,
ensuring that age values are within a reasonable range or that state abbreviations
conform to a standardized list.
5. Data Auditing:
 Performing automated or manual checks to detect inconsistencies. Anomalies can
be flagged for further investigation or correction.
6. Data Cleansing Tools:
 Many tools, such as ETL (Extract, Transform, Load) systems, offer capabilities to
detect and resolve inconsistencies automatically, such as:
 Identifying duplicate records.
 Matching different formats and values.

Correcting misspellings or invalid entries.
7. Human Inspection:
 In cases where automatic methods cannot resolve inconsistencies, human
intervention might be required. This involves subject-matter experts reviewing
and correcting inconsistent records manually.

Examples of Data Inconsistencies:

 Contradictory Data: Two records showing different values for the same attribute.
Example: "Revenue for Q1 2023" is listed as $10,000 in one source but $15,000 in
another.
 Formatting Differences: Phone numbers stored as "123-456-7890" in one record and
"(123) 456 7890" in another, which could create confusion.
 Outdated Information: An address that is no longer valid because the customer moved
but the older records weren't updated.

Impact of Inconsistent Data

Inconsistent data can severely impact the results of data mining processes:

 Incorrect Analysis: Conflicting or incorrect data can lead to wrong conclusions or


predictions.
 Reduced Model Accuracy: Machine learning models trained on inconsistent data are
less accurate, as the noise in the data affects the learning process.
 Poor Decision Making: Businesses relying on data mining for decision-making may
suffer financial losses, operational inefficiencies, or customer dissatisfaction due to
inaccurate insights.

Data Integration and Transformation:

Data Integration and Transformation are two essential steps in the data preprocessing stage of
data mining. These processes ensure that data from various sources are combined, standardized,
and transformed into a format suitable for analysis and mining.

1. Data Integration

Data integration refers to the process of combining data from different sources into a unified
dataset. The goal is to ensure consistency and coherence so that the integrated data can be
analyzed as a whole. Data integration can be challenging when dealing with heterogeneous data
from multiple systems, such as databases, flat files, spreadsheets, or different applications.

Steps in Data Integration:


a. Schema Integration

 Schema integration involves merging different data sources with varying schemas
(database structures) into a unified view.
 Challenges: Resolving issues like synonymity (where different terms mean the same
thing, e.g., "Client ID" vs. "Customer ID") and homonymity (where the same term means
different things, e.g., "State" referring to location vs. status).

b. Entity Identification Problem

 Different sources may use different naming conventions for the same entities. For
example, "SSN" in one system might correspond to "Social Security Number" in another.
 Methods such as record linkage and fuzzy matching are often used to identify the same
entity across different datasets.

c. Data Redundancy

 Data redundancy occurs when the same data is represented in multiple sources. This can
happen when integrating multiple databases, where the same customer or record is stored
in different ways.
 Solution: Use data integration techniques such as duplicate elimination or correlation
analysis to detect and remove redundant data.

d. Data Conflict Resolution

 Conflicts arise when different sources contain contradictory data. For instance, the
"Salary" field for a particular employee may have two different values in different
datasets.
 Approach:
 Rule-based approaches: Set predefined rules (e.g., trust a specific data source
over others).
 Statistical methods: Use statistical analysis to resolve conflicting data (e.g.,
choosing the most frequent value).

2. Data Transformation

Data transformation is the process of converting data into a suitable format for mining and
analysis. Once data is integrated, it often needs to be cleaned, formatted, and transformed to meet
the specific needs of the mining algorithms.

Steps in Data Transformation:

a. Data Smoothing
 Smoothing removes noise and inconsistencies from the data, making it more regular and
less prone to outliers.
 Methods:
 Binning (grouping data into bins and replacing values with the bin
mean/median).
 Clustering (grouping similar data points and replacing outliers with cluster
values).
 Regression-based methods.

b. Data Aggregation

 Aggregation involves summarizing or combining data to reduce the volume but retain
essential information. This is especially useful for data mining in cases where granularity
needs to be reduced.
 Example: Aggregating sales data from individual transactions into monthly or
yearly totals.

c. Data Generalization

 Generalization converts low-level data into higher-level concepts by grouping data based
on higher-level attributes or concepts.
 Example: Converting age values into age ranges (e.g., “18-24,” “25-34”) for
analysis.

d. Data Normalization

 Normalization involves scaling data to a standard range or format. This step is important
when data attributes have different scales, as many data mining algorithms require
normalized data.
 Techniques:
 Min-max normalization: Rescales values between a specified range (e.g.,
[0, 1]).
 Z-score normalization (Standardization): Centers the data around the
mean (mean = 0) with a standard deviation of 1.
 Decimal scaling: Moves the decimal point of values to standardize them.

e. Data Discretization

 Discretization converts continuous data into categorical or interval-based data, which is


sometimes required for specific algorithms (like decision trees).
 Methods:
 Equal-width binning: Divides the data range into equal-width intervals.
 Equal-frequency binning: Divides data so each bin contains
approximately the same number of records.

f. Feature Construction
 Feature construction involves creating new features from existing ones to make patterns
more visible and improve model performance.
 Example: Creating a "BMI" (Body Mass Index) variable from height and weight
data.

g. Data Encoding

 Categorical data may need to be converted into numerical forms for use in certain
machine learning algorithms.
 Label encoding: Assigns a unique integer to each category.
 One-hot encoding: Converts categories into binary vectors for each distinct
category.

Challenges in Data Integration and Transformation

 Heterogeneous Sources: Data may come from various systems with different formats,
types, and structures, making integration difficult.
 Data Inconsistencies: Conflicting data may exist in different sources, requiring conflict
resolution.
 Scalability: Handling large amounts of data in an efficient manner without
compromising data quality is challenging.
 Time-Variant Data: Data changes over time, and integrating time-varying datasets can
be complex.

Importance of Data Integration and Transformation in Data Mining

 Consistency and Accuracy: Proper integration and transformation ensure that the data
used for mining is consistent and accurate, reducing errors and improving model
performance.
 Efficiency: Well-processed data leads to more efficient and effective mining algorithms,
speeding up the analysis process.
 Improved Insights: Clean, standardized, and well-transformed data provides more
meaningful insights and allows for more reliable decision-making.

Data Reduction in Data Mining

Data reduction is a crucial step in data preprocessing that involves reducing the volume of data
while preserving its integrity and ensuring that the essential information for analysis is retained.
This is important because mining very large datasets can be computationally expensive and time-
consuming. Data reduction techniques help in improving both the efficiency and effectiveness of
data mining algorithms.
1. Data Cube Aggregation

Data cube aggregation is a form of data summarization that allows for the representation of
data at various levels of granularity. It’s commonly used in OLAP (Online Analytical
Processing) systems, where large multidimensional datasets are organized into cubes to facilitate
query responses and analysis.

Key Concepts of Data Cube Aggregation:

 Data Cubes: A data cube is a multidimensional structure that stores aggregated data
across multiple dimensions. Each cell in the cube corresponds to aggregated data (e.g.,
total sales) for specific combinations of dimensions (e.g., time, location, and product).
 Aggregation: This involves summarizing data along different dimensions. For example,
aggregating daily sales into monthly or yearly sales figures.
 Roll-up: Moving from lower-level (more detailed) data to higher-level (more
summarized) data. For example, aggregating individual transactions into quarterly sales
data.
 Drill-down: The reverse of roll-up, where higher-level data is disaggregated into more
detailed levels. For example, breaking down yearly sales data into monthly sales data.

Benefits:

 Reduces the size of the dataset by summarizing data, thus improving query performance.
 Provides flexibility for analyzing data at various levels of granularity.

2. Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of attributes (features or


dimensions) in the dataset while retaining as much information as possible. It is important for
high-dimensional data where many features may be redundant, irrelevant, or correlated.

Methods of Dimensionality Reduction:

1. Principal Component Analysis (PCA):


 PCA is a statistical technique that transforms the original set of attributes into a
new set of orthogonal attributes (principal components) that capture the maximum
variance in the data. The first few principal components capture the most
significant information, allowing the less important components to be discarded.
 Application: It is commonly used for compressing data and visualizing high-
dimensional data in 2D or 3D.
2. Singular Value Decomposition (SVD):
 SVD is a matrix factorization technique that decomposes a data matrix into three
matrices. By retaining only the largest singular values, the dimensionality of the
data can be reduced while keeping most of the important information.
 Application: Often used in recommendation systems and text mining (Latent
Semantic Analysis).
3. Linear Discriminant Analysis (LDA):
 LDA is a supervised learning technique used to reduce dimensionality by finding
a linear combination of features that best separates different classes.
 Application: Typically used in classification problems to improve model
performance and reduce overfitting.
4. Feature Selection:
 Involves selecting a subset of relevant features from the original dataset by
removing irrelevant or redundant features. Common techniques include:
 Filter Methods: Using statistical tests (e.g., correlation, Chi-square test)
to select features.
 Wrapper Methods: Using a specific learning algorithm to evaluate
feature subsets.
 Embedded Methods: Feature selection is integrated with the model
training process (e.g., decision trees).

Benefits:

 Reduces the complexity of models.


 Improves the performance of machine learning algorithms.
 Reduces overfitting, especially in high-dimensional datasets.

3. Data Compression

Data compression aims to reduce the size of the dataset by encoding the data in a more efficient
manner without losing significant information. It helps to save storage space and improves data
transfer efficiency.

Types of Data Compression:

1. Lossless Compression:
 In lossless compression, the original data can be perfectly reconstructed from the
compressed data without any loss of information.
 Methods: Huffman coding, Run-length encoding, Lempel-Ziv-Welch (LZW)
compression.
 Application: Used in text and data transmission where data integrity is crucial.
2. Lossy Compression:
 In lossy compression, some of the data is lost during compression, but the
resulting compressed data is still a close approximation of the original.
 Methods: Discrete Cosine Transform (DCT), Wavelet Transform.
 Application: Commonly used for images, audio, and video where some loss of
quality is acceptable in exchange for greater compression.
3. Vector Quantization:
 This technique reduces the size of the dataset by clustering similar data points into
groups and representing each group by a single prototype (centroid).
 Application: Common in image compression.

Benefits:

 Saves storage and computational resources.


 Speeds up the processing of large datasets.

4. Numerosity Reduction

Numerosity reduction is a technique that reduces the volume of data by representing the dataset
in a more compact form, either by approximation or by generating a reduced set of data.

Methods of Numerosity Reduction:

1. Parametric Methods:
 In parametric methods, the data is modeled using a known distribution, and the
model parameters (rather than the data itself) are stored.
 Example: Regression models (e.g., linear regression, logistic regression), which
represent the data using a mathematical function rather than the original data
points.
 Application: Instead of storing all data points, only the coefficients of the
regression equation are retained, allowing predictions on new data points.
2. Non-Parametric Methods:
 Non-parametric methods do not assume an underlying distribution but instead
reduce the dataset size by using data clustering, histograms, or other
approximation methods.
 Clustering: Groups similar data points into clusters, and each cluster is
represented by its centroid or representative data point.
 Sampling: A smaller, representative subset of the data is selected to approximate
the whole dataset. Sampling can be random, stratified, or systematic.
 Histograms: Data is partitioned into bins, and the frequency of data points in
each bin is stored, reducing the dataset size.

Benefits:

 Parametric methods allow for data reduction by modeling relationships between data
points, requiring only a few parameters.
 Non-parametric methods reduce data volume through sampling, clustering, or
summarization, which makes data processing more manageable.
Discretization and Concept Hierarchy Generation in Data Mining
In data mining, discretization and concept hierarchy generation are techniques used for
transforming continuous data into a form that can be more easily and effectively analyzed. These
methods simplify data representation, which is particularly important when dealing with complex
and large datasets, especially for classification, clustering, and decision tree algorithms.

1. Discretization

Discretization is the process of converting continuous attributes or variables into discrete


intervals or categories. This transformation helps in simplifying complex data, reducing its
granularity, and making it suitable for certain data mining algorithms (e.g., decision trees,
association rule mining), which work better with categorical data rather than continuous data.

Why Discretization is Important:

 Simplifies Analysis: By reducing continuous data into categories or intervals, the data
becomes easier to analyze.
 Improves Algorithm Performance: Many algorithms, like decision trees, perform better
with categorical data.
 Facilitates Interpretability: Discretized data can be more easily understood by non-
experts compared to raw continuous data.

Types of Discretization:

a. Supervised Discretization

 In supervised discretization, the class label (target variable) is used to guide the process
of dividing the continuous values into intervals.
 Methods:
1. Entropy-Based Discretization:
 Uses information gain (entropy) to determine the boundaries of the
intervals. The goal is to split the data in a way that maximizes the
separation between different class labels.
2. ChiMerge:
 A statistical method based on the Chi-square test that merges adjacent
intervals until the merged intervals no longer improve the class label
distribution significantly.

b. Unsupervised Discretization

 In unsupervised discretization, the process of creating intervals or categories does not


rely on the class labels.
 Methods:
1. Equal-Width Binning:
 The range of the continuous attribute is divided into intervals of equal
width. For example, if the range of a continuous attribute is [0, 100], it can
be split into 5 equal-width bins of size 20 (e.g., [0-20], [21-40], etc.).
 Advantages: Simple and easy to implement.
 Disadvantages: Can produce poor results if the data is not uniformly
distributed.
2. Equal-Frequency Binning:
 The data is divided into bins such that each bin contains approximately the
same number of data points. For example, if there are 100 data points and
you want 5 bins, each bin will contain 20 data points.
 Advantages: Ensures balanced bins in terms of data points.
 Disadvantages: Bins can have varying widths, which may be harder to
interpret.

c. Recursive Discretization (Top-Down Approach)

 This is a top-down, hierarchical method where the continuous values are recursively split
into smaller intervals. The process stops when a stopping criterion, like a maximum
number of bins or a minimum information gain, is reached.

d. Cluster-Based Discretization

 Uses clustering techniques like k-means or hierarchical clustering to group continuous


values into clusters (which act as bins), effectively discretizing the data.

2. Concept Hierarchy Generation

Concept hierarchy generation refers to the process of organizing data into multiple levels of
abstraction, where higher-level concepts represent generalizations of lower-level concepts.
Concept hierarchies play a crucial role in data summarization and data mining, especially for
tasks like roll-up and drill-down operations in OLAP systems and for knowledge discovery.

Types of Concept Hierarchies:

a. Manual Concept Hierarchy Generation

 This involves human experts defining the hierarchy based on domain knowledge. For
example, in a retail dataset, a hierarchy for product categories may be manually defined
as follows:
 Level 1: All products
 Level 2: Electronics, Clothing, Food, etc.
 Level 3: Smartphones, Laptops, TVs (under Electronics)
b. Automatic Concept Hierarchy Generation

 Automatic methods build the hierarchy based on the data itself, often using attribute-
value relationships, clustering, or generalization rules.
 Methods:
1. Binning-Based Approach: Concept hierarchies can be generated using binning
(as described in discretization). For example, a numeric attribute "age" can be
grouped into different age ranges (e.g., 0-18 = children, 19-35 = adults, etc.).
2. Clustering-Based Approach: A clustering algorithm can be applied to group
similar values or entities into clusters. These clusters can then represent higher-
level concepts in the hierarchy.
3. Data Aggregation: Aggregating detailed data into summary levels based on
common attributes. For example, daily sales can be aggregated into monthly,
quarterly, or yearly sales, creating a hierarchy for time-based data.

Methods of Concept Hierarchy Generation:

a. Specification of a Set of Attributes

 Concept hierarchies can be generated by defining a set of attributes and ordering them
from general to specific.
 Example:
 Consider a dataset containing information about geographic locations.
 Hierarchy:
 Country → State → City → Zip Code

b. Specification of Partial Ordering of Attributes

 A hierarchy can be generated by defining a partial order on a set of attributes based on a


generalization relationship.
 Example:
 For employee data, a hierarchy can be created as follows:
 Employee → Department → Division → Company

c. Discretization of Numeric Attributes

 Numeric attributes (like age or income) can be discretized into intervals, forming a
hierarchy.
 Example:
 Age: (0-18: children, 19-35: adults, 36-60: middle-aged, 61+ : seniors).
Advantages of Discretization and Concept Hierarchy Generation:

For Discretization:

1. Improves Data Mining Efficiency: Discretized data is easier to process, particularly for
algorithms that require categorical data, such as decision trees.
2. Simplifies Model Interpretation: Categories or intervals are often easier for humans to
interpret compared to continuous data.
3. Reduces Model Complexity: Discretization can reduce noise and help prevent
overfitting by reducing the precision of the data.

For Concept Hierarchy Generation:

1. Supports Roll-Up and Drill-Down: Concept hierarchies allow for flexible analysis of
data at various levels of abstraction.
2. Improves Data Summarization: Hierarchies enable summarization of data, helping in
data generalization tasks.
3. Enhances Knowledge Discovery: By working at higher levels of abstraction, it becomes
easier to identify patterns and relationships that may not be apparent at lower levels.

Examples:

Discretization Example:

 Continuous Attribute: Income (range: $0 - $100,000)


 Equal-Width Binning:
 Bin 1: $0 - $25,000
 Bin 2: $25,001 - $50,000
 Bin 3: $50,001 - $75,000
 Bin 4: $75,001 - $100,000
 Equal-Frequency Binning:
 Bin 1: First 25% of income earners
 Bin 2: Second 25% of income earners
 Bin 3: Third 25% of income earners
 Bin 4: Last 25% of income earners

Concept Hierarchy Example:

 Geographical Data:
 Level 1: World
 Level 2: Continents (e.g., North America, Europe)
 Level 3: Countries (e.g., USA, Canada, Germany)
 Level 4: States/Provinces (e.g., California, Ontario)
 Level 5: Cities (e.g., Los Angeles, Toronto)
Decision Tree in Data Mining
A decision tree is a popular supervised learning algorithm used for both classification and
regression tasks. In data mining, it is primarily used for classification, where it helps in decision-
making by breaking down complex data into smaller, manageable chunks through a hierarchical
structure resembling a tree.

The goal of a decision tree is to create a model that predicts the value of a target variable based
on several input features. Each internal node of the tree represents a decision based on a feature,
and each leaf node represents the final classification or output. Decision trees are intuitive, easy
to understand, and interpret, making them a popular choice for many data mining tasks.

Key Concepts of Decision Tree

1. Root Node

 The root node is the top-most node in the tree, where the data starts being split based on
the feature that provides the highest information gain or the most significant decision
factor.

2. Decision Node

 A decision node represents a feature or attribute based on which the data is split further.
At each decision node, a condition (usually a comparison or threshold) is used to branch
the data.

3. Leaf Node (Terminal Node)

 Leaf nodes represent the final outcome or classification. After a series of decisions, the
data reaches a leaf node, which provides the prediction or class label.

4. Splitting

 Splitting is the process of dividing a dataset into subsets based on a particular feature or
attribute. The goal is to split the data in such a way that each subset is as homogeneous as
possible with respect to the target variable.

5. Pruning

 Pruning is the process of removing unnecessary branches from the tree to reduce
complexity and prevent overfitting. Pruned trees are simpler and often generalize better
on unseen data.
6. Branch

 A branch is a sub-tree within the decision tree. It represents the path from a decision node
to another node (either another decision node or a leaf node).

7. Impurity Measures

 Impurity measures how well a split separates the classes. The more homogenous the
resulting subsets, the lower the impurity.
 Common impurity measures include:
 Gini Index: Measures the probability of misclassifying a random element.
 Entropy: Used in information gain to decide splits in decision trees. It measures
the uncertainty or impurity in a dataset.
 Variance: Used for regression tasks to measure the spread of the target values.

Types of Decision Trees

1. Classification Trees

 In classification trees, the target variable is categorical, and the tree assigns each data
point to a specific category or class. The output at the leaf node represents the class label.

2. Regression Trees

 In regression trees, the target variable is continuous. The tree predicts a numeric value at
the leaf nodes. These trees are used for problems where the goal is to predict a continuous
output, such as predicting house prices or stock values.

Steps to Build a Decision Tree

1. Select the Best Attribute/Feature for Splitting


 At each step, the algorithm selects the feature that best separates the data into
distinct classes.
 Techniques like Information Gain (based on entropy) or Gini Index are used to
determine the best feature.
2. Create Decision Nodes and Branches
 Based on the selected attribute, the data is split into subsets, and decision nodes
are created for each subset.
3. Repeat for Subsets
 Recursively repeat the process for each subset, selecting the best feature for
splitting and creating more decision nodes or leaf nodes until:
 All the data in a subset belongs to a single class.
 There are no more features to split on.
 A predefined stopping criterion is reached (e.g., maximum depth of the
tree or minimum number of samples per leaf).
4. Pruning (Optional)
 Once the tree is built, pruning can be applied to remove branches that provide
little to no additional value in terms of accuracy or performance.

Advantages of Decision Trees

 Easy to Understand and Interpret: Decision trees provide a clear and visual
representation of the decision-making process.
 Non-Parametric: Decision trees do not make assumptions about the underlying data
distribution.
 Handles Both Categorical and Continuous Data: Decision trees can handle various
types of input features (e.g., numeric, categorical).
 Feature Importance: Decision trees automatically rank features by importance, helping
in feature selection.

Disadvantages of Decision Trees

 Overfitting: Decision trees can easily become too complex, especially if they are not
pruned. Overfitting can lead to poor generalization on unseen data.
 Instability: Small changes in the data can lead to significant changes in the tree structure.
 Bias Toward Features with More Levels: Decision trees can be biased towards features
with more categories or levels, which might lead to overestimating their importance.

Decision Tree Algorithms

There are several algorithms to construct decision trees, and each uses different criteria for
selecting the best split:

1. ID3 (Iterative Dichotomiser 3)

 The ID3 algorithm uses information gain (based on entropy) to select the attribute that
best separates the data at each step.
2. C4.5

 An extension of ID3, C4.5 handles both categorical and continuous data, and it deals with
missing values and can prune the tree after creation.

3. CART (Classification and Regression Trees)

 CART constructs binary trees (each node has only two branches) and uses Gini Index
for classification tasks and mean squared error for regression tasks.

Example of a Decision Tree

Predicting Whether a Person Buys a Sports Car

We want to predict whether a person will buy a sports car based on Age and Income.

Dataset:
Person Age Income Buys Sports Car?

1 Young High Yes

2 Young Medium No

3 Middle-aged Medium Yes

4 Senior Low No

5 Senior Medium No

6 Senior High No

7 Middle-aged High Yes

8 Young Low No

9 Young Medium No

10 Middle-aged Low No
Step-by-Step:

1. Choose the Best Attribute for the Root Node:


 Calculate Information Gain or Gini Index for both attributes (Age and Income).
 Suppose Age has the highest information gain, so we use it as the root node.
2. Split by Age:
 Young: 1 Yes, 3 No → Further split by Income.
 Middle-aged: 2 Yes, 1 No → Most are Yes, so this branch predicts Yes.
 Senior: 0 Yes, 3 No → All are No, so this branch predicts No.
3. Further Split for Young Group:
 Income = High: Buys sports car (Yes).
 Income = Low or Medium: Does not buy a sports car (No).

Final Tree Structure:

Age
/ | \
Young Middle-aged Senior
/ | \
Income Buys Yes Buys No
/ \
High Medium/Low
Yes No

This tree is simple and interpretable, providing clear decision rules:

 If the person is Middle-aged, they are likely to buy a sports car.


 If the person is Senior, they will not buy a sports car.
 If the person is Young and has a High Income, they will buy a sports car; otherwise,
they won’t.

Practical Use of Decision Trees in Data Mining

1. Customer Segmentation: Decision trees can be used to classify customers based on


purchasing behavior, demographic information, and other features.
2. Churn Prediction: Telecom or banking industries use decision trees to predict which
customers are likely to leave based on their usage patterns and behaviors.
3. Credit Scoring: Banks use decision trees to assess the likelihood of a borrower
defaulting on a loan.
4. Fraud Detection: In finance and insurance, decision trees can help detect fraudulent
transactions by identifying patterns in past fraudulent activities.

You might also like