Data Warehousing & Data Mining Unit-3 Notes
Data Warehousing & Data Mining Unit-3 Notes
Data Mining
Overview of Data Mining:
Data mining is the process of discovering patterns and extracting valuable insights from large
sets of data. It combines techniques from statistics, machine learning, and database systems to
analyze and interpret complex data structures. Here's an overview of its key components and
applications:
Key Components
1. Data Preprocessing:
Cleaning: Removing noise and inconsistencies from the data.
Transformation: Normalizing or aggregating data to improve analysis.
Reduction: Simplifying data without losing significant information, often through
techniques like dimensionality reduction.
2. Data Exploration:
Using statistical methods to understand the data distribution, trends, and
relationships.
3. Modeling:
Applying algorithms to identify patterns and make predictions. Common
techniques include:
Classification: Assigning labels to data points based on input features.
Regression: Predicting continuous values.
Clustering: Grouping similar data points together.
Association Rule Learning: Discovering relationships between variables
in large databases.
4. Evaluation:
Assessing the accuracy and effectiveness of the models using metrics like
precision, recall, and F1-score.
5. Deployment:
Integrating the model into production systems for real-time analysis or decision-
making.
Applications
Marketing: Analyzing customer behavior for targeted advertising and sales strategies.
Finance: Fraud detection, credit scoring, and risk management.
Healthcare: Predicting disease outbreaks, patient diagnosis, and treatment effectiveness.
Retail: Inventory management, customer segmentation, and personalized
recommendations.
Manufacturing: Predictive maintenance and quality control.
Challenges
The motivation behind data mining stems from the increasing volume of data generated in
various fields and the need to extract meaningful insights from that data. Here are some key
reasons driving the interest in data mining:
1. Data Explosion
Volume of Data: With the rise of digital technologies, organizations are inundated with
vast amounts of data from sources like social media, sensors, transactions, and more.
Variety of Data: Data comes in different formats (structured, unstructured, semi-
structured), requiring sophisticated techniques to analyze.
2. Competitive Advantage
Informed Decision-Making: Companies use data mining to gain insights that guide
strategic decisions, helping them stay ahead of competitors.
Personalization: Tailoring products and services to individual customer preferences can
enhance customer loyalty and satisfaction.
3. Improving Efficiency
5. Risk Management
6. Scientific Research
Pattern Discovery: Researchers use data mining to uncover new patterns in complex
datasets, aiding advancements in fields like genomics, climate science, and social
sciences.
Timely Insights: With tools that support real-time data analysis, organizations can make
quick, informed decisions that respond to changing circumstances.
8. Predictive Analytics
Forecasting Trends: Businesses can anticipate future trends and behaviors, allowing
them to proactively adjust their strategies.
9. Innovative Solutions
New Product Development: Insights from data mining can lead to innovative product
features or entirely new products that meet emerging customer needs.
Public Health: Data mining is used to analyze health trends, predict outbreaks, and
improve patient care.
Policy Making: Governments can use data insights to develop policies that better serve
their populations.
Data Mining is the process of discovering patterns, correlations, and insights from large sets of
data using statistical and computational techniques. It involves the extraction of meaningful
information from a vast amount of raw data, enabling organizations to make data-driven
decisions and predictions.
Data mining encompasses several key functionalities, each aimed at extracting different types of
knowledge from data. Here are the primary functionalities:
1. Classification:
Assigning items in a dataset to target categories or classes based on input features.
Common algorithms: Decision Trees, Random Forest, Support Vector Machines,
Neural Networks.
Example: Email filtering (spam vs. non-spam), credit scoring.
2. Regression:
Predicting a continuous numeric value based on input variables.
Common algorithms: Linear Regression, Polynomial Regression, Regression
Trees.
Example: Forecasting sales revenue, predicting housing prices.
3. Clustering:
Grouping a set of objects in such a way that objects in the same group (cluster)
are more similar to each other than to those in other groups.
Common algorithms: K-means, Hierarchical clustering, DBSCAN.
Example: Customer segmentation, grouping similar documents.
4. Association Rule Learning:
Discovering interesting relationships or associations among a set of items in large
datasets.
Common algorithms: Apriori, Eclat.
Example: Market basket analysis (finding products that are frequently purchased
together).
5. Anomaly Detection (Outlier Detection):
Identifying rare items, events, or observations which raise suspicions by differing
significantly from the majority of the data.
Common techniques: Statistical tests, Isolation Forest, One-Class SVM.
Example: Fraud detection in transactions, network security breaches.
6. Sequential Pattern Mining:
Identifying regular sequences or trends in data that occur over time.
Common algorithms: GSP, SPADE.
Example: Analyzing purchase patterns over time, web page traversal paths.
7. Dependency Modeling:
Modeling the dependencies between variables in a dataset to understand how
changes in one variable affect others.
Techniques include Bayesian networks and graphical models.
Example: Understanding the relationship between marketing spend and sales.
8. Text Mining:
Extracting meaningful information from unstructured text data using natural
language processing (NLP) techniques.
Example: Sentiment analysis, topic modeling, document classification.
9. Time Series Analysis:
Analyzing time-ordered data points to extract meaningful statistics and
characteristics.
Common techniques: ARIMA, Seasonal decomposition, Exponential smoothing.
Example: Stock price forecasting, demand forecasting.
Data Processing:
Data processing in data mining is a crucial step that involves preparing and transforming raw
data into a format suitable for analysis. This process typically consists of several key stages:
1. Data Collection
Gathering data from various sources, which can include databases, data warehouses, web
scraping, or sensors.
2. Data Cleaning
Identifying and correcting errors or inconsistencies in the data. This may involve:
Removing duplicates
Filling in missing values
Correcting inaccuracies
3. Data Integration
Combining data from different sources to create a cohesive dataset. This may include:
4. Data Transformation
Modifying the data to make it suitable for analysis. This can involve:
5. Data Reduction
Reducing the dataset's size while preserving its integrity, which may include:
6. Data Discretization
Transforming continuous data into discrete categories, which can make certain types of analysis
easier.
7. Data Exploration
Using descriptive statistics and visualization techniques to understand the data’s characteristics
and identify patterns or anomalies.
Applying various algorithms and methods (e.g., clustering, classification, association rules) to
extract meaningful insights and patterns from the prepared data.
Data preprocessing is a crucial step in data mining to prepare raw data for analysis and model
development. The main goal of preprocessing is to improve the quality of the data and make it
more suitable for mining processes. Below are common forms of data preprocessing:
1. Data Cleaning
Handling Missing Data: Filling in missing values (e.g., mean, median, mode
imputation), removing incomplete records, or using interpolation methods.
Outlier Detection and Removal: Identifying and either removing or adjusting data
points that are far outside the norm.
Handling Noisy Data: Smoothing noisy data using techniques like binning, clustering, or
regression.
Correcting Data Inconsistencies: Fixing problems like incorrect formats, redundancies,
or duplicate entries.
2. Data Integration
Combining Data from Multiple Sources: Merging data from different databases or files
to create a cohesive dataset.
Resolving Schema Differences: Harmonizing different schemas to integrate multiple
data sources.
Data Transformation: Making sure that the structure and format of data across sources
are consistent.
3. Data Transformation
4. Data Reduction
5. Data Discretization
6. Data Aggregation
Summarizing Data: Aggregating multiple data points into meaningful summaries, such
as averaging, summing, or computing the median for large datasets.
Time-Based Aggregation: For time-series data, aggregating by specific time intervals
(e.g., daily, monthly).
7. Data Encoding
8. Data Balancing
Handling Imbalanced Data: Using techniques like over-sampling (e.g., SMOTE),
under-sampling, or generating synthetic samples to balance classes in classification
problems.
Each of these steps is critical in transforming raw data into a form suitable for analysis, helping
improve model accuracy and reliability in data mining.
Data Cleaning
Data cleaning is an essential process in data mining, ensuring the quality and consistency of the
dataset. It deals with correcting or removing inaccurate records, handling missing values, and
managing noisy data. Here are some of the key methods for handling missing values and noisy
data:
Missing data is a common problem in datasets. Several methods are used to deal with missing
values, depending on the nature of the data and the extent of the missing entries:
Noisy data refers to data that contains errors, outliers, or irrelevant values. It can distort data
mining processes, leading to inaccurate results. Several techniques are used to reduce or
eliminate noise from the data.
a. Binning
Binning is a simple and effective technique for smoothing noisy data by grouping values
into bins or intervals and then replacing the values with representative values for each
bin.
Equal-width binning: Divides the range of data into equal-sized intervals.
Equal-frequency binning: Divides the data into intervals containing an equal number of
data points.
Smoothing by bin means: The values in a bin are replaced by the mean of the bin.
Smoothing by bin median: The values in a bin are replaced by the median value.
Smoothing by bin boundaries: The values near a boundary are replaced by the
boundary values.
b. Clustering
Clustering groups data into clusters (groups of similar objects). Outliers and noise are
often located in sparse regions that are far from dense clusters.
Detecting Outliers: Data points that do not belong to any cluster can be identified as
outliers.
Removing Noise: The clusters of noise or points far from the core clusters can be
removed or examined for correction.
c. Regression
Regression is a statistical technique that can be used to detect noise in data and smooth it
out by fitting a function to the data points.
Linear Regression: A linear model is fitted to two variables (x and y), and noise is
detected as deviations from this linear relationship.
Multiple Regression: Extends this concept to multiple variables, where a more complex
model is fitted, and deviations from the model indicate noise.
d. Computer and Human Inspection
Inconsistent data refers to data that contains discrepancies or contradictions, making it unreliable
or difficult to use for analysis. Inconsistencies can arise due to a variety of reasons such as data
entry errors, integration of multiple data sources, or system updates. Inconsistent data can lead to
incorrect conclusions in data mining, so identifying and rectifying these inconsistencies is
critical.
1. Multiple Data Sources: When data is gathered from different sources, there may be
different formats, units of measure, or naming conventions. For example, one source may
use "Male/Female" for gender, while another uses "M/F."
2. Data Redundancy: The same data may be recorded multiple times in different ways,
leading to discrepancies. For example, a customer may have multiple accounts with slight
variations in their address.
3. Time-Based Inconsistencies: Data that changes over time may lead to inconsistencies
when older records are not updated correctly. For example, if a customer changes their
address, and only some records reflect this change.
4. Data Entry Errors: Manual input can introduce errors such as misspellings, incorrect
formats, or misplaced decimal points.
5. System Upgrades: When databases are updated or systems are migrated, data
inconsistencies can emerge due to different formats or missing data mappings.
Types of Inconsistencies
1. Contradictory Data:
Data that provides conflicting information. For example, if one part of the dataset
says "John Smith" lives in New York, but another part says he lives in Los
Angeles.
2. Different Units of Measurement:
Inconsistent use of units can lead to errors. For example, weight may be recorded
in kilograms in one part of the dataset and in pounds in another.
3. Duplicate Records:
Multiple entries for the same entity with slight differences. For example, one
record may list "Jane Doe" and another "J. Doe" with the same address and phone
number.
4. Inconsistent Formats:
Data that is formatted differently in various parts of the dataset. For example, one
column might store dates as "MM/DD/YYYY" while another uses "DD-MM-
YYYY."
5. Violation of Business Rules:
Data that contradicts predefined business rules or constraints. For instance, a
customer may have a date of birth that makes them 150 years old, which violates
the business logic.
1. Data Standardization:
Consistency in Units: Ensure all data follows the same units of measurement. For
example, converting all weights to kilograms or all temperatures to Celsius.
Consistent Formats: Ensure that formats (like date formats, currency, etc.) are
consistent across the dataset.
2. Data Deduplication:
Identifying and removing duplicate records. Tools can be used to detect near-
duplicates based on similarity measures such as edit distance or phonetic
similarity.
3. Schema Matching:
Aligning fields from different data sources. For example, if one source labels a
field as "Customer ID" and another as "Client Number," schema matching would
ensure both are recognized as the same attribute.
4. Business Rule Enforcement:
Implementing rules and constraints that enforce consistency. For example,
ensuring that age values are within a reasonable range or that state abbreviations
conform to a standardized list.
5. Data Auditing:
Performing automated or manual checks to detect inconsistencies. Anomalies can
be flagged for further investigation or correction.
6. Data Cleansing Tools:
Many tools, such as ETL (Extract, Transform, Load) systems, offer capabilities to
detect and resolve inconsistencies automatically, such as:
Identifying duplicate records.
Matching different formats and values.
Correcting misspellings or invalid entries.
7. Human Inspection:
In cases where automatic methods cannot resolve inconsistencies, human
intervention might be required. This involves subject-matter experts reviewing
and correcting inconsistent records manually.
Contradictory Data: Two records showing different values for the same attribute.
Example: "Revenue for Q1 2023" is listed as $10,000 in one source but $15,000 in
another.
Formatting Differences: Phone numbers stored as "123-456-7890" in one record and
"(123) 456 7890" in another, which could create confusion.
Outdated Information: An address that is no longer valid because the customer moved
but the older records weren't updated.
Inconsistent data can severely impact the results of data mining processes:
Data Integration and Transformation are two essential steps in the data preprocessing stage of
data mining. These processes ensure that data from various sources are combined, standardized,
and transformed into a format suitable for analysis and mining.
1. Data Integration
Data integration refers to the process of combining data from different sources into a unified
dataset. The goal is to ensure consistency and coherence so that the integrated data can be
analyzed as a whole. Data integration can be challenging when dealing with heterogeneous data
from multiple systems, such as databases, flat files, spreadsheets, or different applications.
Schema integration involves merging different data sources with varying schemas
(database structures) into a unified view.
Challenges: Resolving issues like synonymity (where different terms mean the same
thing, e.g., "Client ID" vs. "Customer ID") and homonymity (where the same term means
different things, e.g., "State" referring to location vs. status).
Different sources may use different naming conventions for the same entities. For
example, "SSN" in one system might correspond to "Social Security Number" in another.
Methods such as record linkage and fuzzy matching are often used to identify the same
entity across different datasets.
c. Data Redundancy
Data redundancy occurs when the same data is represented in multiple sources. This can
happen when integrating multiple databases, where the same customer or record is stored
in different ways.
Solution: Use data integration techniques such as duplicate elimination or correlation
analysis to detect and remove redundant data.
Conflicts arise when different sources contain contradictory data. For instance, the
"Salary" field for a particular employee may have two different values in different
datasets.
Approach:
Rule-based approaches: Set predefined rules (e.g., trust a specific data source
over others).
Statistical methods: Use statistical analysis to resolve conflicting data (e.g.,
choosing the most frequent value).
2. Data Transformation
Data transformation is the process of converting data into a suitable format for mining and
analysis. Once data is integrated, it often needs to be cleaned, formatted, and transformed to meet
the specific needs of the mining algorithms.
a. Data Smoothing
Smoothing removes noise and inconsistencies from the data, making it more regular and
less prone to outliers.
Methods:
Binning (grouping data into bins and replacing values with the bin
mean/median).
Clustering (grouping similar data points and replacing outliers with cluster
values).
Regression-based methods.
b. Data Aggregation
Aggregation involves summarizing or combining data to reduce the volume but retain
essential information. This is especially useful for data mining in cases where granularity
needs to be reduced.
Example: Aggregating sales data from individual transactions into monthly or
yearly totals.
c. Data Generalization
Generalization converts low-level data into higher-level concepts by grouping data based
on higher-level attributes or concepts.
Example: Converting age values into age ranges (e.g., “18-24,” “25-34”) for
analysis.
d. Data Normalization
Normalization involves scaling data to a standard range or format. This step is important
when data attributes have different scales, as many data mining algorithms require
normalized data.
Techniques:
Min-max normalization: Rescales values between a specified range (e.g.,
[0, 1]).
Z-score normalization (Standardization): Centers the data around the
mean (mean = 0) with a standard deviation of 1.
Decimal scaling: Moves the decimal point of values to standardize them.
e. Data Discretization
f. Feature Construction
Feature construction involves creating new features from existing ones to make patterns
more visible and improve model performance.
Example: Creating a "BMI" (Body Mass Index) variable from height and weight
data.
g. Data Encoding
Categorical data may need to be converted into numerical forms for use in certain
machine learning algorithms.
Label encoding: Assigns a unique integer to each category.
One-hot encoding: Converts categories into binary vectors for each distinct
category.
Heterogeneous Sources: Data may come from various systems with different formats,
types, and structures, making integration difficult.
Data Inconsistencies: Conflicting data may exist in different sources, requiring conflict
resolution.
Scalability: Handling large amounts of data in an efficient manner without
compromising data quality is challenging.
Time-Variant Data: Data changes over time, and integrating time-varying datasets can
be complex.
Consistency and Accuracy: Proper integration and transformation ensure that the data
used for mining is consistent and accurate, reducing errors and improving model
performance.
Efficiency: Well-processed data leads to more efficient and effective mining algorithms,
speeding up the analysis process.
Improved Insights: Clean, standardized, and well-transformed data provides more
meaningful insights and allows for more reliable decision-making.
Data reduction is a crucial step in data preprocessing that involves reducing the volume of data
while preserving its integrity and ensuring that the essential information for analysis is retained.
This is important because mining very large datasets can be computationally expensive and time-
consuming. Data reduction techniques help in improving both the efficiency and effectiveness of
data mining algorithms.
1. Data Cube Aggregation
Data cube aggregation is a form of data summarization that allows for the representation of
data at various levels of granularity. It’s commonly used in OLAP (Online Analytical
Processing) systems, where large multidimensional datasets are organized into cubes to facilitate
query responses and analysis.
Data Cubes: A data cube is a multidimensional structure that stores aggregated data
across multiple dimensions. Each cell in the cube corresponds to aggregated data (e.g.,
total sales) for specific combinations of dimensions (e.g., time, location, and product).
Aggregation: This involves summarizing data along different dimensions. For example,
aggregating daily sales into monthly or yearly sales figures.
Roll-up: Moving from lower-level (more detailed) data to higher-level (more
summarized) data. For example, aggregating individual transactions into quarterly sales
data.
Drill-down: The reverse of roll-up, where higher-level data is disaggregated into more
detailed levels. For example, breaking down yearly sales data into monthly sales data.
Benefits:
Reduces the size of the dataset by summarizing data, thus improving query performance.
Provides flexibility for analyzing data at various levels of granularity.
2. Dimensionality Reduction
Benefits:
3. Data Compression
Data compression aims to reduce the size of the dataset by encoding the data in a more efficient
manner without losing significant information. It helps to save storage space and improves data
transfer efficiency.
1. Lossless Compression:
In lossless compression, the original data can be perfectly reconstructed from the
compressed data without any loss of information.
Methods: Huffman coding, Run-length encoding, Lempel-Ziv-Welch (LZW)
compression.
Application: Used in text and data transmission where data integrity is crucial.
2. Lossy Compression:
In lossy compression, some of the data is lost during compression, but the
resulting compressed data is still a close approximation of the original.
Methods: Discrete Cosine Transform (DCT), Wavelet Transform.
Application: Commonly used for images, audio, and video where some loss of
quality is acceptable in exchange for greater compression.
3. Vector Quantization:
This technique reduces the size of the dataset by clustering similar data points into
groups and representing each group by a single prototype (centroid).
Application: Common in image compression.
Benefits:
4. Numerosity Reduction
Numerosity reduction is a technique that reduces the volume of data by representing the dataset
in a more compact form, either by approximation or by generating a reduced set of data.
1. Parametric Methods:
In parametric methods, the data is modeled using a known distribution, and the
model parameters (rather than the data itself) are stored.
Example: Regression models (e.g., linear regression, logistic regression), which
represent the data using a mathematical function rather than the original data
points.
Application: Instead of storing all data points, only the coefficients of the
regression equation are retained, allowing predictions on new data points.
2. Non-Parametric Methods:
Non-parametric methods do not assume an underlying distribution but instead
reduce the dataset size by using data clustering, histograms, or other
approximation methods.
Clustering: Groups similar data points into clusters, and each cluster is
represented by its centroid or representative data point.
Sampling: A smaller, representative subset of the data is selected to approximate
the whole dataset. Sampling can be random, stratified, or systematic.
Histograms: Data is partitioned into bins, and the frequency of data points in
each bin is stored, reducing the dataset size.
Benefits:
Parametric methods allow for data reduction by modeling relationships between data
points, requiring only a few parameters.
Non-parametric methods reduce data volume through sampling, clustering, or
summarization, which makes data processing more manageable.
Discretization and Concept Hierarchy Generation in Data Mining
In data mining, discretization and concept hierarchy generation are techniques used for
transforming continuous data into a form that can be more easily and effectively analyzed. These
methods simplify data representation, which is particularly important when dealing with complex
and large datasets, especially for classification, clustering, and decision tree algorithms.
1. Discretization
Simplifies Analysis: By reducing continuous data into categories or intervals, the data
becomes easier to analyze.
Improves Algorithm Performance: Many algorithms, like decision trees, perform better
with categorical data.
Facilitates Interpretability: Discretized data can be more easily understood by non-
experts compared to raw continuous data.
Types of Discretization:
a. Supervised Discretization
In supervised discretization, the class label (target variable) is used to guide the process
of dividing the continuous values into intervals.
Methods:
1. Entropy-Based Discretization:
Uses information gain (entropy) to determine the boundaries of the
intervals. The goal is to split the data in a way that maximizes the
separation between different class labels.
2. ChiMerge:
A statistical method based on the Chi-square test that merges adjacent
intervals until the merged intervals no longer improve the class label
distribution significantly.
b. Unsupervised Discretization
This is a top-down, hierarchical method where the continuous values are recursively split
into smaller intervals. The process stops when a stopping criterion, like a maximum
number of bins or a minimum information gain, is reached.
d. Cluster-Based Discretization
Concept hierarchy generation refers to the process of organizing data into multiple levels of
abstraction, where higher-level concepts represent generalizations of lower-level concepts.
Concept hierarchies play a crucial role in data summarization and data mining, especially for
tasks like roll-up and drill-down operations in OLAP systems and for knowledge discovery.
This involves human experts defining the hierarchy based on domain knowledge. For
example, in a retail dataset, a hierarchy for product categories may be manually defined
as follows:
Level 1: All products
Level 2: Electronics, Clothing, Food, etc.
Level 3: Smartphones, Laptops, TVs (under Electronics)
b. Automatic Concept Hierarchy Generation
Automatic methods build the hierarchy based on the data itself, often using attribute-
value relationships, clustering, or generalization rules.
Methods:
1. Binning-Based Approach: Concept hierarchies can be generated using binning
(as described in discretization). For example, a numeric attribute "age" can be
grouped into different age ranges (e.g., 0-18 = children, 19-35 = adults, etc.).
2. Clustering-Based Approach: A clustering algorithm can be applied to group
similar values or entities into clusters. These clusters can then represent higher-
level concepts in the hierarchy.
3. Data Aggregation: Aggregating detailed data into summary levels based on
common attributes. For example, daily sales can be aggregated into monthly,
quarterly, or yearly sales, creating a hierarchy for time-based data.
Concept hierarchies can be generated by defining a set of attributes and ordering them
from general to specific.
Example:
Consider a dataset containing information about geographic locations.
Hierarchy:
Country → State → City → Zip Code
Numeric attributes (like age or income) can be discretized into intervals, forming a
hierarchy.
Example:
Age: (0-18: children, 19-35: adults, 36-60: middle-aged, 61+ : seniors).
Advantages of Discretization and Concept Hierarchy Generation:
For Discretization:
1. Improves Data Mining Efficiency: Discretized data is easier to process, particularly for
algorithms that require categorical data, such as decision trees.
2. Simplifies Model Interpretation: Categories or intervals are often easier for humans to
interpret compared to continuous data.
3. Reduces Model Complexity: Discretization can reduce noise and help prevent
overfitting by reducing the precision of the data.
1. Supports Roll-Up and Drill-Down: Concept hierarchies allow for flexible analysis of
data at various levels of abstraction.
2. Improves Data Summarization: Hierarchies enable summarization of data, helping in
data generalization tasks.
3. Enhances Knowledge Discovery: By working at higher levels of abstraction, it becomes
easier to identify patterns and relationships that may not be apparent at lower levels.
Examples:
Discretization Example:
Geographical Data:
Level 1: World
Level 2: Continents (e.g., North America, Europe)
Level 3: Countries (e.g., USA, Canada, Germany)
Level 4: States/Provinces (e.g., California, Ontario)
Level 5: Cities (e.g., Los Angeles, Toronto)
Decision Tree in Data Mining
A decision tree is a popular supervised learning algorithm used for both classification and
regression tasks. In data mining, it is primarily used for classification, where it helps in decision-
making by breaking down complex data into smaller, manageable chunks through a hierarchical
structure resembling a tree.
The goal of a decision tree is to create a model that predicts the value of a target variable based
on several input features. Each internal node of the tree represents a decision based on a feature,
and each leaf node represents the final classification or output. Decision trees are intuitive, easy
to understand, and interpret, making them a popular choice for many data mining tasks.
1. Root Node
The root node is the top-most node in the tree, where the data starts being split based on
the feature that provides the highest information gain or the most significant decision
factor.
2. Decision Node
A decision node represents a feature or attribute based on which the data is split further.
At each decision node, a condition (usually a comparison or threshold) is used to branch
the data.
Leaf nodes represent the final outcome or classification. After a series of decisions, the
data reaches a leaf node, which provides the prediction or class label.
4. Splitting
Splitting is the process of dividing a dataset into subsets based on a particular feature or
attribute. The goal is to split the data in such a way that each subset is as homogeneous as
possible with respect to the target variable.
5. Pruning
Pruning is the process of removing unnecessary branches from the tree to reduce
complexity and prevent overfitting. Pruned trees are simpler and often generalize better
on unseen data.
6. Branch
A branch is a sub-tree within the decision tree. It represents the path from a decision node
to another node (either another decision node or a leaf node).
7. Impurity Measures
Impurity measures how well a split separates the classes. The more homogenous the
resulting subsets, the lower the impurity.
Common impurity measures include:
Gini Index: Measures the probability of misclassifying a random element.
Entropy: Used in information gain to decide splits in decision trees. It measures
the uncertainty or impurity in a dataset.
Variance: Used for regression tasks to measure the spread of the target values.
1. Classification Trees
In classification trees, the target variable is categorical, and the tree assigns each data
point to a specific category or class. The output at the leaf node represents the class label.
2. Regression Trees
In regression trees, the target variable is continuous. The tree predicts a numeric value at
the leaf nodes. These trees are used for problems where the goal is to predict a continuous
output, such as predicting house prices or stock values.
Easy to Understand and Interpret: Decision trees provide a clear and visual
representation of the decision-making process.
Non-Parametric: Decision trees do not make assumptions about the underlying data
distribution.
Handles Both Categorical and Continuous Data: Decision trees can handle various
types of input features (e.g., numeric, categorical).
Feature Importance: Decision trees automatically rank features by importance, helping
in feature selection.
Overfitting: Decision trees can easily become too complex, especially if they are not
pruned. Overfitting can lead to poor generalization on unseen data.
Instability: Small changes in the data can lead to significant changes in the tree structure.
Bias Toward Features with More Levels: Decision trees can be biased towards features
with more categories or levels, which might lead to overestimating their importance.
There are several algorithms to construct decision trees, and each uses different criteria for
selecting the best split:
The ID3 algorithm uses information gain (based on entropy) to select the attribute that
best separates the data at each step.
2. C4.5
An extension of ID3, C4.5 handles both categorical and continuous data, and it deals with
missing values and can prune the tree after creation.
CART constructs binary trees (each node has only two branches) and uses Gini Index
for classification tasks and mean squared error for regression tasks.
We want to predict whether a person will buy a sports car based on Age and Income.
Dataset:
Person Age Income Buys Sports Car?
2 Young Medium No
4 Senior Low No
5 Senior Medium No
6 Senior High No
8 Young Low No
9 Young Medium No
10 Middle-aged Low No
Step-by-Step:
Age
/ | \
Young Middle-aged Senior
/ | \
Income Buys Yes Buys No
/ \
High Medium/Low
Yes No