Data Discretization

This document discusses data discretization techniques for preprocessing numeric data. It begins by defining discretization as replacing raw numeric values with interval labels or conceptual labels. The main discretization techniques discussed are binning, histogram analysis, and cluster analysis. Binning partitions values into equal-frequency or equal-width bins, histogram analysis uses partitions of equal-width or equal-frequency, and cluster analysis groups similar values together. Discretization reduces data size and simplifies data for further analysis. The document also categorizes discretization as supervised or unsupervised, and as top-down splitting or bottom-up merging.

Uploaded by

Dominador J. Santos Jr.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

367 views

Data Discretization

Uploaded by

Dominador J. Santos Jr.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

One of Major Tasks in Data Preprocessing

Data discretization – Part of data reduction but with particular importance, especially for numerical data

Discretization

Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).

Data discretization transforms numeric data by mapping values to interval or concept labels.
Discretization techniques include binning, histogram analysis, cluster analysis, and so on.

Data discretization and concept hierarchy generation are also forms of data reduction. The raw data are
replaced by a smaller number of interval or concept labels. This simplifies the original data and makes
the mining more efficient. The resulting patterns mined are typically easier to understand.

Three types of attributes:

(An attribute is a property or characteristic of an object)

Examples: Eye Color of a person, temperature, etc. Attribute is also known as variable, field,
characteristics or feature.

Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or academic rank
Continuous — real numbers, e.g., integer or real numbers (here we aggregated interval and ratio
attributes into continuous)

Discretization:
– Divide the range of a continuous attribute into intervals
– Reduce data size by discretization
– Prepare for further analysis

Discretization
– Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals
– Interval labels can then be used to replace actual data

************************

Discretization techniques can be categorized based on how the discretization is performed, such as
whether it uses class information or which direction it proceeds (i.e., top-down vs. bottom-up). If the
discretization process uses class information, then we say it is supervised discretization. Otherwise, it is
unsupervised. If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals, it is called
top-down discretization or splitting. This contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-points, removes some by merging
neighborhood values to form intervals, and then recursively applies this process to the resulting
intervals.
– Supervised vs. unsupervised (use class or don’t use class variable)
– Split (top-down) vs. merge (bottom-up)

Discretization for numeric Data

Typical methods: All the methods can be applied recursively

– Binning (covered earlier)
Top-down split, unsupervised,
First sort data and partition into (equal-frequency) bins
Then one can: smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is,
the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing. Figure 3.2
illustrates some binning techniques. In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin contains three values). In smoothing by bin
means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values
4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly,
smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In
smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the
width, the greater the effect of the smoothing. Alternatively, bins may be equal width, where the
interval range of values in each bin is constant. Binning is also used as a discretization technique

Discretization by Binning Binning is a top-down splitting technique based on a specified number

of bins. Section 3.2.2 discussed binning methods for data smoothing. These methods are also used as
discretization methods for data reduction and concept hierarchy generation. For example, attribute
values can be discretized by applying equal-width or equal-frequency binning, and then replacing each
bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians,
respectively. These techniques can be applied recursively to the resulting partitions to generate concept
hierarchies. Binning does not use class information and is therefore an unsupervised discretization
technique. It is sensitive to the user-specified number of bins, as well as the presence of outliers.
– Histogram analysis (covered earlier)
Top-down split, unsupervised

Discretization by Histogram Analysis Like binning, histogram analysis is an unsupervised

discretization technique because it does not use class information. Histograms were introduced in
Section 2.2.3. A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or
bins. Various partitioning rules can be used to define histograms (Section 3.4.6). In an equal-width
histogram, for example, the values are partitioned into equal-size partitions or ranges (e.g., earlier in
Figure 3.8 for price, where each bucket has a width of $10). With an equal-frequency histogram, the
values are partitioned so that, ideally, each partition contains the same number of data tuples. The
histogram analysis algorithm can be applied recursively to each partition in order to automatically
generate a multilevel concept hierarchy, with the procedure terminating once a prespecified number of
concept levels has been reached. A minimum interval size can also be used per level to control the
recursive procedure. This specifies the minimum width of a partition, or the minimum number of values
for each partition at each level. Histograms can also be partitioned based on cluster analysis of the data
distribution, as described next.
- Clustering Analysis
Either top-down split or bottom-up merge, unsupervised
Detect and remove outliers
Cluster analysis is a popular data discretization method. A clustering algorithm can be applied to
discretize a numeric attribute, A, by partitioning the values of A into clusters or groups. Clustering takes
the distribution of A into consideration, as well as the closeness of data points, and therefore is able to
produce high-quality discretization results. Clustering can be used to generate a concept hierarchy for A
by following either a top-down splitting strategy or a bottom-up merging strategy, where each cluster
forms a node of the concept hierarchy. In the former, each initial cluster or partition may be further
decomposed into several subclusters, forming a lower level of the hierarchy. In the latter, clusters are
formed by repeatedly grouping neighboring clusters in order to form higher-level concepts. Clustering
methods for data mining are studied in Chapters 10 and 11.

Simple Discretization Methods: Binning

z Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of intervals will be: intervals will
be: W = (B–A)/ N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well Skewed data is not handled well

Equal-depth (frequency) partitioning

– Divides the range into Divides the range into N intervals each containing approximately same intervals,
each containing approximately same number of data points
– Good data scaling
– Managing categorical attributes can be tricky

Mitsubishi Galant 4g63 Engine Repair Manual
100% (1)
Mitsubishi Galant 4g63 Engine Repair Manual
29 pages
Chapter 3 Research Methodology 2023
No ratings yet
Chapter 3 Research Methodology 2023
35 pages
E-Tivity 2.2 Tharcisse 217010849
No ratings yet
E-Tivity 2.2 Tharcisse 217010849
7 pages
RHandbookProgramEvaluation PDF
100% (1)
RHandbookProgramEvaluation PDF
759 pages
1.9-b - Discretization - Concept-Hierarchy
No ratings yet
1.9-b - Discretization - Concept-Hierarchy
2 pages
#CH-2.1.5
No ratings yet
#CH-2.1.5
19 pages
4 - Discretization and Concept Hierarchy
No ratings yet
4 - Discretization and Concept Hierarchy
26 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Data Discretization and Concept Hierarchy Generation_PPT
No ratings yet
Data Discretization and Concept Hierarchy Generation_PPT
21 pages
4 Binning
No ratings yet
4 Binning
19 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Discretization Techniques A Recent Survey
No ratings yet
Discretization Techniques A Recent Survey
12 pages
IDS5
No ratings yet
IDS5
56 pages
C49 DWM Expt4
No ratings yet
C49 DWM Expt4
14 pages
3point5point2 Normalization
No ratings yet
3point5point2 Normalization
3 pages
Discretization From The Top-Down
No ratings yet
Discretization From The Top-Down
3 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Exp 5
No ratings yet
Exp 5
11 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Discretization
No ratings yet
Data Discretization
9 pages
Normalization
No ratings yet
Normalization
35 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
w2-Data_Preparation
No ratings yet
w2-Data_Preparation
46 pages
Feature Eng Cheat Sheet
No ratings yet
Feature Eng Cheat Sheet
5 pages
UNIT-2
No ratings yet
UNIT-2
34 pages
Session 2 on Discreatization - Binning Notes
No ratings yet
Session 2 on Discreatization - Binning Notes
14 pages
Normalization 05032024 010758pm
No ratings yet
Normalization 05032024 010758pm
17 pages
Entropy Discretization
No ratings yet
Entropy Discretization
20 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
4 Popular Discretization Techniques You Need to Know in Data Science (1)
No ratings yet
4 Popular Discretization Techniques You Need to Know in Data Science (1)
17 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Discretization Unification
No ratings yet
Data Discretization Unification
14 pages
Binning
No ratings yet
Binning
4 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Unit-1 3
No ratings yet
Unit-1 3
58 pages
Module_III_data_mining
No ratings yet
Module_III_data_mining
7 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Wa0029.
No ratings yet
Wa0029.
4 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Improved Discretization Based Decision Tree For Continuous Attributes
No ratings yet
Improved Discretization Based Decision Tree For Continuous Attributes
5 pages
Data Transformation
No ratings yet
Data Transformation
16 pages
Data Reduction
No ratings yet
Data Reduction
22 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Discretization and Concept Hierarchy Generation
No ratings yet
Discretization and Concept Hierarchy Generation
16 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Madok's Car Care Center: Experience Certificate
0% (1)
Madok's Car Care Center: Experience Certificate
1 page
G 61 Ilq 3 T 8 DKM 91 S 4
No ratings yet
G 61 Ilq 3 T 8 DKM 91 S 4
41 pages
Preventive Maintenance Checklist
No ratings yet
Preventive Maintenance Checklist
3 pages
Weekly Inspection Checklist
No ratings yet
Weekly Inspection Checklist
2 pages
Toyota Innova Models Features1 Specifications
No ratings yet
Toyota Innova Models Features1 Specifications
1 page
Tech Tips - 004 - Poor Cooling Performance
No ratings yet
Tech Tips - 004 - Poor Cooling Performance
1 page
Water Pump Isuzu
50% (2)
Water Pump Isuzu
11 pages
Welder Generators TLW300SS Rev 2 Manual DataId 19336 Version 1
100% (3)
Welder Generators TLW300SS Rev 2 Manual DataId 19336 Version 1
138 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Ejay Research Concepts and Stat N2015 Key PDF
No ratings yet
Ejay Research Concepts and Stat N2015 Key PDF
4 pages
Altfragen Biostatistics Colloquium
100% (1)
Altfragen Biostatistics Colloquium
10 pages
Measures of Variability
No ratings yet
Measures of Variability
27 pages
The Essentials of Statistics A Tool for Social Research 3rd Edition Joseph F. Healey - Instantly access the complete ebook with just one click
No ratings yet
The Essentials of Statistics A Tool for Social Research 3rd Edition Joseph F. Healey - Instantly access the complete ebook with just one click
43 pages
TDP 301 Eductional Measurements and Evaluation Notes Sept Dec 2023-1
No ratings yet
TDP 301 Eductional Measurements and Evaluation Notes Sept Dec 2023-1
129 pages
Analysis Interpretation and Use of Test Data
No ratings yet
Analysis Interpretation and Use of Test Data
50 pages
4.10 Descriptive Statistics
No ratings yet
4.10 Descriptive Statistics
18 pages
Qualitative Research
No ratings yet
Qualitative Research
13 pages
Measures of Association
No ratings yet
Measures of Association
4 pages
Questions
No ratings yet
Questions
22 pages
Educ 98 Midterm Exam - MARYKNOL J. ALVAREZ
100% (2)
Educ 98 Midterm Exam - MARYKNOL J. ALVAREZ
5 pages
Operational Definitions-Lec 2
No ratings yet
Operational Definitions-Lec 2
60 pages
MMW (Data Management) - Part 1
No ratings yet
MMW (Data Management) - Part 1
26 pages
Full Download (eBook PDF) Business Research Methods 9th Edition PDF DOCX
100% (3)
Full Download (eBook PDF) Business Research Methods 9th Edition PDF DOCX
40 pages
Chapter 1 STA108
No ratings yet
Chapter 1 STA108
24 pages
Unit 2 Research Methods Ipr
No ratings yet
Unit 2 Research Methods Ipr
23 pages
Seale Chapter 11
No ratings yet
Seale Chapter 11
25 pages
Statistics Basic (1-3)
No ratings yet
Statistics Basic (1-3)
37 pages
Nptel: Marketing Management - Web Course
No ratings yet
Nptel: Marketing Management - Web Course
4 pages
Spitzer 1981
No ratings yet
Spitzer 1981
13 pages
Measurement Scaling
No ratings yet
Measurement Scaling
12 pages
1-Philosophy of Research Method 1
No ratings yet
1-Philosophy of Research Method 1
24 pages
Chapter 12
No ratings yet
Chapter 12
22 pages
Chapter 05
No ratings yet
Chapter 05
9 pages
Transformative Leadership, Locus of Control On Fraud Detection and Environmental Performance
No ratings yet
Transformative Leadership, Locus of Control On Fraud Detection and Environmental Performance
21 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
24 pages
Lecture Note(Software Metric)
No ratings yet
Lecture Note(Software Metric)
60 pages