Data Discretization
Data Discretization
Data discretization – Part of data reduction but with particular importance, especially for numerical data
Discretization
Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
Data discretization transforms numeric data by mapping values to interval or concept labels.
Discretization techniques include binning, histogram analysis, cluster analysis, and so on.
Data discretization and concept hierarchy generation are also forms of data reduction. The raw data are
replaced by a smaller number of interval or concept labels. This simplifies the original data and makes
the mining more efficient. The resulting patterns mined are typically easier to understand.
Discretization:
– Divide the range of a continuous attribute into intervals
– Reduce data size by discretization
– Prepare for further analysis
Discretization
– Reduce the number of values for a given continuous attribute by dividing the range of the attribute
into intervals
– Interval labels can then be used to replace actual data
************************
Discretization techniques can be categorized based on how the discretization is performed, such as
whether it uses class information or which direction it proceeds (i.e., top-down vs. bottom-up). If the
discretization process uses class information, then we say it is supervised discretization. Otherwise, it is
unsupervised. If the process starts by first finding one or a few points (called split points or cut points) to
split the entire attribute range, and then repeats this recursively on the resulting intervals, it is called
top-down discretization or splitting. This contrasts with bottom-up discretization or merging, which
starts by considering all of the continuous values as potential split-points, removes some by merging
neighborhood values to form intervals, and then recursively applies this process to the resulting
intervals.
– Supervised vs. unsupervised (use class or don’t use class variable)
– Split (top-down) vs. merge (bottom-up)
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is,
the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing. Figure 3.2
illustrates some binning techniques. In this example, the data for price are first sorted and then
partitioned into equal-frequency bins of size 3 (i.e., each bin contains three values). In smoothing by bin
means, each value in a bin is replaced by the mean value of the bin. For example, the mean of the values
4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the value 9. Similarly,
smoothing by bin medians can be employed, in which each bin value is replaced by the bin median. In
smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the
width, the greater the effect of the smoothing. Alternatively, bins may be equal width, where the
interval range of values in each bin is constant. Binning is also used as a discretization technique