DM - 02 - 02 - Descriptive Data Summarization
DM - 02 - 02 - Descriptive Data Summarization
Fall 2008
Motivation
– To better understand the data
– To highlight which data values should be treated as noise or
outliers.
Data characteristics
– Measures of central tendency
Mean, median, mode, and midrange
– Measures of data dispersion
Rang, quartiles, interquartile range (IQR), and variance
∑wx
i =1
i i
x = n
∑w
i =1
i
Trimmed mean
– A major problem with the mean is its sensitivity to extreme
(e.g., outlier) values.
– Even a small number of extreme values can corrupt the
mean.
– the trimmed mean is the mean obtained after cutting off
values at the high and low extremes.
– For example, we can sort the values and remove the top and
bottom 2% before computing the mean.
– We should avoid trimming too large a portion (such as
20%) at both ends as this can result in the loss of valuable
information.
Boxplot for the unit price data for items sold at four branches of
AllElectronics during a given time period.