Data Mining
Data Mining
Data Mining
Introduction
Rapid advances in data collection and storage technology have enabled
organizations to accumulate vast amounts of data. However, extracting
useful
information has proven extremely challenging. Often, traditional data a
nalysis tools and techniques cannot be used because of the massive size
of a data
set. Sometimes, the non-traditional nature of the data means that trad
itional
approaches cannot be applied even if the data set is relatively small
. In other
situations, the questions that need to be answered cannot be addressed
using
existing data analysis techniques, and thus, new methods need to be
developed.
Data mining is a technology that blends traditional data analysis methods
with sophisticated algorithms for processing large volumes of data. It has als
o
opened up exciting opportunities for exploring and analyzing new type
s of
data and for analyzing old types of data in new ways. In this int
roductory
chapter, we present an overview of data mining and outline the key
topics
to be covered in this book. We start with a description of some w
ell-known
applications that require new techniques for data analysis.
Business Point-of-sale data collection (bar code scanners, radio frequ
ency
identi cation (RFID), and smart card technology) have allowed retailers
to
collect up-to-the-minute data about customer purchases at the checkout counters of their stores. Retailers can utilize this information, along
with other
business-critical data such as Web logs from e-commerce Web sites and
customer service records from call centers, to help them better underst
and the
needs of their customers and make more informed business decisions.
Data mining techniques can be used to support a wide range of busin
ess
intelligence applications such as customer pro ling, targeted marketing, workow management, store layout, and fraud detection. It can also help re
tailers
2 Chapter 1 Introduction
answer important business questions such as Who are the most pro ta
ble
customers? What products can be cross-sold or up-sold? and What is the
revenue outlook of the company for next year? Some of these questions
motivated the creation of association analysis (Chapters 6 and 7), a
new data
analysis technique.
Medicine, Science, and Engineering Researchers in medicine, science,
and engineering are rapidly accumulating data that is key to important
new
discoveries. For example, as an important step toward improving our
understanding of the Earths climate system, NASA has deployed a series of Earthorbiting satellites that continuously generate global observations of
the land
surface, oceans, and atmosphere. However, because of the size and
spatiotemporal nature of the data, traditional methods are often not s
uitable for
analyzing these data sets. Techniques developed in data mining can aid Earth
scientists in answering questions such as What is the relationship betw
een
the frequency and intensity of ecosystem disturbances such as droughts
and
hurricanes to global warming? How is land surface precipitation and temperature a ected by ocean surface temperature? and How well can we predict
the beginning and end of the growing season for a region?
As another example, researchers in molecular biology hope to use the large
amounts of genomic data currently being gathered to better understand
the
structure and function of genes. In the past, traditional methods i
n molecular biology allowed scientists to study only a few genes at a time
in a given
experiment. Recent breakthroughs in microarray technology have enabled scientists to compare the behavior of thousands of genes under various situations.
Such comparisons can help determine the function of each gene and per
haps
isolate the genes responsible for certain diseases. However, the noisy and hig
hdimensional nature of data requires new types of data analysis. In
addition
to analyzing gene array data, data mining can also be used to addre
ss other
important biological challenges such as protein structure prediction, mu
ltiple
sequence alignment, the modeling of biochemical pathways, and phylogenetics.
1.1 What Is Data Mining?
Data mining is the process of automatically discovering useful informat
ion in
large data repositories. Data mining techniques are deployed to scour
large
databases in order to
nd novel and useful patterns that might o
therwise
remain unknown. They also provide capabilities to predict the outcome
of a
1.1 What Is Data Mining? 3
future observation, such as predicting whether a newly arrived custome
r will
spend more than $100 at a department store.
Not all information discovery tasks are considered to be data mining.
For
example, looking up individual records using a database management syst
em
or nding particular Web pages via a query to an Internet search engin
e are
tasks related to the area of information retrieval. Although such ta
sks are
important and may involve the use of the sophisticated algorithms and
data
structures, they rely on traditional computer science techniques and
obvious
features of the data to create index structures for e ciently organizin
g and
retrieving information. Nonetheless, data mining techniques have been
used
to enhance information retrieval systems.
Data Mining and Knowledge Discovery
Data mining is an integral part of knowledge discovery in dat
abases
(KDD), which is the overall process of converting raw data into us
eful information, as shown in Figure 1.1. This process consists of a serie
s of transformation steps, from data preprocessing to postprocessing of data mi
ning
results.
Input
Data
Information
Data
Preprocessing
Data
Mining
Postprocessing
Filtering Patterns
Visualization
Pattern Interpretation
Feature Selection
Dimensionality Reduction
Normalization
Data Subsetting
Figure 1.1. The process of knowledge discovery in databases (KDD).
les, sp
The input data can be stored in a variety of formats ( at
readsheets, or relational tables) and may reside in a centralized data
repository
or be distributed across multiple sites. The purpose of preproc
essing is
to transform the raw input data into an appropriate format for subs
equent
analysis. The steps involved in data preprocessing include fusing data
from
multiple sources, cleaning data to remove noise and duplicate observat
ions,
and selecting records and features that are relevant to the data mini
ng task
at hand. Because of the many ways data can be collected and stored
, data
4 Chapter 1 Introduction
preprocessing is perhaps the most laborious and time-consuming step in
the
overall knowledge discovery process.
Closing the loop is the phrase often used to refer to the process of
integrating data mining results into decision support systems. For ex
ample,
in business applications, the insights o ered by data mining results
can be
integrated with campaign management tools so that e ective marketing promotions can be conducted and tested. Such integration requires a post
processing step that ensures that only valid and useful results are inco
rporated
into the decision support system. An example of postprocessing is vis
ualization (see Chapter 3), which allows analysts to explore the data and
the data
mining results from a variety of viewpoints. Statistical measures or
hypothesis testing methods can also be applied during postprocessing to elim
inate
spurious data mining results.
1.2 Motivating Challenges
As mentioned earlier, traditional data analysis techniques have often e
ncountered practical di culties in meeting the challenges posed by new data
sets.
The following are some of the speci c challenges that motivated the dev
elopment of data mining.
Scalability Because of advances in data generation and collection, data sets
with sizes of gigabytes, terabytes, or even petabytes are becoming
common.
If data mining algorithms are to handle these massive data sets, th
en they
must be scalable. Many data mining algorithms employ special search s
trategies to handle exponential search problems. Scalability may also requ
ire the
implementation of novel data structures to access individual records in
an efcient manner. For instance, out-of-core algorithms may be necessary wh
en
processing data sets that cannot t into main memory. Scalability can also be
improved by using sampling or developing parallel and distributed algorithms.
High Dimensionality It is now common to encounter data sets with hun
dreds or thousands of attributes instead of the handful common a few decades
ago. In bioinformatics, progress in microarray technology has produced
gene
expression data involving thousands of features. Data sets with
temporal
or spatial components also tend to have high dimensionality. For
example,
consider a data set that contains measurements of temperature at
various
locations. If the temperature measurements are taken repeatedly for a
n extended period, the number of dimensions (features) increases in proportion to
1.2 Motivating Challenges 5
the number of measurements taken. Traditional data analysis techniques that
were developed for low-dimensional data often do not work well for such highdimensional data. Also, for some data analysis algorithms, the computational
complexity increases rapidly as the dimensionality (the number of feat
ures)
increases.
Heterogeneous and Complex Data Traditional data analysis methods
often deal with data sets containing attributes of the same type, either contin-
which culminated in the eld of data mining, built upon the methodology and
algorithms that researchers had previously used. In particular, data
mining
draws upon ideas, such as (1) sampling, estimation, and hypothesis
testing
from statistics and (2) search algorithms, modeling techniques, and l
earning
theories from arti cial intelligence, pattern recognition, and machine learning.
Data mining has also been quick to adopt ideas from other areas, in
cluding
optimization, evolutionary computing, information theory, signal proce
ssing,
visualization, and information retrieval.
A number of other areas also play key supporting roles. In parti
cular,
database systems are needed to provide support for e cient storage, ind
exing, and query processing. Techniques from high performance (parallel)
computing are often important in addressing the massive size of some dat
a sets.
Distributed techniques can also help address the issue of size and are essential
when the data cannot be gathered in one location.
Figure 1.2 shows the relationship of data mining to other areas.
Database Technology, Parallel Computing, Distributed Computing
AI,
Machine
Learning,
and
Pattern
Recognition
Statistics
Data Mining
Figure 1.2. Data mining as a con uence of many disciplines.
1.4 Data Mining Tasks 7
1.4 Data Mining Tasks
Data mining tasks are generally divided into two major categories:
Predictive tasks. The objective of these tasks is to predict the value of a p
articular attribute based on the values of other attributes. The attrib
ute
to be predicted is commonly known as the target or dependent variable, while the attributes used for making the prediction are known a
s
the explanatory or independent variables.
Descriptive tasks. Here, the objective is to derive patterns (co
rrelations,
trends, clusters, trajectories, and anomalies) that summarize the
underlying relationships in data. Descriptive data mining tasks are ofte
n
exploratory in nature and frequently require postprocessing techniques
to validate and explain the results.
Figure 1.3 illustrates four of the core data mining tasks that are describ
ed
in the remainder of this book.
D I A P E R
A
n
o
m
a
ly
D
e
t
e
c
t
io
n
Data
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
125K
100K
70K
120K
95K
80K
220K
85K
75K
90K
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
No
No
No
No
Yes
No
No
Yes
No
Yes
P
r
e
d
ic
t
iv
e
M
o
d
e
lin
g
C
lu
s
t
e
r
A
n
a
ly
s
is
A
s
s
o
c
ia
t
io
n
A
n
a
ly
s
is
DIAPER
Figure 1.3. Four of the core data mining tasks.
8 Chapter 1 Introduction
Predictive modeling refers to the task of building a model for the
target
variable as a function of the explanatory variables. There are two
types of
predictive modeling tasks: classi cation, which is used for discrete
target
variables, and regression, which is used for continuous target variable
s. For
example, predicting whether a Web user will make a purchase at an
online
bookstore is a classi cation task because the target variable is binaryvalued.
On the other hand, forecasting the future price of a stock is a reg
ression task
because price is a continuous-valued attribute. The goal of both ta
sks is to
learn a model that minimizes the error between the predicted and true
values
of the target variable. Predictive modeling can be used to identify
customers
that will respond to a marketing campaign, predict disturbances in the Earths
ecosystem, or judge whether a patient has a particular disease based
on the
results of medical tests.
Example 1.1 (Predicting the Type of a Flower). Consider the task of
predicting a species of
ower based on the characteristics of the
o
wer. In
particular, consider classifying an Iris ower as to whether it belongs
to one
of the following three Iris species: Setosa, Versicolour, or Virgin
ica. To perform this task, we need a data set containing the characteristics
of various
owers of these three species. A data set with this type of i
nformation is
the well-known Iris data set from the UCI Machine Learning Repository
at
http://www.ics.uci.edu/mlearn. In addition to the species of a
ower,
this data set contains four other attributes: sepal width, sepal l
ength, petal
length, and petal width. (The Iris data set and its attributes are
described
further in Section 3.1.) Figure 1.4 shows a plot of petal width v
ersus petal
length for the 150 owers in the Iris data set. Petal width is broke
n into the
categories low, medium, and high, which correspond to the intervals [0
, 0.75),
[0.75, 1.75), [1.75, ), respectively. Also, petal length is broken into categor
ies
low, medium, and high, which correspond to the intervals [0, 2.5), [2
.5, 5), [5,
), respectively. Based on these categories of petal width and leng
th, the
following rules can be derived:
Petal width low and petal length low implies Setosa.
Petal width medium and petal length medium implies Versicolour.
Petal width high and petal length high implies Virginica.
While these rules do not classify all the
owers, they do a good (b
ut not
perfect) job of classifying most of the
owers. Note that
owers
from the
Setosa species are well separated from the Versicolour and Virginica
species
with respect to petal width and length, but the latter two spec
ies overlap
somewhat with respect to these attributes.
1.4 Data Mining Tasks 9
0 1 2 2.5 3 4 5 6 7
0
0.5
0.75
1
1.5
1.75
2
2.5
Petal Length (cm)
P
e
t
a
l
W
i
d
t
h
(
c
m
)
Setosa
Versicolour
Virginica
Figure 1.4. Petal width versus petal length for 150 Iris owers.
Association analysis is used to discover patterns that describe strongl
y associated features in the data. The discovered patterns are typically represent
ed
in the form of implication rules or feature subsets. Because of the exponentia
l
size of its search space, the goal of association analysis is to ext
ract the most
interesting patterns in an e cient manner. Useful applications of associ
ation
analysis include nding groups of genes that have related functionality, identifying Web pages that are accessed together, or understanding the relationships
between di erent elements of Earths climate system.
Example 1.2 (Market Basket Analysis). The transactions shown in Table 1.1 illustrate point-of-sale data collected at the checkout c
ounters of a
grocery store. Association analysis can be applied to
nd items that a
re frequently bought together by customers. For example, we may discove
r the
rule {Diapers} {Milk}, which suggests that customers who buy diapers
also tend to buy milk. This type of rule can be used to identify
potential
cross selling opportunities among related items.
Cluster analysis seeks to
nd groups of closely related observations so
that
observations that belong to the same cluster are more similar to each
other
10 Chapter 1 Introduction
Table 1.1. Market basket data.
Transaction ID Items
1 {Bread, Butter, Diapers, Milk}
2 {Coffee, Sugar, Cookies, Salmon}
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}
4 {Bread, Butter, Salmon, Chicken}
5 {Eggs, Bread, Butter}
6 {Salmon, Diapers, Milk}
7 {Bread, Tea, Sugar, Eggs}
8 {Coffee, Sugar, Chicken, Eggs}
9 {Bread, Diapers, Milk, Salt}
10 {Tea, Eggs, Cookies, Diapers, Milk}
than observations that belong to other clusters. Clustering has been
used to
group sets of related customers, nd areas of the ocean that have a signi cant
impact on the Earths climate, and compress data.
Example 1.3 (Document Clustering). The collection of news articl
es
shown in Table 1.2 can be grouped based on their respective topics.
Each
article is represented as a set of word frequency pairs (w, c), where w is a wor
d
and c is the number of times the word appears in the article. Ther
e are two
natural clusters in the data set. The
rst cluster consists of the
rst
four ar
ticles, which correspond to news about the economy, while the second
cluster
contains the last four articles, which correspond to news about health care. A
good clustering algorithm should be able to identify these two cluster
s based
on the similarity between words that appear in the articles.
Table 1.2. Collection of news articles.
Article Words
1 dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, governm
ent: 2
2 machinery: 2, labor: 3, market: 4, industry: 2, work: 3, coun
try: 1
3 job: 5, in ation: 3, rise: 2, jobless: 2, market: 3, country:
2, index: 3
4 domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price:
2
5 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor
: 2
6 pharmaceutical: 2, company: 3, drug: 2, vaccine: 1,
u: 3
7 death: 2, cancer: 4, drug: 3, public: 4, health: 3, director:
2
8 medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care
: 1
1.5 Scope and Organization of the Book 11
Anomaly detection is the task of identifying observations whose character
istics are signi cantly di erent from the rest of the data. Such observ
ations
are known as anomalies or outliers. The goal of an anomaly detecti
on al
gorithm is to discover the real anomalies and avoid falsely labeli
ng normal
objects as anomalous. In other words, a good anomaly detector must
have
a high detection rate and a low false alarm rate. Applications of
anomaly
detection include the detection of fraud, network intrusions, unusual p
atterns
of disease, and ecosystem disturbances.
Example 1.4 (Credit Card Fraud Detection). A credit card company
records the transactions made by every credit card holder, along with personal
information such as credit limit, age, annual income, and address.
Since the
number of fraudulent cases is relatively small compared to the
number of
legitimate transactions, anomaly detection techniques can be applied to
build
a pro le of legitimate transactions for the users. When a new trans
action
arrives, it is compared against the pro le of the user. If the charac
teristics of
the transaction are very di erent from the previously created pro le, then the
transaction is agged as potentially fraudulent.
1.5 Scope and Organization of the Book
This book introduces the major principles and techniques used in data mining
from an algorithmic perspective. A study of these principles and techniques is
essential for developing a better understanding of how data mining technology
can be applied to various kinds of data. This book also serves as
a starting
point for readers who are interested in doing research in this
eld.
We begin the technical discussion of this book with a chapter o
n data
(Chapter 2), which discusses the basic types of data, data quali
ty, prepro
cessing techniques, and measures of similarity and dissimilarity. Al
though
this material can be covered quickly, it provides an essential foun
dation for
data analysis. Chapter 3, on data exploration, discusses summary sta
tistics,
visualization techniques, and On Line Analytical Processing (OLAP). Th
ese
techniques provide the means for quickly gaining insight into a data
set.
Chapters 4 and 5 cover classi cation. Chapter 4 provides a foundati
on
by discussing decision tree classi ers and several issues that are
important
to all classi cation: over tting, performance evaluation, and the compar
ison
of di erent classi cation models. Using this foundation, Chapter 5 descri
bes
a number of other important classi cation techniques: rule based syst
ems,
nearest neighbor classi ers, Bayesian classi ers, arti cial neural networks, sup
port vector machines, and ensemble classi ers, which are collections of
classi
12 Chapter 1 Introduction
ers. The multiclass and imbalanced class problems are also discussed. These
topics can be covered independently.
Association analysis is explored in Chapters 6 and 7. Chapter 6 desc
ribes
the basics of association analysis: frequent itemsets, association
rules, and
some of the algorithms used to generate them. Speci c types of
frequent
itemsetsmaximal, closed, and hypercliquethat are important for data min
ing are also discussed, and the chapter concludes with a discussion o
f evalua
tion measures for association analysis. Chapter 7 considers a variety
of more
advanced topics, including how association analysis can be applied to categor
ical and continuous data or to data that has a concept hierarchy. (
A concept
hierarchy is a hierarchical categorization of objects, e.g., store items, clothi
ng,
shoes, sneakers.) This chapter also describes how association analysis
can be
extended to
nd sequential patterns (patterns involving order), patter
ns in
graphs, and negative relationships (if one item is present, then th
e other is
not).
Cluster analysis is discussed in Chapters 8 and 9. Chapter 8 rst describes
the di erent types of clusters and then presents three speci c clustering
tech
niques: K means, agglomerative hierarchical clustering, and DBSCAN.
This
is followed by a discussion of techniques for validating the results o
f a cluster
ing algorithm. Additional clustering concepts and techniques are explor
ed in
Chapter 9, including fuzzy and probabilistic clustering, Self Organizing
Maps
(SOM), graph based clustering, and density based clustering. There is a
lso a
discussion of scalability issues and factors to consider when selectin
g a clus
tering algorithm.
The last chapter, Chapter 10, is on anomaly detection. After some
basic
de nitions, several di erent types of anomaly detection are considered:
sta
tistical, distance based, density based, and clustering based. Appendic
es A
through E give a brief review of important topics that are used in
portions of
the book: linear algebra, dimensionality reduction, statistics, regre
ssion, and
optimization.
The subject of data mining, while relatively young compared to statist
ics
or machine learning, is already too large to cover in a single book
. Selected
references to topics that are only brie y covered, such as data qu
ality, are
provided in the bibliographic notes of the appropriate chapter. Refere
nces to
topics not covered in this book, such as data mining for streams and
privacy
preserving data mining, are provided in the bibliographic notes of this chapter.
1.6 Bibliographic Notes 13
1.6 Bibliographic Notes
Knowl
xt Data. Morgan
Kaufmann, San Francisco, CA, 2003.
Bibliography 15
[5] M. S. Chen, J. Han, and P. S. Yu. Data Mining: An Over
view from a Database
Perspective. IEEE Transactions on Knowledge abd Data Engineering, 8(
6):866883,
1996.
[6] V. Cherkassky and F. Mulier. Learning from Data: Concepts, T
heory, and Methods.
Wiley Interscience, 1998.
[7] C. Clifton, M. Kantarcioglu, and J. Vaidya. De ning privacy f
or data mining. In
National Science Foundation Workshop on Next Generation Data Mining,
pages 126
133, Baltimore, MD, November 2002.
[8] P. Domingos and G. Hulten. Mining high speed data streams. In
Proc. of the 6th Intl.
Conf. on Knowledge Discovery and Data Mining, pages 7180, Boston, Massa
chusetts,
2000. ACM Press.
[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi cation.
John Wiley & Sons,
Inc., New York, 2nd edition, 2001.
[10] M. H. Dunham. Data Mining: Introductory and Advanced Topics.
Prentice Hall, 2002.
[11] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors.
Information Visualization in
Data Mining and Knowledge Discovery. Morgan Kaufmann Publishers, San F
rancisco,
CA, September 2001.
[12] U. M. Fayyad, G. Piatetsky Shapiro, and P. Smyth. From Data M
ining to Knowledge
Discovery: An Overview. In Advances in Knowledge Discovery and Data Mi
ning, pages
134. AAAI Press, 1996.
[13] U. M. Fayyad, G. Piatetsky Shapiro, P. Smyth, and R. Uthurusamy,
editors. Advances
in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[14] J. H. Friedman. Data Mining and Statistics: Whats the Connect
ion? Unpublished.
www stat.stanford.edu/jhf/ftp/dm-stat.ps, 1997.
[15] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining
Frequent Patterns in Data
Streams at Multiple Time Granularities. In H. Kargupta, A. Joshi, K.
Sivakumar, and
Y. Yesha, editors, Next Generation Data Mining, pages 191212. AAAI/MIT,
2003.
[16] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical
Themes and Lessons
for Data Mining. Data Mining and Knowledge Discovery, 1(1):1128, 1997.
[17] R. L. Grossman, M. F. Hornick, and G. Meyer. Data minin
g standards initiatives.
Communications of the ACM, 45(8):5961, 2002.
[18] R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu, edito
rs. Data
Mining for Scienti c and Engineering Applications. Kluwer Academic Publishers,
2001.
[19] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. OCa
llaghan. Clustering Data
Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineer
ing,
15(3):515528, May/June 2003.
[20] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. P
regibon. Emerging scienti c
applications in data mining. Communications of the ACM, 45(8):5458, 200
2.
[21] J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann
Publishers, San Francisco, 2001.
[22] D. J. Hand. Data Mining: Statistics and More? The American
Statistician, 52(2):
112118, 1998.
[23] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mini
ng. MIT Press, 2001.
[24] T. Hastie, R. Tibshirani, and J. H. Friedman. The Element
s of Statistical Learning:
Data Mining, Inference, Prediction. Springer, New York, 2001.
[25] M. Kantardzic. Data Mining: Concepts, Models, Methods, and Algorith
ms. Wiley-IEEE
Press, Piscataway, NJ, 2003.
16 Chapter 1 Introduction
[26] H. Kargupta and P. K. Chan, editors. Advances in Distributed a
nd Parallel Knowledge
Discovery. AAAI Press, September 2002.
[27] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the Priv
acy Preserving Properties of Random Data Perturbation Techniques. In Proc. of the 2003
IEEE Intl. Conf.
on Data Mining, pages 99106, Melbourne, Florida, December 2003. IEE
E Computer
Society.
[28] D. Kifer, S. Ben-David, and J. Gehrke. Detecting Change in Dat
a Streams. In Proc. of
the 30th VLDB Conf., pages 180191, Toronto, Canada, 2004. Morgan Kaufma
nn.
[29] D. Lambert. What Use is Statistics for Massive Data? In ACM
SIGMOD Workshop
on Research Issues in Data Mining and Knowledge Discovery, pages 5462,
2000.
[30] M. H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Ma
nifold Learning for Data
Streams. In Proc. of the SIAM Intl. Conf. on Data Mining, Lake Buen
a Vista, Florida,
April 2004. SIAM.
[31] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.
[32] S. Papadimitriou, A. Brockwell, and C. Faloutsos. Adaptive, unsuperv
ised stream mining. VLDB Journal, 13(3):222239, 2004.
[33] O. Parr Rud. Data Mining Cookbook: Modeling Data for Marketing, Risk
and Customer
Relationship Management. John Wiley & Sons, New York, NY, 2001.
[34] D. Pyle. Business Modeling and Data Mining. Morgan Kaufmann, S
an Francisco, CA,
2003.
[35] N. Ramakrishnan and A. Grama. Data Mining: From Serendipity
to ScienceGuest
Editors Introduction. IEEE Computer, 32(8):3437, 1999.
[36] R. Roiger and M. Geatz. Data Mining: A Tutorial Based Pri
mer. Addison-Wesley,
2002.
[37] P. Smyth. Breaking out of the Black-Box: Research Challenges
in Data Mining. In
Proc. of the 2001 ACM SIGMOD Workshop on Research Issues in Data M
ining and
Knowledge Discovery, 2001.
[38] P. Smyth, D. Pregibon, and C. Faloutsos. Data-driven evolut
ion of data mining algorithms. Communications of the ACM, 45(8):3337, 2002.
[39] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and
Y. Theodoridis.
State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1):5057, 2
004.
[40] J. T. L. Wang, M. J. Zaki, H. Toivonen, and D. E.
Shasha, editors. Data Mining in
Bioinformatics. Springer, September 2004.
[41] A. R. Webb. Statistical Pattern Recognition. John Wiley & Son
s, 2nd edition, 2002.
[42] I. H. Witten and E. Frank. Data Mining: Practical Machine Le
arning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.
[43] X. Wu, P. S. Yu, and G. Piatetsky-Shapiro. Data Mining: How Research Me
ets Practical
Development? Knowledge and Information Systems, 5(2):248261, 2003.
[44] M. J. Zaki and C.-T. Ho, editors. Large-Scale Parallel Data Mining. Sprin
ger, September
2002.
1.7 Exercises
1. Discuss whether or not each of the following activities is a dat
a mining task.
1.7 Exercises 17
(a) Dividing the customers of a company according to their gender.
(b) Dividing the customers of a company according to their pro tability
.
(c) Computing the total sales of a company.
(d) Sorting a student database based on student identi cation numbers.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
(f) Predicting the future stock price of a company using historical
records.
(g) Monitoring the heart rate of a patient for abnormalities.
(h) Monitoring seismic waves for earthquake activities.
(i) Extracting the frequencies of a sound wave.
2. Suppose that you are employed as a data mining consultant for
an Internet
search engine company. Describe how data mining can help the company
by
giving speci c examples of how techniques, such as clustering, class
i cation,
association rule mining, and anomaly detection can be applied.
3. For each of the following data sets, explain whether or not dat
a privacy is an
important issue.
(a) Census data collected from 19001950.
(b) IP addresses and visit times of Web users who visit your Websit
e.
(c) Images from Earth-orbiting satellites.
(d) Names and addresses of people from the telephone book.
(e) Names and email addresses collected from the Web.
2
Data
Data Miner: True, but my results are really quite good. Field 1
is a very strong predictor of
eld 5. Im surprised that this
wasnt noticed before.
Statistician: What? Field 1 is just an identi cation number.
Data Miner: Nonetheless, my results speak for themselves.
Statistician: Oh, no! I just remembered. We assigned ID
numbers after we sorted the records based on eld 5. There is
a strong connection, but its meaningless. Sorry.
22 Chapter 2 Data
Although this scenario represents an extreme situation, it emphasizes t
he
importance of knowing your data. To that end, this chapter will add
ress
each of the four issues mentioned above, outlining some of the basic challenges
and standard approaches.
2.1 Types of Data
A data set can often be viewed as a collection of data objects
. Other
names for a data object are record, point, vector, pattern, event, ca
se, sample,
observation, or entity. In turn, data objects are described by a
number of
attributes that capture the basic characteristics of an object, suc
h as the
mass of a physical object or the time at which an event occ
urred. Other
names for an attribute are variable, characteristic, eld, feature, or dimension.
Example 2.2 (Student Information). Often, a data set is a
le, in whi
ch
the objects are records (or rows) in the le and each
eld (or column)
corresponds to an attribute. For example, Table 2.1 shows a data set tha
t consists
of student information. Each row corresponds to a student and each c
olumn
is an attribute that describes some aspect of a student, such as g
rade point
average (GPA) or identi cation number (ID).
Table 2.1. A sample data set containing student information.
Student ID Year Grade Point Average (GPA) . . .
.
.
.
1034262 Senior 3.24 . . .
1052663 Sophomore 3.51 . . .
1082246 Freshman 3.62 . . .
.
.
.
Although record-based data sets are common, either in at
les or relational database systems, there are other important types of da
ta sets and
systems for storing data. In Section 2.1.2, we will discuss some of the types
of
data sets that are commonly encountered in data mining. However, we
rst
consider attributes.
2.1 Types of Data 23
2.1.1 Attributes and Measurement
In this section we address the issue of describing data by conside
ring what
types of attributes are used to describe data objects. We
rst de ne a
n attribute, then consider what we mean by the type of an attribute, a
nd nally
describe the types of attributes that are commonly encountered.
What Is an attribute?
We start with a more detailed de nition of an attribute.
De nition 2.1. An attribute is a property or characteristic of an o
bject
that may vary, either from one object to another or from one time to another.
For example, eye color varies from person to person, while the temperature
of an object varies over time. Note that eye color is a symbolic a
ttribute with
a small number of possible values brown, black, blue, green, hazel, etc., while
temperature is a numerical attribute with a potentially unlimited numb
er of
values.
At the most basic level, attributes are not about numbers or
symbols.
However, to discuss and more precisely analyze the characteristics of
objects,
we assign numbers or symbols to them. To do this in a well-de ned wa
y, we
need a measurement scale.
De nition 2.2. A measurement scale is a rule (function) that associates
a numerical or symbolic value with an attribute of an object.
Formally, the process of measurement is the application of a measur
ement scale to associate a value with a particular attribute of a speci c object.
While this may seem a bit abstract, we engage in the process of meas
urement
all the time. For instance, we step on a bathroom scale to det
ermine our
weight, we classify someone as male or female, or we count the
number of
chairs in a room to see if there will be enough to seat all the people coming to
a meeting. In all these cases, the physical value of an attribute of an
object
is mapped to a numerical or symbolic value.
With this background, we can now discuss the type of an attrib
ute, a
concept that is important in determining if a particular data analysis technique
is consistent with a speci c type of attribute.
The Type of an Attribute
It should be apparent from the previous discussion that the properties
of an
attribute need not be the same as the properties of the values used
to mea24 Chapter 2 Data
sure it. In other words, the values used to represent an attribute
may have
properties that are not properties of the attribute itself, and vice
versa. This
is illustrated with two examples.
Example 2.3 (Employee Age and ID Number). Two attributes that
might be associated with an employee are ID and age (in years). Both of thes
e
attributes can be represented as integers. However, while it is reas
onable to
10
A mapping of lengths to numbers
that captures only the order
properties of length.
A mapping of lengths to numbers
that captures both the order and
additivity properties of length.
Figure 2.1. The measurement of the length of line segments on two different sc
ales of measurement.
The Di erent Types of Attributes
A useful (and simple) way to specify the type of an attribute is
to identify
the properties of numbers that correspond to underlying properties
of the
attribute. For example, an attribute such as length has many of the properties
of numbers. It makes sense to compare and order objects by length,
as well
as to talk about the di erences and ratios of length. The following p
roperties
(operations) of numbers are typically used to describe attributes.
1. Distinctness = and ,=
2. Order <, , >, and
3. Addition + and
4. Multiplication
and /
Given these properties, we can de ne four types of attributes: nominal
,
ordinal, interval, and ratio. Table 2.2 gives the de nitions of these
types,
along with information about the statistical operations that are valid for each
type. Each attribute type possesses all of the properties and operations of th
e
attribute types above it. Consequently, any property or operation that is vali
d
for nominal, ordinal, and interval attributes is also valid for rat
io attributes.
In other words, the de nition of the attribute types is cumulative. H
owever,
26 Chapter 2 Data
Table 2.2. Different attribute types.
Attribute
Type
Description Examples Operations
Nominal The values of a nominal
attribute are just di erent
names; i.e., nominal values
provide only enough
information to distinguish
one object from another.
(=, ,=)
zip codes,
employee ID numbers,
eye color, gender
mode, entropy,
contingency
correlation,
2
test
C
a
t
e
g
o
r
i
a
l
(
Q
u
a
l
i
t
a
t
i
v
e
)
Ordinal The values of an ordinal
attribute provide enough
information to order
objets.
(<, >)
hardness of minerals,
good, better, best,
grades,
street numbers
median,
perentiles,
rank orrelation,
run tests,
sign tests
Interval For interval attributes, the
di erenes between values
are meaningful, i.e., a unit
of measurement exists.
(+, )
alendar dates,
temperature in Celsius
or Fahrenheit
mean,
standard deviation,
Pearsons
orrelation,
t and F tests
N
u
m
e
r
i
(
Q
u
a
n
t
i
t
a
t
i
v
e
)
Ratio For ratio variables, both
di erenes and ratios are
meaningful.
(*, /)
temperature in Kelvin,
monetary quantities,
ounts, age, mass,
length,
eletrial urrent
geometri mean,
harmoni mean,
perent
variation
this does not mean that the statistial operations appropriate for one attribute
type are appropriate for the attribute types above it.
Nominal and ordinal attributes are olletively referred to as ategori
al
or qualitative attributes. As the name suggests, qualitative attributes
, suh
as employee ID, lak most of the properties of numbers. Even if the
y are rep
resented by numbers, i.e., integers, they should be treated more like
symbols.
The remaining two types of attributes, interval and ratio, are oll
etively re
ferred to as quantitative or numeri attributes. Quantitative attributes are
represented by numbers and have most of the properties of numbers.
Note
that quantitative attributes an be integer valued or ontinuous.
The types of attributes an also be desribed in terms of transformat
ions
that do not hange the meaning of an attribute. Indeed, S. Smith Stevens, the
psyhologist who originally de ned the types of attributes shown in Table 2.2,
de ned them in terms of these permissible transformations. For example,
2.1 Types of Data 27
Table 2.3. Transformations that de ne attribute levels.
Attribute
Type
Transformation Comment
Nominal Any one to one mapping, e.g., a
permutation of values
If all employee ID numbers are
reassigned, it will not make any
di erene.
C
a
t
e
g
o
r
i
a
l
(
Q
u
a
l
i
t
a
t
i
v
e
)
Ordinal An order preserving
hange of
values, i.e.,
new value = f(old value),
where f is a monotoni funtion.
An attribute enompassing the
notion of good, better, best an
be represented equally well by
the values 1, 2, 3 or by
0.5, 1, 10.
Interval new value = a
old value + b,
a and b onstants.
The Fahrenheit and Celsius
temperature sales di er in the
loation of their zero value and
the size of a degree (unit).
N
u
m
e
r
i
(
Q
u
a
n
t
i
t
a
t
i
v
e
)
Ratio new value = a
old value Length an be measured in
meters or feet.
the meaning of a length attribute is unhanged if it is measured
in meters
instead of feet.
The statistial operations that make sense for a partiular type of attribute
are those that will yield the same results when the attribute is transformed us
ing a transformation that preserves the attributes meaning. To illustrate, the
average length of a set of objets is di erent when measured in meters
rather
than in feet, but both averages represent the same length. Table 2.3 shows the
permissible (meaning preserving) transformations for the four attribute t
ypes
of Table 2.2.
Example 2.5 (Temperature Sales). Temperature provides a good illus
tration of some of the onepts that have been desribed. First, tem
perature
an be either an interval or a ratio attribute, depending on its m
easurement
sale. When measured on the Kelvin sale, a temperature of 2
is, in a physi
ally meaningful way, twie that of a temperature of 1
our only infrequently or do not make muh sense. For instane, it is di ult
to think of a realisti data set that ontains a ontinuous bina
ry attribute.
Typially, nominal and ordinal attributes are binary or disrete, while interval
and ratio attributes are ontinuous. However, ount attributes, whih
are
disrete, are also ratio attributes.
Asymmetri Attributes
For asymmetri attributes, only presenea non zero attribute valueis re
garded as important. Consider a data set where eah objet is a stu
dent and
eah attribute reords whether or not a student took a partiular ou
rse at
a university. For a spei student, an attribute has a value of 1
if the stu
dent took the ourse assoiated with that attribute and a value of 0 otherwise.
Beause students take only a small fration of all available ourse
s, most of
the values in suh a data set would be 0. Therefore, it is more
meaningful
and more e ient to fous on the non zero values. To illustrate, if
students
are ompared on the basis of the ourses they dont take, then most s
tudents
would seem very similar, at least if the number of ourses is
large. Binary
attributes where only non zero values are important are alled asymmetr
i
2.1 Types of Data 29
binary attributes. This type of attribute is partiularly important f
or as
soiation analysis, whih is disussed in Chapter 6. It is also poss
ible to have
disrete or
ontinuous asymmetri features. For instane, if the
number of
redits assoiated with eah ourse is reorded, then the resulting data set wil
l
onsist of asymmetri disrete or ontinuous attributes.
2.1.2 Types of Data Sets
There are many types of data sets, and as the
eld of data mining
develops
and matures, a greater variety of data sets beome available for ana
lysis. In
this setion, we desribe some of the most ommon types. For onve
niene,
we have grouped the types of data sets into three groups: reord da
ta, graph
based data, and ordered data. These ategories do not over all po
ssibilities
and other groupings are ertainly possible.
General Charateristis of Data Sets
Before providing details of spei kinds of data sets, we disuss t
hree har
ateristis that apply to many data sets and have a signi ant impat
on the
data mining tehniques that are used: dimensionality, sparsity, and resolution
.
Dimensionality The dimensionality of a data set is the number of attributes
that the objets in the data set possess. Data with a small number
of dimen
No
No
Yes
No
No
No
1
2
3
4
5
6
7
8
9
10
Single
Married
Single
Married
Divored
Married
Divored
Single
Married
Single
(a) Reord data.
TID ITEMS
1
2
3
4
5
Bread, Soda, Milk
Beer, Bread
Beer, Soda, Diaper, Milk
Beer, Bread, Diaper, Milk
Soda, Diaper, Milk
(b) Transation data.
Projetion of
x Load
Projetion of
y Load
Distane Load Thikness
10.23
12.65
13.54
14.27
15.22
16.22
17.34
18.45
5.27
6.25
7.23
8.43
27
22
23
25
1.2
1.1
1.2
0.9
() Data matrix.
t
e
a
m
o
a
h
p
l
a
y
s
o
r
e
g
a
m
e
w
i
n
l
o
s
t
t
i
m
e
o
u
t
s
e
a
s
o
n
b
a
l
l
Doument 1 3 0 5 0 2 6 0 2 0 2
0 7 0 2 1 0 0 3 0 0
0 1 0 0 1 2 2 0 3 0
Doument 2
Doument 3
(d) Doument term matrix.
Figure 2.2. Different variations of reord data.
and n olumns, one for eah attribute. (A representation that has data objets
as olumns and attributes as rows is also
ne.) This matrix is alled
a data
matrix or a pattern matrix. A data matrix is a variation of reord
data,
but beause it onsists of numeri attributes, standard matrix operati
on an
be applied to transform and manipulate the data. Therefore, the data
matrix
is the standard data format for most statistial data. Figure 2.2()
shows a
sample data matrix.
The Sparse Data Matrix A sparse data matrix is a speial ase of a da
ta
matrix in whih the attributes are of the same type and are asymmetr
i; i.e.,
only non zero values are important. Transation data is an example of a sparse
data matrix that has only 01 entries. Another ommon example is doument
data. In partiular, if the order of the terms (words) in a doument is ignore
d,
32 Chapter 2 Data
then a doument an be represented as a term vetor, where eah t
erm is
a omponent (attribute) of the vetor and the value of eah
omponent is
the number of times the orresponding term ours in the doument.
This
representation of a olletion of douments is often alled a doument
term
matrix. Figure 2.2(d) shows a sample doument term matrix. The douments
are the rows of this matrix, while the terms are the olumns. In pratie, onl
y
the non zero entries of sparse data matries are stored.
Graph Based Data
A graph an sometimes be a onvenient and powerful representation for
data.
We onsider two spei ases: (1) the graph aptures relationships amo
ng
data objets and (2) the data objets themselves are represented as g
raphs.
Data with Relationships among Objets The relationships among ob
jets frequently onvey important information. In suh ases, the data is ofte
n
represented as a graph. In partiular, the data objets are mapped
to nodes
of the graph, while the relationships among objets are aptured by t
he links
between objets and link properties, suh as diretion and weight. C
onsider
Web pages on the World Wide Web, whih ontain both text and link
s to
other pages. In order to proess searh queries, Web searh engines
ollet
and proess Web pages to extrat their ontents. It is well known,
however,
that the links to and from eah page provide a great deal of information about
the relevane of a Web page to a query, and thus, must also be
taken into
onsideration. Figure 2.3(a) shows a set of linked Web pages.
Data with Objets That Are Graphs If objets have struture, th
at
is, the objets ontain subobjets that have relationships, then suh
objets
are frequently represented as graphs. For example, the struture of
hemial
ompounds an be represented by a graph, where the nodes are atoms and the
links between nodes are hemial bonds. Figure 2.3(b) shows a ball an
d stik
diagram of the hemial ompound benzene, whih ontains atoms of arb
on
(blak) and hydrogen (gray). A graph representation makes it possib
le to
determine whih substrutures our frequently in a set of ompounds a
nd to
asertain whether the presene of any of these substrutures is assoiated with
the presene or absene of ertain hemial properties, suh as melti
ng point
or heat of formation. Substruture mining, whih is a branh of data
mining
that analyzes suh data, is onsidered in Setion 7.5.
2.1 Types of Data 33
(Gets updated frequently, so visit often!)
Book Referenes in Data Mining and
Knowledge Disovery
Useful Links:
Books
General Data Mining
nd are
shown in Figure 2.4.
Sequential Data Sequential data, also referred to as temporal data,
an
be thought of as an extension of reord data, where eah reord h
as a time
assoiated with it. Consider a retail transation data set that also
stores the
time at whih the transation took plae. This time information mak
es it
possible to nd patterns suh as andy sales peak before Halloween. A time
an also be assoiated with eah attribute. For example, eah reord
ould
be the purhase history of a ustomer, with a listing of items pu
rhased at
di erent times. Using this information, it is possible to
nd patterns s
uh as
people who buy DVD players tend to buy DVDs in the period immediately
following the purhase.
Figure 2.4(a) shows an example of sequential transation data. T
here
are
ve di erent timest1, t2, t3, t4, and t5; three di erent ustomersC1
,
34 Chapter 2 Data
Time Customer Items Purhased
t1 C1 A, B
t2 C3 A, C
t2 C1 C, D
t3 C2 A, D
t4 C2 E
t5 C1 A, E
Customer Time and Items Purhased
C1 (t1: A,B) (t2:C,D) (t5:A,E)
C2 (t3: A, D) (t4: E)
C3 (t2: A, C)
(a) Sequential transation data.
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
(b) Genomi sequene data.
1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
20
15
10
5
0
5
10
15
20
25
30
Year
Minneapolis Average Monthly Temperature (19821993)
T
e
m
p
e
r
a
t
u
r
e
(
e
l
iu
s
)
() Temperature time series.
Longitude
Temp 150 180 120 90 60 30 0 30 60 90 120 150 180
0
5
10
15
20
25
30
90
60
60
90
30
30
0
L
a
t
it
u
d
e
(d) Spatial temperature data.
Figure 2.4. Different variations of ordered data.
C2, and C3; and
ve di erent itemsA, B, C, D, and E. In the top
table,
eah row orresponds to the items purhased at a partiular time
by eah
ustomer. For instane, at time t3, ustomer C2 purhased items A and D. In
the bottom table, the same information is displayed, but eah row orresponds
to a partiular ustomer. Eah row ontains information on eah trans
ation
involving the ustomer, where a transation is onsidered to be a set
of items
and the time at whih those items were purhased. For example, ustomer C3
bought items A and C at time t2.
2.1 Types of Data 35
Sequene Data Sequene data onsists of a data set that is a sequen
e of
by 1
ue
to human error, limitations of measuring devies, or aws in the data olletion
proess. Values or even entire data objets may be missing. In othe
r ases,
there may be spurious or dupliate objets; i.e., multiple data objet
s that all
orrespond to a single real objet. For example, there might be two di erent
reords for a person who has reently lived at two di erent addresses.
Even if
all the data is present and looks ne, there may be inonsisteniesa person
has a height of 2 meters, but weighs only 2 kilograms.
In the next few setions, we fous on aspets of data quality that are related
to data measurement and olletion. We begin with a de nition of measu
re
ment and data olletion errors and then onsider a variety of proble
ms that
involve measurement error: noise, artifats, bias, preision, and aur
ay. We
onlude by disussing data quality issues that may involve both measurement
and data olletion problems: outliers, missing and inonsistent value
s, and
dupliate data.
Measurement and Data Colletion Errors
The term measurement error refers to any problem resulting from the mea
surement proess. A ommon problem is that the value reorded di ers f
rom
the true value to some extent. For ontinuous attributes, the numeri
al dif
ferene of the measured and true value is alled the error. The te
rm data
olletion error refers to errors suh as omitting data objets or at
tribute
values, or inappropriately inluding a data objet. For example, a
study of
animals of a ertain speies might inlude animals of a related speies that are
similar in appearane to the speies of interest. Both measurement errors and
data olletion errors an be either systemati or random.
We will only onsider general types of errors. Within partiular doma
ins,
there are ertain types of data errors that are ommonplae, and the
re often
exist well developed tehniques for deteting and/or orreting these er
rors.
For example, keyboard errors are ommon when data is entered manually, and
as a result, many data entry programs have tehniques for deteting and, with
human intervention, orreting suh errors.
Noise and Artifats
Noise is the random omponent of a measurement error. It may involv
e the
distortion of a value or the addition of spurious objets. Figure
2.5 shows a
time series before and after it has been disrupted by random noise.
If a bit
38 Chapter 2 Data
(a) Time series. (b) Time series with noise.
Figure 2.5. Noise in a time series ontext.
(a) Three groups of points. (b) With noise points (+) added.
Figure 2.6. Noise in a spatial ontext.
more noise were added to the time series, its shape would be lost.
Figure 2.6
shows a set of data points before and after some noise points (indi
ated by
+s) have been added. Notie that some of the noise points are interm
ixed
with the non noise points.
The term noise is often used in onnetion with data that has a spa
tial or
temporal omponent. In suh ases, tehniques from signal or image pr
oess
ing an frequently be used to redue noise and thus, help to disove
r patterns
(signals) that might be lost in the noise. Nonetheless, the eliminatio
n of
noise is frequently di ult, and muh work in data mining fouses on
devis
ing robust algorithms that produe aeptable results even when noise
is
present.
2.2 Data Quality 39
Data errors may be the result of a more deterministi phenomenon, su
h
as a streak in the same plae on a set of photographs. Suh det
erministi
distortions of the data are often referred to as artifats.
Preision, Bias, and Auray
In statistis and experimental siene, the quality of the measurement proess
and the resulting data are measured by preision and bias. We provid
e the
standard de nitions, followed by a brief disussion. For the following
de ni
tions, we assume that we make repeated measurements of the same underlying
quantity and use this set of values to alulate a mean (average) v
alue that
serves as our estimate of the true value.
De nition 2.3 (Preision). The loseness of repeated measurements (of the
same quantity) to one another.
De nition 2.4 (Bias). A systemati variation of measurements from th
e
quantity being measured.
Preision is often measured by the standard deviation of a set of
values,
while bias is measured by taking the di erene between the mean of th
e set
of values and the known value of the quantity being measured.
Bias
an
only be determined for objets whose measured quantity is known by me
ans
external to the urrent situation. Suppose that we have a standard laboratory
weight with a mass of 1g and want to assess the preision and bias of our new
laboratory sale. We weigh the mass
ve times, and obtain the followin
g ve
values: 1.015, 0.990, 1.013, 1.001, 0.986. The mean of these values is
1.001,
and hene, the bias is 0.001. The preision, as measured by t
he standard
deviation, is 0.013.
It is
ommon to use the more general term, auray, to refer
to the
degree of measurement error in data.
De nition 2.5 (Auray). The loseness of measurements to the true value
give their age or weight. In other ases, some attributes are not
appliable
to all objets; e.g., often, forms have onditional parts that are
lle
d out only
when a person answers a previous question in a ertain way, but for simpliity,
all
elds are stored. Regardless, missing values should be taken into
aount
during the data analysis.
There are several strategies (and variations on these strategies) for dealing
with missing data, eah of whih may be appropriate in ertain irumstanes.
These strategies are listed next, along with an indiation of their
advantages
and disadvantages.
2.2 Data Quality 41
Eliminate Data Objets or Attributes A simple and e etive strategy
is to eliminate objets with missing values. However, even a partial
ly spei
ed data objet ontains some information, and if many objets have mis
sing
values, then a reliable analysis an be di ult or impossible. Nonethe
less, if
a data set has only a few objets that have missing values, then i
t may be
expedient to omit them. A related strategy is to eliminate attribu
tes that
have missing values. This should be done with aution, however, sin
e the
eliminated attributes may be the ones that are ritial to the analys
is.
Estimate Missing Values Sometimes missing data an be reliably esti
mated. For example,
onsider a time series that hanges in a
reasonably
smooth fashion, but has a few, widely sattered missing values. In suh ase
s,
the missing values
an be estimated (interpolated) by using the r
emaining
values. As another example,
onsider a data set that has many simil
ar data
points. In this situation, the attribute values of the points losest to the p
oint
with the missing value are often used to estimate the missing value.
If the
attribute is ontinuous, then the average attribute value of the neare
st neigh
bors is used; if the attribute is ategorial, then the most ommonly ourring
attribute value an be taken. For a onrete illustration, onsider preipitat
ion
measurements that are reorded by ground stations. For areas not ont
aining
a ground station, the preipitation an be estimated using values obse
rved at
nearby ground stations.
Ignore the Missing Value during Analysis Many data mining approahes
an be modi ed to ignore missing values. For example, suppose that ob
jets
are being lustered and the similarity between pairs of data objets
needs to
be alulated. If one or both objets of a pair have missing valu
es for some
attributes, then the similarity an be alulated by using only the
attributes
that do not have missing values. It is true that the similarity
will only be
approximate, but unless the total number of attributes is small or
the num
ber of missing values is high, this degree of inauray may not mat
ter muh.
Likewise, many lassi ation shemes an be modi ed to work with missing
values.
Inonsistent Values
Data an ontain inonsistent values. Consider an address
eld, where b
oth a
zip ode and ity are listed, but the spei ed zip ode area is not ontained in
that ity. It may be that the individual entering this information t
ransposed
two digits, or perhaps a digit was misread when the information was
sanned
42 Chapter 2 Data
from a handwritten form. Regardless of the ause of the inonsistent
values,
it is important to detet and, if possible, orret suh problems.
Some types of inonsistenes are easy to detet. For instane, a p
ersons
height should not be negative. In other ases, it an be neessary
to onsult
an external soure of information. For example, when an insurane om
pany
proesses laims for reimbursement, it heks the names and addresses
on the
reimbursement forms against a database of its ustomers.
One an inonsisteny has been deteted, it is sometimes possible to orret
the data. A produt ode may have hek digits, or it may be possib
le to
double hek a produt ode against a list of known produt odes, an
d then
orret the ode if it is inorret, but lose to a known ode. T
he orretion
of an inonsisteny requires additional or redundant information.
Example 2.6 (Inonsistent Sea Surfae Temperature). This example
illustrates an inonsisteny in atual time series data that measures
the sea
surfae temperature (SST) at various points on the oean. SST data was origi
nally olleted using oean based measurements from ships or buoys, but more
reently, satellites have been used to gather the data. To reate a
long term
data set, both soures of data must be used. However, beause the data omes
from di erent soures, the two parts of the data are subtly di erent.
This
disrepany is visually displayed in Figure 2.7, whih shows the orre
lation of
SST values between pairs of years. If a pair of years has a positive orrelati
on,
then the loation orresponding to the pair of years is olored white; otherwise
it is olored blak. (Seasonal variations were removed from the data sine, ot
h
erwise, all the years would be highly orrelated.) There is a distint hange
in
behavior where the data has been put together in 1983. Years within
eah of
the two groups, 19581982 and 19831999, tend to have a positive orrelat
ion
with one another, but a negative orrelation with years in the other
group.
This does not mean that this data should not be used, only that th
e analyst
should onsider the potential impat of suh disrepanies on the data mining
analysis.
Dupliate Data
A data set may inlude data objets that are dupliates, or almost dupliates,
of one another. Many people reeive dupliate mailings beause they a
ppear
in a database multiple times under slightly di erent names. To detet
and
eliminate suh dupliates, two main issues must be addressed. First,
if there
are two objets that atually represent a single objet, then th
e values of
orresponding attributes may di er, and these inonsistent values mus
t be
2.2 Data Quality 43
60 65 70 75 80 85 90 95
Year
Y
e
a
r
60
65
70
75
80
85
90
95
Figure 2.7. Correlation of SST data between pairs of years. White areas indi
ate positive orrelation.
Blak areas indiate negative orrelation.
resolved. Seond, are needs to be taken to avoid aidentally ombining data
objets that are similar, but not dupliates, suh as two distint
people with
idential names. The term dedupliation is often used to refer to the proess
of dealing with these issues.
In some ases, two or more objets are idential with respet to t
he at
tributes measured by the database, but they still represent di erent o
bjets.
Here, the dupliates are legitimate, but may still ause problems for
some al
gorithms if the possibility of idential objets is not spei ally ao
unted for
in their design. An example of this is given in Exerise 13 on pag
e 91.
2.2.2 Issues Related to Appliations
Data quality issues an also be onsidered from an appliation viewpoi
nt as
expressed by the statement data is of high quality if it is su
itable for its
intended use. This approah to data quality has proven quite useful, partiu
larly in business and industry. A similar viewpoint is also present in sta
tistis
and the experimental sienes, with their emphasis on the areful design of ex
periments to ollet the data relevant to a spei hypothesis. As with quality
44 Chapter 2 Data
issues at the measurement and data olletion level, there are many issues that
are spei to partiular appliations and elds. Again, we onsider only a few
of the general issues.
Timeliness Some data starts to age as soon as it has been oll
eted. In
partiular, if the data provides a snapshot of some ongoing phenome
non or
proess, suh as the purhasing behavior of ustomers or Web browsing
pat
terns, then this snapshot represents reality for only a limited time. If the d
ata
is out of date, then so are the models and patterns that are based
on it.
Relevane The available data must ontain the information neessary fo
r
the appliation. Consider the task of building a model that predits
the ai
dent rate for drivers. If information about the age and gender of t
he driver is
omitted, then it is likely that the model will have limited auray
unless this
information is indiretly available through other attributes.
Making sure that the objets in a data set are relevant is also ha
llenging.
A ommon problem is sampling bias, whih ours when a sample does n
ot
ontain di erent types of objets in proportion to their atual ourr
ene in
the population. For example, survey data desribes only those who respond to
the survey. (Other aspets of sampling are disussed further in Setion 2.3.2.
)
Beause the results of a data analysis an re et only the data that is present,
sampling bias will typially result in an erroneous analysis.
Knowledge about the Data Ideally, data sets are aompanied by do
umentation that desribes di erent aspets of the data; the quality
of this
doumentation an either aid or hinder the subsequent analysis. For example,
if the doumentation identi es several attributes as being strongly
related,
these attributes are likely to provide highly redundant information, a
nd we
may deide to keep just one. (Consider sales tax and purhase prie.
) If the
doumentation is poor, however, and fails to tell us, for examp
le, that the
missing values for a partiular eld are indiated with a 9999, then our analy
sis of the data may be faulty. Other important harateristis are the preisi
on
of the data, the type of features (nominal, ordinal, interval, r
atio), the sale
of measurement (e.g., meters or feet for length), and the origin of
the data.
2.3 Data Preproessing
In this setion, we address the issue of whih preproessing steps
should be
applied to make the data more suitable for data mining. Data prepro
essing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101123 Wath Chiago 09/06/04 $25.99 . . .
101123 Battery Chiago 09/06/04 $5.99 . . .
101124 Shoes Minneapolis 09/06/04 $75.00 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
proess of eliminating attributes, suh as the type of item, or
reduing the
number of values for a partiular attribute; e.g., reduing the poss
ible values
for date from 365 days to 12 months. This type of aggregation is
ommonly
used in Online Analytial Proessing (OLAP), whih is disussed furth
er in
Chapter 3.
There are several motivations for aggregation. First, the smaller data
sets
resulting from data redution require less memory and proessing time,
and
hene, aggregation may permit the use of more expensive data mining
algo
rithms. Seond, aggregation an at as a hange of sope or sale by providing
a high level view of the data instead of a low level view. In the
previous ex
ample, aggregating over store loations and months gives us a monthly
, per
store view of the data instead of a daily, per item view. Finally,
the behavior
of groups of objets or attributes is often more stable than that of
individual
objets or attributes. This statement re ets the statistial fat that aggregat
e
quantities, suh as averages or totals, have less variability than t
he individ
by 0.5
i
o
n
s
Standard Deviation
(a) Histogram of standard deviation of
average monthly preipitation
0 1 2 3 4 5 6
0
50
100
150
N
u
m
b
e
r
o
f
L
a
n
d
L
o
a
t
i
o
n
s
Standard Deviation
(b) Histogram of standard deviation of
average yearly preipitation
Figure 2.8. Histograms of standard deviation for monthly and yearly preipitat
ion in Australia for the
period 1982 to 1993.
2.3.2 Sampling
Sampling is a ommonly used approah for seleting a subset of th
e data
objets to be analyzed. In statistis, it has long been used for b
oth the pre
liminary investigation of the data and the
nal data analysis. Sampli
ng an
also be very useful in data mining. However, the motivations for
sampling
in statistis and data mining are often di erent. Statistiians use sam
pling
beause obtaining the entire set of data of interest is too expensi
ve or time
onsuming, while data miners sample beause it is too expensive or ti
me on
suming to proess all the data. In some ases, using a sampling algorithm an
redue the data size to the point where a better, but more expensive algorithm
an be used.
The key priniple for e etive sampling is the following: Using a samp
le
will work almost as well as using the entire data set if the sam
ple is repre
sentative. In turn, a sample is representative if it has approximate
ly the
same property (of interest) as the original set of data. If the mea
n (average)
of the data objets is the property of interest, then a sample is r
epresentative
if it has a mean that is lose to that of the original data. Bea
use sampling is
a statistial proess, the representativeness of any partiular sample will var
y,
and the best that we an do is hoose a sampling sheme that guaran
tees a
high probability of getting a representative sample. As disussed nex
t, this
involves hoosing the appropriate sample size and sampling tehniques.
48 Chapter 2 Data
Sampling Approahes
There are many sampling tehniques, but only a few of the most bas
i ones
and their variations will be overed here. The simplest type of sa
mpling is
simple random sampling. For this type of sampling, there is an equal prob
ability of seleting any partiular item. There are two variations o
n random
sampling (and other sampling tehniques as well): (1) sampling without re
plaementas eah item is seleted, it is removed from the set of all objets
that together onstitute the population, and (2) sampling with replae
mentobjets are not removed from the population as they are seleted f
or
the sample. In sampling with replaement, the same objet an be piked more
than one. The samples produed by the two methods are not muh di er
ent
when samples are relatively small ompared to the data set size, but sampling
with replaement is simpler to analyze sine the probability of sele
ting any
objet remains onstant during the sampling proess.
When the population onsists of di erent types of objets, with wi
dely
di erent numbers of objets, simple random sampling an fail to adequ
ately
represent those types of objets that are less frequent. This an a
use prob
lems when the analysis requires proper representation of all objet ty
pes. For
example, when building lassi ation models for rare lasses, it is ritial that
the rare lasses be adequately represented in the sample. Hene, a
sampling
sheme that an aommodate di ering frequenies for the items of interest is
needed. Strati ed sampling, whih starts with prespei ed groups of ob
jets, is suh an approah. In the simplest version, equal numbers
of objets
are drawn from eah group even though the groups are of di erent sizes. In an
other variation, the number of objets drawn from eah group is propo
rtional
to the size of that group.
0.6
0.8
1
Sample Size
P
r
o
b
a
b
i
l
i
t
y
(b) Probability a sample ontains points
from eah of 10 groups.
Figure 2.10. Finding representative points from 10 groups.
Progressive Sampling
The proper sample size an be di ult to determine, so adaptive or progres
sive sampling shemes are sometimes used. These approahes start with
a
small sample, and then inrease the sample size until a sample of
su ient
size has been obtained. While this tehnique eliminates the need to determine
the orret sample size initially, it requires that there be a way to evaluate t
he
sample to judge if it is large enough.
Suppose, for instane, that progressive sampling is used to learn
a pre
ditive model. Although the auray of preditive models inreases a
s the
sample size inreases, at some point the inrease in auray levels
o . We
want to stop inreasing the sample size at this leveling o point. By
keeping
trak of the hange in auray of the model as we take progressi
vely larger
samples, and by taking other samples lose to the size of the urren
t one, we
an get an estimate as to how lose we are to this leveling o point, and
thus,
stop sampling.
2.3.3 Dimensionality Redution
Data sets an have a large number of features. Consider a set of d
ouments,
where eah doument is represented by a vetor whose omponents are t
he
frequenies with whih eah word ours in the doument. In suh
ases,
2.3 Data Preproessing 51
there are typially thousands or tens of thousands of attributes (omponents),
one for eah word in the voabulary. As another example,
onsider a
set of
time series onsisting of the daily losing prie of various stoks o
ver a period
of 30 years. In this ase, the attributes, whih are the pries on
spei days,
again number in the thousands.
There are a variety of bene ts to dimensionality redution. A key bene t
ularly for ontinuous data, use tehniques from linear algebra to pro
jet the
data from a high dimensional spae into a lower dimensional spae. Prinipal
Components Analysis (PCA) is a linear algebra tehnique for ontinuous
attributes that nds new attributes (prinipal omponents) that (1) are linear
ombinations of the original attributes, (2) are orthogonal (perpendiular) to
eah other, and (3) apture the maximum amount of variation in the data. For
example, the rst two prinipal omponents apture as muh of the varia
tion
in the data as is possible with two orthogonal attributes that are linear ombi
nations of the original attributes. Singular Value Deomposition (SVD)
is a linear algebra tehnique that is related to PCA and is also ommonly used
for dimensionality redution. For additional details, see Appendies A
and B.
2.3.4 Feature Subset Seletion
Another way to redue the dimensionality is to use only a subset of
the fea
tures. While it might seem that suh an approah would lose information, this
is not the ase if redundant and irrelevant features are present. Re
dundant
features dupliate muh or all of the information ontained in one
or more
other attributes. For example, the purhase prie of a produt and the amount
of sales tax paid ontain muh of the same information. Irrelevant features
ontain almost no useful information for the data mining task at han
d. For
instane, students ID numbers are irrelevant to the task of predi
ting stu
dents grade point averages. Redundant and irrelevant features
an
redue
lassi ation auray and the quality of the lusters that are found.
While some irrelevant and redundant attributes an be eliminated imme
diately by using ommon sense or domain knowledge, seleting the best subset
of features frequently requires a systemati approah. The ideal appr
oah to
feature seletion is to try all possible subsets of features as input
to the data
mining algorithm of interest, and then take the subset that produes
the best
results. This method has the advantage of re eting the objetive and
bias of
the data mining algorithm that will eventually be used. Unfortunately,
sine
the number of subsets involving n attributes is 2
n
, suh an approah is impra
tial in most situations and alternative strategies are needed. There
are three
standard approahes to feature seletion: embedded,
lter, and wrapper.
2.3 Data Preproessing 53
Embedded approahes Feature seletion ours naturally as part of the
data mining algorithm. Spei ally, during the operation of the data m
ining
algorithm, the algorithm itself deides whih attributes to use and
whih to
ignore. Algorithms for building deision tree lassi ers, whih are disussed in
Chapter 4, often operate in this manner.
Filter approahes Features are seleted before the data mining algorit
hm
is run, using some approah that is independent of the data mining t
ask. For
example, we might selet sets of attributes whose pairwise orrelation is as low
as possible.
Wrapper approahes These methods use the target data mining algorithm
as a blak box to
nd the best subset of attributes, in a way simil
ar to that
of the ideal algorithm desribed above, but typially without enumerati
ng all
possible subsets.
Sine the embedded approahes are algorithm spei , only the
lter and
wrapper approahes will be disussed further here.
An Arhiteture for Feature Subset Seletion
It is possible to enompass both the
lter and wrapper approahes withi
n a
ommon arhiteture. The feature seletion proess is viewed as onsis
ting of
four parts: a measure for evaluating a subset, a searh strategy tha
t ontrols
the generation of a new subset of features, a stopping riterion, and
a valida
tion proedure. Filter methods and wrapper methods di er only in the w
ay
in whih they evaluate a subset of features. For a wrapper method,
subset
evaluation uses the target data mining algorithm, while for a
lter app
roah,
the evaluation tehnique is distint from the target data mining al
gorithm.
The following disussion provides some details of this approah, whih is sum
marized in Figure 2.11.
Coneptually, feature subset seletion is a searh over all possible s
ubsets
of features. Many di erent types of searh strategies an be used,
but the
searh strategy should be omputationally inexpensive and should nd optimal
or near optimal sets of features. It is usually not possible
to satisfy both
requirements, and thus, tradeo s are neessary.
An integral part of the searh is an evaluation step to judge how the urrent
subset of features ompares to others that have been onsidered. This requires
an evaluation measure that attempts to determine the goodness of a subset of
attributes with respet to a partiular data mining task, suh as la
ssi ation
54 Chapter 2 Data
Searh
Strategy
Stopping
Criterion
Seleted
Attributes
Attributes
Validation
Proedure
Subset of
Attributes
Evaluation
Done
Not
Done
Figure 2.11. Flowhart of a feature subset seletion proess.
poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1
a way that satis es a riterion that is thought to have a relationship
to good
performane for the data mining task being onsidered.
Binarization
A simple tehnique to binarize a ategorial attribute is the following: If th
ere
are m ategorial values, then uniquely assign eah original value to an integer
in the interval [0, m 1]. If the attribute is ordinal, then o
rder must be
maintained by the assignment. (Note that even if the attribute is o
riginally
represented using integers, this proess is neessary if the integers are not in
the
interval [0, m1].) Next, onvert eah of these m integers to a binary number.
Sine n = log
2
(m) binary digits are required to represent these integers,
represent these binary numbers using n binary attributes. To illust
rate, a
ategorial variable with 5 values awful, poor, OK, good, great would require
three binary variables x
1
, x
2
, and x
3
. The onversion is shown in Table 2.5.
Suh a transformation an ause
ompliations, suh as reating un
in
tended relationships among the transformed attributes. For example, in Table
2.5, attributes x
2
and x
3
are orrelated beause information about the good
value is enoded using both attributes. Furthermore, assoiation analy
sis re
quires asymmetri binary attributes, where only the presene of the at
tribute
(value = 1) is important. For assoiation problems, it is therefore neessary
to
introdue one binary attribute for eah ategorial value, as in Table 2.6. If
the
2.3 Data Preproessing 59
number of resulting attributes is too large, then the tehniques desribed below
an be used to redue the number of ategorial values before binariz
ation.
Likewise, for assoiation problems, it may be neessary to replae a
single
binary attribute with two asymmetri binary attributes. Consider a bin
ary
attribute that reords a persons gender, male or female. For traditio
nal as
soiation rule algorithms, this information needs to be transformed in
to two
asymmetri binary attributes, one that is a 1 only when the person
is male
and one that is a 1 only when the person is female. (For asymmetri
binary
attributes, the information representation is somewhat ine ient in that
two
bits of storage are required to represent eah bit of information.)
Disretization of Continuous Attributes
Disretization is typially applied to attributes that are used in la
ssi ation
or assoiation analysis. In general, the best disretization depends on the al
go
rithm being used, as well as the other attributes being onsidered.
Typially,
however, the disretization of an attribute is onsidered in isolation.
Transformation of a ontinuous attribute to a ategorial attribute involves
two subtasks: deiding how many ategories to have and determining ho
w to
map the values of the ontinuous attribute to these ategories. In the rst step
,
after the values of the ontinuous attribute are sorted, they are th
en divided
into n intervals by speifying n 1 split points. In the seond, rather
trivial
step, all the values in one interval are mapped to the same ategori
al value.
Therefore, the problem of disretization is one of deiding how
many split
points to hoose and where to plae them. The result an be
represented
either as a set of intervals (x
0
, x
1
], (x
1
, x
2
], . . . , (x
n1
, x
n
), where x
0
and x
n
may be + or , respetively, or equivalently, as a series of inequali
ties
x
0
< x x
1
, . . . , x
n1
< x < x
n
.
Unsupervised Disretization A basi distintion between disretization
methods for lassi ation is whether lass information is used (supervise
d) or
not (unsupervised). If
lass information is not used, then relat
ively simple
approahes are ommon. For instane, the equal width approah divides the
range of the attribute into a user spei ed number of intervals eah having the
same width. Suh an approah an be badly a eted by outliers, and fo
r that
reason, an equal frequeny (equal depth) approah, whih tries to
put
the same number of objets into eah interval, is often preferred. A
s another
example of unsupervised disretization, a lustering method, suh as K means
(see Chapter 8),
an also be used. Finally, visually inspeting the
data an
sometimes be an e etive approah.
60 Chapter 2 Data
Example 2.12 (Disretization Tehniques). This example demonstrates
how these approahes work on an atual data set. Figure 2.13(a) show
s data
points belonging to four di erent groups, along with two outliersthe lar
ge
dots on either end. The tehniques of the previous paragraph were a
pplied
to disretize the x values of these data points into four ate
gorial values.
(Points in the data set have a random y omponent to make it easy
to see
how many points are in eah group.) Visually inspeting the data works quite
well, but is not automati, and thus, we fous on the other three a
pproahes.
The split points produed by the tehniques equal width, equal frequeny, and
K means are shown in Figures 2.13(b), 2.13(), and 2.13(d), respetively. The
split points are represented as dashed lines. If we measure the performane of
a disretization tehnique by the extent to whih di erent objets in d
i erent
groups are assigned the same ategorial value, then K means performs
best,
followed by equal frequeny, and
nally, equal width.
Supervised Disretization The disretization methods desribed above
are usually better than no disretization, but keeping the end purpose in mind
and using additional information (lass labels) often produes better
results.
This should not be surprising, sine an interval onstruted with no knowledge
of lass labels often ontains a mixture of lass labels. A oneptu
ally simple
approah is to plae the splits in a way that maximizes the p
urity of the
intervals. In pratie, however, suh an approah requires potentially arbitra
ry
deisions about the purity of an interval and the minimum size of an
interval.
To overome suh onerns, some statistially based approahes start with eah
attribute value as a separate interval and reate larger intervals by
merging
adjaent intervals that are similar aording to a statistial tes
t. Entropy
based approahes are one of the most promising approahes to disretiz
ation,
and a simple approah based on entropy will be presented.
First, it is neessary to de ne entropy. Let k be the number of di ere
nt
lass labels, m
i
be the number of values in the i
th
interval of a partition, and
m
ij
be the number of values of lass j in interval i.
i
of the
i
th
interval is given by the equation
e
i
=
k
i=1
p
ij
log
2
p
ij
,
where p
ij
= m
ij
/m
i
is the probability (fration of values) of lass j in the i
th
interval. The total entropy, e, of the partition is the weighted
average of the
individual interval entropies, i.e.,
2.3 Data Preproessing 61
0 5 10 15 20
(a) Original data.
0 5 10 15 20
(b) Equal width disretization.
0 5 10 15 20
() Equal frequeny disretization.
0 5 10 15 20
(d) K means disretization.
Figure 2.13. Different disretization tehniques.
e =
n
i=1
w
i
e
i
,
where m is the number of values, w
i
= m
i
/m is the fration of values in the
i
th
interval, and n is the number of intervals. Intuitively, the entro
py of an
interval is a measure of the purity of an interval. If an interval
ontains only
values of one lass (is perfetly pure), then the entropy is 0 and i
t ontributes
62 Chapter 2 Data
nothing to the overall entropy. If the lasses of values in an
interval our
equally often (the interval is as impure as possible), then the
entropy is a
maximum.
A simple approah for partitioning a ontinuous attribute starts by biset
ing the initial values so that the resulting two intervals give minimum entropy.
This tehnique only needs to onsider eah value as a possible split
point, be
ause it is assumed that intervals ontain ordered sets of values. The splitti
ng
proess is then repeated with another interval, typially hoosing the
interval
with the worst (highest) entropy, until a user spei ed number of inter
vals is
reahed, or a stopping riterion is satis ed.
Example 2.13 (Disretization of Two Attributes). This method was
used to independently disretize both the x and y attributes of
the two
dimensional data shown in Figure 2.14. In the
rst disretization, sh
own in
Figure 2.14(a), the x and y attributes were both split into three intervals. (
The
dashed lines indiate the split points.) In the seond disretization,
shown in
Figure 2.14(b), the x and y attributes were both split into
ve interv
als.
This simple example illustrates two aspets of disretization. First,
in two
dimensions, the lasses of points are well separated, but in one dimension, this
is not so. In general, disretizing eah attribute separately often
guarantees
suboptimal results. Seond,
ve intervals work better than three,
but six
intervals do not improve the disretization muh, at least in terms o
f entropy.
(Entropy values and results for six intervals are not shown.) Consequ
ently,
it is desirable to have a stopping riterion that automatially
nds th
e right
number of partitions.
Categorial Attributes with Too Many Values
Categorial attributes an sometimes have too many values. If the ategorial
attribute is an ordinal attribute, then tehniques similar to tho
se for on
tinuous attributes
an be used to redue the number of ategori
es. If the
ategorial attribute is nominal, however, then other approahes are
needed.
Consider a university that has a large number of departments. Consequ
ently,
k
, log x, e
x
,
ge 92
explores other aspets of variable transformation.
Normalization or Standardization
Another ommon type of variable transformation is the standardization o
r
normalization of a variable. (In the data mining ommunity the terms
are
often used interhangeably. In statistis, however, the term normalization an
be onfused with the transformations used for making a variable normal, i.e.,
Gaussian.) The goal of standardization or normalization is to make
an en
tire set of values have a partiular property. A traditional exam
ple is that
of standardizing a variable in statistis. If x is the mean (average
) of the
attribute values and s
x
is their standard deviation, then the transformation
x
= (x x)/s
x
reates a new variable that has a mean of 0 and a standard
deviation of 1. If di erent variables are to be ombined in some w
ay, then
suh a transformation is often neessary to avoid having a variable w
ith large
values dominate the results of the alulation. To illustrate, onside
r ompar
ing people based on two variables: age and inome. For any two peo
ple, the
di erene in inome will likely be muh higher in absolute terms (hundreds or
thousands of dollars) than the di erene in age (less than 150). If t
he di er
enes in the range of values of age and inome are not taken into aount, then
2.4 Measures of Similarity and Dissimilarity 65
the omparison between people will be dominated by di erenes in inome. In
partiular, if the similarity or dissimilarity of two people is alulated using
the
similarity or dissimilarity measures de ned later in this hapter, then in many
ases, suh as that of Eulidean distane, the inome values will dominate the
alulation.
The mean and standard deviation are strongly a eted by outliers, so th
e
above transformation is often modi ed. First, the mean is replaed b
y the
median, i.e., the middle value. Seond, the standard deviation is replaed
by
the absolute standard deviation. Spei ally, if x is a variable, th
en the
absolute standard deviation of x is given by
A
=
m
i=1
[x
i
[,
i
where x
i
the i
th
value of the variable, m i the number of object, and i either
the
mean or median. Other approache for computing etimate of the loca
tion
(center) and pread of a et of value in the preence of outlier
are decribed
in Section 3.2.3 and 3.2.4, repectively. Thee meaure can alo b
e ued to
de ne a tandardization tranformation.
2.4 Meaure of Similarity and Diimilarity
Similarity and diimilarity are important becaue they are ued by a number
uch a clutering, nearet neighbor cla
of data mining technique,
i cation,
and anomaly detection. In many cae, the initial data et i not n
eeded once
thee imilaritie or diimilaritie have been computed. Such approach
e can
be viewed a tranforming the data to a imilarity (diimilarity) pa
ce and
then performing the analyi.
We begin with a dicuion of the baic: high level de nition of imilarity
and diimilarity, and a dicuion of how they are related. For co
nvenience,
the term proximity i ued to refer to either imilarity or diimilarity. Si
nce
the proximity between two object i a function of the proximity betw
een the
correponding attribute of the two object, we
rt decribe how to me
aure
the proximity between object having only one
imple attribute, an
d then
conider proximity meaure for object with multiple attribute.
Thi
in
clude meaure uch a correlation and Euclidean ditance, which are
ueful
for dene data uch a time erie or two dimenional point, a we
ll a the
Jaccard and coine imilarity meaure, which are ueful for pare
data like
document. Next, we conider everal important iue concerning prox
imity
meaure. The ection conclude with a brief dicuion of how to e
lect the
right proximity meaure.
66 Chapter 2 Data
2.4.1 Baic
De nition
Informally, the imilarity between two object i a numerical meaure
of the
degree to which the two object are alike. Conequently, imilaritie are high
er
for pair of object that are more alike. Similaritie are uually n
on negative
and are often between 0 (no imilarity) and 1 (complete imilarity).
The diimilarity between two object i a numerical meaure of the d
e
gree to which the two object are di erent. Diimilaritie are lower
for more
imilar pair of object. Frequently, the term ditance i ued a a
ynonym
ee, ditance i often u
for diimilarity, although, a we hall
ed to refer to
a pecial cla of diimilaritie. Diimilaritie ometime fall in
the interval
[0, 1], but it i alo common for them to range from 0 to .
Tranformation
Tranformation
are often applied to convert a imilarity to a dii
milarity,
or vice vera, or to tranform a proximity meaure to fall within a
particular
range, uch a [0,1]. For intance, we may have imilaritie that ra
nge from 1
to 10, but the particular algorithm or oftware package that we want
to ue
may be deigned to only work with diimilaritie, or it may only w
ork with
imilaritie in the interval [0,1]. We dicu thee iue here beca
ue we will
employ uch tranformation
later in our dicuion of proximity.
In addi
tion, thee iue are relatively independent of the detail of peci c proximity
meaure.
Frequently, proximity meaure, epecially imilaritie, are de ned or tran
formed to have value in the interval [0,1]. Informally, the motivati
on for thi
i to ue a cale in which a proximity value indicate the fraction
of imilarity
(or diimilarity) between two object. Such a tranformation i often
rela
tively traightforward. For example, if the imilaritie between objec
t range
from 1 (not at all imilar) to 10 (completely imilar), we can mak
e them fall
within the range [0, 1] by uing the tranformation
are the original and new imilarity value, repectively. In the more general
imilaritie to the interval [0, 1] i
cae, the tranformation of
given by the
expreion
= (min )/(max min ), where max and min are the
maximum and minimum imilarity value, repectively. Likewie, diimilarity
meaure with a
nite range can be mapped to the interval [0,1] by u
ing the
formula d
w cale.
Conider the tranformation d
.
Converion in the oppoite direction i conidered in Exercie 23 on
page 94.
In general, any monotonic decreaing function can be ued to convert
di
imilaritie to imilaritie, or vice vera. Of coure, other facto
r alo mut
be conidered when tranforming imilaritie to diimilaritie, or vic
e vera,
or when tranforming the value of a proximity meaure to a new cal
e. We
have mentioned iue related to preerving meaning, ditortion of cal
e, and
requirement of data analyi tool, but thi lit i certainly not e
xhautive.
2.4.2 Similarity and Diimilarity between Simple Attribute
The proximity of object with a number of attribute i typically d
e ned by
combining the proximitie of individual attribute, and thu, we
rt
dicu
68 Chapter 2 Data
proximity between object having a ingle attribute. Conider object
de
cribed by one nominal attribute. What would it mean for two uch
object
to be imilar? Since nominal attribute only convey information about
the
ditinctne of object, all we can ay i that two object either have the am
e
value or they do not. Hence, in thi cae imilarity i traditionally de ned a
1
if attribute value match, and a 0 otherwie. A diimilarity would be de ned
in the oppoite way: 0 if the attribute value match, and 1 if the
y do not.
For object with a ingle ordinal attribute, the ituation i more
compli
cated becaue information about order hould be taken into account. Conider
an attribute that meaure the quality of a product, e.g., a candy b
ar, on the
cale poor, fair, OK, good, wonderful . It would eem reaonable that a prod
uct, P1, which i rated wonderful, would be cloer to a product P
2, which i
rated good, than it would be to a product P3, which i rated OK. To make thi
obervation quantitative, the value of the ordinal attribute are often mapped
to ucceive integer, beginning at 0 or 1, e.g., poor=0, fair=1,
OK=2,
good=3, wonderful =4. Then, d(P1, P2) = 3 2 = 1 or, if we want th
e di
imilarity to fall between 0 and 1, d(P1, P2) =
32
4
= 0.25. A imilarity for
ordinal attribute can then be de ned a = 1 d.
Thi de nition of imilarity (diimilarity) for an ordinal attribute ho
uld
make the reader a bit uneay ince thi aume equal interval, and thi i not
o. Otherwie, we would have an interval or ratio attribute. I the
di erence
between the value fair and good really the ame a that between th
e value
earlier.
2.4.3 Diimilaritie between Data Object
In thi ection, we dicu variou kind of diimilaritie. We be
gin with a
dicuion of ditance, which are diimilaritie with certain propert
ie, and
then provide example of more general kind of diimilaritie.
Ditance
We
rt preent ome example, and then o er a more formal decription
of
ditance in term of the propertie common to all ditance. The Euclidean
ditance, d, between two point, x and y, in one , two , three
, or higher
dimenional pace, i given by the following familiar formula:
d(x, y) =
_
n
k=1
(x
k
y
k
)
2
, (2.1)
where n i the number of dimenion and x
k
and y
k
are, repectively, the k
th
attribute (component) of x and y. We illutrate thi formula with
Figure
2.15 and Table 2.8 and 2.9, which how a et of point, the x and y coordinate
of thee point, and the ditance matrix containing the pairwie di
tance
of thee point.
70 Chapter 2 Data
The Euclidean ditance meaure given in Equation 2.1 i generalized by
the Minkowki ditance metric hown in Equation 2.2,
d(x, y) =
_
n
k=1
[x
k
y
k
[
r
_
1/r
, (2.2)
where r i
mple
a parameter.
of Minkowki ditance.
r = 1. City block (Manhattan, taxicab, L
1
norm) ditance. A common
example i the Hamming ditance, which i the number of bit that
are di erent between two object that have only binary attribute, i.e.
,
between two binary vector.
r = 2. Euclidean ditance (L
2
norm).
r = . Supremum (L
max
or L
ditance uing data from Table 2.8. Notice that all thee ditanc
e
matrice are ymmetric; i.e., the ij
th
entry i the ame a the ji
th
entry. In
Table 2.9, for intance, the fourth row of the
rt column and
the fourth
column of the
rt row both contain the value 5.1.
uch a the Euclidean ditance, have ome well known prope
Ditance,
r
tie. If d(x, y) i the ditance between two point, x and y, then the followi
ng
propertie hold.
1. Poitivity
(a) d(x, x) 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
2.4 Meaure of Similarity and Diimilarity 71
p1
p2
p3 p4
2
1
0
3
y
1 2 3 4 5 6
x
Figure 2.15. Four two dimenional point.
Table 2.8. x and y coordinate of four point.
point x coordinate y coordinate
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Table 2.9. Euclidean ditance matrix for Table 2.8.
p1 p2 p3 p4
p1 0.0 2.8 3.2 5.1
p2 2.8 0.0 1.4 3.2
p3 3.2 1.4 0.0 2.0
p4 5.1 3.2 2.0 0.0
Table 2.10. L
1
ditance matrix for Table 2.8.
L
1
p1 p2 p3 p4
p1 0.0 4.0 4.0 6.0
p2 4.0 0.0 2.0 4.0
p3 4.0 2.0 0.0 2.0
p4 6.0 4.0 2.0 0.0
Table 2.11. L
p1 p2 p3 p4
p1 0.0 2.0 3.0 5.0
p2 2.0 0.0 1.0 3.0
p3 3.0 1.0 0.0 2.0
p4 5.0 3.0 2.0 0.0
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) d(x, y) + d(y, z) for all point x, y, and z.
Meaure that atify all three propertie are known a metric. So
me
people only ue the term ditance for diimilarity meaure that atify thee
propertie, but that practice i often violated. The three propertie
decribed
here are ueful, a well a mathematically pleaing. Alo, if the
triangle in
equality hold, then thi property can be ued to increae the e ciency of tech
nique (including clutering) that depend on ditance poeing thi property.
(See Exercie 25.) Nonethele, many diimilaritie do not atify one or more
of the metric propertie. We give two example of uch meaure.
72 Chapter 2 Data
Example 2.14 (Non metric Diimilaritie: Set Di erence). Thi ex
ample i baed on the notion of the di erence of two et, a de ned
in et
theory. Given two et A and B, A B i the et of element of A
that are
not in B. For example, if A = 1, 2, 3, 4 and B = 2, 3, 4, then AB = 1
and B A = , the empty et. We can de ne the ditance d between two
et A and B a d(A, B) = ize(A B), where ize i a function retu
rning
the number of element in a et. Thi ditance meaure, which i
an integer
value greater than or equal to 0, doe not atify the econd part
of the po
itivity property, the ymmetry property, or the triangle inequality.
However,
thee propertie can be made to hold if the diimilarity meaure i
modi ed
a follow: d(A, B) = ize(A B) + ize(B A). See Exercie 21 on page
94.
Example 2.15 (Non metric Diimilaritie: Time). Thi example give
a more everyday example of a diimilarity meaure that i not a met
ric, but
that i till ueful. De ne a meaure of the ditance between time o
f the day
a follow:
d(t
1
, t
2
) =
_
t
2
t
1
if t
1
t
2
24 + (t
2
t
1
) if t
1
t
2
_
. (2.4)
To illutrate, d(1PM, 2PM) = 1 hour, while d(2PM, 1PM) = 23 hour
.
Such a de nition would make ene, for example, when anwering the quetion:
If an event occur at 1PM every day, and it i now 2PM, how long d
o I have
to wait for that event to occur again?
2.4.4 Similaritie between Data Object
For imilaritie, the triangle inequality (or the analogou property)
typically
doe not hold, but ymmetry and poitivity typically do. To be expl
icit, if
(x, y) i the imilarity between point x and y, then the typical propertie of
imilaritie are the following:
1.
(x, y) = 1 only if x = y. (0 1)
(x, y) = (y, x) for all x and y. (Symmetry)
2.
imilari
There i no general analog of the triangle inequality for
ty mea
ure. It i ometime poible, however, to how that a imilarity
meaure
can eaily be converted to a metric ditance. The coine and Jaccard imilarit
y
meaure, which are dicued hortly, are two example. Alo, for peci c im
ilarity meaure, it i poible to derive mathematical bound on the imilarity
between two object that are imilar in pirit to the triangle inequa
lity.
2.4 Meaure of Similarity and Diimilarity 73
Example 2.16 (A Non ymmetric Similarity Meaure). Conider an
experiment in which people are aked to claify a mall et of ch
aracter a
they ah on a creen. The confuion matrix for thi experiment record how
often each character i clai ed a itelf, and how often each i cl
ai ed a
another character. For intance, uppoe that 0 appeared 200 time and wa
clai ed a a 0 160 time, but a an o 40 time. Likewie,
uppoe t
hat
o appeared 200 time and wa clai ed a an o 170 time, but a 0 only
30 time. If we take thee count a a meaure of the imilarity b
etween two
character, then we have a imilarity meaure, but one that i not
ymmetric.
In uch ituation, the imilarity meaure i often made ymmetric by
etting
(x, y) =
quantitie (frequencie):
f
00
= the number of attribute where x i 0 and y i 0
f
01
= the number of attribute where x i 0 and y i 1
f
10
= the number of attribute where x i 1 and y i 0
f
11
= the number of attribute where x i 1 and y i 1
Simple Matching Coe cient One commonly ued imilarity coe cient i
the imple matching coe cient (SMC), which i de ned a
SMC =
number of matching attribute value
number of attribute
=
f
11
+ f
00
f
01
+ f
10
+ f
11
+ f
00
. (2.5)
74 Chapter 2 Data
Thi meaure count both preence and abence equally. Conequently,
the
SMC could be ued to nd tudent who had anwered quetion imilarly on
a tet that conited only of true/fale quetion.
Jaccard Coe cient Suppoe that x and y are data object that repreent
two row (two tranaction) of a tranaction matrix (ee Section 2.1.2). If ea
ch
aymmetric binary attribute correpond to an item in a tore, then a
1 indi
cate that the item wa purchaed, while a 0 indicate that the product wa not
purchaed. Since the number of product not purchaed by any cutomer
far
outnumber the number of product that were purchaed, a imilarity meaure
uch a SMC would ay that all tranaction are very imilar. A a reult, th
e
Jaccard coe cient i frequently ued to handle object coniting of aymmet
ric binary attribute. The Jaccard coe cient, which i often ymbolized by
J, i given by the following equation:
J =
number of matching preence
number of attribute not involved in 00 matche
=
f
11
f
01
+ f
10
+ f
11
. (2.6)
Example 2.17 (The SMC and Jaccard Similarity Coe cient). To
illutrate the di erence between thee two imilarity meaure, we calcu
late
SMC and J for the following two binary vector.
x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
f
01
= 2 the number of attribute where x wa 0 and y wa 1
f
10
= 1 the number of attribute where x wa 1 and y wa 0
f
00
= 7 the number of attribute where x wa 0 and y wa 0
f
11
= 0 the number of attribute where x wa 1 and y wa 1
SMC =
f11+f00
f01+f10+f11+f00
=
0+7
2+1+0+7
= 0.7
J =
f11
f01+f10+f11
=
0
2+1+0
= 0
Coine Similarity
Document are often repreented a vector, where each attribute repre
ent
the frequency with which a particular term (word) occur in the document. It
i more complicated than thi, of coure, ince certain common word
are ig
2.4 Meaure of Similarity and Diimilarity 75
nored and variou proceing technique are ued to account for di erent form
of the ame word, di ering document length, and di erent word frequencie
.
Even though document have thouand or ten of thouand of attribute
(term), each document i pare ince it ha relatively few non zero attribute
.
(The normalization ued for document do not create a non zero entry where
there wa a zero entry; i.e., they preerve parity.) Thu, a with
tranaction
data,
imilarity hould not depend on the number of hared 0 valu
e
ince
any two document are likely to not contain many of the ame word,
and
therefore, if 00 matche are counted, mot document will be highly imilar to
mot other document. Therefore, a imilarity meaure for document n
eed
to ignore 00 matche
like the Jaccard meaure, but alo mut be
able to
handle non binary vector. The coine imilarity, de ned next, i one of the
mot common meaure of document imilarity. If x and y are two docu
ment
vector, then
co(x, y) =
x y
|x| |y|
, (2.7)
where
indicate the vector dot product, x y =
n
k=1
x
k
y
k
, and |x| i the
length of vector x, |x| =
_
n
k=1
x
2
k
=
x x.
Example 2.18 (Coine Similarity of Two Document Vector). Thi
example calculate the coine imilarity for the following two data
object,
which might repreent document vector:
x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
x y = 3
1 + 2
0 + 0
0 + 5
0 + 0
0 + 0
0 + 0
0 + 2
1 + 0
0 + 0
2 = 5
|x| =
3 3 + 2
2 + 0
0 + 5
5 + 0
0 + 0
0 + 0
0 + 2
2 + 0
0 + 0
0 = 6.48
|y| =
1 1 + 0
0 + 0
0 + 0
0 + 0
0 + 0
0 + 0
0 + 1
1 + 0
0 + 2
2 = 2.24
co(x, y) = 0.31
A indicated by Figure 2.16, coine imilarity really i a meaure o
f the
(coine of the) angle between x and y. Thu, if the coine imilari
ty i 1, the
angle between x and y i 0
,
and they do not share any terms (words).
76 Chapter 2 Data
x
y
, (2.8)
where x
= x/|x| and y
= y/|y|. Dividing x and y by their lengths normalizes them to have a length of 1. This means that cosine similarity does not ta
ke
the magnitude of the two data objects into account when computing similarity.
(Euclidean distance might be a better choice when magnitude is importa
nt.)
For vectors with a length of 1, the cosine measure can be calculated by taking
a simple dot product. Conseuently, when many cosine similarities bet
ween
objects are being computed, normalizing the objects to have unit leng
th can
reduce the time reuired.
Extended Jaccard Coe cient (Tanimoto Coe cient)
The extended Jaccard coe cient can be used for document data and that
reduces to the Jaccard coe cient in the case of binary attributes. The extended
Jaccard coe cient is also known as the Tanimoto coe cient. (However, ther
e
is another coe cient that is also known as the Tanimoto coe cient.) This coe cient, which we shall represent as EJ, is de ned by the following eua
tion:
EJ(x, y) =
x y
|x|
2
+ |y|
2
x y
. (2.9)
Correlation
The correlation between two data object that have binary or continuou vari
able i
a meaure of the linear relationhip between the attri
bute of the
object. (The calculation of correlation between attribute, which i
more
common, can be de ned imilarly.) More preciely, Pearon correlation
2.4 Meaure of Similarity and Diimilarity 77
coe cient between two data object, x and y, i
de ned by the follo
wing
equation:
corr(x, y) =
covariance(x, y)
tandard deviation(x) tandard deviation(y)
=
xy
x
y
, (2.10)
where we are uing the following tandard tatitical notation and de nition:
covariance(x, y) =
xy
=
1
n 1
n
k=1
(x
k
x)(y
k
y) (2.11)
tandard deviation(x)
x
=
_
1
n 1
n
k=1
(x
k
x)
2
tandard deviation(y)
y
=
_
1
n 1
n
k=1
(y
k
y)
2
x =
1
n
n
k=1
x
k
y
i
the mean of x
=
1
n
n
k=1
y
k
i the mean of y
Example 2.19 (Perfect Correlation). Correlation i alway in the range
1 to 1. A correlation of 1 (1) mean that x and y have a perfect
poitive
(negative) linear relationhip; that i, x
k
= ay
k
+ b, where a and b are con
tant. The following two et of value for x and y indicate cae
where the
correlation i 1 and +1, repectively. In the rt cae, the mean of
x and y
were choen to be 0, for implicity.
x = (3, 6, 0, 3, 6)
y = ( 1, 2, 0, 1, 2)
x = (3, 6, 0, 3, 6)
y = (1, 2, 0, 1, 2)
78 Chapter 2 Data
1.00 0.90 0.80 0.70 0.60 0.50 0.40
0.30 0.20 0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Figure 2.17. Scatter plot illutrating correlation from 1 to 1.
Example 2.20 (Non linear Relationhip). If the correlation i 0, th
en
there i no linear relationhip between the attribute of the two dat
a object.
However, non linear relationhip may till exit. In the following
example,
x
k
= y
2
k
, but their correlation i 0.
x = (3, 2, 1, 0, 1, 2, 3)
y = ( 9, 4, 1, 0, 1, 4, 9)
Example 2.21 (Viualizing Correlation). It i alo eay to judge the cor
relation between two data object x and y by plotting pair of corre
ponding
attribute value. Figure 2.17 how a number of thee plot when x
and y
have 30 attribute and the value of thee attribute are randomly ge
nerated
(with a normal ditribution) o that the correlation of x and y range from 1
to 1. Each circle in a plot repreent one of the 30 attribute; it
x coordinate
i the value of one of the attribute for x, while it y coordinat
e i the value
of the ame attribute for y.
k
= (x
k
x)/
x
and y
k
= (y
k
y)/
y
.
Bregman Divergence Thi ection provide a brief decription of Bregman
divergence, which are a family of proximity function that hare om
e com
mon propertie. A a reult, it i poible to contruct general d
ata mining
algorithm, uch a clutering algorithm, that work with any Bregman diver
gence. A concrete example i the K mean clutering algorithm (Section
8.2).
Note that thi ection require knowledge of vector calculu.
Bregman divergence are lo or ditortion function. To undertand th
e
idea of a lo function, conider the following. Let x and y be two point, wh
ere
y i regarded a the original point and x i ome ditortion or app
roximation
of it. For example, x may be a point that wa generated, for
example, by
adding random noie to y. The goal i to meaure the reulting dit
ortion or
lo that reult if y i approximated by x. Of coure, the more
imilar x and
y are, the maller the lo or ditortion. Thu, Bregman divergence
can be
ued a diimilarity function.
More formally, we have the following de nition.
De nition 2.6 (Bregman Divergence). Given a trictly convex function
(with a ew modest restrictions that are generally satis ed), the Breg
man
divergence (loss unction) D(x, y) generated by that unction is given
by the
ollowing equation:
D(x, y) = (x) (y) (y), (x y)) (2.12)
where (y) is the gradient o evaluated at y, xy, is the vector di erence
between x and y, and (y), (x y)) is the inner product between (x)
and (x y). For points in Euclidean space, the inner product is just
the dot
product.
D(x, y) can be written as D(x, y) = (x) L(x), where L(x) = (y) +
(y), (x y)) and represents the equation o a plane that is tangent to th
e
unction at y. Using calculus terminology, L(x) is the lineariza
tion o
around the point y and the Bregman divergence is just the di erence between
a unction and a linear approximation to that unction. Di erent Bregma
n
divergences are obtained by using di erent choices or .
Example 2.22. We provide a concrete example using squared Euclidean dis
tance, but restrict ourselves to one dimension to simpliy the mathematics. Le
t
80 Chapter 2 Data
x and y be real numbers and (t) be the real valued unction, (t)
= t
2
. In
that case, the gradient reduces to the derivative and the dot product
reduces
to multiplication. Speci cally, Equation 2.12 becomes Equation 2.13.
D(x, y) = x
2
y
2
2y(x y) = (x y)
2
(2.13)
The graph or this example, with y = 1, is shown in Figure 2.1
8. The
Bregman divergence is shown or two values o x: x = 2 and x = 3.
10
9
8
7
6
5
4
3
2
1
4 3 2 1 0 1 2 3 4
0
y
x
(x) = x
2
D(2, 1)
D(3, 1)
y = 2x 1
Figure 2.18. Illustration o Bregman divergence.
2.4.6 Issues in Proximity Calculation
This section discusses several important issues related to proximity me
asures:
(1) how to handle the case in which attributes have di erent scales and/or are
correlated, (2) how to calculate proximity between objects that are co
mposed
o di erent types o attributes, e.g., quantitative and qualitative, (3)
and how
to handle proximity calculation when attributes have di erent weights; i
.e.,
when not all attributes contribute equally to the proximity o objects
.
=
_
_
0 if the k
th
attribute is an asymmetric attribute an
both objects have a value of 0, or if one of the objects
has a missing value for the k
th
attribute
1 otherwise
3: Compute the overall similarity between the two objects using the
following formula:
similarity(x, y) =
n
k=1
k
s
k
(x, y)
n
k=1
k
(2.15)
the formulas for proximity can be moi e
of
each attribute.
If the weights w
k
sum to 1, then (2.15) becomes
similarity(x, y) =
n
k=1
w
k
k
s
k
(x, y)
n
k=1
k
. (2.16)
The e nition of the Minkowski istance can also be moi e
(x, y) =
_
as follows:
n
k=1
w
k
[x
k
y
k
[
r
_
1/r
. (2.17)
2.4.7 Selecting the Right Proximity Measure
The ollowing are a ew general observations that may be helpul. F
irst, the
type o proximity measure should t the type o data. For many types o dense,
continuous data, metric distance measures such as Euclidean distance ar
e o
ten used. Proximity between continuous attributes is most oten ex
pressed
in terms o
di erences, and distance measures provide a well de ned way
o
combining these di erences into an overall proximity measure. Although
at
tributes can have di erent scales and be o di ering importance, these i
ssues
can oten be dealt with as described earlier.
For sparse data, which oten consists o
asymmetric attributes, we
typi
cally employ similarity measures that ignore 00 matches. Conceptually,
this
re ects the act that, or a pair o complex objects, similarity depend
s on the
number o characteristics they both share, rather than the number o
charac
teristics they both lack. More speci cally, or sparse, asymmetric dat
a, most
84 Chapter 2 Data
objects have only a ew o the characteristics described by the attri
butes, and
thus, are highly similar in terms o the characteristics they do not
have. The
cosine, Jaccard, and extended Jaccard measures are appropriate or such data.
There are other characteristics o data vectors that may need to be consid
ered. Suppose, or example, that we are interested in comparing tim
e series.
I the magnitude o the time series is important (or example, each time series
represent total sales o
the same organization or a di erent year),
then we
could use Euclidean distance. I the time series represent di erent qua
ntities
(or example, blood pressure and oxygen consumption), then we usually
want
to determine i the time series have the same shape, not the same m
agnitude.
Correlation, which uses a built in normalization that accounts or di er
ences
in magnitude and level, would be more appropriate.
data
mining. Some classi cation algorithms only work with categorical data,
and
association analysis requires binary data, and thus, there is a signi cant moti
vation to investigate how to best binarize or discretize continuous at
tributes.
For association analysis, we reer the reader to work by Srikant and
Agrawal
[78], while some useul reerences or discretization in the area o classi cation
include work by Dougherty et al. [51], Elomaa and Rousu [52], Fayy
ad and
Irani [53], and Hussain et al. [56].
Feature selection is another topic well investigated in data mining. A broad
coverage o
this topic is provided in a survey by Molina et al. [
71] and two
books by Liu and Motada [66, 67]. Other useul papers include those by Blum
and Langley [46], Kohavi and John [62], and Liu et al. [68].
It is di cult to provide reerences or the subject o eature transormations
because practices vary rom one discipline to another. Many statistics
books
have a discussion o transormations, but typically the discussion is
restricted
to a particular purpose, such as ensuring the normality o a variable or making
sure that variables have equal variance. We o er two reerences: Osborne [73]
and Tukey [83].
While we have covered some o
the most commonly used distance
and
similarity measures, there are hundreds o such measures and more are
being
created all the time. As with so many other topics in this chapter
, many o
these measures are speci c to particular elds; e.g., in the area o time series see
papers by Kalpakis et al. [59] and Keogh and Pazzani [61]. Clusteri
ng books
provide the best general discussions. In particular, see the books by Anderber
g
[45], Jain and Dubes [57], Kauman and Rousseeuw [60], and Sneath and Sokal
[77].
Bibliography
[45] M. R. Anderberg. Cluster Analysis or Applications. Academic
Press, New York, De
cember 1973.
[46] A. Blum and P. Langley. Selection o
Relevant Features and
Examples in Machine
Learning. Arti cial Intelligence, 97(12):245271, 1997.
Bibliography 87
[47] H. H. Bock and E. Diday. Analysis o Symbolic Data: Exploratory M
ethods or Extract
ing Statistical Inormation rom Complex Data (Studies in Classi cation, Data Anal
ysis,
and Knowledge Organization). Springer Verlag Telos, January 2000.
[48] I. Borg and P. Groenen. Modern Multidimensional ScalingTheory
and Applications.
Springer Verlag, February 1997.
[49] W. G. Cochran. Sampling Techniques. John Wiley & Sons, 3rd ed
ition, July 1977.
[50] J. W. Demmel. Applied Numerical Linear Algebra. Society or
Industrial & Applied
Mathematics, September 1997.
[51] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsuperv
ised Discretization
o
Continuous Features. In Proc. o
the 12th Intl. Con. on Mac
hine Learning, pages
194202, 1995.
[52] T. Elomaa and J. Rousu. General and E cient Multisplitting o Nu
merical Attributes.
Machine Learning, 36(3):201244, 1999.
[53] U. M. Fayyad and K. B. Irani. Multi interval discretization
o
continuousvalued at
tributes or classi cation learning. In Proc. 13th Int. Joint Con.
on Arti cial Intelli
gence, pages 10221027. Morgan Kauman, 1993.
[54] F. H. Gaohua Gu and H. Liu. Sampling and Its Application in Data Mining
: A Survey.
Technical Report TRA6/00, National University o Singapore, Singapore, 2
000.
[55] D. J. Hand. Statistics and the Theory o Measurement. Journal o th
e Royal Statistical
Society: Series A (Statistics in Society), 159(3):445492, 1996.
[56] F. Hussain, H. Liu, C. L. Tan, and M. Dash. TRC6/99:
Discretization: an enabling
technique. Technical report, National University o Singapore, Singapore
, 1999.
[57] A. K. Jain and R. C. Dubes. Algorithms or Clustering
Data. Prentice Hall
Advanced Reerence Series. Prentice Hall, March 1988. Book avail
able online at
http://www.cse.msu.edu/jain/Clustering Jain Dubes.pdf.
[58] I. T. Jolli e. Principal Component Analysis. Springer Verlag,
2nd edition, October
2002.
[59] K. Kalpakis, D. Gada, and V. Puttagunta. Distance Measures for
E ective Clustering
of ARIMA Time-Series. In Proc. of the 2001 IEEE Intl. Conf. on Data
Mining, pages
273280. IEEE Computer Society, 2001.
[60] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An I
ntroduction to Cluster
Analysis. Wiley Series in Probability and Statistics. John Wiley and
Sons, New York,
November 1990.
[61] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time
warping for datamining
applications. In KDD, pages 285289, 2000.
[62] R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Arti cial
Intelligence,
97(12):273324, 1997.
[63] D. Krantz, R. D. Luce, P. Suppes, and A. Tversky. Foun
dations of Measurements:
Volume 1: Additive and polynomial representations. Academic Press, Ne
w York, 1971.
[64] J. B. Kruskal and E. M. Uslaner. Multidimensional Scaling. Sag
e Publications, August
1978.
[65] B. W. Lindgren. Statistical Theory. CRC Press, January 1993.
[66] H. Liu and H. Motoda, editors. Feature Extraction, Construction and Sel
ection: A Data
Mining Perspective. Kluwer International Series in Engineering and Comp
uter Science,
453. Kluwer Academic Publishers, July 1998.
hematical
Statistics, 28(3):602632, September 1957.
[84] R. Y. Wang, M. Ziad, Y. W. Lee, and Y. R. Wang. Da
ta Quality. The Kluwer International Series on Advances in Database Systems, Volume 23. K
luwer Academic
Publishers, January 2001.
[85] M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluatio
n of Sampling for Data
Mining of Association Rules. Technical Report TR617, Rensselaer Polytechnic Inst
itute,
1996.
2.6 Exercises
1. In the initial example of Chapter 2, the statistician says, Ye
s,
elds 2 and 3
are basically the same. Can you tell from the three lines of sample
data that
are shown why she says that?
2.6 Exercises 89
2. Classify the following attributes as binary, discrete, or continuous. Als
o classify
them as qualitative (nominal or ordinal) or quantitative (interva
l or ratio).
Some cases may have more than one interpretation, so brie y indicate
your
reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM.
(b) Brightness as measured by a light meter.
(c) Brightness as measured by peoples judgments.
(d) Angles as measured in degrees between 0 and 360.
(e) Bronze, Silver, and Gold medals as awarded at the Olympics.
(f) Height above sea level.
(g) Number of patients in a hospital.
(h) ISBN numbers for books. (Look up the format on the Web.)
(i) Ability to pass light in terms of the following values: opaque, transluc
ent,
transparent.
(j) Military rank.
(k) Distance from the center of campus.
(l) Density of a substance in grams per cubic centimeter.
(m) Coat check number. (When you attend an event, you can often gi
ve your
coat to someone who, in turn, gives you a number that you can use
to
claim your coat when you leave.)
3. You are approached by the marketing director of a local company, who believ
es
that he has devised a foolproof way to measure customer satisf
action. He
explains his scheme as follows: Its so simple that I cant believe that
no one
has thought of it before. I just keep track of the number of customer complain
ts
for each product. I read in a data mining book that counts are ratio attribute
s,
and so, my measure of product satisfaction must be a ratio attrib
ute. But
when I rated the products based on my new customer satisfaction measure and
showed them to my boss, he told me that I had overlooked the obvio
us, and
that my measure was worthless. I think that he was just mad because our bestselling product had the worst satisfaction since it had the most
complaints.
Could you help me set him straight?
(a) Who is right, the marketing director or his boss? If you ans
wered, his
boss, what would you do to
x the measure of satisfaction?
(b) What can you say about the attribute type of the original produ
ct satisfaction attribute?
90 Chapter 2 Data
4. A few months later, you are again approached by the same marketi
ng director
as in Exercise 3. This time, he has devised a better approach to
measure the
extent to which a customer prefers one product over other, similar products. H
e
explains, When we develop new products, we typically create several variations
and evaluate which one customers prefer. Our standard procedure is to
give
our test subjects all of the product variations at one time and then ask them to
rank the product variations in order of preference. However, our tes
t subjects
are very indecisive, especially when there are more than two products
. As a
result, testing takes forever. I suggested that we perform the compa
risons in
pairs and then use these comparisons to get the rankings. Thus, if
we have
three product variations, we have the customers compare variations 1
and 2,
nally 3 and 1. Our testing time with my new pr
then 2 and 3, and
ocedure
is a third of what it was for the old procedure, but the employees
conducting
the tests complain that they cannot come up with a consistent ranking
from
the results. And my boss wants the latest product evaluations, yeste
rday. I
should also mention that he was the person who came up with the old
product
evaluation approach. Can you help me?
(a) Is the marketing director in trouble? Will his approach work f
or generating an ordinal ranking of the product variations in terms of cus
tomer
preference? Explain.
(b) Is there a way to x the marketing directors approach? More gener
ally,
what can you say about trying to create an ordinal measurement scale
based on pairwise comparisons?
(c) For the original product evaluation scheme, the overall ranking
s of each
product variation are found by computing its average over all test subjects.
Comment on whether you think that this is a reasonable approach. What
other approaches might you take?
5. Can you think of a situation in which identi cation numbers would
be useful
for prediction?
phants: weight, height, tusk length, trunk length, and ear area. Bas
ed on these
measurements, what sort of similarity measure from Section 2.4 would y
ou use
to compare or group these elephants? Justify your answer and expla
in any
special circumstances.
92 Chapter 2 Data
15. You are given a set of m objects that is divided into K group
s, where the i
th
group is of size m
i
. If the goal is to obtain a sample of size n < m, what is
the di erence between the following two sampling schemes? (Assume sampling
with replacement.)
(a) We randomly select n
m
i
/m elements from each group.
(b) We randomly select n elements from the data set, without regard
for the
group to which an object belongs.
16. Consider a document-term matrix, where tf
ij
is the frequency of the i
th
word
(term) in the j
th
document and m is the number of documents. Consider the
variable transformation that is de ned by
tf
ij
= tf
ij
log
m
df
i
, (2.18)
where df
i
is the number of documents in which the i
th
term appears, which
is known as the document frequency of the term. This transformation
is
known as the inverse document frequency transformation.
(a) What is the e ect of this transformation if a term occurs in one document?
In every document?
(b) What might be the purpose of this transformation?
17. Assume that we apply a square root transformation to a ratio at
tribute x to
obtain the new attribute x
. As part of your analysis, you identify an interval
(a, b) in which x
has a linear relationship to another attribute y.
E
u
c
l
i
d
e
a
n
D
i
s
t
a
n
c
e
(b) Relationship between Euclidean
distance and correlation.
Figure 2.20. Graphs or Exercise 20.
21. Show that the set di erence metric given by
d(A, B) = size(A B) + size(B A) (2.19)
satis es the metric axioms given on page 70. A and B are sets and A
B is
the set di erence.
22. Discuss how you might map correlation values rom the interval
[1,1] to the
interval [0,1]. Note that the type o transormation that you use mi
ght depend
on the application that you have in mind. Thus, consider two appl
ications:
clustering time series and predicting the behavior o one time series
given an
other.
23. Given a similarity measure with values in the interval [0,1] describe two
ways to
transorm this similarity value into a dissimilarity value in the inte
rval [0,].
24. Proximity is typically de ned between a pair o objects.
(a) De ne two ways in which you might de ne the proximity among a grou
p
o objects.
(b) How might you de ne the distance between two sets o points in Euclidean
space?
(c) How might you de ne the proximity between two sets o
data objec
ts?
(Make no assumption about the data objects, except that a proximit
y
measure is de ned between any pair o objects.)
25. You are given a set o
points S in Euclidean space, as well
as the distance o
each point in S to a point x. (It does not matter i x S.)
2.6 Exercises 95
(a) I
the goal is to
nd all points within a speci ed distance of
point y,
y ,= x, xplain how you could us th triangl inquality and th alr
ady
calculatd distancs to x to potntially rduc th numbr of dista
nc
(3.3)
To summarize, the median is the middle value if there are an odd nu
m er
of values, and the average of the two middle values if the num e
r of values
is even. Thus, for seven values, the median is x
(4)
, while for ten values, the
median is
1
2
(x
(5)
+ x
(6)
).
102 Chapter 3 Exploring Data
Although the mean is sometimes interpreted as the middle of a set of values,
this is only correct if the values are distri uted in a symmetric manner. If t
he
distri ution of values is skewed, then the median is a etter indica
tor of the
middle. Also, the mean is sensitive to the presence of outliers. Fo
r data with
outliers, the median again provides a more ro ust estimate of the mid
dle of a
set of values.
To overcome pro lems with the traditional de nition of a mean, the notion
of a trimmed mean is sometimes used. A percentage p etween 0 and
100
is speci ed, the top and ottom (p/2)% of the data is thrown out,
and the
mean is then calculated in the normal way. The median is a trimmed
mean
with p = 100%, while the standard mean corresponds to p = 0%.
Example 3.3. Consider the set of values {1, 2, 3, 4, 5, 90}. The mean of th
ese
values is 17.5, while the median is 3.5. The trimmed mean with p
= 40% is
also 3.5.
Example 3.4. The means, medians, and trimmed means (p = 20%) of
the
four quantitative attri utes of the Iris data are given in Ta le 3.3.
The three
measures of location have similar values except for the attri ute peta
l length.
Ta le 3.3. Means and medians for sepal length, sepal width, petal length, and
petal width. (All values
are in centimeters.)
Measure Sepal Length Sepal Width Petal Length Petal Width
mean 5.84 3.05 3.76 1.20
median 5.80 3.00 4.35 1.30
trimmed mean (20%) 5.79 3.02 3.72 1.12
3.2.4 Measures of Spread: Range and Variance
Another set of commonly used summary statistics for continuous da
ta are
those that measure the dispersion or spread of a set of values. Suc
h measures
indicate if the attri ute values are widely spread out or if they ar
e relatively
of values are often used. Following are the de nitions of three such
measures:
the a solute average deviation (AAD), the median a solute deviation
(MAD), and the interquartile range (IQR). Ta le 3.4 shows these measures
for the Iris data set.
AAD(x) =
1
m
m
i=1
|x
i
x| (3.6)
MAD(x) = median
{|x
1
x|, . . . , |x
m
x|}
(3.7)
interquartile range(x) = x
75%
x
25%
(3.8)
104 Chapter 3 Exploring Data
3.2.5 Multivariate Summary Statistics
Measures of location for data that consists of several attri utes (mul
tivariate
data) can e o tained y computing the mean or median separately for
each
attri ute. Thus, given a data set the mean of the data o jects, x,
is given y
x = (x
1
, . . . , x
n
), (3.9)
where x
i
is the mean of the i
th
attri ute x
i
.
For multivariate data, the spread of each attri ute can e computed
in
dependently of the other attri utes using any of the approaches descri
ed in
Section 3.2.4. However, for data with continuous varia les, the sprea
d of the
data is most commonly captured y the covariance matrix S, whose ij
th
entry s
ij
is the covariance of the i
th
and j
th
attri utes of the data.
i
and x
j
are the i
th
and j
th
attri utes, then
s
ij
= covariance(x
i
, x
j
). (3.10)
In turn, covariance(x
i
, x
j
) is given y
covariance(x
i
, x
j
) =
1
m 1
m
Thus, if x
k=1
(x
ki
x
i
)(x
kj
x
j
), (3.11)
where x
ki
and x
kj
are the values of the i
th
and j
th
attri utes for the k
th
o ject.
Notice that covariance(x
i
, x
i
) = variance(x
i
). Thus, the covariance matrix has
, x
i
) = 1, while the other entries are etween
1 and 1. It is also useful to consider correlation matrices that co
ntain the
pairwise correlations of o jects instead of attri utes.
3.2.6 Other Ways to Summarize the Data
There are, of course, other types of summary statistics. For
instance, the
skewness of a set of values measures the degree to which the values a
re sym
metrically distri uted around the mean. There are also other character
istics
of the data that are not easy to measure quantitatively, such as wh
ether the
distri ution of values is multimodal; i.e., the data has multiple umps where
most of the values are concentrated. In many cases, however, the mos
t e ec
tive approach to understanding the more complicated or su tle aspects of how
the values of an attri ute are distri uted, is to view the values gr
aphically in
the form of a histogram. (Histograms are discussed in the next secti
on.)
3.3 Visualization
Data visualization is the display of information in a graphic or ta ular format.
Successful visualization requires that the data (information) e converted into
a visual format so that the characteristics of the data and the re
lationships
among data items or attri utes can e analyzed or reported. The
goal of
visualization is the interpretation of the visualized information y a
person
and the formation of a mental model of the information.
In everyday life, visual techniques such as graphs and ta les are oft
en the
preferred approach used to explain the weather, the economy, and the
results
of political elections. Likewise, while algorithmic or mathematical app
roaches
are often emphasized in most technical disciplinesdata mining included
visual techniques can play a key role in data analysis. In fact, so
metimes the
use of visualization techniques in data mining is referred to as vis
ual data
mining.
3.3.1 Motivations for Visualization
The overriding motivation for using visualization is that people can q
uickly
a sor large amounts of visual information and nd patterns in it. Co
nsider
Figure 3.2, which shows the Sea Surface Temperature (SST) in degrees Celsius
for July, 1982. This picture summarizes the information from approxim
ately
250,000 num ers and is readily interpreted in a few seconds. For exa
mple, it
106 Chapter 3 Exploring Data
Longitude
Temp 150 180 120 90 60 30 0 30 60 90 120 150 180
0
5
10
15
20
25
30
90
60
60
90
30
30
0
L
a
t
i
t
u
d
e
Figure 3.2. Sea Surface Temperature (SST) for July, 1982.
is easy to see that the ocean temperature is highest at the equator
and lowest
at the poles.
Another general motivation for visualization is to make use of the do
main
knowledge that is locked up in peoples heads. While the use of dom
ain
knowledge is an important task in data mining, it is often di cult or impossi le
to fully utilize such knowledge in statistical or algorithmic tools. In some c
ases,
an analysis can e performed using non visual tools, and then the
results
presented visually for evaluation y the domain expert. In other cases, having
a domain specialist examine visualizations of the data may e the e
st way
of nding patterns of interest since, y using domain knowledge, a person can
often quickly eliminate many uninteresting patterns and direct the focu
s to
the patterns that are important.
3.3.2 General Concepts
This section explores some of the general concepts related to visualiz
ation, in
particular, general approaches for visualizing the data and its attri
utes. A
num er of visualization techniques are mentioned rie y and will e descri ed
in more detail when we discuss speci c approaches later on. We assume
that
the reader is familiar with line graphs, ar charts, and scatter plot
s.
3.3 Visualization 107
Representation: Mapping Data to Graphical Elements
The rst step in visualization is the mapping of information to a visual format;
i.e., mapping the o jects, attri utes, and relationships in a set of
information
to visual o jects, attri utes, and relationships. That is, data o jects, their
at
tri utes, and the relationships among data o jects are translated into graphical
elements such as points, lines, shapes, and colors.
O jects are usually represented in one of three ways. First, if
only a
single categorical attri ute of the o ject is eing considered,
then o jects
are often lumped into categories ased on the value of that attri
ute, and
these categories are displayed as an entry in a ta le or an area on
a screen.
(Examples shown later in this chapter are a cross ta ulation ta le and
a ar
chart.) Second, if an o ject has multiple attri utes, then the o j
ect can e
displayed as a row (or column) of a ta le or as a line on a grap
h. Finally,
an o ject is often interpreted as a point in two
or three dimension
al space,
where graphically, the point might e represented y a geometric
gure,
such
as a circle, cross, or ox.
For attri utes, the representation depends on the type of attri ute,
i.e.,
nominal, ordinal, or continuous (interval or ratio). Ordinal and
continuous
attri utes can e mapped to continuous, ordered graphical features su
ch as
location along the x, y, or z axes; intensity; color; or size (
diameter, width,
height, etc.). For categorical attri utes, each category can e m
apped to
a distinct position, color, shape, orientation, em ellishment, or
column in
a ta le. However, for nominal attri utes, whose values are unordere
d, care
should e taken when using graphical features, such as color and position that
have an inherent ordering associated with their values. In other word
s, the
graphical elements used to represent the ordinal values often have a
n order,
ut ordinal values do not.
The representation of relationships via graphical elements occurs
either
explicitly or implicitly. For graph data, the standard graph represen
tation
a set of nodes with links etween the nodesis normally used. If th
e nodes
(data o jects) or links (relationships) have attri utes or characteristics of th
eir
own, then this is represented graphically. To illustrate, if the node
s are cities
and the links are highways, then the diameter of the nodes might r
epresent
population, while the width of the links might represent the volume of
tra c.
In most cases, though, mapping o jects and attri utes to graphical
el
ements implicitly maps the relationships in the data to relationships
among
graphical elements. To illustrate, if the data o ject represents a physical o
ject
that has a location, such as a city, then the relative positions of
the graphical
o jects corresponding to the data o jects tend to naturally preserve the actual
108 Chapter 3 Exploring Data
relative positions of the o jects. Likewise, if there are two or three continu
ous
attri utes that are taken as the coordinates of the data points, then the result
ing plot often gives considera le insight into the relationships of the attri ut
es
and the data points ecause data points that are visually close to e
ach other
have similar values for their attri utes.
In general, it is di cult to ensure that a mapping of o jects and attri utes
will result in the relationships eing mapped to easily o served rela
tionships
among graphical elements. Indeed, this is one of the most challenging
aspects
of visualization. In any given set of data, there are many implicit relationsh
ips,
and hence, a key challenge of visualization is to choose a technique that makes
the relationships of interest easily o serva le.
Arrangement
As discussed earlier, the proper choice of visual representation of o
jects and
attri utes is essential for good visualization. The arrangement of items withi
n
the visual display is also crucial. We illustrate this with two exam
ples.
Example 3.5. This example illustrates the importance of rearranging a ta le
of data. In Ta le 3.5, which shows nine o jects with six inary a
ttri utes,
there is no clear relationship etween o jects and attri utes, at lea
st at
rst
glance. If the rows and columns of this ta le are permuted, however, as shown
in Ta le 3.6, then it is clear that there are really only two types
of o jects in
the ta leone that has all ones for the rst three attri utes and one that has
only ones for the last three attri utes.
Ta le 3.5. A ta le of nine o jects (rows) with
six inary attri utes (columns).
1 2 3 4 5 6
1 0 1 0 1 1 0
2 1 0 1 0 0 1
3 0 1 0 1 1 0
4 1 0 1 0 0 1
5 0 1 0 1 1 0
6 1 0 1 0 0 1
7 0 1 0 1 1 0
8 1 0 1 0 0 1
9 0 1 0 1 1 0
Ta le 3.6. A ta le of nine o jects (rows) with six
inary attri utes (columns) permuted so that the
relationships of the rows and columns are clear.
6 1 3 2 5 4
4 1 1 1 0 0 0
2 1 1 1 0 0 0
6 1 1 1 0 0 0
8 1 1 1 0 0 0
5 0 0 0 1 1 1
3 0 0 0 1 1 1
9 0 0 0 1 1 1
1 0 0 0 1 1 1
7 0 0 0 1 1 1
3.3 Visualization 109
Example 3.6. Consider Figure 3.3(a), which shows a visualization of a graph.
If the connected components of the graph are separated, as in Figure
3.3( ),
then the relationships etween nodes and graphs ecome much simple
r to
understand.
(a) Original view of a graph. ( ) Uncoupled view of connected components
of the graph.
Figure 3.3. Two visualizations of a graph.
Selection
Another key concept in visualization is selection, which is the e
limination
or the de emphasis of certain o jects and attri utes. Speci cally, whil
e data
o jects that only have a few dimensions can often e mapped to a tw
o
or
three dimensional graphical representation in a straightforward way, t
here is
no completely satisfactory and general approach to represent data with
many
attri utes. Likewise, if there are many data o jects, then visualiz
ing all the
o jects can result in a display that is too crowded. If there are many attri u
tes
and many o jects, then the situation is even more challenging.
The most common approach to handling many attri utes is to choose a
su set of attri utesusually twofor display. If the dimensionality is not too
high, a matrix of ivariate (two attri ute) plots can e constructed f
or simul
taneous viewing. (Figure 3.16 shows a matrix of scatter plots for t
he pairs
of attri utes of the Iris data set.) Alternatively, a visualization
program can
automatically show a series of two dimensional plots, in which the sequence is
user directed or ased on some prede ned strategy. The hope is that visualiz
ing a collection of two dimensional plots will provide a more complete
view of
the data.
110 Chapter 3 Exploring Data
The technique of selecting a pair (or small num er) of attri utes is a type of
dimensionality reduction, and there are many more sophisticated dimensi
on
ality reduction techniques that can e employed, e.g., principal com
ponents
analysis (PCA). Consult Appendices A (Linear Alge ra) and B (Dimension
ality Reduction) for more information.
When the num er of data points is high, e.g., more than a few hu
ndred,
or if the range of the data is large, it is di cult to display enough information
a out each o ject. Some data points can o scure other data point
s, or a
data o ject may not occupy enough pixels to allow its features to e
clearly
displayed. For example, the shape of an o ject cannot e used to
encode a
characteristic of that o ject if there is only one pixel availa le to display it
. In
these situations, it is useful to e a le to eliminate some of the o
jects, either
y zooming in on a particular region of the data or y taking a sa
mple of the
data points.
3.3.3 Techniques
Visualization techniques are often specialized to the type of data e
ing ana
lyzed. Indeed, new visualization techniques and approaches, as well as special
ized variations of existing approaches, are eing continuously created, typicall
y
in response to new kinds of data and visualization tasks.
Despite this specialization and the ad hoc nature of visualization, there are
some generic ways to classify visualization techniques. One such class
i cation
is ased on the num er of attri utes involved (1, 2, 3, or many) or whether the
data has some special characteristic, such as a hierarchical or graph structure.
Visualization methods can also e classi ed according to the type of attri utes
involved. Yet another classi cation is ased on the type of application
: scien
ti c, statistical, or information visualization. The following discussion will u
se
three categories: visualization of a small num er of attri utes, visualization
of
data with spatial and/or temporal attri utes, and visualization of d
ata with
many attri utes.
Most of the visualization techniques discussed here can e found in a wide
variety of mathematical and statistical packages, some of which a
re freely
availa le. There are also a num er of data sets that are freely availa le on t
he
World Wide We . Readers are encouraged to try these visualization techniques
as they proceed through the following sections.
3.3 Visualization 111
Visualizing Small Num ers of Attri utes
This section examines techniques for visualizing data with respect to
a small
num er of attri utes. Some of these techniques, such as histogr
ams, give
insight into the distri ution of the o served values for a single attri ute. O
ther
techniques, such as scatter plots, are intended to display the r
elationships
etween the values of two attri utes.
Stem and Leaf Plots Stem and leaf plots can e used to provide i
nsight
into the distri ution of one dimensional integer or continuous data. (
We will
assume integer data initially, and then explain how stem and leaf plots can e
applied to continuous data.) For the simplest type of stem and leaf
plot, we
split the values into groups, where each group contains those values
that are
the same except for the last digit. Each group ecomes a stem, whil
e the last
digits of a group are the leaves. Hence, if the values are two d
igit integers,
e.g., 35, 36, 42, and 51, then the stems will e the high orde
r digits, e.g., 3,
4, and 5, while the leaves are the low order digits, e.g., 1, 2
, 5, and 6. By
plotting the stems vertically and leaves horizontally, we can provide
a visual
representation of the distri ution of the data.
Example 3.7. The set of integers shown in Figure 3.4 is the sepal
length in
centimeters (multiplied y 10 to make the values integers) taken from
the Iris
data set. For convenience, the values have also een sorted.
The stem and leaf plot for this data is shown in Figure 3.5. Each num er in
Figure 3.4 is rst put into one of the vertical groups4, 5, 6, or 7acco
rding
to its tens digit. Its last digit is then placed to the right of t
he colon. Often,
especially if the amount of data is larger, it is desira le to
split the stems.
For example, instead of placing all values whose tens digit is 4 i
n the same
ucket, the stem 4 is repeated twice; all values 4044 are put in the
ucket
corresponding to the
rst stem and all values 4549 are put in the
ucket
corresponding to the second stem. This approach is shown in the stem
and
leaf plot of Figure 3.6. Other variations are also possi le.
Histograms Stem and leaf plots are a type of histogram, a plot tha
t dis
plays the distri ution of values for attri utes y dividing the possi
le values
into ins and showing the num er of o jects that fall into each in.
For cate
gorical data, each value is a in. If this results in too many values, then va
lues
are com ined in some way. For continuous attri utes, the range of values is di
vided into instypically, ut not necessarily, of equal widthand the values
in each in are counted.
112 Chapter 3 Exploring Data
43 44 44 44 45 46 46 46 46 47 47 48 48 48 48 48
49 49 49 49 49 49 50
50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51
51 51 52 52 52 52 53
54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56
56 56 56 57 57 57 57
57 57 57 57 58 58 58 58 58 58 58 59 59 59 60 60
60 60 60 60 61 61 61
61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63
64 64 64 64 64 64 64
65 65 65 65 65 66 66 67 67 67 67 67 67 67 67 68
68 68 69 69 69 69 70
71 72 72 72 73 74 76 77 77 77 77 79
Figure 3.4. Sepal length data from the Iris data set.
4 : 34444566667788888999999
5 : 0000000000111111111222234444445555555666666777777778888888999
6 : 000000111111222233333333344444445555566777777778889999
7 : 0122234677779
Figure 3.5. Stem and leaf plot for the sepal length from the Iris data set.
4 : 3444
4 : 566667788888999999
5 : 000000000011111111122223444444
5 : 5555555666666777777778888888999
6 : 00000011111122223333333334444444
6 : 5555566777777778889999
7 : 0122234
7 : 677779
Figure 3.6. Stem and leaf plot for the sepal length from the Iris data set whe
n uckets corresponding
to digits are split.
Once the counts are availa le for each in, a ar plot is constructed
such
that each in is represented y one ar and the area of each ar is proportional
to the num er of values (o jects) that fall into the corresponding range. If
all
intervals are of equal width, then all ars are the same width and
the height
of a ar is proportional to the num er of values in the correspondin
g in.
Example 3.8. Figure 3.7 shows histograms (with 10 ins) for sepal l
ength,
sepal width, petal length, and petal width. Since the shape of
a histogram
can depend on the num er of ins, histograms for the same data, ut
with 20
ins, are shown in Figure 3.8.
There are variations of the histogram plot. A relative (frequency) hi
s
togram replaces the count y the relative frequency. However, this i
s just a
3.3 Visualization 113
4 4.5 5 5.5 6 6.5 7 7.5 8
0
5
10
15
20
25
30
Sepal Length
C
o
u
n
t
(a) Sepal length.
2 2.5 3 3.5 4 4.5
0
5
10
15
20
25
30
35
40
45
50
Sepal Width
C
o
u
n
t
( ) Sepal width.
0 1 2 3 4 5 6 7
0
5
10
15
20
25
30
35
40
Petal Length
C
o
u
n
t
(c) Petal length.
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25
30
35
40
45
Petal Width
C
o
u
n
t
(d) Petal width.
Figure 3.7. Histograms of four Iris attri utes (10 ins).
4 4.5 5 5.5 6 6.5 7 7.5 8
0
2
4
6
8
10
12
14
16
Sepal Length
C
o
u
n
t
(a) Sepal length.
2 2.5 3 3.5 4 4.5
0
5
10
15
20
25
30
Sepal Width
C
o
u
n
t
( ) Sepal width.
1 2 3 4 5 6 7
0
5
10
15
20
25
30
35
Petal Length
C
o
u
n
t
(c) Petal length.
0 0.5 1 1.5 2 2.5
0
5
10
15
20
25
30
35
Petal Width
C
o
u
n
t
(d) Petal width.
Figure 3.8. Histograms of four Iris attri utes (20 ins).
change in scale of the y axis, and the shape of the histogram does not change.
Another common variation, especially for unordered categorical data,
is the
Pareto histogram, which is the same as a normal histogram except that the
categories are sorted y count so that the count is decreasing from left to righ
t.
Two Dimensional Histograms Two dimensional histograms are also pos
si le. Each attri ute is divided into intervals and the two sets of intervals
de ne
two dimensional rectangles of values.
Example 3.9. Figure 3.9 shows a two dimensional histogram of petal le
ngth
and petal width. Because each attri ute is split into three ins, there are ni
ne
rectangular two dimensional ins. The height of each rectangular ar indicates
the num er of o jects ( owers in this case) that fall into each in.
Most of
the
owers fall into only three of the insthose along the diagonal. I
t is not
possi le to see this y looking at the one dimensional distri utions.
114 Chapter 3 Exploring Data
Petal Length
Petal Width
50
40
30
20
10
0
C
o
u
n
t
2
2
3
4
5
6
1
1.5
0.5
Figure 3.9. Two dimensional histogram of petal length and width in the Iris da
ta set.
While two dimensional histograms can e used to discover interesting facts
a out how the values of two attri utes co occur, they are visually mo
re com
plicated. For instance, it is easy to imagine a situation in which
some of the
columns are hidden y others.
Box Plots Box plots are another method for showing the distri ution of the
values of a single numerical attri ute. Figure 3.10 shows a la eled ox plot f
or
sepal length. The lower and upper ends of the ox indicate the 25
th
and 75
th
percentiles, respectively, while the line inside the ox indicates the value of
the
50
th
percentile. The top and ottom lines of the tails indicate the 10
th
and
90
th
percentiles. Outliers are shown y + marks. Box plots are relatively
compact, and thus, many of them can e shown on the same plot. Sim
pli ed
versions of the ox plot, which take less space, can also e used.
Example 3.10. The ox plots for the
rst four attri utes of the
Iris data
set are shown in Figure 3.11. Box plots can also e used to comp
are how
attri utes vary etween di erent classes of o jects, as shown in Figur
e 3.12.
Pie Chart A pie chart is similar to a histogram, ut is typically
used with
categorical attri utes that have a relatively small num er of values. Instead
of
showing the relative frequency of di erent values with the area or heig
ht of a
ar, as in a histogram, a pie chart uses the relative area of a ci
rcle to indicate
relative frequency. Although pie charts are common in popular articles
, they
3.3 Visualization 115
Outlier
90
th
percentile
10
th
percentile
50
th
percentile
75
th
percentile
25
th
percentile
+
+
+
+
Figure 3.10. Description of
ox plot for sepal length.
8
7
6
5
4
3
2
1
0
V
a
l
u
e
s
(
c
e
n
t
i
m
e
t
e
r
s
)
+
+
+
+
Sepal Length
Figure 3.11.
6
5
4
3
2
1
0
V
a
lu
e
s
(
c
e
n
t
im
e
t
e
r
s
)
+
+
+
+
Sepal Length
(a) Setosa.
7
5
4
3
2
1
6
V
a
lu
e
s
(
c
e
n
Petal Length
t
im
e
t
e
r
s
)
+
Sepal Length Petal Length
( ) Versicolour.
7
5
4
3
2
8
6
V
a
lu
e
s
(
c
e
n
t
im
e
t
e
r
s
)
+
Sepal Length Petal Length Petal Width Sepal Width
(c) Virginica.
Figure 3.12. Box plots of attri utes y Iris species.
are used less frequently in technical pu lications ecause the size o
f relative
areas can e hard to judge. Histograms are preferred for technical w
ork.
Example 3.11. Figure 3.13 displays a pie chart that shows the distri
ution
of Iris species in the Iris data set. In this case, all three
ower
types have the
same frequency.
Percentile Plots and Empirical Cumulative Distri ution Functions
A type of diagram that shows the distri ution of the data more quantitatively
is the plot of an empirical cumulative distri ution function. While this type
of
plot may sound complicated, the concept is straightforward. For each value of
a statistical distri ution, a cumulative distri ution function (CDF) shows
116 Chapter 3 Exploring Data
Setosa Virginica
Versicolour
Figure 3.13. Distri ution of the types of Iris owers.
the pro a ility that a point is less than that value. For each o served value,
an
empirical cumulative distri ution function (ECDF) shows the fraction
of points that are less than this value. Since the num er of points is nite, th
e
empirical cumulative distri ution function is a step function.
Example 3.12. Figure 3.14 shows the ECDFs of the Iris attri utes
. The
percentiles of an attri ute provide similar information. Figure 3.15 s
hows the
percentile plots of the four continuous attri utes of the Iris data
set from
Ta le 3.2. The reader should compare these gures with the histograms given
in Figures 3.7 and 3.8.
Scatter Plots Most people are familiar with scatter plots to some ex
tent,
and they were used in Section 2.4.5 to illustrate linear correlation.
Each data
o ject is plotted as a point in the plane using the values of the
two attri utes
as x and y coordinates. It is assumed that the attri utes are either integer
or
real valued.
Example 3.13. Figure 3.16 shows a scatter plot for each pair of at
tri utes
of the Iris data set. The di erent species of Iris are indicated
y di erent
markers. The arrangement of the scatter plots of pairs of attri ut
es in this
type of ta ular format, which is known as a scatter plot matrix, p
rovides
an organized way to examine a num er of scatter plots simultaneously.
3.3 Visualization 117
4 4.5 5 5.5 6 6.5 7 7.5 8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)
(a) Sepal Length.
2 2.5 3 3.5 4 4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)
( ) Sepal Width.
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)
(c) Petal Length.
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)
(d) Petal Width.
Figure 3.14. Empirical CDFs of four Iris attri utes.
0 20 40 60 80 10
1
2
3
4
5
6
7
Percentile
V
a
lu
e
(
c
e
n
t
im
e
t
e
r
s
)
sepal length
sepal width
petal length
petal width
Figure 3.15. Percentile plots for sepal length, sepal width, petal length, and
petal width.
1
1
8
C
h
a
p
t
e
r
3
E
x
p
l
o
r
i
n
g
D
a
t
a
0 1 2
petal width
2 4 6
petal length
2 3 4
sepal width
5 6 7 8
0
0.5
1
1.5
2
2.5
sepal length
p
e
t
a
l
w
i
d
t
h
p
e
t
a
l
l
e
n
g
t
h
2
4
6
5
6
7
8
2
2.5
3
3.5
4
4.5
s
e
p
a
l
w
i
d
t
h
s
e
p
a
l
l
e
n
g
t
h
Setosa
Versicolour
Virginica
Figure 3.16. Matrix of scatter plots for the Iris data set.
3.3 Visualization 119
There are two main uses for scatter plots. First, they graphically
show
the relationship etween two attri utes. In Section 2.4.5, we saw how
scatter
plots could e used to judge the degree of linear correlation. (See Figure 2.1
7.)
Scatter plots can also e used to detect non linear relationships, either direct
ly
or y using a scatter plot of the transformed attri utes.
Second, when class la els are availa le, they can e used to investigate the
degree to which two attri utes separate the classes. If is possi le
to draw a
line (or a more complicated curve) that divides the plane de ned y th
e two
attri utes into separate regions that contain mostly o jects of one cl
ass, then
it is possi le to construct an accurate classi er ased on the speci ed
pair of
attri utes. If not, then more attri utes or more sophisticated met
hods are
needed to uild a classi er. In Figure 3.16, many of the pairs of attri utes (fo
r
example, petal width and petal length) provide a moderate separation o
f the
Iris species.
Example 3.14. There are two separate approaches for displaying three
at
tri utes of a data set with a scatter plot. First, each o ject can
e displayed
according to the values of three, instead of two attri utes. Figure 3.17 shows
a
three dimensional scatter plot for three attri utes in the Iris data set. Seco
nd,
one of the attri utes can e associated with some characteristic of the marker,
such as its size, color, or shape. Figure 3.18 shows a plot of t
hree attri utes
of the Iris data set, where one of the attri utes, sepal width, is mapped to the
size of the marker.
Extending Two and Three Dimensional Plots As illustrated y Fig
ure 3.18, two or three dimensional plots can e extended to represe
nt a few
additional attri utes. For example, scatter plots can display up to
three ad
ditional attri utes using color or shading, size, and shape, allowing v
e or six
dimensions to e represented. There is a need for caution, however.
As the
complexity of a visual representation of the data increases, it ecome
s harder
for the intended audience to interpret the information. There is no
ene t in
packing six dimensions worth of information into a two or three dimensional
plot, if doing so makes it impossi le to understand.
Visualizing Spatio temporal Data
Data often has spatial or temporal attri utes. For instance, the
data may
consist of a set of o servations on a spatial grid, such as o servat
ions of pres
sure on the surface of the Earth or the modeled temperature at vario
us grid
points in the simulation of a physical o ject. These o servations can
also e
120 Chapter 3
2
3
4
5
2
1
3
4
5
6
7
0
0.5
1.5
1
2
Petal Width
Sepal Width
Setosa
Versicolour
Virginica
S
e
p
a
l
L
e
n
g
t
h
Figure 3.17.
petal width.
Setosa
Versicolour
Virginica
1 2 3 4
0
0.5
1
1.5
2
2.5
Petal Length
P
e
t
a
l
W
i
d
t
h
Figure 3.18.
Exploring Data
Scatter plot of petal length versus petal width, with the size of
5
5
0
5
5
Temperature
(Celsius)
5
10
15
20
25
0
5
Figure 3.19. Contour plot of SST for Decem er 1998.
made at various points in time. In addition, data may have only a
temporal
component, such as time series data that gives the daily prices of s
tocks.
Contour Plots For some three dimensional data, two attri utes specify
a
position in a plane, while the third has a continuous value, such
as temper
ature or elevation. A useful visualization for such data is a conto
ur plot,
which reaks the plane into separate regions where the values of
the third
attri ute (temperature, elevation) are roughly the same. A common exam
ple
of a contour plot is a contour map that shows the elevation of land
locations.
Example 3.15. Figure 3.19 shows a contour plot of the average sea s
urface
temperature (SST) for Decem er 1998. The land is ar itrarily set to
have a
temperature of 0
smooth manner.
Example 3.16. Figure 3.20 shows a surfae plot of the density around
a set
of 12 points. This example is further disussed in Setion 9.3.3.
Vetor Field Plots In some data, a harateristi may have both a
mag
nitude and a diretion assoiated with it. For example, onsider the
ow of a
substane or the hange of density with loation. In these situations, it an
be
useful to have a plot that displays both diretion and magnitude. T
his type
of plot is known as a vetor plot.
Example 3.17. Figure 3.21 shows a ontour plot of the density of
the two
smaller density peaks from Figure 3.20(b), annotated with the density gradient
vetors.
Lower Dimensional Slies Consider a spatio temporal data set that reords
some quantity, suh as temperature or pressure, at various loations over time.
Suh a data set has four dimensions and annot be easily displayed by the types
3.3 Visualization 123
Figure 3.21. Vetor plot of the gradient (hange) in density for the bottom tw
o density peaks of Figure
3.20.
of plots that we have desribed so far. However, separate slies of t
he data
an be displayed by showing a set of plots, one for eah month. By examining
the hange in a partiular area from one month to another, it is p
ossible to
notie hanges that our, inluding those that may be due to seasonal fators.
Example 3.18. The underlying data set for this example onsists of t
he av
erage monthly sea level pressure (SLP) from 1982 to 1999 on a 2.5
by 2.5
latitude longitude grid. The twelve monthly plots of pressure for one
year are
shown in Figure 3.22. In this example, we are interested in slies
for a par
tiular month in the year 1982. More generally, we an onsider sli
es of the
data along any arbitrary dimension.
Animation Another approah to dealing with slies of data, whether or not
time is involved, is to employ animation. The idea is to display
suessive
two dimensional slies of the data. The human visual system is well
suited to
deteting visual hanges and an often notie hanges that might be d
i ult
to detet in another manner. Despite the visual appeal of animation,
a set of
still plots, suh as those of Figure 3.22,
an be more useful sin
e this type of
visualization allows the information to be studied in arbitrary order
and for
arbitrary amounts of time.
124 Chapter 3 Exploring Data
January February Marh
April May June
S
e
t
o
s
a
V
e
r
s
i
o
lo
u
r
V
ir
g
in
i
a
Virginia Versiolour Setosa
50 100
50
100
150
Correlation
0.4
0.5
0.6
0.7
0.8
0.9
Figure 3.24. Plot of the Iris orrelation matrix.
There are some important pratial onsiderations when visualizing a data
matrix. If lass labels are known, then it is useful to reorder the
data matrix
so that all objets of a lass are together. This makes it easier, for example
, to
detet if all objets in a lass have similar attribute values for some attribut
es.
If di erent attributes have di erent ranges, then the attributes are often stan
dardized to have a mean of zero and a standard deviation of 1. Thi
s prevents
the attribute with the largest magnitude values from visually dominatin
g the
plot.
Example 3.19. Figure 3.23 shows the standardized data matrix for the
Iris
data set. The rst 50 rows represent Iris owers of the speies Setosa, the next
50 Versiolour, and the last 50 Virginia. The Setosa owers have petal width
and length well below the average, while the Versiolour
owers have
petal
width and length around average. The Virginia owers have petal width and
length above average.
It an also be useful to look for struture in the plot of a proximity matrix
for a set of data objets. Again, it is useful to sort the rows
and olumns of
the similarity matrix (when lass labels are known) so that all the objets of a
lass are together. This allows a visual evaluation of the ohesivene
ss of eah
lass and its separation from other lasses.
Example 3.20. Figure 3.24 shows the orrelation matrix for the Iri
s data
set. Again, the rows and olumns are organized so that all the
owe
rs of a
partiular speies are together. The
owers in eah group are most si
milar
126 Chapter 3 Exploring Data
to eah other, but Versiolour and Virginia are more similar to one
another
than to Setosa.
If
lass labels are not known, various tehniques (matrix reordering
and
seriation)
an be used to rearrange the rows and olumns of the
similarity
matrix so that groups of highly similar objets and attributes a
re together
and an be visually identi ed. E etively, this is a simple kind of l
ustering.
See Setion 8.5.3 for a disussion of how a proximity matrix an be
used to
investigate the luster struture of data.
Parallel Coordinates Parallel
oordinates have one
oordinate axis
for
eah attribute, but the di erent axes are parallel to one other instead
of per
pendiular, as is traditional. Furthermore, an objet is represented
as a line
instead of as a point. Spei ally, the value of eah attribute of a
n objet is
mapped to a point on the oordinate axis assoiated with that attribu
te, and
these points are then onneted to form the line that represents the
objet.
It might be feared that this would yield quite a mess. However, in
many
ases, objets tend to fall into a small number of groups, where the
points in
eah group have similar values for their attributes. If so, and if the number
of
data objets is not too large, then the resulting parallel oordinate
s plot an
reveal interesting patterns.
Example 3.21. Figure 3.25 shows a parallel oordinates plot of the f
our nu
merial attributes of the Iris data set. The lines representing objets of di er
ent lasses are distinguished by their shading and the use of three di erent line
stylessolid, dotted, and dashed. The parallel oordinates plot shows that the
lasses are reasonably well separated for petal width and petal length, but less
well separated for sepal length and sepal width. Figure 3.25 is another parall
el
oordinates plot of the same data, but with a di erent ordering of the axes.
One of the drawbaks of parallel oordinates is that the detetion of
pat
terns in suh a plot may depend on the order. For instane, if l
ines ross a
lot, the piture an beome onfusing, and thus, it an be desirab
le to order
e
n
t
i
m
e
t
e
r
s
)
Setosa
Versiolour
Virginia
Figure 3.26. A parallel oordinates plot of the four Iris attribu
tes with the attributes reordered to
emphasize similarities and dissimilarities of groups.
128 Chapter 3 Exploring Data
spei ally, eah attribute of an objet is mapped to a partiular feat
ure of a
glyph, so that the value of the attribute determines the exat natu
re of the
feature. Thus, at a glane, we an distinguish how two objets di er.
Star oordinates are one example of this approah. This tehnique use
s
one axis for eah attribute. These axes all radiate from a enter point, like
the
spokes of a wheel, and are evenly spaed. Typially, all the att
ribute values
are mapped to the range [0,1].
An objet is mapped onto this star shaped set of axes using the foll
owing
proess: Eah attribute value of the objet is onverted to a fra
tion that
represents its distane between the minimum and maximum values of
the
attribute. This fration is mapped to a point on the axis orrespond
ing to
this attribute. Eah point is onneted with a line segment to the
point on
the axis preeding or following its own axis; this forms a polygon.
The size
and shape of this polygon gives a visual desription of the attribute
values of
the objet. For ease of interpretation, a separate set of axes is
used for eah
objet. In other words, eah objet is mapped to a polygon. An exa
mple of a
star oordinates plot of
ower 150 is given in Figure 3.27(a).
It is also possible to map the values of features to those of more
familiar
objets, suh as faes. This tehnique is named Cherno faes for its reator,
Herman Cherno . In this tehnique, eah attribute is assoiated with a spei
feature of a fae, and the attribute value is used to determine th
e way that
the faial feature is expressed. Thus, the shape of the fae may be
ome more
elongated as the value of the orresponding data feature inreases. An example
of a Cherno
fae for
ower 150 is given in Figure 3.27(b).
The program that we used to make this fae mapped the features to t
he
four features listed below. Other features of the fae, suh as wid
th between
the eyes and length of the mouth, are given default values.
Data Feature Faial Feature
sepal length size of fae
sepal width forehead/jaw relative ar length
petal length shape of forehead
petal width shape of jaw
Example 3.22. A more extensive illustration of these two approahes to view
ing multidimensional data is provided by Figures 3.28 and 3.29, whih
shows
the star and fae plots, respetively, of 15
owers from the Iris data
set. The
rst 5
owers are of speies Setosa, the seond 5 are Versiolour, and
the last
5 are Virginia.
3.3 Visualization 129
s
e
p
a
l
w
i
d
t
h
p
e
t
a
l
w
i
d
t
h
petal length sepal length
(a) Star graph of Iris 150. (b) Cherno
fae of Iris 150.
Figure 3.27. Star oordinates graph and Chernoff fae of the 150
th
ower of the Iris data set.
1 2 3 4 5
51 52 53 54 55
101 102 103 104 105
Figure 3.28. Plot of 15 Iris owers using star oordinates.
1 2 3 4 5
51 52 53 54 55
101 102 103 104 105
Figure 3.29. A plot of 15 Iris owers using Chernoff faes.
130 Chapter 3 Exploring Data
Despite the visual appeal of these sorts of diagrams, they do not sale well,
and thus, they are of limited use for many data mining problems. Nonetheless,
they may still be of use as a means to quikly ompare small sets
of objets
that have been seleted by other tehniques.
nterative
analysis of data and typially provide extensive apabilities for visualizing th
e
data and generating summary statistis. For these reasons, our approa
h to
multidimensional data analysis will be based on the terminology and onepts
ommon to OLAP systems.
3.4.1 Representing Iris Data as a Multidimensional Array
Most data sets an be represented as a table, where eah row is an objet and
eah olumn is an attribute. In many ases, it is also possible to view the da
ta
as a multidimensional array. We illustrate this approah by represent
ing the
Iris data set as a multidimensional array.
Table 3.7 was reated by disretizing the petal length and petal
width
attributes to have values of low, medium, and high and then o
unting the
number of
owers from the Iris data set that have partiular
om
binations
of petal width, petal length, and speies type. (For petal wi
dth, the
at
egories low, medium, and high
orrespond to the intervals [0, 0
.75), [0.75,
1.75), [1.75, ), respetively. For petal length, the ategories low,
medium,
and high orrespond to the intervals [0, 2.5), [2.5, 5), [5, ), r
espetively.)
132 Chapter 3 Exploring Data
Table 3.7. Number of owers having a partiular ombination of petal width, peta
l length, and speies
type.
Petal Length Petal Width Speies Type Count
low low Setosa 46
low medium Setosa 2
medium low Setosa 2
medium medium Versiolour 43
medium high Versiolour 3
medium high Virginia 3
high medium Versiolour 2
high medium Virginia 3
high high Versiolour 2
high high Virginia 44
0
0
0
0
0
2
0
2
46
Virginia
Versiolour
Setosa
high
low
medium
h
i
g
h
m
e
d
i
u
m
l
o
w
S
p
e
i
e
s
Petal
Width
Petal
Width
Figure 3.30. A multidimensional data representation for the Iris data set.
3.4 OLAP and Multidimensional Data Analysis 133
Table 3.8. Cross tabulation of owers aord
ing to petal length and width for owers of the
Setosa speies.
Width
low medium high
low 46 2 0
medium 2 0 0
high 0 0 0
L
e
n
g
t
h
Table 3.9. Cross tabulation of owers aord
ing to petal length and width for owers of the
Versiolour speies.
Width
low medium high
low 0 0 0
medium 0 43 3
high 0 2 2
L
e
n
g
t
h
Table 3.10. Cross tabulation of
owers a
ording to petal length and width for owers of
the Virginia speies.
Width
low medium high
low 0 0 0
medium 0 0 3
high 0 3 44
L
e
n
g
t
h
Empty ombinationsthose ombinations that do not orrespond to at least
one
owerare not shown.
The data an be organized as a multidimensional array with three dime
n
sions orresponding to petal width, petal length, and speies type,
as illus
trated in Figure 3.30. For larity, slies of this array are shown
as a set of
three two dimensional tables, one for eah speiessee Tables 3.8, 3.9
, and
3.10. The information ontained in both Table 3.7 and Figure 3.30
is the
same. However, in the multidimensional representation shown in Figure
3.30
(and Tables 3.8, 3.9, and 3.10), the values of the attributespetal width, petal
length, and speies typeare array indies.
What is important are the insights an be gained by looking at data from a
multidimensional viewpoint. Tables 3.8, 3.9, and 3.10 show that eah
speies
of Iris is haraterized by a di erent
ombination of values of
petal length
and width. Setosa owers have low width and length, Versiolour owers have
medium width and length, and Virginia
owers have high width and lengt
h.
3.4.2 Multidimensional Data: The General Case
The previous setion gave a spei example of using a multidimensional
ap
proah to represent and analyze a familiar data set. Here we desr
ibe the
general approah in more detail.
134 Chapter 3 Exploring Data
The starting point is usually a tabular representation of the data
, suh
as that of Table 3.7, whih is alled a fat table. Two steps ar
e neessary
in order to represent data as a multidimensional array: identi ation
of the
dimensions and identi ation of an attribute that is the fous of the
analy
sis. The dimensions are ategorial attributes or, as in the previous
example,
ontinuous attributes that have been onverted to ategorial attributes
. The
values of an attribute serve as indies into the array for the dimen
sion orre
sponding to the attribute, and the number of attribute values is th
e size of
that dimension. In the previous example, eah attribute had three po
ssible
values, and thus, eah dimension was of size three and ould be i
ndexed by
three values. This produed a 3 3 3 multidimensional array.
Eah ombination of attribute values (one value for eah di erent attribute)
de nes a ell of the multidimensional array. To illustrate using the
previous
example, if petal length = low, petal width = medium, and speies =
Setosa,
a spei ell ontaining the value 2 is identi ed. That is, there are
only two
owers in the data set that have the spei ed attribute values. Notie
that
eah row (objet) of the data set in Table 3.7 orresponds to a
ell in the
multidimensional array.
The ontents of eah ell represents the value of a target quantity (target
variable or attribute) that we are interested in analyzing. In the Iris exampl
e,
the target quantity is the number of
owers whose petal width and
length
fall within ertain limits. The target attribute is quantitative bea
use a key
goal of multidimensional data analysis is to look aggregate quantities,
suh as
totals or averages.
The following summarizes the proedure for reating a multidimensional
data representation from a data set represented in tabular form. First, identi
fy
the ategorial attributes to be used as the dimensions and a
quantitative
attribute to be used as the target of the analysis. Eah row (obje
t) in the
table is mapped to a ell of the multidimensional array. The indies of the e
ll
are spei ed by the values of the attributes that were seleted as dim
ensions,
while the value of the ell is the value of the target attribute. Cells not de n
ed
by the data are assumed to have a value of 0.
Example 3.23. To further illustrate the ideas just disussed, we pre
sent a
more traditional example involving the sale of produts.The fat table for this
example is given by Table 3.11. The dimensions of the multidimensiona
l rep
resentation are the produt ID, loation, and date attributes, while t
he target
attribute is the revenue. Figure 3.31 shows the multidimensional repr
esenta
tion of this data set. This larger and more ompliated data set wi
ll be used
to illustrate additional onepts of multidimensional data analysis.
3.4 OLAP and Multidimensional Data Analysis 135
3.4.3 Analyzing Multidimensional Data
In this setion, we desribe di erent multidimensional analysis tehniques
. In
partiular, we disuss the reation of data ubes, and related operations, su
h
as sliing, diing, dimensionality redution, roll up, and drill down.
Data Cubes: Computing Aggregate Quantities
A key motivation for taking a multidimensional viewpoint of data is
the im
portane of aggregating data in various ways. In the sales example,
we might
wish to
nd the total sales revenue for a spei year and a spei pro
dut.
Or we might wish to see the yearly sales revenue for eah loation
aross all
produts. Computing aggregate totals involves
xing spei values for som
e
of the attributes that are being used as dimensions and then summing
over
all possible values for the attributes that make up the remaining dim
ensions.
There are other types of aggregate quantities that are also of intere
st, but for
simpliity, this disussion will use totals (sums).
Table 3.12 shows the result of summing over all loations for various
om
binations of date and produt. For simpliity, assume that all the
dates are
within one year. If there are 365 days in a year and 1000 produts, then Table
3.12 has 365,000 entries (totals), one for eah produt data pair. W
e ould
also speify the store loation and date and sum over produts, or s
peify the
loation and produt and sum over all dates.
Table 3.13 shows the marginal totals of Table 3.12. These totals are
the
result of further summing over either dates or produts. In Table 3
.13, the
total sales revenue due to produt 1, whih is obtained by summing
aross
row 1 (over all dates), is $370,000. The total sales revenue o
n January 1,
2004, whih is obtained by summing down olumn 1 (over all produts
), is
$527,362. The total sales revenue, whih is obtained by summing over all rows
and olumns (all times and produts) is $227,352,127. All of these t
otals are
for all loations beause the entries of Table 3.13 inlude all loat
ions.
A key point of this example is that there are a number of di erent t
otals
(aggregates) that an be omputed for a multidimensional array, depending on
how many attributes we sum over. Assume that there are n dimensions
and
that the i
th
dimension (attribute) has s
i
possible values. There are n di erent
ways to sum only over a single attribute. If we sum over dimension j, then we
obtain s
1
s
j1
s
j+1
s
n
totals, one for eah possible ombination
of attribute values of the n 1 other attributes (dimensions). The tota
ls that
result from summing over one attribute form a multidimensional array of n1
dimensions and there are n suh arrays of totals. In the sales exam
ple, there
136 Chapter 3 Exploring Data
Table 3.11. Sales revenue of produts (in dollars) for various loations and t
imes.
Produt ID Loation Date Revenue
.
.
.
.
.
.
.
.
.
.
.
.
1 Minneapolis Ot. 18, 2004 $250
1 Chiago Ot. 18, 2004 $79
.
.
.
.
.
.
.
.
.
1 Paris Ot. 18, 2004 301
.
.
.
.
.
.
.
.
.
.
.
.
27 Minneapolis Ot. 18, 2004 $2,321
27 Chiago Ot. 18, 2004 $3,278
.
.
.
.
.
.
.
.
.
27 Paris Ot. 18, 2004 $1,325
.
.
.
.
.
.
.
.
.
.
.
.
$ $ $
L
o
a
t
i
o
n
Date
Produt ID
.
.
.
. . .
.
.
.
Figure 3.31. Multidimensional data representation for sales data.
3.4 OLAP and Multidimensional Data Analysis 137
Table 3.12. Totals that result from summing over all loations for a
nd produt.
date
Jan 1, 2004 Jan 2, 2004 . . . De 31, 2004
1 $1,001 $987 . . . $891
.
.
.
.
.
.
.
.
.
27 $10,265 $10,225 . . . $9,325
p
r
o
d
u
t
I
D
.
.
.
.
.
.
.
xed time a
.
.
Table 3.13. Table 3.12 with marginal totals.
date
Jan 1, 2004 Jan 2, 2004 . . . De 31, 2004 total
1 $1,001 $987 . . . $891 $370,000
.
.
.
.
.
.
.
.
.
.
.
.
27 $10,265 $10,225 . . . $9,325 $3,800,020
p
r
o
d
u
t
I
D
.
.
.
.
.
.
.
.
.
.
.
.
total $527,362 $532,953 . . . $631,221 $227,352,127
are three sets of totals that result from summing over only one dimension and
eah set of totals an be displayed as a two dimensional table.
If we sum over two dimensions (perhaps starting with one of the ar
rays
of totals obtained by summing over one dimension), then we will
obtain a
multidimensional array of totals with n 2 dimensions. There will b
e
n
2
distint arrays of
be
3
2
= 3
suh totals.
there will
arrays of totals that result from summing over loation and produt,
loation
and time, or produt and time. In general, summing over k dimensions
yields
n
k
arrays of totals, eah with dimension n k.
A multidimensional representation of the data, together with all pos
sible
totals (aggregates), is known as a data ube. Despite the name, th
e size of
eah dimensionthe number of attribute valuesdoes not need to be equal.
Also, a data ube may have either more or fewer than three dimensions. More
importantly, a data ube is a generalization of what is known in
statistial
terminology as a ross tabulation. If marginal totals were added,
Tables
3.8, 3.9, or 3.10 would be typial examples of ross tabulations.
138 Chapter 3 Exploring Data
Dimensionality Redution and Pivoting
The aggregation desribed in the last setion an be viewed as a
form of
dimensionality redution. Spei ally, the j
th
dimension is eliminated by
summing over it. Coneptually, this ollapses eah olumn of ells in the j
th
dimension into a single ell. For both the sales and Iris examples, aggregati
ng
over one dimension redues the dimensionality of the data from 3 to
2. If s
j
is the number of possible values of the j
th
dimension, the number of
ells is
redued by a fator of s
j
. Exerise 17 on page 143 asks the reader to explore
the di erene between this type of dimensionality redution and that of PCA.
Pivoting refers to aggregating over all dimensions exept two. The result
is a two dimensional ross tabulation with the two spei ed dimensions as the
only remaining dimensions. Table 3.13 is an example of pivoting on d
ate and
produt.
Sliing and Diing
These two olorful names refer to rather straightforward operations. Sliing i
s
seleting a group of ells from the entire multidimensional array by speifyi
ng
a spei
value for one or more dimensions. Tables 3.8, 3.9,
and 3.10 are
three slies from the Iris set that were obtained by speifying three
separate
values for the speies dimension. Diing involves seleting a subset of ells
by
speifying a range of attribute values. This is equivalent to de ning a subarray
from the omplete array. In pratie, both operations an also be aompanied
by aggregation over some dimensions.
dis
tribution. Also, there are plots that display whether the observed v
alues are
statistially signi ant in some sense. We have not overed any of the
se teh
niques here and refer the reader to the previously mentioned statisti
al and
mathematial pakages.
Multidimensional analysis has been around in a variety of forms for s
ome
time. One of the original papers was a white paper by Codd [88],
the father
of relational databases. The data ube was introdued by Gray et a
l. [91],
who desribed various operations for reating and manipulating data u
bes
within a relational database framework. A omparison of statistial databases
and OLAP is given by Shoshani [100]. Spei information on OLAP an
be found in doumentation from database vendors and many popular books
.
Many database textbooks also have general disussions of OLAP, often i
n the
ontext of data warehousing. For example, see the text by Ramakrishnan and
Gehrke [97].
Bibliography
[86] D. A. Burn. Designing E etive Statistial Graphs. In C. R. Rao
, editor, Handbook of
Statistis 9. Elsevier/North Holland, Amsterdam, The Netherlands, Septembe
r 1993.
3.6 Exerises 141
[87] S. K. Card, J. D. MaKinlay, and B. Shneiderman, editors
. Readings in Information
Visualization: Using Vision to Think. Morgan Kaufmann Publishers, San
Franiso,
CA, January 1999.
[88] E. F. Codd, S. B. Codd, and C. T. Smalley. Providing
OLAP (On line Analytial
Proessing) to User Analysts: An IT Mandate. White Paper, E.F. Codd and Asso
iates,
1993.
[89] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors.
Information Visualization in
Data Mining and Knowledge Disovery. Morgan Kaufmann Publishers, San F
raniso,
CA, September 2001.
[90] M. Friendly. Gallery of Data Visualization. http://www.math.yorku
.a/SCS/Gallery/,
2005.
[91] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reihart, M. Venkatrao,
F. Pellow,
and H. Pirahesh. Data Cube: A Relational Aggregation Operator General
izing Group
By, Cross Tab, and Sub Totals. Journal Data Mining and Knowledge Di
sovery, 1(1):
2953, 1997.
[92] B. W. Lindgren. Statistial Theory. CRC Press, January 1993.
[93] Mathematia 5.1. Wolfram Researh, In. http://www.wolfram.om/, 2
005.
[94] MATLAB 7.0. The MathWorks, In. http://www.mathworks.om, 2005.
[95] Mirosoft Exel 2003. Mirosoft, In. http://www.mirosoft.om/, 2
003.
[96] R: A language and environment for statistial omputing and grap
his. The R Projet
for Statistial Computing. http://www.r projet.org/, 2005.
[97] R. Ramakrishnan and J. Gehrke. Database Management Systems.
MGraw Hill, 3rd
edition, August 2002.
[98] S PLUS. Insightful Corporation. http://www.insightful.om, 2005.
[99] SAS: Statistial Analysis System. SAS Institute In. http://www.s
as.om/, 2005.
[100] A. Shoshani. OLAP and statistial databases: similarities and
di erenes. In Pro.
of the Sixteenth ACM SIGACT SIGMOD SIGART Symp. on Priniples of Da
tabase
Systems, pages 185196. ACM Press, 1997.
[101] R. Spene. Information Visualization. ACM Press, New York, De
ember 2000.
[102] SPSS: Statistial Pakage for the Soial Sienes. SPSS, In
. http://www.spss.om/,
2005.
[103] E. R. Tufte. The Visual Display of Quantitative Information. Graphis Pr
ess, Cheshire,
CT, Marh 1986.
[104] J. W. Tukey. Exploratory data analysis. Addison Wesley, 1977.
[105] P. Velleman and D. Hoaglin. The ABCs of EDA: Appliations, Basis, and C
omputing
of Exploratory Data Analysis. Duxbury, 1981.
3.6 Exerises
1. Obtain one of the data sets available at the UCI Mahine Learnin
g Repository
and apply as many of the di erent visualization tehniques desribed
in the
hapter as possible. The bibliographi notes and book Web site provide pointer
s
to visualization software.
142 Chapter 3 Exploring Data
2. Identify at least two advantages and two disadvantages of using olor to vi
sually
represent information.
3. What are the arrangement issues that arise with respet to three
dimensional
plots?
4. Disuss the advantages and disadvantages of using sampling to redue the nu
m
ber of data objets that need to be displayed. Would simple random
sampling
(without replaement) be a good approah to sampling? Why or why not?
5. Desribe how you would reate visualizations to display information
that de
sribes the following types of systems.
(a) Computer networks. Be sure to inlude both the stati aspets
of the
network, suh as onnetivity, and the dynami aspets, suh as tra .
(b) The distribution of spei plant and animal speies around the world for
a spei moment in time.
() The use of omputer resoures, suh as proessor time, main memory, and
disk, for a set of benhmark database programs.
(d) The hange in oupation of workers in a partiular ountry over
the last
thirty years. Assume that you have yearly information about eah person
and relation
1 3 6
2 1 5
2 2 22
17. Disuss the di erenes between dimensionality redution based on agg
regation
and dimensionality redution based on tehniques suh as PCA and SVD.