Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Mining

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 157

1

Introduction
Rapid advances in data collection and storage technology have enabled
organizations to accumulate vast amounts of data. However, extracting
useful
information has proven extremely challenging. Often, traditional data a
nalysis tools and techniques cannot be used because of the massive size
of a data
set. Sometimes, the non-traditional nature of the data means that trad
itional
approaches cannot be applied even if the data set is relatively small
. In other
situations, the questions that need to be answered cannot be addressed
using
existing data analysis techniques, and thus, new methods need to be
developed.
Data mining is a technology that blends traditional data analysis methods
with sophisticated algorithms for processing large volumes of data. It has als
o
opened up exciting opportunities for exploring and analyzing new type
s of
data and for analyzing old types of data in new ways. In this int
roductory
chapter, we present an overview of data mining and outline the key
topics
to be covered in this book. We start with a description of some w
ell-known
applications that require new techniques for data analysis.
Business Point-of-sale data collection (bar code scanners, radio frequ
ency
identi cation (RFID), and smart card technology) have allowed retailers
to
collect up-to-the-minute data about customer purchases at the checkout counters of their stores. Retailers can utilize this information, along
with other
business-critical data such as Web logs from e-commerce Web sites and
customer service records from call centers, to help them better underst
and the
needs of their customers and make more informed business decisions.
Data mining techniques can be used to support a wide range of busin
ess
intelligence applications such as customer pro ling, targeted marketing, workow management, store layout, and fraud detection. It can also help re
tailers
2 Chapter 1 Introduction
answer important business questions such as Who are the most pro ta
ble
customers? What products can be cross-sold or up-sold? and What is the
revenue outlook of the company for next year? Some of these questions
motivated the creation of association analysis (Chapters 6 and 7), a
new data
analysis technique.
Medicine, Science, and Engineering Researchers in medicine, science,
and engineering are rapidly accumulating data that is key to important

new
discoveries. For example, as an important step toward improving our
understanding of the Earths climate system, NASA has deployed a series of Earthorbiting satellites that continuously generate global observations of
the land
surface, oceans, and atmosphere. However, because of the size and
spatiotemporal nature of the data, traditional methods are often not s
uitable for
analyzing these data sets. Techniques developed in data mining can aid Earth
scientists in answering questions such as What is the relationship betw
een
the frequency and intensity of ecosystem disturbances such as droughts
and
hurricanes to global warming? How is land surface precipitation and temperature a ected by ocean surface temperature? and How well can we predict
the beginning and end of the growing season for a region?
As another example, researchers in molecular biology hope to use the large
amounts of genomic data currently being gathered to better understand
the
structure and function of genes. In the past, traditional methods i
n molecular biology allowed scientists to study only a few genes at a time
in a given
experiment. Recent breakthroughs in microarray technology have enabled scientists to compare the behavior of thousands of genes under various situations.
Such comparisons can help determine the function of each gene and per
haps
isolate the genes responsible for certain diseases. However, the noisy and hig
hdimensional nature of data requires new types of data analysis. In
addition
to analyzing gene array data, data mining can also be used to addre
ss other
important biological challenges such as protein structure prediction, mu
ltiple
sequence alignment, the modeling of biochemical pathways, and phylogenetics.
1.1 What Is Data Mining?
Data mining is the process of automatically discovering useful informat
ion in
large data repositories. Data mining techniques are deployed to scour
large
databases in order to
nd novel and useful patterns that might o
therwise
remain unknown. They also provide capabilities to predict the outcome
of a
1.1 What Is Data Mining? 3
future observation, such as predicting whether a newly arrived custome
r will
spend more than $100 at a department store.
Not all information discovery tasks are considered to be data mining.
For
example, looking up individual records using a database management syst
em
or nding particular Web pages via a query to an Internet search engin
e are
tasks related to the area of information retrieval. Although such ta
sks are
important and may involve the use of the sophisticated algorithms and

data
structures, they rely on traditional computer science techniques and
obvious
features of the data to create index structures for e ciently organizin
g and
retrieving information. Nonetheless, data mining techniques have been
used
to enhance information retrieval systems.
Data Mining and Knowledge Discovery
Data mining is an integral part of knowledge discovery in dat
abases
(KDD), which is the overall process of converting raw data into us
eful information, as shown in Figure 1.1. This process consists of a serie
s of transformation steps, from data preprocessing to postprocessing of data mi
ning
results.
Input
Data
Information
Data
Preprocessing
Data
Mining
Postprocessing
Filtering Patterns
Visualization
Pattern Interpretation
Feature Selection
Dimensionality Reduction
Normalization
Data Subsetting
Figure 1.1. The process of knowledge discovery in databases (KDD).
les, sp
The input data can be stored in a variety of formats ( at
readsheets, or relational tables) and may reside in a centralized data
repository
or be distributed across multiple sites. The purpose of preproc
essing is
to transform the raw input data into an appropriate format for subs
equent
analysis. The steps involved in data preprocessing include fusing data
from
multiple sources, cleaning data to remove noise and duplicate observat
ions,
and selecting records and features that are relevant to the data mini
ng task
at hand. Because of the many ways data can be collected and stored
, data
4 Chapter 1 Introduction
preprocessing is perhaps the most laborious and time-consuming step in
the
overall knowledge discovery process.
Closing the loop is the phrase often used to refer to the process of
integrating data mining results into decision support systems. For ex
ample,
in business applications, the insights o ered by data mining results
can be

integrated with campaign management tools so that e ective marketing promotions can be conducted and tested. Such integration requires a post
processing step that ensures that only valid and useful results are inco
rporated
into the decision support system. An example of postprocessing is vis
ualization (see Chapter 3), which allows analysts to explore the data and
the data
mining results from a variety of viewpoints. Statistical measures or
hypothesis testing methods can also be applied during postprocessing to elim
inate
spurious data mining results.
1.2 Motivating Challenges
As mentioned earlier, traditional data analysis techniques have often e
ncountered practical di culties in meeting the challenges posed by new data
sets.
The following are some of the speci c challenges that motivated the dev
elopment of data mining.
Scalability Because of advances in data generation and collection, data sets
with sizes of gigabytes, terabytes, or even petabytes are becoming
common.
If data mining algorithms are to handle these massive data sets, th
en they
must be scalable. Many data mining algorithms employ special search s
trategies to handle exponential search problems. Scalability may also requ
ire the
implementation of novel data structures to access individual records in
an efcient manner. For instance, out-of-core algorithms may be necessary wh
en
processing data sets that cannot t into main memory. Scalability can also be
improved by using sampling or developing parallel and distributed algorithms.
High Dimensionality It is now common to encounter data sets with hun
dreds or thousands of attributes instead of the handful common a few decades
ago. In bioinformatics, progress in microarray technology has produced
gene
expression data involving thousands of features. Data sets with
temporal
or spatial components also tend to have high dimensionality. For
example,
consider a data set that contains measurements of temperature at
various
locations. If the temperature measurements are taken repeatedly for a
n extended period, the number of dimensions (features) increases in proportion to
1.2 Motivating Challenges 5
the number of measurements taken. Traditional data analysis techniques that
were developed for low-dimensional data often do not work well for such highdimensional data. Also, for some data analysis algorithms, the computational
complexity increases rapidly as the dimensionality (the number of feat
ures)
increases.
Heterogeneous and Complex Data Traditional data analysis methods
often deal with data sets containing attributes of the same type, either contin-

uous or categorical. As the role of data mining in business, science


, medicine,
and other
elds has grown, so has the need for techniques that can
handle
heterogeneous attributes. Recent years have also seen the emergence of
more
complex data objects. Examples of such non-traditional types of data
include
collections of Web pages containing semi-structured text and hyperlinks; DNA
data with sequential and three-dimensional structure; and climate dat
a that
consists of time series measurements (temperature, pressure, etc.) at
various
locations on the Earths surface. Techniques developed for mining such
complex objects should take into consideration relationships in the data,
such as
temporal and spatial autocorrelation, graph connectivity, and parent-child relationships between the elements in semi-structured text and XML documents.
Data Ownership and Distribution Sometimes, the data needed for an
analysis is not stored in one location or owned by one organization.
Instead,
the data is geographically distributed among resources belonging to mul
tiple
entities. This requires the development of distributed data mining techniques.
The key challenges faced by distributed data mining algorithms include
the
following: (1) how to reduce the amount of communication needed to perform
the distributed computation, (2) how to e ectively consolidate the data mining
results obtained from multiple sources, and (3) how to address data
security
issues.
Non-traditional Analysis The traditional statistical approach is based on
a hypothesize-and-test paradigm. In other words, a hypothesis is prop
osed,
an experiment is designed to gather the data, and then the data is
analyzed
with respect to the hypothesis. Unfortunately, this process is extremely labor
intensive. Current data analysis tasks often require the generation an
d evaluation of thousands of hypotheses, and consequently, the development of
some
data mining techniques has been motivated by the desire to automat
e the
process of hypothesis generation and evaluation. Furthermore, the dat
a sets
analyzed in data mining are typically not the result of a carefully
designed
6 Chapter 1 Introduction
experiment and often represent opportunistic samples of the data, rather than
random samples. Also, the data sets frequently involve non-traditional
types
of data and data distributions.
1.3 The Origins of Data Mining
Brought together by the goal of meeting the challenges of the prev
ious section, researchers from di erent disciplines began to focus on developing
more
e cient and scalable tools that could handle diverse types of data. This work,

which culminated in the eld of data mining, built upon the methodology and
algorithms that researchers had previously used. In particular, data
mining
draws upon ideas, such as (1) sampling, estimation, and hypothesis
testing
from statistics and (2) search algorithms, modeling techniques, and l
earning
theories from arti cial intelligence, pattern recognition, and machine learning.
Data mining has also been quick to adopt ideas from other areas, in
cluding
optimization, evolutionary computing, information theory, signal proce
ssing,
visualization, and information retrieval.
A number of other areas also play key supporting roles. In parti
cular,
database systems are needed to provide support for e cient storage, ind
exing, and query processing. Techniques from high performance (parallel)
computing are often important in addressing the massive size of some dat
a sets.
Distributed techniques can also help address the issue of size and are essential
when the data cannot be gathered in one location.
Figure 1.2 shows the relationship of data mining to other areas.
Database Technology, Parallel Computing, Distributed Computing
AI,
Machine
Learning,
and
Pattern
Recognition
Statistics
Data Mining
Figure 1.2. Data mining as a con uence of many disciplines.
1.4 Data Mining Tasks 7
1.4 Data Mining Tasks
Data mining tasks are generally divided into two major categories:
Predictive tasks. The objective of these tasks is to predict the value of a p
articular attribute based on the values of other attributes. The attrib
ute
to be predicted is commonly known as the target or dependent variable, while the attributes used for making the prediction are known a
s
the explanatory or independent variables.
Descriptive tasks. Here, the objective is to derive patterns (co
rrelations,
trends, clusters, trajectories, and anomalies) that summarize the
underlying relationships in data. Descriptive data mining tasks are ofte
n
exploratory in nature and frequently require postprocessing techniques
to validate and explain the results.
Figure 1.3 illustrates four of the core data mining tasks that are describ
ed
in the remainder of this book.
D I A P E R
A
n
o

m
a
ly
D
e
t
e
c
t
io
n
Data
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
125K
100K
70K
120K
95K
80K
220K
85K
75K
90K
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married

Single
No
No
No
No
Yes
No
No
Yes
No
Yes
P
r
e
d
ic
t
iv
e
M
o
d
e
lin
g
C
lu
s
t
e
r
A
n
a
ly
s
is
A
s
s
o
c
ia
t
io
n
A
n
a
ly
s
is
DIAPER
Figure 1.3. Four of the core data mining tasks.
8 Chapter 1 Introduction
Predictive modeling refers to the task of building a model for the
target
variable as a function of the explanatory variables. There are two
types of
predictive modeling tasks: classi cation, which is used for discrete

target
variables, and regression, which is used for continuous target variable
s. For
example, predicting whether a Web user will make a purchase at an
online
bookstore is a classi cation task because the target variable is binaryvalued.
On the other hand, forecasting the future price of a stock is a reg
ression task
because price is a continuous-valued attribute. The goal of both ta
sks is to
learn a model that minimizes the error between the predicted and true
values
of the target variable. Predictive modeling can be used to identify
customers
that will respond to a marketing campaign, predict disturbances in the Earths
ecosystem, or judge whether a patient has a particular disease based
on the
results of medical tests.
Example 1.1 (Predicting the Type of a Flower). Consider the task of
predicting a species of
ower based on the characteristics of the
o
wer. In
particular, consider classifying an Iris ower as to whether it belongs
to one
of the following three Iris species: Setosa, Versicolour, or Virgin
ica. To perform this task, we need a data set containing the characteristics
of various
owers of these three species. A data set with this type of i
nformation is
the well-known Iris data set from the UCI Machine Learning Repository
at
http://www.ics.uci.edu/mlearn. In addition to the species of a
ower,
this data set contains four other attributes: sepal width, sepal l
ength, petal
length, and petal width. (The Iris data set and its attributes are
described
further in Section 3.1.) Figure 1.4 shows a plot of petal width v
ersus petal
length for the 150 owers in the Iris data set. Petal width is broke
n into the
categories low, medium, and high, which correspond to the intervals [0
, 0.75),
[0.75, 1.75), [1.75, ), respectively. Also, petal length is broken into categor
ies
low, medium, and high, which correspond to the intervals [0, 2.5), [2
.5, 5), [5,
), respectively. Based on these categories of petal width and leng
th, the
following rules can be derived:
Petal width low and petal length low implies Setosa.
Petal width medium and petal length medium implies Versicolour.
Petal width high and petal length high implies Virginica.
While these rules do not classify all the
owers, they do a good (b
ut not
perfect) job of classifying most of the
owers. Note that
owers
from the
Setosa species are well separated from the Versicolour and Virginica
species
with respect to petal width and length, but the latter two spec

ies overlap
somewhat with respect to these attributes.
1.4 Data Mining Tasks 9
0 1 2 2.5 3 4 5 6 7
0
0.5
0.75
1
1.5
1.75
2
2.5
Petal Length (cm)
P
e
t
a
l
W
i
d
t
h
(
c
m
)
Setosa
Versicolour
Virginica
Figure 1.4. Petal width versus petal length for 150 Iris owers.
Association analysis is used to discover patterns that describe strongl
y associated features in the data. The discovered patterns are typically represent
ed
in the form of implication rules or feature subsets. Because of the exponentia
l
size of its search space, the goal of association analysis is to ext
ract the most
interesting patterns in an e cient manner. Useful applications of associ
ation
analysis include nding groups of genes that have related functionality, identifying Web pages that are accessed together, or understanding the relationships
between di erent elements of Earths climate system.
Example 1.2 (Market Basket Analysis). The transactions shown in Table 1.1 illustrate point-of-sale data collected at the checkout c
ounters of a
grocery store. Association analysis can be applied to
nd items that a
re frequently bought together by customers. For example, we may discove
r the
rule {Diapers} {Milk}, which suggests that customers who buy diapers
also tend to buy milk. This type of rule can be used to identify
potential
cross selling opportunities among related items.
Cluster analysis seeks to
nd groups of closely related observations so
that
observations that belong to the same cluster are more similar to each

other
10 Chapter 1 Introduction
Table 1.1. Market basket data.
Transaction ID Items
1 {Bread, Butter, Diapers, Milk}
2 {Coffee, Sugar, Cookies, Salmon}
3 {Bread, Butter, Coffee, Diapers, Milk, Eggs}
4 {Bread, Butter, Salmon, Chicken}
5 {Eggs, Bread, Butter}
6 {Salmon, Diapers, Milk}
7 {Bread, Tea, Sugar, Eggs}
8 {Coffee, Sugar, Chicken, Eggs}
9 {Bread, Diapers, Milk, Salt}
10 {Tea, Eggs, Cookies, Diapers, Milk}
than observations that belong to other clusters. Clustering has been
used to
group sets of related customers, nd areas of the ocean that have a signi cant
impact on the Earths climate, and compress data.
Example 1.3 (Document Clustering). The collection of news articl
es
shown in Table 1.2 can be grouped based on their respective topics.
Each
article is represented as a set of word frequency pairs (w, c), where w is a wor
d
and c is the number of times the word appears in the article. Ther
e are two
natural clusters in the data set. The
rst cluster consists of the
rst
four ar
ticles, which correspond to news about the economy, while the second
cluster
contains the last four articles, which correspond to news about health care. A
good clustering algorithm should be able to identify these two cluster
s based
on the similarity between words that appear in the articles.
Table 1.2. Collection of news articles.
Article Words
1 dollar: 1, industry: 4, country: 2, loan: 3, deal: 2, governm
ent: 2
2 machinery: 2, labor: 3, market: 4, industry: 2, work: 3, coun
try: 1
3 job: 5, in ation: 3, rise: 2, jobless: 2, market: 3, country:
2, index: 3
4 domestic: 3, forecast: 2, gain: 1, market: 2, sale: 3, price:
2
5 patient: 4, symptom: 2, drug: 3, health: 2, clinic: 2, doctor
: 2
6 pharmaceutical: 2, company: 3, drug: 2, vaccine: 1,
u: 3
7 death: 2, cancer: 4, drug: 3, public: 4, health: 3, director:
2
8 medical: 2, cost: 3, increase: 2, patient: 2, health: 3, care
: 1
1.5 Scope and Organization of the Book 11
Anomaly detection is the task of identifying observations whose character
istics are signi cantly di erent from the rest of the data. Such observ
ations
are known as anomalies or outliers. The goal of an anomaly detecti
on al
gorithm is to discover the real anomalies and avoid falsely labeli
ng normal
objects as anomalous. In other words, a good anomaly detector must

have
a high detection rate and a low false alarm rate. Applications of
anomaly
detection include the detection of fraud, network intrusions, unusual p
atterns
of disease, and ecosystem disturbances.
Example 1.4 (Credit Card Fraud Detection). A credit card company
records the transactions made by every credit card holder, along with personal
information such as credit limit, age, annual income, and address.
Since the
number of fraudulent cases is relatively small compared to the
number of
legitimate transactions, anomaly detection techniques can be applied to
build
a pro le of legitimate transactions for the users. When a new trans
action
arrives, it is compared against the pro le of the user. If the charac
teristics of
the transaction are very di erent from the previously created pro le, then the
transaction is agged as potentially fraudulent.
1.5 Scope and Organization of the Book
This book introduces the major principles and techniques used in data mining
from an algorithmic perspective. A study of these principles and techniques is
essential for developing a better understanding of how data mining technology
can be applied to various kinds of data. This book also serves as
a starting
point for readers who are interested in doing research in this
eld.
We begin the technical discussion of this book with a chapter o
n data
(Chapter 2), which discusses the basic types of data, data quali
ty, prepro
cessing techniques, and measures of similarity and dissimilarity. Al
though
this material can be covered quickly, it provides an essential foun
dation for
data analysis. Chapter 3, on data exploration, discusses summary sta
tistics,
visualization techniques, and On Line Analytical Processing (OLAP). Th
ese
techniques provide the means for quickly gaining insight into a data
set.
Chapters 4 and 5 cover classi cation. Chapter 4 provides a foundati
on
by discussing decision tree classi ers and several issues that are
important
to all classi cation: over tting, performance evaluation, and the compar
ison
of di erent classi cation models. Using this foundation, Chapter 5 descri
bes
a number of other important classi cation techniques: rule based syst
ems,
nearest neighbor classi ers, Bayesian classi ers, arti cial neural networks, sup
port vector machines, and ensemble classi ers, which are collections of
classi
12 Chapter 1 Introduction
ers. The multiclass and imbalanced class problems are also discussed. These
topics can be covered independently.
Association analysis is explored in Chapters 6 and 7. Chapter 6 desc
ribes
the basics of association analysis: frequent itemsets, association

rules, and
some of the algorithms used to generate them. Speci c types of
frequent
itemsetsmaximal, closed, and hypercliquethat are important for data min
ing are also discussed, and the chapter concludes with a discussion o
f evalua
tion measures for association analysis. Chapter 7 considers a variety
of more
advanced topics, including how association analysis can be applied to categor
ical and continuous data or to data that has a concept hierarchy. (
A concept
hierarchy is a hierarchical categorization of objects, e.g., store items, clothi
ng,
shoes, sneakers.) This chapter also describes how association analysis
can be
extended to
nd sequential patterns (patterns involving order), patter
ns in
graphs, and negative relationships (if one item is present, then th
e other is
not).
Cluster analysis is discussed in Chapters 8 and 9. Chapter 8 rst describes
the di erent types of clusters and then presents three speci c clustering
tech
niques: K means, agglomerative hierarchical clustering, and DBSCAN.
This
is followed by a discussion of techniques for validating the results o
f a cluster
ing algorithm. Additional clustering concepts and techniques are explor
ed in
Chapter 9, including fuzzy and probabilistic clustering, Self Organizing
Maps
(SOM), graph based clustering, and density based clustering. There is a
lso a
discussion of scalability issues and factors to consider when selectin
g a clus
tering algorithm.
The last chapter, Chapter 10, is on anomaly detection. After some
basic
de nitions, several di erent types of anomaly detection are considered:
sta
tistical, distance based, density based, and clustering based. Appendic
es A
through E give a brief review of important topics that are used in
portions of
the book: linear algebra, dimensionality reduction, statistics, regre
ssion, and
optimization.
The subject of data mining, while relatively young compared to statist
ics
or machine learning, is already too large to cover in a single book
. Selected
references to topics that are only brie y covered, such as data qu
ality, are
provided in the bibliographic notes of the appropriate chapter. Refere
nces to
topics not covered in this book, such as data mining for streams and
privacy
preserving data mining, are provided in the bibliographic notes of this chapter.
1.6 Bibliographic Notes 13
1.6 Bibliographic Notes

The topic of data mining has inspired many textbooks. Introductory


text
books include those by Dunham [10], Han and Kamber [21], Hand et al.
[23],
and Roiger and Geatz [36]. Data mining books with a stronger emphasi
s on
business applications include the works by Berry and Lino [2], Pyle [34], and
Parr Rud [33]. Books with an emphasis on statistical learning includ
e those
by Cherkassky and Mulier [6], and Hastie et al. [24]. Some books
with an
emphasis on machine learning or pattern recognition are those by Dud
a et
al. [9], Kantardzic [25], Mitchell [31], Webb [41], and Witten and F
rank [42].
There are also some more specialized books: Chakrabarti [4] (web mi
ning),
Fayyad et al. [13] (collection of early articles on data mining),
Fayyad et al.
[11] (visualization), Grossman et al. [18] (science and engineering),
Kargupta
and Chan [26] (distributed data mining), Wang et al. [40] (bioinfo
rmatics),
and Zaki and Ho [44] (parallel data mining).
There are several conferences related to data mining. Some of the
main
conferences dedicated to this
eld include the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), the IEEE In
ternational Conference on Data Mining (ICDM), the SIAM International Con
ference on Data Mining (SDM), the European Conference on Principles a
nd
Practice of Knowledge Discovery in Databases (PKDD), and the Paci c Asia
Conference on Knowledge Discovery and Data Mining (PAKDD). Data min
ing papers can also be found in other major conferences such as
the ACM
SIGMOD/PODS conference, the International Conference on Very Large Data
Bases (VLDB), the Conference on Information and Knowledge Management
(CIKM), the International Conference on Data Engineering (ICDE), the
In
ternational Conference on Machine Learning (ICML), and the National Con
ference on Arti cial Intelligence (AAAI).
Journal publications on data mining include IEEE Transactions on Knowl
edge and Data Engineering, Data Mining and Knowledge Discovery,

Knowl

edge and Information Systems, Intelligent Data Analysis, Information


Sys
tems, and the Journal of Intelligent Information Systems.
There have been a number of general articles on data mining that de ne the
eld or its relationship to other elds, particularly statistics. Fayyad et al.
[12]
describe data mining and how it ts into the total knowledge discovery process.
Chen et al. [5] give a database perspective on data mining. Ramakr
ishnan
and Grama [35] provide a general discussion of data mining and present several
viewpoints. Hand [22] describes how data mining di ers from statistics, as does
Friedman [14]. Lambert [29] explores the use of statistics for large data sets
and
provides some comments on the respective roles of data mining and statistics.
14 Chapter 1 Introduction

Glymour et al. [16] consider the lessons that statistics may


have for data
mining. Smyth et al. [38] describe how the evolution of data mining
is being
driven by new types of data and applications, such as those involving streams,
graphs, and text. Emerging applications in data mining are considered by Han
et al. [20] and Smyth [37] describes some research challenges in dat
a mining.
A discussion of how developments in data mining research can be turne
d into
practical tools is given by Wu et al. [43]. Data mining standar
ds are the
subject of a paper by Grossman et al. [17]. Bradley [3] discusses
how data
mining algorithms can be scaled to large data sets.
With the emergence of new data mining applications have come new chal
lenges that need to be addressed. For instance, concerns about privacy breache
s
as a result of data mining have escalated in recent years, particul
arly in ap
plication domains such as Web commerce and health care. As a result,
there
is growing interest in developing data mining algorithms that maintain
user
privacy. Developing techniques for mining encrypted or randomized data
is
known as privacy preserving data mining. Some general references in this
area include papers by Agrawal and Srikant [1], Clifton et al. [7] and Kargupt
a
et al. [27]. Vassilios et al. [39] provide a survey.
Recent years have witnessed a growing number of applications that rapidly
generate continuous streams of data. Examples of stream data include network
tra c, multimedia streams, and stock prices. Several issues must be considered
when mining data streams, such as the limited amount of memory avail
able,
the need for online analysis, and the change of the data over
time. Data
mining for stream data has become an important area in data mining.
Some
selected publications are Domingos and Hulten [8] (classi cation), Giann
ella
et al. [15] (association analysis), Guha et al. [19] (clustering), Kifer et
al. [28]
(change detection), Papadimitriou et al. [32] (time series), and Law et al.
[30]
(dimensionality reduction).
Bibliography
[1] R. Agrawal and R. Srikant. Privacy preserving data mining. In
Proc. of 2000 ACM
SIGMOD Intl. Conf. on Management of Data, pages 439450, Dallas,
Texas, 2000.
ACM Press.
[2] M. J. A. Berry and G. Lino . Data Mining Techniques: For M
arketing, Sales, and
Customer Relationship Management. Wiley Computer Publishing, 2nd edition
, 2004.
[3] P. S. Bradley, J. Gehrke, R. Ramakrishnan, and R. Srikant. Scal
ing mining algorithms
to large databases. Communications of the ACM, 45(8):3843, 2002.
[4] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hyperte

xt Data. Morgan
Kaufmann, San Francisco, CA, 2003.
Bibliography 15
[5] M. S. Chen, J. Han, and P. S. Yu. Data Mining: An Over
view from a Database
Perspective. IEEE Transactions on Knowledge abd Data Engineering, 8(
6):866883,
1996.
[6] V. Cherkassky and F. Mulier. Learning from Data: Concepts, T
heory, and Methods.
Wiley Interscience, 1998.
[7] C. Clifton, M. Kantarcioglu, and J. Vaidya. De ning privacy f
or data mining. In
National Science Foundation Workshop on Next Generation Data Mining,
pages 126
133, Baltimore, MD, November 2002.
[8] P. Domingos and G. Hulten. Mining high speed data streams. In
Proc. of the 6th Intl.
Conf. on Knowledge Discovery and Data Mining, pages 7180, Boston, Massa
chusetts,
2000. ACM Press.
[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classi cation.
John Wiley & Sons,
Inc., New York, 2nd edition, 2001.
[10] M. H. Dunham. Data Mining: Introductory and Advanced Topics.
Prentice Hall, 2002.
[11] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors.
Information Visualization in
Data Mining and Knowledge Discovery. Morgan Kaufmann Publishers, San F
rancisco,
CA, September 2001.
[12] U. M. Fayyad, G. Piatetsky Shapiro, and P. Smyth. From Data M
ining to Knowledge
Discovery: An Overview. In Advances in Knowledge Discovery and Data Mi
ning, pages
134. AAAI Press, 1996.
[13] U. M. Fayyad, G. Piatetsky Shapiro, P. Smyth, and R. Uthurusamy,
editors. Advances
in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[14] J. H. Friedman. Data Mining and Statistics: Whats the Connect
ion? Unpublished.
www stat.stanford.edu/jhf/ftp/dm-stat.ps, 1997.
[15] C. Giannella, J. Han, J. Pei, X. Yan, and P. S. Yu. Mining
Frequent Patterns in Data
Streams at Multiple Time Granularities. In H. Kargupta, A. Joshi, K.
Sivakumar, and
Y. Yesha, editors, Next Generation Data Mining, pages 191212. AAAI/MIT,
2003.
[16] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical
Themes and Lessons
for Data Mining. Data Mining and Knowledge Discovery, 1(1):1128, 1997.
[17] R. L. Grossman, M. F. Hornick, and G. Meyer. Data minin
g standards initiatives.
Communications of the ACM, 45(8):5961, 2002.
[18] R. L. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu, edito
rs. Data
Mining for Scienti c and Engineering Applications. Kluwer Academic Publishers,
2001.
[19] S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. OCa
llaghan. Clustering Data

Streams: Theory and Practice. IEEE Transactions on Knowledge and Data Engineer
ing,
15(3):515528, May/June 2003.
[20] J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. P
regibon. Emerging scienti c
applications in data mining. Communications of the ACM, 45(8):5458, 200
2.
[21] J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann
Publishers, San Francisco, 2001.
[22] D. J. Hand. Data Mining: Statistics and More? The American
Statistician, 52(2):
112118, 1998.
[23] D. J. Hand, H. Mannila, and P. Smyth. Principles of Data Mini
ng. MIT Press, 2001.
[24] T. Hastie, R. Tibshirani, and J. H. Friedman. The Element
s of Statistical Learning:
Data Mining, Inference, Prediction. Springer, New York, 2001.
[25] M. Kantardzic. Data Mining: Concepts, Models, Methods, and Algorith
ms. Wiley-IEEE
Press, Piscataway, NJ, 2003.
16 Chapter 1 Introduction
[26] H. Kargupta and P. K. Chan, editors. Advances in Distributed a
nd Parallel Knowledge
Discovery. AAAI Press, September 2002.
[27] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the Priv
acy Preserving Properties of Random Data Perturbation Techniques. In Proc. of the 2003
IEEE Intl. Conf.
on Data Mining, pages 99106, Melbourne, Florida, December 2003. IEE
E Computer
Society.
[28] D. Kifer, S. Ben-David, and J. Gehrke. Detecting Change in Dat
a Streams. In Proc. of
the 30th VLDB Conf., pages 180191, Toronto, Canada, 2004. Morgan Kaufma
nn.
[29] D. Lambert. What Use is Statistics for Massive Data? In ACM
SIGMOD Workshop
on Research Issues in Data Mining and Knowledge Discovery, pages 5462,
2000.
[30] M. H. C. Law, N. Zhang, and A. K. Jain. Nonlinear Ma
nifold Learning for Data
Streams. In Proc. of the SIAM Intl. Conf. on Data Mining, Lake Buen
a Vista, Florida,
April 2004. SIAM.
[31] T. Mitchell. Machine Learning. McGraw-Hill, Boston, MA, 1997.
[32] S. Papadimitriou, A. Brockwell, and C. Faloutsos. Adaptive, unsuperv
ised stream mining. VLDB Journal, 13(3):222239, 2004.
[33] O. Parr Rud. Data Mining Cookbook: Modeling Data for Marketing, Risk
and Customer
Relationship Management. John Wiley & Sons, New York, NY, 2001.
[34] D. Pyle. Business Modeling and Data Mining. Morgan Kaufmann, S
an Francisco, CA,
2003.
[35] N. Ramakrishnan and A. Grama. Data Mining: From Serendipity
to ScienceGuest
Editors Introduction. IEEE Computer, 32(8):3437, 1999.
[36] R. Roiger and M. Geatz. Data Mining: A Tutorial Based Pri
mer. Addison-Wesley,

2002.
[37] P. Smyth. Breaking out of the Black-Box: Research Challenges
in Data Mining. In
Proc. of the 2001 ACM SIGMOD Workshop on Research Issues in Data M
ining and
Knowledge Discovery, 2001.
[38] P. Smyth, D. Pregibon, and C. Faloutsos. Data-driven evolut
ion of data mining algorithms. Communications of the ACM, 45(8):3337, 2002.
[39] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and
Y. Theodoridis.
State-of-the-art in privacy preserving data mining. SIGMOD Record, 33(1):5057, 2
004.
[40] J. T. L. Wang, M. J. Zaki, H. Toivonen, and D. E.
Shasha, editors. Data Mining in
Bioinformatics. Springer, September 2004.
[41] A. R. Webb. Statistical Pattern Recognition. John Wiley & Son
s, 2nd edition, 2002.
[42] I. H. Witten and E. Frank. Data Mining: Practical Machine Le
arning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.
[43] X. Wu, P. S. Yu, and G. Piatetsky-Shapiro. Data Mining: How Research Me
ets Practical
Development? Knowledge and Information Systems, 5(2):248261, 2003.
[44] M. J. Zaki and C.-T. Ho, editors. Large-Scale Parallel Data Mining. Sprin
ger, September
2002.
1.7 Exercises
1. Discuss whether or not each of the following activities is a dat
a mining task.
1.7 Exercises 17
(a) Dividing the customers of a company according to their gender.
(b) Dividing the customers of a company according to their pro tability
.
(c) Computing the total sales of a company.
(d) Sorting a student database based on student identi cation numbers.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
(f) Predicting the future stock price of a company using historical
records.
(g) Monitoring the heart rate of a patient for abnormalities.
(h) Monitoring seismic waves for earthquake activities.
(i) Extracting the frequencies of a sound wave.
2. Suppose that you are employed as a data mining consultant for
an Internet
search engine company. Describe how data mining can help the company
by
giving speci c examples of how techniques, such as clustering, class
i cation,
association rule mining, and anomaly detection can be applied.
3. For each of the following data sets, explain whether or not dat
a privacy is an
important issue.
(a) Census data collected from 19001950.
(b) IP addresses and visit times of Web users who visit your Websit
e.
(c) Images from Earth-orbiting satellites.
(d) Names and addresses of people from the telephone book.
(e) Names and email addresses collected from the Web.
2
Data

This chapter discusses several data-related issues that are important f


or successful data mining:
The Type of Data Data sets di er in a number of ways. For example,
the
attributes used to describe data objects can be of di erent typesquantitative
or qualitativeand data sets may have special characteristics; e.g., some data
sets contain time series or objects with explicit relationships to one
another.
Not surprisingly, the type of data determines which tools and techniqu
es can
be used to analyze the data. Furthermore, new research in data mini
ng is
often driven by the need to accommodate new application areas and their new
types of data.
The Quality of the Data Data is often far from perfect. While most data
mining techniques can tolerate some level of imperfection in the data,
a focus
on understanding and improving data quality typically improves the qual
ity
of the resulting analysis. Data quality issues that often need to be
addressed
include the presence of noise and outliers; missing, inconsistent, o
r duplicate
data; and data that is biased or, in some other way, unrepresentat
ive of the
phenomenon or population that the data is supposed to describe.
Preprocessing Steps to Make the Data More Suitable for Data Mining Often, the raw data must be processed in order to make it sui
table for
analysis. While one objective may be to improve data quality, other
goals
focus on modifying the data so that it better ts a speci ed data mining technique or tool. For example, a continuous attribute, e.g., length, m
ay need to
be transformed into an attribute with discrete categories, e.g., short, me
dium,
or long, in order to apply a particular technique. As another exam
ple, the
20 Chapter 2 Data
number of attributes in a data set is often reduced because many tec
hniques
are more e ective when the data has a relatively small number of attri
butes.
Analyzing Data in Terms of Its Relationships One approach to data
analysis is to
nd relationships among the data objects and then p
erform
the remaining analysis using these relationships rather than the data
objects
themselves. For instance, we can compute the similarity or distance
between
pairs of objects and then perform the analysisclustering, classi cation,
or
anomaly detectionbased on these similarities or distances. There are ma
ny
such similarity or distance measures, and the proper choice depends o
n the
type of data and the particular application.
Example 2.1 (An Illustration of Data-Related Issues). To further illustrate the importance of these issues, consider the following hypothetical sit

uation. You receive an email from a medical researcher concerning a


project
that you are eager to work on.
Hi,
Ive attached the data
le that I mentioned in my previous email.
Each line contains the information for a single patient and consists
of ve elds. We want to predict the last eld using the other elds.
I dont have time to provide any more information about the data
since Im going out of town for a couple of days, but hopefully that
wont slow you down too much. And if you dont mind, could we
meet when I get back to discuss your preliminary results? I might
invite a few other members of my team.
Thanks and see you in a couple of days.
Despite some misgivings, you proceed to analyze the data. The
rst fe
w
rows of the le are as follows:
012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
.
.
.
A brief look at the data reveals nothing strange. You put your doubts aside
and start the analysis. There are only 1000 lines, a smaller data
le
than you
had hoped for, but two days later, you feel that you have made some progress.
You arrive for the meeting, and while waiting for others to arrive,
you strike
21
up a conversation with a statistician who is working on the project. When she
learns that you have also been analyzing the data from the project,
she asks
if you would mind giving her a brief overview of your results.
Statistician: So, you got the data for all the patients?
Data Miner: Yes. I havent had much time for analysis, but I
do have a few interesting results.
Statistician: Amazing. There were so many data issues with
this set of patients that I couldnt do much.
Data Miner: Oh? I didnt hear about any possible problems.
Statistician: Well, rst there is
eld 5, the variable we want to
predict. Its common knowledge among people who analyze
this type of data that results are better if you work with the
log of the values, but I didnt discover this until later. Was it
mentioned to you?
Data Miner: No.
Statistician: But surely you heard about what happened to
eld
4? Its supposed to be measured on a scale from 1 to 10, with
0 indicating a missing value, but because of a data entry
error, all 10s were changed into 0s. Unfortunately, since
some of the patients have missing values for this eld, its
impossible to say whether a 0 in this
eld is a real 0 or a 10.
Quite a few of the records have that problem.
Data Miner: Interesting. Were there any other problems?
Statistician: Yes, elds 2 and 3 are basically the same, but I
assume that you probably noticed that.
Data Miner: Yes, but these elds were only weak predictors of
eld 5.
Statistician: Anyway, given all those problems, Im surprised
you were able to accomplish anything.

Data Miner: True, but my results are really quite good. Field 1
is a very strong predictor of
eld 5. Im surprised that this
wasnt noticed before.
Statistician: What? Field 1 is just an identi cation number.
Data Miner: Nonetheless, my results speak for themselves.
Statistician: Oh, no! I just remembered. We assigned ID
numbers after we sorted the records based on eld 5. There is
a strong connection, but its meaningless. Sorry.
22 Chapter 2 Data
Although this scenario represents an extreme situation, it emphasizes t
he
importance of knowing your data. To that end, this chapter will add
ress
each of the four issues mentioned above, outlining some of the basic challenges
and standard approaches.
2.1 Types of Data
A data set can often be viewed as a collection of data objects
. Other
names for a data object are record, point, vector, pattern, event, ca
se, sample,
observation, or entity. In turn, data objects are described by a
number of
attributes that capture the basic characteristics of an object, suc
h as the
mass of a physical object or the time at which an event occ
urred. Other
names for an attribute are variable, characteristic, eld, feature, or dimension.
Example 2.2 (Student Information). Often, a data set is a
le, in whi
ch
the objects are records (or rows) in the le and each
eld (or column)
corresponds to an attribute. For example, Table 2.1 shows a data set tha
t consists
of student information. Each row corresponds to a student and each c
olumn
is an attribute that describes some aspect of a student, such as g
rade point
average (GPA) or identi cation number (ID).
Table 2.1. A sample data set containing student information.
Student ID Year Grade Point Average (GPA) . . .
.
.
.
1034262 Senior 3.24 . . .
1052663 Sophomore 3.51 . . .
1082246 Freshman 3.62 . . .
.
.
.
Although record-based data sets are common, either in at
les or relational database systems, there are other important types of da
ta sets and
systems for storing data. In Section 2.1.2, we will discuss some of the types
of
data sets that are commonly encountered in data mining. However, we
rst
consider attributes.
2.1 Types of Data 23
2.1.1 Attributes and Measurement
In this section we address the issue of describing data by conside

ring what
types of attributes are used to describe data objects. We
rst de ne a
n attribute, then consider what we mean by the type of an attribute, a
nd nally
describe the types of attributes that are commonly encountered.
What Is an attribute?
We start with a more detailed de nition of an attribute.
De nition 2.1. An attribute is a property or characteristic of an o
bject
that may vary, either from one object to another or from one time to another.
For example, eye color varies from person to person, while the temperature
of an object varies over time. Note that eye color is a symbolic a
ttribute with
a small number of possible values brown, black, blue, green, hazel, etc., while
temperature is a numerical attribute with a potentially unlimited numb
er of
values.
At the most basic level, attributes are not about numbers or
symbols.
However, to discuss and more precisely analyze the characteristics of
objects,
we assign numbers or symbols to them. To do this in a well-de ned wa
y, we
need a measurement scale.
De nition 2.2. A measurement scale is a rule (function) that associates
a numerical or symbolic value with an attribute of an object.
Formally, the process of measurement is the application of a measur
ement scale to associate a value with a particular attribute of a speci c object.
While this may seem a bit abstract, we engage in the process of meas
urement
all the time. For instance, we step on a bathroom scale to det
ermine our
weight, we classify someone as male or female, or we count the
number of
chairs in a room to see if there will be enough to seat all the people coming to
a meeting. In all these cases, the physical value of an attribute of an
object
is mapped to a numerical or symbolic value.
With this background, we can now discuss the type of an attrib
ute, a
concept that is important in determining if a particular data analysis technique
is consistent with a speci c type of attribute.
The Type of an Attribute
It should be apparent from the previous discussion that the properties
of an
attribute need not be the same as the properties of the values used
to mea24 Chapter 2 Data
sure it. In other words, the values used to represent an attribute
may have
properties that are not properties of the attribute itself, and vice
versa. This
is illustrated with two examples.
Example 2.3 (Employee Age and ID Number). Two attributes that
might be associated with an employee are ID and age (in years). Both of thes
e
attributes can be represented as integers. However, while it is reas
onable to

talk about the average age of an employee, it makes no sense to t


alk about
the average employee ID. Indeed, the only aspect of employees that
we want
to capture with the ID attribute is that they are distinct. Conseque
ntly, the
only valid operation for employee IDs is to test whether they are equal. There
is no hint of this limitation, however, when integers are used to re
present the
employee ID attribute. For the age attribute, the properties of the
integers
used to represent age are very much the properties of the attribute.
Even so,
the correspondence is not complete since, for example, ages have a maximum,
while integers do not.
Example 2.4 (Length of Line Segments). Consider Figure 2.1, whi
ch
shows some objectsline segmentsand how the length attribute of these
objects can be mapped to numbers in two di erent ways. Each success
ive
line segment, going from the top to the bottom, is formed by appen
ding the
topmost line segment to itself. Thus, the second line segment from t
he top is
formed by appending the topmost line segment to itself twice, the t
hird line
segment from the top is formed by appending the topmost line segme
nt to
itself three times, and so forth. In a very real (physical) sense
, all the line
segments are multiples of the rst. This fact is captured by the measurements
on the right-hand side of the gure, but not by those on the left
hand-side.
More speci cally, the measurement scale on the left-hand side captures
only
the ordering of the length attribute, while the scale on the righthand side
captures both the ordering and additivity properties. Thus, an attribute can b
e
measured in a way that does not capture all the properties of the attribute.
The type of an attribute should tell us what properties of the attribute are
re ected in the values used to measure it. Knowing the type of an at
tribute
is important because it tells us which properties of the measured va
lues are
consistent with the underlying properties of the attribute, and the
refore, it
allows us to avoid foolish actions, such as computing the average employee ID.
Note that it is common to refer to the type of an attribute as the
type of a
measurement scale.
2.1 Types of Data 25
1 1
2
3
4
5
3
7
8

10
A mapping of lengths to numbers
that captures only the order
properties of length.
A mapping of lengths to numbers
that captures both the order and
additivity properties of length.
Figure 2.1. The measurement of the length of line segments on two different sc
ales of measurement.
The Di erent Types of Attributes
A useful (and simple) way to specify the type of an attribute is
to identify
the properties of numbers that correspond to underlying properties
of the
attribute. For example, an attribute such as length has many of the properties
of numbers. It makes sense to compare and order objects by length,
as well
as to talk about the di erences and ratios of length. The following p
roperties
(operations) of numbers are typically used to describe attributes.
1. Distinctness = and ,=
2. Order <, , >, and
3. Addition + and
4. Multiplication
and /
Given these properties, we can de ne four types of attributes: nominal
,
ordinal, interval, and ratio. Table 2.2 gives the de nitions of these
types,
along with information about the statistical operations that are valid for each
type. Each attribute type possesses all of the properties and operations of th
e
attribute types above it. Consequently, any property or operation that is vali
d
for nominal, ordinal, and interval attributes is also valid for rat
io attributes.
In other words, the de nition of the attribute types is cumulative. H
owever,
26 Chapter 2 Data
Table 2.2. Different attribute types.
Attribute
Type
Description Examples Operations
Nominal The values of a nominal
attribute are just di erent
names; i.e., nominal values
provide only enough
information to distinguish
one object from another.
(=, ,=)
zip codes,
employee ID numbers,
eye color, gender
mode, entropy,
contingency
correlation,

2
test
C
a

t
e
g
o
r
i

a
l
(
Q
u
a
l
i
t
a
t
i
v
e
)
Ordinal The values of an ordinal
attribute provide enough
information to order
objets.
(<, >)
hardness of minerals,
good, better, best,
grades,
street numbers
median,
perentiles,
rank orrelation,
run tests,
sign tests
Interval For interval attributes, the
di erenes between values
are meaningful, i.e., a unit
of measurement exists.
(+, )
alendar dates,
temperature in Celsius
or Fahrenheit
mean,
standard deviation,
Pearsons
orrelation,
t and F tests
N
u
m
e
r
i

(
Q
u
a

n
t
i
t
a
t
i
v
e
)
Ratio For ratio variables, both
di erenes and ratios are
meaningful.
(*, /)
temperature in Kelvin,
monetary quantities,
ounts, age, mass,
length,
eletrial urrent
geometri mean,
harmoni mean,
perent
variation
this does not mean that the statistial operations appropriate for one attribute
type are appropriate for the attribute types above it.
Nominal and ordinal attributes are olletively referred to as ategori
al
or qualitative attributes. As the name suggests, qualitative attributes
, suh
as employee ID, lak most of the properties of numbers. Even if the
y are rep
resented by numbers, i.e., integers, they should be treated more like
symbols.
The remaining two types of attributes, interval and ratio, are oll
etively re
ferred to as quantitative or numeri attributes. Quantitative attributes are
represented by numbers and have most of the properties of numbers.
Note
that quantitative attributes an be integer valued or ontinuous.
The types of attributes an also be desribed in terms of transformat
ions
that do not hange the meaning of an attribute. Indeed, S. Smith Stevens, the
psyhologist who originally de ned the types of attributes shown in Table 2.2,
de ned them in terms of these permissible transformations. For example,
2.1 Types of Data 27
Table 2.3. Transformations that de ne attribute levels.
Attribute
Type
Transformation Comment
Nominal Any one to one mapping, e.g., a
permutation of values
If all employee ID numbers are
reassigned, it will not make any
di erene.
C
a
t
e
g
o

r
i

a
l
(
Q
u
a
l
i
t
a
t
i
v
e
)
Ordinal An order preserving
hange of
values, i.e.,
new value = f(old value),
where f is a monotoni funtion.
An attribute enompassing the
notion of good, better, best an
be represented equally well by
the values 1, 2, 3 or by
0.5, 1, 10.
Interval new value = a
old value + b,
a and b onstants.
The Fahrenheit and Celsius
temperature sales di er in the
loation of their zero value and
the size of a degree (unit).
N
u
m
e
r
i

(
Q
u
a
n
t
i
t
a
t
i
v
e
)
Ratio new value = a
old value Length an be measured in
meters or feet.
the meaning of a length attribute is unhanged if it is measured
in meters
instead of feet.
The statistial operations that make sense for a partiular type of attribute

are those that will yield the same results when the attribute is transformed us
ing a transformation that preserves the attributes meaning. To illustrate, the
average length of a set of objets is di erent when measured in meters
rather
than in feet, but both averages represent the same length. Table 2.3 shows the
permissible (meaning preserving) transformations for the four attribute t
ypes
of Table 2.2.
Example 2.5 (Temperature Sales). Temperature provides a good illus
tration of some of the onepts that have been desribed. First, tem
perature
an be either an interval or a ratio attribute, depending on its m
easurement
sale. When measured on the Kelvin sale, a temperature of 2

is, in a physi
ally meaningful way, twie that of a temperature of 1

. This is not true when


temperature is measured on either the Celsius or Fahrenheit sales, b
eause,
physially, a temperature of 1

Fahrenheit (Celsius) is not muh di erent than


a temperature of 2

Fahrenheit (Celsius). The problem is that the zero points


of the Fahrenheit and Celsius sales are, in a physial sense, ar
bitrary, and
therefore, the ratio of two Celsius or Fahrenheit temperatures is n
ot physi
ally meaningful.
28 Chapter 2 Data
Desribing Attributes by the Number of Values
An independent way of distinguishing between attributes is by the numb
er of
values they an take.
Disrete A disrete attribute has a
nite or ountably in nite set of va
lues.
Suh attributes an be ategorial, suh as zip odes or ID numb
ers,
or numeri, suh as ounts. Disrete attributes are often represen
ted
using integer variables. Binary attributes are a speial ase of d
is
rete attributes and assume only two values, e.g., true/false, ye
s/no,
male/female, or 0/1. Binary attributes are often represented as Boolean
variables, or as integer variables that only take the values 0 or 1.
Continuous A ontinuous attribute is one whose values are real numbers. Ex
amples inlude attributes suh as temperature, height, or weight. Co
n
tinuous attributes are typially represented as
oating point variables.
Pratially, real values an only be measured and represented with lim
ited preision.
In theory, any of the measurement sale typesnominal, ordinal, interval, and
ratioould be ombined with any of the types based on the number of
at
tribute valuesbinary, disrete, and ontinuous. However, some ombinations

our only infrequently or do not make muh sense. For instane, it is di ult
to think of a realisti data set that ontains a ontinuous bina
ry attribute.
Typially, nominal and ordinal attributes are binary or disrete, while interval
and ratio attributes are ontinuous. However, ount attributes, whih
are
disrete, are also ratio attributes.
Asymmetri Attributes
For asymmetri attributes, only presenea non zero attribute valueis re
garded as important. Consider a data set where eah objet is a stu
dent and
eah attribute reords whether or not a student took a partiular ou
rse at
a university. For a spei  student, an attribute has a value of 1
if the stu
dent took the ourse assoiated with that attribute and a value of 0 otherwise.
Beause students take only a small fration of all available ourse
s, most of
the values in suh a data set would be 0. Therefore, it is more
meaningful
and more e ient to fous on the non zero values. To illustrate, if
students
are ompared on the basis of the ourses they dont take, then most s
tudents
would seem very similar, at least if the number of ourses is
large. Binary
attributes where only non zero values are important are alled asymmetr
i
2.1 Types of Data 29
binary attributes. This type of attribute is partiularly important f
or as
soiation analysis, whih is disussed in Chapter 6. It is also poss
ible to have
disrete or
ontinuous asymmetri features. For instane, if the
number of
redits assoiated with eah ourse is reorded, then the resulting data set wil
l
onsist of asymmetri disrete or ontinuous attributes.
2.1.2 Types of Data Sets
There are many types of data sets, and as the
eld of data mining
develops
and matures, a greater variety of data sets beome available for ana
lysis. In
this setion, we desribe some of the most ommon types. For onve
niene,
we have grouped the types of data sets into three groups: reord da
ta, graph
based data, and ordered data. These ategories do not over all po
ssibilities
and other groupings are ertainly possible.
General Charateristis of Data Sets
Before providing details of spei  kinds of data sets, we disuss t
hree har
ateristis that apply to many data sets and have a signi ant impat
on the
data mining tehniques that are used: dimensionality, sparsity, and resolution
.
Dimensionality The dimensionality of a data set is the number of attributes
that the objets in the data set possess. Data with a small number
of dimen

sions tends to be qualitatively di erent than moderate or high dimens


ional
data. Indeed, the di ulties assoiated with analyzing high dimensional d
ata
are sometimes referred to as the urse of dimensionality. Beause of
this,
an important motivation in preproessing the data is dimensionality redu
tion. These issues are disussed in more depth later in this hapter
and in
Appendix B.
Sparsity For some data sets, suh as those with asymmetri features,
most
attributes of an objet have values of 0; in many ases, fewe
r than 1% of
the entries are non zero. In pratial terms, sparsity is an advantag
e beause
usually only the non zero values need to be stored and manipulated.
This
results in signi ant savings with respet to omputation time and stora
ge.
Furthermore, some data mining algorithms work well only for sparse dat
a.
Resolution It is frequently possible to obtain data at di erent levels of reso
lution, and often the properties of the data are di erent at di erent resolutions.
For instane, the surfae of the Earth seems very uneven at a resol
ution of a
30 Chapter 2 Data
few meters, but is relatively smooth at a resolution of tens of kilo
meters. The
patterns in the data also depend on the level of resolution. If
the resolution
is too
ne, a pattern may not be visible or may be buried in noi
se; if the
resolution is too oarse, the pattern may disappear. For example, v
ariations
in atmospheri pressure on a sale of hours re et the movement of s
torms
and other weather systems. On a sale of months, suh phenomena ar
e not
detetable.
Reord Data
Muh data mining work assumes that the data set is a olletion of
reords
(data objets), eah of whih onsists of a
xed set of data elds (att
ributes).
See Figure 2.2(a). For the most basi form of reord data, there is
no expliit
relationship among reords or data elds, and every reord (objet) has
the
same set of attributes. Reord data is usually stored either in at
le
s or in
relational databases. Relational databases are ertainly more than a olletio
n
of reords, but data mining often does not use any of the additional information
available in a relational database. Rather, the database serves as a onvenien
t
plae to
nd reords. Di erent types of reord data are desribed below
and
are illustrated in Figure 2.2.
Transation or Market Basket Data Transation data is a speial type

of reord data, where eah reord (transation) involves a set of it


ems. Con
sider a groery store. The set of produts purhased by a ustomer during one
shopping trip onstitutes a transation, while the individual produ
ts that
were purhased are the items. This type of data is alled market
basket
data beause the items in eah reord are the produts in a persons m
ar
ket basket. Transation data is a olletion of sets of items, but
it an be
viewed as a set of reords whose
elds are asymmetri attributes. Most
often,
the attributes are binary, indiating whether or not an item was pur
hased,
but more generally, the attributes an be disrete or ontinuous, su
h as the
number of items purhased or the amount spent on those items. Figure 2.2(b)
shows a sample transation data set. Eah row represents the purhase
s of a
partiular ustomer at a partiular time.
The Data Matrix If the data objets in a olletion of data all
have the
same xed set of numeri attributes, then the data objets an be thought of as
points (vetors) in a multidimensional spae, where eah dimension represents
a distint attribute desribing the objet. A set of suh data obje
ts an be
interpreted as an m by n matrix, where there are m rows, one for eah objet,
2.1 Types of Data 31
Refund Defaulted
Borrower
Marital
Status
Taxable
Inome
Tid
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
Yes
No
No
Yes

No
No
Yes
No
No
No
1
2
3
4
5
6
7
8
9
10
Single
Married
Single
Married
Divored
Married
Divored
Single
Married
Single
(a) Reord data.
TID ITEMS
1
2
3
4
5
Bread, Soda, Milk
Beer, Bread
Beer, Soda, Diaper, Milk
Beer, Bread, Diaper, Milk
Soda, Diaper, Milk
(b) Transation data.
Projetion of
x Load
Projetion of
y Load
Distane Load Thikness
10.23
12.65
13.54
14.27
15.22
16.22
17.34
18.45
5.27
6.25
7.23
8.43
27
22
23
25

1.2
1.1
1.2
0.9
() Data matrix.
t
e
a
m

o
a

h
p
l
a
y
s

o
r
e
g
a
m
e
w
i
n
l
o
s
t
t
i
m
e
o
u
t
s
e
a
s
o
n
b
a
l
l
Doument 1 3 0 5 0 2 6 0 2 0 2
0 7 0 2 1 0 0 3 0 0
0 1 0 0 1 2 2 0 3 0
Doument 2
Doument 3
(d) Doument term matrix.
Figure 2.2. Different variations of reord data.
and n olumns, one for eah attribute. (A representation that has data objets
as olumns and attributes as rows is also
ne.) This matrix is alled

a data
matrix or a pattern matrix. A data matrix is a variation of reord
data,
but beause it onsists of numeri attributes, standard matrix operati
on an
be applied to transform and manipulate the data. Therefore, the data
matrix
is the standard data format for most statistial data. Figure 2.2()
shows a
sample data matrix.
The Sparse Data Matrix A sparse data matrix is a speial ase of a da
ta
matrix in whih the attributes are of the same type and are asymmetr
i; i.e.,
only non zero values are important. Transation data is an example of a sparse
data matrix that has only 01 entries. Another ommon example is doument
data. In partiular, if the order of the terms (words) in a doument is ignore
d,
32 Chapter 2 Data
then a doument an be represented as a term vetor, where eah t
erm is
a omponent (attribute) of the vetor and the value of eah 
omponent is
the number of times the orresponding term ours in the doument.
This
representation of a olletion of douments is often alled a doument
term
matrix. Figure 2.2(d) shows a sample doument term matrix. The douments
are the rows of this matrix, while the terms are the olumns. In pratie, onl
y
the non zero entries of sparse data matries are stored.
Graph Based Data
A graph an sometimes be a onvenient and powerful representation for
data.
We onsider two spei  ases: (1) the graph aptures relationships amo
ng
data objets and (2) the data objets themselves are represented as g
raphs.
Data with Relationships among Objets The relationships among ob
jets frequently onvey important information. In suh ases, the data is ofte
n
represented as a graph. In partiular, the data objets are mapped
to nodes
of the graph, while the relationships among objets are aptured by t
he links
between objets and link properties, suh as diretion and weight. C
onsider
Web pages on the World Wide Web, whih ontain both text and link
s to
other pages. In order to proess searh queries, Web searh engines
ollet
and proess Web pages to extrat their ontents. It is well known,
however,
that the links to and from eah page provide a great deal of information about
the relevane of a Web page to a query, and thus, must also be
taken into
onsideration. Figure 2.3(a) shows a set of linked Web pages.
Data with Objets That Are Graphs If objets have struture, th
at
is, the objets ontain subobjets that have relationships, then suh

objets
are frequently represented as graphs. For example, the struture of
hemial
ompounds an be represented by a graph, where the nodes are atoms and the
links between nodes are hemial bonds. Figure 2.3(b) shows a ball an
d stik
diagram of the hemial ompound benzene, whih ontains atoms of arb
on
(blak) and hydrogen (gray). A graph representation makes it possib
le to
determine whih substrutures our frequently in a set of ompounds a
nd to
asertain whether the presene of any of these substrutures is assoiated with
the presene or absene of ertain hemial properties, suh as melti
ng point
or heat of formation. Substruture mining, whih is a branh of data
mining
that analyzes suh data, is onsidered in Setion 7.5.
2.1 Types of Data 33
(Gets updated frequently, so visit often!)
Book Referenes in Data Mining and
Knowledge Disovery
Useful Links:
Books
General Data Mining

Other Useful Web sites


o
o
o The Data Mine
Usama Fayyad, Gregory Piatetsky Shapiro,
Padhrai Smyth, and Ramasamy uthurasamy,
"Advanes in Knowledge Disovery and Data
Mining", AAAI Press/the MIT Press, 1996.
J. Ross Quinlan, "C4.5: Programs for Mahine
Learning", Morgan Kaufmann Publishers, 1993.
Mihael Berry and Gordon Linoff, "Data Mining
Tehniques (For Marketing, Sales, and Customer
Support), John Wiley & Sons, 1997.
Usama Fayyad, "Mining Databases: Towards
Algorithms for Knowledge Disovery", Bulletin of
the IEEE Computer Soiety Tehnial Committee
on data Engineering, vol. 21, no. 1, Marh 1998.
Christopher Matheus, Philip Chan, and Gregory
Piatetsky Shapiro, "Systems for knowledge
Disovery in databases", IEEE Transations on
Knowledge and Data Engineering, 5(6):903 913,
Deember 1993.
Bibliography
ACMSIGKDD
KDnuggets
General Data Mining
Knowledge Disovery and
Data Mining Bibliography
(a) Linked Web pages. (b) Benzene moleule.
Figure 2.3. Different variations of graph data.
Ordered Data
For some types of data, the attributes have relationships that invol
ve order
in time or spae. Di erent types of ordered data are desribed next a

nd are
shown in Figure 2.4.
Sequential Data Sequential data, also referred to as temporal data,
an
be thought of as an extension of reord data, where eah reord h
as a time
assoiated with it. Consider a retail transation data set that also
stores the
time at whih the transation took plae. This time information mak
es it
possible to nd patterns suh as andy sales peak before Halloween. A time
an also be assoiated with eah attribute. For example, eah reord
ould
be the purhase history of a ustomer, with a listing of items pu
rhased at
di erent times. Using this information, it is possible to
nd patterns s
uh as
people who buy DVD players tend to buy DVDs in the period immediately
following the purhase.
Figure 2.4(a) shows an example of sequential transation data. T
here
are
ve di erent timest1, t2, t3, t4, and t5; three di erent ustomersC1
,
34 Chapter 2 Data
Time Customer Items Purhased
t1 C1 A, B
t2 C3 A, C
t2 C1 C, D
t3 C2 A, D
t4 C2 E
t5 C1 A, E
Customer Time and Items Purhased
C1 (t1: A,B) (t2:C,D) (t5:A,E)
C2 (t3: A, D) (t4: E)
C3 (t2: A, C)
(a) Sequential transation data.
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
(b) Genomi sequene data.
1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
20
15
10
5
0
5
10
15
20
25
30
Year
Minneapolis Average Monthly Temperature (19821993)

T
e
m
p
e
r
a
t
u
r
e
(

e
l
iu
s
)
() Temperature time series.
Longitude
Temp 150 180 120 90 60 30 0 30 60 90 120 150 180
0
5
10
15
20
25
30
90
60
60
90
30
30
0
L
a
t
it
u
d
e
(d) Spatial temperature data.
Figure 2.4. Different variations of ordered data.
C2, and C3; and
ve di erent itemsA, B, C, D, and E. In the top
table,
eah row orresponds to the items purhased at a partiular time
by eah
ustomer. For instane, at time t3, ustomer C2 purhased items A and D. In
the bottom table, the same information is displayed, but eah row orresponds
to a partiular ustomer. Eah row ontains information on eah trans
ation
involving the ustomer, where a transation is onsidered to be a set
of items
and the time at whih those items were purhased. For example, ustomer C3
bought items A and C at time t2.
2.1 Types of Data 35
Sequene Data Sequene data onsists of a data set that is a sequen
e of

individual entities, suh as a sequene of words or letters. It is quite simil


ar to
sequential data, exept that there are no time stamps; instead, there are posi
tions in an ordered sequene. For example, the geneti information o
f plants
and animals an be represented in the form of sequenes of nuleoti
des that
are known as genes. Many of the problems assoiated with geneti seq
uene
data involve prediting similarities in the struture and funtion of genes from
similarities in nuleotide sequenes. Figure 2.4(b) shows a setion of
the hu
man geneti ode expressed using the four nuleotides from whih all
DNA is
onstruted: A, T, G, and C.
Time Series Data Time series data is a speial type of sequent
ial data
in whih eah reord is a time series, i.e., a series of measurem
ents taken
over time. For example, a
nanial data set might ontain objets th
at are
time series of the daily pries of various stoks. As another exampl
e, onsider
Figure 2.4(), whih shows a time series of the average monthly temp
erature
for Minneapolis during the years 1982 to 1994. When working with tem
poral
data, it is important to onsider temporal autoorrelation; i.e.
, if two
measurements are lose in time, then the values of those measurement
s are
often very similar.
Spatial Data Some objets have spatial attributes, suh as positions
or ar
eas, as well as other types of attributes. An example of spatial data is weath
er
data (preipitation, temperature, pressure) that is olleted for a v
ariety of
geographial loations. An important aspet of spatial data is spatial
auto
orrelation; i.e., objets that are physially lose tend to be simila
r in other
ways as well. Thus, two points on the Earth that are
lose
to eah other
usually have similar values for temperature and rainfall.
Important examples of spatial data are the siene and engineering d
ata
sets that are the result of measurements or model output taken at
regularly
or irregularly distributed points on a two or three dimensional grid
or mesh.
For instane, Earth siene data sets reord the temperature or pressure mea
sured at points (grid ells) on latitudelongitude spherial grids
of various
resolutions, e.g., 1

by 1

. (See Figure 2.4(d).) As another example, in the


simulation of the ow of a gas, the speed and diretion of ow an be reorded

for eah grid point in the simulation.


36 Chapter 2 Data
Handling Non Reord Data
Most data mining algorithms are designed for reord data or its varia
tions,
suh as transation data and data matries. Reord oriented tehniques
an
be applied to non reord data by extrating features from data objets
and
using these features to reate a reord orresponding to eah objet. Consider
the hemial struture data that was desribed earlier. Given a set of ommon
substrutures, eah ompound an be represented as a reord with bi
nary
attributes that indiate whether a ompound ontains a spei  substruture.
Suh a representation is atually a transation data set, where the transations
are the ompounds and the items are the substrutures.
In some ases, it is easy to represent the data in a reord f
ormat, but
this type of representation does not apture all the information in
the data.
Consider spatio temporal data onsisting of a time series from eah p
oint on
a spatial grid. This data is often stored in a data matrix, whe
re eah row
represents a loation and eah olumn represents a partiular point in
time.
However, suh a representation does not expliitly apture the time r
elation
ships that are present among attributes and the spatial relation
ships that
exist among objets. This does not mean that suh a representation i
s inap
propriate, but rather that these relationships must be taken into onsideration
during the analysis. For example, it would not be a good idea to
use a data
mining tehnique that assumes the attributes are statistially independe
nt of
one another.
2.2 Data Quality
Data mining appliations are often applied to data that was olleted
for an
other purpose, or for future, but unspei ed appliations. For th
at reason,
data mining annot usually take advantage of the signi ant bene ts of ad
dressing quality issues at the soure. In ontrast, muh of statis
tis deals
with the design of experiments or surveys that ahieve a prespei ed le
vel of
data quality. Beause preventing data quality problems is typially not an op
tion, data mining fouses on (1) the detetion and orretion of dat
a quality
problems and (2) the use of algorithms that an tolerate poor data
quality.
The rst step, detetion and orretion, is often alled data leaning.
The following setions disuss spei  aspets of data quality. The fous is
on measurement and data olletion issues, although some appliation related
issues are also disussed.
2.2 Data Quality 37
2.2.1 Measurement and Data Colletion Issues
It is unrealisti to expet that data will be perfet. There may be problems d

ue
to human error, limitations of measuring devies, or aws in the data olletion
proess. Values or even entire data objets may be missing. In othe
r ases,
there may be spurious or dupliate objets; i.e., multiple data objet
s that all
orrespond to a single real objet. For example, there might be two di erent
reords for a person who has reently lived at two di erent addresses.
Even if
all the data is present and looks ne, there may be inonsisteniesa person
has a height of 2 meters, but weighs only 2 kilograms.
In the next few setions, we fous on aspets of data quality that are related
to data measurement and olletion. We begin with a de nition of measu
re
ment and data olletion errors and then onsider a variety of proble
ms that
involve measurement error: noise, artifats, bias, preision, and aur
ay. We
onlude by disussing data quality issues that may involve both measurement
and data olletion problems: outliers, missing and inonsistent value
s, and
dupliate data.
Measurement and Data Colletion Errors
The term measurement error refers to any problem resulting from the mea
surement proess. A ommon problem is that the value reorded di ers f
rom
the true value to some extent. For ontinuous attributes, the numeri
al dif
ferene of the measured and true value is alled the error. The te
rm data
olletion error refers to errors suh as omitting data objets or at
tribute
values, or inappropriately inluding a data objet. For example, a
study of
animals of a ertain speies might inlude animals of a related speies that are
similar in appearane to the speies of interest. Both measurement errors and
data olletion errors an be either systemati or random.
We will only onsider general types of errors. Within partiular doma
ins,
there are ertain types of data errors that are ommonplae, and the
re often
exist well developed tehniques for deteting and/or orreting these er
rors.
For example, keyboard errors are ommon when data is entered manually, and
as a result, many data entry programs have tehniques for deteting and, with
human intervention, orreting suh errors.
Noise and Artifats
Noise is the random omponent of a measurement error. It may involv
e the
distortion of a value or the addition of spurious objets. Figure
2.5 shows a
time series before and after it has been disrupted by random noise.
If a bit
38 Chapter 2 Data
(a) Time series. (b) Time series with noise.
Figure 2.5. Noise in a time series ontext.
(a) Three groups of points. (b) With noise points (+) added.
Figure 2.6. Noise in a spatial ontext.
more noise were added to the time series, its shape would be lost.
Figure 2.6

shows a set of data points before and after some noise points (indi
ated by
+s) have been added. Notie that some of the noise points are interm
ixed
with the non noise points.
The term noise is often used in onnetion with data that has a spa
tial or
temporal omponent. In suh ases, tehniques from signal or image pr
oess
ing an frequently be used to redue noise and thus, help to disove
r patterns
(signals) that might be lost in the noise. Nonetheless, the eliminatio
n of
noise is frequently di ult, and muh work in data mining fouses on
devis
ing robust algorithms that produe aeptable results even when noise
is
present.
2.2 Data Quality 39
Data errors may be the result of a more deterministi phenomenon, su
h
as a streak in the same plae on a set of photographs. Suh det
erministi
distortions of the data are often referred to as artifats.
Preision, Bias, and Auray
In statistis and experimental siene, the quality of the measurement proess
and the resulting data are measured by preision and bias. We provid
e the
standard de nitions, followed by a brief disussion. For the following
de ni
tions, we assume that we make repeated measurements of the same underlying
quantity and use this set of values to alulate a mean (average) v
alue that
serves as our estimate of the true value.
De nition 2.3 (Preision). The loseness of repeated measurements (of the
same quantity) to one another.
De nition 2.4 (Bias). A systemati variation of measurements from th
e
quantity being measured.
Preision is often measured by the standard deviation of a set of
values,
while bias is measured by taking the di erene between the mean of th
e set
of values and the known value of the quantity being measured.
Bias
an
only be determined for objets whose measured quantity is known by me
ans
external to the urrent situation. Suppose that we have a standard laboratory
weight with a mass of 1g and want to assess the preision and bias of our new
laboratory sale. We weigh the mass
ve times, and obtain the followin
g ve
values: 1.015, 0.990, 1.013, 1.001, 0.986. The mean of these values is
1.001,
and hene, the bias is 0.001. The preision, as measured by t
he standard
deviation, is 0.013.
It is
ommon to use the more general term, auray, to refer
to the
degree of measurement error in data.
De nition 2.5 (Auray). The loseness of measurements to the true value

of the quantity being measured.


Auray depends on preision and bias, but sine it is a general o
nept,
there is no spei  formula for auray in terms of these two quantit
ies.
One important aspet of auray is the use of signi ant digits. Th
e
goal is to use only as many digits to represent the result of a me
asurement or
alulation as are justi ed by the preision of the data. For example
, if the
length of an objet is measured with a meter stik whose smallest markings are
millimeters, then we should only reord the length of data to the ne
arest mil
limeter. The preision of suh a measurement would be  0.5mm. We do
not
40 Chapter 2 Data
review the details of working with signi ant digits, as most readers w
ill have
enountered them in previous ourses, and they are overed in onside
rable
depth in siene, engineering, and statistis textbooks.
Issues suh as signi ant digits, preision, bias, and auray are sometimes
overlooked, but they are important for data mining as well as stati
stis and
siene. Many times, data sets do not ome with information on the preision
of the data, and furthermore, the programs used for analysis return
results
without any suh information. Nonetheless, without some understanding
of
the auray of the data and the results, an analyst runs the risk of ommitting
serious data analysis blunders.
Outliers
Outliers are either (1) data objets that, in some sense, have har
ateristis
that are di erent from most of the other data objets in the
data set, or
(2) values of an attribute that are unusual with respet to the typ
ial values
for that attribute. Alternatively, we
an speak of anomalous obj
ets or
values. There is onsiderable leeway in the de nition of an outlier, a
nd many
di erent de nitions have been proposed by the statistis and data min
ing
ommunities. Furthermore, it is important to distinguish between the notions
of noise and outliers. Outliers an be legitimate data objets or va
lues. Thus,
unlike noise, outliers may sometimes be of interest. In fraud a
nd network
intrusion detetion, for example, the goal is to
nd unusual objets or
events
from among a large number of normal ones. Chapter 10 disusses ano
maly
detetion in more detail.
Missing Values
It is not unusual for an objet to be missing one or more at
tribute values.
In some ases, the information was not olleted; e.g., some people d
eline to

give their age or weight. In other ases, some attributes are not
appliable
to all objets; e.g., often, forms have onditional parts that are
lle
d out only
when a person answers a previous question in a ertain way, but for simpliity,
all
elds are stored. Regardless, missing values should be taken into
aount
during the data analysis.
There are several strategies (and variations on these strategies) for dealing
with missing data, eah of whih may be appropriate in ertain irumstanes.
These strategies are listed next, along with an indiation of their
advantages
and disadvantages.
2.2 Data Quality 41
Eliminate Data Objets or Attributes A simple and e etive strategy
is to eliminate objets with missing values. However, even a partial
ly spei
ed data objet ontains some information, and if many objets have mis
sing
values, then a reliable analysis an be di ult or impossible. Nonethe
less, if
a data set has only a few objets that have missing values, then i
t may be
expedient to omit them. A related strategy is to eliminate attribu
tes that
have missing values. This should be done with aution, however, sin
e the
eliminated attributes may be the ones that are ritial to the analys
is.
Estimate Missing Values Sometimes missing data an be reliably esti
mated. For example,
onsider a time series that hanges in a
reasonably
smooth fashion, but has a few, widely sattered missing values. In suh ase
s,
the missing values
an be estimated (interpolated) by using the r
emaining
values. As another example,
onsider a data set that has many simil
ar data
points. In this situation, the attribute values of the points losest to the p
oint
with the missing value are often used to estimate the missing value.
If the
attribute is ontinuous, then the average attribute value of the neare
st neigh
bors is used; if the attribute is ategorial, then the most ommonly ourring
attribute value an be taken. For a onrete illustration, onsider preipitat
ion
measurements that are reorded by ground stations. For areas not ont
aining
a ground station, the preipitation an be estimated using values obse
rved at
nearby ground stations.
Ignore the Missing Value during Analysis Many data mining approahes
an be modi ed to ignore missing values. For example, suppose that ob
jets
are being lustered and the similarity between pairs of data objets
needs to
be alulated. If one or both objets of a pair have missing valu
es for some
attributes, then the similarity an be alulated by using only the

attributes
that do not have missing values. It is true that the similarity
will only be
approximate, but unless the total number of attributes is small or
the num
ber of missing values is high, this degree of inauray may not mat
ter muh.
Likewise, many lassi ation shemes an be modi ed to work with missing
values.
Inonsistent Values
Data an ontain inonsistent values. Consider an address
eld, where b
oth a
zip ode and ity are listed, but the spei ed zip ode area is not ontained in
that ity. It may be that the individual entering this information t
ransposed
two digits, or perhaps a digit was misread when the information was
sanned
42 Chapter 2 Data
from a handwritten form. Regardless of the ause of the inonsistent
values,
it is important to detet and, if possible, orret suh problems.
Some types of inonsistenes are easy to detet. For instane, a p
ersons
height should not be negative. In other ases, it an be neessary
to onsult
an external soure of information. For example, when an insurane om
pany
proesses laims for reimbursement, it heks the names and addresses
on the
reimbursement forms against a database of its ustomers.
One an inonsisteny has been deteted, it is sometimes possible to orret
the data. A produt ode may have hek digits, or it may be possib
le to
double hek a produt ode against a list of known produt odes, an
d then
orret the ode if it is inorret, but lose to a known ode. T
he orretion
of an inonsisteny requires additional or redundant information.
Example 2.6 (Inonsistent Sea Surfae Temperature). This example
illustrates an inonsisteny in atual time series data that measures
the sea
surfae temperature (SST) at various points on the oean. SST data was origi
nally olleted using oean based measurements from ships or buoys, but more
reently, satellites have been used to gather the data. To reate a
long term
data set, both soures of data must be used. However, beause the data omes
from di erent soures, the two parts of the data are subtly di erent.
This
disrepany is visually displayed in Figure 2.7, whih shows the orre
lation of
SST values between pairs of years. If a pair of years has a positive orrelati
on,
then the loation orresponding to the pair of years is olored white; otherwise
it is olored blak. (Seasonal variations were removed from the data sine, ot
h
erwise, all the years would be highly orrelated.) There is a distint hange
in
behavior where the data has been put together in 1983. Years within
eah of
the two groups, 19581982 and 19831999, tend to have a positive orrelat

ion
with one another, but a negative orrelation with years in the other
group.
This does not mean that this data should not be used, only that th
e analyst
should onsider the potential impat of suh disrepanies on the data mining
analysis.
Dupliate Data
A data set may inlude data objets that are dupliates, or almost dupliates,
of one another. Many people reeive dupliate mailings beause they a
ppear
in a database multiple times under slightly di erent names. To detet
and
eliminate suh dupliates, two main issues must be addressed. First,
if there
are two objets that atually represent a single objet, then th
e values of
orresponding attributes may di er, and these inonsistent values mus
t be
2.2 Data Quality 43
60 65 70 75 80 85 90 95
Year
Y
e
a
r
60
65
70
75
80
85
90
95
Figure 2.7. Correlation of SST data between pairs of years. White areas indi
ate positive orrelation.
Blak areas indiate negative orrelation.
resolved. Seond, are needs to be taken to avoid aidentally ombining data
objets that are similar, but not dupliates, suh as two distint
people with
idential names. The term dedupliation is often used to refer to the proess
of dealing with these issues.
In some ases, two or more objets are idential with respet to t
he at
tributes measured by the database, but they still represent di erent o
bjets.
Here, the dupliates are legitimate, but may still ause problems for
some al
gorithms if the possibility of idential objets is not spei ally ao
unted for
in their design. An example of this is given in Exerise 13 on pag
e 91.
2.2.2 Issues Related to Appliations
Data quality issues an also be onsidered from an appliation viewpoi
nt as
expressed by the statement data is of high quality if it is su
itable for its
intended use. This approah to data quality has proven quite useful, partiu
larly in business and industry. A similar viewpoint is also present in sta
tistis

and the experimental sienes, with their emphasis on the areful design of ex
periments to ollet the data relevant to a spei  hypothesis. As with quality
44 Chapter 2 Data
issues at the measurement and data olletion level, there are many issues that
are spei  to partiular appliations and elds. Again, we onsider only a few
of the general issues.
Timeliness Some data starts to age as soon as it has been oll
eted. In
partiular, if the data provides a snapshot of some ongoing phenome
non or
proess, suh as the purhasing behavior of ustomers or Web browsing
pat
terns, then this snapshot represents reality for only a limited time. If the d
ata
is out of date, then so are the models and patterns that are based
on it.
Relevane The available data must ontain the information neessary fo
r
the appliation. Consider the task of building a model that predits
the ai
dent rate for drivers. If information about the age and gender of t
he driver is
omitted, then it is likely that the model will have limited auray
unless this
information is indiretly available through other attributes.
Making sure that the objets in a data set are relevant is also ha
llenging.
A ommon problem is sampling bias, whih ours when a sample does n
ot
ontain di erent types of objets in proportion to their atual ourr
ene in
the population. For example, survey data desribes only those who respond to
the survey. (Other aspets of sampling are disussed further in Setion 2.3.2.
)
Beause the results of a data analysis an re et only the data that is present,
sampling bias will typially result in an erroneous analysis.
Knowledge about the Data Ideally, data sets are aompanied by do
umentation that desribes di erent aspets of the data; the quality
of this
doumentation an either aid or hinder the subsequent analysis. For example,
if the doumentation identi es several attributes as being strongly
related,
these attributes are likely to provide highly redundant information, a
nd we
may deide to keep just one. (Consider sales tax and purhase prie.
) If the
doumentation is poor, however, and fails to tell us, for examp
le, that the
missing values for a partiular eld are indiated with a 9999, then our analy
sis of the data may be faulty. Other important harateristis are the preisi
on
of the data, the type of features (nominal, ordinal, interval, r
atio), the sale
of measurement (e.g., meters or feet for length), and the origin of
the data.
2.3 Data Preproessing
In this setion, we address the issue of whih preproessing steps
should be
applied to make the data more suitable for data mining. Data prepro
essing

2.3 Data Preproessing 45


is a broad area and onsists of a number of di erent strategies and tehniques
that are interrelated in omplex ways. We will present some of
the most
important ideas and approahes, and try to point out the interrelatio
nships
among them. Spei ally, we will disuss the following topis:
Aggregation
Sampling
Dimensionality redution
Feature subset seletion
Feature reation
Disretization and binarization
Variable transformation
Roughly speaking, these items fall into two ategories: seleting da
ta ob
jets and attributes for the analysis or reating/hanging the attribut
es. In
both ases the goal is to improve the data mining analysis with
respet to
time, ost, and quality. Details are provided in the following setio
ns.
A quik note on terminology: In the following, we sometimes use synonyms
for attribute, suh as feature or variable, in order to follow ommon
usage.
2.3.1 Aggregation
Sometimes less is more and this is the ase with aggregation, the ombining
of two or more objets into a single objet. Consider a data set 
onsisting of
transations (data objets) reording the daily sales of produts
in various
store loations (Minneapolis, Chiago, Paris, . . .) for di erent days
over the
ourse of a year. See Table 2.4. One way to aggregate transations for this
data
set is to replae all the transations of a single store with a s
ingle storewide
transation. This redues the hundreds or thousands of transations that our
daily at a spei  store to a single daily transation, and the number
of data
objets is redued to the number of stores.
An obvious issue is how an aggregate transation is reated; i.e., h
ow the
values of eah attribute are ombined aross all the reords orresponding to a
partiular loation to reate the aggregate transation that represents the sale
s
of a single store or date. Quantitative attributes, suh as prie,
are typially
aggregated by taking a sum or an average. A qualitative attribute,
suh as
item, an either be omitted or summarized as the set of all the items that were
sold at that loation.
The data in Table 2.4 an also be viewed as a multidimensional
array,
where eah attribute is a dimension. From this viewpoint, aggregation
is the
46 Chapter 2 Data
Table 2.4. Data set ontaining information about ustomer purhases.
Transation ID Item Store Loation Date Prie . . .
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
101123 Wath Chiago 09/06/04 $25.99 . . .
101123 Battery Chiago 09/06/04 $5.99 . . .
101124 Shoes Minneapolis 09/06/04 $75.00 . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
proess of eliminating attributes, suh as the type of item, or
reduing the
number of values for a partiular attribute; e.g., reduing the poss
ible values
for date from 365 days to 12 months. This type of aggregation is 
ommonly
used in Online Analytial Proessing (OLAP), whih is disussed furth
er in
Chapter 3.
There are several motivations for aggregation. First, the smaller data
sets
resulting from data redution require less memory and proessing time,
and
hene, aggregation may permit the use of more expensive data mining
algo
rithms. Seond, aggregation an at as a hange of sope or sale by providing
a high level view of the data instead of a low level view. In the
previous ex
ample, aggregating over store loations and months gives us a monthly
, per
store view of the data instead of a daily, per item view. Finally,
the behavior
of groups of objets or attributes is often more stable than that of
individual
objets or attributes. This statement re ets the statistial fat that aggregat
e
quantities, suh as averages or totals, have less variability than t
he individ

ual objets being aggregated. For totals, the atual amount of


variation is
larger than that of individual objets (on average), but the perentag
e of the
variation is smaller, while for means, the atual amount of vari
ation is less
than that of individual objets (on average). A disadvantage of aggregation is
the potential loss of interesting details. In the store example aggre
gating over
months loses information about whih day of the week has the highest
sales.
Example 2.7 (Australian Preipitation). This example is based on pre
ipitation in Australia from the period 1982 to 1993. Figure 2.8(a)
shows
a histogram for the standard deviation of average monthly preipitatio
n for
3,030 0.5

by 0.5

grid ells in Australia, while Figure 2.8(b) shows a histogram


for the standard deviation of the average yearly preipitation for the
same lo
ations. The average yearly preipitation has less variability than th
e average
monthly preipitation. All preipitation measurements (and their stan
dard
deviations) are in entimeters.
2.3 Data Preproessing 47
0 2 4 6 8 10 12 14 16 18
0
20
40
60
80
100
120
140
160
180
N
u
m
b
e
r
o
f
L
a
n
d
L
o

a
t

i
o
n
s
Standard Deviation
(a) Histogram of standard deviation of
average monthly preipitation
0 1 2 3 4 5 6
0
50
100
150
N
u
m
b
e
r
o
f
L
a
n
d
L
o

a
t
i
o
n
s
Standard Deviation
(b) Histogram of standard deviation of
average yearly preipitation
Figure 2.8. Histograms of standard deviation for monthly and yearly preipitat
ion in Australia for the
period 1982 to 1993.
2.3.2 Sampling
Sampling is a ommonly used approah for seleting a subset of th
e data
objets to be analyzed. In statistis, it has long been used for b
oth the pre
liminary investigation of the data and the
nal data analysis. Sampli
ng an
also be very useful in data mining. However, the motivations for
sampling
in statistis and data mining are often di erent. Statistiians use sam
pling
beause obtaining the entire set of data of interest is too expensi
ve or time
onsuming, while data miners sample beause it is too expensive or ti
me on
suming to proess all the data. In some ases, using a sampling algorithm an
redue the data size to the point where a better, but more expensive algorithm
an be used.

The key priniple for e etive sampling is the following: Using a samp
le
will work almost as well as using the entire data set if the sam
ple is repre
sentative. In turn, a sample is representative if it has approximate
ly the
same property (of interest) as the original set of data. If the mea
n (average)
of the data objets is the property of interest, then a sample is r
epresentative
if it has a mean that is lose to that of the original data. Bea
use sampling is
a statistial proess, the representativeness of any partiular sample will var
y,
and the best that we an do is hoose a sampling sheme that guaran
tees a
high probability of getting a representative sample. As disussed nex
t, this
involves hoosing the appropriate sample size and sampling tehniques.
48 Chapter 2 Data
Sampling Approahes
There are many sampling tehniques, but only a few of the most bas
i ones
and their variations will be overed here. The simplest type of sa
mpling is
simple random sampling. For this type of sampling, there is an equal prob
ability of seleting any partiular item. There are two variations o
n random
sampling (and other sampling tehniques as well): (1) sampling without re
plaementas eah item is seleted, it is removed from the set of all objets
that together onstitute the population, and (2) sampling with replae
mentobjets are not removed from the population as they are seleted f
or
the sample. In sampling with replaement, the same objet an be piked more
than one. The samples produed by the two methods are not muh di er
ent
when samples are relatively small ompared to the data set size, but sampling
with replaement is simpler to analyze sine the probability of sele
ting any
objet remains onstant during the sampling proess.
When the population onsists of di erent types of objets, with wi
dely
di erent numbers of objets, simple random sampling an fail to adequ
ately
represent those types of objets that are less frequent. This an a
use prob
lems when the analysis requires proper representation of all objet ty
pes. For
example, when building lassi ation models for rare lasses, it is ritial that
the rare lasses be adequately represented in the sample. Hene, a
sampling
sheme that an aommodate di ering frequenies for the items of interest is
needed. Strati ed sampling, whih starts with prespei ed groups of ob
jets, is suh an approah. In the simplest version, equal numbers
of objets
are drawn from eah group even though the groups are of di erent sizes. In an
other variation, the number of objets drawn from eah group is propo
rtional
to the size of that group.

Example 2.8 (Sampling and Loss of Information). One a sampling


tehnique has been seleted, it is still neessary to hoose the
sample size.
Larger sample sizes inrease the probability that a sample will be re
presenta
tive, but they also eliminate muh of the advantage of sampling. Con
versely,
with smaller sample sizes, patterns may be missed or erroneous patterns an be
deteted. Figure 2.9(a) shows a data set that ontains 8000 two dimen
sional
points, while Figures 2.9(b) and 2.9() show samples from this data set of size
2000 and 500, respetively. Although most of the struture of this d
ata set is
present in the sample of 2000 points, muh of the struture is miss
ing in the
sample of 500 points.
2.3 Data Preproessing 49
(a) 8000 points (b) 2000 points () 500 points
Figure 2.9. Example of the loss of struture with sampling.
Example 2.9 (Determining the Proper Sample Size). To illustrate that
determining the proper sample size requires a methodial approah,
on
sider
the following task.
Given a set of data that onsists of a small number of almost equal
sized groups,
nd at least one representative point for eah of the
groups. Assume that the objets in eah group are highly similar
to eah other, but not very similar to objets in di erent groups.
Also assume that there are a relatively small number of groups,
e.g., 10. Figure 2.10(a) shows an idealized set of lusters (groups)
from whih these points might be drawn.
This problem an be e iently solved using sampling. One approah is t
o
take a small sample of data points, ompute the pairwise similarities
between
points, and then form groups of points that are highly similar. Th
e desired
set of representative points is then obtained by taking one point fro
m eah of
these groups. To follow this approah, however, we need to determine a sample
size that would guarantee, with a high probability, the desired outom
e; that
is, that at least one point will be obtained from eah luster. F
igure 2.10(b)
shows the probability of getting one objet from eah of the 10 grou
ps as the
sample size runs from 10 to 60. Interestingly, with a sample size of 20, there
is
little hane (20%) of getting a sample that inludes all 10 lusters. Even wi
th
a sample size of 30, there is still a moderate hane (almost 40%)
of getting a
sample that doesnt ontain objets from all 10 lusters. This issue i
s further
explored in the ontext of lustering by Exerise 4 on page 559.
50 Chapter 2 Data
(a) Ten groups of points.
0 10 20 30 40 50 60 70
0
0.2
0.4

0.6
0.8
1
Sample Size
P
r
o
b
a
b
i
l
i
t
y
(b) Probability a sample ontains points
from eah of 10 groups.
Figure 2.10. Finding representative points from 10 groups.
Progressive Sampling
The proper sample size an be di ult to determine, so adaptive or progres
sive sampling shemes are sometimes used. These approahes start with
a
small sample, and then inrease the sample size until a sample of
su ient
size has been obtained. While this tehnique eliminates the need to determine
the orret sample size initially, it requires that there be a way to evaluate t
he
sample to judge if it is large enough.
Suppose, for instane, that progressive sampling is used to learn
a pre
ditive model. Although the auray of preditive models inreases a
s the
sample size inreases, at some point the inrease in auray levels
o . We
want to stop inreasing the sample size at this leveling o point. By
keeping
trak of the hange in auray of the model as we take progressi
vely larger
samples, and by taking other samples lose to the size of the urren
t one, we
an get an estimate as to how lose we are to this leveling o point, and
thus,
stop sampling.
2.3.3 Dimensionality Redution
Data sets an have a large number of features. Consider a set of d
ouments,
where eah doument is represented by a vetor whose omponents are t
he
frequenies with whih eah word ours in the doument. In suh
ases,
2.3 Data Preproessing 51
there are typially thousands or tens of thousands of attributes (omponents),
one for eah word in the voabulary. As another example,
onsider a
set of
time series onsisting of the daily losing prie of various stoks o
ver a period
of 30 years. In this ase, the attributes, whih are the pries on
spei  days,
again number in the thousands.
There are a variety of bene ts to dimensionality redution. A key bene t

is that many data mining algorithms work better if the dimensionalityt


he
number of attributes in the datais lower. This is partly beause dime
nsion
ality redution an eliminate irrelevant features and redue noise and
partly
beause of the urse of dimensionality, whih is explained below. Another ben
e t is that a redution of dimensionality an lead to a more understa
ndable
model beause the model may involve fewer attributes. Also, dimensi
onality
redution may allow the data to be more easily visualized. Even if
dimen
sionality redution doesnt redue the data to two or three dimensions,
data
is often visualized by looking at pairs or triplets of attributes, a
nd the num
ber of suh ombinations is greatly redued. Finally, the amount of
time and
memory required by the data mining algorithm is redued with a redution in
dimensionality.
The term dimensionality redution is often reserved for those tehnique
s
that redue the dimensionality of a data set by reating new attribu
tes that
are a ombination of the old attributes. The redution of dimension
ality by
seleting new attributes that are a subset of the old is known as feature subset
seletion or feature seletion. It will be disussed in Setion 2.3.4
.
In the remainder of this setion, we brie y introdue two important topis:
the urse of dimensionality and dimensionality redution tehniques base
d on
linear algebra approahes suh as prinipal omponents analysis (PCA). More
details on dimensionality redution an be found in Appendix B.
The Curse of Dimensionality
The urse of dimensionality refers to the phenomenon that many t
ypes of
data analysis beome signi antly harder as the dimensionality of the
data
inreases. Spei ally, as dimensionality inreases, the data beomes in
reas
ingly sparse in the spae that it oupies. For lassi ation, this 
an mean
that there are not enough data objets to allow the reation of a m
odel that
reliably assigns a lass to all possible objets. For lustering, t
he de nitions
of density and the distane between points, whih are ritial for
lustering,
beome less meaningful. (This is disussed further in Setions 9.1.2, 9.4.5, a
nd
9.4.7.) As a result, many lustering and lassi ation algorithms (and
other
52 Chapter 2 Data
data analysis algorithms) have trouble with high dimensional dataredued
lassi ation auray and poor quality lusters.
Linear Algebra Tehniques for Dimensionality Redution
Some of the most
ommon approahes for dimensionality redution, par
ti

ularly for ontinuous data, use tehniques from linear algebra to pro
jet the
data from a high dimensional spae into a lower dimensional spae. Prinipal
Components Analysis (PCA) is a linear algebra tehnique for ontinuous
attributes that nds new attributes (prinipal omponents) that (1) are linear
ombinations of the original attributes, (2) are orthogonal (perpendiular) to
eah other, and (3) apture the maximum amount of variation in the data. For
example, the rst two prinipal omponents apture as muh of the varia
tion
in the data as is possible with two orthogonal attributes that are linear ombi
nations of the original attributes. Singular Value Deomposition (SVD)
is a linear algebra tehnique that is related to PCA and is also ommonly used
for dimensionality redution. For additional details, see Appendies A
and B.
2.3.4 Feature Subset Seletion
Another way to redue the dimensionality is to use only a subset of
the fea
tures. While it might seem that suh an approah would lose information, this
is not the ase if redundant and irrelevant features are present. Re
dundant
features dupliate muh or all of the information ontained in one
or more
other attributes. For example, the purhase prie of a produt and the amount
of sales tax paid ontain muh of the same information. Irrelevant features
ontain almost no useful information for the data mining task at han
d. For
instane, students ID numbers are irrelevant to the task of predi
ting stu
dents grade point averages. Redundant and irrelevant features
an
redue
lassi ation auray and the quality of the lusters that are found.
While some irrelevant and redundant attributes an be eliminated imme
diately by using ommon sense or domain knowledge, seleting the best subset
of features frequently requires a systemati approah. The ideal appr
oah to
feature seletion is to try all possible subsets of features as input
to the data
mining algorithm of interest, and then take the subset that produes
the best
results. This method has the advantage of re eting the objetive and
bias of
the data mining algorithm that will eventually be used. Unfortunately,
sine
the number of subsets involving n attributes is 2
n
, suh an approah is impra
tial in most situations and alternative strategies are needed. There
are three
standard approahes to feature seletion: embedded,
lter, and wrapper.
2.3 Data Preproessing 53
Embedded approahes Feature seletion ours naturally as part of the
data mining algorithm. Spei ally, during the operation of the data m
ining
algorithm, the algorithm itself deides whih attributes to use and
whih to
ignore. Algorithms for building deision tree lassi ers, whih are disussed in
Chapter 4, often operate in this manner.
Filter approahes Features are seleted before the data mining algorit
hm
is run, using some approah that is independent of the data mining t

ask. For
example, we might selet sets of attributes whose pairwise orrelation is as low
as possible.
Wrapper approahes These methods use the target data mining algorithm
as a blak box to
nd the best subset of attributes, in a way simil
ar to that
of the ideal algorithm desribed above, but typially without enumerati
ng all
possible subsets.
Sine the embedded approahes are algorithm spei , only the
lter and
wrapper approahes will be disussed further here.
An Arhiteture for Feature Subset Seletion
It is possible to enompass both the
lter and wrapper approahes withi
n a
ommon arhiteture. The feature seletion proess is viewed as onsis
ting of
four parts: a measure for evaluating a subset, a searh strategy tha
t ontrols
the generation of a new subset of features, a stopping riterion, and
a valida
tion proedure. Filter methods and wrapper methods di er only in the w
ay
in whih they evaluate a subset of features. For a wrapper method,
subset
evaluation uses the target data mining algorithm, while for a
lter app
roah,
the evaluation tehnique is distint from the target data mining al
gorithm.
The following disussion provides some details of this approah, whih is sum
marized in Figure 2.11.
Coneptually, feature subset seletion is a searh over all possible s
ubsets
of features. Many di erent types of searh strategies an be used,
but the
searh strategy should be omputationally inexpensive and should nd optimal
or near optimal sets of features. It is usually not possible
to satisfy both
requirements, and thus, tradeo s are neessary.
An integral part of the searh is an evaluation step to judge how the urrent
subset of features ompares to others that have been onsidered. This requires
an evaluation measure that attempts to determine the goodness of a subset of
attributes with respet to a partiular data mining task, suh as la
ssi ation
54 Chapter 2 Data
Searh
Strategy
Stopping
Criterion
Seleted
Attributes
Attributes
Validation
Proedure
Subset of
Attributes
Evaluation
Done
Not
Done
Figure 2.11. Flowhart of a feature subset seletion proess.

or lustering. For the lter approah, suh measures attempt to predi


t how
well the atual data mining algorithm will perform on a given set of attributes.
For the wrapper approah, where evaluation onsists of atually runnin
g the
target data mining appliation, the subset evaluation funtion is simp
ly the
riterion normally used to measure the result of the data mining.
Beause the number of subsets an be enormous and it is impratial
to
examine them all, some sort of stopping riterion is neessary. This strategy
is
usually based on one or more onditions involving the following: the
number
of iterations, whether the value of the subset evaluation measure is optimal or
exeeds a ertain threshold, whether a subset of a ertain size has
been ob
tained, whether simultaneous size and evaluation riteria have been ah
ieved,
and whether any improvement an be ahieved by the options available to the
searh strategy.
Finally, one a subset of features has been seleted, the resu
lts of the
target data mining algorithm on the seleted subset should be validate
d. A
straightforward evaluation approah is to run the algorithm with the f
ull set
of features and ompare the full results to results obtained using the subset of
features. Hopefully, the subset of features will produe results that
are better
than or almost as good as those produed when using all features.
Another
validation approah is to use a number of di erent feature seletion algorithms
to obtain subsets of features and then ompare the results of running the data
mining algorithm on eah subset.
2.3 Data Preproessing 55
Feature Weighting
Feature weighting is an alternative to keeping or eliminating features.
More
important features are assigned a higher weight, while less important features
are given a lower weight. These weights are sometimes assigned based
on do
main knowledge about the relative importane of features. Alternatively, they
may be determined automatially. For example, some lassi ation shemes
,
suh as support vetor mahines (Chapter 5), produe lassi ation models in
whih eah feature is given a weight. Features with larger weights play a more
important role in the model. The normalization of objets that takes
plae
when omputing the osine similarity (Setion 2.4.5) an also be regar
ded as
a type of feature weighting.
2.3.5 Feature Creation
It is frequently possible to reate, from the original attributes,
a new set of
attributes that aptures the important information in a data set muh
more
e etively. Furthermore, the number of new attributes an be smaller than the
number of original attributes, allowing us to reap all the previously
desribed

bene ts of dimensionality redution. Three related methodologies for reating


new attributes are desribed next: feature extration, mapping the dat
a to a
new spae, and feature onstrution.
Feature Extration
The reation of a new set of features from the original raw data i
s known as
feature extration. Consider a set of photographs, where eah photogra
ph
is to be lassi ed aording to whether or not it ontains a human fa
e. The
raw data is a set of pixels, and as suh, is not suitable f
or many types of
lassi ation algorithms. However, if the data is proessed to provide
higher
level features, suh as the presene or absene of
ertain types o
f edges and
areas that are highly orrelated with the presene of human faes, then a muh
broader set of lassi ation tehniques an be applied to this problem.
Unfortunately, in the sense in whih it is most
ommonly used, f
eature
extration is highly domain spei . For a partiular
eld, suh as
image
proessing, various features and the tehniques to extrat them h
ave been
developed over a period of time, and often these tehniques have lim
ited ap
pliability to other elds. Consequently, whenever data mining is applied to a
relatively new area, a key task is the development of new features and featur
e
extration methods.
56 Chapter 2 Data
0 0.2 0.4 0.6 0.8 1
1
0.5
0
0.5
1
Time (seonds)
(a) Two time series.
0 0.2 0.4 0.6 0.8 1
15
10
5
0
5
10
15
Time (seonds)
(b) Noisy time series.
0 10 20 30 40 50 60 70 80 90
0
50
100
150
200
250
300
Frequeny
() Power spetrum

Figure 2.12. Appliation of the Fourier transform to identify the underlying f


requenies in time series
data.
Mapping the Data to a New Spae
A totally di erent view of the data an reveal important and interestin
g fea
tures. Consider, for example, time series data, whih often ontain
s periodi
patterns. If there is only a single periodi pattern and not muh n
oise, then
the pattern is easily deteted. If, on the other hand, there are
a number of
periodi patterns and a signi ant amount of noise is present, then the
se pat
terns are hard to detet. Suh patterns an, nonetheless, often be
deteted
by applying a Fourier transform to the time series in order to hang
e to a
representation in whih frequeny information is expliit. In the example that
follows, it will not be neessary to know the details of the Fourier
transform.
It is enough to know that, for eah time series, the Fourier transform produes
a new data objet whose attributes are related to frequenies.
Example 2.10 (Fourier Analysis). The time series presented in Fi
gure
2.12(b) is the sum of three other time series, two of whih are shown in Figure
2.12(a) and have frequenies of 7 and 17 yles per seond, respetiv
ely. The
third time series is random noise. Figure 2.12() shows the power sp
etrum
that an be omputed after applying a Fourier transform to the origin
al time
series. (Informally, the power spetrum is proportional to the squar
e of eah
frequeny attribute.) In spite of the noise, there are two peaks that orrespon
d
to the periods of the two original, non noisy time series. Again, the main poi
nt
is that better features an reveal important aspets of the data.
2.3 Data Preproessing 57
Many other sorts of transformations are also possible. Besides the Fo
urier
transform, the wavelet transform has also proven very useful for time series
and other types of data.
Feature Constrution
Sometimes the features in the original data sets have the neessary information,
but it is not in a form suitable for the data mining algorithm. In this situat
ion,
one or more new features onstruted out of the original features an be more
useful than the original features.
Example 2.11 (Density). To illustrate this,
onsider a data set ons
isting
of information about historial artifats, whih, along with other information
,
ontains the volume and mass of eah artifat. For simpliity, assu
me that
these artifats are made of a small number of materials (wood, l
ay, bronze,
gold) and that we want to lassify the artifats with respet to the
material

of whih they are made. In this ase, a density feature onstruted


from the
mass and volume features, i.e., density = mass/volume, would most dire
tly
yield an aurate lassi ation. Although there have been some attempts
to
automatially perform feature onstrution by exploring simple mathematial
ombinations of existing attributes, the most ommon approah is to onstrut
features using domain expertise.
2.3.6 Disretization and Binarization
Some data mining algorithms, espeially ertain lassi ation algorithms,
re
quire that the data be in the form of ategorial attributes. Algori
thms that
nd assoiation patterns require that the data be in the form of bina
ry at
tributes. Thus, it is often neessary to transform a ontinuous attr
ibute into
a ategorial attribute (disretization), and both ontinuous and dis
rete
attributes may need to be transformed into one or more binary at
tributes
(binarization). Additionally, if a ategorial attribute has a large n
umber of
values (ategories), or some values our infrequently, then it may be bene ial
for ertain data mining tasks to redue the number of ategories by ombining
some of the values.
As with feature seletion, the best disretization and binarization approah
is the one that produes the best result for the data mining algorith
m that
will be used to analyze the data. It is typially not pratial to a
pply suh a
riterion diretly. Consequently, disretization or binarization is perf
ormed in
58 Chapter 2 Data
Table 2.5. Conversion of a ategorial attribute to three binary attributes.
Categorial Value Integer Value x
1
x
2
x
3
awful 0 0 0 0
poor 1 0 0 1
OK 2 0 1 0
good 3 0 1 1
great 4 1 0 0
Table 2.6. Conversion of a ategorial attribute to ve asymmetri binary attrib
utes.
Categorial Value Integer Value x
1
x
2
x
3
x
4
x
5
awful 0 1 0 0 0 0

poor 1 0 1 0 0 0
OK 2 0 0 1 0 0
good 3 0 0 0 1 0
great 4 0 0 0 0 1
a way that satis es a riterion that is thought to have a relationship
to good
performane for the data mining task being onsidered.
Binarization
A simple tehnique to binarize a ategorial attribute is the following: If th
ere
are m ategorial values, then uniquely assign eah original value to an integer
in the interval [0, m 1]. If the attribute is ordinal, then o
rder must be
maintained by the assignment. (Note that even if the attribute is o
riginally
represented using integers, this proess is neessary if the integers are not in
the
interval [0, m1].) Next, onvert eah of these m integers to a binary number.
Sine n = log
2
(m) binary digits are required to represent these integers,
represent these binary numbers using n binary attributes. To illust
rate, a
ategorial variable with 5 values awful, poor, OK, good, great would require
three binary variables x
1
, x
2
, and x
3
. The onversion is shown in Table 2.5.
Suh a transformation an ause
ompliations, suh as reating un
in
tended relationships among the transformed attributes. For example, in Table
2.5, attributes x
2
and x
3
are orrelated beause information about the good
value is enoded using both attributes. Furthermore, assoiation analy
sis re
quires asymmetri binary attributes, where only the presene of the at
tribute
(value = 1) is important. For assoiation problems, it is therefore neessary
to
introdue one binary attribute for eah ategorial value, as in Table 2.6. If
the
2.3 Data Preproessing 59
number of resulting attributes is too large, then the tehniques desribed below
an be used to redue the number of ategorial values before binariz
ation.
Likewise, for assoiation problems, it may be neessary to replae a
single
binary attribute with two asymmetri binary attributes. Consider a bin
ary
attribute that reords a persons gender, male or female. For traditio
nal as
soiation rule algorithms, this information needs to be transformed in
to two
asymmetri binary attributes, one that is a 1 only when the person

is male
and one that is a 1 only when the person is female. (For asymmetri
 binary
attributes, the information representation is somewhat ine ient in that
two
bits of storage are required to represent eah bit of information.)
Disretization of Continuous Attributes
Disretization is typially applied to attributes that are used in la
ssi ation
or assoiation analysis. In general, the best disretization depends on the al
go
rithm being used, as well as the other attributes being onsidered.
Typially,
however, the disretization of an attribute is onsidered in isolation.
Transformation of a ontinuous attribute to a ategorial attribute involves
two subtasks: deiding how many ategories to have and determining ho
w to
map the values of the ontinuous attribute to these ategories. In the rst step
,
after the values of the ontinuous attribute are sorted, they are th
en divided
into n intervals by speifying n 1 split points. In the seond, rather
trivial
step, all the values in one interval are mapped to the same ategori
al value.
Therefore, the problem of disretization is one of deiding how
many split
points to hoose and where to plae them. The result an be
represented
either as a set of intervals (x
0
, x
1
], (x
1
, x
2
], . . . , (x
n1
, x
n
), where x
0
and x
n
may be + or , respetively, or equivalently, as a series of inequali
ties
x
0
< x x
1
, . . . , x
n1
< x < x
n
.
Unsupervised Disretization A basi distintion between disretization
methods for lassi ation is whether lass information is used (supervise
d) or
not (unsupervised). If
lass information is not used, then relat

ively simple
approahes are ommon. For instane, the equal width approah divides the
range of the attribute into a user spei ed number of intervals eah having the
same width. Suh an approah an be badly a eted by outliers, and fo
r that
reason, an equal frequeny (equal depth) approah, whih tries to
put
the same number of objets into eah interval, is often preferred. A
s another
example of unsupervised disretization, a lustering method, suh as K means
(see Chapter 8),
an also be used. Finally, visually inspeting the
data an
sometimes be an e etive approah.
60 Chapter 2 Data
Example 2.12 (Disretization Tehniques). This example demonstrates
how these approahes work on an atual data set. Figure 2.13(a) show
s data
points belonging to four di erent groups, along with two outliersthe lar
ge
dots on either end. The tehniques of the previous paragraph were a
pplied
to disretize the x values of these data points into four ate
gorial values.
(Points in the data set have a random y omponent to make it easy
to see
how many points are in eah group.) Visually inspeting the data works quite
well, but is not automati, and thus, we fous on the other three a
pproahes.
The split points produed by the tehniques equal width, equal frequeny, and
K means are shown in Figures 2.13(b), 2.13(), and 2.13(d), respetively. The
split points are represented as dashed lines. If we measure the performane of
a disretization tehnique by the extent to whih di erent objets in d
i erent
groups are assigned the same ategorial value, then K means performs
best,
followed by equal frequeny, and
nally, equal width.
Supervised Disretization The disretization methods desribed above
are usually better than no disretization, but keeping the end purpose in mind
and using additional information (lass labels) often produes better
results.
This should not be surprising, sine an interval onstruted with no knowledge
of lass labels often ontains a mixture of lass labels. A oneptu
ally simple
approah is to plae the splits in a way that maximizes the p
urity of the
intervals. In pratie, however, suh an approah requires potentially arbitra
ry
deisions about the purity of an interval and the minimum size of an
interval.
To overome suh onerns, some statistially based approahes start with eah
attribute value as a separate interval and reate larger intervals by
merging
adjaent intervals that are similar aording to a statistial tes
t. Entropy
based approahes are one of the most promising approahes to disretiz
ation,
and a simple approah based on entropy will be presented.
First, it is neessary to de ne entropy. Let k be the number of di ere
nt
lass labels, m

i
be the number of values in the i
th
interval of a partition, and
m
ij
be the number of values of lass j in interval i.
i
of the
i
th
interval is given by the equation
e
i
=
k

Then the entropy e

i=1
p
ij
log
2
p
ij
,
where p
ij
= m
ij
/m
i
is the probability (fration of values) of lass j in the i
th
interval. The total entropy, e, of the partition is the weighted
average of the
individual interval entropies, i.e.,
2.3 Data Preproessing 61
0 5 10 15 20
(a) Original data.
0 5 10 15 20
(b) Equal width disretization.
0 5 10 15 20
() Equal frequeny disretization.
0 5 10 15 20
(d) K means disretization.
Figure 2.13. Different disretization tehniques.
e =
n
i=1
w
i
e
i
,
where m is the number of values, w
i
= m
i
/m is the fration of values in the

i
th
interval, and n is the number of intervals. Intuitively, the entro
py of an
interval is a measure of the purity of an interval. If an interval
ontains only
values of one lass (is perfetly pure), then the entropy is 0 and i
t ontributes
62 Chapter 2 Data
nothing to the overall entropy. If the lasses of values in an
interval our
equally often (the interval is as impure as possible), then the
entropy is a
maximum.
A simple approah for partitioning a ontinuous attribute starts by biset
ing the initial values so that the resulting two intervals give minimum entropy.
This tehnique only needs to onsider eah value as a possible split
point, be
ause it is assumed that intervals ontain ordered sets of values. The splitti
ng
proess is then repeated with another interval, typially hoosing the
interval
with the worst (highest) entropy, until a user spei ed number of inter
vals is
reahed, or a stopping riterion is satis ed.
Example 2.13 (Disretization of Two Attributes). This method was
used to independently disretize both the x and y attributes of
the two
dimensional data shown in Figure 2.14. In the
rst disretization, sh
own in
Figure 2.14(a), the x and y attributes were both split into three intervals. (
The
dashed lines indiate the split points.) In the seond disretization,
shown in
Figure 2.14(b), the x and y attributes were both split into
ve interv
als.
This simple example illustrates two aspets of disretization. First,
in two
dimensions, the lasses of points are well separated, but in one dimension, this
is not so. In general, disretizing eah attribute separately often
guarantees
suboptimal results. Seond,
ve intervals work better than three,
but six
intervals do not improve the disretization muh, at least in terms o
f entropy.
(Entropy values and results for six intervals are not shown.) Consequ
ently,
it is desirable to have a stopping riterion that automatially
nds th
e right
number of partitions.
Categorial Attributes with Too Many Values
Categorial attributes an sometimes have too many values. If the ategorial
attribute is an ordinal attribute, then tehniques similar to tho
se for on
tinuous attributes
an be used to redue the number of ategori
es. If the
ategorial attribute is nominal, however, then other approahes are
needed.
Consider a university that has a large number of departments. Consequ
ently,

a department name attribute might have dozens of di erent values. In


this
situation, we
ould use our knowledge of the relationships amon
g di erent
departments to ombine departments into larger groups, suh as engineer
ing,
soial sienes, or biologial sienes. If domain knowledge does n
ot serve as
a useful guide or suh an approah results in poor lassi ation perfor
mane,
then it is neessary to use a more empirial approah, suh as grouping values
2.3 Data Preproessing 63
0 1 2 3 4 5
0
1
2
3
4
5
x
y
(a) Three intervals
0 1 2 3 4 5
0
1
2
3
4
5
x
y
(b) Five intervals
Figure 2.14. Disretizing x and y attributes for four groups (lasses) of poin
ts.
together only if suh a grouping results in improved lassi ation aur
ay or
ahieves some other data mining objetive.
2.3.7 Variable Transformation
A variable transformation refers to a transformation that is applied t
o all
the values of a variable. (We use the term variable instead of attr
ibute to ad
here to ommon usage, although we will also refer to attribute transformation
on oasion.) In other words, for eah objet, the transformation is applied t
o
the value of the variable for that objet. For example, if only th
e magnitude
of a variable is important, then the values of the variable an be
transformed
by taking the absolute value. In the following setion, we disuss t
wo impor
tant types of variable transformations: simple funtional transformation
s and
normalization.
Simple Funtions
For this type of variable transformation, a simple mathematial fu
ntion is
applied to eah value individually. If x is a variable, then examp
les of suh
transformations inlude x

k
, log x, e
x
,

x, 1/x, sin x, or [x[. In statistis, vari


able transformations, espeially sqrt, log, and 1/x, are often used to transform
data that does not have a Gaussian (normal) distribution into data that does.
While this an be important, other reasons often take preedene in data min
64 Chapter 2 Data
ing. Suppose the variable of interest is the number of data bytes i
n a session,
and the number of bytes ranges from 1 to 1 billion. This is a hug
e range, and
it may be advantageous to ompress it by using a log
10
transformation. In
this ase, sessions that transferred 10
8
and 10
9
bytes would be more similar
to eah other than sessions that transferred 10 and 1000 bytes (9
8 = 1
versus 3 1 = 2). For some appliations, suh as network intrusion detetion,
this may be what is desired, sine the
rst two sessions most likely
represent
transfers of large les, while the latter two sessions ould be two quite distint
types of sessions.
Variable transformations should be applied with aution sine they hange
the nature of the data. While this is what is desired, there an
be problems
if the nature of the transformation is not fully appreiated. For in
stane, the
transformation 1/x redues the magnitude of values that are 1 or larg
er, but
inreases the magnitude of values between 0 and 1. To illustrate,
the values
1, 2, 3 go to 1,
1
2
,
1
3
, but the values 1,
1
2
,
1
3
go to 1, 2, 3. Thus, for
all sets of values, the transformation 1/x reverses the order. To
help larify
the e et of a transformation, it is important to ask questions
suh as the
following: Does the order need to be maintained? Does the transforma
tion
apply to all values, espeially negative values and 0? What is th
e e et of
the transformation on the values between 0 and 1? Exerise 17 on pa

ge 92
explores other aspets of variable transformation.
Normalization or Standardization
Another ommon type of variable transformation is the standardization o
r
normalization of a variable. (In the data mining ommunity the terms
are
often used interhangeably. In statistis, however, the term normalization an
be onfused with the transformations used for making a variable normal, i.e.,
Gaussian.) The goal of standardization or normalization is to make
an en
tire set of values have a partiular property. A traditional exam
ple is that
of standardizing a variable in statistis. If x is the mean (average
) of the
attribute values and s
x
is their standard deviation, then the transformation
x

= (x x)/s
x
reates a new variable that has a mean of 0 and a standard
deviation of 1. If di erent variables are to be ombined in some w
ay, then
suh a transformation is often neessary to avoid having a variable w
ith large
values dominate the results of the alulation. To illustrate, onside
r ompar
ing people based on two variables: age and inome. For any two peo
ple, the
di erene in inome will likely be muh higher in absolute terms (hundreds or
thousands of dollars) than the di erene in age (less than 150). If t
he di er
enes in the range of values of age and inome are not taken into aount, then
2.4 Measures of Similarity and Dissimilarity 65
the omparison between people will be dominated by di erenes in inome. In
partiular, if the similarity or dissimilarity of two people is alulated using
the
similarity or dissimilarity measures de ned later in this hapter, then in many
ases, suh as that of Eulidean distane, the inome values will dominate the
alulation.
The mean and standard deviation are strongly a eted by outliers, so th
e
above transformation is often modi ed. First, the mean is replaed b
y the
median, i.e., the middle value. Seond, the standard deviation is replaed
by
the absolute standard deviation. Spei ally, if x is a variable, th
en the
absolute standard deviation of x is given by
A
=
m
i=1
[x
i
[,
i

where x

i
the i
th
value of the variable, m i the number of object, and i either
the
mean or median. Other approache for computing etimate of the loca
tion
(center) and pread of a et of value in the preence of outlier
are decribed
in Section 3.2.3 and 3.2.4, repectively. Thee meaure can alo b
e ued to
de ne a tandardization tranformation.
2.4 Meaure of Similarity and Diimilarity
Similarity and diimilarity are important becaue they are ued by a number
uch a clutering, nearet neighbor cla
of data mining technique,
i cation,
and anomaly detection. In many cae, the initial data et i not n
eeded once
thee imilaritie or diimilaritie have been computed. Such approach
e can
be viewed a tranforming the data to a imilarity (diimilarity) pa
ce and
then performing the analyi.
We begin with a dicuion of the baic: high level de nition of imilarity
and diimilarity, and a dicuion of how they are related. For co
nvenience,
the term proximity i ued to refer to either imilarity or diimilarity. Si
nce
the proximity between two object i a function of the proximity betw
een the
correponding attribute of the two object, we
rt decribe how to me
aure
the proximity between object having only one
imple attribute, an
d then
conider proximity meaure for object with multiple attribute.
Thi
in
clude meaure uch a correlation and Euclidean ditance, which are
ueful
for dene data uch a time erie or two dimenional point, a we
ll a the
Jaccard and coine imilarity meaure, which are ueful for pare
data like
document. Next, we conider everal important iue concerning prox
imity
meaure. The ection conclude with a brief dicuion of how to e
lect the
right proximity meaure.
66 Chapter 2 Data
2.4.1 Baic
De nition
Informally, the imilarity between two object i a numerical meaure
of the
degree to which the two object are alike. Conequently, imilaritie are high
er
for pair of object that are more alike. Similaritie are uually n
on negative
and are often between 0 (no imilarity) and 1 (complete imilarity).
The diimilarity between two object i a numerical meaure of the d
e
gree to which the two object are di erent. Diimilaritie are lower

for more
imilar pair of object. Frequently, the term ditance i ued a a
ynonym
ee, ditance i often u
for diimilarity, although, a we hall
ed to refer to
a pecial cla of diimilaritie. Diimilaritie ometime fall in
the interval
[0, 1], but it i alo common for them to range from 0 to .
Tranformation
Tranformation
are often applied to convert a imilarity to a dii
milarity,
or vice vera, or to tranform a proximity meaure to fall within a
particular
range, uch a [0,1]. For intance, we may have imilaritie that ra
nge from 1
to 10, but the particular algorithm or oftware package that we want
to ue
may be deigned to only work with diimilaritie, or it may only w
ork with
imilaritie in the interval [0,1]. We dicu thee iue here beca
ue we will
employ uch tranformation
later in our dicuion of proximity.
In addi
tion, thee iue are relatively independent of the detail of peci c proximity
meaure.
Frequently, proximity meaure, epecially imilaritie, are de ned or tran
formed to have value in the interval [0,1]. Informally, the motivati
on for thi
i to ue a cale in which a proximity value indicate the fraction
of imilarity
(or diimilarity) between two object. Such a tranformation i often
rela
tively traightforward. For example, if the imilaritie between objec
t range
from 1 (not at all imilar) to 10 (completely imilar), we can mak
e them fall
within the range [0, 1] by uing the tranformation 

= ( 1)/9, where  and




are the original and new imilarity value, repectively. In the more general
imilaritie to the interval [0, 1] i
cae, the tranformation of
given by the
expreion 

= (min )/(max min ), where max  and min  are the
maximum and minimum imilarity value, repectively. Likewie, diimilarity
meaure with a
nite range can be mapped to the interval [0,1] by u
ing the
formula d

= (d min d)/(max d min d).


There can be variou complication in mapping proximity meaure to the
interval [0, 1], however. If, for example, the proximity meaure originally ta
ke
2.4 Meaure of Similarity and Diimilarity 67
value in the interval [0,], then a non linear tranformation i need
ed and
value will not have the ame relationhip to one another on the ne

w cale.
Conider the tranformation d

= d/(1 + d) for a diimilarity meaure that


range from 0 to . The diimilaritie 0, 0.5, 2, 10, 100, and
1000 will be
tranformed into the new diimilaritie 0, 0.33, 0.67, 0.90, 0.99,
and 0.999,
repectively. Larger value on the original diimilarity cale are c
ompreed
into the range of value near 1, but whether or not thi i deirable depend on
the application. Another complication i that the meaning of the pro
ximity
meaure may be changed. For example, correlation, which i dicued
later,
i
a meaure of imilarity that take value
in the interval [
1,1]. Mapping
thee value to the interval [0,1] by taking the abolute value loe informatio
n
about the ign, which can be important in ome application. See Exe
rcie 22
on page 94.
Tranforming imilaritie to diimilaritie and vice vera i alo rel
atively
traightforward, although we again face the iue of preerving meanin
g and
changing a linear cale into a non linear cale. If the imilarity (
or diimilar
ity) fall in the interval [0,1], then the diimilarity can be de ned a d = 1
(
= 1 d). Another imple approach i to de ne imilarity a
the
nega
tive of the diimilarity (or vice vera). To illutrate, the diim
ilaritie 0, 1,
10, and 100 can be tranformed into the imilaritie 0, 1, 10, and 100,
repectively.
The imilaritie
reulting from the negation tranformation are not
re
tricted to the range [0, 1], but if that i deired, then tranformation uc
h a
 =
1
d+1
,  = e
d
, or  = 1
dmin d
max dmin d
can be ued. For the tranformation
 =
1
d+1
, the diimilaritie 0, 1, 10, 100 are tranformed into 1, 0.5, 0.09, 0.01,
repectively. For  = e
d
, they become 1.00, 0.37, 0.00, 0.00, repectively,
while for  = 1
dmin d
max dmin d
they become 1.00, 0.99, 0.00, 0.00, repectively.
In thi dicuion, we have focued on converting diimilaritie to imilaritie

.
Converion in the oppoite direction i conidered in Exercie 23 on
page 94.
In general, any monotonic decreaing function can be ued to convert
di
imilaritie to imilaritie, or vice vera. Of coure, other facto
r alo mut
be conidered when tranforming imilaritie to diimilaritie, or vic
e vera,
or when tranforming the value of a proximity meaure to a new cal
e. We
have mentioned iue related to preerving meaning, ditortion of cal
e, and
requirement of data analyi tool, but thi lit i certainly not e
xhautive.
2.4.2 Similarity and Diimilarity between Simple Attribute
The proximity of object with a number of attribute i typically d
e ned by
combining the proximitie of individual attribute, and thu, we
rt
dicu
68 Chapter 2 Data
proximity between object having a ingle attribute. Conider object
 de
cribed by one nominal attribute. What would it mean for two uch
object
to be imilar? Since nominal attribute only convey information about
the
ditinctne of object, all we can ay i that two object either have the am
e
value or they do not. Hence, in thi cae imilarity i traditionally de ned a
1
if attribute value match, and a 0 otherwie. A diimilarity would be de ned
in the oppoite way: 0 if the attribute value match, and 1 if the
y do not.
For object with a ingle ordinal attribute, the ituation i more
compli
cated becaue information about order hould be taken into account. Conider
an attribute that meaure the quality of a product, e.g., a candy b
ar, on the
cale poor, fair, OK, good, wonderful . It would eem reaonable that a prod
uct, P1, which i rated wonderful, would be cloer to a product P
2, which i
rated good, than it would be to a product P3, which i rated OK. To make thi
obervation quantitative, the value of the ordinal attribute are often mapped
to ucceive integer, beginning at 0 or 1, e.g., poor=0, fair=1,
OK=2,
good=3, wonderful =4. Then, d(P1, P2) = 3 2 = 1 or, if we want th
e di
imilarity to fall between 0 and 1, d(P1, P2) =
32
4
= 0.25. A imilarity for
ordinal attribute can then be de ned a  = 1 d.
Thi de nition of imilarity (diimilarity) for an ordinal attribute ho
uld
make the reader a bit uneay ince thi aume equal interval, and thi i not
o. Otherwie, we would have an interval or ratio attribute. I the
di erence
between the value fair and good really the ame a that between th
e value

OK and wonderful ? Probably not, but in practice, our option are


limited,
and in the abence of more information, thi i
the tandard
approach for
de ning proximity between ordinal attribute.
For interval or ratio attribute, the natural meaure of diimil
arity be
tween two object i the abolute di erence of their value. For examp
le, we
might compare our current weight and our weight a year ago by aying
I am
ten pound heavier. In cae uch a thee, the diimilaritie typically range
from 0 to , rather than from 0 to 1. The imilarity of interval o
r ratio at
tribute i typically expreed by tranforming a diimilarity into a imilarit
y,
a previouly decribed.
Table 2.7 ummarize thi dicuion. In thi table, x and y are two object
that have one attribute of the indicated type. Alo, d(x, y) and (x, y) are t
he
diimilarity and imilarity between x and y, repectively. Other app
roache
are poible; thee are the mot common one.
The following two ection conider more complicated meaure of prox
imity between object that involve multiple attribute: (1) diimilari
tie be
tween data object and (2) imilaritie between data object. Thi di
viion
2.4 Meaure of Similarity and Diimilarity 69
Table 2.7. Similarity and diimilarity for imple attribute
Attribute
Type
Diimilarity Similarity
Nominal d =
_
0 if x = y
1 if x ,= y
 =
_
1 if x = y
0 if x ,= y
Ordinal
d = [x y[/(n 1)
(value mapped to integer 0 to n1,
where n i the number of value)
 = 1 d
 = d,  =
Interval or Ratio d = [x y[
1
1+d
,  = e
d
,
 = 1
dmin d
max dmin d
allow u to more naturally diplay the underlying motivation for emp
loying
variou proximity meaure. We emphaize, however, that imilaritie ca
n be
tranformed into diimilaritie and vice vera uing the approache decribed

earlier.
2.4.3 Diimilaritie between Data Object
In thi ection, we dicu variou kind of diimilaritie. We be
gin with a
dicuion of ditance, which are diimilaritie with certain propert
ie, and
then provide example of more general kind of diimilaritie.
Ditance
We
rt preent ome example, and then o er a more formal decription
of
ditance in term of the propertie common to all ditance. The Euclidean
ditance, d, between two point, x and y, in one , two , three
, or higher
dimenional pace, i given by the following familiar formula:
d(x, y) =

_
n
k=1
(x
k
y
k
)
2
, (2.1)
where n i the number of dimenion and x
k
and y
k
are, repectively, the k
th
attribute (component) of x and y. We illutrate thi formula with
Figure
2.15 and Table 2.8 and 2.9, which how a et of point, the x and y coordinate
of thee point, and the ditance matrix containing the pairwie di
tance
of thee point.
70 Chapter 2 Data
The Euclidean ditance meaure given in Equation 2.1 i generalized by
the Minkowki ditance metric hown in Equation 2.2,
d(x, y) =
_
n
k=1
[x
k
y
k
[
r
_
1/r
, (2.2)
where r i
mple

a parameter.

The following are the three mot common exa

of Minkowki ditance.
r = 1. City block (Manhattan, taxicab, L
1
norm) ditance. A common
example i the Hamming ditance, which i the number of bit that
are di erent between two object that have only binary attribute, i.e.
,
between two binary vector.
r = 2. Euclidean ditance (L
2
norm).
r = . Supremum (L
max
or L

norm) ditance. Thi i the maximum


di erence between any attribute of the object. More formally, the L

ditance i de ned by Equation 2.3


d(x, y) = lim
r
_
n
k=1
[x
k
y
k
[
r
_
1/r
. (2.3)
The r parameter hould not be confued with the number of dimenion
(at
tribute) n. The Euclidean, Manhattan, and upremum ditance are de ned
for all value
of n: 1, 2, 3, . . ., and pecify di erent way o
f combining the
di erence in each dimenion (attribute) into an overall ditance.
Table 2.10 and 2.11, repectively, give the proximity matrice for
the L
1
and L

ditance uing data from Table 2.8. Notice that all thee ditanc
e
matrice are ymmetric; i.e., the ij
th
entry i the ame a the ji
th
entry. In
Table 2.9, for intance, the fourth row of the
rt column and
the fourth
column of the
rt row both contain the value 5.1.
uch a the Euclidean ditance, have ome well known prope
Ditance,
r
tie. If d(x, y) i the ditance between two point, x and y, then the followi
ng
propertie hold.

1. Poitivity
(a) d(x, x) 0 for all x and y,
(b) d(x, y) = 0 only if x = y.
2.4 Meaure of Similarity and Diimilarity 71
p1
p2
p3 p4
2
1
0
3
y
1 2 3 4 5 6
x
Figure 2.15. Four two dimenional point.
Table 2.8. x and y coordinate of four point.
point x coordinate y coordinate
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Table 2.9. Euclidean ditance matrix for Table 2.8.
p1 p2 p3 p4
p1 0.0 2.8 3.2 5.1
p2 2.8 0.0 1.4 3.2
p3 3.2 1.4 0.0 2.0
p4 5.1 3.2 2.0 0.0
Table 2.10. L
1
ditance matrix for Table 2.8.
L
1
p1 p2 p3 p4
p1 0.0 4.0 4.0 6.0
p2 4.0 0.0 2.0 4.0
p3 4.0 2.0 0.0 2.0
p4 6.0 4.0 2.0 0.0
Table 2.11. L

ditance matrix for Table 2.8.


L

p1 p2 p3 p4
p1 0.0 2.0 3.0 5.0
p2 2.0 0.0 1.0 3.0
p3 3.0 1.0 0.0 2.0
p4 5.0 3.0 2.0 0.0
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) d(x, y) + d(y, z) for all point x, y, and z.
Meaure that atify all three propertie are known a metric. So
me
people only ue the term ditance for diimilarity meaure that atify thee
propertie, but that practice i often violated. The three propertie
decribed
here are ueful, a well a mathematically pleaing. Alo, if the
triangle in
equality hold, then thi property can be ued to increae the e ciency of tech
nique (including clutering) that depend on ditance poeing thi property.

(See Exercie 25.) Nonethele, many diimilaritie do not atify one or more
of the metric propertie. We give two example of uch meaure.
72 Chapter 2 Data
Example 2.14 (Non metric Diimilaritie: Set Di erence). Thi ex
ample i baed on the notion of the di erence of two et, a de ned
in et
theory. Given two et A and B, A B i the et of element of A
that are
not in B. For example, if A = 1, 2, 3, 4 and B = 2, 3, 4, then AB = 1
and B A = , the empty et. We can de ne the ditance d between two
et A and B a d(A, B) = ize(A B), where ize i a function retu
rning
the number of element in a et. Thi ditance meaure, which i
an integer
value greater than or equal to 0, doe not atify the econd part
of the po
itivity property, the ymmetry property, or the triangle inequality.
However,
thee propertie can be made to hold if the diimilarity meaure i
modi ed
a follow: d(A, B) = ize(A B) + ize(B A). See Exercie 21 on page
94.
Example 2.15 (Non metric Diimilaritie: Time). Thi example give
a more everyday example of a diimilarity meaure that i not a met
ric, but
that i till ueful. De ne a meaure of the ditance between time o
f the day
a follow:
d(t
1
, t
2
) =
_
t
2
t
1
if t
1
t
2
24 + (t
2
t
1
) if t
1
t
2
_
. (2.4)
To illutrate, d(1PM, 2PM) = 1 hour, while d(2PM, 1PM) = 23 hour
.
Such a de nition would make ene, for example, when anwering the quetion:
If an event occur at 1PM every day, and it i now 2PM, how long d
o I have
to wait for that event to occur again?
2.4.4 Similaritie between Data Object
For imilaritie, the triangle inequality (or the analogou property)

typically
doe not hold, but ymmetry and poitivity typically do. To be expl
icit, if
(x, y) i the imilarity between point x and y, then the typical propertie of
imilaritie are the following:
1.
(x, y) = 1 only if x = y. (0  1)
(x, y) = (y, x) for all x and y. (Symmetry)
2.
imilari
There i no general analog of the triangle inequality for
ty mea
ure. It i ometime poible, however, to how that a imilarity
meaure
can eaily be converted to a metric ditance. The coine and Jaccard imilarit
y
meaure, which are dicued hortly, are two example. Alo, for peci c im
ilarity meaure, it i poible to derive mathematical bound on the imilarity
between two object that are imilar in pirit to the triangle inequa
lity.
2.4 Meaure of Similarity and Diimilarity 73
Example 2.16 (A Non ymmetric Similarity Meaure). Conider an
experiment in which people are aked to claify a mall et of ch
aracter a
they ah on a creen. The confuion matrix for thi experiment record how
often each character i clai ed a itelf, and how often each i cl
ai ed a
another character. For intance, uppoe that 0 appeared 200 time and wa
clai ed a a 0 160 time, but a an o 40 time. Likewie,
uppoe t
hat
o appeared 200 time and wa clai ed a an o 170 time, but a 0 only
30 time. If we take thee count a a meaure of the imilarity b
etween two
character, then we have a imilarity meaure, but one that i not 
ymmetric.
In uch ituation, the imilarity meaure i often made ymmetric by
etting


(x, y) = 

(y, x) = ((x, y)+(y, x))/2, where 

indicate the new imilarity


meaure.
2.4.5 Example of Proximity Meaure
ection provide peci c example
Thi
of ome imilarity and diimi
larity
meaure.
Similarity Meaure for Binary Data
Similarity meaure between object that contain only binary attribute
are
called imilarity coe cient, and typically have value between 0 and 1.
A
value of 1 indicate that the two object are completely imilar, whi
le a value
of 0 indicate that the object are not at all imilar. There are many rationa
le
for why one coe cient i better than another in peci c intance.
Let x and y be two object that conit of n binary attribute. Th
e com
parion of two uch object, i.e., two binary vector, lead to the following fo
ur

quantitie (frequencie):
f
00
= the number of attribute where x i 0 and y i 0
f
01
= the number of attribute where x i 0 and y i 1
f
10
= the number of attribute where x i 1 and y i 0
f
11
= the number of attribute where x i 1 and y i 1
Simple Matching Coe cient One commonly ued imilarity coe cient i
the imple matching coe cient (SMC), which i de ned a
SMC =
number of matching attribute value
number of attribute
=
f
11
+ f
00
f
01
+ f
10
+ f
11
+ f
00
. (2.5)
74 Chapter 2 Data
Thi meaure count both preence and abence equally. Conequently,
the
SMC could be ued to nd tudent who had anwered quetion imilarly on
a tet that conited only of true/fale quetion.
Jaccard Coe cient Suppoe that x and y are data object that repreent
two row (two tranaction) of a tranaction matrix (ee Section 2.1.2). If ea
ch
aymmetric binary attribute correpond to an item in a tore, then a
1 indi
cate that the item wa purchaed, while a 0 indicate that the product wa not
purchaed. Since the number of product not purchaed by any cutomer
far
outnumber the number of product that were purchaed, a imilarity meaure
uch a SMC would ay that all tranaction are very imilar. A a reult, th
e
Jaccard coe cient i frequently ued to handle object coniting of aymmet
ric binary attribute. The Jaccard coe cient, which i often ymbolized by
J, i given by the following equation:
J =
number of matching preence
number of attribute not involved in 00 matche
=
f
11
f
01
+ f

10
+ f
11
. (2.6)
Example 2.17 (The SMC and Jaccard Similarity Coe cient). To
illutrate the di erence between thee two imilarity meaure, we calcu
late
SMC and J for the following two binary vector.
x = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
y = (0, 0, 0, 0, 0, 0, 1, 0, 0, 1)
f
01
= 2 the number of attribute where x wa 0 and y wa 1
f
10
= 1 the number of attribute where x wa 1 and y wa 0
f
00
= 7 the number of attribute where x wa 0 and y wa 0
f
11
= 0 the number of attribute where x wa 1 and y wa 1
SMC =
f11+f00
f01+f10+f11+f00
=
0+7
2+1+0+7
= 0.7
J =
f11
f01+f10+f11
=
0
2+1+0
= 0
Coine Similarity
Document are often repreented a vector, where each attribute repre
ent
the frequency with which a particular term (word) occur in the document. It
i more complicated than thi, of coure, ince certain common word
are ig
2.4 Meaure of Similarity and Diimilarity 75
nored and variou proceing technique are ued to account for di erent form
of the ame word, di ering document length, and di erent word frequencie
.
Even though document have thouand or ten of thouand of attribute
(term), each document i pare ince it ha relatively few non zero attribute
.
(The normalization ued for document do not create a non zero entry where
there wa a zero entry; i.e., they preerve parity.) Thu, a with
tranaction
data,
imilarity hould not depend on the number of hared 0 valu
e
ince
any two document are likely to not contain many of the ame word,
and
therefore, if 00 matche are counted, mot document will be highly imilar to
mot other document. Therefore, a imilarity meaure for document n
eed
to ignore 00 matche
like the Jaccard meaure, but alo mut be

able to
handle non binary vector. The coine imilarity, de ned next, i one of the
mot common meaure of document imilarity. If x and y are two docu
ment
vector, then
co(x, y) =
x y
|x| |y|
, (2.7)
where
indicate the vector dot product, x y =
n
k=1
x
k
y
k
, and |x| i the
length of vector x, |x| =
_
n
k=1
x
2
k
=

x x.
Example 2.18 (Coine Similarity of Two Document Vector). Thi
example calculate the coine imilarity for the following two data
object,
which might repreent document vector:
x = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
y = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
x y = 3
1 + 2
0 + 0
0 + 5
0 + 0
0 + 0
0 + 0
0 + 2
1 + 0
0 + 0
2 = 5
|x| =

3 3 + 2
2 + 0
0 + 5
5 + 0
0 + 0
0 + 0
0 + 2
2 + 0
0 + 0
0 = 6.48
|y| =

1 1 + 0
0 + 0
0 + 0
0 + 0
0 + 0
0 + 0
0 + 1
1 + 0
0 + 2
2 = 2.24
co(x, y) = 0.31
A indicated by Figure 2.16, coine imilarity really i a meaure o
f the
(coine of the) angle between x and y. Thu, if the coine imilari
ty i 1, the
angle between x and y i 0

, and x and y are the same exept for magnitude


(length). If the osine similarity is 0, then the angle between x a
nd y is 90

,
and they do not share any terms (words).
76 Chapter 2 Data
x
y

Figure 2.16. Geometric illustration of the cosine measure.

Euation 2.7 can be written as Euation 2.8.


cos(x, y) =
x
|x|
y
|y|
= x

, (2.8)
where x

= x/|x| and y

= y/|y|. Dividing x and y by their lengths normalizes them to have a length of 1. This means that cosine similarity does not ta
ke
the magnitude of the two data objects into account when computing similarity.
(Euclidean distance might be a better choice when magnitude is importa
nt.)
For vectors with a length of 1, the cosine measure can be calculated by taking
a simple dot product. Conseuently, when many cosine similarities bet
ween
objects are being computed, normalizing the objects to have unit leng
th can
reduce the time reuired.
Extended Jaccard Coe cient (Tanimoto Coe cient)
The extended Jaccard coe cient can be used for document data and that
reduces to the Jaccard coe cient in the case of binary attributes. The extended
Jaccard coe cient is also known as the Tanimoto coe cient. (However, ther
e
is another coe cient that is also known as the Tanimoto coe cient.) This coe cient, which we shall represent as EJ, is de ned by the following eua
tion:
EJ(x, y) =
x y
|x|
2
+ |y|
2
x y
. (2.9)
Correlation
The correlation between two data object that have binary or continuou vari
able i
a meaure of the linear relationhip between the attri
bute of the
object. (The calculation of correlation between attribute, which i
 more
common, can be de ned imilarly.) More preciely, Pearon correlation
2.4 Meaure of Similarity and Diimilarity 77
coe cient between two data object, x and y, i
de ned by the follo
wing
equation:
corr(x, y) =
covariance(x, y)
tandard deviation(x) tandard deviation(y)
=

xy

x


y
, (2.10)
where we are uing the following tandard tatitical notation and de nition:
covariance(x, y) = 
xy
=
1
n 1
n
k=1
(x
k
x)(y
k
y) (2.11)
tandard deviation(x)
x
=

_
1
n 1
n
k=1
(x
k
x)
2
tandard deviation(y)
y
=

_
1
n 1
n
k=1
(y
k
y)
2
x =
1
n
n
k=1
x

k
y

i

the mean of x
=

1
n
n
k=1
y
k
i the mean of y
Example 2.19 (Perfect Correlation). Correlation i alway in the range
1 to 1. A correlation of 1 (1) mean that x and y have a perfect
poitive
(negative) linear relationhip; that i, x
k
= ay
k
+ b, where a and b are con
tant. The following two et of value for x and y indicate cae
where the
correlation i 1 and +1, repectively. In the rt cae, the mean of
x and y
were choen to be 0, for implicity.
x = (3, 6, 0, 3, 6)
y = ( 1, 2, 0, 1, 2)
x = (3, 6, 0, 3, 6)
y = (1, 2, 0, 1, 2)
78 Chapter 2 Data
1.00 0.90 0.80 0.70 0.60 0.50 0.40
0.30 0.20 0.10
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Figure 2.17. Scatter plot illutrating correlation from 1 to 1.
Example 2.20 (Non linear Relationhip). If the correlation i 0, th
en
there i no linear relationhip between the attribute of the two dat
a object.
However, non linear relationhip may till exit. In the following
example,
x
k
= y
2
k
, but their correlation i 0.
x = (3, 2, 1, 0, 1, 2, 3)
y = ( 9, 4, 1, 0, 1, 4, 9)
Example 2.21 (Viualizing Correlation). It i alo eay to judge the cor
relation between two data object x and y by plotting pair of corre
ponding
attribute value. Figure 2.17 how a number of thee plot when x
and y
have 30 attribute and the value of thee attribute are randomly ge
nerated
(with a normal ditribution) o that the correlation of x and y range from 1
to 1. Each circle in a plot repreent one of the 30 attribute; it
x coordinate
i the value of one of the attribute for x, while it y coordinat
e i the value
of the ame attribute for y.

If we tranform x and y by ubtracting o their mean and then normaliz


ing them o that their length are 1, then their correlation can be calculated b
y
2.4 Meaure of Similarity and Diimilarity 79
taking the dot product. Notice that thi i not the ame a the tandardizatio
n
ued in other context, where we make the tranformation, x

k
= (x
k
x)/
x
and y

k
= (y
k
y)/
y
.
Bregman Divergence Thi ection provide a brief decription of Bregman
divergence, which are a family of proximity function that hare om
e com
mon propertie. A a reult, it i poible to contruct general d
ata mining
algorithm, uch a clutering algorithm, that work with any Bregman diver
gence. A concrete example i the K mean clutering algorithm (Section
8.2).
Note that thi ection require knowledge of vector calculu.
Bregman divergence are lo or ditortion function. To undertand th
e
idea of a lo function, conider the following. Let x and y be two point, wh
ere
y i regarded a the original point and x i ome ditortion or app
roximation
of it. For example, x may be a point that wa generated, for
example, by
adding random noie to y. The goal i to meaure the reulting dit
ortion or
lo that reult if y i approximated by x. Of coure, the more 
imilar x and
y are, the maller the lo or ditortion. Thu, Bregman divergence
 can be
ued a diimilarity function.
More formally, we have the following de nition.
De nition 2.6 (Bregman Divergence). Given a trictly convex function
(with a ew modest restrictions that are generally satis ed), the Breg
man
divergence (loss unction) D(x, y) generated by that unction is given
by the
ollowing equation:
D(x, y) = (x) (y) (y), (x y)) (2.12)
where (y) is the gradient o evaluated at y, xy, is the vector di erence
between x and y, and (y), (x y)) is the inner product between (x)
and (x y). For points in Euclidean space, the inner product is just
the dot
product.
D(x, y) can be written as D(x, y) = (x) L(x), where L(x) = (y) +
(y), (x y)) and represents the equation o a plane that is tangent to th

e
unction at y. Using calculus terminology, L(x) is the lineariza
tion o

around the point y and the Bregman divergence is just the di erence between
a unction and a linear approximation to that unction. Di erent Bregma
n
divergences are obtained by using di erent choices or .
Example 2.22. We provide a concrete example using squared Euclidean dis
tance, but restrict ourselves to one dimension to simpliy the mathematics. Le
t
80 Chapter 2 Data
x and y be real numbers and (t) be the real valued unction, (t)
= t
2
. In
that case, the gradient reduces to the derivative and the dot product
reduces
to multiplication. Speci cally, Equation 2.12 becomes Equation 2.13.
D(x, y) = x
2
y
2
2y(x y) = (x y)
2
(2.13)
The graph or this example, with y = 1, is shown in Figure 2.1
8. The
Bregman divergence is shown or two values o x: x = 2 and x = 3.
10
9
8
7
6
5
4
3
2
1
4 3 2 1 0 1 2 3 4
0
y
x
(x) = x
2
D(2, 1)
D(3, 1)
y = 2x 1
Figure 2.18. Illustration o Bregman divergence.
2.4.6 Issues in Proximity Calculation
This section discusses several important issues related to proximity me
asures:
(1) how to handle the case in which attributes have di erent scales and/or are
correlated, (2) how to calculate proximity between objects that are co
mposed
o di erent types o attributes, e.g., quantitative and qualitative, (3)
and how
to handle proximity calculation when attributes have di erent weights; i
.e.,
when not all attributes contribute equally to the proximity o objects
.

2.4 Measures o Similarity and Dissimilarity 81


Standardization and Correlation or Distance Measures
An important issue with distance measures is how to handle the
situation
when attributes do not have the same range o
values. (This
situation is
oten described by saying that the variables have di erent scales.) Earli
er,
Euclidean distance was used to measure the distance between people based on
two attributes: age and income. Unless these two attributes are standardized
,
the distance between two people will be dominated by income.
A related issue is how to compute distance when there is correlation
be
tween some o the attributes, perhaps in addition to di erences in the ranges o
values. A generalization o Euclidean distance, the Mahalanobis distanc
e,
is useul when attributes are correlated, have di erent ranges o
val
ues (di
erent variances), and the distribution o the data is approximately G
aussian
(normal). Speci cally, the Mahalanobis distance between two objects (vectors)
x and y is de ned as
mahalanobis(x, y) = (x y)
1
(x y)
T
, (2.14)
where
1
is the inverse o the covariance matrix o the data. Note that the
covariance matrix is the matrix whose ij
th
entry is the covariance of the i
th
and j
th
attributes as de ned by Euation 2.11.
Example 2.23. In Figure 2.19, there are 1000 points, whose x
and y attributes have a correlation of 0.6. The distance between the two lar
ge points
at the opposite ends of the long axis of the ellipse is 14.7 in terms of Euclide
an
distance, but only 6 with respect to Mahalanobis distance. In practi
ce, computing the Mahalanobis distance is expensive, but can be worthwhile for d
ata
whose attributes are correlated. If the attributes are relatively unco
rrelated,
but have di erent ranges, then standardizing the variables is su cient.
Combining imilarities for Heterogeneous Attributes
The previous de nitions of similarity were based on approaches that assu
med
all the attributes were of the same type. A general approach is needed when th
e
attributes are of di erent types. One straightforward approach is to co
mpute
the similarity between each attribute separately using Table 2.7, a
nd then

combine these similarities using a method that results in a similarity


between
0 and 1. Typically, the overall similarity is de ned as the average
of all the
individual attribute similarities.
82 Chapter 2 Data
8 6 4 2 0 2 4 6 8
5
4
3
2
1
0
1
2
3
4
5
x
y
Figure 2.19.
et of two-dimensional points. The Mahalanobis distance between
the two points represented by large dots is 6; their Euclidean distance is 14.7.
Unfortunately, this approach does not work well if some of the attri
butes
are asymmetric attributes. For example, if all the attributes are a
symmetric
binary attributes, then the similarity measure suggested previously reduces to
the simple matching coe cient, a measure that is not appropriate for a
symmetric binary attributes. The easiest way to x this problem is to omit asymmetric attributes from the similarity calculation when their values are
0 for
both of the objects whose similarity is being computed. A similar a
pproach
also works well for handling missing values.
In summary, Algorithm 2.1 is e ective for computing an overall similar
ity between two objects, x and y, with di erent types of attribute
s. This
procedure can be easily modi ed to work with dissimilarities.
Using Weights
In much of the previous discussion, all attributes were treated eu
ally when
computing proximity. This is not desirable when some attributes are more important to the de nition of proximity than others. To address these situations,
2.4 Measures of imilarity and Dissimilarity 83
Algorithm 2.1 imilarities of heterogeneous objects.
1: For the k
th
attribute, compute a similarity, s
k
(x, y), in the range [0, 1].
2: De ne an indicator variable,
k
, for the k
th
attribute as follows:

=
_

_
0 if the k
th
attribute is an asymmetric attribute an
both objects have a value of 0, or if one of the objects
has a missing value for the k
th
attribute
1 otherwise
3: Compute the overall similarity between the two objects using the
following formula:
similarity(x, y) =
n
k=1

k
s
k
(x, y)
n
k=1

k
(2.15)
the formulas for proximity can be moi e
of
each attribute.
If the weights w
k
sum to 1, then (2.15) becomes
similarity(x, y) =

by weighting the contribution

n
k=1
w
k

k
s
k
(x, y)
n
k=1

k
. (2.16)
The e nition of the Minkowski istance can also be moi e
(x, y) =
_

as follows:

n
k=1
w
k
[x
k
y
k
[
r
_
1/r
. (2.17)
2.4.7 Selecting the Right Proximity Measure
The ollowing are a ew general observations that may be helpul. F
irst, the
type o proximity measure should t the type o data. For many types o dense,
continuous data, metric distance measures such as Euclidean distance ar
e o
ten used. Proximity between continuous attributes is most oten ex
pressed
in terms o
di erences, and distance measures provide a well de ned way
o
combining these di erences into an overall proximity measure. Although
at
tributes can have di erent scales and be o di ering importance, these i
ssues
can oten be dealt with as described earlier.
For sparse data, which oten consists o
asymmetric attributes, we
typi
cally employ similarity measures that ignore 00 matches. Conceptually,
this
re ects the act that, or a pair o complex objects, similarity depend
s on the
number o characteristics they both share, rather than the number o
charac
teristics they both lack. More speci cally, or sparse, asymmetric dat
a, most
84 Chapter 2 Data
objects have only a ew o the characteristics described by the attri
butes, and
thus, are highly similar in terms o the characteristics they do not
have. The
cosine, Jaccard, and extended Jaccard measures are appropriate or such data.
There are other characteristics o data vectors that may need to be consid
ered. Suppose, or example, that we are interested in comparing tim
e series.
I the magnitude o the time series is important (or example, each time series
represent total sales o
the same organization or a di erent year),
then we
could use Euclidean distance. I the time series represent di erent qua
ntities
(or example, blood pressure and oxygen consumption), then we usually
want
to determine i the time series have the same shape, not the same m
agnitude.
Correlation, which uses a built in normalization that accounts or di er
ences
in magnitude and level, would be more appropriate.

In some cases, transormation or normalization o


the data is import
ant
or obtaining a proper similarity measure since such transormations ar
e not
always present in proximity measures. For instance, time series
may have
trends or periodic patterns that signi cantly impact similarity. Also, a proper
computation o
similarity may require that time lags be taken into a
ccount.
Finally, two time series may only be similar over speci c periods o t
ime. For
example, there is a strong relationship between temperature and the u
se o
natural gas, but only during the heating season.
Practical consideration can also be important. Sometimes, a one or mo
re
proximity measures are already in use in a particular
eld, and thus,
others
will have answered the question o which proximity measures should be
used.
Other times, the sotware package or clustering algorithm being u
sed may
drastically limit the choices. I
e ciency is a concern, then we may
want to
choose a proximity measure that has a property, such as the triangle inequality,
that can be used to reduce the number o proximity calculations. (See Exercise
25.)
However, i
common practice or practical restrictions do not di
ctate a
choice, then the proper choice o a proximity measure can be a time consuming
task that requires careul consideration o
both domain knowledge a
nd the
purpose or which the measure is being used. A number o di erent similarit
y
measures may need to be evaluated to see which ones produce results
that
make the most sense.
2.5 Bibliographic Notes
It is essential to understand the nature o
the data that is b
eing analyzed,
and at a undamental level, this is the subject o
measurement
theory. In
2.5 Bibliographic Notes 85
particular, one o
the initial motivations or de ning types o
attr
ibutes was
to be precise about which statistical operations were valid or what
sorts o
data. We have presented the view o
measurement theory that was ini
tially
described in a classic paper by S. S. Stevens [79]. (Tables 2.2
and 2.3 are
derived rom those presented by Stevens [80].) While this is the most common
view and is reasonably easy to understand and apply, there is, o
course,

much more to measurement theory. An authoritative discussion can be 
ound
in a three volume series on the oundations o
measurement theory [6
3, 69,
81]. Also o
interest is a wide ranging article by Hand [55], whic
h discusses

measurement theory and statistics, and is accompanied by comments 


rom
other researchers in the
eld. Finally, there are many books and artic
les that
describe measurement issues or particular areas o science and enginee
ring.
Data quality is a broad subject that spans every discipline that uses
data.
Discussions o
precision, bias, accuracy, and signi cant gures can be
ound
in many introductory science, engineering, and statistics textbooks. The view
o data quality as tness or use is explained in more detail in the b
ook by
Redman [76]. Those interested in data quality may also be interested in MITs
Total Data Quality Management program [70, 84]. However, the knowle
dge
needed to deal with speci c data quality issues in a particular domain is oten
best obtained by investigating the data quality practices o researchers in that
eld.
Aggregation is a less well de ned subject than many other preprocessing
tasks. However, aggregation is one o the main techniques used by the database
area o Online Analytical Processing (OLAP), which is discussed in Chapter 3.
There has also been relevant work in the area o symbolic data analysis (Bock
and Diday [47]). One o the goals in this area is to summarize traditional rec
ord
data in terms o symbolic data objects whose attributes are more complex than
traditional attributes. Speci cally, these attributes can have values t
hat are
sets o values (categories), intervals, or sets o values with weights (histogra
ms).
Another goal o
symbolic data analysis is to be able to perorm
clustering,
classi cation, and other kinds o data analysis on data that consists o symbolic
data objects.
Sampling is a subject that has been well studied in statistics and
related
elds. Many introductory statistics books, such as the one by Lindgren
[65],
have some discussion on sampling, and there are entire books devoted
to the
subject, such as the classic text by Cochran [49]. A survey o
sa
mpling or
data mining is provided by Gu and Liu [54], while a survey o
sam
pling or
databases is provided by Olken and Rotem [72]. There are a number o
 other
data mining and database related sampling reerences that may be o interest,
86 Chapter 2 Data
including papers by Palmer and Faloutsos [74], Provost et al. [75],
Toivonen
[82], and Zaki et al. [85].
In statistics, the traditional techniques that have been used or dimension
ality reduction are multidimensional scaling (MDS) (Borg and Groenen [
48],
Kruskal and Uslaner [64]) and principal component analysis (PCA) (Jol
li e
[58]), which is similar to singular value decomposition (SVD) (Demmel
[50]).
Dimensionality reduction is discussed in more detail in Appendix B.
Discretization is a topic that has been extensively investigated in

data
mining. Some classi cation algorithms only work with categorical data,
and
association analysis requires binary data, and thus, there is a signi cant moti
vation to investigate how to best binarize or discretize continuous at
tributes.
For association analysis, we reer the reader to work by Srikant and
Agrawal
[78], while some useul reerences or discretization in the area o classi cation
include work by Dougherty et al. [51], Elomaa and Rousu [52], Fayy
ad and
Irani [53], and Hussain et al. [56].
Feature selection is another topic well investigated in data mining. A broad
coverage o
this topic is provided in a survey by Molina et al. [
71] and two
books by Liu and Motada [66, 67]. Other useul papers include those by Blum
and Langley [46], Kohavi and John [62], and Liu et al. [68].
It is di cult to provide reerences or the subject o eature transormations
because practices vary rom one discipline to another. Many statistics
books
have a discussion o transormations, but typically the discussion is
restricted
to a particular purpose, such as ensuring the normality o a variable or making
sure that variables have equal variance. We o er two reerences: Osborne [73]
and Tukey [83].
While we have covered some o
the most commonly used distance
and
similarity measures, there are hundreds o such measures and more are
being
created all the time. As with so many other topics in this chapter
, many o
these measures are speci c to particular elds; e.g., in the area o time series see
papers by Kalpakis et al. [59] and Keogh and Pazzani [61]. Clusteri
ng books
provide the best general discussions. In particular, see the books by Anderber
g
[45], Jain and Dubes [57], Kauman and Rousseeuw [60], and Sneath and Sokal
[77].
Bibliography
[45] M. R. Anderberg. Cluster Analysis or Applications. Academic
Press, New York, De
cember 1973.
[46] A. Blum and P. Langley. Selection o
Relevant Features and
Examples in Machine
Learning. Arti cial Intelligence, 97(12):245271, 1997.
Bibliography 87
[47] H. H. Bock and E. Diday. Analysis o Symbolic Data: Exploratory M
ethods or Extract
ing Statistical Inormation rom Complex Data (Studies in Classi cation, Data Anal
ysis,
and Knowledge Organization). Springer Verlag Telos, January 2000.
[48] I. Borg and P. Groenen. Modern Multidimensional ScalingTheory
and Applications.
Springer Verlag, February 1997.
[49] W. G. Cochran. Sampling Techniques. John Wiley & Sons, 3rd ed
ition, July 1977.
[50] J. W. Demmel. Applied Numerical Linear Algebra. Society or
Industrial & Applied
Mathematics, September 1997.
[51] J. Dougherty, R. Kohavi, and M. Sahami. Supervised and Unsuperv

ised Discretization
o
Continuous Features. In Proc. o
the 12th Intl. Con. on Mac
hine Learning, pages
194202, 1995.
[52] T. Elomaa and J. Rousu. General and E cient Multisplitting o Nu
merical Attributes.
Machine Learning, 36(3):201244, 1999.
[53] U. M. Fayyad and K. B. Irani. Multi interval discretization
o
continuousvalued at
tributes or classi cation learning. In Proc. 13th Int. Joint Con.
on Arti cial Intelli
gence, pages 10221027. Morgan Kauman, 1993.
[54] F. H. Gaohua Gu and H. Liu. Sampling and Its Application in Data Mining
: A Survey.
Technical Report TRA6/00, National University o Singapore, Singapore, 2
000.
[55] D. J. Hand. Statistics and the Theory o Measurement. Journal o th
e Royal Statistical
Society: Series A (Statistics in Society), 159(3):445492, 1996.
[56] F. Hussain, H. Liu, C. L. Tan, and M. Dash. TRC6/99:
Discretization: an enabling
technique. Technical report, National University o Singapore, Singapore
, 1999.
[57] A. K. Jain and R. C. Dubes. Algorithms or Clustering
Data. Prentice Hall
Advanced Reerence Series. Prentice Hall, March 1988. Book avail
able online at
http://www.cse.msu.edu/jain/Clustering Jain Dubes.pdf.
[58] I. T. Jolli e. Principal Component Analysis. Springer Verlag,
2nd edition, October
2002.
[59] K. Kalpakis, D. Gada, and V. Puttagunta. Distance Measures for
E ective Clustering
of ARIMA Time-Series. In Proc. of the 2001 IEEE Intl. Conf. on Data
Mining, pages
273280. IEEE Computer Society, 2001.
[60] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An I
ntroduction to Cluster
Analysis. Wiley Series in Probability and Statistics. John Wiley and
Sons, New York,
November 1990.
[61] E. J. Keogh and M. J. Pazzani. Scaling up dynamic time
warping for datamining
applications. In KDD, pages 285289, 2000.
[62] R. Kohavi and G. H. John. Wrappers for Feature Subset Selection. Arti cial
Intelligence,
97(12):273324, 1997.
[63] D. Krantz, R. D. Luce, P. Suppes, and A. Tversky. Foun
dations of Measurements:
Volume 1: Additive and polynomial representations. Academic Press, Ne
w York, 1971.
[64] J. B. Kruskal and E. M. Uslaner. Multidimensional Scaling. Sag
e Publications, August
1978.
[65] B. W. Lindgren. Statistical Theory. CRC Press, January 1993.
[66] H. Liu and H. Motoda, editors. Feature Extraction, Construction and Sel
ection: A Data
Mining Perspective. Kluwer International Series in Engineering and Comp
uter Science,
453. Kluwer Academic Publishers, July 1998.

[67] H. Liu and H. Motoda. Feature Selection for Knowledge Disc


overy and Data Mining. Kluwer International Series in Engineering and Computer Scienc
e, 454. Kluwer
Academic Publishers, July 1998.
88 Chapter 2 Data
[68] H. Liu, H. Motoda, and L. Yu. Feature Extraction, Selecti
on, and Construction. In
N. Ye, editor, The Handbook of Data Mining, pages 2241. Lawrence
Erlbaum Associates, Inc., Mahwah, NJ, 2003.
[69] R. D. Luce, D. Krantz, P. Suppes, and A. Tversky. Foun
dations of Measurements:
Volume 3: Representation, Axiomatization, and Invariance. Academic Press, New
York,
1990.
[70] MIT Total Data Quality Management Program. web.mit.edu/tdqm/www/index.s
html,
2003.
[71] L. C. Molina, L. Belanche, and A. Nebot. Feature Selection Alg
orithms: A Survey and
Experimental Evaluation. In Proc. of the 2002 IEEE Intl. Conf. on Data
Mining, 2002.
[72] F. Olken and D. Rotem. Random Sampling from DatabasesA Survey.
Statistics &
Computing, 5(1):2542, March 1995.
[73] J. Osborne. Notes on the Use of Data Transformations. Practical Asse
ssment, Research
& Evaluation, 28(6), 2002.
[74] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved me
thod for data
mining and clustering. ACM SIGMOD Record, 29(2):8292, 2000.
[75] F. J. Provost, D. Jensen, and T. Oates. E cient Progressive Samp
ling. In Proc. of the
5th Intl. Conf. on Knowledge Discovery and Data Mining, pages 2332, 19
99.
[76] T. C. Redman. Data Quality: The Field Guide. Digital Press,
January 2001.
[77] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman,
San Francisco, 1971.
[78] R. Srikant and R. Agrawal. Mining Quantitative Association Rules
in Large Relational
Tables. In Proc. of 1996 ACM-SIGMOD Intl. Conf. on Management of
Data, pages
112, Montreal, Quebec, Canada, August 1996.
[79] S. S. Stevens. On the Theory of Scales of Measurement. Sc
ience, 103(2684):677680,
June 1946.
[80] S. S. Stevens. Measurement. In G. M. Maranell, editor, S
caling: A Sourcebook for
Behavioral Scientists, pages 2241. Aldine Publishing Co., Chicago, 1974.
[81] P. Suppes, D. Krantz, R. D. Luce, and A. Tversky. Foun
dations of Measurements:
Volume 2: Geometrical, Threshold, and Probabilistic Representations. Ac
ademic Press,
New York, 1989.
[82] H. Toivonen. Sampling Large Databases for Association Rules.
In VLDB96, pages
134145. Morgan Kaufman, September 1996.
[83] J. W. Tukey. On the Comparative Anatomy of Transformations. Annals of Mat

hematical
Statistics, 28(3):602632, September 1957.
[84] R. Y. Wang, M. Ziad, Y. W. Lee, and Y. R. Wang. Da
ta Quality. The Kluwer International Series on Advances in Database Systems, Volume 23. K
luwer Academic
Publishers, January 2001.
[85] M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluatio
n of Sampling for Data
Mining of Association Rules. Technical Report TR617, Rensselaer Polytechnic Inst
itute,
1996.
2.6 Exercises
1. In the initial example of Chapter 2, the statistician says, Ye
s,
elds 2 and 3
are basically the same. Can you tell from the three lines of sample
data that
are shown why she says that?
2.6 Exercises 89
2. Classify the following attributes as binary, discrete, or continuous. Als
o classify
them as qualitative (nominal or ordinal) or quantitative (interva
l or ratio).
Some cases may have more than one interpretation, so brie y indicate
your
reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM.
(b) Brightness as measured by a light meter.
(c) Brightness as measured by peoples judgments.
(d) Angles as measured in degrees between 0 and 360.
(e) Bronze, Silver, and Gold medals as awarded at the Olympics.
(f) Height above sea level.
(g) Number of patients in a hospital.
(h) ISBN numbers for books. (Look up the format on the Web.)
(i) Ability to pass light in terms of the following values: opaque, transluc
ent,
transparent.
(j) Military rank.
(k) Distance from the center of campus.
(l) Density of a substance in grams per cubic centimeter.
(m) Coat check number. (When you attend an event, you can often gi
ve your
coat to someone who, in turn, gives you a number that you can use
to
claim your coat when you leave.)
3. You are approached by the marketing director of a local company, who believ
es
that he has devised a foolproof way to measure customer satisf
action. He
explains his scheme as follows: Its so simple that I cant believe that
no one
has thought of it before. I just keep track of the number of customer complain
ts
for each product. I read in a data mining book that counts are ratio attribute
s,
and so, my measure of product satisfaction must be a ratio attrib
ute. But
when I rated the products based on my new customer satisfaction measure and
showed them to my boss, he told me that I had overlooked the obvio

us, and
that my measure was worthless. I think that he was just mad because our bestselling product had the worst satisfaction since it had the most
complaints.
Could you help me set him straight?
(a) Who is right, the marketing director or his boss? If you ans
wered, his
boss, what would you do to
x the measure of satisfaction?
(b) What can you say about the attribute type of the original produ
ct satisfaction attribute?
90 Chapter 2 Data
4. A few months later, you are again approached by the same marketi
ng director
as in Exercise 3. This time, he has devised a better approach to
measure the
extent to which a customer prefers one product over other, similar products. H
e
explains, When we develop new products, we typically create several variations
and evaluate which one customers prefer. Our standard procedure is to
give
our test subjects all of the product variations at one time and then ask them to
rank the product variations in order of preference. However, our tes
t subjects
are very indecisive, especially when there are more than two products
. As a
result, testing takes forever. I suggested that we perform the compa
risons in
pairs and then use these comparisons to get the rankings. Thus, if
we have
three product variations, we have the customers compare variations 1
and 2,
nally 3 and 1. Our testing time with my new pr
then 2 and 3, and
ocedure
is a third of what it was for the old procedure, but the employees
conducting
the tests complain that they cannot come up with a consistent ranking
from
the results. And my boss wants the latest product evaluations, yeste
rday. I
should also mention that he was the person who came up with the old
product
evaluation approach. Can you help me?
(a) Is the marketing director in trouble? Will his approach work f
or generating an ordinal ranking of the product variations in terms of cus
tomer
preference? Explain.
(b) Is there a way to x the marketing directors approach? More gener
ally,
what can you say about trying to create an ordinal measurement scale
based on pairwise comparisons?
(c) For the original product evaluation scheme, the overall ranking
s of each
product variation are found by computing its average over all test subjects.
Comment on whether you think that this is a reasonable approach. What
other approaches might you take?
5. Can you think of a situation in which identi cation numbers would
be useful
for prediction?

6. An educational psychologist wants to use association analysis to


analyze test
results. The test consists of 100 questions with four possible answer
s each.
(a) How would you convert this data into a form suitable for
association
analysis?
(b) In particular, what type of attributes would you have and how
many of
them are there?
7. Which of the following quantities is likely to show more temporal
autocorrelation: daily rainfall or daily temperature? Why?
8. Discuss why a document-term matrix is an example of a data s
et that has
asymmetric discrete or asymmetric continuous features.
2.6 Exercises 91
9. Many sciences rely on observation instead of (or in addition to)
designed experiments. Compare the data quality issues involved in observational
science
with those of experimental science and data mining.
10. Discuss the di erence between the precision of a measurement and
the terms
single and double precision, as they are used in computer science, ty
pically to
represent oating-point numbers that require 32 and 64 bits, respectively
.
11. Give at least two advantages to working with data stored in tex
t les instead
of in a binary format.
12. Distinguish between noise and outliers. Be sure to consider the
following questions.
(a) Is noise ever interesting or desirable? Outliers?
(b) Can noise objects be outliers?
(c) Are noise objects always outliers?
(d) Are outliers always noise objects?
(e) Can noise make a typical value into an unusual one, or vice ve
rsa?
13. Consider the problem of
nding the K nearest neighbors of a dat
a object. A
programmer designs Algorithm 2.2 for this task.
Algorithm 2.2 Algorithm for
nding K nearest neighbors.
1: for i = 1 to number of data objects do
2: Find the distances of the i
th
object to all other objects.
3: Sort these distances in decreasing order.
(Keep track of which object is associated with each distance.)
4: return the objects associated with the
rst K distances of the sor
ted list
5: end for
(a) Describe the potential problems with this algorithm if there are duplicate
objects in the data set. Assume the distance function will only ret
urn a
distance of 0 for objects that are the same.
(b) How would you x this problem?
14. The following attributes are measured for members of a herd
of Asian ele-

phants: weight, height, tusk length, trunk length, and ear area. Bas
ed on these
measurements, what sort of similarity measure from Section 2.4 would y
ou use
to compare or group these elephants? Justify your answer and expla
in any
special circumstances.
92 Chapter 2 Data
15. You are given a set of m objects that is divided into K group
s, where the i
th
group is of size m
i
. If the goal is to obtain a sample of size n < m, what is
the di erence between the following two sampling schemes? (Assume sampling
with replacement.)
(a) We randomly select n
m
i
/m elements from each group.
(b) We randomly select n elements from the data set, without regard
for the
group to which an object belongs.
16. Consider a document-term matrix, where tf
ij
is the frequency of the i
th
word
(term) in the j
th
document and m is the number of documents. Consider the
variable transformation that is de ned by
tf

ij
= tf
ij
log
m
df
i
, (2.18)
where df
i
is the number of documents in which the i
th
term appears, which
is known as the document frequency of the term. This transformation
is
known as the inverse document frequency transformation.
(a) What is the e ect of this transformation if a term occurs in one document?
In every document?
(b) What might be the purpose of this transformation?
17. Assume that we apply a square root transformation to a ratio at
tribute x to
obtain the new attribute x
. As part of your analysis, you identify an interval
(a, b) in which x
has a linear relationship to another attribute y.

(a) What is the corresponding interval (a, b) in terms of x?


(b) Give an equation that relates y to x.
18. This exercise compares and contrasts some similarity and distance
measures.
(a) For binary data, the L1 distance corresponds to the Hamming dis
tance;
that is, the number of bits that are di erent between two binary vecto
rs.
The Jaccard similarity is a measure of the similarity between two bin
ary
vectors. Compute the Hamming distance and the Jaccard similarity between the following two binary vectors.
x = 0101010001
y = 0100011000
(b) Which approach, Jaccard or Hamming distance, is more similar
to the
Simple Matching Coe cient, and which approach is more similar to the
cosine measure? Explain. (Note: The Hamming measure is a distance,
while the other three measures are similarities, but dont let this con
fuse
you.)
2.6 Exercises 93
(c) Suppose that you are comparing how similar two organisms of di er
ent
species are in terms of the number of genes they share. Describe
which
measure, Hamming or Jaccard, you think would be more appropriate for
comparing the genetic makeup of two organisms. Explain. (Assume that
each animal is represented as a binary vector, where each attribute i
s 1 if
a particular gene is present in the organism and 0 otherwise.)
(d) If you wanted to compare the genetic makeup of two organisms of the same
species, e.g., two human beings, would you use the Hamming distance
,
the Jaccard coe cient, or a di erent measure of similarity or distance?
Explain. (Note that two human beings share > 99.9% of the same genes.)
19. For the following vectors, x and y, calculate the indicated similarity or
distance
measures.
(a) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation, Euclidean
(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation, Euclidean, J
accard
(c) x = (0, 1, 0, 1), y = (1, 0, 1, 0) cosine, correlation, Euclidean
(d) x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1) cosine, correlation, J
accard
(e) x = (2, 1, 0, 2, 0, 3), y = (1, 1, 1, 0, 0, 1) cosine, correlation
20. Here, we urther explore the cosine and correlation measures.
(a) What is the range o values that are possible or the cosine m
easure?
(b) I two objects have a cosine measure o 1, are they identical?
Explain.
(c) What is the relationship o
the cosine measure to correlat
ion, i
any?
(Hint: Look at statistical measures such as mean and standard deviati
on
in cases where cosine and correlation are the same and di erent.)
(d) Figure 2.20(a) shows the relationship o the cosine measure to E
uclidean
distance or 100,000 randomly generated points that have been normalized
to have an L2 length o 1. What general observation can you make about

the relationship between Euclidean distance and cosine similarity when


vectors have an L2 norm o 1?
(e) Figure 2.20(b) shows the relationship o correlation to Euclidean distance
or 100,000 randomly generated points that have been standardized to
have a mean o 0 and a standard deviation o 1. What general obse
rva
tion can you make about the relationship between Euclidean distance and
correlation when the vectors have been standardized to have a mean o
0
and a standard deviation o 1?
() Derive the mathematical relationship between cosine similarity and
Eu
clidean distance when each data object has an L
2
length o 1.
(g) Derive the mathematical relationship between correlation and Eucli
dean
distance when each data point has been been standardized by subtracting
its mean and dividing by its standard deviation.
94 Chapter 2 Data
0 0.2 0.4 0.6 0.8 1
Cosine Similarity
1.4
1.2
1
0.8
0.6
0.4
0.2
0
E
u
c
l
i
d
e
a
n
D
i
s
t
a
n
c
e
(a) Relationship between Euclidean
distance and the cosine measure.
0 0.2 0.4 0.6 0.8 1
Correlation
1.4
1.2
1
0.8
0.6
0.4
0.2
0

E
u
c
l
i
d
e
a
n
D
i
s
t
a
n
c
e
(b) Relationship between Euclidean
distance and correlation.
Figure 2.20. Graphs or Exercise 20.
21. Show that the set di erence metric given by
d(A, B) = size(A B) + size(B A) (2.19)
satis es the metric axioms given on page 70. A and B are sets and A
B is
the set di erence.
22. Discuss how you might map correlation values rom the interval
[1,1] to the
interval [0,1]. Note that the type o transormation that you use mi
ght depend
on the application that you have in mind. Thus, consider two appl
ications:
clustering time series and predicting the behavior o one time series
given an
other.
23. Given a similarity measure with values in the interval [0,1] describe two
ways to
transorm this similarity value into a dissimilarity value in the inte
rval [0,].
24. Proximity is typically de ned between a pair o objects.
(a) De ne two ways in which you might de ne the proximity among a grou
p
o objects.
(b) How might you de ne the distance between two sets o points in Euclidean
space?
(c) How might you de ne the proximity between two sets o
data objec
ts?
(Make no assumption about the data objects, except that a proximit
y
measure is de ned between any pair o objects.)
25. You are given a set o
points S in Euclidean space, as well
as the distance o
each point in S to a point x. (It does not matter i x S.)
2.6 Exercises 95
(a) I
the goal is to
nd all points within a speci ed distance of
point y,
y ,= x, xplain how you could us th triangl inquality and th alr
ady
calculatd distancs to x to potntially rduc th numbr of dista
nc

calculations ncssary? Hint: Th triangl inquality, d(x, z) d(x, y) +


d(y, x), can b rwrittn as d(x, y) d(x, z) d(y, z).
(b) In gnral, how would th distanc btwn x and y a ct th numb
r of
distanc calculations?
(c) Suppos that you can
nd a small subst of points S

, from th original


data st, such that vry point in th data st is within a spci d distanc
of at last on of th points in S

, and that you also hav th pairwis


distanc matrix for S

. Dscrib a tchniqu that uss this information to


comput, with a minimum of distanc calculations, th st of all
points
within a distanc of of a speci ed point from the data set.
26. Show that 1 minus the Jaccard similarity is a distance measure etween two
data
o jects, x and y, that satis es the metric axioms given on page 70. Speci cally,
d(x, y) = 1 J(x, y).
27. Show that the distance measure de ned as the angle etween two data vectors,
x and y, satis es the metric axioms given on page 70. Speci cally, d(
x, y) =
arccos(cos(x, y)).
28. Explain why computing the proximity etween two attri utes is oft
en simpler
than computing the similarity etween two o jects.
3
Exploring Data
The previous chapter addressed high level data issues that are importa
nt in
the knowledge discovery process. This chapter provides an introducti
on to
data exploration, which is a preliminary investigation of the data in
order
to etter understand its speci c characteristics. Data exploration can a
id in
selecting the appropriate preprocessing and data analysis techniques. I
t can
even address some of the questions typically answered y data mining.
For
example, patterns can sometimes e found y visually inspecting the
data.
Also, some of the techniques used in data exploration, such as vis
ualization,
can e used to understand and interpret data mining results.
This chapter covers three major topics: summary statistics, visualizati
on,
and On Line Analytical Processing (OLAP). Summary statistics, such as t
he
mean and standard deviation of a set of values, and visualization tec
hniques,
such as histograms and scatter plots, are standard methods that are
widely
employed for data exploration. OLAP, which is a more recent developm
ent,
consists of a set of techniques for exploring multidimensional arrays
of values.

OLAP related analysis functions focus on various ways to create s


ummary
data ta les from a multidimensional data array. These techniques i
nclude
aggregating data either across various dimensions or across various att
ri ute
values. For instance, if we are given sales information reported
according
to product, location, and date, OLAP techniques can e used to c
reate a
summary that descri es the sales activity at a particular location y
month
and product category.
The topics covered in this chapter have considera le overlap with the area
known as Exploratory Data Analysis (EDA), which was created in the
1970s y the prominent statistician, John Tukey. This chapter, li
ke EDA,
places a heavy emphasis on visualization. Unlike EDA, this chapter do
es not
include topics such as cluster analysis or anomaly detection. There a
re two
98 Chapter 3 Exploring Data
reasons for this. First, data mining views descriptive data analysis techni
ques
as an end in themselves, whereas statistics, from which EDA originated, tends
to view hypothesis ased testing as the
nal goal. Second, cluster
analysis
and anomaly detection are large areas and require full chapters
for an in
depth discussion. Hence, cluster analysis is covered in Chapters 8 and 9, whil
e
anomaly detection is discussed in Chapter 10.
3.1 The Iris Data Set
In the following discussion, we will often refer to the Iris d
ata set that is
availa le from the University of California at Irvine (UCI) Machine
Learn
ing Repository. It consists of information on 150 Iris
owers, 50 ea
ch from
one of three Iris species: Setosa, Versicolour, and Virginica. Eac
h ower is
characterized y
ve attri utes:
1. sepal length in centimeters
2. sepal width in centimeters
3. petal length in centimeters
4. petal width in centimeters
5. class (Setosa, Versicolour, Virginica)
The sepals of a ower are the outer structures that protect the more fragile
parts of the ower, such as the petals. In many owers, the sepals are
green,
and only the petals are colorful. For Irises, however, the sepals are also col
orful.
As illustrated y the picture of a Virginica Iris in Figure 3.1, the
sepals of an
Iris are larger than the petals and are drooping, while the petals a
re upright.
3.2 Summary Statistics
Summary statistics are quantities, such as the mean and standard deviation,
that capture various characteristics of a potentially large set of val
ues with a

single num er or a small set of num ers. Everyday examples of


summary
statistics are the average household income or the fraction of college
students
who complete an undergraduate degree in four years. Indeed, for many people,
summary statistics are the most visi le manifestation of statistics.
We will
concentrate on summary statistics for the values of a single attri ute, ut will
provide a rief description of some multivariate summary statistics.
3.2 Summary Statistics 99
Figure 3.1. Picture of Iris Virginica. Ro ert H. Mohlen rock @ US
DA NRCS PLANTS Data ase/
USDA NRCS. 1995. Northeast wetland ora: Field of ce guide to plant species.
ortheast National
Technical Center, Chester, PA. Background removed.
This section considers only the descriptive nature of summary statisti
cs.
However, as descri ed in Appendix C, statistics views data as arising from an
underlying statistical process that is characterized y various parameters, and
some of the summary statistics discussed here can e viewed as estima
tes of
statistical parameters of the underlying distri ution that generated the
data.
3.2.1 Frequencies and the Mode
Given a set of unordered categorical values, there is not much that can e done
to further characterize the values except to compute the frequency with which
each value occurs for a particular set of data. Given a categorical
attri ute x,
which can take values {v
1
, . . . , v
i
, . . . v
k
} and a set of m o jects, the frequency
of a value v
i
is de ned as
frequency(v
i
) =
num er of o jects with attri ute value v
i
m
. (3.1)
The mode of a categorical attri ute is the value that has the highest frequency.
100 Chapter 3 Exploring Data
Example 3.1. Consider a set of students who have an attri ute, class, which
can take values from the set {freshman, sophomore, junior, senior}. Ta l
e
3.1 shows the num er of students for each value of the class attr
i ute. The
mode of the class attri ute is freshman, with a frequency of 0.33.
This may
indicate dropouts due to attrition or a larger than usual freshman cl
ass.
Ta le 3.1. Class size for students in a hypothetical college.
Class Size Frequency
freshman 140 0.33
sophomore 160 0.27

junior 130 0.22


senior 170 0.18
Categorical attri utes often, ut not always, have a small num er of values,
and consequently, the mode and frequencies of these values can e int
eresting
and useful. Notice, though, that for the Iris data set and the clas
s attri ute,
the three types of ower all have the same frequency, and therefore, the notion
of a mode is not interesting.
For continuous data, the mode, as currently de ned, is often not
useful
ecause a single value may not occur more than once. Nonetheless, i
n some
cases, the mode may indicate important information a out the nature o
f the
values or the presence of missing values. For example, the heights of 20 peopl
e
measured to the nearest millimeter will typically not repeat, ut if the heights
are measured to the nearest tenth of a meter, then some people may
have the
same height. Also, if a unique value is used to indicate a missing
value, then
this value will often show up as the mode.
3.2.2 Percentiles
For ordered data, it is more useful to consider the percentiles
of a set of
values. In particular, given an ordinal or continuous attri ute x and a num er
p etween 0 and 100, the p
th
percentile x
p
is a value of x such that p% of the
o served values of x are less than x
p
. For instance, the 50
th
percentile is the
value x
50%
such that 50% of all values of x are less than x
50%
. Ta le 3.2 shows
the percentiles for the four quantitative attri utes of the Iris data
set.
3.2 Summary Statistics 101
Ta le 3.2. Percentiles for sepal length, sepal width, petal length, and pet
al width. (All values are in
centimeters.)
Percentile Sepal Length Sepal Width Petal Length Petal Width
0 4.3 2.0 1.0 0.1
10 4.8 2.5 1.4 0.2
20 5.0 2.7 1.5 0.2
30 5.2 2.8 1.7 0.4
40 5.6 3.0 3.9 1.2
50 5.8 3.0 4.4 1.3
60 6.1 3.1 4.6 1.5
70 6.3 3.2 5.0 1.8
80 6.6 3.4 5.4 1.9
90 6.9 3.6 5.8 2.2
100 7.9 4.4 6.9 2.5

Example 3.2. The percentiles, x


0%
, x
10%
, . . . , x
90%
, x
100%
of the integers from
1 to 10 are, in order, the following: 1.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5,
8.5, 9.5,
10.0. By tradition, x
0%
= min(x) and x
100%
= max(x).
3.2.3 Measures of Location: Mean and Median
For continuous data, two of the most widely used summary statistics a
re the
mean and median, which are measures of the location of a set of
values.
Consider a set of m o jects and an attri ute x. Let {x
1
, . . . , x
m
} e the
attri ute values of x for these m o jects. As a concrete example, t
hese values
might e the heights of m children. Let {x
(1)
, . . . , x
(m)
} represent the values
of x after they have een sorted in non decreasing order. Thus, x
(1)
= min(x)
and x
(m)
= max(x). Then, the mean and median are de ned as follows:
mean(x) = x =
1
m
m
i=1
x
i
(3.2)
median(x) =
x
(r+1)
if m is odd, i.e., m = 2r + 1
1
2
(x
(r)
+ x
(r+1)
) if m is even, i.e., m = 2r

(3.3)
To summarize, the median is the middle value if there are an odd nu
m er
of values, and the average of the two middle values if the num e
r of values
is even. Thus, for seven values, the median is x
(4)
, while for ten values, the
median is
1
2
(x
(5)
+ x
(6)
).
102 Chapter 3 Exploring Data
Although the mean is sometimes interpreted as the middle of a set of values,
this is only correct if the values are distri uted in a symmetric manner. If t
he
distri ution of values is skewed, then the median is a etter indica
tor of the
middle. Also, the mean is sensitive to the presence of outliers. Fo
r data with
outliers, the median again provides a more ro ust estimate of the mid
dle of a
set of values.
To overcome pro lems with the traditional de nition of a mean, the notion
of a trimmed mean is sometimes used. A percentage p etween 0 and
100
is speci ed, the top and ottom (p/2)% of the data is thrown out,
and the
mean is then calculated in the normal way. The median is a trimmed
mean
with p = 100%, while the standard mean corresponds to p = 0%.
Example 3.3. Consider the set of values {1, 2, 3, 4, 5, 90}. The mean of th
ese
values is 17.5, while the median is 3.5. The trimmed mean with p
= 40% is
also 3.5.
Example 3.4. The means, medians, and trimmed means (p = 20%) of
the
four quantitative attri utes of the Iris data are given in Ta le 3.3.
The three
measures of location have similar values except for the attri ute peta
l length.
Ta le 3.3. Means and medians for sepal length, sepal width, petal length, and
petal width. (All values
are in centimeters.)
Measure Sepal Length Sepal Width Petal Length Petal Width
mean 5.84 3.05 3.76 1.20
median 5.80 3.00 4.35 1.30
trimmed mean (20%) 5.79 3.02 3.72 1.12
3.2.4 Measures of Spread: Range and Variance
Another set of commonly used summary statistics for continuous da
ta are
those that measure the dispersion or spread of a set of values. Suc
h measures
indicate if the attri ute values are widely spread out or if they ar
e relatively

concentrated around a single point such as the mean.


The simplest measure of spread is the range, which, given an attri ut
e x
with a set of m values {x
1
, . . . , x
m
}, is de ned as
range(x) = max(x) min(x) = x
(m)
x
(1)
. (3.4)
3.2 Summary Statistics 103
Ta le 3.4. Range, standard deviation (std), a solute average difference (AAD),
median a solute differ
ence (MAD), and interquartile range (IQR) for sepal length, sepal width, petal l
ength, and petal width.
(All values are in centimeters.)
Measure Sepal Length Sepal Width Petal Length Petal Width
range 3.6 2.4 5.9 2.4
std 0.8 0.4 1.8 0.8
AAD 0.7 0.3 1.6 0.6
MAD 0.7 0.3 1.2 0.7
IQR 1.3 0.5 3.5 1.5
Although the range identi es the maximum spread, it can e misleading if
most of the values are concentrated in a narrow and of values, ut
there are
also a relatively small num er of more extreme values. Hence, the
variance
is preferred as a measure of spread. The variance of the (o served)
values of
an attri ute x is typically written as s
2
x
and is de ned elow. The standard
deviation, which is the square root of the variance, is written as s
x
and has
the same units as x.
variance(x) = s
2
x
=
1
m 1
m
i=1
(x
i
x)
2
(3.5)
The mean can e distorted y outliers, and since the variance is computed
using the mean, it is also sensitive to outliers. Indeed, the variance is pa
rticu
larly sensitive to outliers since it uses the squared di erence etween the mean
and other values. As a result, more ro ust estimates of the spread
of a set

of values are often used. Following are the de nitions of three such
measures:
the a solute average deviation (AAD), the median a solute deviation
(MAD), and the interquartile range (IQR). Ta le 3.4 shows these measures
for the Iris data set.
AAD(x) =
1
m
m
i=1
|x
i
x| (3.6)
MAD(x) = median
{|x
1
x|, . . . , |x
m
x|}
(3.7)
interquartile range(x) = x
75%
x
25%
(3.8)
104 Chapter 3 Exploring Data
3.2.5 Multivariate Summary Statistics
Measures of location for data that consists of several attri utes (mul
tivariate
data) can e o tained y computing the mean or median separately for
each
attri ute. Thus, given a data set the mean of the data o jects, x,
is given y
x = (x
1
, . . . , x
n
), (3.9)
where x
i
is the mean of the i
th
attri ute x
i
.
For multivariate data, the spread of each attri ute can e computed
in
dependently of the other attri utes using any of the approaches descri
ed in
Section 3.2.4. However, for data with continuous varia les, the sprea
d of the
data is most commonly captured y the covariance matrix S, whose ij
th
entry s
ij
is the covariance of the i
th

and j
th
attri utes of the data.
i
and x
j
are the i
th
and j
th
attri utes, then
s
ij
= covariance(x
i
, x
j
). (3.10)
In turn, covariance(x
i
, x
j
) is given y
covariance(x
i
, x
j
) =
1
m 1
m

Thus, if x

k=1
(x
ki
x
i
)(x
kj
x
j
), (3.11)
where x
ki
and x
kj
are the values of the i
th
and j
th
attri utes for the k
th
o ject.
Notice that covariance(x
i
, x
i
) = variance(x
i
). Thus, the covariance matrix has

the variances of the attri utes along the diagonal.


The covariance of two attri utes is a measure of the degree to which
two
attri utes vary together and depends on the magnitudes of the varia l
es. A
value near 0 indicates that two attri utes do not have a (linear) re
lationship,
ut it is not possi le to judge the degree of relationship etween two varia les
y looking only at the value of the covariance. Because the correlat
ion of two
attri utes immediately gives an indication of how strongly two attri ut
es are
(linearly) related, correlation is preferred to covariance for data ex
ploration.
(Also see the discussion of correlation in Section 2.4.5.) The ij
th
entry of the
correlation matrix R, is the correlation etween the i
th
and j
th
attri utes
of the data. If x
i
and x
j
are the i
th
and j
th
attri utes, then
r
ij
= correlation(x
i
, x
j
) =
covariance(x
i
, x
j
)
s
i
s
j
, (3.12)
3.3 Visualization 105
where s
i
and s
j
are the variances of x
i
and x
j
, respectively. The diagonal
entries of R are correlation(x
i

, x
i
) = 1, while the other entries are etween
1 and 1. It is also useful to consider correlation matrices that co
ntain the
pairwise correlations of o jects instead of attri utes.
3.2.6 Other Ways to Summarize the Data
There are, of course, other types of summary statistics. For
instance, the
skewness of a set of values measures the degree to which the values a
re sym
metrically distri uted around the mean. There are also other character
istics
of the data that are not easy to measure quantitatively, such as wh
ether the
distri ution of values is multimodal; i.e., the data has multiple umps where
most of the values are concentrated. In many cases, however, the mos
t e ec
tive approach to understanding the more complicated or su tle aspects of how
the values of an attri ute are distri uted, is to view the values gr
aphically in
the form of a histogram. (Histograms are discussed in the next secti
on.)
3.3 Visualization
Data visualization is the display of information in a graphic or ta ular format.
Successful visualization requires that the data (information) e converted into
a visual format so that the characteristics of the data and the re
lationships
among data items or attri utes can e analyzed or reported. The
goal of
visualization is the interpretation of the visualized information y a
person
and the formation of a mental model of the information.
In everyday life, visual techniques such as graphs and ta les are oft
en the
preferred approach used to explain the weather, the economy, and the
results
of political elections. Likewise, while algorithmic or mathematical app
roaches
are often emphasized in most technical disciplinesdata mining included
visual techniques can play a key role in data analysis. In fact, so
metimes the
use of visualization techniques in data mining is referred to as vis
ual data
mining.
3.3.1 Motivations for Visualization
The overriding motivation for using visualization is that people can q
uickly
a sor large amounts of visual information and nd patterns in it. Co
nsider
Figure 3.2, which shows the Sea Surface Temperature (SST) in degrees Celsius
for July, 1982. This picture summarizes the information from approxim
ately
250,000 num ers and is readily interpreted in a few seconds. For exa
mple, it
106 Chapter 3 Exploring Data
Longitude
Temp 150 180 120 90 60 30 0 30 60 90 120 150 180
0
5

10
15
20
25
30
90
60
60
90
30
30
0
L
a
t
i
t
u
d
e
Figure 3.2. Sea Surface Temperature (SST) for July, 1982.
is easy to see that the ocean temperature is highest at the equator
and lowest
at the poles.
Another general motivation for visualization is to make use of the do
main
knowledge that is locked up in peoples heads. While the use of dom
ain
knowledge is an important task in data mining, it is often di cult or impossi le
to fully utilize such knowledge in statistical or algorithmic tools. In some c
ases,
an analysis can e performed using non visual tools, and then the
results
presented visually for evaluation y the domain expert. In other cases, having
a domain specialist examine visualizations of the data may e the e
st way
of nding patterns of interest since, y using domain knowledge, a person can
often quickly eliminate many uninteresting patterns and direct the focu
s to
the patterns that are important.
3.3.2 General Concepts
This section explores some of the general concepts related to visualiz
ation, in
particular, general approaches for visualizing the data and its attri
utes. A
num er of visualization techniques are mentioned rie y and will e descri ed
in more detail when we discuss speci c approaches later on. We assume
that
the reader is familiar with line graphs, ar charts, and scatter plot
s.
3.3 Visualization 107
Representation: Mapping Data to Graphical Elements
The rst step in visualization is the mapping of information to a visual format;
i.e., mapping the o jects, attri utes, and relationships in a set of
information
to visual o jects, attri utes, and relationships. That is, data o jects, their
at
tri utes, and the relationships among data o jects are translated into graphical
elements such as points, lines, shapes, and colors.
O jects are usually represented in one of three ways. First, if

only a
single categorical attri ute of the o ject is eing considered,
then o jects
are often lumped into categories ased on the value of that attri
ute, and
these categories are displayed as an entry in a ta le or an area on
a screen.
(Examples shown later in this chapter are a cross ta ulation ta le and
a ar
chart.) Second, if an o ject has multiple attri utes, then the o j
ect can e
displayed as a row (or column) of a ta le or as a line on a grap
h. Finally,
an o ject is often interpreted as a point in two
or three dimension
al space,
where graphically, the point might e represented y a geometric
gure,
such
as a circle, cross, or ox.
For attri utes, the representation depends on the type of attri ute,
i.e.,
nominal, ordinal, or continuous (interval or ratio). Ordinal and
continuous
attri utes can e mapped to continuous, ordered graphical features su
ch as
location along the x, y, or z axes; intensity; color; or size (
diameter, width,
height, etc.). For categorical attri utes, each category can e m
apped to
a distinct position, color, shape, orientation, em ellishment, or
column in
a ta le. However, for nominal attri utes, whose values are unordere
d, care
should e taken when using graphical features, such as color and position that
have an inherent ordering associated with their values. In other word
s, the
graphical elements used to represent the ordinal values often have a
n order,
ut ordinal values do not.
The representation of relationships via graphical elements occurs
either
explicitly or implicitly. For graph data, the standard graph represen
tation
a set of nodes with links etween the nodesis normally used. If th
e nodes
(data o jects) or links (relationships) have attri utes or characteristics of th
eir
own, then this is represented graphically. To illustrate, if the node
s are cities
and the links are highways, then the diameter of the nodes might r
epresent
population, while the width of the links might represent the volume of
tra c.
In most cases, though, mapping o jects and attri utes to graphical
el
ements implicitly maps the relationships in the data to relationships
among
graphical elements. To illustrate, if the data o ject represents a physical o
ject
that has a location, such as a city, then the relative positions of
the graphical

o jects corresponding to the data o jects tend to naturally preserve the actual
108 Chapter 3 Exploring Data
relative positions of the o jects. Likewise, if there are two or three continu
ous
attri utes that are taken as the coordinates of the data points, then the result
ing plot often gives considera le insight into the relationships of the attri ut
es
and the data points ecause data points that are visually close to e
ach other
have similar values for their attri utes.
In general, it is di cult to ensure that a mapping of o jects and attri utes
will result in the relationships eing mapped to easily o served rela
tionships
among graphical elements. Indeed, this is one of the most challenging
aspects
of visualization. In any given set of data, there are many implicit relationsh
ips,
and hence, a key challenge of visualization is to choose a technique that makes
the relationships of interest easily o serva le.
Arrangement
As discussed earlier, the proper choice of visual representation of o
jects and
attri utes is essential for good visualization. The arrangement of items withi
n
the visual display is also crucial. We illustrate this with two exam
ples.
Example 3.5. This example illustrates the importance of rearranging a ta le
of data. In Ta le 3.5, which shows nine o jects with six inary a
ttri utes,
there is no clear relationship etween o jects and attri utes, at lea
st at
rst
glance. If the rows and columns of this ta le are permuted, however, as shown
in Ta le 3.6, then it is clear that there are really only two types
of o jects in
the ta leone that has all ones for the rst three attri utes and one that has
only ones for the last three attri utes.
Ta le 3.5. A ta le of nine o jects (rows) with
six inary attri utes (columns).
1 2 3 4 5 6
1 0 1 0 1 1 0
2 1 0 1 0 0 1
3 0 1 0 1 1 0
4 1 0 1 0 0 1
5 0 1 0 1 1 0
6 1 0 1 0 0 1
7 0 1 0 1 1 0
8 1 0 1 0 0 1
9 0 1 0 1 1 0
Ta le 3.6. A ta le of nine o jects (rows) with six
inary attri utes (columns) permuted so that the
relationships of the rows and columns are clear.
6 1 3 2 5 4
4 1 1 1 0 0 0
2 1 1 1 0 0 0
6 1 1 1 0 0 0
8 1 1 1 0 0 0
5 0 0 0 1 1 1
3 0 0 0 1 1 1
9 0 0 0 1 1 1

1 0 0 0 1 1 1
7 0 0 0 1 1 1
3.3 Visualization 109
Example 3.6. Consider Figure 3.3(a), which shows a visualization of a graph.
If the connected components of the graph are separated, as in Figure
3.3( ),
then the relationships etween nodes and graphs ecome much simple
r to
understand.
(a) Original view of a graph. ( ) Uncoupled view of connected components
of the graph.
Figure 3.3. Two visualizations of a graph.
Selection
Another key concept in visualization is selection, which is the e
limination
or the de emphasis of certain o jects and attri utes. Speci cally, whil
e data
o jects that only have a few dimensions can often e mapped to a tw
o
or
three dimensional graphical representation in a straightforward way, t
here is
no completely satisfactory and general approach to represent data with
many
attri utes. Likewise, if there are many data o jects, then visualiz
ing all the
o jects can result in a display that is too crowded. If there are many attri u
tes
and many o jects, then the situation is even more challenging.
The most common approach to handling many attri utes is to choose a
su set of attri utesusually twofor display. If the dimensionality is not too
high, a matrix of ivariate (two attri ute) plots can e constructed f
or simul
taneous viewing. (Figure 3.16 shows a matrix of scatter plots for t
he pairs
of attri utes of the Iris data set.) Alternatively, a visualization
program can
automatically show a series of two dimensional plots, in which the sequence is
user directed or ased on some prede ned strategy. The hope is that visualiz
ing a collection of two dimensional plots will provide a more complete
view of
the data.
110 Chapter 3 Exploring Data
The technique of selecting a pair (or small num er) of attri utes is a type of
dimensionality reduction, and there are many more sophisticated dimensi
on
ality reduction techniques that can e employed, e.g., principal com
ponents
analysis (PCA). Consult Appendices A (Linear Alge ra) and B (Dimension
ality Reduction) for more information.
When the num er of data points is high, e.g., more than a few hu
ndred,
or if the range of the data is large, it is di cult to display enough information
a out each o ject. Some data points can o scure other data point
s, or a
data o ject may not occupy enough pixels to allow its features to e
clearly
displayed. For example, the shape of an o ject cannot e used to
encode a
characteristic of that o ject if there is only one pixel availa le to display it

. In
these situations, it is useful to e a le to eliminate some of the o
jects, either
y zooming in on a particular region of the data or y taking a sa
mple of the
data points.
3.3.3 Techniques
Visualization techniques are often specialized to the type of data e
ing ana
lyzed. Indeed, new visualization techniques and approaches, as well as special
ized variations of existing approaches, are eing continuously created, typicall
y
in response to new kinds of data and visualization tasks.
Despite this specialization and the ad hoc nature of visualization, there are
some generic ways to classify visualization techniques. One such class
i cation
is ased on the num er of attri utes involved (1, 2, 3, or many) or whether the
data has some special characteristic, such as a hierarchical or graph structure.
Visualization methods can also e classi ed according to the type of attri utes
involved. Yet another classi cation is ased on the type of application
: scien
ti c, statistical, or information visualization. The following discussion will u
se
three categories: visualization of a small num er of attri utes, visualization
of
data with spatial and/or temporal attri utes, and visualization of d
ata with
many attri utes.
Most of the visualization techniques discussed here can e found in a wide
variety of mathematical and statistical packages, some of which a
re freely
availa le. There are also a num er of data sets that are freely availa le on t
he
World Wide We . Readers are encouraged to try these visualization techniques
as they proceed through the following sections.
3.3 Visualization 111
Visualizing Small Num ers of Attri utes
This section examines techniques for visualizing data with respect to
a small
num er of attri utes. Some of these techniques, such as histogr
ams, give
insight into the distri ution of the o served values for a single attri ute. O
ther
techniques, such as scatter plots, are intended to display the r
elationships
etween the values of two attri utes.
Stem and Leaf Plots Stem and leaf plots can e used to provide i
nsight
into the distri ution of one dimensional integer or continuous data. (
We will
assume integer data initially, and then explain how stem and leaf plots can e
applied to continuous data.) For the simplest type of stem and leaf
plot, we
split the values into groups, where each group contains those values
that are
the same except for the last digit. Each group ecomes a stem, whil
e the last
digits of a group are the leaves. Hence, if the values are two d
igit integers,

e.g., 35, 36, 42, and 51, then the stems will e the high orde
r digits, e.g., 3,
4, and 5, while the leaves are the low order digits, e.g., 1, 2
, 5, and 6. By
plotting the stems vertically and leaves horizontally, we can provide
a visual
representation of the distri ution of the data.
Example 3.7. The set of integers shown in Figure 3.4 is the sepal
length in
centimeters (multiplied y 10 to make the values integers) taken from
the Iris
data set. For convenience, the values have also een sorted.
The stem and leaf plot for this data is shown in Figure 3.5. Each num er in
Figure 3.4 is rst put into one of the vertical groups4, 5, 6, or 7acco
rding
to its tens digit. Its last digit is then placed to the right of t
he colon. Often,
especially if the amount of data is larger, it is desira le to
split the stems.
For example, instead of placing all values whose tens digit is 4 i
n the same
ucket, the stem 4 is repeated twice; all values 4044 are put in the
ucket
corresponding to the
rst stem and all values 4549 are put in the
ucket
corresponding to the second stem. This approach is shown in the stem
and
leaf plot of Figure 3.6. Other variations are also possi le.
Histograms Stem and leaf plots are a type of histogram, a plot tha
t dis
plays the distri ution of values for attri utes y dividing the possi
le values
into ins and showing the num er of o jects that fall into each in.
For cate
gorical data, each value is a in. If this results in too many values, then va
lues
are com ined in some way. For continuous attri utes, the range of values is di
vided into instypically, ut not necessarily, of equal widthand the values
in each in are counted.
112 Chapter 3 Exploring Data
43 44 44 44 45 46 46 46 46 47 47 48 48 48 48 48
49 49 49 49 49 49 50
50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51
51 51 52 52 52 52 53
54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56
56 56 56 57 57 57 57
57 57 57 57 58 58 58 58 58 58 58 59 59 59 60 60
60 60 60 60 61 61 61
61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63
64 64 64 64 64 64 64
65 65 65 65 65 66 66 67 67 67 67 67 67 67 67 68
68 68 69 69 69 69 70
71 72 72 72 73 74 76 77 77 77 77 79
Figure 3.4. Sepal length data from the Iris data set.
4 : 34444566667788888999999
5 : 0000000000111111111222234444445555555666666777777778888888999
6 : 000000111111222233333333344444445555566777777778889999
7 : 0122234677779
Figure 3.5. Stem and leaf plot for the sepal length from the Iris data set.

4 : 3444
4 : 566667788888999999
5 : 000000000011111111122223444444
5 : 5555555666666777777778888888999
6 : 00000011111122223333333334444444
6 : 5555566777777778889999
7 : 0122234
7 : 677779
Figure 3.6. Stem and leaf plot for the sepal length from the Iris data set whe
n uckets corresponding
to digits are split.
Once the counts are availa le for each in, a ar plot is constructed
such
that each in is represented y one ar and the area of each ar is proportional
to the num er of values (o jects) that fall into the corresponding range. If
all
intervals are of equal width, then all ars are the same width and
the height
of a ar is proportional to the num er of values in the correspondin
g in.
Example 3.8. Figure 3.7 shows histograms (with 10 ins) for sepal l
ength,
sepal width, petal length, and petal width. Since the shape of
a histogram
can depend on the num er of ins, histograms for the same data, ut
with 20
ins, are shown in Figure 3.8.
There are variations of the histogram plot. A relative (frequency) hi
s
togram replaces the count y the relative frequency. However, this i
s just a
3.3 Visualization 113
4 4.5 5 5.5 6 6.5 7 7.5 8
0
5
10
15
20
25
30
Sepal Length
C
o
u
n
t
(a) Sepal length.
2 2.5 3 3.5 4 4.5
0
5
10
15
20
25
30
35
40
45
50
Sepal Width

C
o
u
n
t
( ) Sepal width.
0 1 2 3 4 5 6 7
0
5
10
15
20
25
30
35
40
Petal Length
C
o
u
n
t
(c) Petal length.
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25
30
35
40
45
Petal Width
C
o
u
n
t
(d) Petal width.
Figure 3.7. Histograms of four Iris attri utes (10 ins).
4 4.5 5 5.5 6 6.5 7 7.5 8
0
2
4
6
8
10
12
14
16
Sepal Length
C
o
u
n
t
(a) Sepal length.
2 2.5 3 3.5 4 4.5

0
5
10
15
20
25
30
Sepal Width
C
o
u
n
t
( ) Sepal width.
1 2 3 4 5 6 7
0
5
10
15
20
25
30
35
Petal Length
C
o
u
n
t
(c) Petal length.
0 0.5 1 1.5 2 2.5
0
5
10
15
20
25
30
35
Petal Width
C
o
u
n
t
(d) Petal width.
Figure 3.8. Histograms of four Iris attri utes (20 ins).
change in scale of the y axis, and the shape of the histogram does not change.
Another common variation, especially for unordered categorical data,
is the
Pareto histogram, which is the same as a normal histogram except that the
categories are sorted y count so that the count is decreasing from left to righ
t.
Two Dimensional Histograms Two dimensional histograms are also pos
si le. Each attri ute is divided into intervals and the two sets of intervals
de ne
two dimensional rectangles of values.
Example 3.9. Figure 3.9 shows a two dimensional histogram of petal le
ngth
and petal width. Because each attri ute is split into three ins, there are ni

ne
rectangular two dimensional ins. The height of each rectangular ar indicates
the num er of o jects ( owers in this case) that fall into each in.
Most of
the
owers fall into only three of the insthose along the diagonal. I
t is not
possi le to see this y looking at the one dimensional distri utions.
114 Chapter 3 Exploring Data
Petal Length
Petal Width
50
40
30
20
10
0
C
o
u
n
t
2
2
3
4
5
6
1
1.5
0.5
Figure 3.9. Two dimensional histogram of petal length and width in the Iris da
ta set.
While two dimensional histograms can e used to discover interesting facts
a out how the values of two attri utes co occur, they are visually mo
re com
plicated. For instance, it is easy to imagine a situation in which
some of the
columns are hidden y others.
Box Plots Box plots are another method for showing the distri ution of the
values of a single numerical attri ute. Figure 3.10 shows a la eled ox plot f
or
sepal length. The lower and upper ends of the ox indicate the 25
th
and 75
th
percentiles, respectively, while the line inside the ox indicates the value of
the
50
th
percentile. The top and ottom lines of the tails indicate the 10
th
and
90
th
percentiles. Outliers are shown y + marks. Box plots are relatively
compact, and thus, many of them can e shown on the same plot. Sim
pli ed
versions of the ox plot, which take less space, can also e used.
Example 3.10. The ox plots for the
rst four attri utes of the
Iris data

set are shown in Figure 3.11. Box plots can also e used to comp
are how
attri utes vary etween di erent classes of o jects, as shown in Figur
e 3.12.
Pie Chart A pie chart is similar to a histogram, ut is typically
used with
categorical attri utes that have a relatively small num er of values. Instead
of
showing the relative frequency of di erent values with the area or heig
ht of a
ar, as in a histogram, a pie chart uses the relative area of a ci
rcle to indicate
relative frequency. Although pie charts are common in popular articles
, they
3.3 Visualization 115
Outlier
90
th
percentile
10
th
percentile
50
th
percentile
75
th
percentile
25
th
percentile
+
+
+
+
Figure 3.10. Description of
ox plot for sepal length.
8
7
6
5
4
3
2
1
0
V
a
l
u
e
s
(
c
e
n
t
i
m

e
t
e
r
s
)
+
+
+
+
Sepal Length
Figure 3.11.
6
5
4
3
2
1
0
V
a
lu
e
s
(
c
e
n
t
im
e
t
e
r
s
)
+
+
+
+
Sepal Length
(a) Setosa.
7
5
4
3
2
1
6
V
a
lu
e
s
(
c
e
n

Petal Length Petal Width Sepal Width


Box plot for Iris attri utes.

Petal Length

Petal Width Sepal Width

t
im
e
t
e
r
s
)
+
Sepal Length Petal Length
( ) Versicolour.
7
5
4
3
2
8
6
V
a
lu
e
s

Petal Width Sepal Width

(
c
e
n
t
im
e
t
e
r
s
)
+
Sepal Length Petal Length Petal Width Sepal Width
(c) Virginica.
Figure 3.12. Box plots of attri utes y Iris species.
are used less frequently in technical pu lications ecause the size o
f relative
areas can e hard to judge. Histograms are preferred for technical w
ork.
Example 3.11. Figure 3.13 displays a pie chart that shows the distri
ution
of Iris species in the Iris data set. In this case, all three
ower
types have the
same frequency.
Percentile Plots and Empirical Cumulative Distri ution Functions
A type of diagram that shows the distri ution of the data more quantitatively
is the plot of an empirical cumulative distri ution function. While this type
of
plot may sound complicated, the concept is straightforward. For each value of
a statistical distri ution, a cumulative distri ution function (CDF) shows
116 Chapter 3 Exploring Data
Setosa Virginica
Versicolour
Figure 3.13. Distri ution of the types of Iris owers.
the pro a ility that a point is less than that value. For each o served value,

an
empirical cumulative distri ution function (ECDF) shows the fraction
of points that are less than this value. Since the num er of points is nite, th
e
empirical cumulative distri ution function is a step function.
Example 3.12. Figure 3.14 shows the ECDFs of the Iris attri utes
. The
percentiles of an attri ute provide similar information. Figure 3.15 s
hows the
percentile plots of the four continuous attri utes of the Iris data
set from
Ta le 3.2. The reader should compare these gures with the histograms given
in Figures 3.7 and 3.8.
Scatter Plots Most people are familiar with scatter plots to some ex
tent,
and they were used in Section 2.4.5 to illustrate linear correlation.
Each data
o ject is plotted as a point in the plane using the values of the
two attri utes
as x and y coordinates. It is assumed that the attri utes are either integer
or
real valued.
Example 3.13. Figure 3.16 shows a scatter plot for each pair of at
tri utes
of the Iris data set. The di erent species of Iris are indicated
y di erent
markers. The arrangement of the scatter plots of pairs of attri ut
es in this
type of ta ular format, which is known as a scatter plot matrix, p
rovides
an organized way to examine a num er of scatter plots simultaneously.
3.3 Visualization 117
4 4.5 5 5.5 6 6.5 7 7.5 8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)
(a) Sepal Length.
2 2.5 3 3.5 4 4.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

0.9
1
x
F
(
x
)
( ) Sepal Width.
1 2 3 4 5 6 7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)
(c) Petal Length.
0 0.5 1 1.5 2 2.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
F
(
x
)
(d) Petal Width.
Figure 3.14. Empirical CDFs of four Iris attri utes.
0 20 40 60 80 10
1
2
3
4
5
6
7
Percentile
V
a
lu
e
(

c
e
n
t
im
e
t
e
r
s
)
sepal length
sepal width
petal length
petal width
Figure 3.15. Percentile plots for sepal length, sepal width, petal length, and
petal width.
1
1
8
C
h
a
p
t
e
r
3
E
x
p
l
o
r
i
n
g
D
a
t
a
0 1 2
petal width
2 4 6
petal length
2 3 4
sepal width
5 6 7 8
0
0.5
1
1.5
2
2.5
sepal length
p
e
t
a
l

w
i
d
t
h
p
e
t
a
l
l
e
n
g
t
h
2
4
6
5
6
7
8
2
2.5
3
3.5
4
4.5
s
e
p
a
l
w
i
d
t
h
s
e
p
a
l
l
e
n
g
t
h
Setosa
Versicolour
Virginica
Figure 3.16. Matrix of scatter plots for the Iris data set.
3.3 Visualization 119
There are two main uses for scatter plots. First, they graphically

show
the relationship etween two attri utes. In Section 2.4.5, we saw how
scatter
plots could e used to judge the degree of linear correlation. (See Figure 2.1
7.)
Scatter plots can also e used to detect non linear relationships, either direct
ly
or y using a scatter plot of the transformed attri utes.
Second, when class la els are availa le, they can e used to investigate the
degree to which two attri utes separate the classes. If is possi le
to draw a
line (or a more complicated curve) that divides the plane de ned y th
e two
attri utes into separate regions that contain mostly o jects of one cl
ass, then
it is possi le to construct an accurate classi er ased on the speci ed
pair of
attri utes. If not, then more attri utes or more sophisticated met
hods are
needed to uild a classi er. In Figure 3.16, many of the pairs of attri utes (fo
r
example, petal width and petal length) provide a moderate separation o
f the
Iris species.
Example 3.14. There are two separate approaches for displaying three
at
tri utes of a data set with a scatter plot. First, each o ject can
e displayed
according to the values of three, instead of two attri utes. Figure 3.17 shows
a
three dimensional scatter plot for three attri utes in the Iris data set. Seco
nd,
one of the attri utes can e associated with some characteristic of the marker,
such as its size, color, or shape. Figure 3.18 shows a plot of t
hree attri utes
of the Iris data set, where one of the attri utes, sepal width, is mapped to the
size of the marker.
Extending Two and Three Dimensional Plots As illustrated y Fig
ure 3.18, two or three dimensional plots can e extended to represe
nt a few
additional attri utes. For example, scatter plots can display up to
three ad
ditional attri utes using color or shading, size, and shape, allowing v
e or six
dimensions to e represented. There is a need for caution, however.
As the
complexity of a visual representation of the data increases, it ecome
s harder
for the intended audience to interpret the information. There is no
ene t in
packing six dimensions worth of information into a two or three dimensional
plot, if doing so makes it impossi le to understand.
Visualizing Spatio temporal Data
Data often has spatial or temporal attri utes. For instance, the
data may
consist of a set of o servations on a spatial grid, such as o servat
ions of pres
sure on the surface of the Earth or the modeled temperature at vario
us grid
points in the simulation of a physical o ject. These o servations can

also e
120 Chapter 3
2
3
4
5
2
1
3
4
5
6
7
0
0.5
1.5
1
2
Petal Width
Sepal Width
Setosa
Versicolour
Virginica
S
e
p
a
l
L
e
n
g
t
h
Figure 3.17.
petal width.
Setosa
Versicolour
Virginica
1 2 3 4
0
0.5
1
1.5
2
2.5
Petal Length
P
e
t
a
l
W
i
d
t
h
Figure 3.18.

Exploring Data

Three dimensional scatter plot of sepal width, sepal length, and

Scatter plot of petal length versus petal width, with the size of

the marker indicating sepal


width.
3.3 Visualization 121
5
5
5
5
5
0
0
0
5
5
5
5
5
1
0
10
1
0
10
10
15
1
5
1
5
15 15
2
0
2
0
20
20
20
2
0
20
2
5
2
5
25
2
5
2
5
2
5
25
0
2
5
2
5
5
5
5
5

5
5
0
5
5
Temperature
(Celsius)
5
10
15
20
25
0
5
Figure 3.19. Contour plot of SST for Decem er 1998.
made at various points in time. In addition, data may have only a
temporal
component, such as time series data that gives the daily prices of s
tocks.
Contour Plots For some three dimensional data, two attri utes specify
a
position in a plane, while the third has a continuous value, such
as temper
ature or elevation. A useful visualization for such data is a conto
ur plot,
which reaks the plane into separate regions where the values of
the third
attri ute (temperature, elevation) are roughly the same. A common exam
ple
of a contour plot is a contour map that shows the elevation of land
locations.
Example 3.15. Figure 3.19 shows a contour plot of the average sea s
urface
temperature (SST) for Decem er 1998. The land is ar itrarily set to
have a
temperature of 0

C. In many ontour maps, suh as that of Figure 3.19, the


ontour lines that separate two regions are labeled with the value us
ed to
separate the regions. For larity, some of these labels have been de
leted.
Surfae Plots Like ontour plots, surfae plots use two attributes for
the
x and y oordinates. The third attribute is used to indiate the he
ight above
122 Chapter 3 Exploring Data
(a) Set of 12 points. (b) Overall density funtionsurfae
plot.
Figure 3.20. Density of a set of 12 points.
the plane de ned by the rst two attributes. While suh graphs an be useful,
they require that a value of the third attribute be de ned for all ombinations
of values for the rst two attributes, at least over some range. Al
so, if the
surfae is too irregular, then it an be di ult to see all th
e information,
unless the plot is viewed interatively. Thus, surfae plots are oft
en used to
desribe mathematial funtions or physial surfaes that vary in a re
latively

smooth manner.
Example 3.16. Figure 3.20 shows a surfae plot of the density around
a set
of 12 points. This example is further disussed in Setion 9.3.3.
Vetor Field Plots In some data, a harateristi may have both a
mag
nitude and a diretion assoiated with it. For example, onsider the
ow of a
substane or the hange of density with loation. In these situations, it an
be
useful to have a plot that displays both diretion and magnitude. T
his type
of plot is known as a vetor plot.
Example 3.17. Figure 3.21 shows a ontour plot of the density of
the two
smaller density peaks from Figure 3.20(b), annotated with the density gradient
vetors.
Lower Dimensional Slies Consider a spatio temporal data set that reords
some quantity, suh as temperature or pressure, at various loations over time.
Suh a data set has four dimensions and annot be easily displayed by the types
3.3 Visualization 123
Figure 3.21. Vetor plot of the gradient (hange) in density for the bottom tw
o density peaks of Figure
3.20.
of plots that we have desribed so far. However, separate slies of t
he data
an be displayed by showing a set of plots, one for eah month. By examining
the hange in a partiular area from one month to another, it is p
ossible to
notie hanges that our, inluding those that may be due to seasonal fators.
Example 3.18. The underlying data set for this example onsists of t
he av
erage monthly sea level pressure (SLP) from 1982 to 1999 on a 2.5

by 2.5

latitude longitude grid. The twelve monthly plots of pressure for one
year are
shown in Figure 3.22. In this example, we are interested in slies
for a par
tiular month in the year 1982. More generally, we an onsider sli
es of the
data along any arbitrary dimension.
Animation Another approah to dealing with slies of data, whether or not
time is involved, is to employ animation. The idea is to display
suessive
two dimensional slies of the data. The human visual system is well
suited to
deteting visual hanges and an often notie hanges that might be d
i ult
to detet in another manner. Despite the visual appeal of animation,
a set of
still plots, suh as those of Figure 3.22,
an be more useful sin
e this type of
visualization allows the information to be studied in arbitrary order
and for
arbitrary amounts of time.
124 Chapter 3 Exploring Data
January February Marh
April May June

July August September


Otober November Deember
Figure 3.22. Monthly plots of sea level pressure over the 12 months of 1982.
3.3.4 Visualizing Higher Dimensional Data
This setion onsiders visualization tehniques that an display more than the
handful of dimensions that an be observed with the tehniques just disussed.
However, even these tehniques are somewhat limited in that they only
show
some aspets of the data.
Matries An image an be regarded as a retangular array of pixels,
where
eah pixel is haraterized by its olor and brightness. A data
matrix is a
retangular array of values. Thus, a data matrix an be visualized as an image
by assoiating eah entry of the data matrix with a pixel in the i
mage. The
brightness or olor of the pixel is determined by the value of the orresponding
entry of the matrix.
3.3 Visualization 125
V
ir
g
in
i
a
V
e
r
s
i
o
lo
u
r
S
e
t
o
s
a
Sepal length Sepal width Petal length Petal width
50
100
150
Standard
Deviation
2
1.5
1
0.5
0
0.5
1
1.5
2
2.5
3
Figure 3.23. Plot of the Iris data matrix where
olumns have been standardized to have a mean
of 0 and standard deviation of 1.

S
e
t
o
s
a
V
e
r
s
i
o
lo
u
r
V
ir
g
in
i
a
Virginia Versiolour Setosa
50 100
50
100
150
Correlation
0.4
0.5
0.6
0.7
0.8
0.9
Figure 3.24. Plot of the Iris orrelation matrix.
There are some important pratial onsiderations when visualizing a data
matrix. If lass labels are known, then it is useful to reorder the
data matrix
so that all objets of a lass are together. This makes it easier, for example
, to
detet if all objets in a lass have similar attribute values for some attribut
es.
If di erent attributes have di erent ranges, then the attributes are often stan
dardized to have a mean of zero and a standard deviation of 1. Thi
s prevents
the attribute with the largest magnitude values from visually dominatin
g the
plot.
Example 3.19. Figure 3.23 shows the standardized data matrix for the
Iris
data set. The rst 50 rows represent Iris owers of the speies Setosa, the next
50 Versiolour, and the last 50 Virginia. The Setosa owers have petal width
and length well below the average, while the Versiolour
owers have
petal
width and length around average. The Virginia owers have petal width and
length above average.
It an also be useful to look for struture in the plot of a proximity matrix
for a set of data objets. Again, it is useful to sort the rows
and olumns of
the similarity matrix (when lass labels are known) so that all the objets of a
lass are together. This allows a visual evaluation of the ohesivene

ss of eah
lass and its separation from other lasses.
Example 3.20. Figure 3.24 shows the orrelation matrix for the Iri
s data
set. Again, the rows and olumns are organized so that all the
owe
rs of a
partiular speies are together. The
owers in eah group are most si
milar
126 Chapter 3 Exploring Data
to eah other, but Versiolour and Virginia are more similar to one
another
than to Setosa.
If
lass labels are not known, various tehniques (matrix reordering
and
seriation)
an be used to rearrange the rows and olumns of the
similarity
matrix so that groups of highly similar objets and attributes a
re together
and an be visually identi ed. E etively, this is a simple kind of l
ustering.
See Setion 8.5.3 for a disussion of how a proximity matrix an be
used to
investigate the luster struture of data.
Parallel Coordinates Parallel
oordinates have one
oordinate axis
for
eah attribute, but the di erent axes are parallel to one other instead
of per
pendiular, as is traditional. Furthermore, an objet is represented
as a line
instead of as a point. Spei ally, the value of eah attribute of a
n objet is
mapped to a point on the oordinate axis assoiated with that attribu
te, and
these points are then onneted to form the line that represents the
objet.
It might be feared that this would yield quite a mess. However, in
many
ases, objets tend to fall into a small number of groups, where the
points in
eah group have similar values for their attributes. If so, and if the number
of
data objets is not too large, then the resulting parallel oordinate
s plot an
reveal interesting patterns.
Example 3.21. Figure 3.25 shows a parallel oordinates plot of the f
our nu
merial attributes of the Iris data set. The lines representing objets of di er
ent lasses are distinguished by their shading and the use of three di erent line
stylessolid, dotted, and dashed. The parallel oordinates plot shows that the
lasses are reasonably well separated for petal width and petal length, but less
well separated for sepal length and sepal width. Figure 3.25 is another parall
el
oordinates plot of the same data, but with a di erent ordering of the axes.
One of the drawbaks of parallel oordinates is that the detetion of
pat
terns in suh a plot may depend on the order. For instane, if l
ines ross a
lot, the piture an beome onfusing, and thus, it an be desirab
le to order

the oordinate axes to obtain sequenes of axes with less rossover.


Compare
Figure 3.26, where sepal width (the attribute that is most mixed) is at the left
of the
gure, to Figure 3.25, where this attribute is in the middle.
Star Coordinates and Cherno
Faes
Another approah to displaying multidimensional data is to enode ob
jets
as glyphs or ionssymbols that impart information non verbally. Mor
e
3.3 Visualization 127
Sepal Length Sepal Width Petal Length Petal Width
0
1
2
3
4
5
6
7
8
V
a
l
u
e
(

e
n
t
i
m
e
t
e
r
s
)
Setosa
Versiolour
Virginia
Figure 3.25. A parallel oordinates plot of the four Iris attributes.
Sepal Width Sepal Length Petal Length Petal Width
0
1
2
3
4
5
6
7
8
V
a
l
u
e
(


e
n
t
i
m
e
t
e
r
s
)
Setosa
Versiolour
Virginia
Figure 3.26. A parallel oordinates plot of the four Iris attribu
tes with the attributes reordered to
emphasize similarities and dissimilarities of groups.
128 Chapter 3 Exploring Data
spei ally, eah attribute of an objet is mapped to a partiular feat
ure of a
glyph, so that the value of the attribute determines the exat natu
re of the
feature. Thus, at a glane, we an distinguish how two objets di er.
Star oordinates are one example of this approah. This tehnique use
s
one axis for eah attribute. These axes all radiate from a enter point, like
the
spokes of a wheel, and are evenly spaed. Typially, all the att
ribute values
are mapped to the range [0,1].
An objet is mapped onto this star shaped set of axes using the foll
owing
proess: Eah attribute value of the objet is onverted to a fra
tion that
represents its distane between the minimum and maximum values of
the
attribute. This fration is mapped to a point on the axis orrespond
ing to
this attribute. Eah point is onneted with a line segment to the
point on
the axis preeding or following its own axis; this forms a polygon.
The size
and shape of this polygon gives a visual desription of the attribute
values of
the objet. For ease of interpretation, a separate set of axes is
used for eah
objet. In other words, eah objet is mapped to a polygon. An exa
mple of a
star oordinates plot of
ower 150 is given in Figure 3.27(a).
It is also possible to map the values of features to those of more
familiar
objets, suh as faes. This tehnique is named Cherno faes for its reator,
Herman Cherno . In this tehnique, eah attribute is assoiated with a spei 
feature of a fae, and the attribute value is used to determine th
e way that
the faial feature is expressed. Thus, the shape of the fae may be
ome more
elongated as the value of the orresponding data feature inreases. An example
of a Cherno
fae for
ower 150 is given in Figure 3.27(b).

The program that we used to make this fae mapped the features to t
he
four features listed below. Other features of the fae, suh as wid
th between
the eyes and length of the mouth, are given default values.
Data Feature Faial Feature
sepal length size of fae
sepal width forehead/jaw relative ar length
petal length shape of forehead
petal width shape of jaw
Example 3.22. A more extensive illustration of these two approahes to view
ing multidimensional data is provided by Figures 3.28 and 3.29, whih
shows
the star and fae plots, respetively, of 15
owers from the Iris data
set. The
rst 5
owers are of speies Setosa, the seond 5 are Versiolour, and
the last
5 are Virginia.
3.3 Visualization 129
s
e
p
a
l
w
i
d
t
h
p
e
t
a
l
w
i
d
t
h
petal length sepal length
(a) Star graph of Iris 150. (b) Cherno
fae of Iris 150.
Figure 3.27. Star oordinates graph and Chernoff fae of the 150
th
ower of the Iris data set.
1 2 3 4 5
51 52 53 54 55
101 102 103 104 105
Figure 3.28. Plot of 15 Iris owers using star oordinates.
1 2 3 4 5
51 52 53 54 55
101 102 103 104 105
Figure 3.29. A plot of 15 Iris owers using Chernoff faes.
130 Chapter 3 Exploring Data
Despite the visual appeal of these sorts of diagrams, they do not sale well,
and thus, they are of limited use for many data mining problems. Nonetheless,
they may still be of use as a means to quikly ompare small sets
of objets
that have been seleted by other tehniques.

3.3.5 Dos and Donts


To onlude this setion on visualization, we provide a short list of
visualiza
tion dos and donts. While these guidelines inorporate a lot of visual
ization
wisdom, they should not be followed blindly. As always, guidelines
are no
substitute for thoughtful onsideration of the problem at hand.
ACCENT Priniples The following are the ACCENT priniples for ef
fetive graphial display put forth by D. A. Burn (as adapted b
y Mihael
Friendly):
Apprehension Ability to orretly pereive relations among variables. Does
the graph maximize apprehension of the relations among variables?
Clarity Ability to visually distinguish all the elements of a graph.
Are the
most important elements or relations visually most prominent?
Consisteny Ability to interpret a graph based on similarity to previ
ous
graphs. Are the elements, symbol shapes, and olors onsistent wi
th
their use in previous graphs?
E ieny Ability to portray a possibly omplex relation in as simple a
way
as possible. Are the elements of the graph eonomially used? Is t
he
graph easy to interpret?
Neessity The need for the graph, and the graphial elements. Is the
graph
a more useful way to represent the data than alternatives (table, tex
t)?
Are all the graph elements neessary to onvey the relations?
Truthfulness Ability to determine the true value represented by any g
raph
ial element by its magnitude relative to the impliit or expliit s
ale.
Are the graph elements aurately positioned and saled?
Tuftes Guidelines Edward R. Tufte has also enumerated the followin
g
priniples for graphial exellene:
3.4 OLAP and Multidimensional Data Analysis 131
Graphial exellene is the well designed presentation of interesting data
a matter of substane, of statistis, and of design.
Graphial exellene onsists of omplex ideas ommuniated with lar
ity, preision, and e ieny.
Graphial exellene is that whih gives to the viewer the greatest num
ber of ideas in the shortest time with the least ink in the smallest spae.
Graphial exellene is nearly always multivariate.
And graphial exellene requires telling the truth about the data.
3.4 OLAP and Multidimensional Data Analysis
In this setion, we investigate the tehniques and insights that
ome from
viewing data sets as multidimensional arrays. A number of databas
e sys
tems support suh a viewpoint, most notably, On Line Analytial Proess
ing
(OLAP) systems. Indeed, some of the terminology and apabilities of O
LAP
systems have made their way into spreadsheet programs that are used by mil
lions of people. OLAP systems also have a strong fous on the i

nterative
analysis of data and typially provide extensive apabilities for visualizing th
e
data and generating summary statistis. For these reasons, our approa
h to
multidimensional data analysis will be based on the terminology and onepts
ommon to OLAP systems.
3.4.1 Representing Iris Data as a Multidimensional Array
Most data sets an be represented as a table, where eah row is an objet and
eah olumn is an attribute. In many ases, it is also possible to view the da
ta
as a multidimensional array. We illustrate this approah by represent
ing the
Iris data set as a multidimensional array.
Table 3.7 was reated by disretizing the petal length and petal
width
attributes to have values of low, medium, and high and then o
unting the
number of
owers from the Iris data set that have partiular
om
binations
of petal width, petal length, and speies type. (For petal wi
dth, the
at
egories low, medium, and high
orrespond to the intervals [0, 0
.75), [0.75,
1.75), [1.75, ), respetively. For petal length, the ategories low,
medium,
and high orrespond to the intervals [0, 2.5), [2.5, 5), [5, ), r
espetively.)
132 Chapter 3 Exploring Data
Table 3.7. Number of owers having a partiular ombination of petal width, peta
l length, and speies
type.
Petal Length Petal Width Speies Type Count
low low Setosa 46
low medium Setosa 2
medium low Setosa 2
medium medium Versiolour 43
medium high Versiolour 3
medium high Virginia 3
high medium Versiolour 2
high medium Virginia 3
high high Versiolour 2
high high Virginia 44
0
0
0
0
0
2
0
2
46
Virginia
Versiolour
Setosa
high
low
medium
h
i

g
h
m
e
d
i
u
m
l
o
w
S
p
e

i
e
s
Petal
Width
Petal
Width
Figure 3.30. A multidimensional data representation for the Iris data set.
3.4 OLAP and Multidimensional Data Analysis 133
Table 3.8. Cross tabulation of owers aord
ing to petal length and width for owers of the
Setosa speies.
Width
low medium high
low 46 2 0
medium 2 0 0
high 0 0 0
L
e
n
g
t
h
Table 3.9. Cross tabulation of owers aord
ing to petal length and width for owers of the
Versiolour speies.
Width
low medium high
low 0 0 0
medium 0 43 3
high 0 2 2
L
e
n
g
t
h
Table 3.10. Cross tabulation of
owers a
ording to petal length and width for owers of
the Virginia speies.
Width
low medium high
low 0 0 0
medium 0 0 3
high 0 3 44

L
e
n
g
t
h
Empty ombinationsthose ombinations that do not orrespond to at least
one
owerare not shown.
The data an be organized as a multidimensional array with three dime
n
sions orresponding to petal width, petal length, and speies type,
as illus
trated in Figure 3.30. For larity, slies of this array are shown
as a set of
three two dimensional tables, one for eah speiessee Tables 3.8, 3.9
, and
3.10. The information ontained in both Table 3.7 and Figure 3.30
is the
same. However, in the multidimensional representation shown in Figure
3.30
(and Tables 3.8, 3.9, and 3.10), the values of the attributespetal width, petal
length, and speies typeare array indies.
What is important are the insights an be gained by looking at data from a
multidimensional viewpoint. Tables 3.8, 3.9, and 3.10 show that eah
speies
of Iris is haraterized by a di erent
ombination of values of
petal length
and width. Setosa owers have low width and length, Versiolour owers have
medium width and length, and Virginia
owers have high width and lengt
h.
3.4.2 Multidimensional Data: The General Case
The previous setion gave a spei  example of using a multidimensional
ap
proah to represent and analyze a familiar data set. Here we desr
ibe the
general approah in more detail.
134 Chapter 3 Exploring Data
The starting point is usually a tabular representation of the data
, suh
as that of Table 3.7, whih is alled a fat table. Two steps ar
e neessary
in order to represent data as a multidimensional array: identi ation
of the
dimensions and identi ation of an attribute that is the fous of the
analy
sis. The dimensions are ategorial attributes or, as in the previous
example,
ontinuous attributes that have been onverted to ategorial attributes
. The
values of an attribute serve as indies into the array for the dimen
sion orre
sponding to the attribute, and the number of attribute values is th
e size of
that dimension. In the previous example, eah attribute had three po
ssible
values, and thus, eah dimension was of size three and ould be i
ndexed by
three values. This produed a 3 3 3 multidimensional array.
Eah ombination of attribute values (one value for eah di erent attribute)
de nes a ell of the multidimensional array. To illustrate using the

previous
example, if petal length = low, petal width = medium, and speies =
Setosa,
a spei  ell ontaining the value 2 is identi ed. That is, there are
only two
owers in the data set that have the spei ed attribute values. Notie
that
eah row (objet) of the data set in Table 3.7 orresponds to a
ell in the
multidimensional array.
The ontents of eah ell represents the value of a target quantity (target
variable or attribute) that we are interested in analyzing. In the Iris exampl
e,
the target quantity is the number of
owers whose petal width and
length
fall within ertain limits. The target attribute is quantitative bea
use a key
goal of multidimensional data analysis is to look aggregate quantities,
suh as
totals or averages.
The following summarizes the proedure for reating a multidimensional
data representation from a data set represented in tabular form. First, identi
fy
the ategorial attributes to be used as the dimensions and a
quantitative
attribute to be used as the target of the analysis. Eah row (obje
t) in the
table is mapped to a ell of the multidimensional array. The indies of the e
ll
are spei ed by the values of the attributes that were seleted as dim
ensions,
while the value of the ell is the value of the target attribute. Cells not de n
ed
by the data are assumed to have a value of 0.
Example 3.23. To further illustrate the ideas just disussed, we pre
sent a
more traditional example involving the sale of produts.The fat table for this
example is given by Table 3.11. The dimensions of the multidimensiona
l rep
resentation are the produt ID, loation, and date attributes, while t
he target
attribute is the revenue. Figure 3.31 shows the multidimensional repr
esenta
tion of this data set. This larger and more ompliated data set wi
ll be used
to illustrate additional onepts of multidimensional data analysis.
3.4 OLAP and Multidimensional Data Analysis 135
3.4.3 Analyzing Multidimensional Data
In this setion, we desribe di erent multidimensional analysis tehniques
. In
partiular, we disuss the reation of data ubes, and related operations, su
h
as sliing, diing, dimensionality redution, roll up, and drill down.
Data Cubes: Computing Aggregate Quantities
A key motivation for taking a multidimensional viewpoint of data is
the im
portane of aggregating data in various ways. In the sales example,
we might
wish to
nd the total sales revenue for a spei  year and a spei  pro
dut.

Or we might wish to see the yearly sales revenue for eah loation
aross all
produts. Computing aggregate totals involves
xing spei  values for som
e
of the attributes that are being used as dimensions and then summing
over
all possible values for the attributes that make up the remaining dim
ensions.
There are other types of aggregate quantities that are also of intere
st, but for
simpliity, this disussion will use totals (sums).
Table 3.12 shows the result of summing over all loations for various
om
binations of date and produt. For simpliity, assume that all the
dates are
within one year. If there are 365 days in a year and 1000 produts, then Table
3.12 has 365,000 entries (totals), one for eah produt data pair. W
e ould
also speify the store loation and date and sum over produts, or s
peify the
loation and produt and sum over all dates.
Table 3.13 shows the marginal totals of Table 3.12. These totals are
the
result of further summing over either dates or produts. In Table 3
.13, the
total sales revenue due to produt 1, whih is obtained by summing
aross
row 1 (over all dates), is $370,000. The total sales revenue o
n January 1,
2004, whih is obtained by summing down olumn 1 (over all produts
), is
$527,362. The total sales revenue, whih is obtained by summing over all rows
and olumns (all times and produts) is $227,352,127. All of these t
otals are
for all loations beause the entries of Table 3.13 inlude all loat
ions.
A key point of this example is that there are a number of di erent t
otals
(aggregates) that an be omputed for a multidimensional array, depending on
how many attributes we sum over. Assume that there are n dimensions
and
that the i
th
dimension (attribute) has s
i
possible values. There are n di erent
ways to sum only over a single attribute. If we sum over dimension j, then we
obtain s
1

s
j1
s
j+1

s
n
totals, one for eah possible ombination
of attribute values of the n 1 other attributes (dimensions). The tota
ls that
result from summing over one attribute form a multidimensional array of n1
dimensions and there are n suh arrays of totals. In the sales exam

ple, there
136 Chapter 3 Exploring Data
Table 3.11. Sales revenue of produts (in dollars) for various loations and t
imes.
Produt ID Loation Date Revenue
.
.
.
.
.
.
.
.
.
.
.
.
1 Minneapolis Ot. 18, 2004 $250
1 Chiago Ot. 18, 2004 $79
.
.
.
.
.
.
.
.
.
1 Paris Ot. 18, 2004 301
.
.
.
.
.
.
.
.
.
.
.
.
27 Minneapolis Ot. 18, 2004 $2,321
27 Chiago Ot. 18, 2004 $3,278
.
.
.
.
.
.
.
.
.
27 Paris Ot. 18, 2004 $1,325
.
.
.
.
.
.
.

.
.
.
.
.
$ $ $
L
o

a
t
i
o
n
Date
Produt ID
.
.
.
. . .
.
.
.
Figure 3.31. Multidimensional data representation for sales data.
3.4 OLAP and Multidimensional Data Analysis 137
Table 3.12. Totals that result from summing over all loations for a
nd produt.
date
Jan 1, 2004 Jan 2, 2004 . . . De 31, 2004
1 $1,001 $987 . . . $891
.
.
.
.
.
.
.
.
.
27 $10,265 $10,225 . . . $9,325
p
r
o
d
u

t
I
D
.
.
.
.
.
.
.

xed time a

.
.
Table 3.13. Table 3.12 with marginal totals.
date
Jan 1, 2004 Jan 2, 2004 . . . De 31, 2004 total
1 $1,001 $987 . . . $891 $370,000
.
.
.
.
.
.
.
.
.
.
.
.
27 $10,265 $10,225 . . . $9,325 $3,800,020
p
r
o
d
u

t
I
D
.
.
.
.
.
.
.
.
.
.
.
.
total $527,362 $532,953 . . . $631,221 $227,352,127
are three sets of totals that result from summing over only one dimension and
eah set of totals an be displayed as a two dimensional table.
If we sum over two dimensions (perhaps starting with one of the ar
rays
of totals obtained by summing over one dimension), then we will
obtain a
multidimensional array of totals with n 2 dimensions. There will b
e
n
2
distint arrays of
be
3
2
= 3

suh totals.

For the sales examples,

there will

arrays of totals that result from summing over loation and produt,
loation
and time, or produt and time. In general, summing over k dimensions
yields
n
k
arrays of totals, eah with dimension n k.
A multidimensional representation of the data, together with all pos
sible
totals (aggregates), is known as a data ube. Despite the name, th
e size of
eah dimensionthe number of attribute valuesdoes not need to be equal.
Also, a data ube may have either more or fewer than three dimensions. More
importantly, a data ube is a generalization of what is known in
statistial
terminology as a ross tabulation. If marginal totals were added,
Tables
3.8, 3.9, or 3.10 would be typial examples of ross tabulations.
138 Chapter 3 Exploring Data
Dimensionality Redution and Pivoting
The aggregation desribed in the last setion an be viewed as a
form of
dimensionality redution. Spei ally, the j
th
dimension is eliminated by
summing over it. Coneptually, this ollapses eah olumn of ells in the j
th
dimension into a single ell. For both the sales and Iris examples, aggregati
ng
over one dimension redues the dimensionality of the data from 3 to
2. If s
j
is the number of possible values of the j
th
dimension, the number of
ells is
redued by a fator of s
j
. Exerise 17 on page 143 asks the reader to explore
the di erene between this type of dimensionality redution and that of PCA.
Pivoting refers to aggregating over all dimensions exept two. The result
is a two dimensional ross tabulation with the two spei ed dimensions as the
only remaining dimensions. Table 3.13 is an example of pivoting on d
ate and
produt.
Sliing and Diing
These two olorful names refer to rather straightforward operations. Sliing i
s
seleting a group of ells from the entire multidimensional array by speifyi
ng
a spei 
value for one or more dimensions. Tables 3.8, 3.9,
and 3.10 are
three slies from the Iris set that were obtained by speifying three
separate
values for the speies dimension. Diing involves seleting a subset of ells
by
speifying a range of attribute values. This is equivalent to de ning a subarray
from the omplete array. In pratie, both operations an also be aompanied
by aggregation over some dimensions.

Roll Up and Drill Down


In Chapter 2, attribute values were regarded as being atomi in some sense.
However, this is not always the ase. In partiular, eah date has
a number
of properties assoiated with it suh as the year, month, and week.
The data
an also be identi ed as belonging to a partiular business quarter, o
r if the
appliation relates to eduation, a shool quarter or semester. A
loation
also has various properties:
ontinent, ountry, state (provine,
et.), and
ity. Produts an also be divided into various ategories, suh as
lothing,
eletronis, and furniture.
Often these ategories an be organized as a hierarhial tree or la
ttie.
For instane, years onsist of months or weeks, both of whih onsist
of days.
Loations an be divided into nations, whih ontain states (or oth
er units
of loal government), whih in turn ontain ities. Likewise, any
ategory
3.5 Bibliographi Notes 139
of produts an be further subdivided. For example, the produt
ategory,
furniture, an be subdivided into the subategories, hairs, tables, so
fas, et.
This hierarhial struture gives rise to the roll up and drill down
opera
tions. To illustrate, starting with the original sales data, whih
is a multidi
mensional array with entries for eah date, we an aggregate (roll
up) the
sales aross all the dates in a month. Conversely, given a representation of t
he
data where the time dimension is broken into months, we might want t
o split
the monthly sales totals (drill down) into daily sales totals. Of 
ourse, this
requires that the underlying sales data be available at a daily granu
larity.
Thus, roll up and drill down operations are related to aggregation. N
o
tie, however, that they di er from the aggregation operations disussed
until
now in that they aggregate
ells within a dimension, not aross
the entire
dimension.
3.4.4 Final Comments on Multidimensional Data Analysis
Multidimensional data analysis, in the sense implied by OLAP and related sys
tems, onsists of viewing the data as a multidimensional array and aggregating
data in order to better analyze the struture of the data. For the
Iris data,
the di erenes in petal width and length are learly shown by suh an
anal
ysis. The analysis of business data, suh as sales data, an also
reveal many
interesting patterns, suh as pro table (or unpro table) stores or produts
.

As mentioned, there are various types of database systems that suppo


rt
the analysis of multidimensional data. Some of these systems are b
ased on
relational databases and are known as ROLAP systems. More speial
ized
database systems that spei ally employ a multidimensional data represen
tation as their fundamental data model have also been designed. Suh systems
are known as MOLAP systems. In addition to these types of systems, statisti
al databases (SDBs) have been developed to store and analyze various
types
of statistial data, e.g., ensus and publi health data, that ar
e olleted by
governments or other large organizations. Referenes to OLAP and SDBs
are
provided in the bibliographi notes.
3.5 Bibliographi Notes
Summary statistis are disussed in detail in most introdutory
statistis
books, suh as [92]. Referenes for exploratory data analysis are th
e lassi
text by Tukey [104] and the book by Velleman and Hoaglin [105].
The basi visualization tehniques are readily available, being an int
egral
part of most spreadsheets (Mirosoft EXCEL [95]), statistis programs
(SAS
140 Chapter 3 Exploring Data
[99], SPSS [102], R [96], and S PLUS [98]), and mathematis software
(MAT
LAB [94] and Mathematia [93]). Most of the graphis in this hapt
er were
generated using MATLAB. The statistis pakage R is freely available
as an
open soure software pakage from the R projet.
The literature on visualization is extensive, overing many elds and many
deades. One of the lassis of the eld is the book by Tufte [103].
The book
by Spene [101], whih strongly in uened the visualization portion of
this
hapter, is a useful referene for information visualizationboth priniples and
tehniques. This book also provides a thorough disussion of many dyn
ami
visualization tehniques that were not
overed in this hapter. T
wo other
books on visualization that may also be of interest are those by Ca
rd et al.
[87] and Fayyad et al. [89].
Finally, there is a great deal of information available about data visualiza
tion on the World Wide Web. Sine Web sites ome and go frequently, the best
strategy is a searh using information visualization, data visualization, or
statistial graphis. However, we do want to single out for attention
The
Gallery of Data Visualization, by Friendly [90]. The ACCENT Priniples for
e etive graphial display as stated in this hapter an be found there
, or as
originally presented in the artile by Burn [86].
There are a variety of graphial tehniques that
an be used to
explore
whether the distribution of the data is Gaussian or some other spei ed

dis
tribution. Also, there are plots that display whether the observed v
alues are
statistially signi ant in some sense. We have not overed any of the
se teh
niques here and refer the reader to the previously mentioned statisti
al and
mathematial pakages.
Multidimensional analysis has been around in a variety of forms for s
ome
time. One of the original papers was a white paper by Codd [88],
the father
of relational databases. The data ube was introdued by Gray et a
l. [91],
who desribed various operations for reating and manipulating data u
bes
within a relational database framework. A omparison of statistial databases
and OLAP is given by Shoshani [100]. Spei  information on OLAP an
be found in doumentation from database vendors and many popular books
.
Many database textbooks also have general disussions of OLAP, often i
n the
ontext of data warehousing. For example, see the text by Ramakrishnan and
Gehrke [97].
Bibliography
[86] D. A. Burn. Designing E etive Statistial Graphs. In C. R. Rao
, editor, Handbook of
Statistis 9. Elsevier/North Holland, Amsterdam, The Netherlands, Septembe
r 1993.
3.6 Exerises 141
[87] S. K. Card, J. D. MaKinlay, and B. Shneiderman, editors
. Readings in Information
Visualization: Using Vision to Think. Morgan Kaufmann Publishers, San
Franiso,
CA, January 1999.
[88] E. F. Codd, S. B. Codd, and C. T. Smalley. Providing
OLAP (On line Analytial
Proessing) to User Analysts: An IT Mandate. White Paper, E.F. Codd and Asso
iates,
1993.
[89] U. M. Fayyad, G. G. Grinstein, and A. Wierse, editors.
Information Visualization in
Data Mining and Knowledge Disovery. Morgan Kaufmann Publishers, San F
raniso,
CA, September 2001.
[90] M. Friendly. Gallery of Data Visualization. http://www.math.yorku
.a/SCS/Gallery/,
2005.
[91] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reihart, M. Venkatrao,
F. Pellow,
and H. Pirahesh. Data Cube: A Relational Aggregation Operator General
izing Group
By, Cross Tab, and Sub Totals. Journal Data Mining and Knowledge Di
sovery, 1(1):
2953, 1997.
[92] B. W. Lindgren. Statistial Theory. CRC Press, January 1993.
[93] Mathematia 5.1. Wolfram Researh, In. http://www.wolfram.om/, 2
005.
[94] MATLAB 7.0. The MathWorks, In. http://www.mathworks.om, 2005.
[95] Mirosoft Exel 2003. Mirosoft, In. http://www.mirosoft.om/, 2

003.
[96] R: A language and environment for statistial omputing and grap
his. The R Projet
for Statistial Computing. http://www.r projet.org/, 2005.
[97] R. Ramakrishnan and J. Gehrke. Database Management Systems.
MGraw Hill, 3rd
edition, August 2002.
[98] S PLUS. Insightful Corporation. http://www.insightful.om, 2005.
[99] SAS: Statistial Analysis System. SAS Institute In. http://www.s
as.om/, 2005.
[100] A. Shoshani. OLAP and statistial databases: similarities and
di erenes. In Pro.
of the Sixteenth ACM SIGACT SIGMOD SIGART Symp. on Priniples of Da
tabase
Systems, pages 185196. ACM Press, 1997.
[101] R. Spene. Information Visualization. ACM Press, New York, De
ember 2000.
[102] SPSS: Statistial Pakage for the Soial Sienes. SPSS, In
. http://www.spss.om/,
2005.
[103] E. R. Tufte. The Visual Display of Quantitative Information. Graphis Pr
ess, Cheshire,
CT, Marh 1986.
[104] J. W. Tukey. Exploratory data analysis. Addison Wesley, 1977.
[105] P. Velleman and D. Hoaglin. The ABCs of EDA: Appliations, Basis, and C
omputing
of Exploratory Data Analysis. Duxbury, 1981.
3.6 Exerises
1. Obtain one of the data sets available at the UCI Mahine Learnin
g Repository
and apply as many of the di erent visualization tehniques desribed
in the
hapter as possible. The bibliographi notes and book Web site provide pointer
s
to visualization software.
142 Chapter 3 Exploring Data
2. Identify at least two advantages and two disadvantages of using olor to vi
sually
represent information.
3. What are the arrangement issues that arise with respet to three
dimensional
plots?
4. Disuss the advantages and disadvantages of using sampling to redue the nu
m
ber of data objets that need to be displayed. Would simple random
sampling
(without replaement) be a good approah to sampling? Why or why not?
5. Desribe how you would reate visualizations to display information
that de
sribes the following types of systems.
(a) Computer networks. Be sure to inlude both the stati aspets
of the
network, suh as onnetivity, and the dynami aspets, suh as tra .
(b) The distribution of spei  plant and animal speies around the world for
a spei  moment in time.
() The use of omputer resoures, suh as proessor time, main memory, and
disk, for a set of benhmark database programs.
(d) The hange in oupation of workers in a partiular ountry over
the last
thirty years. Assume that you have yearly information about eah person

that also inludes gender and level of eduation.


Be sure to address the following issues:
Representation. How will you map objets, attributes,

and relation

ships to visual elements?


Arrangement. Are there any speial
onsiderations that need to
be
taken into aount with respet to how visual elements are displayed? Spe
i  examples might be the hoie of viewpoint, the use of transpareny
,
or the separation of ertain groups of objets.
Seletion. How will you handle a large number of attributes and
data
objets?
6. Desribe one advantage and one disadvantage of a stem and l
eaf plot with
respet to a standard histogram.
7. How might you address the problem that a histogram depends on th
e number
and loation of the bins?
8. Desribe how a box plot an give information about whether the v
alue of an
attribute is symmetrially distributed. What an you say about the sy
mmetry
of the distributions of the attributes shown in Figure 3.11?
9. Compare sepal length, sepal width, petal length, and petal width, using Fig
ure
3.12.
3.6 Exerises 143
10. Comment on the use of a box plot to explore a data set with
four attributes:
age, weight, height, and inome.
11. Give a possible explanation as to why most of the values of
petal length and
width fall in the bukets along the diagonal in Figure 3.9.
12. Use Figures 3.14 and 3.15 to identify a harateristi shared by the petal
width
and petal length attributes.
13. Simple line plots, suh as that displayed in Figure 2.12 o
n page 56, whih
shows two time series, an be used to e etively display high dimensional data.
For example, in Figure 2.12 it is easy to tell that the frequenie
s of the two
time series are di erent. What harateristi of time series allows the
e etive
visualization of high dimensional data?
14. Desribe the types of situations that produe sparse or dense data ubes.
Illus
trate with examples other than those used in the book.
15. How might you extend the notion of multidimensional data analysis so that
the
target variable is a qualitative variable? In other words, what sorts of summar
y
statistis or data visualizations would be of interest?
16. Construt a data ube from Table 3.14. Is this a dense or spa
rse data ube? If
it is sparse, identify the ells that are empty.
Table 3.14. Fat table for Exerise 16.
Produt ID Loation ID Number Sold
1 1 10

1 3 6
2 1 5
2 2 22
17. Disuss the di erenes between dimensionality redution based on agg
regation
and dimensionality redution based on tehniques suh as PCA and SVD.

You might also like