Information Sciences 137 (2001) 1±15
www.elsevier.com/locate/ins
Ecient data reconciliation
Munir Cochinwala a, Verghese Kurien b, Gail Lalk
Dennis Shasha c
a,*
,
a
c
Telcordia Technologies, Inc., 445 South Street, Morristown, NJ 07960, USA
b
Niksun, Inc., North Brunswick, NJ 08902, USA
New York University, 70 Washington Square North, New York, NY 10012, USA
Received 27 September 1998; received in revised form 5 May 2000; accepted 20 October 2000
Abstract
Data reconciliation is the process of matching records across dierent databases.
Data reconciliation requires ``joining'' on ®elds that have traditionally been non-key
®elds. Generally, the operational databases are of sucient quality for the purposes for
which they were initially designed but since the data in the dierent databases do not
have a canonical structure and may have errors, approximate matching algorithms are
required.
Approximate matching algorithms can have many dierent parameter settings. The
number of parameters will aect the complexity of the algorithm due to the number of
comparisons needed to identify matching records across dierent datasets. For large
datasets that are prevalent in data warehouses, the increased complexity may result in
impractical solutions.
In this paper, we describe an ecient method for data reconciliation. Our main
contribution is the incorporation of machine learning and statistical techniques to reduce the complexity of the matching algorithms via identi®cation and elimination of
redundant or useless parameters. We have conducted experiments on actual data that
demonstrate the validity of our techniques. In our experiments, the techniques reduced
complexity by 50% while signi®cantly increasing matching accuracy. Ó 2001 Telcordia
Technologies Inc. Published by Elsevier Science Inc. All rights reserved.
*
Corresponding author. Tel.: +1-973-829-5213; fax: +1-973-829-5885.
E-mail address: grl@research.telcordia.com (G. Lalk).
0020-0255/01/$ - see front matter Ó 2001 Telcordia Technologies Inc. Published by Elsevier
Science Inc. All rights reserved.
PII: S 0 0 2 0 - 0 2 5 5 ( 0 0 ) 0 0 0 7 0 - 0
2
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
1. Introduction
A high-priority business goal of corporate data warehouses is to generate
customer pro®les that identify the complete set of products and services purchased by a customer. A good customer pro®le can support strategic decision
making by combining the many pieces of information available about a particular customer from a variety of sources. For marketing and new service development, it is useful to have a view of products and services customers
currently purchase from company subsidiaries. From a regulatory perspective, it
may be necessary to know exactly what data each subsidiary owns and how it can
be shared across dierent entities. From an advertising and market promotion
perspective, it is important to have knowledge of what the current customers
already own (e.g., a cellular phone) since the success of a direct marketing initiative often depends on reaching a targeted market segment and demonstrating
that a company understands the current market. In addition, accurate customer
pro®les will pave the way for one-stop-shopping and uni®ed billing. Finally, a
good customer pro®le can be used for predictive modeling of customer churn,
winback and upsell, all functions of strategic interest to most corporations.
The data sources that are used for generating customer pro®les are typically
disparate types of operations systems designed to support speci®c operations
functions (e.g., billing, networking, planning, . . .). The generation of customer
pro®le data requires ``joining'' on ®elds that have traditionally been non-key
®elds (e.g., customer name, service address) or on ®elds that have slightly
dierent semantics. The data in the non-key ®elds does not generally have a
canonical structure.
In most organizations, operational databases are of sucient quality for the
purposes for which they were initially designed. It usually does not matter that
names are slightly misspelled, because accesses are by department or an application-speci®c key (e.g., telephone number). The problem arises when a
global view must be obtained (e.g., all the products we sell to a customer) and
the join must be done on a dirty ®eld like ``name''. For this reason, conventional relational join operations will miss connections that should be present.
Sometimes, parsing will be enough, e.g., translating Jones, Christopher to
Chris Jones, but often it will not be. The result of using conventional relational
join operations will be low quality information about customers, hindering
both sales and analysis eorts.
It is common practice to improve on the results of conventional relational
join techniques by using approximate matching instead of exact matching. The
matching procedure can be made even more robust by matching on multiple
®elds or attributes. The increased robustness is due to the fact that multiple attributes can be used to mask errors in single ®elds. Matching on multiple
attributes involves the generation of matching rules that are used to decide if a
pair of records is matched.
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
3
Use of multiple attributes for generating matching rules introduces two
problems:
· How to generate a highly accurate matching rule?
· How to reduce the complexity of a matching rule while maintaining accuracy?
In this paper, we address these problems as follows:
· Machine learning (ML) techniques are used to generate accurate matching
rules.
· Statistical techniques [3,4,8,9] are used to reduce the complexity via identi®cation and elimination of useless or redundant parameters.
Our contribution to data reconciliation techniques is the incorporation of
ML and statistical techniques into a general methodology for reducing
matching complexity for large datasets. We demonstrate the validity of our
techniques through experiments on actual telecommunications company
data. In our experiments, the techniques reduced complexity of the
matching process by over 50% while signi®cantly increasing matching accuracy.
The remainder of the paper is as follows: Section 2 outlines previous work in
the area of data quality in both the academic and industrial arena. Section 3
describes our methodology and Section 4 provides a detailed explanation of
how ML and statistics are used to generate rules and reduce complexity.
Section 5 is the conclusion.
2. Previous work
Techniques to build data warehouses from disparate data sources have been
studied and implemented in both the commercial and academic arenas.
2.1. Industrial approaches
Many industrial companies do some kind of data cleaning using proprietary
technologies. The industrial leaders are Apertus enterprise/Integrator, Vality
technology and Gladstone computer services.
Vality's Burch [10] suggests the following methodology:
1. Perform a data quality audit: Using sampling, estimate the percentage of records in error or that are inconsistent between one database and another.
2. Do a lexical analysis and decomposition of ®elds. For example, Don McCarthy should really be two ®elds. 123 Main Street, Somewhere, Colorado
should be at least three ®elds.
3. Identify the source of each piece of data, its time of entry, who entered it,
and other possibly relevant information. This will help uncover systematic
sources of errors.
4
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
4. Use rules when possible to clean the data, e.g., ``D'' may mean Dallas in
Texas and Denver in Colorado.
5. Control the process to encourage standardization, allow for corrections, and
to establish a single point of entry for replicated ®elds.
Our approach re®nes and extends this methodology by including a process to
derive low complexity matching rules.
2.2. Academic approaches
In academia, the problem of data quality has been restricted to matching
records in databases. To our knowledge, there has been no general methodology or tool development for data quality. Finding approximate matching
records in two databases can be transformed into the problem of ®nding duplicates in a single database. The transformation is to append one database to
the other and to keep track of the course of each record.
For large datasets, a full pairwise comparison of records is infeasible. One
set of techniques for reducing the complexity is based on pre-sorting the datasets. When ®nding exact matches, one can sort the tuples and ®nd those
neighbors that match exactly as in [1]. For approximate matching, one can
generalize this approach by doing pairwise comparisons of nearby records in a
sort and then combining several sorts [2,7]. This is called a windowing±combining approach. The problem with this approach is that if records dier by
their ®rst or second character, this approach might fail to ®nd matches. Also,
combining may be too optimistic or too pessimistic. For example, we have
found that two records having the same name may represent dierent people.
The same address combined with a similar name, however, usually represents a
single person.
Hernandez and Stolfo combine a windowing±combining method with the
idea of transitive closure [5]. They execute several independent runs of a
sort±merge strategy, each time using a dierent key and a relatively small
window. For instance, in one run, they use the address as the principal part
of the key while in another they use the last name of the employee as the
principal part of the key. Each independent run will produce a set of pairs of
records, which can be merged. They then apply the transitive closure to those
pairs of records. The result will be a union of all pairs discovered by all
independent runs, plus all those pairs that can be inferred by transitivity.
Thus, they are assuming that similarity is an equivalence relation. While this
assumption is false in a mathematical sense (e.g., if similarity means distance
1, then ``dog'' is similar to ``fog'' which is similar to ``for'' which is similar to
``far'', but dog and far are not at all similar), the authors nevertheless show
good results.
Monge and Elkan [6] propose a scheme for duplicate identi®cation that
chooses several sortings of the data (left-to-right lexicographic and right-to-left
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
5
lexicographic currently) and maintains duplicates or near duplicates in ``clusters'' where each member of the cluster is an approximate duplicate of every
other member. When scanning the data, each record is compared to the set of
current clusters (the authors found that maintaining as few as four clusters
worked well). If the record is close enough to any cluster, it joins the cluster.
Otherwise, it forms a new singleton cluster. Whenever a new cluster causes the
number of clusters to exceed the allowed limit, then some cluster is forgotten.
This approach also assumes that similarity is an equivalence relation, but the
authors show good results.
3. Data reconciliation methodology
A matching algorithm does record matching based on a match between
pairs of columns, where one column comes from one database and the other
from a dierent database. The matching may be exact or approximate. Since
exact matching yields only very few matches, our methodology is based on
approximate matching. We focus on matching customer records across two
dierent databases to identify sets of customer records that exist in both databases. In the examples cited in this paper, a name/address database of
wireless service customers is matched against a name/address database of
wireline service customers in the same market area.
3.1. Wireless and wireline databases
The wireless database contains approximately 1,300,000 customer records.
The records contain several ®elds including: record identi®er, customer name,
street address, city, state, zip, phone and service dates.
The wireline database contains approximately 860,000 records. The ®elds in
this database are assigned based on name and address. The record identi®ers in
this database are assigned based on name and address. The address ®elds in the
wireline database are much more sparsely populated than the address ®elds of
the wireless database.
3.2. General methodology for approximate matching
We now describe a general methodology that may be used to detect matches
across dierent databases. The methodology can be divided into the following
steps:
· Preprocessing. Preprocessing includes the elimination of common words like
``street'', special characters and blanks. For speci®c applications, it could include re-ordering and substitution of words e.g., ``RR Box 20'' becomes ``20
PO Box''. This step is usually highly domain dependent.
6
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
Table 1
Average ®eld lengths
Parameter
Average string length
Wireline address length
Wireless address length
Wireline name length
Wireless name length
6.72
6.88
10.7
10.87
· Parameter space de®nition. The parameter space can be divided into two
measurement categories:
Distance measures. These include a set of distance measures that can be
used to perform approximate matching. In the case of string matching,
the distance measures could be Hamming distance, edit distance or alignment distance. In this paper, we only considered edit distance. The edit
distance parameters were (1) edit distance between the address ®elds of
the wireline and wireless databases and (2) edit distance between the name
®elds of the wireline and wireless databases.
Record descriptors. Another group of parameters is descriptors of the records being compared, e.g., ®eld lengths. For the Falcon dataset, we chose
four record descriptors, length of wireless address ®eld, length of wireline
address ®eld, length of wireless name ®eld and length of wireline name
®eld. The average ®eld length of these parameters in the datasets we studied is shown in Table 1.
· Matching rule generation and complexity reduction. This consists of selecting
a ML algorithm to generate matching rules. After a speci®c ML algorithm is
selected, parameters are pruned to yield a matching rule of low complexity.
This process is described in detail in the next section.
· Application of pruned parameter rule. Once the improved matching rule has
been developed on the sample dataset, it can be applied to the original (large)
datasets. The pruning of the parameter space carried out in the previous step
will have signi®cantly reduced the complexity of the matching process. We
demonstrate the eectiveness of this process in Section 4 of this paper.
4. Matching rule generation and complexity reduction
Since pre-processing and parameter identi®cation are highly application
dependent and not very interesting, we focus on the matching rule generation
and pruning, which are typically the hardest steps of any data reconciliation
problem.
In any data reconciliation problem, the analyst typically generates multiple
parameters (such as distance measures) for use in the matching process. Once
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
7
these parameters have been calculated, the process of constructing a matching
rule based on these parameters consists of ®nding a rule to map these parameters to labels, such as ``matched'', ``not matched''. For instance, an example of a matching rule may be: ``If the edit distance between the address
®elds in the wireline and wireless databases <3 and the length of the wireline
address ®eld P 5 THEN declare that the records are matched.
The use of multiple parameters in formulating matching rules gives rise to
the following problems:
· Deriving accurate matching rules by hand is hard if there are four or more
parameters. ML techniques can help in the generation of these rules.
· The generated rules can have very high evaluation complexity.
± Statistical and ML techniques can be used to eliminate redundant parameters.
± Since the ®nal matching rule may involve many parameters, it is dicult
to construct an index on the attributes that relate to the parameters. The
matching rules that we derive can be used to construct indexes on attributes so as to reduce evaluation complexity. We give an example of this later.
In the remainder of this section, we give a detailed example of how matching
rules are generated from parameters and subsequently show the complexity
reduction of the rules using ML techniques. The process of rule generation and
complexity reduction is explained on a subset of the wireless dataset. The
dataset are records in the wireless data that include the work ``FALCON'' in
the customer name. We refer to this dataset as the Falcon dataset for the rest of
this paper. For the Falcon dataset, the wireless data contains 241 records and
the wireline data contains 883 records.
Matching rule generation and complexity reduction consists of three steps:
1. Generation of training data for use by ML algorithms.
2. Selection of ML algorithms on the bases if the training data.
3. Pruning of parameters to reduce complexity of rule evaluation.
4.1. Generating training data
This process consists of the following steps:
· Selection of a sample of records from one of the databases. Typically, one
would choose the sample from the database for which the matching relationship is surjective. 1 For example, in our case, we choose samples from the
wireless database since it is expected that people who have wireless service
also have wireline service. If no surjective relationship exists, a sample from
the database that has the higher degree of an ``onto'' relationship may be
1
A surjective map f : X ! Y is onto the domain Y.
8
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
selected. The reason for using sampling is that this allows a variety of analysis techniques to be studied without problems arising from computational
or storage complexity.
· Generation of a table of prospective matches. This is done by running a simple matching algorithm over the dataset. The algorithm should be lenient, in
the sense that only a small number of missed matches can result by using it.
For the Falcon dataset, we declared a possible match if the edit distances between both the name and address ®elds for the records being compared were
less than 3. This rule generated nearly 30% false matches but did not miss
any true matches.
· An analyst peruses the prospective matches and assigns a verdict (label) to
each of these as correctly matched, incorrectly matched or ambiguous. This
step is the most human intensive in the whole matching process. In the future, we hope to reduce the extent of human involvement in this step by using rules derived from metadata to generate the verdicts. For the Falcon
dataset, at the end of this step we had a table with the following seven columns:
1. Verdict,
2. Edit distance between address ®elds,
3. Edit distance between name ®elds,
4. Length of wireline name ®eld,
5. Length of wireline address ®eld,
6. Length of wireless name ®eld,
7. Length of wireless address ®eld.
Each row in the table represents a record pair across the wireless and wireline
databases. This table was used as training data for the ML schemes discussed
in the next section.
4.2. Algorithm selection for matching rule generation
This section discusses the selection of an algorithm than can be used to
generate matching rules. In the next section we show how the parameters can
be pruned in order to reduce the complexity of the ®nal matching. This is done
as follows:
Identify that ML algorithm that gives the best matching performance in a
class of ML algorithms on the training data. In the experiments that we conducted, we used the classi®cation and regression trees (CART) algorithm [11]
linear discriminant analysis (which attempts to ®nd linear combinations of the
parameters that best separate the verdicts) [17], and vector quantization, which
is a generalization of nearest neighbor algorithms [15]. A short description of
the CART algorithm can be found in Appendix A. Rules generated by CART
are binary decision trees, which have low evaluation complexity. Algorithms
such as neural nets [16] were initially considered but then rejected due to the
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
9
Table 2
Falcon: algorithm performance
Matching algorithm
Error rate (%)
CART
Linear discriminants
Vector quantization
5.1
5.3
9.4
extreme complexity of the evaluated rules. The selected algorithm is data dependent. It is entirely possible that another algorithm might be optimal for a
dierent problem. The main advantage of using schemes such as ours is that
the data indicates which algorithm should be favored.
Table 2 shows the matching performance of the three algorithms considered
based on the training data. These rates were based on a sample of 50 averaged
runs over the dataset where half the dataset was used to build the matching rule
and the other half was used to judge the performance (cross-validation). For
this dataset, the linear discriminant rules performed almost as well as the
CART matching rules. Hence, linear discriminants can be considered to be a
valid competitor to CART, for the purpose of matching, in this dataset.
However, since the rules generated by linear discriminants involve a linear
combination of the parameters, the ®nal evaluation complexity of the rules is
large. For example, if we had a rule of the form:
2 Address distance 1:3 Name length < 3 ! Match;
indices on name length would be useless for reducing evaluation complexity
since name length has been convolved with the address distance parameter.
Tree-based methods do not suer from this because parameters enter into relations in conjunctive normal form, and separate indices or clusters can be
formed for each parameter, speeding up the evaluation of the ®nal matching
rule. Consequently, we chose CART as the algorithm for generating matching
rules in this dataset.
4.3. Complexity reduction
The methods that we used to prune down the complexity are borrowed from
the ®eld in statistics referred to as model selection . Given a speci®c ML algorithm, model selection methods attempt to ®nd that group of parameters
that achieve a good tradeo between classi®cation accuracy and the complexity
of the matching model generated by the ML algorithm. Model selection
methods can be quite complex [4,8,9], and so, for ease of explanation, we work
through the ®rst steps of a simple model selection scheme for the Falcon dataset. (It turned out that some of the more involved model selection schemes
such as bagging and subset selection [13] yielded the same matching rule as the
10
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
simple scheme.) It is to be emphasized that the scheme that we are proposing
here is not the best scheme, but was chosen for clarity of explanation as well as
good performance for the CART algorithm and the dataset at hand. This is in
general, true of all our analyses, where ML algorithms and model selection
schemes may work well for some datasets and not for others. The ®rst step in
this scheme consists of de®ning a tradeo function that trades o model
complexity for classi®cation accuracy.
In order to de®ne the tradeo function, we de®ne a model complexity parameter called the complexity ratio. This is the percentage contribution in
overall complexity due to a parameter or group of parameters.
For example, Table 1 in Section 3.2 indicates that the changes in the compute complexity of calculating the name distance is 10:7 10:87 116:31. The
complexity of calculating a length parameter like wireline address length is
6.72. Hence the complexity ratio for the name distance parameter is
116:31= 116:31 46:23 6:72 6:88 10:87 10:7 0:5882.
In order to formally de®ne the complexity tradeo function, the following
notation is required. Let P denote the parameter space. For any tree, T, let
P(T) denote the parameters present in T.
For a given tree, T, we de®ne the tree complexity function C(T) as follows:
X
C T
complexity ratio p:
1
p2P T
C T is simply the sum of the complexity ratios of all the parameters in T . Let
M(T) denote the misclassi®cation rate of tree T . We used the tradeo function
J T C T M T . Since both C(T) and M T are scale free, subtracting
them makes sense. The analyst could choose to weight C(T) and M(T) differently, but we have not chosen to do so.
Given two trees, a starting tree T and a reduced tree (i.e., a tree with some
parameters dropped), T , the dierence in the tradeo function simply is
D J ; T ; T J T
J T C T
C T
M T
M T :
The reduced tree T represents an improvement over the tree T if D J ; T ; T is
large.
The pruning process starts o with the full tree (i.e., with all parameters
included) and then attempts to move to a reduced tree by dropping parameters
one at a time so that the tradeo function is maximized.
For the Falcon dataset, Table 3 shows the eect of dropping the parameters
one at a time. The second column of the table shows the change in the misclassi®cation rate (over the full tree) due to a single parameter being dropped.
The fourth column shows the change in tradeo function. Pruning the name
distance parameter gives the greatest increase in the tradeo function and so
the name distance parameter is selected for pruning.
11
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
Table 3
Falcon: tradeo for single drops
Parameter
DError rate
DComplexity ratio
DJ(T)
Name distance
Address distance
Wireline address length
Wireless address length
Wireline name length
Wireless name length
0.0014
0.184
0.007
0.0081
0.016
0.061
0.5882
0.2338
0.0339
0.0347
0.0541
0.0566
0.5742
0.0498
0.0269
0.0266
0.0381
)0.0044
Table 4
Falcon: tradeo for dual drops
Parameter tuple
DError rate
DComplexity ratio
DJ(T)
Name distance/Wireline address
length
Name distance/Wireless address
length
Name distance/Wireline name
length
0.0091
0.033
0.0239
0.0072
0.0347
0.0275
0.0088
0.0541
0.0453
The model selection process continues by dropping tuples of parameters. In
the next step, we selected a pair of parameters for which the individual degradation in performance by dropping these parameters was small.
As can be seen from Table 4, the best candidate was the name distance/
wireline name length pair. This tuple was passed to the next round of pruning.
The process was continued to ®nd good triplets of parameters for dropping.
We do not show the details of this step. The model selection process suggested
the ®nal matching tree shown in Fig. 1.
This tree has an error rate of about 8% and the only parameters included are
the address distance and the wireless name length. The rule suggests that an
index on the wireless name ®eld could be used to reduce the evaluation complexity.
It can be seen from Fig. 1, that address distance is the most important
variable, since it is the closest to the root of the tree. The next (and only) other
variable of importance is the length of the wireless name ®eld. The ovals indicate non-terminal nodes and the rectangles are terminal nodes. The counts
inside the boxes or ovals are estimated probabilities that the class assigned is
wrong, i.e., it is the misclassi®cation rate at the node. The accuracy of this
matching rule on the Falcon dataset is shown in Table 5. The entries in the
table are conditional probabilities. For instance, the estimated conditional
probability of predicting correctly matched record pairs as matched is 0.965.
As can be seen from the table, we do not predict very well on the ambiguous
matches, which is to be expected. The numbers in the table are slightly dierent
12
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
Fig. 1. Falcon dataset: matching rule.
Table 5
Misclassi®cation probability for CART rule
Classi®cation probability
Given
matched
Given not
matched
Given
ambiguous
Predicted as matched length
Predicted as not matched
Predicted as ambiguous
0.965
0.02
0.0123
.014
0.956
0.03
0.03
0.18
0.79
from the misclassi®cation rates shown in the tree since the numbers in the table
were estimated on the basis of multiple runs. However, the ®nal classi®er has
low complexity and good matching accuracy.
We carefully studied observations that the matching rule misclassi®ed. We
show a typical example of a case where the matching rule failed. The ®rst record is from the wireline database and the second record is from the wireless
database.
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
AAY5MD
1200 N
13
FALCON MOUNTAIN
OLD DECATUR
NJ
320013887
1200 OLD DECATUR RD
FALCON MOUNTAIN IS JOHN PIERCE
SAGINAW
NJ 07114-4300
Since one character ®elds like ``N'' in the record above, were assigned too high
a score in the weight computation for the matching algorithm, we decided to
drop such ®elds. This could be incorporated into the pre-processing steps. The
results of doing this was that the percentage of correct matches went to 94%
using the same rule. This shows the importance of carefully studying the
mismatched records.
Hence it can be seen that ML techniques combined with model selection
procedures can be used to attack the problem of matching and complexity
reduction of matching in database reconciliation. The complexity factors can
be explicitly included in the model selection process. In this problem, since the
number of parameters was small, a simple model selection scheme was adequate.
In addition to these experiments, we carried out further experiments on a
random sample of 250 records from the wireless database. As in the Falcon
example, CART generated the best matching rules. The matching tree had the
address distance and wireless name length as parameters with an error rate of
approximately 10%. Our matching rule found approximately 47% more correctly matched records over a leading commercial vendor's tool.
5. Conclusion
Our contribution to data reconciliation techniques is a general methodology
for reducing the complexity of the matching process for large datasets. As
discussed in Section 4.3, the reduction in complexity was achieved by pruning
``useless'' parameters. The reduced number of parameters directly impacts
complexity of the matching process.
The generated rules have very high accuracy and low complexity. The automatically derived rule increased the accuracy of the matching process from
63% to 90%. The pruning methods used are capable of incorporating costs
related to complexity as well as dierent types of mismatches. The pruning
reduced complexity by over 50% by elimination of the address distance parameter.
Currently, we use metadata in the preprocessing and the identi®cation of
parameters for the matching algorithm. We plan on increasing our use of
metadata to:
14
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
· generate labels for rule derivation. This step is currently dependent on human involvement;
· identify dependencies among attributes from dierent data sources;
· use the dependency information in cleaning individual ®elds;
· develop an understanding and provide documentation for the data in the
warehouse.
We also plan on working on improved string matching algorithms that will
also be useful in reducing complexity.
A related research area is the application of data visualization techniques,
[14], to detect trends, which can re¯ect systematic errors in the data. Another
related research area is the incorporation of risk analysis techniques to control
data quality.
Appendix A
We now brie¯y review a common method of generating matching rules given
a training set of labeled (correctly matched, incorrectly matched or uncertain if
matched or not) data. This method of rule generation is called CART. Trees
generated by CART are not guaranteed to give high accuracy, but the rules
that correspond to the trees have low evaluation complexity.
CART is a type of decision tree. Methods for automatic construction of
decision trees were ®rst introduced in the social sciences by Morgan and
Sonquist [12], later work was done by Breiman et al. [11], who introduced new
algorithms for automatic tree construction.
A little formalism will make the ideas pertaining to CART simpler. Let P
denote the space of parameters. Let Y denote the labeling. Tree based models
seek to predict the labels, Y, on the basis of the parameters, P,by using rules of
the type
IF Parameter 1 < 2:3AND Parameter 2 2 fA; Bg ! Y MATCH:
For instance, a rule that CART might generate is:
IF NAME
DISTANCE < 2AND NAME
LENGTH > 6
! MATCHED:
A classi®cation or regression tree consists of many such rules constructed by a
method called recursive partitioning. We will not discuss recursive partitioning
in this paper, and instead, refer the reader to Breiman et al. [11], for the details.
The ®nal result of tree construction is a disjoint partitioning of the parameter
space. The partitions correspond to the labels on the leaves of the tree.
The ordering of the parameters in a CART tree is important. Parameters
occurring higher up in the tree have a greater impact on the ®nal rules than
M. Cochinwala et al. / Information Sciences 137 (2001) 1±15
15
parameters that occur lower down in the tree. The earlier decisions split the
parameter space into larger partitions than later decisions. Hence parameters
that are close to the root of the tree are the most important ones.
Finally, the parameters enter into the tree in conjunctive normal form. This
makes the evaluation complexity of a CART tree small.
References
[1] D. Bitton, D.J.H. DeWitt, Duplicate record elimination in large data ®les, ACM Trans.
Database Syst. 8 (2) (1983) 255±265.
[2] C.A. Giles, A.A. Brooks, T. Doskzkocs, D.J. Hummel, An experiment in computer assisted
duplicate checking, in: Proceedings of the ASIS Annual Meeting, 1976, p. 108.
[3] D. Gus®eld, in: Algorithms On Strings Trees And Sequences, Cambridge University Press,
New York, 1997, pp. 215±225.
[4] D.A. Belsley, E. Kuh, R.E. Welsch, Regression Diagnostics: Identifying In¯uential Data and
Sources of Collinearity, Wiley, New York, 1980.
[5] M. Hernandez, S. Stolfo, The merge/purge problem for large databases, in: Proceedings of the
ACM SIGMOD International Conference on Management of Data, May 1995, pp. 127±138.
[6] E.M. Alvaro, C.P. Elkan, The ®eld matching problem: Algorithms and applications, in:
Proceedings of the Second International Conference on Knowledge Discovery and Data
Mining, AAAI Press, August 1996, pp. 267±270.
[7] B.N. Howard, J.M. Kennedy, S.J. Axford, A.P. James, Automatic linkage of vital records,
Science 130 (1959) 954±959.
[8] L. Breiman, P. Spector, Submodel selection and evaluation in regression ± the X random case,
Int. Rev. Stat. 3 (1994) 291±319.
[9] J. Rissanen, Stochastic complexity in statistical inquiry, World Scienti®c, Ser. In Comp. Sci. 15
(1989).
[10] G. Burch, in: R. Barquin, H. Edelstein (Eds.), Building, Using and Managing the Data
Warehouse, Prentice-Hall, Englewood clis, NJ, 1997.
[11] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classi®cation and Regression Trees,
Wadsworth, Monterey, 1984.
[12] J.N. Morgan, J.A. Sonquist, Problems in the analysis of survey data and a proposal, J. Am.
Stat. Assoc. 58 (1963) 415±434.
[13] L. Breiman, Bagging Predictors, University of California, Berkeley, Department of Statistics,
Technical Report 421, 1994.
[14] M. Livny, R. Ramakrishnan, K. Beyer, G. Chen, D. Donjerkovic, S. Lawande, J. Myllymaki,
K. Wenger, DEVise: integrated querying and visual exploration of large datasets, in:
Proceedings of the ACM SIGMOD International Conference on Management of Data, May
1997.
[15] T. Kohenen, Self Organization and Associative Memory, third ed., Springer, Berlin, 1989.
[16] B. Ripley, Neural networks and related methods for classi®cation, J. Roy. Stat. Soc. Series B
(1994) 409±456.
[17] T.W. Anderson, An Introduction to Multivariate Statistical Analysis, Wiley, New York, 1984.