research-article

Open access

Unsupervised Graph-Based Entity Resolution for Complex Entities

Authors:

Nishadi Kirielle,

Peter Christen,

Thilina RanbadugeAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 17, Issue 1

Article No.: 12, Pages 1 - 30

https://doi.org/10.1145/3533016

Published: 20 February 2023 Publication History

All formats PDF

Abstract

Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process compares attribute values of records to calculate similarities and then classifies pairs of records as referring to the same entity or not based on these similarities. Recently developed graph-based ER approaches combine relationships between records with attribute similarities to improve linkage quality. Most of these approaches only consider databases containing basic entities that have static attribute values and static relationships, such as publications in bibliographic databases. In contrast, temporal record linkage addresses the problem where attribute values of entities can change over time. However, neither existing graph-based ER nor temporal record linkage can achieve high linkage quality on databases with complex entities, where an entity (such as a person) can change its attribute values over time while having different relationships with other entities at different points in time. In this article, we propose an unsupervised graph-based ER framework that is aimed at linking records of complex entities. Our framework provides five key contributions. First, we propagate positive evidence encountered when linking records to use in subsequent links by propagating attribute values that have changed. Second, we employ negative evidence by applying temporal and link constraints to restrict which candidate record pairs to consider for linking. Third, we leverage the ambiguity of attribute values to disambiguate similar records that, however, belong to different entities. Fourth, we adaptively exploit the structure of relationships to link records that have different relationships. Fifth, using graph measures, we refine matched clusters of records by removing likely wrong links between records. We conduct extensive experiments on seven real-world datasets from different domains showing that on average our unsupervised graph-based ER framework can improve precision by up to 25% and recall by up to 29% compared to several state-of-the-art ER techniques.

1 Introduction

Entity resolution (ER) is the process used in data integration to identify and group records into clusters that refer to the same entity where records can be sourced from one or multiple databases [7, 41]. Generally, records used in ER have multiple attributes (commonly known as quasi-identifiers [10]) that describe an entity. For example, a person entity can have a birth record with attributes such as the baby’s name, sex, place of birth, date of birth, as well as the details of the parents. The process of integrating different databases is required in domains such as health analytics, national censuses, e-commerce, crime and fraud detection, national security, and the social sciences [7, 15, 41].

Traditional ER approaches only consider the similarities between attribute values of each compared record pair individually to identify matches [16], while graph-based collective ER techniques make use of the relationships between entities to improve match decisions [2, 14, 28, 33].¹

Most research in ER has only focused on entities that have static attribute values, where these values can contain variations, abbreviations, and errors, or be missing. Such entities also only have static relationships that are the same in all records that represent the same entity [2, 14]. We refer to such entities as basic entities. Basic entities are, for example, publications or consumer products. When linking publication records across two bibliographic databases, for example, an article published in a conference or journal has the same title, venue, and a single author or a group of coauthors across both databases, potentially with some data errors, variations, or missing values in these attributes. These attribute values (unless being corrected after publication), however, do not change for a given publication record. Similarly, the relationship of being an author in a given publication is fixed and also does not change over time. This is assuming the ER task is to link publications across two databases; the task of linking authors will involve complex entities (as described next) because the details of authors, such as their names and affiliations, can change over time.

Research in temporal record linkage has explored the effect of temporally changing attribute values, such as address or name changes when people move or get married, in the ER process [35]. However, these approaches are limited to adjusting the attribute similarities between individual records based on their temporal distances and the likelihood that an attribute value can change over time. For example, address values generally change more often than surname values [27] as people are more likely to change their address in a given period of time than their surname. These approaches, however, do not consider that the relationships encountered between certain types of entities, such as people, can also be different at various points in time. We refer to types of entities that can have changing attribute values as well as different types of relationships at various points in time as complex entities. As we show in Section 6, existing graph-based collective ER techniques fail to achieve high linkage quality for situations where complex entities need to be resolved [2] due to the changing nature of attribute values and different relationships.

As an example, if we consider a set of birth, marriage, and death certificates as a set of databases of complex entities, then these databases will contain records of people at different stages of their life. For instance, the same person can appear as a baby in a birth certificate, a bride in a marriage certificate, and then as a mother in the birth certificates of her own children. The structure of relationships in these certificates, within the same or across different databases, can be complex because the same entity can play a different role in each relationship and can have different types of relationships at different points in their lives. As a baby, an entity has a childOf relationship with her parents from the birth record, while as a married bride she then has a spouseOf relationship with her husband, and when she has a baby herself will have a motherOf relationship in her baby’s birth certificate. Using a motivating example, in Section 2, we describe the different challenges that can occur with complex entities.

Furthermore, ambiguity in attribute values is a common problem in the ER process that involves both basic and complex entities. Entities such as people seem to have higher levels of ambiguity in their attribute values compared to entities such as publications. It is common for many individuals, for example, to share the same first name or surname, the same city and postcode values, or the same occupation. On the other hand, publication titles are generally rather unique. In Table 1, we illustrate this issue by showing the least ambiguous attribute (where values are shared by the smallest numbers of entities based on ground truth data) from a variety of datasets as commonly used in ER research. Publication titles in Scholar, song titles in Million Songs, and movie titles in IMDB, stand out with an average of only slightly more than one entity having a given attribute value. On the other hand, for the Isle of Skye (IOS) dataset [45] (which we use in our evaluation in Section 6), the values of first names are on average shared by more than twenty individuals (at least five individuals for the other datasets that contain complex entities). This higher ambiguity makes the ER process more challenging.

Table 1.

Dataset	Domain	Entity	Attribute with	Entity	Attribute Value Frequencies
		Type	Least Ambiguity	Count	Minimum	Average	Maximum
Isle of Skye [45]	Demographic	Complex	First name (birth babies)	12,285	1	21.80	1,089
Kilmarnock [45]	Demographic	Complex	First name (birth babies)	23,715	1	5.82	1,837
IPUMS [47]	Census	Complex	First name	21,828	1	5.68	1,009
NCVR [47]	Census	Complex	Middle name	8,214,211	1	18.31	181,839
Scholar [13]	Bibliographic	Basic	Publication title	64,263	1	1.02	51
Million Songs [13]	Songs	Basic	Song title	35,463	1	1.14	110
IMDB [13]	Movies	Basic	Movie name	6,407	1	1.15	3

Table 1. Frequencies (Minimum, Average, and Maximum) of the Number of Entities that Share an Attribute Value in Databases that Contain Complex or Basic Entities, where we Only Show the Most Unique Attribute (with the Lowest Average Frequency)

Although there exist collective ER work that studies disambiguation [2] and changing attribute values [14], none of them has explored how to address the problem of disambiguation in a context where attribute values as well as relationships can change over time. For example, if we have two person records of a woman before and after her marriage in which she changes her surname, and both surnames before and after her marriage are ambiguous (such as “Miller” and “Smith”), then we still need to be able to identify that these two records refer to the same woman.

Another important aspect is that many practical ER applications suffer from missing, incomplete, or biased ground truth data (known true matches and non-matches). Particularly in the context of databases that contain complex entities such as person records, ground truth data are often not available, or if available, they might be limited to manually curated, biased, and incomplete matches [10]. Therefore, in spite of the growing interest in applying supervised techniques such as deep learning [31, 37, 39, 53], unsupervised techniques are still required in many practical ER applications.

Being able to link complex entities is highly important in domains such as medical research, where linking patient records of individuals and families over time can help detect patterns of how diseases spread through households and communities, and even facilitate novel genealogy studies [33]; in national census analyses that help governments to better understand patterns of education, migration, fertility, and social mobility over time [10, 18]; in social network analysis to identify the interests and connections of individuals; and in the domain of population reconstruction [3] that intends to link databases of whole populations to reconstruct family trees over time that can be used for data analysis in demography, sociology, and genealogy [17]. Of current interest, reconstructing a historical population from 1918 will allow the analysis of how the Spanish flu has spread [48]. Better understanding such historical pandemics at the scale of a full population can help public health researchers and governments when dealing with health crises, such as the current COVID-19 pandemic, and to be better prepared for future outbreaks of infectious diseases.

Our aim in this work is to provide an unsupervised ER framework that can link records of complex (as well as basic) entities while addressing the challenges current graph-based ER and temporal linkage cannot handle adequately. We address five challenges in ER which are fundamental in linking records about complex entities, where we elaborate on these challenges with a motivating example in the following section. We conduct extensive experiments on four datasets that contain complex entities and three datasets containing basic entities to illustrate how our proposed framework outperforms state-of-the-art ER approaches.

Contribution: We propose a novel unsupervised graph-based ER framework that is focused on addressing the challenges associated with resolving complex entities (referred to as RELATER, which stands for propagation of constRaints and attributE values, reLationships, Ambiguity, and refinemenT for Entity Resolution, the main contributions of our work). We propose a global method of propagating attribute values and constraints to capture changing attribute values and different relationships, a method for leveraging ambiguity in the ER process, an adaptive method of incorporating relationship structure, and a dynamic refinement step to improve clusters of records by removing likely wrong links between records. RELATER can be employed to resolve records of both basic and complex entities, as we will show using extensive experiments in Section 6.

2 Motivating Example

As shown in Table 2, let us consider a set of complex entities where we are interested in resolving eleven person records (\(r_1\) to \(r_{11}\)) from one birth certificate and two death certificates. We assume a birth (\(B\)) certificate describes a birth baby (\(Bb\)) and its mother (\(Bm\)) and father (\(Bf\)), while a death (\(D\)) certificate describes a deceased person (\(Dd\)), their mother (\(Dm\)) and father (\(Df\)), and possibly their spouse (\(Ds\)). Similarly, a marriage (\(M\)) certificate would describe a bride (\(Mb\)) and a groom (\(Mg\)), the bride’s mother (\(Mbm\)) and father (\(Mbf\)), and the groom’s mother (\(Mgm\)) and father (\(Mgf\)).

Table 2.

Certificate Type	ID	Event Year	Birth Baby/Deceased Person	Mother	Father	Spouse
Birth	\(B_1\)	1767	Mary Smith (\(r_1\))	Margeret Smith (\(r_2\))	John Smith (\(r_3\))	–
Death	\(D_1\)	1827	Mary Taylor (\(r_4\))	Margery Smyth (\(r_5\))	John Smith (\(r_6\))	Nichol Taylor (\(r_7\))
Death	\(D_2\)	1777	Anne Smith (\(r_8\))	Maria Smith (\(r_9\))	Jonn Smith (\(r_{10}\))	Duncan Hunter (\(r_{11}\))

Table 2. Sample Records from One Birth Certificate and Two Death Certificates, as Discussed in the Example in Section 2

For simplicity, we only show the name attribute of each record. However, in real data each such record will have various other attributes including an address, an occupation, and a date of birth, marriage or death, respectively, to name a few.

Given the three example certificates in Table 2, we are interested in finding which person entities are associated with these eleven records, hence which records need to be linked such that each resulting cluster of records represents one entity. As an initial step, we need to extract the records from the certificates where \(B_1\) will contribute three person records, Mary Smith (\(r_1\)), Margeret Smith (\(r_2\)), and John Smith (\(r_3\)), and likewise for the other certificates. We then need to determine if Mary Smith in \(B_1\) is the deceased person in \(D_1\) or \(D_2\), or if she is the mother on either/both of these two death certificates. Similarly, the other records in this example have different roles and relationships.

Assume that Mary Smith (\(r_1\)) in \(B_1\) is the deceased in \(D_1\), Mary Taylor (\(r_4\)), and the deceased in \(D_2\), Anne Smith (\(r_8\)) could be the sibling of Mary Smith (\(r_1\)). Therefore, the goal of our ER process is to find the following clusters of records which correspond to six different person entities: \((r_1,r_4)\), \((r_2,r_5,r_9)\), \((r_3,r_6,r_{10})\), \((r_7)\), \((r_8)\), and \((r_{11})\).

There exist different challenges in this example that are of interest particularly for resolving complex entities. The primary challenge is the identification problem as defined by Bhattacharya and Getoor [2], where we need to figure out the set of records that refer to each entity. While this problem has been explored in the collective ER literature [2, 14, 28, 33], as we discuss next, some aspects in our example have either not been investigated so far, or improvements are required because existing methods fail to obtain high linkage quality for complex entities (as we will show in Section 6).

The second challenge is how to resolve entities with changing attribute values. Mary Smith in \(r_1\) has a different surname (\(r_4\)) in her death certificate, \(D_1\), which is likely due to the change of her surname when she got married. Assume we have linked \(r_1\) with Mary Smith’s marriage certificate (not shown), where this link of records states that her surname has changed to Taylor. In such a scenario, if we can propagate the link decision (of her birth record with her marriage record) to the link decision of her birth and death records, then we can easily identify that Mary Smith is the same person as Mary Taylor based on her linked birth and marriage records. While existing temporal record linkage approaches [27, 35] address the challenge of changing attribute values by applying techniques such as temporal decays of attribute weights to capture temporal changes, these solutions do not address the problem of different relationships of the same entity found in records at various points in time.

The third challenge is how to incorporate the different relationships into the ER process to discover positive or negative evidence to guide the ER process. Assume \(r_1\) and \(r_4\) in the example in Table 2 have been linked, and now we are interested in knowing if \(r_4\) and \(r_9\) refer to the same entity as we still do not know if \(r_1\) and \(r_8\) are siblings. Here, even though both \(r_1\) and \(r_4\) have relationships with their mothers and fathers, \(r_9\) has different relationships, namely her baby Anne (\(r_8\)) and her spouse, John (\(r_{10}\)). These different relationships occurring in records at different points in time can provide negative evidence for any subsequent link decisions, for instance in the form of constraints. For example, in order to decide if \(r_4\) refers to the same entity as \(r_9\), we can propagate temporal information from the link decision of \(r_4\) with \(r_1 (Bb)\) discussed above. When it comes to the temporal domain, biological constraints become relevant, in our example for a birth baby to become a mother there should be a gap of at least around 15 years. Therefore, we can decide that \(r_1\) and \(r_9\) cannot be linked (refer to the same entity) as they are only 10 years apart.

In a context where relationships are considered, Dong et al. [14] propagated link decisions by considering attribute value changes and applying constraints. They perform an exhaustive search to find all record pairs associated with any of the linked records, and then merge attributes and use the transitive closure property to remove any additional record pairs [14]. However, no existing graph-based ER work has explored how to efficiently propagate attribute value changes and apply constraints. As we discuss in Section 4.1, we propose an efficient method that avoids an exhaustive search to propagate link decisions. Furthermore, no research has so far explored how this propagation of link decisions is affected when the attribute values of entities are ambiguous.

This disambiguation problem, as we showed in Table 1, is where a given attribute value is shared by multiple (possible many) entities. Values that are shared by only a smaller number of entities provide stronger evidence that two records refer to the same entity. For example, if we look at the attribute values of the records in Table 2, we can see that the surname Smith occurs more often than the surname Hunter. As a result, if we have two records, such as \(r_1\) and \(r_8\), which both have the surname Smith, then this shared value does not provide sufficient evidence to link those records because Smith is ambiguous. However, if we find a new record with a surname Hunter, it is more likely that the new record represents the same entity as \(r_{11}\) because Hunter is unique in our example. Bhattacharya and Getoor [2] have explored ambiguity of static attribute values in relational clustering for collective ER. However, they have not investigated how to incorporate disambiguation while propagating link decisions, or when attribute values can change over time.

In collective ER, we are interested in linking records that are relationally connected with other records. For example, consider the two connected record groups of \(B_1\) and \(D_2\) in Table 2. If we assume that Mary Smith and Anne Smith are siblings, then we should not link them. However, the parent record pairs in that group, \((r_2,r_9)\) and \((r_3,r_{10})\), need to be linked as they refer to their parents. We refer to this challenge as the partial match group problem, where only a subset of relationally connected records correspond to the same entities while others do not. While recent ER approaches take relationships into account by either incorporating relationship information into the similarity calculation [2, 14] or by making a group link decision [19, 40], these approaches would fail to properly link the parent records in this example because the overall similarity drops due to the different sibling first names.

The final challenge is the one of incorrect link decisions. Because the process of linking two records is no longer independent from linking other records in the context of collective ER [2, 14], a single wrong link might propagate into other link decisions and result in an increase in the number of false matches as well as missed true matches. For example, assume in Table 2 that we have incorrectly linked \(r_1\) with \(r_8\) given both their parent’s first names are similar and their surnames are the same. However, as a deceased person can only be linked to a single birth baby, \(r_8\) will then not be linked to its correct birth record, and similarly \(r_4\) might get linked to a wrong birth record. To the best of our knowledge, this challenge has not yet been addressed in the literature.

3 Problem Definition and Overview

We now define the problem of ER for databases of complex entities. We show the main notation we use throughout the article in Table 3, where we use bold letters for lists and sets (with upper-case bold letters for lists of sets and lists), and normal type letters for numbers and text.

Table 3.

Notation	Description
\(r\), \(\mathbf {R}\), \(\mathbf {B}\)	A record, a set of records, a set of blocks each consisting of a set of records
\(o\), \(\mathbf {O}\)	An entity, a set of entities
\(\mathbf {m}\), \(\mathbf {M}\)	A matched cluster of records, a set of matched clusters of records
\(a\), \(\mathbf {A}\)	An attribute, a set of attributes
\(v\), \(v_a\), \(\mathbf {v}_a\)	An attribute value, a value of attribute \(a\), the set of attribute values of attribute \(a\)
\(\mathbf {G}_O = (\mathbf {N}_O, \mathbf {E}_O\))	An entity graph with a set of nodes and edges
\(\mathbf {G}_D = (\mathbf {N}_D, \mathbf {E}_D\))	A dependency graph with a set of nodes and edges
\(n_A, \mathbf {N}_A\)	An atomic node, a set of atomic nodes
\(n_R, \mathbf {N}_R\)	A relational node, a set of relational nodes
\(\mathbf {g}\), \(\mathbf {Q}\)	A group of nodes, a priority queue of node groups
\(\mathbf {C} = \mathbf {C}_A \cup \mathbf {C}_R\)	A set of adjacent nodes consisting of adjacent atomic and adjacent relational nodes
\(\rho\), \(\mathbf {P}\)	A role type, a set of role types
\(\mathbf {T}, \mathbf {L}\)	Sets of temporal and link constraints
\(sim\), \(sim_a\), \(sim_d\)	Total, atomic, and disambiguation similarity score
\(\gamma\)	The weight distribution for the two similarity components, \(sim_a\) and \(sim_d\)
\(sim_M\), \(sim_C\), \(sim_E\)	Average similarities of must, core, and extra attributes
\(w_M\), \(w_C\), \(w_E\)	Weights given to must, core, and extra attribute categories
\(t_b, t_m\), \(t_a\)	Thresholds for bootstrapping, merging, and atomic node similarity
\(t_n\)	Threshold for minimum number of nodes in a cluster to split by bridges
\(t_d\)	Threshold for minimum density of a cluster to refine
\(Mof, Fof, Cof, Sof\)	motherOf, fatherOf, childOf, and spouseOf relationships

Table 3. Main Notation used Throughout the Article

Let \(\mathbf {R}\) be a set of records from a database and \(\mathbf {O}\) be a set of real-world entities (such as people). We assume each record \(r \in \mathbf {R}\) has a reference \(r.o\) to the entity \(o \in \mathbf {O}\) that is represented by \(r\), with \(\mathbf {O} = \lbrace r.o: \forall r \in \mathbf {R}\rbrace\), and where these \(r.o\) are unknown at the beginning of the ER process. Each entity \(o \in {\bf O}\) is represented by a set of records, \({\bf m} \subset {\bf R}\), that describe the entity. We denote such a record set \(\mathbf {m}\) as a cluster of records, and the set of all such clusters with \(\mathbf {M}\). Each record \(r \in \mathbf {R}\) can have a set of other records \(\mathbf {R}^{\prime } \subset \mathbf {R}\), with \(r \not\in \mathbf {R}^{\prime }\), that are connected to \(r\) by relationships such as motherOf (Mof), fatherOf (Fof), childOf (Cof), spouseOf (Sof), or authorOf. We refer to such a set \(\mathbf {R}^{\prime }\) as a record group. Each record \(r \in \mathbf {R}\) contains values, \(v\), for a set of attributes, \(\mathbf {A}\), that provide information such as the name, address, and gender for a person; or the author name, venue, and title for a publication. Each record \(r\) also has a timestamp, \(r.t\), that stands for the point in time (usually a date) when the event that corresponds to \(r\) occurred. Similarly, each record \(r\) is associated with a role, \(r.\rho \in \mathbf {P}\), where \(\mathbf {P}\) is the set of all possible roles such as a mother, a child, an author, or a publication.

We first describe the challenges that are specific for resolving complex entities, illustrating them in Figure 1. We then formally define the problem of ER to resolve records of complex entities.

Fig. 1.

(a)

Changing attribute values: Let \(r_i, r_j, r_k \in \mathbf {R}\) be three records that represent the same entity (i.e., \(r_i.o = r_j.o = r_k.o\)). Assume \(r_i.v_a = r_j.v_a\) and \(r_j.v_a \ne r_k.v_a\), where \(a \in \mathbf {A}\), and \(r_i.t \lt r_j.t \lt r_k.t\). If \(r_j.v_a\) and \(r_k.v_a\) are not variations of the same attribute value (i.e., are not different values due to typographical errors but actual changed values [7, 35], such as surname Smith to Taylor), then this is the challenge of changing attribute values where an entity has different values for the same attribute, \(a\), at the timestamps of \(r_j.t\) and \(r_k.t\).

(b)

Different relationships: Let \(r_i, r_j, r_u, r_v \in \mathbf {R}\) be four records where \(r_i\) and \(r_j\) are relationally connected with \(r_i.o \ne r_j.o\) and \(r_i.t = r_j.t\). Similarly, \(r_u\) and \(r_v\) are also relationally connected with \(r_u.o \ne r_v.o\) and \(r_u.t = r_v.t\). Assume \(r_i.o = r_u.o\) and \(r_i.t \ne r_u.t\) (and therefore \(r_j.t \ne r_v.t\)). There can be situations where \(r_j.o \ne r_v.o\) because at different timestamps \(r_i\) and \(r_u\) have different relationships with \(r_j\) and \(r_v\). This is the challenge where we encounter different relationships for the same entity in its records at different points in time, such as a baby to mother relationship at birth versus a bride to groom relationship at marriage.

(c)

Disambiguation problem: Let \(\mathbf {R}_{\alpha } \subset \mathbf {R}\) be a set of records having the same value for a given attribute, \(a \in \mathbf {A}\) (\(\forall r_i, r_j \in \mathbf {R}_{\alpha } : r_i.v_a = r_j.v_a \wedge i \ne j\)). Assume each of these records represents a different entity in \(\mathbf {O}\) such that \(\forall r_i, r_j \in \mathbf {R}_{\alpha } : r_i.o \ne r_j.o\). We refer to the challenge of distinguishing such entities having the same attribute value (for example, many people having the common surname Smith) as the disambiguation problem.

(d)

Partial match group problem: Let \(\mathbf {R}_{\alpha } \subset \mathbf {R}\) and \(\mathbf {R}_{\beta } \subset \mathbf {R}\) be two groups of records, with \(\mathbf {R}_{\alpha } \cap \mathbf {R}_{\beta } = \emptyset\), where we assume the records in each group are relationally connected with each other. When two such groups are compared for linking, in the set of paired records, \(\lbrace (r_i, r_k), (r_j, r_l)\rbrace\), where \(r_i, r_j \in \mathbf {R}_{\alpha }\) and \(r_k, r_l \in \mathbf {R}_{\beta }\), if \(\exists (r_i, r_k), (r_j, r_l) : r_i.o \ne r_j.o \wedge r_k.o = r_l.o\), we define such a group as a partial match group. We refer to this challenge of having some record pairs that refer to different entities while other record pairs referring to the same entity in relationally connected record groups, (such as linking parents across the birth records of siblings, but not linking the siblings), as the partial match group problem.

(e)

Incorrect link problem: Let \(\mathbf {M}\) be the set of record clusters in the record set \(\mathbf {R}\) that have been linked. Assume \(\mathbf {m}_k \subset \mathbf {R}\) where \(\mathbf {m}_k \in \mathbf {M}\) and \(\exists (r_i, r_j) \in \mathbf {m}_k : r_i.o \ne r_j. o\). This challenge is the incorrect link problem where we have records representing different entities in the same cluster of records, such as an entity cluster that represents a certain individual to contain a record of a sibling.

Definition 3.1 (ER of Complex Entities).

Given a set of records, \(\mathbf {R}\), the ER problem of resolving records of complex entities is to link records \(r_i \in \mathbf {R}\) into clusters of records \(\mathbf {m}_k\) such that \(\mathbf {R} = \lbrace r_i: \forall r_i \in \mathbf {m}_k, \forall \mathbf {m}_k \in \mathbf {M}\rbrace\) (all records in \(\mathbf {R}\) have been inserted into a cluster) with \(\mathbf {M} = \cup \lbrace \mathbf {m}_k\rbrace\) and \(\forall \mathbf {m}_i, \mathbf {m}_j \in \mathbf {M}: \mathbf {m}_i \cap \mathbf {m}_j = \emptyset\) (each record has been inserted into only one record cluster); \(\mathbf {O} = \lbrace r_i.o: \forall r_i \in \mathbf {m}_k, \forall \mathbf {m}_k \in \mathbf {M}\rbrace\) and \(\forall r_i \in {\bf m}_k: {r_i.o = o_j}, \forall {\bf m}_k \in {\bf M}\) (every entity in \(\mathbf {O}\) is represented by one record cluster); and \(\forall \mathbf {m}_k \in \mathbf {M}: |\mathbf {m}_k| \ge 1\) (each record cluster contains one or more records, where records that were not linked are clusters of size 1) in a context where relationally connected record groups can contain partial match groups and the records of an entity can have changing attribute values, different relationships in different records, and ambiguous attribute values shared with other entities.

Figure 2 shows the pipeline of our framework where the input is the groups of relationally connected records extracted from one or more databases, and the output is a set of entities represented as clusters of records. We now provide an overview of the three main steps of RELATER as described in detail in Section 5 (the white coloured boxes in Figure 2). In Section 4, we then discuss how each key technique (the blue coloured boxes in Figure 2) contributes to the pipeline.

Fig. 2.

(1)

Dependency Graph Generation: To resolve records, we need to represent them in a data structure that can capture the relationships among records. Hence, we generate a dependency graph [14] defined as follows.

Definition 3.2.

A dependency graph is a directed graph, \(\mathbf {G}_D\), that consists of a set of nodes, \(\mathbf {N}_D\), where these nodes represent pairs of attribute values or pairs of records; and a set of edges, \(\mathbf {E}_D\), that represent relationships between nodes. \(\mathbf {N}_D\) consists of atomic nodes, \(\mathbf {N}_A\), that represent pairs of attribute values, and relational nodes, \(\mathbf {N}_R\), that represent pairs of candidate records that possibly refer to the same entity, where \(\mathbf {N}_D = \mathbf {N}_A \cup \mathbf {N}_R\).

To generate the dependency graph, we potentially first have to extract records representing individual entities (unless the input dataset already contains such individual records). For example, as shown in Table 2 and Figure 3, to generate the dependency graph for person data, we first extract individual records from birth and death certificates. Then, as we describe in Section 5.1, for each pair of similar values in an attribute \(a \in {\bf A}\) (with similarities greater than a threshold \(t_a\)), \(v_i\) and \(v_j\), we add a node \((v_i, v_j) \in \mathbf {N}_A\) to \(\mathbf {G}_D\). We repeat this process for a selected set of quasi-identifying attributes that represent an entity. For each pair of records, \((r_i, r_j) \in \mathbf {R}\), that possibly refer to the same entity (based on blocking as we elaborate on in Section 5.1), we add a node \((r_i, r_j) \in \mathbf {N}_R\) to \(\mathbf {G}_D\).

Fig. 3.

A directed edge in \(\mathbf {G}_D\) represents that the similarity of the destination node depends on the similarity of the source node. Edges between nodes in \(\mathbf {N}_R\) represent relationships between records, such as motherOf, or authorOf. For each node in \(\mathbf {N}_R\), the set of its adjacent nodes with incoming edges is denoted by \(\mathbf {C} = \mathbf {C}_A \cup \mathbf {C}_R\), where \(\mathbf {C}_A\) and \(\mathbf {C}_R\) are the sets of adjacent atomic and relational nodes of the specified node, respectively. For each \(r_i.v_i\) and \(r_j.v_j\), if the node \((v_i, v_j) \in \mathbf {N}_A\), then there is a directed edge \((v_i, v_j) \rightarrow (r_i,r_j)\), with \((v_i, v_j) \in \mathbf {C}_A\) for the relational node \((r_i,r_j)\). For each pair of nodes \(n_i, n_j \in \mathbf {N}_R\), if there is a relationship between these nodes then there exist two directed edges between them: \(n_i \rightarrow n_j\) (where \(n_i \in \mathbf {C}_j\) for \(n_j\)) and \(n_j \rightarrow n_i\) (where \(n_j \in \mathbf {C}_i\) for \(n_i\)). For example, there will be two edges between a mother node and a child node. To keep it simple, we show these as double-arrowed edges in the example figures.

Figure 3(b) shows an example of a small dependency graph. Since each relational node in this graph is associated with two records, we refer to linking two records in a node as merging the node. Each node in \(\mathbf {G}_D\) is also associated with a node state [14], where this state changes throughout the running of our framework. The possible states are active (considered for merging), inactive (failed merging due to insufficient evidence such as low similarity), merged (two records in the node are linked), and non-merge (not considered for merging due to constraint violations).

(2)

Bootstrapping: In this step, described in more detail in Section 5.2, we merge highly similar groups of nodes in \(\mathbf {G}_D\) that have an average similarity greater than a predefined bootstrapping threshold, \(t_b\). As we then propagate link decisions, it is important to bootstrap the framework only with highly confident record pairs.

(3)

Iterative Merging and Entity Graph Generation: In this step, we iteratively merge candidate nodes in \(\mathbf {G}_D\) considering their relationship structure and the ambiguity of their attribute values. We also propagate link decisions to account for changing attribute values and different relationships. In this process, we generate an entity graph (defined below) to capture the relationships between entities as defined below. Finally, we dynamically refine the record clusters to remove likely wrong links, as we describe in Section 4.5.

Definition 3.3.

An entity graph is a directed graph, \(\mathbf {G}_O\) that consists of a set of nodes, \(\mathbf {N}_O\), where each node represents an entity \(o \in \mathbf {O}\); and a set of edges, \(\mathbf {E}_O\), that represent the relationships between these entities.

After describing the key techniques next, in Section 5, we then discuss these three steps of RELATER in detail.

4 Key Techniques

In this section, we describe the key techniques, including all novel contributions, underlying the RELATER framework that solve the five challenges described in the previous section. These techniques help our framework to achieve high linkage quality specifically for complex entities when compared to existing ER approaches.

4.1 Global Propagation of Attribute Values (PROP-A)

As we defined in Section 3, the first challenge is changing attribute values where values such as names and addresses can change over time. To solve this problem, we propagate those changing attribute values through the ER process in the bootstrapping and iterative merging steps of our framework. When an attribute value changes over time, this change makes it difficult to decide if two records refer to the same entity. Therefore, with this technique, we first check if any of the records are associated with a record cluster. If there are associated entities, we check the attribute values of all records associated with these entities to identify how its attribute values have changed over time.

We use such changing attribute values in the ER process as positive evidence for any subsequent links. For this, we maintain clusters of records, \(\mathbf {M}\), that aggregate all linked records and their attribute values. Let us assume, we merge a node containing two records, \((r_i, r_j)\). To add these two linked records to \(\mathbf {M}\), we consider three different cases. First, if neither \(r_i\) or \(r_j\) are associated with a record cluster, then we create a new cluster, \(\mathbf {m}_k\), add \(r_i\) and \(r_j\) into \(\mathbf {m}_k\), and then add \(\mathbf {m}_k\) to \(\mathbf {M}\). Second, if only one of \(r_i\) or \(r_j\) is associated with a record cluster, for instance \(r_i\) is already associated with a cluster \(\mathbf {m}_k\), based on a previous link, while \(r_j\) is not, then we add \(r_j\) to \(\mathbf {m}_k\). Third, if both records are associated with two different record clusters, we merge those two clusters. For example, in Figure 4, we can see that the merged node \((r_1,r_{12})\) is associated with the record cluster \(\mathbf {m}_1\), which contains all attribute values of \(r_1\) and \(r_{12}\).

Fig. 4.

In order to propagate attribute values, when linking two records, \(r_i\), \(r_j \in \mathbf {R}\), we find the most similar attribute value pair of the two records by considering the associated record cluster(s), if there are any. For example, when we consider the node \((r_1, r_4)\) in Figure 4, because \(r_1\) is part of the associated record cluster \(\mathbf {m}_1\), we compare all attributes of \(r_4\) with the corresponding attribute values of \(\mathbf {m}_1\) to find the best matching atomic nodes with the most similar values. As the surname of \(r_4\) is Taylor, the node \((r_1, r_4)\) is already associated with the atomic node (Smith, Taylor). When we compare Taylor with the surnames of \(\mathbf {m}_1\), and assuming the similarities \(sim(Tayler, Taylor) \gt sim(Smith,Taylor)\), we remove the edge from the atomic node (Smith, Taylor) and add a new edge from the (Tayler, Taylor) node to the relational node \((r_1, r_4)\). In this way, even if an individual changes their name or address over time, our framework can still identify them based on previous links or highly similar attribute values. With this attribute propagation step, as connected atomic nodes are changing, the similarity of each relational node can change through the ER process, which is a significant improvement over previous collective ER approaches that do not consider such attribute value changes.

In the context of collective ER, the idea of propagating linking decisions has first been proposed by Dong et al. [14]. However, our propagation method is different from this previous approach as we make a global propagation of attribute values using a unified view of all record clusters, \(\mathbf {M}\), that represent entities. On the other hand, Dong et al. [14] propagated link decisions with an exhaustive search that merges relational nodes in the graph.

4.2 Global Propagation of Constraints (PROP-C)

The purpose of this technique is to make use of the different relationships that we encounter in records at different points in time. Since different relationships correspond to different entities, we cannot directly compare them in making link decisions. However, relationships can be used as negative evidence for any subsequent links. In our framework, we model such negative evidence as constraints and apply these constraints in the steps of dependency graph generation, bootstrapping, and iterative merging. For example, for a birth baby (\(Bb\)) to become a birth mother (\(Bm\)), biologically, there should be a time gap of at least around 15 and at most around 55 years [45]. For other role pairs, there are different constraints that can be applied based on domain knowledge [8]. We refer to those constraints that are associated with temporal aspects as temporal constraints.

In addition to that, there can be constraints that are associated with the properties of certain relationships. For example, in Figure 4, after the node \((r_1, r_4)\) is merged, where we assume \(r_1\) and \(r_4\) refer to a baby and a deceased person, respectively, \(r_1\) cannot be linked with any other death record because a birth baby (\(Bb\)) can only be linked to one deceased person (\(Dd\)), and vice versa. We refer to constraints that are associated with such relationships as link constraints. Link constraints are one-to-one and one-to-many constraints that can be applied to pairs of entity roles.

As these constraints are domain dependent, they need to be manually specified by domain experts or learned from training data to be used by our framework. As we defined in Section 3, each record, \(r \in \mathbf {R}\), is associated with a role, \(r.\rho\), where \(\rho \in \mathbf {P}\) and \(\mathbf {P}\) is the set of all roles for a given domain, and a timestamp, \(r.t\). We define the temporal and link constraints that we use as follows.

Definition 4.1.

Temporal constraints apply for databases with complex entities where such constraints restrict (for specific role pairs) if two records should be considered for linking or not. We model temporal constraints as a set \(\mathbf {T} = \bigcup _{\rho _1,\rho _2 \in \mathbf {P}} \mathbf {T}_{\rho _1,\rho _2}\), of time periods where records can and cannot be linked.

Definition 4.2.

Link constraints restrict, for a given role pair, how many links a record can be involved in. We model link constraints as a set \(\mathbf {L} = \bigcup _{\rho _1,\rho _2 \in \mathbf {P}} \mathbf {L}_{\rho _1,\rho _2}\), of one-to-one or one-to-many constraints that determine how many records can be involved in a specific relationship for this role pair.

For example, the temporal constraint between the roles of birth baby and birth mother \(\mathbf {T}_{Bb,Bm}\) can be represented as \((r_i.\rho = Bb) \wedge (r_j.\rho = Bm) \wedge (15 \le YearTimeGap(r_i, r_j) \le 55) \Rightarrow ValidMerge(r_i,r_j)\). Similarly, the one-to-one link constraint between the roles of birth baby and deceased person \(\mathbf {L}_{Bb,Dd}\) can be represented as \((r_u.\rho = Bb) \wedge (r_v.\rho = Dd) \wedge (|Links(r_u, Dd)| = 0) \wedge (|Links(r_v, Bb)|=0) \Rightarrow ValidMerge(r_u,r_v)\), which means both records \(r_u\) and \(r_v\) cannot be involved in any other links to a deceased person and a birth baby, respectively.

4.3 Leverage Ambiguity of Attribute Values (AMB)

An interesting, yet sometimes overlooked, aspect of ER is the ambiguity of attribute values, where potentially many entities can share the same value, such as a common surname or city name [7]. Such values can therefore become ambiguous, and possibly lead to record pairs with high attribute similarities yet referring to different entities. To solve this challenge, we therefore propose a method in the dependency graph generation step of our framework to calculate the overall similarity, \(sim\), of a relational node that incorporates ambiguity where we combine the similarity of two components, an atomic similarity, \(sim_a\) and a disambiguation similarity, \(sim_d\), defined as

\begin{align} sim_a(r_i, r_j) & = \frac{w_M \cdot sim_M(r_i, r_j) + w_C \cdot sim_C(r_i, r_j) + w_E \cdot sim_E(r_i, r_j)}{w_M + w_C + w_E}, \end{align}

(1)

\begin{align} sim_d(r_i, r_j) & = \frac{\log _2 (|\mathbf {O}| / (r_i.f + r_j.f))}{\log _2 |\mathbf {O}|}, \end{align}

(2)

\begin{align} sim(r_i, r_j) & = \gamma \cdot sim_a(r_i, r_j) + (1-\gamma) \cdot sim_d(r_i, r_j), \end{align}

(3)

where \(0 \le \gamma \le 1\) is the weight distribution for the two similarity components, \(sim_a\) and \(sim_d\) as we describe next and illustrate in Figure 5.

Fig. 5.

To calculate the initial atomic similarity, \(sim_a\), of a relational node, we consider its set of adjacent atomic nodes with incoming edges, \(\mathbf {C}_A\). The similarities between attribute values in atomic nodes are assumed to be always between 0 (completely different values) and 1 (same values). Similarities are generally calculated using approximate string comparison functions, such as Jaro-Winkler or edit-distance [7], as appropriate to the values in an attribute. The importance of different attributes towards the calculation of \(sim_a\) also varies. For example, in databases with complex entities, attributes such as first names are more important because generally they are more complete and also more stable over time, whereas attributes such as occupation or address can be missing and they can change over time [8, 45]. In bibliographic data, for pairs of authors, their first names and surnames can be considered more important than the venue of a publication, which can be considered as a less important attribute that provides additional information because many publications share the same venue.

To this end, we group attributes into three categories: Must, Core, and Extra, based on their importance in the ER process determined using domain knowledge or data characteristics, such as completeness [10]. For two records to be classified similar, they need to have highly similar values in the Must attributes (such as first name), but they can have a comparatively lower similarity in Core attributes (like surname). Extra attributes (such as the occupation of a person or the venue of a publication) provide further evidence of similarities between records.

We calculate an initial atomic similarity as shown in Equation (1), where \(sim_M\), \(sim_C\), and \(sim_E\) represent the average of atomic node similarities of the Must, Core, and Extra attribute categories, while \(w_M\), \(w_C\), and \(w_E\) represent their corresponding weights, which can be learnt from a training dataset or determined via domain knowledge [8, 45]. As Extra attributes are subsidiary, the presence of an Extra attribute provides positive evidence for a match while its absence does not necessarily provide negative evidence. This is supported because we add atomic nodes only if the similarity of the two attribute values in a node are above a pre-defined threshold, \(t_a\). Therefore, we set \(w_E = 0.0\) if all Extra attributes are absent.

If the pair of records in a relational node has attribute values that occur frequently in the set of records \(\mathbf {R}\), then a high attribute similarity of a record pair is not important compared with a pair of records that have rare attribute values and the same attribute similarity [16]. In our example in Section 2, Smith occurs seven times whereas Hunter occurs only once. Two records having the surname Hunter, therefore, have a higher likelihood of referring to the same real-world entity compared to two records having the surname Smith. As the link decisions in our framework are dependent on each other, we need to prioritise record pairs with unique or rare attribute values such that they are processed before record pairs with ambiguous attribute values. As this is similar to the concept of inverse document frequency as used in information retrieval, we use a normalised score of inverse document frequency [46] as the disambiguation similarity score, \(sim_d\). Assume \(a_{\alpha }, a_{\beta } \in \mathbf {A}\) are the attributes that we consider for calculating ambiguity. Then, the frequency \(r.f\) for a record \(r\) is calculated as the frequency of attribute value combinations, \(v_{a_{\alpha }}\) and \(v_{a_{\beta }}\), in one of the duplicate free datasets that we aim at linking. For two records \(r_i\) and \(r_j\) in a relational node, let \(r_i.f\) and \(r_j.f\) be the frequencies calculated as described. If the number of unique records in the dataset (i.e., number of entities) is \(|\mathbf {O}|\), we define \(sim_d\) as shown in Equation (2), where we can estimate \(|\mathbf {O}|\) using the same duplicate free dataset we used to calculate frequencies.

4.4 Adaptive Leveraging of Relationship Structure (REL)

The purpose of this technique is to leverage the relationship structure to link records in the bootstrapping and iterative merging steps. Recent approaches have incorporated relational similarities between nodes into the similarity score in different ways [2, 14, 19, 28]. A limitation of directly incorporating relational similarities into overall similarity scores is that this can affect the link decisions in partial match groups, as we defined in Section 3. In our example, as illustrated in Figure 6, for linking the two mother records, \(r_2(Bm)\) and \(r_9(Dm)\), basically there are two different methods in the literature that incorporate the similarity of relational nodes. The first method [40] considers the average similarity of the group of nodes, \((r_2,r_9)\), \((r_3, r_{10}),\) and \((r_1,r_8)\). The second method [2, 14] has a component in the similarity calculation that provides a separate weight to the similarities of the relational nodes, \((r_3, r_{10})\) and \((r_1,r_8)\). In both of these methods, the average similarity or the similarity score of the node \((r_2,r_9)\) gets lowered because of the low similarity of \((r_1,r_8)\), where this node has a low similarity because its two records represent two separate entities (two siblings).

Fig. 6.

To overcome this problem, RELATER provides a novel adaptive method to exploit the relational structure of entities. As \(\mathbf {G}_D\) is a dependency graph, a connected component of a group of relational nodes represents the structure of relationships between records. In order to decide if a pair of records in a relational node needs to be linked, we consider the average similarity of the relationally connected node group. Then, if that average similarity is less than a predefined threshold \(t_m\), we adaptively remove the node with the lowest similarity from the group and recalculate the average similarity.

As per the previous example of siblings and as illustrated in Figure 6, \(\mathbf {G}_D\) will have a group of three relational nodes (a triangle) representing the two mothers \((r_2, r_9)\), two fathers \((r_3, r_{10})\), and two siblings \((r_1, r_8)\). To leverage the relational structure, we consider the average similarity (0.63) of all three nodes in the first iteration. If this average similarity is less than \(t_m\), then we remove the node with the lowest similarity and continue to consider the remaining nodes. Therefore, in the example in Figure 6, we ignore the lowest similarity sibling node \((r_1, r_8)\) (as the two records refer to two different individuals), and continue with the remaining pair of parent nodes, \((r_2, r_9)\) and \((r_3, r_{10})\), which now have an average similarity of 0.85 to proceed with the merging and solving of the partial match group problem we discussed in Section 3.

4.5 Dynamic Refining of Record Clusters (REF)

In collective ER, the propagation of link decisions to the subsequent linking of records can lead to poor linkage quality if two records that refer to two different entities have been linked in a previous iteration. However, if we can detect such incorrect links, removing them can facilitate the correct pairs to be linked in following iterations.

To solve this problem, we propose a novel method to dynamically refine record clusters, \(m_k \in \mathbf {M}\), by removing wrongly linked records from such clusters. This step is conducted after each of the bootstrapping and merging steps, as we show in Figure 2. Each time a node is merged, we add the pair of records associated with the node to the corresponding record cluster, \(\mathbf {m}_k\), and the node corresponding to \(\mathbf {m}_k\) in the entity graph \(\mathbf {G}_O\) is updated. As we need to keep track of how records are added into record clusters, we create separate undirected graphs for each such record cluster where the nodes indicate records and edges are added between each linked record pair. We utilise this graph structure of relations formed in each record cluster to identify likely wrong links.

Based on the hypothesis that loosely connected record clusters (such as chains) are more likely to contain errors compared to densely connected record clusters (such as cliques), we apply the graph measure based error identification method proposed by Randall et al. [44] on the graph generated from each record cluster. As a link decision in our framework can be propagated into future link decisions, early identification of wrongly linked record pairs allows correct record pairs to be linked in the next iteration.

We used the graph measures of density and bridges (illustrated in Figure 7) to identify loosely connected record clusters. A bridge is an edge that will disconnect the graph if removed; and density, \(d\), is measured by the number of edges out of the total number of possible edges in a graph [44] calculated as \(d = 2|E|/(|N|\cdot (|N|-1))\), where \(E\) and \(N\) are the sets of edges and nodes of the undirected graph generated from a record cluster, \(\mathbf {m}_k\). For such a cluster having at least three records, \(|\mathbf {m}_k| \ge 3\), we calculate the density and if it is less than a predefined threshold, \(t_d\), we remove the node with the lowest degree. Similarly, for a record cluster having more than \(t_n\) records, we split the record cluster by any existing bridges. In Section 6, we discuss how to set the parameters, \(t_d\) and \(t_n\) in our experimental evaluation.

Fig. 7.

5 Entity Resolution with RELATER

In this section, we present our ER framework, as shown in Figure 2, which consists of three main steps (dependency graph generation, bootstrapping, and iterative merging and entity graph generation). We detail how we utilise the five key techniques we discussed before in this framework. Finally, we provide the time complexity of our framework.

5.1 Dependency Graph Generation

As we are representing records in a dependency graph, \(\mathbf {G}_D\), if all possible record pairs and attribute pairs are added into this graph, then it can get very large. Therefore, we apply a blocking technique to reduce the comparison space by removing likely non-matching record pairs and grouping potentially matching pairs [42]. We employ a locality sensitive hashing based blocking technique that maps similar attribute value pairs to the same hash value to group likely matches [42]. After blocking, \(\mathbf {G}_D\) is generated in two phases considering only record pairs in the generated blocks. In the first phase, only attribute pairs that have a similarity of at least a threshold \(t_a\) are added to the graph from each block as atomic nodes, \(\mathbf {N}_A\), along with their similarities.

In the second phase, we add relational nodes, \(\mathbf {N}_R\), from each block to \(\mathbf {G}_D\) based on several filtering methods. First, we consider only possible pairs of entity role types, such as pairs of two authors or pairs of two birth mothers, while ignoring pairs with different genders, such as a birth mother paired with a groom. Second, we filter record pairs by temporal constraints, \(\mathbf {T}\), for databases having complex entities where such constraints can be applied. Then, we filter record pairs based on the number of atomic nodes available that correspond to the three attribute categories Must, Core, and Extra, as we discussed in Section 4.3. For example, we can define a rule such that at least one atomic node from the Must and two from the Core attributes need to exist in \(\mathbf {N}_A\) for a record pair to be added to \(\mathbf {N}_R\). Our framework provides the flexibility to select such rules to obtain a high-quality initial dependency graph, \(\mathbf {G}_D\). These rules can be determined from domain knowledge or can be learned using a training dataset containing known matches.

Relational nodes, \(\mathbf {N}_R\), are added to \(\mathbf {G}_D\) in the active state with directed edges from their corresponding atomic and relative relational nodes. For example, in Figure 3(b), the node \((r_1, r_4)\) has \((r_2, r_5)\), and \((r_3, r_6)\) as relational nodes. The incoming edges from \((r_2, r_5)\) to \((r_1, r_4)\), and from \((r_1, r_4)\) to \((r_2, r_5)\), which indicate the motherOf and childOf relationships, respectively, are added between those two nodes. Similarly, relationship edges are added between the father’s relational node as well. Then, for all relational nodes, we calculate the similarity score, \(sim\), using Equation (3) where we leverage the ambiguity of attribute values (AMB) as we discussed in Section 4.3.

5.2 Bootstrapping

As an initial step of the merging process, we first bootstrap the dependency graph, \(\mathbf {G}_D\), by merging highly similar relational nodes. In collective ER, any subsequent links always depend upon the previous links [2]. Therefore, it is important to bootstrap this graph by only linking record pairs for which we have a high confidence of them to be a correct match.

After \(\mathbf {G}_D\) is generated, we have groups of relational nodes of different sizes. In the bootstrapping step, we consider only nodes in groups (leaving the singletons), where the average atomic similarities, following Equation (1), of all nodes in a group must be at least the bootstrap threshold which we set to \(t_b = 0.95\) in our evaluation in Section 6 (based on a set of initial experiments) to bootstrap the graph by linking highly similar record pairs. We only consider groups of nodes connected with relationships at this stage rather than singleton nodes that are not connected with any other nodes by relationships because groups provide more relationship evidence than singletons. While linking such highly similar node groups, we also propagate the attribute values (PROP-A) and constraints (PROP-C) and adaptively leverage relationship structures (REL), as shown in Figure 2. After bootstrapping the dependency graph, \(\mathbf {G}_D\), we dynamically refine record clusters (REF) to remove any incorrectly linked record pairs, as we discussed in Section 4.5.

5.3 Iterative Merging and Entity Graph Generation

The merging process used in RELATER, as outlined in Algorithm 1, iteratively merges nodes in the dependency graph, \(\mathbf {G}_D\), to find the entities that correspond to the records in nodes in this graph. In the bootstrapping step, as we only linked the highly similar record pairs, we obtain the bootstrapped dependency graph, \(\mathbf {G}_D\), and the corresponding set of record clusters \(\mathbf {M}\). Therefore, \(\mathbf {G}_D\) and \(\mathbf {M}\) become the input to the iterative merging process. Additionally, we provide the set of temporal \(\mathbf {T}\) and link \(\mathbf {L}\) constraints and the thresholds for merging \(t_m\), graph bridges \(t_n\), and graph density \(t_d\), as input to the algorithm.

The merging algorithm starts in line 1 by initialising the entity graph, \(\mathbf {G}_O\), with the record clusters generated in the bootstrapping step. Then, the algorithm proceeds by generating a priority queue, \(\mathbf {Q}\), of node groups. These are relationally connected nodes with relationship edges that are in the active state in \(\mathbf {G}_D\), as we described in Section 5.1. This queue is initialised in line 2 using all active relational node groups, where precedence is given to larger groups and then to groups with high average similarity of nodes, \(sim\), as calculated using Equation (3).

In every iteration, we perform merging on the top relational node group, \(\mathbf {g}\), in \(\mathbf {Q}\). For each node in \(\mathbf {g}\) in line 6, we check if the node is valid to be merged based on the temporal, \(\mathbf {T}\), and link constraints, \(\mathbf {L}\). This is where we make use of the link decisions made in previous iterations to validate the temporal and link constraints, as we detailed in the description of our global propagation of constraints (PROP-C) in Section 4.2. Previous link decisions may or may not have associated record clusters, \(\mathbf {m}_k \in \mathbf {M}\) (that represents an entity), to the records in a node. Based on these associated clusters, there can be three different cases. First, if both records in a node are associated with two different record clusters, we validate the node by applying constraints on every possible record pair between the two clusters. Second, if only one record is associated with a record cluster, we validate all possible records in the cluster against the record that does not have an associated record cluster. Third, if none of the records is associated with record clusters, we apply constraints between the original records to validate the node.

For each valid node, we propagate the attribute values as detailed in Section 4.1 (PROP-A) (line 7 in Algorithm 1) and find the set of atomic nodes with the highest attribute similarities corresponding to the node. With those atomic nodes, we then calculate the new similarity of the node in line 8 while setting its state to inactive as it has been processed (line 9). Any node that violates any constraints is removed from the group in line 11 and its state is updated to non-merge in line 12.

We then calculate the average similarity, \(sim_g\), of all nodes in line 14. If \(sim_g\) exceeds the merge threshold, \(t_m\), then in lines 16 to 18, we merge all nodes in \(\mathbf {g}\), add the records to the corresponding record cluster \(\mathbf {m}_k\), update the entity graph, \(\mathbf {G}_O\), with the updated record cluster, update the state of the merged nodes to merged, and continue to the next group in the queue \(\mathbf {Q}\). Otherwise, we remove the node with the lowest similarity from the group \(\mathbf {g}\) and check the possibility to merge the group until it is reduced to a pair (in lines 13 to 20) by adaptively leveraging the relationship structure (REL), as we described in Section 4.4.

After we have processed all the nodes in the priority queue, \(\mathbf {Q}\), we finally refine (REF) the entities in the entity graph, \(\mathbf {G}_O\), in line 21. In order to identify and remove the likely wrong links in the entities associated with the record clusters in \(\mathbf {M}\), we utilise the measures of graph bridges and graph density as we described in Section 4.5. Finally, in line 22, we return the generated entity graph, \(\mathbf {G}_O\). The reason why we generate an entity graph as the end result of the framework is that such a graph can capture all direct and indirect relationships between entities. Each entity represented by a node in \(\mathbf {G}_O\) by now consists of a set of records. This set of records allows us to infer all possible links to all related entities, which in turn enumerates all the indirect relationships between entities.

5.4 Complexity Analysis

If we assume \(\Psi (\cdot)\) is the worst case time complexity of the attribute value similarity calculation function, then the atomic node generation time complexity is \(O(|\mathbf {A}| \cdot \Psi (|\mathbf {v}_{a^{\prime }}|^2))\), where \(|\mathbf {v}_{a^{\prime }}|\) is the number of values of the attribute \(a^{\prime } \in \mathbf {A}\) with the largest number of unique attribute values. Assuming \(\mathbf {B}\) is the set of blocks after blocking records, then the relational node generation time complexity is \(O(|\mathbf {R}|^2/|\mathbf {B}|)\) [42] where \(\mathbf {R}\) is the set of records being processed. The bootstrapping, iterative merging, and entity graph generation steps all have a complexity of \(O(|\mathbf {N}_R|)\), assuming each \(n \in \mathbf {N}_R\) has a small number of neighbouring nodes. In the datasets used for our experimental evaluation, the average number of relationships per node was no more than 3.

6 Experimental Evaluation

We conduct an extensive set of experiments to address the following questions: (1) How does RELATER compare to other state-of-the-art ER baselines? (2) How does RELATER scale to large datasets? (3) How do parameter values affect linkage quality? (4) How does each key technique in our framework affect linkage quality?

6.1 Experimental Setup

We first describe the setup, we used for our evaluation including the evaluated datasets, baselines, and settings.

6.1.1 Datasets.

We evaluate RELATER on real datasets from different domains as detailed in Table 4. To resolve complex entities, we use three demographic datasets, where the goal is to link person records across birth, marriage, and death certificates; and one census dataset where the interest is to link person records across several census snapshots. Furthermore, we resolve basic entities in three publicly available datasets from the bibliographic and music domains to show that our framework can obtain high linkage quality for both complex and basic entities.

Table 4.

Dataset	Domain	Entity	Role	Interpretation (Links between)	Num. of Records		Record	True
		Type	Pair		Role-1	Role-2	Pairs	Matches
Isle of Skye	Demographic	Complex	Bp-Bp	Birth parents in birth certificates	34,272	34,272	436,518	83,132
(IOS) [45]			Bp-Dp	Parents in birth and death certificates	34,272	23,938	628,141	38,662
Kilmarnock	Demographic	Complex	Bp-Bp	Birth parents in birth certificates	74,948	74,948	1,571,991	135,346
(KIL) [45]			Bp-Dp	Parents in birth and death certificates	74,948	45,186	2,357,625	80,819
North Brabant	Demographic	Complex	Bp-Bp	Birth parents in birth certificates	906,710	906,710	11,012,512	N/A
(BHIC) [45]			Bp-Dp	Parents in birth and death certificates	906,710	1,436,217	861,041	N/A
IPUMS [47]	Census	Complex	F-F	Fathers in households	10,914	10,914	1,169,048	10,914
			M-M	Mothers in households	10,908	10,908	1,225,887	10,908
			C-C	Children in households	16,875	16,875	3,042,050	16,875
DBLP-ACM [13]	Bibliographic	Basic	P-P	Publications	2,616	2,294	38,998	2,220
DBLP-Scholar [13]	Bibliographic	Basic	P-P	Publications	2,616	64,263	772,911	5,347
Million Songs	Music	Basic	S-S	Songs	35,463	35,463	33,120	19,070
(MSD) [13]

Table 4. Characteristics of the Datasets Used in the Experiments

The demographic datasets include two proprietary datasets from Scotland [45], one from the remote IOS and the other from the town of Kilmarnock (KIL). Both contain civil certificates (birth, marriage, and death) of their population over the period from 1861 to 1901. A third dataset is from the publicly available Brabant Historical Information Center (BHIC) [4]. It contains civil certificates from North Brabant, a province of the Netherlands, in the period from 1759 to 1969. Demographers with expertise in linking such data have curated and linked the IOS and KIL datasets [45]. Their semi-automatic approach is heavily biased towards certain types of links, such as Bp-Bp (links between birth parents), as their interests were in identifying siblings of the same mother to facilitate the analysis of child mortality. Therefore, we show results of Bp-Bp for which we have directly curated ground truth links, along with the results of Bp-Dp (links between birth and death parents) for which we have inferred ground truth links (where in Table 4 p represents both mother, m, and father, f). We utilise the BHIC dataset for evaluating the scalability of RELATER since it is a significantly larger dataset compared to IOS and KIL. However, we cannot show the linkage quality of the BHIC dataset as we do not have ground truth links. As census data, we used the 1870 and 1880 census snapshots of families from the US census data (IPUMS) publicly available from IPUMS [47].

To evaluate RELATER for resolving basic entities, we selected datasets with different data characteristics and levels of difficulty to match records. We use a music dataset, Million Songs (MSD) [13], and two bibliographic datasets, DBLP-ACM [13], and DBLP-Scholar [13]. As DBLP-ACM consists of two well-structured datasets, it can be considered as a simple dataset to resolve [32]. However, DBLP-Scholar is more challenging because the publications in Google Scholar have many quality problems, such as misspellings and different representations of authors and venues [32].

6.1.2 Baselines.

To compare RELATER with existing (state-of-the-art) ER approaches, we selected five baselines where each of them represents a different ER approach. The first baseline, Attr-Sim, provides a basic pairwise similarity approach such as used with traditional pairwise linkage techniques [7]. Second, Dep-Graph is an implementation of the collective ER approach proposed by Dong et al. [14] that propagates link decisions in the ER process. To allow a fair comparison, we used the same dependency graph and the same set of temporal and link constraints we used to evaluate RELATER. Third, Rel-Cluster is an implementation of the collective ER approach proposed by Bhattacharya and Getoor [2] that employs ambiguity of attribute values in the ER process. In Rel-Cluster, we apply the same set of temporal and link constraints applied to RELATER for a fair comparison. Fourth, ZeroER [50] is a recent unsupervised approach that employs generative modelling for learning match and non-match distributions to resolve entities. Finally, Magellan is a state-of-the-art supervised ER system available as an open-source library [31]. As training data, we used the record pairs generated in the blocking step as we will describe in Section 6.1.3. For our experiments, we selected four classifiers from Magellan (a support vector machine, a random forest, a logistic regression, and a decision tree) and averaged their linkage quality results.

6.1.3 Settings.

We implemented our framework and baselines in Python 2.7 except for Magellan, which is implemented in Python 3.6 and conducted all experiments on a server running Ubuntu 18.04 with 64-bit Intel Xeon 2.10 GHz CPUs, and 512 GBytes of memory. The code of our framework is available in an online repository² to facilitate repeatability.

For all baselines and RELATER, in the blocking step, we grouped potential matches by employing a locality sensitive hashing based indexing technique that maps records with similar attribute pairs to the same block [42]. In the record pair comparison step, we then employed similarity functions such as the Jaro-Winkler similarity for names and the Jaccard similarity for other textual strings [7] to compare attribute values between records. For numerical comparisons, we used the maximum absolute difference [7], and for comparing addresses in the IOS dataset, we geocoded address strings [30] and calculated similarities based on the distance between two locations. However, due to the absence or low quality of addresses, we did not consider geocoding for the other datasets.

For RELATER and all unsupervised baselines, we use the same set of attributes for Must, Core, and Extra attributes (as shown in Table 5) for calculating the attribute similarity for a fair comparison. In the presence and absence of the Extra attributes, we set \(w_M, w_C, w_E\) to \(0.6, 0.2, 0.2\) and 0.7,0.3,0.0, respectively. As there can be several Extra attributes, \(w_E\) is always lower than \(w_M\) for a single attribute.

Table 5.

Dataset	Must Attributes	Core Attributes	Extra Attributes
Isle of Skye	First name, Surname	Address, Parish	Occupation, Birth Year, Birth Place,
(IOS) [45]			Marriage Year, Marriage Place
Kilmarnock	First name, Surname	Address, Parish	Occupation, Birth Year, Birth Place,
(KIL) [45]			Marriage Year, Marriage Place
North Brabant	–	First name, Surname	Birth Year, Birth Place, Marriage Year,
(BHIC) [45]			Marriage Place
IPUMS [47]	First name, Race, Birth Place	Surname, Birth Year	Occupation, State, City
DBLP-ACM [13]	Publication Name	Publication Year	Publication Venue
DBLP-Scholar [13]	Publication Name	–	Publication Year
Million Songs	Title	Artist, Year	Length, Album
(MSD) [13]

Table 5. Attribute Categorisation of the Datasets used in the Experiments

We set the default merging threshold as \(t_m = 0.85\), the atomic node similarity threshold as \(t_a = 0.9\), the weighting distribution in similarity scores as \(\gamma = 0.6\), the graph measures \(t_n =15\) (bridges), and \(t_d=30\%\) (density), based on the parameter sensitivity analysis in Section 6.4. We do not show results varying \(t_d\) as it has a fairly small influence on precision and recall with regard to different thresholds as the record clusters we are obtaining are not very big.

6.2 Linkage Quality Evaluation

We now compare the linkage quality of RELATER with the aforementioned baselines with respect to three different evaluation measures: precision (P), recall (R), and the F\(^*\) measure [24]. To describe what each evaluation measure represents, we consider \(TP\), \(FP\), and \(FN\) as the number of true matches, false matches, and false non-matches, respectively [7]. Precision is the number of true matches against the total number of matches generated by a particular method, \(P = TP/(TP + FP)\), while recall is the number of true matches against the total number of matches in the linked ground truth data, \(R = TP/(TP + FN)\) [23]. We do not use the F-measure for evaluation as recent research has found that it is not suitable for measuring linkage quality in ER [23] because the relative importance given to precision and recall in the F-measure depends on the number of predicted matches. Therefore, we use a more interpretable alternative to the F-measure, the \(F^*\)-measure [24], calculated as \(F^* = TP/(TP + FP + FN)\), which corresponds to the number of true matches against the number of matches which are either misclassified or are correctly classified.

Table 6 provides the precision, recall, and \(F^*\) results for RELATER and the baselines evaluated. Based on the average results, we can see that RELATER outperforms all baselines. We first discuss how RELATER behaves with regard to resolving complex and basic entities. The databases with complex entities involve several types of role pairs to resolve entities. Since we do not have a complete set of ground truth links for the IOS and KIL datasets, we only show role pairs for which we have manually curated or inferred ground truth links [45]. For both these datasets, RELATER obtains both high precision and recall values for the role pair Bp-Bp, whereas for Bp-Dp we can see a drop in precision and recall. This is to be expected as we have an incomplete (inferred or biased) set of ground truth links for Bp-Dp [45]. In the IPUMS dataset, the F-F (father to father) and M-M (mother to mother) role pairs have high precision, recall, and \(F^*\) results whereas we can see a small drop for the C-C (children to children) role pair because the set of ground truth links from IPUMS are more focused on linking couples than children [47]. In the context of resolving basic entities, RELATER provides high precision and recall results for both DBLP-ACM and MSD. We can see that the DBLP-Scholar dataset is challenging to resolve because for all the baselines there is a drop in linkage quality. However, we can see that RELATER outperforms all the other unsupervised baselines even for the challenging DBLP-Scholar dataset.

Table 6.

Dataset	Role Pair		RELATER	Attr-Sim	Dep-Graph	Rel-Cluster	ZeroER	Magellan
		P	98.73	63.67	90.87	93.59	60.53	77.9 \(\pm\) 33.4
IOS	Bp-Bp	R	94.70	88.41	65.26	63.72	70.75	72.9 \(\pm\) 35.1
		F\(^*\)	93.56	58.76	61.25	61.06	48.41	60.4 \(\pm\) 38.6
		P	86.44	43.05	0.00	80.91	58.37	67.8 \(\pm\) 37.9
IOS	Bp-Dp	R	92.87	72.32	0.00	49.19	14.02	62.2 \(\pm\) 41.4
		F\(^*\)	81.06	36.96	0.00	44.07	12.74	46.1 \(\pm\) 40.4
		P	97.81	30.26	54.81	71.81	79.45	69.6 \(\pm\) 40.1
KIL	Bp-Bp	R	89.52	89.13	74.93	71.92	90.82	62.7 \(\pm\) 46.7
		F\(^*\)	87.76	29.18	46.32	56.09	73.54	51.6 \(\pm\) 45.9
		P	74.36	11.05	28.95	30.35	45.71	63.9 \(\pm\) 36.1
KIL	Bp-Dp	R	89.57	90.49	70.69	43.18	15.67	61.8 \(\pm\) 44.1
		F\(^*\)	68.44	10.93	25.85	21.69	13.21	45.6 \(\pm\) 39.4
		P	100.0	99.86	99.98	95.70	99.99	84.0 \(\pm\) 32.6
IPUMS	F-F	R	96.33	63.84	76.86	60.58	71.17	84.5 \(\pm\) 32.7
		F\(^*\)	96.33	63.78	76.85	58.98	71.16	81.1 \(\pm\) 32.0
		P	100.0	99.85	99.96	93.68	99.97	80.0 \(\pm\) 35.1
IPUMS	M-M	R	95.88	60.17	70.98	57.86	71.19	76.8 \(\pm\) 38.8
		F\(^*\)	95.88	60.11	70.97	55.68	71.17	74.2 \(\pm\) 37.8
		P	89.68	99.60	99.33	77.55	99.96	81.9 \(\pm\) 32.2
IPUMS	C-C	R	93.89	58.16	77.16	50.18	90.09	72.2 \(\pm\) 39.6
		F\(^*\)	84.73	58.02	76.76	43.81	90.06	68.1 \(\pm\) 37.8
		P	98.94	71.90	98.89	81.04	99.45	96.8 \(\pm\) 00.9
DBLP-ACM	P-P	R	96.49	96.71	96.67	96.44	98.60	97.8 \(\pm\) 01.6
		F\(^*\)	95.50	70.19	95.63	78.68	98.07	94.7 \(\pm\) 02.2
		P	77.89	54.65	69.71	78.54	98.47	88.0 \(\pm\) 03.3
DBLP-Scholar	P-P	R	80.10	79.60	80.94	49.41	44.72	87.5 \(\pm\) 04.0
		F\(^*\)	65.26	47.93	59.88	43.53	44.41	78.1 \(\pm\) 03.9
		P	99.99	99.49	99.99	92.97	99.93	99.5 \(\pm\) 00.3
MSD	S-S	R	99.26	99.77	95.20	99.79	91.81	98.2 \(\pm\) 02.2
		F\(^*\)	99.24	99.26	95.19	92.79	91.75	97.7 \(\pm\) 02.4
		P	92.4 \(\pm\) 9.3	67.3 \(\pm\) 30.9	74.2 \(\pm\) 33.9	79.6 \(\pm\) 18.2	84.2 \(\pm\) 21.5	80.9 \(\pm\) 11.2
Averages		R	92.9 \(\pm\) 5.1	79.9 \(\pm\) 14.6	70.9 \(\pm\) 25.5	64.2 \(\pm\) 18.7	65.9 \(\pm\) 31.1	77.7 \(\pm\) 13.2
		F\(^*\)	86.8 \(\pm\) 11.4	53.5 \(\pm\) 23.0	60.9 \(\pm\) 28.5	55.6 \(\pm\) 18.8	61.5 \(\pm\) 30.9	69.8 \(\pm\) 17.8

Table 6. Precision (P), Recall (R), and F\(^*\) Measure Results of RELATER Compared to the Baselines (Averages \(\pm\) Standard Deviations)

Best results in each row are shown in bold font.

The results of the Attr-Sim baseline are not showing acceptable linkage quality in any of the datasets with complex entities. This indicates that traditional pairwise linkage approaches are insufficient for linking databases with complex entities because these approaches do not address the challenges associated with resolving complex entities. With respect to the datasets with basic entities, good linkage quality can be achieved when resolving easy datasets such as MSD even for the Attr-Sim baseline. For this dataset, the recall and F\(^*\) measure have a slight improvement when compared with RELATER. However, for more challenging datasets with basic entities, such as DBLP-Scholar, the Attr-Sim baseline does not provide good results.

Dep-Graph [14] and Rel-Cluster [2] are two unsupervised collective ER baselines. Although they exploit relationship information to resolve entities, they have poor performance compared to RELATER when resolving complex entities. The Dep-Graph baseline addresses the problems of changing attribute values and different relationships by propagating attribute values and constraints. However, as it does not address the problems of disambiguation, partial match groups, or incorrect links that RELATER addresses, for some role pairs (such as IOS Bp-Dp) Dep-Graph cannot identify any true matches. Similarly, the drop in results in the Rel-Cluster baseline is because it only addresses the disambiguation problem. However, as we can see both of these baselines perform better in resolving basic entities, the type of entities these two techniques were developed for. Dep-Graph achieves slight improvements in recall and F\(^*\) results for the DBLP-ACM dataset pair. However, for all other datasets, RELATER performs better than Dep-Graph.

ZeroER [50] is a recently proposed unsupervised ER baseline that exploits the bi-modal nature of ER problems to resolve entities. Based on the observation that the similarity vectors for matches are different from those of non-matches, ZeroER employs generative models to learn the match and non-match distributions. Therefore, when the features are well separated in simple basic entities such as the DBLP-ACM dataset, ZeroER achieves the best results compared to all other baselines and RELATER. However, when the datasets become challenging (even for basic entities), such as DBLP-Scholar and MSD, we can see a drop of linkage quality results due to the absence of well separated features for the match and non-match classes. Interestingly, none of the datasets with complex entities achieve acceptable linkage quality with ZeroER compared to RELATER, because ZeroER is unable to address the challenges associated with complex entities such as changing attribute values and relationships, ambiguity, and the partial match group problem.

The precision, recall, and F\(^*\) values for Magellan [31] are presented as averages with standard deviations because we use four different classifiers and two different settings to generate the training and testing datasets. Since datasets with complex entities have different role pair types, we trained Magellan in two different settings, whereas in the first we trained it only on record pairs of the specific role pair that is being tested, and in the second we trained it on the full dataset. As most of the datasets with complex entities have incomplete ground truth data, in practical scenarios one likely will have to train on record pairs of all role pair types, for which Magellan obtains fairly poor results. However, as expected in the first setting Magellan provides better results compared to RELATER because Magellan is a supervised learning approach. For simple datasets with basic entities, such as DBLP-ACM and MSD, we can see that RELATER outperforms Magellan. However, for challenging datasets such as DBLP-Scholar, Magellan provides better results because it is a supervised approach.

6.3 Scalability

Table 7 shows the runtimes of RELATER compared with the baselines. Attr-Sim shows the best runtimes for most of the datasets because it simply links records without considering any relationships. The next best runtimes alternate between RELATER and Dep-Graph [14]. For datasets with complex entities RELATER takes more time compared to Dep-Graph because it addresses all problems specified in Section 3, whereas Dep-Graph addresses only the problems of changing attribute values and different relationships. However, for datasets with basic entities, RELATER runs faster than Dep-Graph because basic entities do not have most of the challenges complex entities have. Rel-Cluster has higher runtimes compared to both RELATER and Dep-Graph because of the iterative clustering method employed. ZeroER shows worse runtimes compared to all other unsupervised baselines because it involves a time consuming feature generation process. The worst performing baseline is Magellan (these runtimes are averages for the four supervised classifiers and two different settings we described in Section 6.2) as it consumes much time for training the supervised classification models. Overall, the runtimes of RELATER are comparatively better than the other baselines.

Table 7.

Dataset	\|\(\mathbf {N^A}\)\|	\|\(\mathbf {N^R}\)\|	RELATER	Attr-Sim	Dep-Graph	Rel-Cluster	ZeroER	Magellan
IOS	74,851	2,992,834	372	50	176	358	9,355	10,059
KIL	1,565,730	11,190,176	1,945	178	1,207	7,663	8,868	9,632
IPUMS	72,974	5,436,985	386	97	509	53,928	363	1,297
DBLP-ACM	43,726	52,200	3	3	6	20	254	15
DBLP-Scholar	1,259,501	1,093,444	152	70	172	3,837	6,063	30
MSD	70,136	33,120	3	1	4	105	88	86

Table 7. Runtime Results (in Seconds) for RELATER and Baselines

Best results in each row are shown in bold font.

Next, we evaluate the scalability by comparing the runtimes of RELATER on different sized datasets. For that, we vary the time periods of records considered for generating the graph with the BHIC dataset. Table 8 provides an overview of runtimes for the atomic node and relational node generation, bootstrapping, iterative merging, and entity graph generation steps of RELATER. These runtimes indicate that the iterative merging and entity graph generation step accounts for the largest component of the overall runtime because it is the most time consuming step that involves most of the key techniques. We use total linkage time to measure the scalability of our framework. Considering the values of linkage times per node and per edge, we can see that our proposed framework has a near linear scalability with both, which indicates that RELATER can scale to large graphs.

Table 8.

Time	Number of	Number of	Generate	Generate	Bootstrap	Iterative Merging and	Linkage	Linkage
Period	Nodes	Edges	\(\mathbf {N}_A\)	\(\mathbf {N}_R\)	time	Entity Graph	time (ms)	time (ms)
			time (s)	time (s)	(s)	Generation time (s)	per node	per edge
1900–1935	22,928,967	41,121,771	20,642	1,438	896	23,155	1.0	0.6
1890–1935	42,398,382	80,524,946	28,881	2,172	1,685	113,143	2.7	1.4
1880–1935	68,739,033	134,057,215	36,033	3,910	3,013	299,123	4.4	2.3
1870–1935	100,907,697	199,588,456	39,113	6,062	5,423	660,896	6.6	3.3

Table 8. Runtimes of RELATER for Different Graph Sizes of the BHIC Dataset Generated for Different Time Periods

Linkage time is the total of bootstrapping, merging, inferring, and refining using the default settings described in Section 6.

6.4 Parameter Sensitivity Analysis

We now show how RELATER is robust to the various parameter values used for \(t_a\), \(t_m\), \(\gamma\), and \(t_n\) using linkage quality results of two datasets with complex entities and one dataset with basic entities. We vary each parameter while keeping the other parameters at their default values, as we discuss below.

In Figure 8, we vary \(t_a\) in the range of [0.8, 0.95] to explore the sensitivity of \(t_a\). For all datasets, we can see that \(t_a\) is robust in the range of [0.8, 0.9] but results drop as we increase \(t_a\) further. A high \(t_a\) value keeps only highly similar atomic nodes in the initial graph, resulting in a graph that does not include many of the ground truth links. Lower \(t_a\) values add more atomic nodes to \(\mathbf {G}_D\) resulting in a larger graph. Therefore, we set \(t_a =0.9\) as the default choice that works well for all datasets in our experiments.

Fig. 8.

We then vary the merging threshold, \(t_m\). For lower values of \(t_m\) we can see that precision drops for the IOS dataset because many false matches are being linked. However, in the other datasets, we cannot see a drop in linkage quality results with lower \(t_m\) values because the blocking technique we used already provides an initial graph of high precision. We can see the precision, recall, and F\(^*\) results are robust in the [0.8, 0.85] range for all datasets. When we further increase \(t_m\), recall drops because we miss many true matches. Therefore, without a loss of generality, we set \(t_m = 0.85\) as the default value that works well for all datasets.

Next, we discuss the impact on the linkage quality results of the value of \(\gamma\) that defines the weight distribution of similarity components in Equation (3). When \(\gamma\) is lower a higher weight is assigned to the disambiguation similarity \(sim_d\) and unambiguous record pairs that refer to different entities can get linked. Therefore, we can see a drop in results at lower \(\gamma\) values. When we increase \(\gamma\) recall drops because a higher weight is given to atomic similarity and disambiguation is ignored. When we do not disambiguate (\(\gamma = 1.0\)), recall drops because ambiguous record pairs with high similarity get linked and these links along with link constraints restrict true matches with lower similarity getting linked. However, for the DBLP-ACM dataset, we cannot see a drop in recall when we do not involve disambiguation similarity because the attribute values of basic entities are not as ambiguous as complex entities, as we showed in Table 1. We, therefore, set \(\gamma = 0.6\) as this value provides a good balance between precision and recall for all datasets.

As we show in Figure 8, we can see that RELATER is fairly robust to \(t_n\), the threshold for the minimum number of nodes in a cluster to split by bridges in the range of [10,20] for the IOS Bp-Bp dataset. If we further increase \(t_n\) there is a small drop in precision and F\(^*\) because wrong links are not removed from small clusters for high \(t_n\) values. This drop is fairly small due to the small clusters generated for this dataset. For datasets with larger cluster sizes the effect of \(t_n\) will be more significant. In other datasets, we cannot see any effect of \(t_n\) because the clusters have at most two records because we link only two datasets. We set \(t_n=15\) as this value works well with all datasets.

6.5 Ablation Analysis

Table 9 shows the contributions of each key technique in the RELATER framework for different role pairs in two datasets with complex entities. In Section 4, we described the key techniques that we use to address the challenges related to complex entities, including PROP-A, PROP-C, AMB, REL, and REF. As both PROP-A and PROP-C propagate link decisions through the ER process, we consider them as a single component in this analysis, and show results without PROP-A and PROP-C, without AMB, without REL, and without REF, separately.

Table 9.

Dataset	Role Pair		RELATER	without PROP-A and PROP-C	without AMB	without REL	without REF
		P	98.73	86.79	99.22	99.88	98.02
IOS	Bp-Bp	R	94.70	95.20	93.56	61.58	94.87
		F\(^*\)	93.56	83.15	92.89	61.53	93.08
		P	86.44	72.56	89.72	0.00	85.28
IOS	Bp-Dp	R	92.87	93.24	88.62	0.00	93.14
		F\(^*\)	81.06	68.93	80.45	0.00	80.24
		P	100.0	92.75	100.0	0.00	100.0
IPUMS	F-F	R	96.33	96.33	87.16	0.00	96.33
		F\(^*\)	96.33	89.59	87.16	0.00	96.33
		P	100.0	92.19	100.0	0.00	100.0
IPUMS	M-M	R	95.88	95.89	86.77	0.00	95.88
		F\(^*\)	95.88	88.69	86.77	0.00	95.88
		P	89.68	47.96	90.75	0.00	89.68
IPUMS	C-C	R	93.89	93.93	86.71	0.00	93.89
		F\(^*\)	84.73	46.52	79.67	0.00	84.73

Table 9. Ablation Analysis for RELATER that Shows How Each Key Component in the Framework Affects Linkage Quality

Best results in each row are shown in bold font.

When we remove PROP-A and PROP-C from the RELATER framework, then we neither propagate negative nor positive evidence throughout the ER process. We, therefore, do not consider attribute value changes and we do not enforce any constraints while linking records. We can see that for all datasets precision is reduced along with a drop of \(F^*\) of up to 41% because we link record pairs that refer to different entities in the absence of constraints. Similarly, as we do not propagate attribute value changes, record pairs that are true links do not get linked due to lower similarity.

We incorporated ambiguity in RELATER by including a disambiguation component (AMB) in the similarity calculation, as we discussed in Section 4. Without AMB corresponds to RELATER when similarities are solely calculated based on attribute similarities by ignoring the disambiguation similarity. We can see a drop in recall when disambiguation similarity is not involved. This is because ambiguous record pairs with high attribute similarity get linked wrongly, and the enforced constraints prevent correct record pairs from being linked.

Next, we remove the adaptive leveraging of relationship structure (REL). Interestingly, except for the IOS Bp-Bp dataset, we can see all other datasets provide zero results for all linkage quality measures. For example, with the IPUMS dataset, we know that most families have siblings, and therefore almost every group that we consider to link in RELATER is a partial match group as we defined in Section 3. As a result none of the correct record pairs gets linked.

Finally, when we remove dynamic refining of record clusters from RELATER (without REF), we can see the precision drops for both role pairs in the IOS datasets. The reason for this is that the wrong links are removed when refining the record clusters. The improvement of linkage results is small here because we have small clusters in the IOS dataset. However, we cannot see any differences in results for the IPUMS dataset because the cluster sizes in this dataset are limited to two as we link only two census snapshots. Therefore those clusters cannot be refined.

An important aspect to notice in the ablation study is that whenever we remove a key technique from RELATER the overall results always drops. In some scenarios precision increases compromising recall and F\(^*\) while in other scenarios recall increases compromising precision and F\(^*\). However, in none of the scenarios the overall linkage quality results (as indicated by F\(^*\)) are higher than RELATER with all its key techniques included.

7 Related Work

Various ER approaches have been developed since the 1950s to link records in databases [7, 21, 26, 41]. Most recent ER methods are based on supervised learning and deep learning approaches, and the majority of recent works that aim at overcoming the lack of ground truth data are using active learning or transfer learning. We now describe ER approaches related to ours that exploit the relationships among entities to achieve high linkage quality.

On et al. [40] used relationship information for group linkage using weighted bipartite matching. However, because groups are linked independently from each other, this approach does not propagate relationship information. Fu et al. [19] pioneered the use of group linkage for person records by linking individuals within households in census data (only considering relationships within households), thereby substantially reducing the number of ambiguous links.

In contrast to pairwise classification based approaches, graph-based collective ER approaches provide more accurate results by exploiting relational information [21]. Kalashnikov et al. [28] proposed an approach for reference disambiguation based on random walks, aiming to identify the entity to which each record refers to Bhattacharya and Getoor [2] also used relational information between different types of entities by employing an iterative cluster merging process using a relationship graph. However, these approaches focused on basic entities that have static attribute values and static relationships, and they have mostly been evaluated on bibliographic data. In our work, we address the problems associated with complex entities which have changing attribute values and diverse relationships at different points in time. Kouki et al. [33] proposed a collective ER approach for building familial networks based on probabilistic soft logic. Although the predicates in their probabilistic soft logic capture relationships, they do not capture diverse relationships encountered at different points in time. Similarly, they do not capture attribute values that change over time.

Dong et al. [14] proposed a dependency graph-based approach to propagate link decisions among multiple types of entities through the linkage process. We consider their approach as a baseline (named Dep-Graph) because they also propagate link decisions to capture changing attribute values and apply constraints. However, as we showed in the experimental evaluation, their approach is not successful in addressing the problems associated with complex entities, such as the disambiguation problem, the partial match group problem, or the incorrect link problem.

The ambiguity of attribute values in ER has been discussed since the development of probabilistic record linkage by Fellegi and Sunter [16] in 1969. Li et al. [36] discussed the problem of ambiguity in entities that occur in unstructured textual documents. In their approach, Kalashnikov et al. [28] employed relationship analysis to enhance feature-based similarities between ambiguous reference entity choices. This approach is applicable when the set of entities are known prior to linking, and the task is to match records to entities. In our context, however, the set of entities is initially unknown. The approach developed by Bhattacharya and Getoor [2] incorporated ambiguity in neighbours into the calculation of relational similarities. As this is the closest approach to ours, we consider this as a baseline (named Rel-Cluster). However, as our experiments indicate, we can see that this approach is not providing good linkage results because besides the disambiguation problem it does not consider the other challenges our framework addresses.

Recently efforts have been made to identify ER errors using graph theory measures [12, 44]. However, none of them has been proposed in the context of collective ER. We utilise simple graph theory measures such as bridges and density [44] in our RELATER framework. The clusters of records we obtain are small and, therefore, more sophisticated “repair” operations such as those proposed by Croset et al. [12] cannot be applied.

In the context of ER, several approaches have been proposed to incorporate temporal constraints into the linkage process. Li et al. [35] and Chiang et al. [6] were the first to use temporal information for improved supervised pair-wise record pair classification in the bibliographic domain. More recent work on ER for person records provides strong evidence for the improvement of linkage quality when temporal constraints are applied [11, 38]. Although they address the problem of changing attribute values over time, none of these temporal record linkage solutions [5, 9, 27, 34] address the challenges associated with complex entities, such as that different relationships can occur at different points in time, or the disambiguation or partial match group problems.

A growing body of research has studied supervised techniques in the context of ER. Magellan is one such framework that supports an end-to-end ER pipeline with supervised techniques [31]. Recently, deep learning techniques have also been proposed that provide good linkage results [37, 39]. However, as we discussed in the experiments in Section 6, databases with complex entities generally suffer from a lack of ground truth links that makes it challenging if not impossible to use supervised techniques. Similarly, semi-supervised ER techniques including active learning approaches [29, 43] that query external sources to resolve challenging training cases, as well as crowd-based approaches [1, 22, 49] that employ hybrid machine and human-based systems for resolving entities are challenging with person data due to privacy and confidentiality issues[10]. Furthermore, as transfer learning ER approaches such as [53] use pre-trained models, it is questionable how to incorporate temporal and relational aspects into the linkage process.

Recent advancements in the ER literature have influenced unsupervised approaches towards self-supervised learning methods. One recently proposed state-of-the-art unsupervised ER approach is ZeroER [50], which employs generative modelling to learn match and non-match distributions to resolve entities. However, as we show in the experimental evaluation, ZeroER performs well only when the features representing similarities are well separated, such as in datasets that contain basic entities. When the datasets have complex entities, ZeroER does not perform well as it is unable to distinguish match and non-match distributions due to the challenges associated with complex entities.

The problem of graph-based ER is related to the graph alignment or graph matching problem [25, 51, 52], where the aim is to identify nodes that correspond to the same entity in two graphs. Similarly, ER is also related to link mining [20], which is a research area that focuses on classification, clustering, prediction, and modelling of links in graphs. However, these techniques are not suitable for resolving complex entities because they do not address the challenges in resolving complex entities, including the propagation of link decisions or the disambiguation problem.

8 Conclusion and Future Work

We have presented a novel unsupervised graph-based ER framework for resolving entities in datasets that contain complex (as well as basic) entities. Our framework, RELATER, addresses five challenges in resolving complex entities. First, we propagate positive evidence through the ER process to account for the attribute values of entities that change over time. Second, we consider diverse relationships encountered by an entity at different points in time by propagating negative evidence such as temporal and link constraints throughout the ER process. RELATER achieves an average improvement of 18% precision and 22% recall over the ER approach of Dong et al. [14] that propagates link decisions locally. Third, we address the ambiguities of attribute values by introducing a disambiguation similarity. Our framework achieves an average improvement of 13% precision and 29% recall over the ER method proposed by Bhattacharya and Getoor [2], which also leverages ambiguity. Fourth, the novel technique we propose to leverage the relationship structure is the first to link partial match groups. Fifth, we are the first to address the problem of likely wrong links in the context of collective ER by dynamically refining record clusters.

We show that our framework outperforms several state-of-the-art ER baselines on seven datasets from different domains, where it is one the most efficient ER methods among the compared baselines. Moreover, we show that our framework is robust to parameter settings and each key component substantially contributes to improving linkage quality. While in our work, we considered demographic and census datasets, for other types of person data, such as publication records containing author affiliations, a user will need to define suitable parameter settings for RELATER, including any temporal and linkage constraints, and categorise attributes into must, core, and extra. These settings can be defined through domain knowledge.

In future work, we plan to improve the scalability of RELATER by developing parallel versions of our framework, explore how to use other graph measures to identify any likely wrong links in record clusters, and examine how we can utilise transfer learning [53] when using existing biased ground truth datasets.

Footnotes

As we described formally in Section 3, throughout this article, we refer to a set of matched records that supposedly correspond to the same entity as a cluster of records while we name a set of records that are relationally connected as a group of records.

https://github.com/nishadi/RELATER.

References

[1]

Asma Abboura, Soror Sahrl, Mourad Ouziri, and Salima Benbernou. 2015. CrowdMD: Crowdsourcing-based approach for deduplication. In Proceedings of the International Conference on Big Data. IEEE, 2621–2627.

Abstract

1 Introduction

2 Motivating Example

3 Problem Definition and Overview

4 Key Techniques

4.1 Global Propagation of Attribute Values (PROP-A)

4.2 Global Propagation of Constraints (PROP-C)

4.3 Leverage Ambiguity of Attribute Values (AMB)

4.4 Adaptive Leveraging of Relationship Structure (REL)

4.5 Dynamic Refining of Record Clusters (REF)

5 Entity Resolution with RELATER

5.1 Dependency Graph Generation

5.2 Bootstrapping

5.3 Iterative Merging and Entity Graph Generation

5.4 Complexity Analysis

6 Experimental Evaluation

6.1 Experimental Setup

6.1.1 Datasets.

6.1.2 Baselines.

6.1.3 Settings.

6.2 Linkage Quality Evaluation

6.3 Scalability

6.4 Parameter Sensitivity Analysis

6.5 Ablation Analysis

7 Related Work

8 Conclusion and Future Work

Footnotes

References

Cited By

Index Terms

Recommendations

Collective entity resolution in relational data

Linking Temporal Records for Profiling Entities

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations