Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models

LinkTransformer: A Unified Package for Record Linkage with
Transformer Language Models
Abhishek Arora and Melissa Dell∗

Harvard University, Cambridge, MA, USA
∗ Corresponding author: melissadell@fas.harvard.edu.
Abstract 1 Introduction
Linking information across sources is fundamen-

Linking information across sources is funda-
mental to a variety of analyses in social science, tal to a variety of analyses in social science, busi-
arXiv:2309.00789v1 [cs.CL] 2 Sep 2023
business, and government. While large lan- ness, and government. A recent literature, focused
guage models (LLMs) offer enormous promise on matching across e-commerce datasets, shows
for improving record linkage in noisy datasets, the promise of transformer large language models
in many domains approximate string match- (LLMs) for improving record linkage (alternatively
ing packages in popular softwares such as R termed entity resolution or approximate dictionary
and Stata remain predominant. These pack-
matching). Yet these methods have not yet made
ages have clean, simple interfaces and can be
easily extended to a diversity of languages.
widespread inroads, with rule-based methods con-
Our open-source package LinkTransformer tinuing to overwhelmingly predominate in social
aims to extend the familiarity and ease-of-use science and government applications (e.g., see re-
of popular string matching methods to deep views by Binette and Steorts (2022); Abramitzky
learning. It is a general purpose package for et al. (2021)). In particular, users commonly em-
record linkage with transformer LLMs that ploy string-based matching tools available in statis-
treats record linkage as a text retrieval prob- tical software packages such as R or Stata.
lem. At its core is an off-the-shelf toolkit for
applying transformer models to record link- We suspect that record linkage with large lan-
age with four lines of code. LinkTransformer guage models has not made further inroads at least
contains a rich repository of pre-trained trans- in part due to the lack of packages that match
former semantic similarity models for multi- the ease of the popular string matching packages,
ple languages and supports easy integration of which are intuitive, extensible, and easy-to-use.
any transformer language model from Hugging They require little coding expertise and can easily
Face or OpenAI. It supports standard function-
be applied across different languages and settings.
ality such as blocking and linking on multiple
noisy fields. LinkTransformer APIs also per- In contrast, existing tools for large language model
form other common text data processing tasks, matching require considerable technical expertise
e.g., aggregation, noisy de-duplication, and to implement. This makes sense in the context for
translation-free cross-lingual linkage. Impor- which these models were developed - classifying
tantly, LinkTransformer also contains com- and linking products for e-commerce firms, which
prehensive tools for efficient model tuning, employ data scientists - but it is a significant im-
to facilitate different levels of customization
pediment for broader use.
when off-the-shelf models do not provide
the required accuracy. Finally, to promote To bridge the gap between the ease-of-use
reusability, reproducibility, and extensibility, of widely employed string matching packages
LinkTransformer makes it easy for users to and the power of modern LLMs, we devel-
contribute their custom-trained models to its oped LinkTransformer, a general purpose, user
model hub. By combining transformer lan- friendly package for record linkage with trans-
guage models with intuitive APIs that will be
former LLMs. LinkTransformer treats record
familiar to many users of popular string match-
ing packages, LinkTransformer aims to de- linkage as a text retrieval problem (See Figure 1).
mocratize the benefits of LLMs among those The API can be thought of as a drop-in replacement
who may be less familiar with deep learning to popular dataframe manipulation frameworks like
frameworks. pandas or tools like R and Stata, catering to those
who lack extensive exposure to coding. linking 1950s Japanese firms across different large-
To achieve its objective of democratizing ac- scale, noisy databases using the firm name, loca-
cess to the benefits of deep learning amongst those tion, products, shareholders, and banks. This type
who may lack familiarity with deep learning frame- of linkage problem would be highly convoluted
works, LinkTransformer integrates the following with traditional string matching methods, as there
features: are many noisily measured fields of relevance (e.g.,
products can be described in different ways, differ-
1. An off-the-shelf toolkit for applying trans- ent subsets of managers and shareholders are listed,
former models to record linkage and de- etc). Using LinkTransformer to automatically
duplication with 4 lines of code concatenate the information and feed it to a LLM
2. A rich repository of pre-trained semantic sim- handles these challenges with ease. A demo is
ilarity models, supporting multiple languages, available at https://www.youtube.com/watch?
that underlies off-the-shelf usage v=Sn47nmCvV9M. More resources are available on
our package website https://linktransformer.
3. Easy integration of any language transformer github.io/.
model on Hugging Face or OpenAI LinkTransformer has a GNU General Public
License. It is being actively maintained, and in the
4. APIs to support related data processing next release we will add support for vision-free and
tasks, e.g., aggregation, de-duplication, and multimodal linkage models, including support to
translation-free cross-lingual linkage import and customize any timm model (Wightman,
2019). When OCR errors are rampant, vision-only
5. Comprehensive tools for efficient model tun-
or aligned vision-language transformer models can
ing to facilitate different levels of customiza-
improve record linkage, relative to string matching
tion
or language-only transformer linking e.g., (Yang
6. Easy sharing of models, to promote reusabil- et al., 2023; Arora et al., 2023).
ity, reproducibility, and extensibility The rest of the paper is organized as follows.
Section 2 provides an overview of related work.
The LinkTransformer model zoo currently con- The core LinkTransformer library, Model Zoo,
tains English, Chinese, French, German, Japanese, customized model training, and model sharing are
Spanish, and multilingual pre-trained models. We described in Section 3. Section 4 discusses various
initialize with semantic similarity models, e.g., use cases. Section 5 discusses limitations.
Reimers and Gurevych (2019), which have desir-
able properties relative to using off-the-shelf em- 2 Relation to the Existing Literature
beddings from models like RoBERTa (see Section
2). We further tuned these models on a variety of There is a large literature on record linkage span-
linked datasets. ning social science, statistics, and computer sci-
While transfer learning can facilitate strong off- ence. Record linkage serves as a prerequisite for
the-shelf performance in many scenarios, support- many empirical analyses, which often require com-
ing customization is important. Record linkage ap- bining text data from multiple noisy sources. In
plications are extremely diverse in their languages, social science, statistics, and government applica-
time periods, and domains, which vary significantly tions, large language models have made few in-
in how out-of-domain they are from the web cor- roads. A 2022 review of the record linkage lit-
pora that underlying LLMs are trained on. Con- erature in Science Advances (Binette and Steorts,
siderable heterogeneity - combined with settings 2022), entitled “(Almost) All of Entity Resolution”,
that demand extremely high accuracy - create many concludes that deep neural models are unlikely to
scenarios where custom training is useful. be applicable to record linkage using structured
We show that LinkTransformer performs well data, arguing that training datasets are small and
on challenging record linkage tasks. It is equally there is unlikely to be much gained from large lan-
applicable to record linkage tasks with a single field guage models since text fields are often short.
- e.g., linking 1940s Mexican tariff product classes While there are undoubtedly linking tasks for
across time - and applications that require concate- which LLMs will not be of much use (discussed
nating an array of noisily measured fields - e.g., further in Section 5 on limitations), an extensive
Figure 1: Visualization. This figure shows the LinkTransformer architecture.
literature on e-commerce applications underscores retrieved from a key embedding dataset, using co-
their utility for linking structured datasets. Bench- sine similarity implemented with an FAISS back-
marks in this literature (e.g., Köpcke et al. (2010); end (Johnson et al., 2019). LinkTransformer in-
Das et al. (2015); Primpeli et al. (2019)) focus cludes functionality to tune a no-match thresh-
on applications such as matching electronics and old - since not all entities in the query need to
software products between Amazon-Google and have a match in the key - and allows for retriev-
Walmart-Amazon listings, matching iTunes and ing multiple neighbors, to accommodate many-
Amazon music listings, and matching restaurants to-many matches between the query and the key.
between Fodors and Zagats. The focus is on ap- The LinkTransformer architecture was inspired
plications in English. Recent studies have used by a variety of bi-encoder applications for unstruc-
masked language models (e.g., BERT, DistilBERT, tured texts, e.g., passage retrieval (Karpukhin et al.,
RoBERTa) (Li et al., 2020; Joshi et al., 2021; Brun- 2020), entity disambiguation (Wu et al., 2019), and
ner and Stockinger, 2020), GPT (Peeters and Bizer, entity co-reference resolution (Hsu and Horwood,
2023; Tang et al., 2022), or both, significantly 2022).
outperforming static word embedding and other LinkTransformer departs from much of the lit-
older linkage methods. Zhou et al. (2022), like erature (with the exception of Zhou et al. (2022))
LinkTransformer, uses Sentence BERT (Reimers in utilizing LLMs trained for semantic similar-
and Gurevych, 2019). ity for its pre-trained models.1 A large literature
The main package in this space to our knowl- shows that off-the-shelf LLMs such as BERT have
edge, Ditto (Li et al., 2020), implements the meth- anisotropic geometries (Ethayarajh, 2019): repre-
ods later published in Li et al. (2023). It requires sentations of low frequency words are pushed out-
significant programming expertise to deploy, ap- wards on the hypersphere, the sparsity of low fre-
propriate for a litearture focused on e-commerce - quency words violates convexity, and the distance
where data scientists predominate - but a hindrance between embeddings is correlated with lexical sim-
for broader use. ilarity. This leads to poor performance when indi-
Most of the literature examining record linkage vidual term representations from the transformer
with LLMs poses record linkage as a classifica- model are pooled to create a representation for
tion task, which is appropriate for the e-commerce longer texts - as is often necessary for record
benchmarks. However, this significantly limits linkage - since pooling assumes convexity, and
extensibility, as in many social science and gov- leads to poor alignment between semantically sim-
ernment applications the number of entities to be ilar texts. Contrastive training for semantic sim-
linked numbers in the millions, making it compu- ilarity reduces anisotropy, improving alignment
tationally infeasible to compute a softmax over all between semantically similar pairs and improv-
possible classes (entities).
LinkTransformer frames record linkage as a
1
Compare to Zhou et al. (2022), LinkTransformer uses
a different loss function, supervised contrastive loss (Khosla
knn-retrieval task, in which the nearest neighbor et al., 2020). It is well-suited to record linkage as there are
for each entity in a query embedding dataset is multiple positive pairs in the datasets we use for pre-training.
ing sentence embeddings (Wang and Isola, 2020; which homogenize industry and product clas-
Reimers and Gurevych, 2019). LinkTransformer sifications across varying national systems.
builds closely upon Sentence BERT (Reimers and We include models trained on these for 3 of
Gurevych, 2019), whose excellent semantic sim- the 6 official languages of the UN.
ilarity library inspired many of the features in
3. A variety of e-commerce linking datasets (see
LinkTransformer.
Supplemental Materials)
The knn retrieval structure of LinkTransformer
also supports noisy de-duplication, a closely re- We name these models with a semantic syntax:
lated task that finds noisily duplicated observa- {org_name}/lt-{data}-{task}-{lang}. Each
tions within a dataset. LinkTransformer follows model has a detailed model card, with the appropri-
the methods developed in Silcock et al. (2023), ate tags for quick model discovery. Additionally,
who show that de-duplication using a contrastively for the tasks we trained our models on, we provide
trained bi-encoder significantly outperforms n- a high-level interface to download the right model
gram and locally sensitive hashing methods and by task through a wrapper that retrieves the best
is highly scalable. model for a task chosen by the user.
LinkTransformer makes no compromise in
3 The LinkTransformer Library scalability. All functions are vectorized wherever
possible and the vector similarity search underlying
3.1 Off-the-shelf Toolkit knn retrieval is accelerated by an FAISS (Johnson
At the core of LinkTransformer is an off-the-shelf et al., 2019) backend that can easily be extended
toolkit that streamlines record linkage with trans- to perform retrieval on GPUs on massive datasets.
former language models. The record linkage mod- We also allow “blocking” - running knn-search
els enable using pre-trained or self-trained trans- only within “blocks” that can be defined by the
former models with just 4 lines of code. Any Hug- blocking_vars argument.
ging Face or OpenAI model can be used by config- Record linkage frequently requires matching
uring the model and openai_key arguments. This databases on multiple noisily measured keys.
future-proofs the package, allowing it to take ad- LinkTransformer allows a list of as many vari-
vantage of the open-source revolution that Hugging ables as needed in the "on" argument. The merge
Face has pioneered. Here is an example of the core keys specified by the on variable are serialized by
merge functionality, based on embeddings sourced concatenating them with a < SEP > token, which
from an external language model. is based on the underlying tokenizer of the selected
base language model. This ensures that the seri-
1 import linktransformer as lt
2 # Load data frame alization takes advantage of a token already intro-
3 df1 = pd . read_csv (" df1 . csv ") duced in training and the process is agnostic to the
4 df1 = pd . read_csv (" df2 . csv ") choice of the model.
5 df_matched = lt . merge ( df2 , df1 ,
merge_type = ' 1: m ' , on =[ " Varname "], Since we have designed the API around
model = " sentence - transformers / all - dataframes - due to their familiarity amongst users
MiniLM -L6 - v2 " , openai_key = None )
of R, Stata or Excel - all import/export formats
LinkTransformer provides a wealth of pre- supported.
trained model weights, covering different lan- The accessible LinkTransformer API supports
guages and domains. It currently supports six a plethora of other features that are an essential part
languages (English, Chinese, French, German, of routine data analysis pipelines. These include:
Japanese, and Spanish) with both multilingual and Aggregation: Data processing often requires
monolingual models. Training datasets include: the aggregation of fine descriptions into coarser
categories, that are consistent across datasets/time
1. A novel dataset of firm aliases that we com- or facilitate interpretation. This problem can be
piled from Wikidata for 6 languages thought of as a merge between finer categories and
coarser ones, where LinkTransformer classifies
2. United Nations economic classification sched- the finer categories by means of finding their near-
ules (International Standard Industrial Classi- est coarser neighbor(s). We provide a high-level
fication, Standard International Trade Classi- API for this task, lt.aggregate_rows, with a sim-
fication, and Central Product Classification), ilar syntax to the main record linkage API.
Deduplication: Text datasets can contain noisy Alternately, it can take a list of both positive and
duplicates. Popular libraries like dedupe (Gregg negative pairs, in which case the model is evaluated
and Eder, 2022) only support deduplication using using a binary classification objective.
metrics that most closely resemble edit distance. Only the most important arguments are exposed
LinkTransformer allows for semantic deduplica- and the rest have reasonable defaults which can
tion with a single, intuitive function call. be tweaked by more advanced users. Additionally,
1 df = pd . read_csv (" df1 . csv ")) LinkTransformer supports logging of a training
2 df_dedup = lt . dedup_rows (df , on =" run on Weights and Biases (Biewald, 2020).
CompanyName " , model =" sentence -
transformers / all - MiniLM -L6 - v2 " , 1 best_model_path = lt . train_model (
3 cluster_params = { ' threshold ': 0.7}) 2 model_path ="hf - path -to - base -
model " ,
LinkTransformer de-duplication clusters em- 3 data =" df1 . csv ") ,
4 left_col_names =[ " left_var "],
beddings under the hood, with embeddings 5 right_col_names =[ ' right_var '],
in the same cluster classified as duplicates. 6 left_id_name =[ ' left_id '],
LinkTransformer supports several clustering 7 right_id_name =[ ' right_id '],
8 label_col_name = None ,
methods like SLINK, DBSCAN, HDBSCAN, and 9 log_wandb = False ,
agglomerative clustering. 10 training_args ={ " num_epochs ": 1}
11 )
Cross-lingual linkage: Analyses spanning mul-
tiple countries often require cross-lingual linkage. Default training expects positive pairs. A simple
This task typically requires machine translation fol- argument that specifies label_col_name switches
lowed by a merge. Edit distance metrics tend to the dataset format and model evaluation to adapt
do particularly poorly in this scenario, necessitat- to positive and negative labels. To make this exten-
ing costly hand linking. LinkTransformer users sible to most record linkage use-cases, the model
can bypass translation by using multilingual trans- can also be trained on a dataset of cluster ids and
former models. texts by simply specifying clus_id_col_name and
We provide helpful notebooks and tutori- clus_text_col_names.
als on the LinkTransformer website https:// LinkTransformer is sufficiently sample effi-
linktransformer.github.io/ to outline the use cient that most models in the model zoo were
of these functionalities and also some toy datasets. trained with a student Google Colab account, an
We also have a tutorial to help those who are less integral feature since the vast majority of potential
familiar with language models to select ones that users have constrained compute budgets.
best fit their use case. More detailed information
and additional features that we cannot highlight in 3.3 User Contributions
the interest of brevity can be found in the online LinkTransformer aims to promote the reusabil-
LinkTransformer documentation that is available ity and reproducibility of record linkage pipelines.
on our public repo.2 End-users can upload their self-trained models to
the LinkTransformer Hugging Face hub with a
3.2 Customized Model Training simple model.save_to_hub command. Whenever
Record linkage tasks are highly diverse. Hence, a model is saved, a model card is automatically gen-
LinkTransformer also supports easy model train- erated that follows best practices outlined in Hug-
ing. Custom training can be initialized using the ging Face’s Model Card Guidebook. This process
weights of any Hugging Face transformer model. adds a pipeline-tag, supported language(s) (given
Training data are expected in a pandas data the base model), and other tags to the model card’s
frame, removing entry barriers for the typical social header that facilitate model discovery by other Hug-
science user who is familiar with other program- ging Face users. Moreover, the automatically gen-
ming languages and statistical packages. A data erated card contains instructions on how to use
frame can include only positive labeled examples the model for record linkage and model-specific
(linked observations) as inputs, in which case the architecture and training details.
model is evaluated using an information retrieval
evaluator that measures top-1 retrieval accuracy. 4 Applications
2
https://github.com/dell-research-harvard/ The LLMs in the LinkTransformer model zoo
linktransformer outperform non-neural methods by a wide margin.
We examine applications from both modern web tuned LinkTransformer model (at no charge) and
data and historical datasets that we digitized from purchased GPT hidden states producing similar,
hard copy historical firm and government records. near perfect results.
These tasks are representative of applications in We also link firms across two different 1950s
quantitative social science, industry, and govern- publications created by different Japanese credit
ment. The results are reported in the supplementary bureaus (Jinji Koshinjo, 1954; Teikoku Koshinjo,
materials. 1957). One has around 7,000 firms and the other
First, we evaluate in-domain performance on has around 70,000, including many small firms.
semantic similarity datasets used to pre-train Firm names can be written differently across pub-
LinkTransformer models, with the supplemen- lications and there are many duplicated or similar
tary materials comparing the accuracy of Lev- firm names. To make this task feasible, we concate-
enshtein edit distance matching (Levenshtein nate information on the firm’s name, prefecture,
et al., 1966), semantic similarity models off- major products, shareholders, and banks. These
the-shelf, and LinkTransformer tuned models. variables contain OCR noise and the information
The LinkTransformer and OpenAI models tend included in each publication varies somewhat, e.g.
to outperform edit distance metrics by a wide in terms of which products are described, which
margin in matching Wikidata firm aliases, with shareholders are included, etc. This makes rule-
the top-1 retrieval accuracy tending to be 20-30 based methods quite brittle. Again, neural models
points higher for LinkTransformer linkage than significantly outperform string matching methods,
for edit distance based matching. Cases that with OpenAI giving the best results. (Due to the
LinkTransformer gets wrong are often impossi- cost of labeling, our training dataset may not be
ble even for a skilled human to resolve from the large enough to capture the benefits of tuning.)
firm names alone, e.g., in cases where a firm is Finally, we examine the various e-commerce and
referred to by two completely disparate acronyms. industry benchmarks that prevail in this literature.
Off-the-shelf semantic similarity models also tend We used exactly the same training procedure for
to outperform string-matching methods, albeit not each benchmark, to avoid overfitting, which is of-
by as much of a margin as tuned models. ten not the case in the literature. We have gener-
ally comparable performance, sometimes outper-
Second, we examine historical applications. The
formed by other models (that could be integrated
first setting is a concordance between products in
into LinkTransformer as well if on Hugging Face)
two 1940s Mexican tariffs schedules, published
and sometimes outperforming other models.
by the Mexican government and digitized from
the hard copy documents (Secretaria de Economía
5 Limitations
de Mexico, 1948). Tariffs were applied at an ex-
tremely disaggregated product level and each of LinkTransformer is built upon transformer lan-
the many thousands of products in the tariff sched- guage models, and hence will not be suitable
ule is identified only by a text description, which for lower resource languages that lack pre-trained
can change each time the tariff schedule is updated. LLMs. LinkTransformer will also be less use-
Around 2,000 products map to different descrip- ful in contexts where little language understanding
tions across the schedules. While there are con- enters record linkage, e.g., when linking records
siderable debates on the role that trade policies solely using individual names. For these contexts,
have played in long-run development, empirical the next release of LinkTransformer will inte-
evidence is limited in part due to the considerable grate vision-only transformer models, which the
challenges in homogenizing extremely detailed his- authors have reported on extensively elsewhere
torical tariff schedules across time, as crosswalks (Arora et al., 2023; Yang et al., 2023).
do not exist for most tariff schedule changes. Furthermore, in settings where OCR errors are
We link the tariff schedules using an off-the- considerable, too much information may have been
shelf semantic similarity model, as well as a model destroyed to successfully link entities using gar-
tuned on the in-domain historical data and the rec- bled texts. In this case, a multimodal framework
ommended Open AI embeddings (from the model (e.g., Arora et al. (2023)) that uses aligned lan-
text-embedding-ada-002). All transformer mod- guage and vision models to incorporate the original
els widely outperform edit distance, with the fine- image crops or a matching framework that incorpo-
rates character visual similarity (Yang et al., 2023)
- as OCR errors tend to confuse visually similar
characters - may be required. This again will be
incorporated into LinkTransformer.
LinkTransformer relies on backends that many
end users - accustomed primarily to working with
statistical packages like Stata or R - will not be
familiar with. We recommend that users new to
LLMs deploy the package using a cloud service
optimized for deep learning, such as Colab, to
avoid the need to resolve dependencies. To reduce
startup costs, we will provide detailed tutorials for
LinkTransformer instillation, inference, and train-
ing on Colab.
Supplementary Materials
S-1 Training and other details
LinkTransformer models use AdamW as the optimizer with a linear schedule with a 100% warm-up
with 2e-6 as the max learning rate. We use a batch size of 64 for models trained with Wikidata (companies)
and UN data (products). For industry benchmarks, we used a batch size of 128. We trained the models for
150 epochs for industrial benchmarks and 100 epochs for UN/Wikidata/Historic applications. We used
Supervised Contrastive loss (Khosla et al., 2020) as the training objective with default hyperparameters.
The implementation for the loss was based on the implementation shared on the sentence-transformers
repository (Reimers and Gurevych, 2019).
LinkTransformer uses IndexFlatIP from FAISS (Johnson et al., 2019) as the index of choice, allowing
an exhaustive search to get k nearest neighbours. We use the inner-product as the metric. All embeddings
from the encoders are L2-normalized such that the distances (inner-products) given by the FAISS indices
are equivalent to cosine similarity.
Code to replicate the below tables and train the models are available on our repository, which also
contains links to our training data.
S-2 Datasets and Results

Table S-1 lists the base sentence transformer models that we used to initialize the custom
LinkTransformer models. Table S-2 describes the datasets used for training the LinkTransformer
model zoo. They are drawn from multilingual UN product and industry concordances, Wikidata company
aliases, a 1948 Mexican government concordance between tariff schedules (Secretaria de Economía de
Mexico, 1948), and a hand-linked dataset between two 1950s Japanese firm-level datasets collected by
credit bureaus, one containing around 7,000 firms and the other around 70,000 (Teikoku Koshinjo, 1957;
Jinji Koshinjo, 1954). Table S-3 describes the train-val-test splits for each of these datasets.
Table S-4 shows results of fine-tuned models, where the models are evaluated on the test split of the
dataset they were trained on. Table S-5 examines performance on two historical linkage tasks: linking
historical Mexican tariff schedules and linking historical Japanese firms across large, noisy databases.
Table S-6 reports results on standard industry and e-commerce benchmarks for record linkage.
These results - discussed in the main text - show that deep neural record linkage (whether with off-the-
shelf semantic similarity models, OpenAI models, or custom LinkTransformer models) outperforms
Levenshtein edit distance by a wide margin.
Language Base Model
English sentence-transformers/multi-qa-mpnet-base-dot-v1
Japanese oshizo/sbert-jsnli-luke-japanese-base-lite
French dangvantuan/sentence-camembert-large
Chinese DMetaSoul/sbert-chinese-qmc-domain-v1
Spanish hiiamsid/sentence_similarity_spanish_es
German Sahajtomar/German-semantic
Multilingual sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Table S-1: We used the above sentence-transformers models for different langauges as base models to train
LinkTransformer models. They were selected from the Hugging Face model hub and the names correspond to the
repo names on the Hub.
Model Training Data
lt-wikidata-comp-en Wikidata English-language
company names.
lt-wikidata-comp-fr Wikidata French-language
company names.
lt-wikidata-comp-de Wikidata German-language
company names.
lt-wikidata-comp-ja Wikidata Japanese-language
company names.
lt-wikidata-comp-zh Wikidata Chinese-language
company names.
lt-wikidata-comp-es Wikidata Spanish-language
company names.
lt-wikidata-comp-multi Wikidata multilingual company
names (en, fr, es, de, ja, zh).
lt-wikidata-comp-prod-ind-ja Wikidata Japanese-language
company names and industries.
lt-un-data-fine-fine-en UN fine-level product data
in English.
lt-un-data-fine-coarse-en UN coarse-level product data
in English.
lt-un-data-fine-industry-en UN product data linked
to industries in English.
lt-un-data-fine-fine-es UN fine-level product data
in Spanish.
lt-un-data-fine-coarse-es UN coarse-level product data
in Spanish.
lt-un-data-fine-industry-es UN product data linked
to industries in Spanish.
lt-un-data-fine-fine-fr UN fine-level product data
in French.
lt-un-data-fine-coarse-fr UN coarse-level product data
in French.
lt-un-data-fine-industry-fr UN product data linked
to industries in French.
lt-un-data-fine-fine-multi UN fine-level product data
in multiple languages.
lt-un-data-fine-coarse-multi UN coarse-level product data
in multiple languages.
lt-un-data-fine-industry-multi UN product data linked
to industries in multiple languages.
Table S-2: Model names and training data sources for various models in the LinkTransformer model zoo. Each of
these models is on the Hugging Face hub and can be found by prefixing the organization name dell-research-harvard
(for example, dell-research-harvard/lt-wikidata-comp-multi.). Training code can be found on our package Github
repo and training configs containing the hyperparameters are available in the model repo on the Hugging Face Hub.
Model Training Validation Test
Size Size Size
lt-wikidata-comp-es 2252 359 345
lt-wikidata-comp-fr 6728 1034 1055
lt-wikidata-comp-ja 10120 1629 1616
lt-wikidata-comp-zh 5572 1087 1014
lt-wikidata-comp-de 10131 1514 1511
lt-wikidata-comp-en 28731 4132 4070
lt-wikidata-comp-multi 44660 149 149
lt-wikidata-comp-prod-ind-ja 1190 9950 9902
lt-un-data-fine-fine-en 1252 495 649
lt-un-data-fine-coarse-en 146 18 19
lt-un-data-fine-industry-en 146 18 19
lt-un-data-fine-fine-es 1252 274 310
lt-un-data-fine-coarse-es 146 18 19
lt-un-data-fine-industry-es 144 18 19
lt-un-data-fine-fine-fr 1185 210 255
lt-un-data-fine-coarse-fr 141 18 19
lt-un-data-fine-industry-fr 143 18 18
lt-un-data-fine-fine-multi 1252 210 255
lt-un-data-fine-coarse-multi 146 18 19
lt-un-data-fine-industry-multi 143 18 18
mexicantrade4748 1593 334 334
historicjapanesecompanies 412 33 36
Table S-3: Model names and training, validation, and test sizes for various models in the LinkTransformer model
zoo. The numbers correspond to the number of classes in each split - not the total number of examples. The data
were split into test-train-val at the class level to avoid test set leakage. For historicjapanesecompanies the dataset is
of positive and negative pairs - so we report only the counts of positive pairs for consistency with other tasks.
Model Edit Distance SBERT LT OpenAI
Company Linkage
lt-wikidata-comp-es 0.69 0.68 0.75 0.81
lt-wikidata-comp-fr 0.55 0.78 0.84 0.80
lt-wikidata-comp-ja 0.53 0.62 0.70 0.62
lt-wikidata-comp-zh 0.61 0.73 0.80 0.80
lt-wikidata-comp-de 0.63 0.69 0.77 0.77
lt-wikidata-comp-en 0.58 0.72 0.74 0.74
lt-wikidata-comp-multi 0.69 0.62 0.80 0.74
lt-wikidata-comp-prod-ind-ja 0.96 0.95 0.99 0.98
Fine Product Linkage
lt-un-data-fine-fine-en 0.58 0.76 0.83 0.78
lt-un-data-fine-fine-es 0.57 0.67 0.76 0.68
lt-un-data-fine-fine-fr 0.52 0.67 0.73 0.68
lt-un-data-fine-fine-multi 0.52 0.57 0.73 0.68
Product to Industry Linkage
lt-un-data-fine-industry-en 0.29 0.74 0.84 0.95
lt-un-data-fine-industry-es 0.41 0.95 0.89 0.89
lt-un-data-fine-industry-fr 0.19 0.72 0.67 0.61
lt-un-data-fine-industry-multi 0.19 0.67 0.67 0.61
Product Aggregation
lt-un-data-fine-coarse-en 0.61 0.95 0.95 0.95
lt-un-data-fine-coarse-es 0.47 0.95 0.95 0.95
lt-un-data-fine-coarse-fr 0.53 0.89 1.00 0.95
lt-un-data-fine-coarse-multi 0.53 0.95 1.00 0.95
Table S-4: Performance of various embedding models. Performance is measured by retrieval accuracy at 1 - whether
the nearest neighbor in terms of edit distance or cosine similarity is a "relevant" entity. Company linkage links
company aliases together, Fine Product Linkage links products from different product classifications together,
Product to Industry Linkage links products to their industry classifications, and Product Aggregation links a fine
product to its coarser product classification. LT gives the performance of the trained LinkTransformer model
specified in the Model Column (and found on HuggingFace). Edit Distance gives to linkage accuracy when using
Levenshtein distance as the distance metric. SBERT gives linkage accuracy with the base off-the-shelf model used
to tune the LinkTransformer model (as specified for each language in S-1). OpenAI gives linkage performance
when using embeddings from the OpenAI embedding API (text-embedding-ada-002).
Dataset Semantic Fine Edit OpenAI LT
Sim Tuned Distance ADA UN/Wiki Model
mexicantrade4748 0.81 0.89 0.78 0.88 0.88
historicjapanesecompanies 0.66 0.79 0.29 0.83 0.80
Table S-5: Historical Linking. We examine the base semantic similarity model off-the-shelf, a fine-tuned
LinkTransformer version, Levenshtein edit distance on the tariff description or company name, OpenAI em-
beddings and a pre-trained LinkTransformer model - lt-wikidata-comp-prod-ind-ja mexicantrade4748 and lt-un-
data-fine-fine-multi for historicjapanesecompanies. The table reports top-1 accuracy.
Type Dataset Domain Size # Pos. # Attr. Ours (ZS) Ours (FT) Magellan Deep matcher Ditto REMS
BeerAdvo-RateBeer beer 450 68 4 82.35 87.5 78.8 72.7 84.59 96.65
iTunes-Amazon1 music 539 132 8 70 80 91.2 88.5 92.28 98.18
Fodors-Zagats restaurant 946 110 6 88 93 100 100 98.14 100
Structured DBLP-ACM1 citation 12,363 2,220 4 90 97.5 98.4 98.4 98.96 98.18
DBLP-Scholar1 citation 28,707 5,347 4 76 91.4 92.3 94.7 95.6 91.74
Amazon-Google software 11,460 1,167 3 39.4 68 49.1 69.3 74.1 65.3
Walmart-Amazon1 electronics 10,242 962 5 28 69 71.9 67.6 85.81 71.34
Textual Abt-Buy product 9,575 1,028 3 33.6 78.4 33 55 88.85 67.4
Company company 1,12,632 28,200 1 74.07 88 79.8 92.7 41.00 80.73
iTunes-Amazon2 music 539 132 8 74 81.3 46.8 79.4 92.92 94.74
DBLP-ACM2 citation 12,363 2,220 4 79.6 97.2 91.9 98.1 98.92 98.19
Dirty
DBLP-Scholar2 citation 28,707 5,347 4 75 91.2 82.5 93.8 95.44 91.76
Walmart-Amazon2 electronics 10,242 962 5 25 65 37.4 53.8 82.56 65.74
Table S-6: Benchmarks. ZS is LinkTransformer models zero-shot and FT is LinkTransformer models fine-tuned on the benchmark. The remaining columns report comparisons.
The metric is F1, as these datasets frame linkage as a binary classification problem.
References
Ran Abramitzky, Leah Boustan, Katherine Eriksson, James Feigenbaum, and Santiago Pérez. 2021. Automated
linking of historical data. Journal of Economic Literature, 59(3):865–918.
Abhishek Arora, Xinmei Yang, Shao Yu Jheng, and Melissa Dell. 2023. Linking representations with multimodal
contrastive learning. arXiv preprint arXiv:2304.03464.
Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
Olivier Binette and Rebecca C Steorts. 2022. (almost) all of entity resolution. Science Advances, 8(12):eabi8021.
Ursin Brunner and Kurt Stockinger. 2020. Entity matching with transformer architectures-a step forward in data
integration. In 23rd International Conference on Extending Database Technology, Copenhagen, 30 March-2
April 2020, pages 463–473. OpenProceedings.
Sanjib Das, A Doan, C Gokhale Psgc, Pradap Konda, Yash Govind, and Derek Paulsen. 2015. The magellan data
repository.
Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of bert,
elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512.
Forest Gregg and Derek Eder. 2022. dedupe.
Benjamin Hsu and Graham Horwood. 2022. Contrastive representation learning for cross-document coreference
resolution of events and entities. arXiv preprint arXiv:2205.11438.
Jinji Koshinjo. 1954. Nihon shokuinrokj. Jinji Koshinjo.
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions
on Big Data, 7(3):535–547.
Salil Rajeev Joshi, Arpan Somani, and Shourya Roy. 2021. Relink: Complete-link industrial record linkage
over hybrid feature spaces. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages
2625–2636. IEEE.
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau
Yih. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu,
and Dilip Krishnan. 2020. Supervised contrastive learning. arXiv preprint arXiv:2004.11362.
Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world
match problems. Proceedings of the VLDB Endowment, 3(1-2):484–493.
Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet
physics doklady, volume 10, pages 707–710. Soviet Union.
Yuliang Li, Jinfeng Li, Yoshi Suhara, AnHai Doan, and Wang-Chiew Tan. 2023. Effective entity matching with
transformers. The VLDB Journal, pages 1–21.
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with
pre-trained language models. arXiv preprint arXiv:2004.00584.
Ralph Peeters and Christian Bizer. 2023. Using chatgpt for entity matching. arXiv preprint arXiv:2305.03423.
Anna Primpeli, Ralph Peeters, and Christian Bizer. 2019. The wdc training dataset and gold standard for large-scale
product matching. In Companion Proceedings of The 2019 World Wide Web Conference, pages 381–386.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv
preprint arXiv:1908.10084.
Secretaria de Economía de Mexico. 1948. Ajuste de las fracciones de la tarifa arancelaria que rigieron hasta el
año de 1947 con las de la tarifa que entró en vigor por decreto de fecha 13 de diciembre del mismo año y se
consideraron a partir de 1948. In Anuario Estadístico del Comercio Exterior de los Estados Unidos Mexicanos.
Gobierno de Mexico.
Emily Silcock, Luca D’Amico-Wong, Jinglin Yang, and Melissa Dell. 2023. Noise-robust de-duplication at scale.
International Conference on Learning Representations.
Jiawei Tang, Yifei Zuo, Lei Cao, and Samuel Madden. 2022. Generic entity resolution models. In NeurIPS 2022
First Table Representation Workshop.
Teikoku Koshinjo. 1957. Teikoku Ginko Kaisha Yoroku. Teikoku Koshinjo.
Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment
and uniformity on the hypersphere. In International Conference on Machine Learning, volume 119, pages
9929–9939. PMLR.
Ross Wightman. 2019. Pytorch image models. https://github.com/rwightman/pytorch-image-models.
Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. 2019. Scalable zero-shot
entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814.
Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, and Melissa Dell. 2023. Quantifying character similarity with
vision transformers. arXiv preprint arXiv:2305.14672.
Huchen Zhou, Wenfeng Huang, Mohan Li, and Yulin Lai. 2022. Relation-aware entity matching using sentence-bert.
Computers, Materials & Continua, 71(1).

Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models

Uploaded by

Copyright:

Available Formats

Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models

Uploaded by

Copyright:

Available Formats

LinkTransformer: A Unified Package for Record Linkage with

Transformer Language Models

Abhishek Arora and Melissa Dell∗

Linking information across sources is fundamen-

S-2 Datasets and Results

You might also like