research-article

Free access

Febrl: a freely available record linkage system with a graphical user interface

Author:

Peter ChristenAuthors Info & Claims

HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80

Pages 17 - 25

Published: 01 January 2008 Publication History

Abstract

Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the health system. Significant advances, mostly originating from data mining and machine learning, have been made in recent years in many areas of record linkage techniques. Most of these new methods are not yet implemented in current record linkage systems, or are hidden within 'black box' commercial software. This makes it difficult for users to learn about new record linkage techniques, as well as to compare existing linkage techniques with new ones. What is required are flexible tools that enable users to experiment with new record linkage techniques at low costs.

This paper describes the Febrl (Freely Extensible Biomedical Record Linkage) system, which is available under an open source software licence. It contains many recently developed advanced techniques for data cleaning and standardisation, indexing (blocking), field comparison, and record pair classification, and encapsulates them into a graphical user interface. Febrl can be seen as a training tool suitable for users to learn and experiment with both traditional and new record linkage techniques, as well as for practitioners to conduct linkages with data sets containing up to several hundred thousand records.

References

[1]

Aizawa, A. & Oyama, K. (2005), A fast linkage detection scheme for multi-source information integration, in 'Web Information Retrieval and Integration' (WIRI'05), Tokyo, pp. 30--39.

Digital Library

[2]

Baxter, R., Christen, P. & Churches, T. (2003), A comparison of fast blocking methods for record linkage, in 'ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation', Washington DC, pp. 25--27.

[3]

Brook, E. L., Rosman, D. L., Holman, C. D. J. & Trutwein, B. (2005), 'Summary report: Research outputs project, WA Data Linkage Unit (1995--2003)', Western Australian Data Linkage Unit Perth.

[4]

Chang, C.-C. & Lin, C.-J. (2001), LIBSVM: A library for support vector machines, manual. Department of Computer Science, National Taiwan University. Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm

Digital Library

[5]

Christen, P., Zhu, J. X., Hegland, M., Roberts, S., Nielsen, O. M., Churches, T. & Lim, K. (2002), High-performance computing techniques for record linkage, in 'Australian Health Outcomes Conference' (AHOC'02), Canberra.

[6]

Christen, P., Churches, T. & Hegland, M. (2004), Febrl -- A parallel open source data linkage system, in 'Pacific-Asia Conference on Knowledge Discovery and Data Mining' (PAKDD'04), Sydney, Springer LNAI 3056, pp. 638--647.

[7]

Christen, P. (2005), Probabilistic data generation for deduplication and data linkage, in 'International Conference on Intelligent Data Engineering and Automated Learning' (IDEAL'05), Brisbane, Springer LNCS 3578, pp. 109--116.

Digital Library

[8]

Christen, P. & Belacic, D. (2005), Automated probabilistic address standardisation and verification, in 'Australasian Data Mining Conference' (AusDM'05), Sydney.

[9]

Christen, P., Willmore, A. & Churches, T. (2006), A probabilistic geocoding system utilising a parcel based address file, in 'Selected Papers from AusDM', Springer LNCS 3755, pp. 130--145.

Digital Library

[10]

Christen, P. (2006), A comparison of personal name matching: Techniques and practical issues, in 'Workshop on Mining Complex Data' (MCD'06), held at IEEE ICDM'06, Hong Kong.

Digital Library

[11]

Christen, P. (2006), Privacy-preserving data linkage and geocoding: Current approaches and research directions, in 'Workshop on Privacy Aspects of Data Mining' (PADM'06), held at IEEE ICDM'06, Hong Kong.

Digital Library

[12]

Christen, P. & Churches, T. (2006), Secure health data linkage and geocoding: Current approaches and research directions, in 'National e-Health Privacy and Security Symposium' (ehPASS'06), Brisbane, Australia.

[13]

Christen, P. & Goiser, K. (2007), Quality and complexity measures for data linkage and deduplication, in F. Guillet & H. Hamilton, eds, 'Quality Measures in Data Mining', Springer Studies in Computational Intelligence, vol. 43, pp. 127--151.

[14]

Christen, P. (2007), 'Towards parameter-free blocking for scalable record linkage', Technical Report TRCS-07-03, ANU Joint Computer Science Technical Report Series, The Australian National University, Canberra.

[15]

Christen, P. (2007), A two-step classification approach to unsupervised record linkage, in 'Australasian Data Mining Conference' (AusDM'07), Gold Coast, Conferences in Research and Practice in Information Technology (CRPIT), vol. 70.

Digital Library

[16]

Churches, T., Christen, P., Lim, K. & Zhu, J. X. (2002), 'Preparation of name and address data for record linkage using hidden Markov models', BioMed Central Medical Informatics and Decision Making, vol. 2, no. 9.

[17]

Churches, T. & Christen, P. (2004), 'Some methods for blindfolded record linkage', BioMed Central Medical Informatics and Decision Making, vol. 4, no. 9.

[18]

Clarke, D. E. (2004), 'Practical introduction to record linkage for injury research', Injury Prevention, vol. 10, pp. 186--191.

[19]

Cohen, W. W. & Richman, J. (2002), Learning to match and cluster large high-dimensional data sets for data integration, in 'ACM International Conference on Knowledge Discovery and Data Mining' (SIGKDD'02), Edmonton, pp. 475--480.

Digital Library

[20]

Cohen W. W., Ravikumar P. & Fienberg S. E. (2003), A comparison of string distance metrics for name-matching tasks, in 'IJCAI-03 Workshop on Information Integration on the Web' (IIWeb-03), Acapulco, pp. 73--78.

[21]

Fellegi, I. P. & Sunter, A. B. (1969), 'A theory for record linkage', Journal of the American Statistical Society, vol. 64, no. 328, pp. 1183--1210.

[22]

Goiser K. & Christen, P. (2006), Towards automated record linkage, in 'Australasian Data Mining Conference' (AusDM'06), Sydney, Conferences in Research and Practice in Information Technology (CRPIT), vol. 61, pp. 23--31.

Digital Library

[23]

Gu, L. & Baxter, R. (2004), Adaptive filtering for efficient record linkage, in 'SIAM international conference on data mining' (SDM'04), Lake Buena Vista, Florida.

[24]

Gu, L. & Baxter, R. (2006), Decision models for record linkage, in 'Selected Papers from AusDM', Springer LNCS 3755, pp. 146--160.

Digital Library

[25]

Hernandez, M. A. & Stolfo, S. J. (1995), The merge/purge problem for large databases, in 'ACM international conference on management of data' (SIGMOD'95), San Jose, pp. 127--138.

Digital Library

[26]

Jin, L., Li, C. & Mehrotra, S. (2003), Efficient record linkage in large data sets, in 'International Conference on Database Systems for Advanced Applications' (DASFAA'03), Tokyo, pp. 137--146.

Digital Library

[27]

Kelman, C. W., Bass, J. & Holman, C. D. J. (2002), 'Research use of linked health data --- A best practice protocol', Aust NZ Journal of Public Health, vol. 26, pp. 251--255.

[28]

Rahm, E. & Do, H. H. (2000), 'Data cleaning: Problems and current approaches', IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 3--13.

[29]

Williams, G. J. (2007), 'Data Mining with Rattle and R', Togaware, Canberra. Software available at: http://datamining.togaware.com/survivor/

[30]

Winkler, W. E. (2000), 'Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage', Technical report RR2000/05, US Bureau of the Census.

[31]

Yancey, W. E. (2002), 'BigMatch: A program for extracting probable matches from a large file for record linkage', Technical report RR2002/01, US Bureau of the Census.

Cited By

Konda PDas SSuganthan G.C. PMartinkus PArdalan ABallard JGovind YLi HPanahi FZhang HNaughton JPrasad SKrishnan GDeep RRaghavendra V(2018)Technical Perspective:ACM SIGMOD Record10.1145/3277006.327701547:1(33-40)Online publication date: 10-Sep-2018
https://dl.acm.org/doi/10.1145/3277006.3277015
Xu YLi ZChen QFan F(2018)GL-RFFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-7285-812:5(1035-1037)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1007/s11704-018-7285-8
Sagi TGal ABarkol OBergman RAvram A(2017)Multi-source uncertain entity resolutionInformation Systems10.5555/3050918.305095365:C(124-136)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.5555/3050918.3050953
Show More Cited By

Index Terms

Febrl: a freely available record linkage system with a graphical user interface

Index terms have been assigned to the content through auto-classification.

Recommendations

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant ...
Development and user experiences of an open source data cleaning, deduplication and record linkage system

Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be ...
Application of a record linkage software to identify mortality of enrolees of large integrated healthcare organisations

Information on mortality is important for the improvement of public health and the conduct of medical research. Healthcare organisations typically lack complete and accurate information on mortality. This paper proposes a comprehensive process to link the ...

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80

January 2008

84 pages

ISBN:9781920682613

Sponsors

Australian Comp Soc: Australian Computer Society
CORE - Computing Research and Education
University of Wollongong, Australia

Publisher

Australian Computer Society, Inc.

Australia

Publication History

Published: 01 January 2008

Author Tags

Qualifiers

Research-article

Conference

HDKM '08

Sponsor:

Australian Comp Soc

HDKM '08: Health data and knowledge management

January 1, 2008

NSW, Wollongong, Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
612
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)17

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Konda PDas SSuganthan G.C. PMartinkus PArdalan ABallard JGovind YLi HPanahi FZhang HNaughton JPrasad SKrishnan GDeep RRaghavendra V(2018)Technical Perspective:ACM SIGMOD Record10.1145/3277006.327701547:1(33-40)Online publication date: 10-Sep-2018
https://dl.acm.org/doi/10.1145/3277006.3277015
Xu YLi ZChen QFan F(2018)GL-RFFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-7285-812:5(1035-1037)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1007/s11704-018-7285-8
Sagi TGal ABarkol OBergman RAvram A(2017)Multi-source uncertain entity resolutionInformation Systems10.5555/3050918.305095365:C(124-136)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.5555/3050918.3050953
Abdelkrim OUHAB Mimoun MALKI Djamel BERRABAH Faouzi BOUFARES (2017)An Unsupervised Entity Resolution Framework for English and Arabic DatasetsInternational Journal of Strategic Information Technology and Applications10.4018/IJSITA.20171001028:4(16-29)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.4018/IJSITA.2017100102
Reyes-Galaviz OPedrycz WHe ZPizzi N(2017)A supervised gradient-based learning algorithm for optimized entity resolutionData & Knowledge Engineering10.1016/j.datak.2017.10.004112:C(106-129)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1016/j.datak.2017.10.004
Nguyen KIchise R(2017)ScLinkJournal of Intelligent Information Systems10.1007/s10844-016-0426-348:3(519-551)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10844-016-0426-3
Konda PDas SC. PDoan AArdalan ABallard JLi HPanahi FZhang HNaughton JPrasad SKrishnan GDeep RRaghavendra V(2016)MagellanProceedings of the VLDB Endowment10.14778/3007263.30073149:13(1581-1584)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.14778/3007263.3007314
Konda PDas SSuganthan G. C. PDoan AArdalan ABallard JLi HPanahi FZhang HNaughton JPrasad SKrishnan GDeep RRaghavendra V(2016)MagellanProceedings of the VLDB Endowment10.14778/2994509.29945359:12(1197-1208)Online publication date: 1-Aug-2016
https://dl.acm.org/doi/10.14778/2994509.2994535
Sagi TGal ABarkol OBergman RAvram AÖzcan FKoutrika GMadden S(2016)Multi-Source Uncertain Entity Resolution at Yad VashemProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2903737(807-819)Online publication date: 26-Jun-2016
https://dl.acm.org/doi/10.1145/2882903.2903737
Wang JKrishnan SFranklin MGoldberg KKraska TMilo TDyreson CLi FÖzsu M(2014)A sample-and-clean framework for fast and accurate query processing on dirty dataProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2610505(469-480)Online publication date: 18-Jun-2014
https://dl.acm.org/doi/10.1145/2588555.2610505
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents