Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1385089.1385094dlproceedingsArticle/Chapter ViewAbstractPublication PageshdkmConference Proceedingsconference-collections
research-article
Free access

Febrl: a freely available record linkage system with a graphical user interface

Published: 01 January 2008 Publication History
  • Get Citation Alerts
  • Abstract

    Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the health system. Significant advances, mostly originating from data mining and machine learning, have been made in recent years in many areas of record linkage techniques. Most of these new methods are not yet implemented in current record linkage systems, or are hidden within 'black box' commercial software. This makes it difficult for users to learn about new record linkage techniques, as well as to compare existing linkage techniques with new ones. What is required are flexible tools that enable users to experiment with new record linkage techniques at low costs.
    This paper describes the Febrl (Freely Extensible Biomedical Record Linkage) system, which is available under an open source software licence. It contains many recently developed advanced techniques for data cleaning and standardisation, indexing (blocking), field comparison, and record pair classification, and encapsulates them into a graphical user interface. Febrl can be seen as a training tool suitable for users to learn and experiment with both traditional and new record linkage techniques, as well as for practitioners to conduct linkages with data sets containing up to several hundred thousand records.

    References

    [1]
    Aizawa, A. & Oyama, K. (2005), A fast linkage detection scheme for multi-source information integration, in 'Web Information Retrieval and Integration' (WIRI'05), Tokyo, pp. 30--39.
    [2]
    Baxter, R., Christen, P. & Churches, T. (2003), A comparison of fast blocking methods for record linkage, in 'ACM SIGKDD Workshop on Data Cleaning, Record Linkage and Object Consolidation', Washington DC, pp. 25--27.
    [3]
    Brook, E. L., Rosman, D. L., Holman, C. D. J. & Trutwein, B. (2005), 'Summary report: Research outputs project, WA Data Linkage Unit (1995--2003)', Western Australian Data Linkage Unit Perth.
    [4]
    Chang, C.-C. & Lin, C.-J. (2001), LIBSVM: A library for support vector machines, manual. Department of Computer Science, National Taiwan University. Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm
    [5]
    Christen, P., Zhu, J. X., Hegland, M., Roberts, S., Nielsen, O. M., Churches, T. & Lim, K. (2002), High-performance computing techniques for record linkage, in 'Australian Health Outcomes Conference' (AHOC'02), Canberra.
    [6]
    Christen, P., Churches, T. & Hegland, M. (2004), Febrl -- A parallel open source data linkage system, in 'Pacific-Asia Conference on Knowledge Discovery and Data Mining' (PAKDD'04), Sydney, Springer LNAI 3056, pp. 638--647.
    [7]
    Christen, P. (2005), Probabilistic data generation for deduplication and data linkage, in 'International Conference on Intelligent Data Engineering and Automated Learning' (IDEAL'05), Brisbane, Springer LNCS 3578, pp. 109--116.
    [8]
    Christen, P. & Belacic, D. (2005), Automated probabilistic address standardisation and verification, in 'Australasian Data Mining Conference' (AusDM'05), Sydney.
    [9]
    Christen, P., Willmore, A. & Churches, T. (2006), A probabilistic geocoding system utilising a parcel based address file, in 'Selected Papers from AusDM', Springer LNCS 3755, pp. 130--145.
    [10]
    Christen, P. (2006), A comparison of personal name matching: Techniques and practical issues, in 'Workshop on Mining Complex Data' (MCD'06), held at IEEE ICDM'06, Hong Kong.
    [11]
    Christen, P. (2006), Privacy-preserving data linkage and geocoding: Current approaches and research directions, in 'Workshop on Privacy Aspects of Data Mining' (PADM'06), held at IEEE ICDM'06, Hong Kong.
    [12]
    Christen, P. & Churches, T. (2006), Secure health data linkage and geocoding: Current approaches and research directions, in 'National e-Health Privacy and Security Symposium' (ehPASS'06), Brisbane, Australia.
    [13]
    Christen, P. & Goiser, K. (2007), Quality and complexity measures for data linkage and deduplication, in F. Guillet & H. Hamilton, eds, 'Quality Measures in Data Mining', Springer Studies in Computational Intelligence, vol. 43, pp. 127--151.
    [14]
    Christen, P. (2007), 'Towards parameter-free blocking for scalable record linkage', Technical Report TRCS-07-03, ANU Joint Computer Science Technical Report Series, The Australian National University, Canberra.
    [15]
    Christen, P. (2007), A two-step classification approach to unsupervised record linkage, in 'Australasian Data Mining Conference' (AusDM'07), Gold Coast, Conferences in Research and Practice in Information Technology (CRPIT), vol. 70.
    [16]
    Churches, T., Christen, P., Lim, K. & Zhu, J. X. (2002), 'Preparation of name and address data for record linkage using hidden Markov models', BioMed Central Medical Informatics and Decision Making, vol. 2, no. 9.
    [17]
    Churches, T. & Christen, P. (2004), 'Some methods for blindfolded record linkage', BioMed Central Medical Informatics and Decision Making, vol. 4, no. 9.
    [18]
    Clarke, D. E. (2004), 'Practical introduction to record linkage for injury research', Injury Prevention, vol. 10, pp. 186--191.
    [19]
    Cohen, W. W. & Richman, J. (2002), Learning to match and cluster large high-dimensional data sets for data integration, in 'ACM International Conference on Knowledge Discovery and Data Mining' (SIGKDD'02), Edmonton, pp. 475--480.
    [20]
    Cohen W. W., Ravikumar P. & Fienberg S. E. (2003), A comparison of string distance metrics for name-matching tasks, in 'IJCAI-03 Workshop on Information Integration on the Web' (IIWeb-03), Acapulco, pp. 73--78.
    [21]
    Fellegi, I. P. & Sunter, A. B. (1969), 'A theory for record linkage', Journal of the American Statistical Society, vol. 64, no. 328, pp. 1183--1210.
    [22]
    Goiser K. & Christen, P. (2006), Towards automated record linkage, in 'Australasian Data Mining Conference' (AusDM'06), Sydney, Conferences in Research and Practice in Information Technology (CRPIT), vol. 61, pp. 23--31.
    [23]
    Gu, L. & Baxter, R. (2004), Adaptive filtering for efficient record linkage, in 'SIAM international conference on data mining' (SDM'04), Lake Buena Vista, Florida.
    [24]
    Gu, L. & Baxter, R. (2006), Decision models for record linkage, in 'Selected Papers from AusDM', Springer LNCS 3755, pp. 146--160.
    [25]
    Hernandez, M. A. & Stolfo, S. J. (1995), The merge/purge problem for large databases, in 'ACM international conference on management of data' (SIGMOD'95), San Jose, pp. 127--138.
    [26]
    Jin, L., Li, C. & Mehrotra, S. (2003), Efficient record linkage in large data sets, in 'International Conference on Database Systems for Advanced Applications' (DASFAA'03), Tokyo, pp. 137--146.
    [27]
    Kelman, C. W., Bass, J. & Holman, C. D. J. (2002), 'Research use of linked health data --- A best practice protocol', Aust NZ Journal of Public Health, vol. 26, pp. 251--255.
    [28]
    Rahm, E. & Do, H. H. (2000), 'Data cleaning: Problems and current approaches', IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 3--13.
    [29]
    Williams, G. J. (2007), 'Data Mining with Rattle and R', Togaware, Canberra. Software available at: http://datamining.togaware.com/survivor/
    [30]
    Winkler, W. E. (2000), 'Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage', Technical report RR2000/05, US Bureau of the Census.
    [31]
    Yancey, W. E. (2002), 'BigMatch: A program for extracting probable matches from a large file for record linkage', Technical report RR2002/01, US Bureau of the Census.

    Cited By

    View all

    Index Terms

    1. Febrl: a freely available record linkage system with a graphical user interface
                Index terms have been assigned to the content through auto-classification.

                Recommendations

                Comments

                Information & Contributors

                Information

                Published In

                cover image DL Hosted proceedings
                HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
                January 2008
                84 pages

                Sponsors

                • Australian Comp Soc: Australian Computer Society
                • CORE - Computing Research and Education
                • University of Wollongong, Australia

                Publisher

                Australian Computer Society, Inc.

                Australia

                Publication History

                Published: 01 January 2008

                Author Tags

                1. GUI
                2. data cleaning
                3. data integration
                4. data matching
                5. deduplication
                6. health data linkage
                7. open source software
                8. record linkage software

                Qualifiers

                • Research-article

                Conference

                HDKM '08
                Sponsor:
                • Australian Comp Soc
                HDKM '08: Health data and knowledge management
                January 1, 2008
                NSW, Wollongong, Australia

                Contributors

                Other Metrics

                Bibliometrics & Citations

                Bibliometrics

                Article Metrics

                • Downloads (Last 12 months)41
                • Downloads (Last 6 weeks)17
                Reflects downloads up to 12 Aug 2024

                Other Metrics

                Citations

                Cited By

                View all
                • (2018)Technical Perspective:ACM SIGMOD Record10.1145/3277006.327701547:1(33-40)Online publication date: 10-Sep-2018
                • (2018)GL-RFFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-7285-812:5(1035-1037)Online publication date: 1-Oct-2018
                • (2017)Multi-source uncertain entity resolutionInformation Systems10.5555/3050918.305095365:C(124-136)Online publication date: 1-Apr-2017
                • (2017)An Unsupervised Entity Resolution Framework for English and Arabic DatasetsInternational Journal of Strategic Information Technology and Applications10.4018/IJSITA.20171001028:4(16-29)Online publication date: 1-Oct-2017
                • (2017)A supervised gradient-based learning algorithm for optimized entity resolutionData & Knowledge Engineering10.1016/j.datak.2017.10.004112:C(106-129)Online publication date: 1-Nov-2017
                • (2017)ScLinkJournal of Intelligent Information Systems10.1007/s10844-016-0426-348:3(519-551)Online publication date: 1-Jun-2017
                • (2016)MagellanProceedings of the VLDB Endowment10.14778/3007263.30073149:13(1581-1584)Online publication date: 1-Sep-2016
                • (2016)MagellanProceedings of the VLDB Endowment10.14778/2994509.29945359:12(1197-1208)Online publication date: 1-Aug-2016
                • (2016)Multi-Source Uncertain Entity Resolution at Yad VashemProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2903737(807-819)Online publication date: 26-Jun-2016
                • (2014)A sample-and-clean framework for fast and accurate query processing on dirty dataProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2610505(469-480)Online publication date: 18-Jun-2014
                • Show More Cited By

                View Options

                View options

                PDF

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                Get Access

                Login options

                Media

                Figures

                Other

                Tables

                Share

                Share

                Share this Publication link

                Share on social media