Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Experience: Enhancing Address Matching with Geocoding and Similarity Measure Selection

Published: 07 September 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge.
    In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.

    References

    [1]
    Peter Christen. 2007. A two-step classification approach to unsupervised record linkage. In Proceedings of the 6th Australasian Conference on Data Mining and Analytics, Volume 70. Australian Computer Society, Inc., 111--119.
    [2]
    Peter Christen. 2012. Data Matching—Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Data-Centric Systems and Applications.
    [3]
    Peter Christen, Ross Gayler, and David Hawking. 2009. Similarity-aware indexing for real-time entity resolution. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’09). ACM Press, 1565--1568.
    [4]
    Peter Christen, Alan Willmore, and Tim Churches. 2006. Data mining. Springer-Verlag, Berlin, 130--145.
    [5]
    Vaclav Chvatal and David Sankoff. 1975. Longest Common Subsequences of Two Random Sequences. Technical Report. Stanford University.
    [6]
    Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (1964), 171--176.
    [7]
    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16.
    [8]
    David Gale and Lloyd S. Shapley. 1962. College admissions and the stability of marriage. Am. Math. Month. 69, 1 (1962), 9--15.
    [9]
    Bernard A. Galler and Michael J. Fisher. 1964. An improved equivalence algorithm. Commun. ACM 7, 5 (1964), 301--303.
    [10]
    Debarati Guha-Sapir, Rhonda Davis, Melanie Gall, Pascaline Wallemacq, and Susan Cutter. 2015. Exploring the potential of geocoding the impact of disasters: The experience of global and national databases. In EGU General Assembly Conference Abstracts, Vol. 17.
    [11]
    Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
    [12]
    R. W. Hamming. 1950. Error detecting and error correcting codes. Bell Syst. Techn. J. 29, 2 (1950), 147--160.
    [13]
    Thomas N. Herzog, Fritz J. Scheuren, and William E. Winkler. 2007. Data Quality and Record Linkage Techniques (1st ed.). Springer.
    [14]
    Ihab F. Ilyas, Xu Chu, and others. 2015. Trends in cleaning relational data: Consistency and deduplication. Found Trends Databases 5, 4 (2015), 281--393.
    [15]
    Paul Jaccard. 1901. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Natur. 37 (1901), 241--272.
    [16]
    Matthew A. Jaro. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Statist. Assoc. 84, 406 (1989), 414--420.
    [17]
    Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10, 8 (1966), 707--710.
    [18]
    Mario Miler, Filip Todić, and Marko Ševrović. 2016. Extracting accurate location information from a highly inaccurate traffic accident dataset: A methodology based on a string matching technique. Transport. Res. C: Emerg. Technol. 68 (2016), 185--193.
    [19]
    Alvaro E. Monge and Charles P. Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270.
    [20]
    Ye Nan, Kian M. Chai, Wee S. Lee, and Hai L. Chieu. 2012. Optimizing F-measure: A tale of two approaches. In International Conference on Machine Learning, John Langford and Joelle Pineau (Eds.). 289--296.
    [21]
    Felix Naumann and Melanie Herschel. 2010. An Introduction to Duplicate Detection. Morgan 8 Claypool Publishers.
    [22]
    Daniel Paull. 2003. A Geocoded National Address File for Australia: The G-NAF What, Why, Who and When. PSMA Australia Limited, Griffith, ACT (2003).
    [23]
    Linda Williams Pickle, Lance A. Waller, and Andrew B. Lawson. 2005. Current practices in cancer spatial data analysis: A call for guidance. Int. J. Health Geograph. 4, 1 (13 Jan. 2005), 3.
    [24]
    Robert Tibshirani, G . James, D. Witten, and T. Hastie. 2013. An Introduction to Statistical Learning-with Applications in R. Springer.
    [25]
    William E. Winkler and Yves Thibaudeau. 1991. An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 US Decennial Census. Technical report. US Bureau of the Census.

    Cited By

    View all
    • (2024)A deep neural network model for Chinese toponym matching with geographic pre-training modelInternational Journal of Digital Earth10.1080/17538947.2024.235311117:1Online publication date: 13-May-2024
    • (2024)An interactive approach to semantic enrichment with geospatial dataData & Knowledge Engineering10.1016/j.datak.2024.102341153(102341)Online publication date: Sep-2024
    • (2024)Unveiling the impact of machine learning algorithms on the quality of online geocoding services: a case study using COVID-19 dataJournal of Geographical Systems10.1007/s10109-023-00435-8Online publication date: 25-Jan-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 10, Issue 2
    Challenge Papers and Experience Paper
    June 2018
    31 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3276749
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 September 2018
    Accepted: 01 June 2018
    Revised: 01 April 2018
    Received: 01 October 2017
    Published in JDIQ Volume 10, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Address matching
    2. address normalization
    3. address parsing
    4. conditional functional dependencies
    5. duplicate detection
    6. geocoding
    7. geographic information systems
    8. random forest
    9. record linkage
    10. similarity measures

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)36
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A deep neural network model for Chinese toponym matching with geographic pre-training modelInternational Journal of Digital Earth10.1080/17538947.2024.235311117:1Online publication date: 13-May-2024
    • (2024)An interactive approach to semantic enrichment with geospatial dataData & Knowledge Engineering10.1016/j.datak.2024.102341153(102341)Online publication date: Sep-2024
    • (2024)Unveiling the impact of machine learning algorithms on the quality of online geocoding services: a case study using COVID-19 dataJournal of Geographical Systems10.1007/s10109-023-00435-8Online publication date: 25-Jan-2024
    • (2023)Development of an Algorithm to Evaluate the Quality of Geolocated Addresses in Urban AreasISPRS International Journal of Geo-Information10.3390/ijgi1210040712:10(407)Online publication date: 4-Oct-2023
    • (2023)Geospatial Enrichment of Urban Data for Advanced City Planning: a Pilot Study2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386822(3139-3143)Online publication date: 15-Dec-2023
    • (2023)CLGLIAM: contrastive learning model based on global and local semantic interaction for address matchingApplied Intelligence10.1007/s10489-023-05089-z53:23(29267-29281)Online publication date: 1-Dec-2023
    • (2022)Deep Learning Based Improvement in Overseas Manufacturer Address Quality Using Administrative District DataApplied Sciences10.3390/app12211112912:21(11129)Online publication date: 2-Nov-2022
    • (2022)Deep Transfer Learning Model for Semantic Address MatchingApplied Sciences10.3390/app12191011012:19(10110)Online publication date: 8-Oct-2022
    • (2022)Cross-dataset calibration of industrial electricity consumptionEnergy Efficiency10.1007/s12053-022-10020-515:3Online publication date: 17-Feb-2022
    • (2022)Multi-task deep learning model based on hierarchical relations of address elements for semantic address matchingNeural Computing and Applications10.1007/s00521-022-06914-134:11(8919-8931)Online publication date: 1-Jun-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media