Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2534848.2534850acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
tutorial

A Comparison of String Similarity Measures for Toponym Matching

Published: 03 October 2013 Publication History

Abstract

The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.

References

[1]
Anastácio, I., Martins, B., and Calado, P. 2011. Supervised learning for linking named entities to knowledge base entries. In Proceedings of Text Analysis Conference (Gaithersburg, Maryland, November 14--15, 2011). KBP '11. National Institute of Standards and Technology, Gaithersburg, MD, n.p.
[2]
Bartolini, I., Ciaccia, P., and Patella, M. 2002. String matching with metric trees using an approximate distance. In String Processing & Information Retrieval (SPIRE), Lecture Notes in Computer Science, 2476, 271--283, Lisbon, Portugal.
[3]
Benedetto, D., Caglioti, E., and Loreto, V. 2002. Language trees and zipping. Physical Review Letters, 88, 4, 048702. DOI=10.1103/PhysRevLett.88.048702.
[4]
Central Intelligence Agency. 2013. The World Factbook 2013--14. Washington, DC.
[5]
Christen, P. 2006. A comparison of personal name matching: Techniques and practical issues. In Data Mining Workshops, Sixth IEEE International Conference on Data Mining (Hong Kong, December 18--22, 2006). IEEE, New York, 290--294.
[6]
Christen, P., Churches, T., and Hegland, M. 2004. Febrl -- a parallel open source data linkage system. In Pacific Asia Knowledge Discovery and Data Mining (Sydney, Australia, May 20--26, 2004). Springer, New York, 638--647.
[7]
Cilibrasi, R. and Vitányi, P. M. B. 2005. Clustering by compression. IEEE Transactions on Information Theory, 51, 4, 1523--1545.
[8]
Cilibrasi, R. and Vitányi, P. M. B. 2007. The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, 19, 370--383.
[9]
Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of 2003 International Joint Conferences on Artificial Intelligence (IJCAI-03) Workshop on Information Integration on the Web (Acapulco, Mexico, August 9--15, 2003). Morgan Kaufmann, San Francisco, 73--78.
[10]
Costello, A. B., and Osborne, J. W. 2005. Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research & Evaluation, 10, 7, 1--9.
[11]
Cox, G. E., Kachergis, G., Recchia, G., and Jones, M. N. 2011. Toward a scalable holographic word-form representation. Behavior Research Methods, 43, 3, 602--615.
[12]
Damerau, F. J. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7, 3, 171--176.
[13]
Friedman, C. and Sideli, R. 1992. Tolerating spelling errors during patient validation. Computers and Biomedical Research, 25, 486--509.
[14]
Gadd, T. 1990. PHONIX: The algorithm. Program: Automated Library and Information Systems, 24, 4, 363--366.
[15]
Gong, R. and Chan, T. K. 2006. Syllable alignment: A novel model for phonetic string search. Institute of Electronics, Information and Communication Engineers (IEICE) Transactions on Information and Systems, E89-D, 1, 332--339.
[16]
Hastings, J. T. 2008. Automated conflation of digital gazetteer data. International Journal of Geographical Information Science, 22, 10, 1109--1127.
[17]
Hastings, J. T. and Hill, L. L. 2002. Treatment of 'duplicates' in the Alexandria Digital Library gazetteer. In M.J. Egenhofer, and D.M. Mark (Eds.), Geographic Information Science, Second International Conference (Extended Abstracts) (September 25--28, Boulder, Colorado, 2002). Springer, New York, 64--65.
[18]
Jaro, M. A. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 89, 414--420.
[19]
Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., and Jarvelin, K. 2003. Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proccedings of String Processing & Information Retrieval (SPIRE) (Manaus, Brazil, October 8--10, 2003). Springer, New York, 252--265.
[20]
Lennon, M., Peirce, D. S., Tarry, B. D., and Willett, P. 1981. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3, 4, 177--183.
[21]
Levenshtein, V. I. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707--710.
[22]
Maki, W. S., and Buchanan, E. 2008. Latent structure in measures of associative, semantic, and thematic knowledge. Psychonomic Bulletin & Review, 15, 3, 598--603.
[23]
Martins, B. 2011. A supervised machine learning approach for duplicate detection over gazetteer records. In Proceedings of the 4th International Conference on Geospatial Semantics (Brest, France, May 12--13, 2011). Springer, Berlin Heidelberg, 34--51.
[24]
Monge, A. E. and Elkan, C. P. 1996. The field-matching problem: Algorithm and applications. In Proceedings of ACM SIGKDD (Portland, Oregon, August 4--8, 1996). AAAI Press, Menlo Park, California, 267--270.
[25]
Nadeau, D., and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 1, 3--26.
[26]
Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys, 33, 1, 31--88.
[27]
Porter, E. H. and Winkler, W. E. 1997. Approximate String Comparison and Its Effect on an Advanced Record Linkage System. Technical Report. US Bureau of the Census.
[28]
Sehgal, V., Getoor, L., and Viechnicki, P. D. 2006, November. Entity resolution in geospatial data integration. In Proceedings of the 14th Annual ACM International Symposium on Advances in Geographic Information Systems (Arlington, Virginia, November 10--11, 2006). ACM, New York, NY, 83--90.
[29]
Smith, T. F. and Waterman, M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195--197.
[30]
Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60, 3, 538--556.
[31]
Whitney, C. 2001. How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review. Psychonomic Bulletin & Review, 8, 221--243.
[32]
Winkler, W. E. 2006. Overview of Record Linkage and Current Research Directions. Technical Report. US Bureau of the Census.
[33]
Zheng, Y., Fen, X., Xie, X., Peng, S., & Fu, J. 2010. Detecting nearly duplicated records in location datasets. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in GIS (San Jose, California, November 2--5, 2010). ACM, New York, NY, 137--143.
[34]
Zobel, J. and Dart, P. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of ACM SIGIR (Zurich, Switzerland, August 18--22, 1996). ACM, New York, NY, 166--172

Cited By

View all
  • (2024)Towards Automatically Matching Security Advisories to CPEs: String Similarity-based Vendor Matching2024 International Conference on Computing, Networking and Communications (ICNC)10.1109/ICNC59896.2024.10556231(233-238)Online publication date: 19-Feb-2024
  • (2023)Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00025(120-127)Online publication date: 15-Dec-2023
  • (2023)Few-shot entity linking of food namesInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10346360:5Online publication date: 1-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
COMP '13: Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place
November 2013
75 pages
ISBN:9781450325356
DOI:10.1145/2534848
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data integration
  2. duplicate detection
  3. edit distance
  4. gazetteers
  5. geographic information retrieval
  6. string similarity
  7. toponyms

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

SIGSPATIAL'13
Sponsor:

Acceptance Rates

COMP '13 Paper Acceptance Rate 8 of 14 submissions, 57%;
Overall Acceptance Rate 8 of 14 submissions, 57%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)3
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Towards Automatically Matching Security Advisories to CPEs: String Similarity-based Vendor Matching2024 International Conference on Computing, Networking and Communications (ICNC)10.1109/ICNC59896.2024.10556231(233-238)Online publication date: 19-Feb-2024
  • (2023)Improving Embeddings for High-Accuracy Transformer-Based Address Matching Using a Multiple in-Batch Negatives Loss2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00025(120-127)Online publication date: 15-Dec-2023
  • (2023)Few-shot entity linking of food namesInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10346360:5Online publication date: 1-Sep-2023
  • (2021)Un gazetier des places portuaires françaises du xviiie siècleA Gazetteer of French Eighteenth-Century PortsHumanités numériques10.4000/revuehn.1164Online publication date: 1-May-2021
  • (2020)A Deep Learning Approach to Geographical Candidate Selection through Toponym MatchingProceedings of the 28th International Conference on Advances in Geographic Information Systems10.1145/3397536.3422236(385-388)Online publication date: 3-Nov-2020
  • (2020)Learning Advanced Similarities and Training Features for Toponym InterlinkingAdvances in Information Retrieval10.1007/978-3-030-45439-5_8(111-125)Online publication date: 14-Apr-2020
  • (2018)The Development and Implementation of the Data-Binding Algorithm in the State Civil Information SystemAutomatic Documentation and Mathematical Linguistics10.5555/3288805.328886452:4(195-202)Online publication date: 1-Jul-2018
  • (2018)Stable assessment of the quality of similarity algorithms of character strings and their normalizationsProgram Systems: Theory and ApplicationsПрограммные системы: теория и приложения10.25209/2079-3316-2018-9-4-579-5969:4(579-596)Online publication date: 2018
  • (2018)Stable assessment of the quality of similarity algorithms of character strings and their normalizationsProgram Systems: Theory and ApplicationsПрограммные системы: теория и приложения10.25209/2079-3316-2018-9-4-561-5789:4(561-578)Online publication date: 2018
  • (2017)L2DProceedings of the 2017 International Conference on Computer Science and Artificial Intelligence10.1145/3168390.3168413(242-246)Online publication date: 5-Dec-2017
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media