short-paper

Open access

Learning Geolocation by Accurately Matching Customer Addresses via Graph based Active Learning

Authors:

Saket Maheshwary,

Saurabh SohoneyAuthors Info & Claims

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

Pages 457 - 463

https://doi.org/10.1145/3543873.3584647

Published: 30 April 2023 Publication History

All formats PDF

Abstract

We propose a novel adaptation of graph-based active learning for customer address resolution or de-duplication, with the aim to determine if two addresses represent the same physical building or not. For delivery systems, improving address resolution positively impacts multiple downstream systems such as geocoding, route planning and delivery time estimations, leading to an efficient and reliable delivery experience, both for customers as well as delivery agents. Our proposed approach jointly leverages address text, past delivery information and concepts from graph theory to retrieve informative and diverse record pairs to label. We empirically show the effectiveness of our approach on manually curated dataset across addresses from India (IN) and United Arab Emirates (UAE). We achieved absolute improvement in recall on average across IN and UAE while preserving precision over the existing production system. We also introduce delivery point (DP) geocode learning for cold-start addresses as a downstream application of address resolution. In addition to offline evaluation, we also performed online A/B experiments which show that when the production model is augmented with active learnt record pairs, the delivery precision improved by and delivery defects reduced by on an average across shipments from IN and UAE.

References

[1]

Chris Biemann, Irina Matveeva, Rada Mihalcea, and Dragomir Radev. 2007. Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing. In Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing.

[2]

Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 39–48.

Digital Library

[3]

Mustafa Bilgic, Lilyana Mihalkova, and Lise Getoor. 2010. Active learning for networked data. In ICML.

[4]

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment10 (2008), P10008.

[5]

Zalán Bodó, Zsolt Minier, and Lehel Csató. 2011. Active learning with clustering. In Active Learning and Experimental Design workshop In conjunction with AISTATS 2010. JMLR Workshop and Conference Proceedings, 127–139.

[6]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

[7]

Aydın Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. 2016. Recent advances in graph partitioning. Algorithm engineering (2016), 117–158.

[8]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.

Digital Library

[9]

Xiaojun Chen, Joshua Zhexue Haung, Feiping Nie, Renjie Chen, and Qingyao Wu. 2017. A self-balanced min-cut algorithm for image clustering. In Proceedings of the IEEE International Conference on Computer Vision. 2061–2069.

[10]

Nitin R Chopde and Mangesh Nichat. 2013. Landmark based shortest path detection by using A* and Haversine formula. International Journal of Innovative Research in Computer and Communication Engineering 1, 2 (2013), 298–302.

[11]

David A Cohn, Zoubin Ghahramani, and Michael I Jordan. 1996. Active learning with statistical models. Journal of artificial intelligence research 4 (1996), 129–145.

Digital Library

[12]

Sam Comber and Daniel Arribas-Bel. 2019. Machine learning innovations in address matching: A practical comparison of word2vec and CRFs. Transactions in GIS 23, 2 (2019), 334–348.

[13]

Melanie Ducoffe and Frederic Precioso. 2018. Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841 (2018).

[14]

George Forman. 2021. Getting Your Package to the Right Place: Supervised Machine Learning for Geolocation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 403–419.

Digital Library

[15]

Clinton Gormley and Zachary Tong. 2015. Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. " O’Reilly Media, Inc.".

[16]

Govind and Saurabh Sohoney. 2022. Learning Geolocations for Cold-start and Hard-to-Resolve Addresses via Deep Metric Learning. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track. Association for Computational Linguistics, Abu Dhabi, UAE.

[17]

Arjit Jain, Sunita Sarawagi, and Prithviraj Sen. 2021. Deep indexed active learning for matching heterogeneous entity representations. arXiv preprint arXiv:2104.03986 (2021).

[18]

Zhanming Jie and Wei Lu. 2019. Dependency-guided LSTM-CRF for named entity recognition. arXiv preprint arXiv:1909.10148 (2019).

[19]

Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019).

[20]

Ambika Kaul, Saket Maheshwary, and Vikram Pudi. 2017. Autolearn—automated feature generation and selection. In 2017 IEEE International Conference on data mining (ICDM). IEEE, 217–226.

[21]

Asif R Khan and Hector Garcia-Molina. 2016. Attribute-based crowd entity resolution. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 549–558.

Digital Library

[22]

David D Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994. Elsevier, 148–156.

[23]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020).

[24]

Xiao Liu, Juan Hu, Qi Shen, and Huan Chen. 2021. Geo-BERT Pre-training Model for Query Rewriting in POI Search. In Findings of the Association for Computational Linguistics: EMNLP 2021. 2209–2214.

[25]

Rishabh Maheshwary, Saket Maheshwary, and Vikram Pudi. 2021. A context aware approach for generating natural language attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 15839–15840.

[26]

Rishabh Maheshwary, Saket Maheshwary, and Vikram Pudi. 2021. Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13525–13533.

[27]

Saket Maheshwary, Ambika Kaul, and Vikram Pudi. 2017. Data driven feature learning.

[28]

Saket Maheshwary and Hemant Misra. 2018. Matching resumes to jobs via deep siamese network. In Companion Proceedings of the The Web Conference 2018. 87–88.

Digital Library

[29]

Saket Maheshwary and Vikram Pudi. 2017. Mining keystroke timing pattern for user authentication. In New Frontiers in Mining Complex Patterns: 5th International Workshop, NFMCP 2016, Held in Conjunction with ECML-PKDD 2016, Riva del Garda, Italy, September 19, 2016, Revised Selected Papers 5. Springer, 213–227.

[30]

Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. 2021. Active learning by acquiring contrastive examples. arXiv preprint arXiv:2109.03764 (2021).

[31]

Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A comprehensive benchmark framework for active learning methods in entity matching. In Proceedings of the 2020 ACM SIGMOD Conference on Management of Data. 1133–1147.

Digital Library

[32]

Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 1003–1011.

[33]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data. 19–34.

Digital Library

[34]

Stephen Mussmann, Robin Jia, and Percy Liang. 2020. On the importance of adaptive data collection for extremely imbalanced pairwise tasks. arXiv preprint arXiv:2010.05103 (2020).

[35]

Tan Ningsheng, Yang Chongjun, Yang LiuZhong, and Liu Yuan. 2015. An address regional tessellation method for spatial subdivision and geocoding in digital earth system. International Journal of Digital Earth 8, 10 (2015), 825–839.

[36]

Vamsi Krishna Penumadu, Nitesh Methani, and Saurabh Sohoney. 2022. Learning geospatially aware place embeddings via weak-supervision. In Proceedings of the 30th International Conference on Advances in Geographic Information Systems. 1–10.

Digital Library

[37]

Anna Primpeli and Christian Bizer. 2021. Graph-boosted active learning for multi-source entity resolution. In International Semantic Web Conference. Springer, 182–199.

Digital Library

[38]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1379–1388.

Digital Library

[39]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active learning for large-scale entity resolution. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1379–1388.

Digital Library

[40]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2019. Systemer: A human-in-the-loop system for explainable entity resolution. (2019).

Digital Library

[41]

Alexander Ratner, Sen Bach, Stephen H, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. In Proceedings of the International Conference on Very Large Data Bases, Vol. 11. NIH Public Access, 269.

Digital Library

[42]

Dongyu Ru, Jiangtao Feng, Lin Qiu, Hao Zhou, Mingxuan Wang, Weinan Zhang, Yong Yu, and Lei Li. 2020. Active sentence learning by adversarial uncertainty sampling in discrete space. arXiv preprint arXiv:2004.08046 (2020).

[43]

David W Scott. 1992. Multivariate density estimation: Theory, practice and visualisation. John Willey and Sons. Inc., New York (1992).

[44]

Isabel Segura-Bedmar, Adrián Carruana, and Paloma Martínez. 2016. LABDA at the 2016 BioASQ challenge task 4a: Semantic Indexing by using ElasticSearch. In Proceedings of the Fourth BioASQ workshop. 16–22.

[45]

Burr Settles. 2009. Active learning literature survey. (2009).

[46]

Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and Chew Lim Tan. 2004. Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd annual meeting of the Association for Computational Linguistics (ACL-04). 589–596.

Digital Library

Index Terms

Learning Geolocation by Accurately Matching Customer Addresses via Graph based Active Learning

Recommendations

On active learning of record matching packages
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user ...
SAREM: Semi-supervised Active Heterogeneous Entity Matching Framework
Web Information Systems and Applications
Abstract
Entity matching is a key technique in data quality research, which refers to the identification of records that refer to the same real-world entity in different data sources. This paper introduces SAREM, a semi-supervised entity matching framework ...
Labelling for Venue Visit Detection by Matching Wi-Fi Hotspots with Businesses
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

User behaviour data is essential for modern companies, as it allows them to measure the impact of decisions they make and to gain new insights. A particular type of such data is user location trajectories, which can be clustered into Points of Interest, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023

April 2023

1567 pages

ISBN:9781450394192

DOI:10.1145/3543873

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
415
Total Downloads

Downloads (Last 12 months)354
Downloads (Last 6 weeks)46

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents