Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3318464.3386143acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial Applications

Published: 31 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Entity matching -- the task of clustering duplicated database records to underlying entities -- has become an increasingly critical component in modern data integration management. Amperity provides a platform for businesses to manage customer data that utilizes a machine-learning approach to entity matching, resolving billions of customer records on a daily basis. We face several challenges in deploying entity matching to industrial applications at scale, and they are less prominent in the literature. These challenges include: (1) Providing not just a single entity clustering, but supporting clusterings at multiple confidence levels to enable downstream applications with varying precision/recall trade-off needs. (2) Many customer record attributes may be systematically missing from different sources of data, creating many pairs of records in a cluster that appear to not match due to incomplete, rather than conflicting information. Allowing these records to connect transitively without introducing conflicts is invaluable to businesses because they can acquire a more comprehensive profile of their customers without incorrect entity merges. (3) How to cluster records over time and assign persistent cluster IDs that can be used for downstream use cases such as A/B tests or predictive model training; this is made more challenging by the fact that we receive new customer data every day and clusters naturally evolving over time still require persistent IDs that refer to the same entity. In this work, we describe Amperity's entity matching framework, Fusion, and how its design provides solutions to these challenges. In particular, we describe our pairwise matching model based on ordinal regression that permits a well-defined way to produce entity clusterings at different confidence levels, a novel clustering algorithm that separates conflicting record pairs in clusters while allowing for pairs that may appear dissimilar due to missing data, and a persistent ID generation algorithm which balances stability of the identifier with ever-evolving entities.

    Supplementary Material

    MP4 File (3318464.3386143.mp4)
    Presentation video

    References

    [1]
    Nir Ailon, Moses Charikar, and Alantha Newman. 2008. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), Vol. 55, 5 (2008), 23.
    [2]
    Nikhil Bansal, Avrim Blum, and Shuchi Chawla. 2004. Correlation clustering. Machine learning, Vol. 56, 1--3 (2004), 89--113.
    [3]
    Mikhail Bilenko, S Basil, and Mehran Sahami. 2005. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Data Mining, Fifth IEEE International Conference on. IEEE, 8--pp.
    [4]
    Mikhail Bilenko, Beena Kamath, and Raymond J Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Sixth International Conference on Data Mining (ICDM'06). IEEE, 87--96.
    [5]
    Rainer Burkard, Mauro Dell'Amico, and Silvano Martello. 2009. Assignment Problems .Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
    [6]
    Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 601--610.
    [7]
    Xin Luna Dong and Divesh Srivastava. 2013. Big data integration. In 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, 1245--1248.
    [8]
    Ivan P Fellegi and Alan B Sunter. 1969. A theory for record linkage. J. Amer. Statist. Assoc., Vol. 64, 328 (1969), 1183--1210.
    [9]
    Anja Gruenheid, Xin Luna Dong, and Divesh Srivastava. 2014. Incremental record linkage. Proceedings of the VLDB Endowment, Vol. 7, 9 (2014), 697--708.
    [10]
    Thomas N Herzog, Fritz J Scheuren, and William E Winkler. 2007. Data quality and record linkage techniques .Springer Science & Business Media.
    [11]
    HV Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M Patel, Raghu Ramakrishnan, and Cyrus Shahabi. 2014. Big data and its technical challenges. Commun. ACM, Vol. 57, 7 (2014), 86--94.
    [12]
    Anil K Jain, M Narasimha Murty, and Patrick J Flynn. 1999. Data clustering: a review. ACM computing surveys (CSUR), Vol. 31, 3 (1999), 264--323.
    [13]
    Stephen C Johnson. 1967. Hierarchical clustering schemes. Psychometrika, Vol. 32, 3 (1967), 241--254.
    [14]
    Katie Kalm, Rebecca Scully, Aria Haghighi, and Hilary Fagan. 2019. A Report on the Error Rates of Customer Misidentification and Why This is so Bad for Your Business. (2019).
    [15]
    Hakan Kardes, Deepak Konidena, Siddharth Agrawal, Micah Huff, and Ang Sun. 2013. Graph-based approaches for organization entity resolution in map-reduce. Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing (2013), 70--78.
    [16]
    Garrett Lewwen. 2017. A Greedy Approximation Algorithm for the Linear Assignment Problem. The Antimatroid (Mar 2017). https://antimatroid.wordpress.com/2017/03/
    [17]
    Marina Meila. 2005. Comparing clusterings: an axiomatic view. In Proceedings of the 22nd international conference on Machine learning. ACM, 577--584.
    [18]
    David Menestrina, Steven Euijong Whang, and Hector Garcia-Molina. 2010. Evaluating entity resolution results. Proceedings of the VLDB Endowment, Vol. 3, 1--2 (2010), 208--219.
    [19]
    Matthew Michelson and Sofus A Macskassy. 2009. Record linkage measures in an entity centric world. In Proceedings of the 4th workshop on Evaluation Methods for Machine Learning .
    [20]
    Felix Naumann and Melanie Herschel. 2010. An introduction to duplicate detection. Synthesis Lectures on Data Management, Vol. 2, 1 (2010), 1--87.
    [21]
    Jason DM Rennie and Nathan Srebro. 2005. Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling. Kluwer Norwell, MA, 180--186.
    [22]
    Giovanni Rossi. 2011. Partition distances. arXiv preprint arXiv:1106.4579 (2011).
    [23]
    Claude Sammut and Geoffrey I Webb. 2017. Encyclopedia of machine learning and data mining. Springer Publishing Company, Incorporated.
    [24]
    Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. 2015. Incremental knowledge base construction using deepdive. In Proceedings of the VLDB Endowment International Conference on Very Large Data Bases, Vol. 8. NIH Public Access, 1310.
    [25]
    Csaba István Sidló, András József Molnár, Gábor Lukács, and András Benczúr. 2014. Theoretical foundations of entity resolution models. In ANNALES UNIVERSITATIS SCIENTIARUM BUDAPESTINENSIS DE ROLANDO EOTVOS NOMINATAE SECTIO COMPUTATORICA, Vol. 43. ELTE, 39--56.
    [26]
    Vassilios S Verykios, George V Moustakides, and Mohamed G Elfeky. 2003. A Bayesian decision model for cost optimal record matching. The VLDB Journal, Vol. 12, 1 (2003), 28--40.
    [27]
    Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM, Vol. 59, 11 (Oct. 2016), 56--65. https://doi.org/10.1145/2934664

    Cited By

    View all
    • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
    • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
    • (2023)A high-performance turnkey system for customer lifetime value prediction in retail brandsQuantitative Marketing and Economics10.1007/s11129-023-09272-x22:2(169-192)Online publication date: 8-Nov-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
    June 2020
    2925 pages
    ISBN:9781450367356
    DOI:10.1145/3318464
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cluster id assignment
    2. conflict resolution in clustering
    3. multi-level entity matching

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)744
    • Downloads (Last 6 weeks)77
    Reflects downloads up to

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ERABQS: entity resolution based on active machine learning and balancing query strategyJournal of Intelligent Information Systems10.1007/s10844-024-00853-0Online publication date: 26-Mar-2024
    • (2023)Splitting Tuples of Mismatched EntitiesProceedings of the ACM on Management of Data10.1145/36267631:4(1-29)Online publication date: 12-Dec-2023
    • (2023)A high-performance turnkey system for customer lifetime value prediction in retail brandsQuantitative Marketing and Economics10.1007/s11129-023-09272-x22:2(169-192)Online publication date: 8-Nov-2023
    • (2023)Big Data Integration for Industry 4.0Digital Transformation10.1007/978-3-662-65004-2_10(247-268)Online publication date: 3-Feb-2023
    • (2022)Exploring the use of topological data analysis to automatically detect data quality faultsFrontiers in Big Data10.3389/fdata.2022.9313985Online publication date: 5-Dec-2022
    • (2022)Saga: A Platform for Continuous Construction and Serving of Knowledge at ScaleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526049(2259-2272)Online publication date: 10-Jun-2022
    • (2022)Machine Learning and Data Cleaning: Which Serves the Other?Journal of Data and Information Quality10.1145/350671214:3(1-11)Online publication date: 21-Jul-2022
    • (2022)An Adaptable Framework for Entity Matching Model Selection in Business Enterprises2022 IEEE 24th Conference on Business Informatics (CBI)10.1109/CBI54897.2022.00017(90-99)Online publication date: Jun-2022
    • (2022)The Four Generations of Entity ResolutionundefinedOnline publication date: 25-Feb-2022
    • (2021)FairERProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482105(3004-3008)Online publication date: 26-Oct-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media