Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3487075.3487142acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaeConference Proceedingsconference-collections
research-article

Real-Time Entity Resolution by Forest-Based Indexing in Database Systems with Vertical Fragmentations

Published: 07 December 2021 Publication History

Abstract

Entity resolution (ER) is the process of identifying and matching which tuples/records in a dataset/relation refer to the same real-world entity. Real-time ER is a challenge for large datasets. Schema decomposition is of importance in (distributed) database systems, which partitions a relation/table into a set of vertical fragmentations. For this scenario, we study real-time ER in this paper. By creating forest-based indexing and defining ranking functions and corresponding algorithms, we propose an approach to resolve query tuples over dirty relations of a set of vertical fragmentations with duplicates, misspellings, or NULL values of text attributes. Extensive experiments are conducted to demonstrate the performances of our proposed approach.

References

[1]
P Vieira, A C Salgado and B F Lóscio (2016). A Dynamic Indexing for Incremental Entity Resolution over Query Results. International Journal of Linguistics Research, 7(3), 92-103.
[2]
V Christophides, V Efthymiou, T Palpanas, G Papadakis and K Stefanidis (2020). An overview of end-to-end entity resolution for big data. ACM Computing Surveys (CSUR), 53(6), 1-42.
[3]
A K Elmagarmid, P G Ipeirotis and V S Verykios (2007). Duplicate record detection: A survey. IEEE Transactions on knowledge and data engineering, 19(1), 1-16.
[4]
D Burdick, R Fagin, P G Kolaitis, L Popa and W C Tan (2016). A declarative framework for linking entities. ACM Transactions on Database Systems (TODS), 41(3), 1-38.
[5]
D Firmani, B Saha and D Srivastava (2016). Online entity resolution using an oracle. Proceedings of the VLDB Endowment, 9(5), 384-395.
[6]
H Z Liang, Y Z Wang, P Christen and R Gayler (2014). Noise-tolerant approximate blocking for dynamic real-time entity resolution. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 449-460.
[7]
G Papadakis, D Skoutas, E Thanos and T Palpanas (2020). Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys (CSUR), 53(2), 1-42.
[8]
X L Dong and D Srivastava (2013). Big data integration. 2013 IEEE 29th international conference on data engineering (ICDE), 1245-1248.
[9]
B Ramadan and P Christen (2014). Forest-based dynamic sorted neighborhood indexing for real-time entity resolution. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM), 1787-1790.
[10]
P Christen (2012). Data matching, Concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer-Verlag, Berlin, Heidelberg.
[11]
T Ranbaduge, D Vatsalan and P Christen (2018). A scalable and efficient subgroup blocking scheme for multidatabase record linkage. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 15-27.
[12]
V Christen, P Christen, and E Rahm (2019). Informativeness-Based Active Learning for Entity Resolution. Machine Learning and Knowledge Discovery in Databases, 2, 125-141.
[13]
D Vandic, F Frasincar, U Kaymak and M Riezebos (2020). Scalable entity resolution for Web product descriptions. Information Fusion, 53, 103-111.
[14]
M T Özsu and P Valduriez (1999). Principles of distributed database systems. Springer, Cham, Switzerland.
[15]
R Wang, E M Pierce, S Madnick and R C Fisher. Information Quality. Routledge, New York, USA.
[16]
P Christen (2008). Febrl - an open source data cleaning, deduplication and record linkage system with a graphical user interface. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 1065-1068.
[17]
L Zhu, R D Cui, Q Ma and W Y Meng (2019). Real-time Entity Resolution by Multiple Indices. 2019 14th International Conference on Computer Science & Education (ICCSE), 1063-1068.

Index Terms

  1. Real-Time Entity Resolution by Forest-Based Indexing in Database Systems with Vertical Fragmentations
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Other conferences
            CSAE '21: Proceedings of the 5th International Conference on Computer Science and Application Engineering
            October 2021
            660 pages
            ISBN:9781450389853
            DOI:10.1145/3487075
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 07 December 2021

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. Dirty dataset
            2. Entity resolution
            3. Forest index
            4. Multi-tables
            5. Ranking function

            Qualifiers

            • Research-article
            • Research
            • Refereed limited

            Funding Sources

            • Natural Science Foundation of Hebei Province of China

            Conference

            CSAE 2021

            Acceptance Rates

            Overall Acceptance Rate 368 of 770 submissions, 48%

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 0
              Total Citations
            • 50
              Total Downloads
            • Downloads (Last 12 months)10
            • Downloads (Last 6 weeks)2
            Reflects downloads up to 17 Oct 2024

            Other Metrics

            Citations

            View Options

            Get Access

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format.

            HTML Format

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media