research-article

Splitting Tuples of Mismatched Entities

Authors:

Mengyi YanAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 4

Article No.: 269, Pages 1 - 29

https://doi.org/10.1145/3626763

Published: 12 December 2023 Publication History

Abstract

There has been a host of work on entity resolution (ER), to identify tuples that refer to the same entity. This paper studies the inverse of ER, to identify tuples to which distinct real-world entities are matched by mistake, and split such tuples into a set of tuples, one for each entity. We formulate the tuple splitting problem. We propose a scheme to decide what tuples to split and what tuples to correct without splitting, fix errors/assign attribute values to the split tuples, and impute missing values. The scheme introduces a class of rules, which embed predicates for aligning entities across relations and knowledge graphs G, assessing correlation between attributes, and extracting data from G. It unifies logic deduction, correlation models, and data extraction by chasing the data with the rules. We train machine learning models to assess attribute correlation and predict missing values. We develop algorithms for the tuple splitting scheme. Using real-life data, we empirically verify that the scheme is efficient and accurate, with F-measure 0.92 on average.

References

[1]

2013. Lego Friends. https://www.imdb.com/title/tt4049416/.

[2]

2013. Lego Friends. https://www.imdb.com/title/tt9148446/.

[3]

2013. Storm. http://filmstudieren.ch/en/storm#1.

[4]

2022. Colleges. https://data.world/dhs/colleges-and-universities.

[5]

2022. Colleges KG. https://nces.ed.gov/GLOBALLOCATOR/.

[6]

2022. DBLP. https://dblp.org/rdf/release/dblp-2022-05-02.nt.gz.

[7]

2022. Elected Councillors in Kagawa at-large district of Japan. https://en.wikipedia.org/?curid=27298128.

[8]

2022. Help:Conflation of two people. https://www.wikidata.org/wiki/Help:Conflation_of_two_people.

[9]

2022. Hirai Tar¯o (novelist). https://en.wikipedia.org/wiki/Edogawa_Ranpo.

[10]

2022. Wikemedia. https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data.

[11]

2023. BA film. https://www.zhdk.ch/en/degree-programmes/film/ba-film.

[12]

2023. Code, datasets and full version. https://drive.google.com/drive/folders/1-Bc20q3hc26cqW-7zJ3R0xHm-t00CrIu?usp=sharing.

[13]

2023. DOK.fest. https://www.dokfest-muenchen.de/.

[14]

2023. Dun & Bradstreet. https://www.dnb.com/.

[15]

2023. Filmography by ZHdK. https://www.swissfilms.ch/en/company/zrcher-hochschule-der-knste-zhdk-departement-darstellende-knste-und-film/A96DAF3F0CF04DEDBD79404DC793ED02.

[16]

2023. IMDB. https://www.imdb.com/interfaces/.

[17]

2023. IMDB Name Split. https://help.imdb.com/article/contribution/names-biographical-data/names/GSA3M6SFHRAERXZ3#.

[18]

2023. Noemi Schneide (German). https://www.dokfest-muenchen.de/films/walaa?lang=en & https://de.wikipedia.org/wiki/Noemi_Schneider.

[19]

2023. Noemi Schneider (Swiss). https://www.swissfilms.ch/en/person/nomi-natascha-schneider/385CEC7054A64FDC946F008A4432A4B9.

[20]

2023. US Bureau of Labor Statistics. https://www.bls.gov/.

[21]

2023. Wikidata. https://www.wikidata.org.

[22]

2023. Wikipedia. https://en.wikipedia.org/.

[23]

Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.

Digital Library

[24]

Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In SIGMOD. 783--794.

[25]

Arvind Arasu, Christopher Ré, and Dan Suciu. 2009. Large-Scale Deduplication with Constraints Using Dedupalog. In ICDE. 952--963.

[26]

Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS. 68--79.

[27]

Zeinab Bahmani, Leopoldo E. Bertossi, and Nikolaos Vasiloglou. 2017. ERBlox: Combining matching dependencies with machine learning for entity resolution. Int. J. Approx. Reasoning 83 (2017), 118--141.

Digital Library

[28]

Parikshit Bansal, Prathamesh Deshpande, and Sunita Sarawagi. 2021. Missing Value Imputation on Multidimensional Time Series. PVLDB 14, 11 (2021), 2533--2545.

Digital Library

[29]

Leopoldo Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers.

Digital Library

[30]

Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2013. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory Comput. Syst. 52, 3 (2013), 441--482.

Digital Library

[31]

Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. J. Mach. Learn. Res. 20, 175 (2019), 1--6.

[32]

Mikhail Bilenko, Beena Kamath, and Raymond J Mooney. 2006. Adaptive Blocking: Learning to Scale Up Record Linkage. In ICDM. 87--96.

[33]

Cory Bohon. 2022. How to find and merge duplicate contacts in iOS 16. https://www.techrepublic.com/article/merge-duplicate-contacts-ios-16/.

[34]

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247--1250.

Digital Library

[35]

Zhaoqiang Chen, Qun Chen, Boyi Hou, Murtadha Ahmed, and Zhanhuai Li. 2018. Improving machine-based entity resolution with limited human effort: A risk perspective. In International Workshop on Real-Time Business Intelligence and Analytics. 1--5.

Digital Library

[36]

Zhaoqiang Chen, Qun Chen, Boyi Hou, Zhanhuai Li, and Guoliang Li. 2020. Towards interpretable and learnable risk analysis for entity resolution. In SIGMOD. 1165--1180.

[37]

E. F. Codd. 1972. Relational Completeness of Data Base Sublanguages. In: R. Rustin (ed.): Database Systems: 65--98, Prentice Hall and IBM Research Report RJ 987, San Jose, California (1972).

[38]

Jess Cody. 2022. Where does data come from. https://clearbit.com/blog/where-does-data-come-from.

[39]

Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving Data Quality: Consistency and Accuracy. In VLDB. 315--326.

[40]

Sanjib Das, Paul Suganthan G. C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In SIGMOD. 1431--1446.

[41]

Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, and Subbarao Kambhampati. 2015. BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData. CoRR abs/1506.08908 (2015).

[42]

Ting Deng, Wenfei Fan, Ping Lu, Xiaomeng Luo, Xiaoke Zhu, and Wanhe An. 2022. Deep and Collective Entity Resolution in Parallel. In ICDE. IEEE, 2060--2072.

[43]

Daniel Deutch, Nave Frost, Amir Gilad, and Oren Sheffer. 2021. Explanations for Data Repair Through Shapley Values. In CIKM. ACM.

[44]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.

[45]

Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Muxian Wang, Jianzhong Li, and Hong Gao. 2020. Leveraging currency for repairing inconsistent and incomplete data. TKDE (2020).

[46]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. PVLDB 11, 11 (2018), 1454--1467.

[47]

e_kartoffel. 2015. Names merged in error (by me). https://community-imdb.sprinklr.com/conversations/data-issues-policy-discussions/names-merged-in-error-by-me/5f4a79838815453dba7fbebc.

[48]

Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520.

Digital Library

[49]

Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 2 (2008), 6:1--6:48.

Digital Library

[50]

Wenfei Fan, Liang Geng, Ruochun Jin, Ping Lu, Resul Tugey, and Wenyuan Yu. 2022. Linking Entities across Relations and Graphs. In ICDE. IEEE, 634--647.

[51]

Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie. 2022. Parallel Rule Discovery from Large Datasets by Sampling. In SIGMOD. ACM, 384--398.

[52]

Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie. 2023. Discovering Top-k Rules using Subjective and Objective Criteria. In SIGMOD. ACM.

[53]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. CerFix: A System for Cleaning Data with Certain Fixes. PVLDB 4, 12 (2011), 1375--1378.

Digital Library

[54]

Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDBJ 21, 2 (2012), 213--238.

Digital Library

[55]

Wenfei Fan, Ping Lu, and Chao Tian. 2020. Unifying Logic Rules and Machine Learning for Entity Enhancing. Sci. China Inf. Sci. 63, 7 (2020).

[56]

Wenfei Fan, Ping Lu, Chao Tian, and Jingren Zhou. 2019. Deducing Certain Fixes to Graphs. PVLDB 12, 7 (2019), 752--765.

Digital Library

[57]

Wenfei Fan, Chao Tian, Yanghao Wang, and Qiang Yin. 2021. Discrepancy Detection and Incremental Detection. PVLDB 14, 8 (2021), 1351--1364.

Digital Library

[58]

Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-end multi-perspective matching for entity resolution. In IJCAI. 4961--4967.

[59]

Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. 2020. MAGNN: Metapath aggregated graph neural network for heterogeneous graph embedding. In The Web Conference 2020. 2331--2341.

Digital Library

[60]

Erdun Gao, Ignavier Ng, Mingming Gong, Li Shen, Wei Huang, Tongliang Liu, Kun Zhang, and Howard D. Bondell. 2022. MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. In NeurIPS.

[61]

Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. PVLDB 6, 9 (2013), 625--636.

Digital Library

[62]

Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020. Cleaning denial constraint violations through relaxation. In SIGMOD. 805--815.

[63]

Amir Gilad, Daniel Deutch, and Sudeepa Roy. 2020. On multiple semantics for declarative database repairs. In SIGMOD. 817--831.

[64]

Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD. ACM.

Digital Library

[65]

Lovedeep Gondara and Ke Wang. 2017. Multiple imputation using deep denoising autoencoders. arXiv preprint arXiv:1705.02737 280 (2017).

[66]

Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac. 2010. Record Linkage with Uniqueness Constraints and Erroneous Values. PVLDB 3, 1 (2010), 417--428.

Digital Library

[67]

Shuang Hao, Nan Tang, Guoliang Li, Jianhua Feng, and Ning Wang. 2021. Mis-categorized entities detection. VLDB J. 30, 4 (2021), 515--536.

Digital Library

[68]

IMDb help center. 2023. How can I combine two IMDb name pages? https://help.imdb.com/article/contribution/names-biographical-data/how-can-i-combine-two-imdb-name-pages/G3TNPWSGKZNRU3MP?ref_=helpsrall#.

[69]

Benjamin Hilprecht and Carsten Binnig. 2021. ReStore - Neural Data Completion for Relational Databases. In SIGMOD. 710--722.

[70]

Vinh Thinh Ho, Daria Stepanova, Mohamed H Gad-Elrab, Evgeny Kharlamov, and Gerhard Weikum. 2018. Rule learning from knowledge graphs guided by embedding models. In ISWC. Springer, 72--90.

[71]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[72]

Boyi Hou, Qun Chen, Zhaoqiang Chen, Youcef Nafa, and Zhanhuai Li. 2018. r-HUMO: A Risk-aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees. TKDE 32, 2 (2018), 347--359.

[73]

Huldra. 2020. Help talk: Conflation of two people. https://www.wikidata.org/wiki/Help_talk:Conflation_of_two_people.

[74]

Vassilis N Ioannidis, Xiang Song, Saurav Manchanda, Mufei Li, Xiaoqin Pan, Da Zheng, Xia Ning, Xiangxiang Zeng, and George Karypis. 2020. Drkg-drug repurposing knowledge graph for covid-19. https://github.com/gnn4dr/DRKG/.

[75]

Robert Isele, Anja Jentzsch, and Christian Bizer. 2010. Silk server-adding missing links while consuming linked data. In COLD. 85--96.

[76]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535--547.

[77]

Mourad Khayati, Ines Arous, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. ORBITS: Online Recovery of Missing Values in Multiple Time Series Streams. PVLDB 14, 3 (2020), 294--306.

Digital Library

[78]

Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. Mind the gap: An experimental evaluation of imputation of missing values techniques in time series. In PVLDB, Vol. 13. 768--782.

Digital Library

[79]

Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. PVLDB 3, 1--2 (2010), 484--493.

Digital Library

[80]

loannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate detection with matching dependencies. PVLDB 13, 5 (2020), 712--725.

[81]

Trent Kyono, Yao Zhang, Alexis Bellot, and Mihaela van der Schaar. 2021. MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms. In NeurIPS. 23806--23817.

[82]

Dongjin Lee and Kijung Shin. 2021. Robust factorization of real-world tensor streams with patterns, missing values, and outliers. In ICDE. IEEE, 840--851.

[83]

Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web (2015).

[84]

Adam Lerer, Ledell Wu, Jiajun Shen, Timothée Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. Pytorch-BigGraph: A Large Scale Graph Embedding System. In MLSys.

[85]

Alexander K. Lew, Monica Agrawal, David A. Sontag, and Vikash K. Mansinghka. 2020. PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. CoRR abs/2007.11838 (2020).

[86]

Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang. 2020. GraphER: Token-Centric Entity Resolution with Graph Convolutional Neural Networks. In AAAI. 8172--8179.

[87]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. CoRR abs/1904.09483 (2019).

[88]

Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. PVLDB 14, 1 (2020), 50--60.

Digital Library

[89]

Xi Liang, Zechao Shang, Sanjay Krishnan, Aaron J Elmore, and Michael J Franklin. 2020. Fast and reliable missing data contingency analysis with predicate-constraints. In SIGMOD. 285--295.

[90]

Zifan Liu, Zhechun Zhou, and Theodoros Rekatsinas. 2020. Picket: Self-supervised Data Diagnostics for ML Pipelines. CoRR abs/2006.04730 (2020).

[91]

Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective error correction via a unified context representation and transfer learning. PVLDB 13, 12 (2020), 1948--1961.

Digital Library

[92]

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In SIGMOD. 865--882.

[93]

Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep generative modelling and imputation of incomplete data sets. In ICML. PMLR, 4413--4423.

[94]

John T McCoy, Steve Kroon, and Lidia Auret. 2018. Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51, 21 (2018), 141--146.

[95]

Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A comprehensive benchmark framework for active learning methods in entity matching. In SIGMOD. 1133--1147.

[96]

Yinan Mei, Shaoxu Song, Chenguang Fang, Haifeng Yang, Jingyun Fang, and Jiang Long. 2021. Capturing Semantics for Imputation with Pre-trained Language Models. In ICDE. IEEE, 61--72.

[97]

Xiaoye Miao, Yangyang Wu, Lu Chen, Yunjun Gao, Jun Wang, and Jianwei Yin. 2021. Efficient and effective data imputation with influence functions. PVLDB 15, 3 (2021), 624--632.

Digital Library

[98]

Xiaoye Miao, Yangyang Wu, Jun Wang, Yunjun Gao, Xudong Mao, and Jianwei Yin. 2021. Generative semi-supervised learning for multivariate time series imputation. In AAAI, Vol. 35. 8983--8991.

[99]

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD. 19--34.

[100]

Boris Muzellec, Julie Josse, Claire Boyer, and Marco Cuturi. 2020. Missing data imputation using optimal transport. In International Conference on Machine Learning. PMLR, 7130--7140.

[101]

Alfredo Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. 2020. Handling incomplete heterogeneous data using vaes. Pattern Recognition 107 (2020), 107501.

[102]

Pouya Ghiasnezhad Omran, Kewen Wang, and Zhe Wang. 2019. An embedding-based approach to rule learning in knowledge graphs. TKDE 33, 4 (2019), 1348--1359.

[103]

George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with JedAI. Information Systems 93 (2020), 101565.

[104]

Abdulhakim Qahtan, Nan Tang, Mourad Ouzzani, Yang Cao, and Michael Stonebraker. 2020. Pattern functional dependencies for data cleaning. PVLDB 13, 5 (2020), 684--697.

Digital Library

[105]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active Learning for Large-Scale Entity Resolution. In CIKM. 1379--1388.

[106]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3980--3990.

[107]

Florian Reitz. 2020. Corrections in dblp. https://blog.dblp.org/2020/01/08/corrections-in-dblp-2019/.

[108]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201.

Digital Library

[109]

Weilong Ren, Xiang Lian, and Kambiz Ghazinour. 2021. Online Topic-Aware Entity Resolution Over Incomplete Data Streams. In SIGMOD. 1478--1490.

[110]

El Kindi Rezig, Mourad Ouzzani, Walid G Aref, Ahmed K Elmagarmid, Ahmed R Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable dependency-driven data cleaning. PVLDB 14, 11 (2021), 2546--2554.

Digital Library

[111]

Fereidoon Sadri and Jeffrey D. Ullman. 1980. The Interaction between Functional Dependencies and Template Dependencies. In SIGMOD.

[112]

Philipp Schirmer, Thorsten Papenbrock, Ioannis K. Koumarelas, and Felix Naumann. 2020. Efficient Discovery of Matching Dependencies. ACM Trans. Database Syst. (2020).

[113]

Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing Entity Matching Rules by Examples. PVLDB 11, 2 (2017), 189--202.

Digital Library

[114]

Qi Song, Peng Lin, Hanchao Ma, and Yinghui Wu. 2021. Explaining Missing Data in Graphs: A Constraint-based Approach. In ICDE. IEEE, 1476--1487.

[115]

Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013), 146--166.

Digital Library

[116]

Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2018. Enriching data imputation under similarity rule constraints. TKDE 32, 2 (2018), 275--287.

[117]

Indro Spinelli, Simone Scardapane, and Aurelio Uncini. 2020. Missing data imputation with adversarially-trained graph convolutional networks. Neural Networks 129 (2020), 249--260.

[118]

Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In WWW.

Digital Library

[119]

Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. Int. J. of uncertainty, fuzziness and knowledge-ased systems 10, 05 (2002), 557--570.

Digital Library

[120]

Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep learning for blocking in entity matching: A design space exploration. PVLDB 14, 11 (2021), 2459--2472.

Digital Library

[121]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurlPS. 5998--6008.

[122]

Steven Euijong Whang, Omar Benjelloun, and Hector Garcia-Molina. 2009. Generic entity resolution with negative rules. VLDB J. 18, 6 (2009), 1261--1277.

Digital Library

[123]

Steven Euijong Whang and Hector Garcia-Molina. 2013. Joint entity resolution on multiple datasets. VLDB J. 22, 6 (2013), 773--795.

Digital Library

[124]

Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity Resolution using Zero Labeled Examples. In SIGMOD. 1149--1164.

[125]

Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In MLSys 2020.

[126]

Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don't Be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes. In SIGMOD. ACM.

[127]

Yan Yan, Stephen Meyles, Aria Haghighi, and Dan Suciu. 2020. Entity matching in the wild: A consistent and versatile framework to unify data in industrial applications. In SIGMOD. 2287--2301.

[128]

Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. PMLR, 5675--5684.

[129]

zeorb. 2018. How do I split a TV Series into 2 tv series? https://community-imdb.sprinklr.com/conversations/data-issues-policy-discussions/how-do-i-split-a-tv-series-into-2-tv-series/5f4a79fa8815453dba940741.

[130]

Aoqian Zhang, Shaoxu Song, Yu Sun, and Jianmin Wang. 2019. Learning individual models for imputation. In ICDE. IEEE, 160--171.

[131]

Dongxiang Zhang, Long Guo, Xiangnan He, Jie Shao, Sai Wu, and Heng Tao Shen. 2018. A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution. In ICDE. IEEE, 713--724.

[132]

Dongxiang Zhang, Dongsheng Li, Long Guo, and Kian-Lee Tan. 2020. Unsupervised entity resolution with blocking and graph algorithms. TKDE 34, 3 (2020), 1501--1515.

[133]

Wen Zhang, Bibek Paudel, Liang Wang, Jiaoyan Chen, Hai Zhu, Wei Zhang, Abraham Bernstein, and Huajun Chen. 2019. Iteratively learning embeddings and rules for knowledge graph reasoning. In WWW. 2366--2377.

[134]

Yiliang Zhang and Qi Long. 2021. Fairness in Missing Data Imputation. CoRR abs/2110.12002 (2021).

[135]

Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In WWW. 2413--2424.

Index Terms

Splitting Tuples of Mismatched Entities
1. Information systems
  1. Data management systems
    1. Information integration

Recommendations

A taxonomy of privacy-preserving record linkage techniques

The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity ...
Read More
A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education
Special Issue on Entity Resolution

This article discusses the topics, approaches, and lessons learned in teaching a graduate-level course covering entity resolution (ER) and its relationship to information quality (IQ). The course surveys a broad spectrum of ER topics and activities ...
Read More
Efficient entity resolution based on subgraph cohesion

Entity resolution has wide applications and receives considerable attentions in literature. For entity resolution, similarity functions are often used to judge whether two data objects refer to the same real-world entity. However, the similar relations ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 4

PACMMOD

December 2023

1317 pages

EISSN:2836-6573

DOI:10.1145/3637468

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2023

Published in PACMMOD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

the National Key R&D Program of China
Longhua Science and Technology Innovation Bureau
NSFC
Royal Society Wolfson Research Merit Award
Guangdong Basic and Applied Basic Research Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
140
Total Downloads

Downloads (Last 12 months)140
Downloads (Last 6 weeks)33

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents