Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Splitting Tuples of Mismatched Entities

Published: 12 December 2023 Publication History
  • Get Citation Alerts
  • Abstract

    There has been a host of work on entity resolution (ER), to identify tuples that refer to the same entity. This paper studies the inverse of ER, to identify tuples to which distinct real-world entities are matched by mistake, and split such tuples into a set of tuples, one for each entity. We formulate the tuple splitting problem. We propose a scheme to decide what tuples to split and what tuples to correct without splitting, fix errors/assign attribute values to the split tuples, and impute missing values. The scheme introduces a class of rules, which embed predicates for aligning entities across relations and knowledge graphs G, assessing correlation between attributes, and extracting data from G. It unifies logic deduction, correlation models, and data extraction by chasing the data with the rules. We train machine learning models to assess attribute correlation and predict missing values. We develop algorithms for the tuple splitting scheme. Using real-life data, we empirically verify that the scheme is efficient and accurate, with F-measure 0.92 on average.

    References

    [1]
    2013. Lego Friends. https://www.imdb.com/title/tt4049416/.
    [2]
    2013. Lego Friends. https://www.imdb.com/title/tt9148446/.
    [3]
    2013. Storm. http://filmstudieren.ch/en/storm#1.
    [4]
    2022. Colleges. https://data.world/dhs/colleges-and-universities.
    [5]
    2022. Colleges KG. https://nces.ed.gov/GLOBALLOCATOR/.
    [6]
    2022. DBLP. https://dblp.org/rdf/release/dblp-2022-05-02.nt.gz.
    [7]
    2022. Elected Councillors in Kagawa at-large district of Japan. https://en.wikipedia.org/?curid=27298128.
    [8]
    2022. Help:Conflation of two people. https://www.wikidata.org/wiki/Help:Conflation_of_two_people.
    [9]
    2022. Hirai Tar¯o (novelist). https://en.wikipedia.org/wiki/Edogawa_Ranpo.
    [10]
    2022. Wikemedia. https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data.
    [11]
    2023. BA film. https://www.zhdk.ch/en/degree-programmes/film/ba-film.
    [12]
    2023. Code, datasets and full version. https://drive.google.com/drive/folders/1-Bc20q3hc26cqW-7zJ3R0xHm-t00CrIu?usp=sharing.
    [13]
    2023. DOK.fest. https://www.dokfest-muenchen.de/.
    [14]
    2023. Dun & Bradstreet. https://www.dnb.com/.
    [15]
    2023. Filmography by ZHdK. https://www.swissfilms.ch/en/company/zrcher-hochschule-der-knste-zhdk-departement-darstellende-knste-und-film/A96DAF3F0CF04DEDBD79404DC793ED02.
    [16]
    2023. IMDB. https://www.imdb.com/interfaces/.
    [17]
    2023. IMDB Name Split. https://help.imdb.com/article/contribution/names-biographical-data/names/GSA3M6SFHRAERXZ3#.
    [18]
    2023. Noemi Schneide (German). https://www.dokfest-muenchen.de/films/walaa?lang=en & https://de.wikipedia.org/wiki/Noemi_Schneider.
    [19]
    2023. Noemi Schneider (Swiss). https://www.swissfilms.ch/en/person/nomi-natascha-schneider/385CEC7054A64FDC946F008A4432A4B9.
    [20]
    2023. US Bureau of Labor Statistics. https://www.bls.gov/.
    [21]
    2023. Wikidata. https://www.wikidata.org.
    [22]
    2023. Wikipedia. https://en.wikipedia.org/.
    [23]
    Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
    [24]
    Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In SIGMOD. 783--794.
    [25]
    Arvind Arasu, Christopher Ré, and Dan Suciu. 2009. Large-Scale Deduplication with Constraints Using Dedupalog. In ICDE. 952--963.
    [26]
    Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS. 68--79.
    [27]
    Zeinab Bahmani, Leopoldo E. Bertossi, and Nikolaos Vasiloglou. 2017. ERBlox: Combining matching dependencies with machine learning for entity resolution. Int. J. Approx. Reasoning 83 (2017), 118--141.
    [28]
    Parikshit Bansal, Prathamesh Deshpande, and Sunita Sarawagi. 2021. Missing Value Imputation on Multidimensional Time Series. PVLDB 14, 11 (2021), 2533--2545.
    [29]
    Leopoldo Bertossi. 2011. Database Repairing and Consistent Query Answering. Morgan & Claypool Publishers.
    [30]
    Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2013. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory Comput. Syst. 52, 3 (2013), 441--482.
    [31]
    Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. J. Mach. Learn. Res. 20, 175 (2019), 1--6.
    [32]
    Mikhail Bilenko, Beena Kamath, and Raymond J Mooney. 2006. Adaptive Blocking: Learning to Scale Up Record Linkage. In ICDM. 87--96.
    [33]
    Cory Bohon. 2022. How to find and merge duplicate contacts in iOS 16. https://www.techrepublic.com/article/merge-duplicate-contacts-ios-16/.
    [34]
    Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247--1250.
    [35]
    Zhaoqiang Chen, Qun Chen, Boyi Hou, Murtadha Ahmed, and Zhanhuai Li. 2018. Improving machine-based entity resolution with limited human effort: A risk perspective. In International Workshop on Real-Time Business Intelligence and Analytics. 1--5.
    [36]
    Zhaoqiang Chen, Qun Chen, Boyi Hou, Zhanhuai Li, and Guoliang Li. 2020. Towards interpretable and learnable risk analysis for entity resolution. In SIGMOD. 1165--1180.
    [37]
    E. F. Codd. 1972. Relational Completeness of Data Base Sublanguages. In: R. Rustin (ed.): Database Systems: 65--98, Prentice Hall and IBM Research Report RJ 987, San Jose, California (1972).
    [38]
    Jess Cody. 2022. Where does data come from. https://clearbit.com/blog/where-does-data-come-from.
    [39]
    Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. 2007. Improving Data Quality: Consistency and Accuracy. In VLDB. 315--326.
    [40]
    Sanjib Das, Paul Suganthan G. C., AnHai Doan, Jeffrey F. Naughton, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, and Youngchoon Park. 2017. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services. In SIGMOD. 1431--1446.
    [41]
    Sushovan De, Yuheng Hu, Venkata Vamsikrishna Meduri, Yi Chen, and Subbarao Kambhampati. 2015. BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData. CoRR abs/1506.08908 (2015).
    [42]
    Ting Deng, Wenfei Fan, Ping Lu, Xiaomeng Luo, Xiaoke Zhu, and Wanhe An. 2022. Deep and Collective Entity Resolution in Parallel. In ICDE. IEEE, 2060--2072.
    [43]
    Daniel Deutch, Nave Frost, Amir Gilad, and Oren Sheffer. 2021. Explanations for Data Repair Through Shapley Values. In CIKM. ACM.
    [44]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
    [45]
    Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Muxian Wang, Jianzhong Li, and Hong Gao. 2020. Leveraging currency for repairing inconsistent and incomplete data. TKDE (2020).
    [46]
    Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. PVLDB 11, 11 (2018), 1454--1467.
    [47]
    e_kartoffel. 2015. Names merged in error (by me). https://community-imdb.sprinklr.com/conversations/data-issues-policy-discussions/names-merged-in-error-by-me/5f4a79838815453dba7fbebc.
    [48]
    Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520.
    [49]
    Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 2 (2008), 6:1--6:48.
    [50]
    Wenfei Fan, Liang Geng, Ruochun Jin, Ping Lu, Resul Tugey, and Wenyuan Yu. 2022. Linking Entities across Relations and Graphs. In ICDE. IEEE, 634--647.
    [51]
    Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie. 2022. Parallel Rule Discovery from Large Datasets by Sampling. In SIGMOD. ACM, 384--398.
    [52]
    Wenfei Fan, Ziyan Han, Yaoshu Wang, and Min Xie. 2023. Discovering Top-k Rules using Subjective and Objective Criteria. In SIGMOD. ACM.
    [53]
    Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2011. CerFix: A System for Cleaning Data with Certain Fixes. PVLDB 4, 12 (2011), 1375--1378.
    [54]
    Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDBJ 21, 2 (2012), 213--238.
    [55]
    Wenfei Fan, Ping Lu, and Chao Tian. 2020. Unifying Logic Rules and Machine Learning for Entity Enhancing. Sci. China Inf. Sci. 63, 7 (2020).
    [56]
    Wenfei Fan, Ping Lu, Chao Tian, and Jingren Zhou. 2019. Deducing Certain Fixes to Graphs. PVLDB 12, 7 (2019), 752--765.
    [57]
    Wenfei Fan, Chao Tian, Yanghao Wang, and Qiang Yin. 2021. Discrepancy Detection and Incremental Detection. PVLDB 14, 8 (2021), 1351--1364.
    [58]
    Cheng Fu, Xianpei Han, Le Sun, Bo Chen, Wei Zhang, Suhui Wu, and Hao Kong. 2019. End-to-end multi-perspective matching for entity resolution. In IJCAI. 4961--4967.
    [59]
    Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. 2020. MAGNN: Metapath aggregated graph neural network for heterogeneous graph embedding. In The Web Conference 2020. 2331--2341.
    [60]
    Erdun Gao, Ignavier Ng, Mingming Gong, Li Shen, Wei Huang, Tongliang Liu, Kun Zhang, and Howard D. Bondell. 2022. MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models. In NeurIPS.
    [61]
    Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2013. The LLUNATIC data-cleaning framework. PVLDB 6, 9 (2013), 625--636.
    [62]
    Stella Giannakopoulou, Manos Karpathiotakis, and Anastasia Ailamaki. 2020. Cleaning denial constraint violations through relaxation. In SIGMOD. 805--815.
    [63]
    Amir Gilad, Daniel Deutch, and Sudeepa Roy. 2020. On multiple semantics for declarative database repairs. In SIGMOD. 817--831.
    [64]
    Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude W. Shavlik, and Xiaojin Zhu. 2014. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD. ACM.
    [65]
    Lovedeep Gondara and Ke Wang. 2017. Multiple imputation using deep denoising autoencoders. arXiv preprint arXiv:1705.02737 280 (2017).
    [66]
    Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac. 2010. Record Linkage with Uniqueness Constraints and Erroneous Values. PVLDB 3, 1 (2010), 417--428.
    [67]
    Shuang Hao, Nan Tang, Guoliang Li, Jianhua Feng, and Ning Wang. 2021. Mis-categorized entities detection. VLDB J. 30, 4 (2021), 515--536.
    [68]
    IMDb help center. 2023. How can I combine two IMDb name pages? https://help.imdb.com/article/contribution/names-biographical-data/how-can-i-combine-two-imdb-name-pages/G3TNPWSGKZNRU3MP?ref_=helpsrall#.
    [69]
    Benjamin Hilprecht and Carsten Binnig. 2021. ReStore - Neural Data Completion for Relational Databases. In SIGMOD. 710--722.
    [70]
    Vinh Thinh Ho, Daria Stepanova, Mohamed H Gad-Elrab, Evgeny Kharlamov, and Gerhard Weikum. 2018. Rule learning from knowledge graphs guided by embedding models. In ISWC. Springer, 72--90.
    [71]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
    [72]
    Boyi Hou, Qun Chen, Zhaoqiang Chen, Youcef Nafa, and Zhanhuai Li. 2018. r-HUMO: A Risk-aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees. TKDE 32, 2 (2018), 347--359.
    [73]
    Huldra. 2020. Help talk: Conflation of two people. https://www.wikidata.org/wiki/Help_talk:Conflation_of_two_people.
    [74]
    Vassilis N Ioannidis, Xiang Song, Saurav Manchanda, Mufei Li, Xiaoqin Pan, Da Zheng, Xia Ning, Xiangxiang Zeng, and George Karypis. 2020. Drkg-drug repurposing knowledge graph for covid-19. https://github.com/gnn4dr/DRKG/.
    [75]
    Robert Isele, Anja Jentzsch, and Christian Bizer. 2010. Silk server-adding missing links while consuming linked data. In COLD. 85--96.
    [76]
    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535--547.
    [77]
    Mourad Khayati, Ines Arous, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. ORBITS: Online Recovery of Missing Values in Multiple Time Series Streams. PVLDB 14, 3 (2020), 294--306.
    [78]
    Mourad Khayati, Alberto Lerner, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. Mind the gap: An experimental evaluation of imputation of missing values techniques in time series. In PVLDB, Vol. 13. 768--782.
    [79]
    Hanna Köpcke, Andreas Thor, and Erhard Rahm. 2010. Evaluation of entity resolution approaches on real-world match problems. PVLDB 3, 1--2 (2010), 484--493.
    [80]
    loannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate detection with matching dependencies. PVLDB 13, 5 (2020), 712--725.
    [81]
    Trent Kyono, Yao Zhang, Alexis Bellot, and Mihaela van der Schaar. 2021. MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms. In NeurIPS. 23806--23817.
    [82]
    Dongjin Lee and Kijung Shin. 2021. Robust factorization of real-world tensor streams with patterns, missing values, and outliers. In ICDE. IEEE, 840--851.
    [83]
    Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web (2015).
    [84]
    Adam Lerer, Ledell Wu, Jiajun Shen, Timothée Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. Pytorch-BigGraph: A Large Scale Graph Embedding System. In MLSys.
    [85]
    Alexander K. Lew, Monica Agrawal, David A. Sontag, and Vikash K. Mansinghka. 2020. PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. CoRR abs/2007.11838 (2020).
    [86]
    Bing Li, Wei Wang, Yifang Sun, Linhan Zhang, Muhammad Asif Ali, and Yi Wang. 2020. GraphER: Token-Centric Entity Resolution with Graph Convolutional Neural Networks. In AAAI. 8172--8179.
    [87]
    Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2019. CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]. CoRR abs/1904.09483 (2019).
    [88]
    Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. PVLDB 14, 1 (2020), 50--60.
    [89]
    Xi Liang, Zechao Shang, Sanjay Krishnan, Aaron J Elmore, and Michael J Franklin. 2020. Fast and reliable missing data contingency analysis with predicate-constraints. In SIGMOD. 285--295.
    [90]
    Zifan Liu, Zhechun Zhou, and Theodoros Rekatsinas. 2020. Picket: Self-supervised Data Diagnostics for ML Pipelines. CoRR abs/2006.04730 (2020).
    [91]
    Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective error correction via a unified context representation and transfer learning. PVLDB 13, 12 (2020), 1948--1961.
    [92]
    Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2019. Raha: A Configuration-Free Error Detection System. In SIGMOD. 865--882.
    [93]
    Pierre-Alexandre Mattei and Jes Frellsen. 2019. MIWAE: Deep generative modelling and imputation of incomplete data sets. In ICML. PMLR, 4413--4423.
    [94]
    John T McCoy, Steve Kroon, and Lidia Auret. 2018. Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 51, 21 (2018), 141--146.
    [95]
    Venkata Vamsikrishna Meduri, Lucian Popa, Prithviraj Sen, and Mohamed Sarwat. 2020. A comprehensive benchmark framework for active learning methods in entity matching. In SIGMOD. 1133--1147.
    [96]
    Yinan Mei, Shaoxu Song, Chenguang Fang, Haifeng Yang, Jingyun Fang, and Jiang Long. 2021. Capturing Semantics for Imputation with Pre-trained Language Models. In ICDE. IEEE, 61--72.
    [97]
    Xiaoye Miao, Yangyang Wu, Lu Chen, Yunjun Gao, Jun Wang, and Jianwei Yin. 2021. Efficient and effective data imputation with influence functions. PVLDB 15, 3 (2021), 624--632.
    [98]
    Xiaoye Miao, Yangyang Wu, Jun Wang, Yunjun Gao, Xudong Mao, and Jianwei Yin. 2021. Generative semi-supervised learning for multivariate time series imputation. In AAAI, Vol. 35. 8983--8991.
    [99]
    Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. In SIGMOD. 19--34.
    [100]
    Boris Muzellec, Julie Josse, Claire Boyer, and Marco Cuturi. 2020. Missing data imputation using optimal transport. In International Conference on Machine Learning. PMLR, 7130--7140.
    [101]
    Alfredo Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. 2020. Handling incomplete heterogeneous data using vaes. Pattern Recognition 107 (2020), 107501.
    [102]
    Pouya Ghiasnezhad Omran, Kewen Wang, and Zhe Wang. 2019. An embedding-based approach to rule learning in knowledge graphs. TKDE 33, 4 (2019), 1348--1359.
    [103]
    George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. 2020. Three-dimensional Entity Resolution with JedAI. Information Systems 93 (2020), 101565.
    [104]
    Abdulhakim Qahtan, Nan Tang, Mourad Ouzzani, Yang Cao, and Michael Stonebraker. 2020. Pattern functional dependencies for data cleaning. PVLDB 13, 5 (2020), 684--697.
    [105]
    Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active Learning for Large-Scale Entity Resolution. In CIKM. 1379--1388.
    [106]
    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3980--3990.
    [107]
    Florian Reitz. 2020. Corrections in dblp. https://blog.dblp.org/2020/01/08/corrections-in-dblp-2019/.
    [108]
    Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201.
    [109]
    Weilong Ren, Xiang Lian, and Kambiz Ghazinour. 2021. Online Topic-Aware Entity Resolution Over Incomplete Data Streams. In SIGMOD. 1478--1490.
    [110]
    El Kindi Rezig, Mourad Ouzzani, Walid G Aref, Ahmed K Elmagarmid, Ahmed R Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable dependency-driven data cleaning. PVLDB 14, 11 (2021), 2546--2554.
    [111]
    Fereidoon Sadri and Jeffrey D. Ullman. 1980. The Interaction between Functional Dependencies and Template Dependencies. In SIGMOD.
    [112]
    Philipp Schirmer, Thorsten Papenbrock, Ioannis K. Koumarelas, and Felix Naumann. 2020. Efficient Discovery of Matching Dependencies. ACM Trans. Database Syst. (2020).
    [113]
    Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing Entity Matching Rules by Examples. PVLDB 11, 2 (2017), 189--202.
    [114]
    Qi Song, Peng Lin, Hanchao Ma, and Yinghui Wu. 2021. Explaining Missing Data in Graphs: A Constraint-based Approach. In ICDE. IEEE, 1476--1487.
    [115]
    Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013), 146--166.
    [116]
    Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2018. Enriching data imputation under similarity rule constraints. TKDE 32, 2 (2018), 275--287.
    [117]
    Indro Spinelli, Simone Scardapane, and Aurelio Uncini. 2020. Missing data imputation with adversarially-trained graph convolutional networks. Neural Networks 129 (2020), 249--260.
    [118]
    Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In WWW.
    [119]
    Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. Int. J. of uncertainty, fuzziness and knowledge-ased systems 10, 05 (2002), 557--570.
    [120]
    Saravanan Thirumuruganathan, Han Li, Nan Tang, Mourad Ouzzani, Yash Govind, Derek Paulsen, Glenn Fung, and AnHai Doan. 2021. Deep learning for blocking in entity matching: A design space exploration. PVLDB 14, 11 (2021), 2459--2472.
    [121]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurlPS. 5998--6008.
    [122]
    Steven Euijong Whang, Omar Benjelloun, and Hector Garcia-Molina. 2009. Generic entity resolution with negative rules. VLDB J. 18, 6 (2009), 1261--1277.
    [123]
    Steven Euijong Whang and Hector Garcia-Molina. 2013. Joint entity resolution on multiple datasets. VLDB J. 22, 6 (2013), 773--795.
    [124]
    Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, and Saravanan Thirumuruganathan. 2020. ZeroER: Entity Resolution using Zero Labeled Examples. In SIGMOD. 1149--1164.
    [125]
    Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In MLSys 2020.
    [126]
    Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don't Be SCAREd: Use SCalable Automatic REpairing with Maximal Likelihood and Bounded Changes. In SIGMOD. ACM.
    [127]
    Yan Yan, Stephen Meyles, Aria Haghighi, and Dan Suciu. 2020. Entity matching in the wild: A consistent and versatile framework to unify data in industrial applications. In SIGMOD. 2287--2301.
    [128]
    Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. PMLR, 5675--5684.
    [129]
    zeorb. 2018. How do I split a TV Series into 2 tv series? https://community-imdb.sprinklr.com/conversations/data-issues-policy-discussions/how-do-i-split-a-tv-series-into-2-tv-series/5f4a79fa8815453dba940741.
    [130]
    Aoqian Zhang, Shaoxu Song, Yu Sun, and Jianmin Wang. 2019. Learning individual models for imputation. In ICDE. IEEE, 160--171.
    [131]
    Dongxiang Zhang, Long Guo, Xiangnan He, Jie Shao, Sai Wu, and Heng Tao Shen. 2018. A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution. In ICDE. IEEE, 713--724.
    [132]
    Dongxiang Zhang, Dongsheng Li, Long Guo, and Kian-Lee Tan. 2020. Unsupervised entity resolution with blocking and graph algorithms. TKDE 34, 3 (2020), 1501--1515.
    [133]
    Wen Zhang, Bibek Paudel, Liang Wang, Jiaoyan Chen, Hai Zhu, Wei Zhang, Abraham Bernstein, and Huajun Chen. 2019. Iteratively learning embeddings and rules for knowledge graph reasoning. In WWW. 2366--2377.
    [134]
    Yiliang Zhang and Qi Long. 2021. Fairness in Missing Data Imputation. CoRR abs/2110.12002 (2021).
    [135]
    Chen Zhao and Yeye He. 2019. Auto-EM: End-to-end Fuzzy Entity-Matching using Pre-trained Deep Models and Transfer Learning. In WWW. 2413--2424.

    Index Terms

    1. Splitting Tuples of Mismatched Entities

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 4
      PACMMOD
      December 2023
      1317 pages
      EISSN:2836-6573
      DOI:10.1145/3637468
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 December 2023
      Published in PACMMOD Volume 1, Issue 4

      Permissions

      Request permissions for this article.

      Author Tags

      1. data quality
      2. entity resolution
      3. tuple splitting

      Qualifiers

      • Research-article

      Funding Sources

      • the National Key R&D Program of China
      • Longhua Science and Technology Innovation Bureau
      • NSFC
      • Royal Society Wolfson Research Merit Award
      • Guangdong Basic and Applied Basic Research Foundation

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 140
        Total Downloads
      • Downloads (Last 12 months)140
      • Downloads (Last 6 weeks)33

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media