Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Enriching Relations with Additional Attributes for ER

Published: 30 August 2024 Publication History

Abstract

This paper studies a new problem of relation enrichment. Given a relation D of schema R and a knowledge graph G with overlapping information, it is to identify a small number of relevant features from G, and extend schema R with the additional attributes, to maximally improve the accuracy of resolving entities represented by the tuples of D. We formulate the enrichment problem and show its intractability. Nonetheless, we propose a method to extract features from G that are diverse from the existing attributes of R, minimize null values, and moreover, reduce false positives and false negatives of entity resolution (ER) models. The method links tuples and vertices that refer to the same entity, learns a robust policy to extract attributes via reinforcement learning, and jointly trains the policy and ER models. Moreover, we develop algorithms for (incrementally) enriching D. Using real-life data, we experimentally verify that relation enrichment improves the accuracy of ER above 15.4% (percentage points) by adding 5 attributes, up to 33%.

References

[1]
2017. Identity fraud's impact on the insurance sector. https://legal.thomsonreuters.com/en/insights/articles/identity-frauds-impact-on-the-insurance-sector.
[2]
2019. IMDB. https://www.imdb.com/interfaces/.
[3]
2020. Knowledge Graphs for Financial Services. https://www2.deloitte.com/content/dam/Deloitte/nl/Documents/risk/deloitte-nl-risk-knowledge-graphs-financial-services.pdf.
[4]
2022. DBpedia. http://wiki.dbpedia.org.
[5]
2022. Fraud detection using knowledge graph: How to detect and visualize fraudulent activities. https://www.nebula-graph.io/posts/fraud-detection-using-knowledge-and-graph-database.
[6]
2022. How Fraudsters Create Fake Identities. https://www.shift-technology.com/resources/perspectives/sme-perspectives-how-fraudsters-create-fake-identities.
[7]
2022. Wikemedia. https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data.
[8]
2022. Wikidata - Recent changes. https://www.amazon.science/blog/combining-knowledge-graphs-quickly-and-accurately.
[9]
2022. Wikipedia. https://www.wikipedia.org.
[10]
2023. Code, datasets and full version. https://github.com/SICS-Fundamental-Research-Center/Enrichment.
[11]
2023. IMDb Non-Commercial Datasets. https://developer.imdb.com/non-commercial-datasets.
[12]
2023. Leverage Data Enrichment to Ensure You're Dealing with Real People. https://seon.io/resources/online-insurance-fraud/.
[13]
2023. SEON. https://seon.io/.
[14]
2023. SIFT. https://sift.com/.
[15]
2023. Social Network Usage and Growth Statistics. https://backlinko.com/social-media-users.
[16]
2023. STARBUCKS eGIFT. https://www.starbucks.com/terms/gift-card-offer-terms/.
[17]
2023. Wikidata:WikiProject Disambiguation pages. https://www.wikidata.org/wiki/Wikidata:WikiProject_Disambiguation_pages.
[18]
Aisha Abdallah, Mohd Aizaini Maarof, and Anazida Zainal. 2016. Fraud detection system: A survey. Journal of Network and Computer Applications 68 (2016), 90--113.
[19]
Ghadeer Abuoda, Saravanan Thirumuruganathan, and Ashraf Aboulnaga. 2022. Accelerating Entity Lookups in Knowledge Graphs Through Embeddings. In ICDE. IEEE, 1111--1123.
[20]
David W. Aha and Richard L. Bankert. 1995. A Comparative Evaluation of Sequential Feature Selection Algorithms. In Learning from Data - Fifth International Workshop on Artificial Intelligence and Statistics (AISTATS). Springer, 199--206.
[21]
Boanerges Aleman-Meza, Christian Halaschek-Wiener, Ismailcem Budak Arpinar, and Amit P. Sheth. 2003. Context-Aware Semantic Association Ranking. In SWDB. 33--50.
[22]
Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti. 2002. Eliminating Fuzzy Duplicates in Data Warehouses. In VLDB. 586--597.
[23]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In SIGMOD. 783--794.
[24]
Arvind Arasu, Christopher Ré, and Dan Suciu. 2009. Large-Scale Deduplication with Constraints Using Dedupalog. In ICDE. 952--963.
[25]
Marcelo Arenas, Leopoldo Bertossi, and Jan Chomicki. 1999. Consistent Query Answers in Inconsistent Databases. In PODS. 68--79.
[26]
Abolfazl Asudeh, Nima Shahbazi, Zhongjun Jin, and H. V. Jagadish. 2021. Identifying Insufficient Data Coverage for Ordinal Continuous-Valued Attributes. In SIGMOD. 129--141.
[27]
Zeinab Bahmani and Leopoldo E. Bertossi. 2017. Enforcing Relational Matching Dependencies with Datalog for Entity Resolution. In FLAIRS.
[28]
Zeinab Bahmani, Leopoldo E. Bertossi, and Nikolaos Vasiloglou. 2017. ERBlox: Combining matching dependencies with machine learning for entity resolution. Int. J. Approx. Reasoning 83 (2017), 118--141.
[29]
Parikshit Bansal, Prathamesh Deshpande, and Sunita Sarawagi. 2021. Missing Value Imputation on Multidimensional Time Series. PVLDB 14, 11 (2021), 2533--2545.
[30]
Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Networks 5, 4 (1994), 537--550.
[31]
Mario Beraha, Alberto Maria Metelli, Matteo Papini, Andrea Tirinzoni, and Marcello Restelli. 2019. Feature Selection via Mutual Information: New Theoretical Insights. In International Joint Conference on Neural Networks (IJCNN). IEEE, 1--9.
[32]
Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2013. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theory Comput. Syst. 52, 3 (2013), 441--482.
[33]
Gunawan Budiprasetyo. 2019. Optimisation classification on the web of data using linked data. A study case: Movie popularity classification. Ph.D. Dissertation. University of Southampton.
[34]
Gabrielle Karine Canalle, Bernadette Farias Loscio, and Ana Carolina Salgado. 2017. A strategy for selecting relevant attributes for entity resolution in data integration systems. In International Conference on Enterprise Information Systems, Vol. 2. SCITEPRESS, 80--88.
[35]
Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. 2018. Learning to explain: An information-theoretic perspective on model interpretation. In International Conference on Machine Learning. PMLR, 883--892.
[36]
Zitian Chen, Yanwei Fu, Yinda Zhang, Yu-Gang Jiang, Xiangyang Xue, and Leonid Sigal. 2019. Multi-level semantic feature augmentation for one-shot learning. IEEE Transactions on Image Processing 28, 9 (2019), 4594--4605.
[37]
Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David R. Karger. 2020. ARDA: Automatic Relational Data Augmentation for Machine Learning. PVLDB 13, 9 (2020), 1373--1387.
[38]
Vassilis Christophides, Vasilis Efthymiou, Themis Palpanas, George Papadakis, and Kostas Stefanidis. 2021. An Overview of End-to-End Entity Resolution for Big Data. ACM Comput. Surv. 53, 6 (2021), 127:1--127:42.
[39]
E. F. Codd. 1979. Extending the Database Relational Model to Capture More Meaning. ACM Trans. Database Syst. 4, 4 (1979), 397--434.
[40]
Ting Deng, Wenfei Fan, and Floris Geerts. 2016. Capturing Missing Tuples and Missing Values. ACM Trans. Database Syst. 41, 2 (2016), 10:1--10:47.
[41]
Ting Deng, Wenfei Fan, Ping Lu, Xiaomeng Luo, Xiaoke Zhu, and Wanhe An. 2022. Deep and Collective Entity Resolution in Parallel. In ICDE. IEEE, 2060--2072.
[42]
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. 2020. TURL: Table Understanding through Representation Learning. PVLDB (2020).
[43]
Xin Dong, Alon Y. Halevy, and Jayant Madhavan. 2005. Reference Reconciliation in Complex Information Spaces. In SIGMOD. ACM, 85--96.
[44]
Yuyang Dong, Kunihiro Takeoka, Chuan Xiao, and Masafumi Oyamada. 2021. Efficient joinable table discovery in data lakes: A high-dimensional similarity-based approach. In ICDE. IEEE, 456--467.
[45]
Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq R. Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed Representations of Tuples for Entity Resolution. PVLDB 16, 8 (2018), 1944--1957.
[46]
Mahdi Esmailoghli, Jorge-Arnulfo Quiané-Ruiz, and Ziawasch Abedjan. 2021. COCOA: COrrelation COefficient-Aware Data Augmentation. In EDBT. 331--336.
[47]
Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée J. Miller. 2023. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. PVLDB 16, 7 (2023), 1726--1739.
[48]
Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520.
[49]
Wenfei Fan and Floris Geerts. 2010. Relative information completeness. ACM Trans. Database Syst. 35, 4 (2010), 27:1--27:44.
[50]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 2 (2008), 6:1--6:48.
[51]
Wenfei Fan, Liang Geng, Ruochun Jin, Ping Lu, Resul Tugey, and Wenyuan Yu. 2022. Linking Entities across Relations and Graphs. In ICDE. IEEE, 634--647.
[52]
Wenfei Fan, Ziyan Han, Weilong Ren, Ding Wang, Yaoshu Wang, Min Xie, and Mengyi Yan. 2023. Splitting Tuples of Mismatched Entities. Proc. ACM Manag. Data (2023).
[53]
Wenfei Fan, Chunming Hu, and Chao Tian. 2017. Incremental Graph Computations: Doable and Undoable. In SIGMOD. 155--169.
[54]
Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyuan Yu. 2012. Towards certain fixes with editing rules and master data. VLDB J. 21, 2 (2012), 213--238.
[55]
Wenfei Fan, Ping Lu, and Chao Tian. 2020. Unifying Logic Rules and Machine Learning for Entity Enhancing. Sci. China Inf. Sci. 63, 7 (2020).
[56]
Wenfei Fan, Chao Tian, Yanghao Wang, and Qiang Yin. 2021. Parallel Discrepancy Detection and Incremental Detection. PVLDB 14, 8 (2021), 1351--1364.
[57]
Lior Friedman and Shaul Markovitch. 2018. Recursive feature generation for knowledge-based learning. arXiv preprint arXiv:1802.00050 (2018).
[58]
Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. 2020. MAGNN: Metapath aggregated graph neural network for heterogeneous graph embedding. In The Web Conference 2020. 2331--2341.
[59]
Sainyam Galhotra, Udayan Khurana, Oktie Hassanzadeh, Kavitha Srinivas, Horst Samulowitz, and Miao Qi. 2019. Automated feature enhancement for predictive modeling using external knowledge. In International Conference on Data Mining Workshops (ICDMW). IEEE, 1094--1097.
[60]
Michael Garey and David Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company.
[61]
Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac. 2010. Record Linkage with Uniqueness Constraints and Erroneous Values. PVLDB 3, 1 (2010), 417--428.
[62]
Mark A. Hall. 2000. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In International Conference on Machine Learning (ICML). Morgan Kaufmann, 359--366.
[63]
Asaf Harari and Gilad Katz. 2022. Automatic features generation and selection from external sources: A DBpedia use case. Information Sciences 582 (2022), 398--414.
[64]
Asaf Harari and Gilad Katz. 2022. Few-Shot Tabular Data Enrichment Using Fine-Tuned Transformer Architectures. In ACL. Association for Computational Linguistics, 1577--1591.
[65]
Qi He, Jaewon Yang, and Baoxu Shi. 2020. Constructing knowledge graph for social networks in a deep and holistic way. In Companion Proceedings of the Web Conference 2020. 307--308.
[66]
Benjamin Hilprecht and Carsten Binnig. 2021. ReStore - Neural Data Completion for Relational Databases. In SIGMOD. 710--722.
[67]
Arthur E. Hoerl and Robert W. Kennard. 2000. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 42, 1 (2000), 80--86.
[68]
Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard de Melo, Claudio Gutiérrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan F. Sequeda, Steffen Staab, and Antoine Zimmermann. 2021. Knowledge Graphs. ACM Comput. Surv. 54, 4 (2021), 71:1--71:37.
[69]
Xuming Hu, Shen Wang, Xiao Qin, Chuan Lei, Zhengyuan Shen, Christos Faloutsos, Asterios Katsifodimos, George Karypis, Lijie Wen, and Philip S. Yu. 2023. Automatic Table Union Search with Tabular Representation Learning. In Findings of the Association for Computational Linguistics: ACL. Association for Computational Linguistics.
[70]
Shengyi Huang and Santiago Ontañón. 2022. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. In the Thirty-Fifth International Florida Artificial Intelligence Research Society Conference (FLAIRS).
[71]
Vassilis N Ioannidis, Xiang Song, Saurav Manchanda, Mufei Li, Xiaoqin Pan, Da Zheng, Xia Ning, Xiangxiang Zeng, and George Karypis. 2020. DRKG-drug repurposing knowledge graph for covid-19. https://github.com/gnn4dr/DRKG/.
[72]
Robert Isele, Anja Jentzsch, and Christian Bizer. 2010. Silk server-adding missing links while consuming linked data. In COLD. 85--96.
[73]
Kashif Javed, Sameen Maruf, and Haroon A. Babri. 2015. A two-stage Markov blanket based feature selection algorithm for text classification. Neurocomputing 157 (2015), 91--104.
[74]
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In ACL. 5851--5861.
[75]
Yoed N Kenett, Effi Levi, David Anaki, and Miriam Faust. 2017. The semantic distance task: Quantifying semantic distance with semantic network path length. Journal of Experimental Psychology: Learning, Memory, and Cognition 43, 9 (2017), 1470.
[76]
Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Semantic Table Union Search. Proc. ACM Manag. Data 1, 1 (2023), 9:1--9:25.
[77]
Mourad Khayati, Ines Arous, Zakhar Tymchenko, and Philippe Cudré-Mauroux. 2020. ORBITS: Online Recovery of Missing Values in Multiple Time Series Streams. PVLDB 14, 3 (2020), 294--306.
[78]
Ron Kohavi and George H. John. 1997. Wrappers for Feature Subset Selection. Artif. Intell. 97, 1--2 (1997), 273--324.
[79]
loannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate Detection with Matching Dependencies. PVLDB 13, 5 (2020), 712--725.
[80]
Walter Kropatsch. 1996. Building irregular pyramids by dual-graph contraction. In Vision Image and Signal Processing.
[81]
Clyde P. Kruskal, Larry Rudolph, and Marc Snir. 1990. A complexity theory of efficient parallel algorithms. Theoretical Computer Science 71, 1 (1990), 95--132.
[82]
Arun Kumar, Jeffrey Naughton, Jignesh M Patel, and Xiaojin Zhu. 2016. To join or not to join? Thinking twice about joins before feature selection. In SIGMOD. 19--34.
[83]
Alexander K. Lew, Monica Agrawal, David A. Sontag, and Vikash Mansinghka. 2021. PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming. In International Conference on Artificial Intelligence and Statistics, (AISTATS) (Proceedings of Machine Learning Research).
[84]
Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, and Sudeepa Roy. 2021. Putting things into context: Rich explanations for query answers using join graphs. In SIGMOD. 1051--1063.
[85]
Pan Li, Da Li, Wei Li, Shaogang Gong, Yanwei Fu, and Timothy M Hospedales. 2021. A simple feature augmentation for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8886--8895.
[86]
Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, and Wang-Chiew Tan. 2020. Deep Entity Matching with Pre-Trained Language Models. PVLDB 14, 1 (2020), 50--60.
[87]
Xueling Lin, Haoyang Li, Hao Xin, Zijian Li, and Lei Chen. 2020. KBPearl: A Knowledge Base Population System Supported by Joint Entity and Relation Linking. PVLDB 13, 7 (2020), 1035--1049.
[88]
Jiabin Liu, Chengliang Chai, Yuyu Luo, Yin Lou, Jianhua Feng, and Nan Tang. 2022. Feature Augmentation with Reinforcement Learning. In ICDE. IEEE, 3360--3372.
[89]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
[90]
Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. PVLDB 13, 11 (2020), 1948--1961.
[91]
Andy Maule, Wolfgang Emmerich, and David S Rosenblum. 2008. Impact analysis of database schema changes. In international conference on Software engineering. 451--460.
[92]
Zhengjie Miao, Yuliang Li, and Xiaolan Wang. 2021. Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond. In SIGMOD. ACM, 1303--1316.
[93]
Alberto Moraglio, Cecilia Di Chio, and Riccardo Poli. 2007. Geometric Particle Swarm Optimisation. In EuroGP (Lecture Notes in Computer Science, Vol. 4445). Springer, 125--136.
[94]
Michalis Mountantonakis and Yannis Tzitzikas. 2017. How linked data can aid machine learning-based tasks. In International Conference on Theory and Practice of Digital Libraries (TPDL). 155--168.
[95]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In SIGMOD. 19--34.
[96]
Avanika Narayan, Ines Chami, Laurel J. Orr, and Christopher Ré. 2022. Can Foundation Models Wrangle Your Data? PVLDB 16, 4 (2022), 738--746.
[97]
Fatemeh Nargesian, Erkang Zhu, Ken Q. Pu, and Renée J. Miller. 2018. Table Union Search on Open Data. PVLDB 11, 7 (2018), 813--825.
[98]
George Papadakis, George Mandilaras, Luca Gagliardelli, Giovanni Simonini, Emmanouil Thanos, George Giannakopoulos, Sonia Bergamaschi, Themis Palpanas, and Manolis Koubarakis. 2020. Three-dimensional entity resolution with JedAI. Information Systems 93 (2020), 101565.
[99]
Jan Peters and J. Andrew Bagnell. 2017. Policy Gradient Methods. In Encyclopedia of Machine Learning and Data Mining. Springer.
[100]
Abdulhakim Qahtan, Nan Tang, Mourad Ouzzani, Yang Cao, and Michael Stonebraker. 2020. Pattern functional dependencies for data cleaning. PVLDB 13, 5 (2020), 684--697.
[101]
Zhixin Qi, Hongzhi Wang, Jianzhong Li, and Hong Gao. 2018. FROG: Inference from knowledge base for missing value imputation. Knowl. Based Syst. 145 (2018), 77--90.
[102]
Kun Qian, Lucian Popa, and Prithviraj Sen. 2017. Active Learning for Large-Scale Entity Resolution. In CIKM. 1379--1388.
[103]
Priya Radhakrishnan, Partha P. Talukdar, and Vasudeva Varma. 2018. ELDEN: Improved Entity Linking Using Densified Knowledge Graphs. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). Association for Computational Linguistics, 1844--1853.
[104]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing.
[105]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holo-Clean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201.
[106]
Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, and Juliana Freire. 2021. Correlation sketches for approximate join-correlation queries. In SIGMOD. 1531--1544.
[107]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017).
[108]
Özge Sevgili, Artem Shelmanov, Mikhail Y. Arkhipov, Alexander Panchenko, and Chris Biemann. 2022. Neural entity linking: A survey of models based on deep learning. Semantic Web 13, 3 (2022), 527--570.
[109]
Vraj Shah, Arun Kumar, and Xiaojin Zhu. 2017. Are key-foreign key joins safe to avoid when learning high-capacity classifiers? arXiv preprint arXiv:1704.00485 (2017).
[110]
Zhuchen Shao, Liuxi Dai, Yifeng Wang, Haoqian Wang, and Yongbing Zhang. 2023. AugDiff: Diffusion based Feature Augmentation for Multiple Instance Learning in Whole Slide Image. arXiv preprint arXiv:2303.06371 (2023).
[111]
Shubhranshu Shekhar, Deepak Pai, and Sriram Ravindran. 2020. Entity resolution in dynamic heterogeneous networks. In Companion Proceedings of the Web Conference 2020. 662--668.
[112]
Feichen Shen and Yugyung Lee. 2016. Knowledge discovery from biomedical ontologies in cross domains. PloS one 11, 8 (2016), e0160005.
[113]
Kai Shu, Suhang Wang, Jiliang Tang, Reza Zafarani, and Huan Liu. 2016. User Identity Linkage across Online Social Networks: A Review. SIGKDD Explor. 18, 2 (2016), 5--17.
[114]
Dag Sjøberg. 1993. Quantifying schema evolution. Information and Software Technology 35, 1 (1993), 35--44.
[115]
Petr Somol, Pavel Pudil, Jana Novovicová, and Pavel Paclík. 1999. Adaptive floating search methods in feature selection. Pattern Recognit. Lett. 20, 11--13 (1999), 1157--1163.
[116]
Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2018. Enriching data imputation under similarity rule constraints. TKDE 32, 2 (2018), 275--287.
[117]
Indro Spinelli, Simone Scardapane, and Aurelio Uncini. 2020. Missing data imputation with adversarially-trained graph convolutional networks. Neural Networks (2020).
[118]
El-Ghazali Talbi, Laetitia Jourdan, José García-Nieto, and Enrique Alba. 2008. Comparison of population based metaheuristics for feature selection: Application to microarray data classification. In International Conference on Computer Systems and Applications (AICCSA). IEEE Computer Society, 45--52.
[119]
R. Tibshirani. 1996. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society (Series B) 58 (1996), 267--288.
[120]
Chung-Jui Tu, Li-Yeh Chuang, Jun-Yang Chang, and Cheng-Hong Yang. 2006. Feature Selection using PSO-SVM. In International MultiConference of Engineers and Computer Scientists (IMECS) (Lecture Notes in Engineering and Computer Science). Newswood Limited, 138--143.
[121]
Ryan J. Urbanowicz, Melissa Meeker, William G. La Cava, Randal S. Olson, and Jason H. Moore. 2018. Relief-based feature selection: Introduction and review. J. Biomed. Informatics 85 (2018), 189--203.
[122]
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018).
[123]
Jiayi Wang, Chengliang Chai, Nan Tang, Jiabin Liu, and Guoliang Li. 2022. Coresets over Multiple Tables for Feature-rich and Data-efficient Machine Learning. PVLDB 16, 1 (2022), 64--76.
[124]
Pengfei Wang, Xiaocan Zeng, Lu Chen, Fan Ye, Yuren Mao, Junhao Zhu, and Yunjun Gao. 2022. PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching. PVLDB 16, 2 (2022), 369--378.
[125]
Yue Wang and Shaofeng Zou. 2022. Policy Gradient Method For Robust Reinforcement Learning. In International Conference on Machine Learning (ICML).
[126]
Melanie Weis and Felix Naumann. 2005. DogmatiX Tracks down Duplicates in XML. In SIGMOD. ACM, 431--442.
[127]
Steven Euijong Whang and Hector Garcia-Molina. 2013. Joint entity resolution on multiple datasets. VLDB J. 22, 6 (2013), 773--795.
[128]
Richard Wu, Aoqian Zhang, Ihab F. Ilyas, and Theodoros Rekatsinas. 2020. Attention-based Learning for Missing Data Imputation in HoloClean. In MLSys 2020.
[129]
Jinsung Yoon, James Jordon, and Mihaela van der Schaar. 2018. GAIN: Missing Data Imputation using Generative Adversarial Nets. In ICML. PMLR, 5675--5684.
[130]
Brit Youngmann, Michael Cafarella, Yuval Moskovitch, and Babak Salimi. 2023. On Explaining Confounding Bias. In ICDE. IEEE, 1846--1859.
[131]
Brit Youngmann, Michael Cafarella, Babak Salimi, and Anna Zeng. 2023. Causal Data Integration. PVLDB 16, 10 (2023), 2659--2665.
[132]
Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu, Qingyun Wang, Heng Ji, and Meng Jiang. 2022. A Survey of Knowledge-enhanced Text Generation. ACM Comput. Surv. 54, 11s (2022), 227:1--227:38.
[133]
Reza Zafarani and Huan Liu. 2016. Users joining multiple sites: Friendship and popularity variations across sites. Inf. Fusion 28 (2016), 83--89.
[134]
Dongxiang Zhang, Long Guo, Xiangnan He, Jie Shao, Sai Wu, and Heng Tao Shen. 2018. A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution. In ICDE.
[135]
Yi Zhang and Zachary G Ives. 2020. Finding related tables in data lakes for interactive data science. In SIGMOD. ACM, 1951--1966.
[136]
Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, and Irwin King. 2022. Spectral Feature Augmentation for Graph Contrastive Learning and Beyond. arXiv preprint arXiv:2212.01026 (2022).
[137]
Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. In SIGMOD. ACM, 1504--1517.
[138]
Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J Miller. 2019. Josie: Overlap set similarity search for finding joinable tables in data lakes. In SIGMOD. ACM, 847--864.
[139]
H. Zou and T. Hastie. 2003. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2003).
[140]
Xiaohan Zuo, Peng Lu, Xi Liu, Yibo Gao, Yiping Yang, and Jianxin Chen. 2011. An improved feature selection algorithm based on Markov blanket. In International Conference on Biomedical Engineering and Informatics, (BMEI). IEEE, 1645--1649.

Index Terms

  1. Enriching Relations with Additional Attributes for ER
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 17, Issue 11
    July 2024
    1039 pages
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 30 August 2024
    Published in PVLDB Volume 17, Issue 11

    Check for updates

    Badges

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 48
      Total Downloads
    • Downloads (Last 12 months)48
    • Downloads (Last 6 weeks)21
    Reflects downloads up to 09 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media