Abstract
Entity matching (EM) is crucial step in data integration. Supervised machine learning (SML) approaches have attained the SOTA performance in EM. In real - world scenarios SML suffers from absence or lack of large labeled data for training. Active machine learning (AML) for EM minimize the number of training data required and tries to reduce the amount of Hand labeling by picking just the helpful pairs. ALL AML approaches use just one of the two criteria - informativeness or representativeness - for query selection, Which limit their efficacy. In this work, we propose a Combined Score Sampling (CSS) that combines informativeness and representativeness selection criteria. We evaluate the CSS using the benchmark e-commerce data-sets pair Abt-Buy and AML with ensemble learning as SML model and we demonstrate that it effectively addresses the issue. Comparing our strategy to SML, we demonstrate that it lead to overall enhanced F1 score and stability of the learnt models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011). https://doi.org/10.14778/3402707.3402710
Bianco, G.D., Galante, R., Goncalves, M.A., Canuto, S., Heuser, C.A.: A practical and effective sampling selection strategy for large scale deduplication. IEEE Trans. Knowl. Data Eng. 27(9), 2305–2319 (2015). https://doi.org/10.1109/tkde.2015.2416734
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2003. ACM Press (2003). https://doi.org/10.1145/956750.956759
de Carvalho, M.G., Laender, A.H.F., Goncalves, M.A., da Silva, A.S.: A genetic programming approach to record deduplication. IEEE Trans. Knowl. Data Eng. 24(3), 399–412 (2012). https://doi.org/10.1109/tkde.2010.234
Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5
Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, June 2009. https://doi.org/10.1145/1559845.1559869
Chen, Z., Chen, Q., Hou, B., Li, Z., Li, G.: Towards interpretable and learnable risk analysis for entity resolution. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, May 2020. https://doi.org/10.1145/3318464.3380572
Christen, P.: Febrl. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2008. ACM Press (2008). https://doi.org/10.1145/1401890.1402020
Christen, P.: Data Matching. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Christen, P., Vatsalan, D., Wang, Q.: Efficient entity resolution with adaptive and interactive training data selection. In: 2015 IEEE International Conference on Data Mining. IEEE, November 2015. https://doi.org/10.1109/icdm.2015.63
Christophides, V., Efthymiou, V., Stefanidis, K.: Entity resolution in the web of data. Synth. Lect. Semant. Web Theory Technol. 5(3), 1–122 (2015). https://doi.org/10.2200/s00655ed1v01y201507wbe013
Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Inf. Sci. 137(1-4), 1–15 (2001). https://doi.org/10.1016/s0020-0255(00)00070-0
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2002. ACM Press (2002). https://doi.org/10.1145/775047.775116
Dong, X.L., Rekatsinas, T.: Data integration and machine learning. In: Proceedings of the 2018 International Conference on Management of Data. ACM, May 2018. https://doi.org/10.1145/3183713.3197387
Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: a record linkage toolbox. In: Proceedings 18th International Conference on Data Engineering (2002). (IEEE Comput. Soc.) https://doi.org/10.1109/icde.2002.994694
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007). https://doi.org/10.1109/tkde.2007.250581
Getoor, L., Machanavajjhala, A.: Entity resolution. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564
Konda, P., et al.: Magellan. Proc. VLDB Endow. 9(12), 1197–1208 (2016). https://doi.org/10.14778/2994509.2994535
Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010). https://doi.org/10.1016/j.datak.2009.10.003
Madhavan, J., Halevy, A.Y.: Composing mappings among data sources. In: Proceedings 2003 VLDB Conference, pp. 572–583. Elsevier (2003). https://doi.org/10.1016/b978-012722442-8/50057-4
Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, May 2020. https://doi.org/10.1145/3318464.3380597
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: DMKD (1997)
Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets. Proc. VLDB Endow. 8(2), 125–136 (2014). https://doi.org/10.14778/2735471.2735474
Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised meta-blocking. Proc. VLDB Endow. 7(14), 1929–1940 (2014). https://doi.org/10.14778/2733085.2733098
Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, November 2017. https://doi.org/10.1145/3132847.3132949
Reyes-Galaviz, O.F., Pedrycz, W., He, Z., Pizzi, N.J.: A supervised gradient-based learning algorithm for optimized entity resolution. Data Knowl. Eng. 112, 106–129 (2017). https://doi.org/10.1016/j.datak.2017.10.004
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2002. ACM Press (2002). https://doi.org/10.1145/775047.775087
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: ZeroER: entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, May 2020. https://doi.org/10.1145/3318464.3389743
Yan, L.L., Miller, R.J., Haas, L.M., Fagin, R.: Data-driven understanding and refinement of schema mappings. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data - SIGMOD 2001. ACM Press (2001). https://doi.org/10.1145/375663.375729
Zhao, H., Ram, S.: Entity identification for heterogeneous database integration—a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005). https://doi.org/10.1016/j.is.2003.11.001
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jabrane, M., Hafidi, I., Rochd, Y. (2023). An Improved Active Machine Learning Query Strategy for Entity Matching Problem. In: Aboutabit, N., Lazaar, M., Hafidi, I. (eds) Advances in Machine Intelligence and Computer Science Applications. ICMICSA 2022. Lecture Notes in Networks and Systems, vol 656. Springer, Cham. https://doi.org/10.1007/978-3-031-29313-9_28
Download citation
DOI: https://doi.org/10.1007/978-3-031-29313-9_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28845-6
Online ISBN: 978-3-031-29313-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)