Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

An Improved Active Machine Learning Query Strategy for Entity Matching Problem

  • Conference paper
  • First Online:
Advances in Machine Intelligence and Computer Science Applications (ICMICSA 2022)

Abstract

Entity matching (EM) is crucial step in data integration. Supervised machine learning (SML) approaches have attained the SOTA performance in EM. In real - world scenarios SML suffers from absence or lack of large labeled data for training. Active machine learning (AML) for EM minimize the number of training data required and tries to reduce the amount of Hand labeling by picking just the helpful pairs. ALL AML approaches use just one of the two criteria - informativeness or representativeness - for query selection, Which limit their efficacy. In this work, we propose a Combined Score Sampling (CSS) that combines informativeness and representativeness selection criteria. We evaluate the CSS using the benchmark e-commerce data-sets pair Abt-Buy and AML with ensemble learning as SML model and we demonstrate that it effectively addresses the issue. Comparing our strategy to SML, we demonstrate that it lead to overall enhanced F1 score and stability of the learnt models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endow. 4(11), 695–701 (2011). https://doi.org/10.14778/3402707.3402710

  2. Bianco, G.D., Galante, R., Goncalves, M.A., Canuto, S., Heuser, C.A.: A practical and effective sampling selection strategy for large scale deduplication. IEEE Trans. Knowl. Data Eng. 27(9), 2305–2319 (2015). https://doi.org/10.1109/tkde.2015.2416734

  3. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2003. ACM Press (2003). https://doi.org/10.1145/956750.956759

  4. de Carvalho, M.G., Laender, A.H.F., Goncalves, M.A., da Silva, A.S.: A genetic programming approach to record deduplication. IEEE Trans. Knowl. Data Eng. 24(3), 399–412 (2012). https://doi.org/10.1109/tkde.2010.234

  5. Chen, X., Xu, Y., Broneske, D., Durand, G.C., Zoun, R., Saake, G.: Heterogeneous committee-based active learning for entity resolution (HeALER). In: Welzer, T., Eder, J., Podgorelec, V., Kamišalić Latifić, A. (eds.) ADBIS 2019. LNCS, vol. 11695, pp. 69–85. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28730-6_5

    Chapter  Google Scholar 

  6. Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, June 2009. https://doi.org/10.1145/1559845.1559869

  7. Chen, Z., Chen, Q., Hou, B., Li, Z., Li, G.: Towards interpretable and learnable risk analysis for entity resolution. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, May 2020. https://doi.org/10.1145/3318464.3380572

  8. Christen, P.: Febrl. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2008. ACM Press (2008). https://doi.org/10.1145/1401890.1402020

  9. Christen, P.: Data Matching. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2

    Book  Google Scholar 

  10. Christen, P., Vatsalan, D., Wang, Q.: Efficient entity resolution with adaptive and interactive training data selection. In: 2015 IEEE International Conference on Data Mining. IEEE, November 2015. https://doi.org/10.1109/icdm.2015.63

  11. Christophides, V., Efthymiou, V., Stefanidis, K.: Entity resolution in the web of data. Synth. Lect. Semant. Web Theory Technol. 5(3), 1–122 (2015). https://doi.org/10.2200/s00655ed1v01y201507wbe013

  12. Cochinwala, M., Kurien, V., Lalk, G., Shasha, D.: Efficient data reconciliation. Inf. Sci. 137(1-4), 1–15 (2001). https://doi.org/10.1016/s0020-0255(00)00070-0

  13. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2002. ACM Press (2002). https://doi.org/10.1145/775047.775116

  14. Dong, X.L., Rekatsinas, T.: Data integration and machine learning. In: Proceedings of the 2018 International Conference on Management of Data. ACM, May 2018. https://doi.org/10.1145/3183713.3197387

  15. Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: a record linkage toolbox. In: Proceedings 18th International Conference on Data Engineering (2002). (IEEE Comput. Soc.) https://doi.org/10.1109/icde.2002.994694

  16. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007). https://doi.org/10.1109/tkde.2007.250581

  17. Getoor, L., Machanavajjhala, A.: Entity resolution. Proc. VLDB Endow. 5(12), 2018–2019 (2012). https://doi.org/10.14778/2367502.2367564

  18. Konda, P., et al.: Magellan. Proc. VLDB Endow. 9(12), 1197–1208 (2016). https://doi.org/10.14778/2994509.2994535

  19. Köpcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010). https://doi.org/10.1016/j.datak.2009.10.003

  20. Madhavan, J., Halevy, A.Y.: Composing mappings among data sources. In: Proceedings 2003 VLDB Conference, pp. 572–583. Elsevier (2003). https://doi.org/10.1016/b978-012722442-8/50057-4

  21. Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, May 2020. https://doi.org/10.1145/3318464.3380597

  22. Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: DMKD (1997)

    Google Scholar 

  23. Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets. Proc. VLDB Endow. 8(2), 125–136 (2014). https://doi.org/10.14778/2735471.2735474

  24. Papadakis, G., Papastefanatos, G., Koutrika, G.: Supervised meta-blocking. Proc. VLDB Endow. 7(14), 1929–1940 (2014). https://doi.org/10.14778/2733085.2733098

  25. Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, November 2017. https://doi.org/10.1145/3132847.3132949

  26. Reyes-Galaviz, O.F., Pedrycz, W., He, Z., Pizzi, N.J.: A supervised gradient-based learning algorithm for optimized entity resolution. Data Knowl. Eng. 112, 106–129 (2017). https://doi.org/10.1016/j.datak.2017.10.004

  27. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2002. ACM Press (2002). https://doi.org/10.1145/775047.775087

  28. Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: ZeroER: entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, May 2020. https://doi.org/10.1145/3318464.3389743

  29. Yan, L.L., Miller, R.J., Haas, L.M., Fagin, R.: Data-driven understanding and refinement of schema mappings. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data - SIGMOD 2001. ACM Press (2001). https://doi.org/10.1145/375663.375729

  30. Zhao, H., Ram, S.: Entity identification for heterogeneous database integration—a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005). https://doi.org/10.1016/j.is.2003.11.001

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mourad Jabrane .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jabrane, M., Hafidi, I., Rochd, Y. (2023). An Improved Active Machine Learning Query Strategy for Entity Matching Problem. In: Aboutabit, N., Lazaar, M., Hafidi, I. (eds) Advances in Machine Intelligence and Computer Science Applications. ICMICSA 2022. Lecture Notes in Networks and Systems, vol 656. Springer, Cham. https://doi.org/10.1007/978-3-031-29313-9_28

Download citation

Publish with us

Policies and ethics