research-article

Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution

Authors:

Nishanth Entoor Venkatarathnam,

David Broneske,

Gabriel Campero Durand,

Roman Zoun, and

Gunter SaakeAuthors Info & Claims

iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

November 2020

Pages 446 - 455

https://doi.org/10.1145/3428757.3429140

Published: 27 January 2021 Publication History

Abstract

Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.

References

[1]

F. Atta, S. D. Viglas, and S. Niazi. Sand join---a skew handling join algorithm for google's mapreduce framework. In 2011 IEEE 14th International Multitopic Conference, pages 170--175. IEEE, 2011.

[2]

Q. Chen, J. Yao, and Z. Xiao. Libra: Lightweight data skew mitigation in mapreduce. IEEE Transactions on parallel and distributed systems, 26(9):2520-2533, 2014.

Digital Library

[3]

X. Chen, K. Rapuru, G. Durand, E. Schallehn, and G. Saake. Performance Comparison of Three Spark-based Implementations of Parallel Entity Resolution. In Proceedings of the International Workshop on Big Data Management in Cloud Systems (DEXA-BDMICS), pages 76--87. Springer, 2018.

[4]

X. Chen, E. Schallehn, and G. Saake. Cloud-scale entity resolution: Current state and open challenges. OJBD, 4(1):30-51, 2018.

[5]

D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 1992.

[6]

V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, and T. Palpanas. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems, 65:137-157, 2017.

Digital Library

[7]

E. Gavagsaz, A. Rezaee, and H. Haj Seyyed Javadi. Load balancing in reducers for skewed data in mapreduce systems by using scalable simple random sampling. The Journal of Supercomputing, 74:3415-3440, 2018.

Digital Library

[8]

D. Gomes Mestre and C. E. S. Pires. Improving load balancing for mapreduce-based entity matching. In ISCC, pages 000618--000624. IEEE, 2013.

[9]

B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Handling data skew in mapreduce. Closer, 11:574-583, 2011.

[10]

B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In 2012 IEEE 28th International Conference on Data Engineering, pages 522--533. IEEE, 2012.

Digital Library

[11]

M. A. H. Hassan, M. Bamha, and F. Loulergue. Handling data-skew effects in join operations using mapreduce. Procedia Computer Science, 29:145-158, 2014.

[12]

hortonworks. Hortonworks data platform. retrieved on 10.07.2017.

[13]

S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu. A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In ausPDC, pages 3--9. Australian Computer Society, Inc., 2014.

Digital Library

[14]

S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, and S. Wu. Handling partitioning skew in mapreduce using leen. Peer-to-Peer Networking and Applications, 6(4):409-424, 2013.

[15]

S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. Leen: Locality/fairness-aware key partitioning for mapreduce in the cloud. In CloudCom, pages 17--24. IEEE, 2010.

Digital Library

[16]

D. Karapiperis and V. S. Verykios. Load-balancing the distance computations in record linkage. ACM SIGKDD Explorations Newsletter, 17(1):1-7, 2015.

Digital Library

[17]

L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, pages 618--629. IEEE, 2012.

Digital Library

[18]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. A study of skew in mapreduce applications. Open Cirrus Summit, 11, 2011.

[19]

Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 25--36. ACM, 2012.

Digital Library

[20]

J. Li, Y. Liu, J. Pan, P. Zhang, W. Chen, and L. Wang. Map-balance-reduce: an improved parallel programming model for load balancing of mapreduce. Future Generation Computer Systems, 2017.

[21]

N. McNeill, H. Kardes, and A. Borthwick. Dynamic record blocking: efficient linking of massive databases in mapreduce. In QDB, 2012.

[22]

A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960. ACM, 2011.

Digital Library

[23]

L. Qi, Z. Tang, Y. Qin, and Y. Ye. Csra: An efficient resource allocation algorithm in mapreduce considering data skewness. In International Conference on Knowledge Science, Engineering and Management, pages 651--662. Springer, 2015.

Digital Library

[24]

S. R. Ramakrishnan, G. Swart, and A. Urmanov. Balancing reducer skew in mapreduce workloads using progressive sampling. In Proceedings of the Third ACM Symposium on Cloud Computing, page 16. ACM, 2012.

Digital Library

[25]

M. A. Sherif and A.-C. N. Ngomo. An optimization approach for load balancing in parallel link discovery. In Proceedings of the 11th International Conference on Semantic Systems, pages 161--168. ACM, 2015.

Digital Library

[26]

Z. Tang, W. Ma, K. Li, and K. Li. A data skew oriented reduce placement algorithm based on sampling. IEEE Transactions on Cloud Computing, 2016.

[27]

Z. Tang, X. Zhang, K. Li, and K. Li. An intermediate data placement algorithm for load balancing in spark computing environment. Future Generation Computer Systems, 78:287-301, 2018.

[28]

K.-N. Tran, D. Vatsalan, and P. Christen. Geco: An online personal data generator and corruptor. In CIKM, pages 2473--2476. ACM, 2013.

Digital Library

[29]

R. Xin, P. Deyhim, A. Ghodsi, X. Meng, and M. Zaharia. Graysort on apache spark by databricks. GraySort Competition, 2014.

[30]

Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui. Sampling-based partitioning in mapreduce for skewed data. In 2012 seventh ChinaGrid annual conference, pages 1--8. IEEE, 2012.

Digital Library

[31]

W. Yan, Y. Xue, and B. Malin. Scalable load balancing for mapreduce-based record linkage. In IPCCC, pages 1--10. IEEE, 2013.

[32]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX, pages 2--2, 2012.

Digital Library

Index Terms

Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution

Recommendations

Improving MapReduce-based Entity-resolution by Data-load Balancing
ASE BD&SI '15: Proceedings of the ASE BigData & SocialInformatics 2015

Entity resolution (ER) is to identify the entities referring to the same entity in the dataset. The nature of pairwise similarity computation from ER combined with growth of data size today leads to utilization of distributed computing such as ...
Read More
Classification of Dynamic Load Balancing Strategies in a Network of Workstations
ITNG '08: Proceedings of the Fifth International Conference on Information Technology: New Generations

This paper deals with the problem of load balancing in a network of workstations. Based on the study of recent work in the area, we propose a general classification of load balancing techniques. The load balancing strategies are classified on three ...
Read More
Block-based load balancing for entity resolution with MapReduce
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

The effectiveness and scalability of MapReduce-based implementations of complex data-intensive tasks depend on an even redistribution of data between map and reduce tasks. In the presence of skewed data, sophisticated redistribution approaches thus ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

November 2020

492 pages

ISBN:9781450389228

DOI:10.1145/3428757

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Johannes Kepler University, Linz, Austria

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

iiWAS '20

iiWAS '20: The 22nd International Conference on Information Integration and Web-based Applications & Services

November 30 - December 2, 2020

Chiang Mai, Thailand

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
50
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents