Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3428757.3429140acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution

Published: 27 January 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Entity resolution (ER) is a process to identify records that refer to the same real-world entity. In recent years, facing the ever-increasing data volume, both blocking techniques and parallel computation have been proposed for ER to reduce its running time and improve efficiency. It is popular and convenient to apply the MapReduce programming model for parallel computation. With the default load balancing strategy, if the block sizes are skewed, an imbalanced reducer load will occur and significantly increase the runtime. One possible solution is block-splitting: breaking the overpopulated blocks into smaller sub-blocks, to improve efficiency. In this paper we analyze the advantages and disadvantages of state-of-the-art block splitting methods (BlockSplit and BlockSlicer), and we propose two approaches: TLS and BOS to overcome the identified drawbacks. We comprehensively evaluate and compare our proposed solutions, with Spark implementations, using real-world and synthetic datasets with different properties. The results show that all of them can balance the reducer load with the help of the greedy partition assignment strategy. When memory of used cluster is not abundant given a dataset, a high number of reducers is required to reduce the GC time to improve efficiency. Partitcularly, our TLS and BOS have overwelmingly lower overhead due to the ability of block-wise composite key assignment.

    References

    [1]
    F. Atta, S. D. Viglas, and S. Niazi. Sand join---a skew handling join algorithm for google's mapreduce framework. In 2011 IEEE 14th International Multitopic Conference, pages 170--175. IEEE, 2011.
    [2]
    Q. Chen, J. Yao, and Z. Xiao. Libra: Lightweight data skew mitigation in mapreduce. IEEE Transactions on parallel and distributed systems, 26(9):2520-2533, 2014.
    [3]
    X. Chen, K. Rapuru, G. Durand, E. Schallehn, and G. Saake. Performance Comparison of Three Spark-based Implementations of Parallel Entity Resolution. In Proceedings of the International Workshop on Big Data Management in Cloud Systems (DEXA-BDMICS), pages 76--87. Springer, 2018.
    [4]
    X. Chen, E. Schallehn, and G. Saake. Cloud-scale entity resolution: Current state and open challenges. OJBD, 4(1):30-51, 2018.
    [5]
    D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 1992.
    [6]
    V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis, and T. Palpanas. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems, 65:137-157, 2017.
    [7]
    E. Gavagsaz, A. Rezaee, and H. Haj Seyyed Javadi. Load balancing in reducers for skewed data in mapreduce systems by using scalable simple random sampling. The Journal of Supercomputing, 74:3415-3440, 2018.
    [8]
    D. Gomes Mestre and C. E. S. Pires. Improving load balancing for mapreduce-based entity matching. In ISCC, pages 000618--000624. IEEE, 2013.
    [9]
    B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Handling data skew in mapreduce. Closer, 11:574-583, 2011.
    [10]
    B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In 2012 IEEE 28th International Conference on Data Engineering, pages 522--533. IEEE, 2012.
    [11]
    M. A. H. Hassan, M. Bamha, and F. Loulergue. Handling data-skew effects in join operations using mapreduce. Procedia Computer Science, 29:145-158, 2014.
    [12]
    hortonworks. Hortonworks data platform. retrieved on 10.07.2017.
    [13]
    S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu. A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In ausPDC, pages 3--9. Australian Computer Society, Inc., 2014.
    [14]
    S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, and S. Wu. Handling partitioning skew in mapreduce using leen. Peer-to-Peer Networking and Applications, 6(4):409-424, 2013.
    [15]
    S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi. Leen: Locality/fairness-aware key partitioning for mapreduce in the cloud. In CloudCom, pages 17--24. IEEE, 2010.
    [16]
    D. Karapiperis and V. S. Verykios. Load-balancing the distance computations in record linkage. ACM SIGKDD Explorations Newsletter, 17(1):1-7, 2015.
    [17]
    L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, pages 618--629. IEEE, 2012.
    [18]
    Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. A study of skew in mapreduce applications. Open Cirrus Summit, 11, 2011.
    [19]
    Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skewtune: mitigating skew in mapreduce applications. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 25--36. ACM, 2012.
    [20]
    J. Li, Y. Liu, J. Pan, P. Zhang, W. Chen, and L. Wang. Map-balance-reduce: an improved parallel programming model for load balancing of mapreduce. Future Generation Computer Systems, 2017.
    [21]
    N. McNeill, H. Kardes, and A. Borthwick. Dynamic record blocking: efficient linking of massive databases in mapreduce. In QDB, 2012.
    [22]
    A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960. ACM, 2011.
    [23]
    L. Qi, Z. Tang, Y. Qin, and Y. Ye. Csra: An efficient resource allocation algorithm in mapreduce considering data skewness. In International Conference on Knowledge Science, Engineering and Management, pages 651--662. Springer, 2015.
    [24]
    S. R. Ramakrishnan, G. Swart, and A. Urmanov. Balancing reducer skew in mapreduce workloads using progressive sampling. In Proceedings of the Third ACM Symposium on Cloud Computing, page 16. ACM, 2012.
    [25]
    M. A. Sherif and A.-C. N. Ngomo. An optimization approach for load balancing in parallel link discovery. In Proceedings of the 11th International Conference on Semantic Systems, pages 161--168. ACM, 2015.
    [26]
    Z. Tang, W. Ma, K. Li, and K. Li. A data skew oriented reduce placement algorithm based on sampling. IEEE Transactions on Cloud Computing, 2016.
    [27]
    Z. Tang, X. Zhang, K. Li, and K. Li. An intermediate data placement algorithm for load balancing in spark computing environment. Future Generation Computer Systems, 78:287-301, 2018.
    [28]
    K.-N. Tran, D. Vatsalan, and P. Christen. Geco: An online personal data generator and corruptor. In CIKM, pages 2473--2476. ACM, 2013.
    [29]
    R. Xin, P. Deyhim, A. Ghodsi, X. Meng, and M. Zaharia. Graysort on apache spark by databricks. GraySort Competition, 2014.
    [30]
    Y. Xu, P. Zou, W. Qu, Z. Li, K. Li, and X. Cui. Sampling-based partitioning in mapreduce for skewed data. In 2012 seventh ChinaGrid annual conference, pages 1--8. IEEE, 2012.
    [31]
    W. Yan, Y. Xue, and B. Malin. Scalable load balancing for mapreduce-based record linkage. In IPCCC, pages 1--10. IEEE, 2013.
    [32]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX, pages 2--2, 2012.

    Index Terms

    1. Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services
      November 2020
      492 pages
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      In-Cooperation

      • Johannes Kepler University, Linz, Austria

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 January 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Data skew handling
      2. Load balancing strategy
      3. Parallel data matching
      4. Parallel entity resolution
      5. Parallel record linkage

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      iiWAS '20

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 50
        Total Downloads
      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media