Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters

Published: 01 January 2017 Publication History

Abstract

Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31 percent with an average of 18 percent.

References

[1]
M. J. Zaki, “ Parallel and distributed association mining: A survey,” IEEE Concurrency, vol. Volume 7, no. Issue 4, pp. 14–25, 1999.
[2]
I. Pramudiono and M. Kitsuregawa, “ Fp-tax: Tree structure based generalized association rule mining,” in Proc. 9th ACM SIGMOD Workshop Res. Issues Data Mining Knowl. Discovery, 2004, pp. 60–63.
[3]
J. Dean and S. Ghemawat, “ Mapreduce: Simplified data processing on large clusters,” ACM Commun, vol. Volume 51, no. Issue 1, pp. 107–113, 2008.
[4]
S. Sakr, A. Liu, and A. G. Fayoumi, “ The family of mapreduce and large-scale data processing systems,” ACM Comput. Surveys, vol. Volume 46, no. Issue 1, p. pp.11, 2013.
[5]
M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, “ Apriori-based frequent itemset mining algorithms on mapreduce,” in Proc. 6th Int. Conf. Ubiquitous Inform. Manag. Commun., 2012, pp. 76:1–76:8.
[6]
X. Lin, “ Mr-apriori: Association rules algorithm based on mapreduce,” in Proc. IEEE 5th Int. Conf. Softw. Eng. Serv. Sci., 2014, pp. 141–144.
[7]
L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng, “ Balanced parallel FP-growth with mapreduce,” in Proc. IEEE Youth Conf. Inform. Comput. Telecommun., 2010, pp. 243–246.
[8]
S. Hong, Z. Huaxuan, C. Shiping, and H. Chunyan, “ The study of improved FP-growth algorithm in mapreduce,” in Proc. 1st Int. Workshop Cloud Comput. Inform. Security, 2013, pp. 250–253.
[9]
M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal, “ Parma: A parallel randomized algorithm for approximate association rules mining in mapreduce,” in Proc. 21st ACM Int. Conf. Informa. Knowl. Manag., 2012, pp. 85–94.
[10]
C. Lam, Hadoop in Action . Greenwich, USA: Manning Publications Co., 2010.
[11]
H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “ PFP: Parallel FP-growth for query recommendation,” in Proc. ACM Conf. Recommender Syst., 2008, pp. 107–114.
[12]
C. Curino, E. Jones, Y. Zhang, and S. Madden, “ Schism: A workload-driven approach to database replication and partitioning,” Proc. VLDB Endowment, vol. Volume 3, no. Issue 1-2, pp. 48–57, 2010.
[13]
P. Uthayopas and N. Benjamas, “ Impact of i/o and execution scheduling strategies on large scale parallel data mining,” J. Next Generation Inform. Technol., vol. Volume 5, no. Issue 1, p. pp.78, 2014.
[14]
I. Pramudiono and M. Kitsuregawa, “ Parallel FP-growth on PC cluster,” in Proc. Adv. Knowl. Discovery Data Mining, 2003, pp. 467–473.
[15]
Y. Xun, J. Zhang, and X. Qin, “ Fidoop: Parallel mining of frequent itemsets using mapreduce,” IEEE Trans. Syst., Man, Cybern.: Syst., vol. Volume 46, no. Issue 3, pp. 313–325, 2016.
[16]
S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout Action . Greenwich, USA: Manning, 2011.
[17]
D. Borthakur, “ Hdfs architecture guide,” HADOOP APACHE PROJECT. Available : http://hadoop.apache.org/common/docs/current/hdfs design.pdf, 2008.
[18]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “ Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., 2010, p. pp.10.
[19]
W. Lu, Y. Shen, S. Chen, and B. C. Ooi, “ Efficient processing of k nearest neighbor joins using mapreduce,” Proc. VLDB Endowment, vol. Volume 5, no. Issue 10, pp. 1016–1027, 2012.
[20]
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “ An efficient k-means clustering algorithm: Analysis and implementation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. Volume 24, no. Issue 7, pp. 881–892, 2002.
[21]
A. K. Jain, “ Data clustering: 50 years beyond k-means,” Pattern Recog. Lett., vol. Volume 31, no. Issue 8, pp. 651–666, 2010.
[22]
D. Arthur and S. Vassilvitskii, “ k-means++: The advantages of careful seeding,” in Proc. 18th Annu. ACM-SIAM Symp. Discr. Algorithms, 2007, pp. 1027–1035.
[23]
J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining Massive Datasets . Cambridge, U.K.: Cambridge Univ. Press, 2014.
[24]
A. Stupar, S. Michel, and R. Schenkel, “ Rankreduce–processing k-nearest neighbor queries on top of mapreduce,” in Proc. 8th Workshop Large-Scale Distrib. Syst. Informa. Retrieval, 2010, pp. 13–18.
[25]
B. Bahmani, A. Goel, and R. Shinde, “ Efficient distributed locality sensitive hashing,” in Proc. 21st ACM Int. Conf. Inform. Knowl. Manag., 2012, pp. 2174–2178.
[26]
R. Panigrahy, “ Entropy based nearest neighbor search in high dimensions,” in Proc. 17th Annu. ACM-SIAM Symp. Discr. Algorithm, 2006, pp. 1186–1195.
[27]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “ Min-wise independent permutations,” J. Comput. Syst. Sci., vol. Volume 60, no. Issue 3, pp. 630–659, 2000.
[28]
L. Cristofor, “ ARtool: Association rule mining algorithms and tools,” 2006.
[29]
S. Agrawal, V. Narasayya, and B. Yang, “ Integrating vertical and horizontal partitioning into automated physical database design,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, 2004, pp. 359–370.
[30]
F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, “ Bigtable: A distributed structured data storage system,” in Proc. 7th Symp. Operating Syst. Des. Implementation, 2006, pp. 305–314.
[31]
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, “ Pnuts: Yahoo!'s hosted data serving platform,” Proc. VLDB Endowment, vol. Volume 1, no. Issue 2, pp. 1277–1288, 2008.
[32]
J. Xie and X. Qin, “ The 19th heterogeneity in computing workshop (HCW 2010),” in Proc. IEEE Int. Symp. Parallel Distrib. Process., Workshops Phd Forum, 2010, pp. 1–5.
[33]
M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson, “ Cohadoop: Flexible data placement and its exploitation in hadoop,” Proc. VLDB Endowment, vol. Volume 4, no. Issue 9, pp. 575–585, 2011.
[34]
R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac, “ Adaptive mapreduce using situation-aware mappers,” in Proc. 15th Int. Conf. Extending Database Technol., 2012, pp. 420–431.
[35]
Q. Ke, V. Prabhakaran, Y. Xie, Y. Yu, J. Wu, and J. Yang, “ Optimizing data partitioning for data-parallel computing,” uS Patent App. 13/325,049, 13 2011.
[36]
M. Liroz-Gistau, R. Akbarinia, D. Agrawal, E. Pacitti, and P. Valduriez, “ Data partitioning for minimizing transferred data in mapreduce,” in Proc. 6th Int. Conf. Data Manag. Cloud, Grid P2P Syst., 2013, pp. 1–12.
[37]
T. Kirsten, L. Kolb, M. Hartung, A. Groß, H. Köpcke, and E. Rahm, “ Data partitioning for parallel entity matching,” Proc. VLDB Endowment, vol. Volume 3, no. Issue 2, pp. 1–8, 2010.
[38]
S. Kotoulas, E. Oren, and F. Van Harmelen, “ Mind the data skew: Distributed inferencing by speeddating in elastic regions,” in Proc. 19th Int. Conf. World Wide Web, 2010, pp. 531–540.
[39]
L. Li and M. Zhang, “ The strategy of mining association rule based on cloud computing,” in Proc. Int. Conf. Bus. Comput. Global Inform., 2011, pp. 475–478.
[40]
S. Groot, K. Goda, and M. Kitsuregawa, “ Towards improved load balancing for data intensive distributed computing,” in Proc. ACM Symp. Appl. Comput., 2011, pp. 139–146.
[41]
M. Z. Ashrafi, D. Taniar, and K. Smith, “ ODAM: An optimized distributed association rule mining algorithm,” IEEE Distrib. Syst. Online, vol. Volume 5, no. Issue 3, p. pp.1, 2004.

Cited By

View all
  • (2024)Language-Model Based Informed Partition of Databases to Speed Up Pattern MiningProceedings of the ACM on Management of Data10.1145/36549872:3(1-27)Online publication date: 30-May-2024
  • (2022)Understanding the Impact of Data Parallelism on Neural Network ClassificationOptical Memory and Neural Networks10.3103/S1060992X2201010631:1(107-121)Online publication date: 1-Mar-2022
  • (2022)Load Balancing Algorithms for Hadoop Cluster in Unbalanced EnvironmentComputational Intelligence and Neuroscience10.1155/2022/15450242022Online publication date: 1-Jan-2022
  • Show More Cited By

Index Terms

  1. FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Parallel and Distributed Systems
    IEEE Transactions on Parallel and Distributed Systems  Volume 28, Issue 1
    January 2017
    304 pages

    Publisher

    IEEE Press

    Publication History

    Published: 01 January 2017

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Language-Model Based Informed Partition of Databases to Speed Up Pattern MiningProceedings of the ACM on Management of Data10.1145/36549872:3(1-27)Online publication date: 30-May-2024
    • (2022)Understanding the Impact of Data Parallelism on Neural Network ClassificationOptical Memory and Neural Networks10.3103/S1060992X2201010631:1(107-121)Online publication date: 1-Mar-2022
    • (2022)Load Balancing Algorithms for Hadoop Cluster in Unbalanced EnvironmentComputational Intelligence and Neuroscience10.1155/2022/15450242022Online publication date: 1-Jan-2022
    • (2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
    • (2020)A general-purpose distributed pattern mining systemApplied Intelligence10.1007/s10489-020-01664-w50:9(2647-2662)Online publication date: 18-Mar-2020
    • (2019)Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computersCluster Computing10.1007/s10586-017-1609-622:2(2851-2863)Online publication date: 1-Mar-2019
    • (2018)BIGMinerCluster Computing10.5555/3287988.328800221:3(1507-1520)Online publication date: 1-Sep-2018
    • (2017)A strategy for scheduling reduce task based on intermediate data locality of the MapReduceCluster Computing10.1007/s10586-017-0972-720:4(2821-2831)Online publication date: 1-Dec-2017

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media