research-article

FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters

Authors:

Xujun ZhaoAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 28, Issue 1

Pages 101 - 114

https://doi.org/10.1109/TPDS.2016.2560176

Published: 01 January 2017 Publication History

Abstract

Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions. We implement FiDoop-DP on a 24-node Hadoop cluster, driven by a wide range of datasets created by IBM Quest Market-Basket Synthetic Data Generator. Experimental results reveal that FiDoop-DP is conducive to reducing network and computing loads by the virtue of eliminating redundant transactions on Hadoop nodes. FiDoop-DP significantly improves the performance of the existing parallel frequent-pattern scheme by up to 31 percent with an average of 18 percent.

References

[1]

M. J. Zaki, “ Parallel and distributed association mining: A survey,” IEEE Concurrency, vol. Volume 7, no. Issue 4, pp. 14–25, 1999.

Digital Library

[2]

I. Pramudiono and M. Kitsuregawa, “ Fp-tax: Tree structure based generalized association rule mining,” in Proc. 9th ACM SIGMOD Workshop Res. Issues Data Mining Knowl. Discovery, 2004, pp. 60–63.

Digital Library

[3]

J. Dean and S. Ghemawat, “ Mapreduce: Simplified data processing on large clusters,” ACM Commun, vol. Volume 51, no. Issue 1, pp. 107–113, 2008.

Digital Library

[4]

S. Sakr, A. Liu, and A. G. Fayoumi, “ The family of mapreduce and large-scale data processing systems,” ACM Comput. Surveys, vol. Volume 46, no. Issue 1, p. pp.11, 2013.

Digital Library

[5]

M.-Y. Lin, P.-Y. Lee, and S.-C. Hsueh, “ Apriori-based frequent itemset mining algorithms on mapreduce,” in Proc. 6th Int. Conf. Ubiquitous Inform. Manag. Commun., 2012, pp. 76:1–76:8.

Digital Library

[6]

X. Lin, “ Mr-apriori: Association rules algorithm based on mapreduce,” in Proc. IEEE 5th Int. Conf. Softw. Eng. Serv. Sci., 2014, pp. 141–144.

[7]

L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng, “ Balanced parallel FP-growth with mapreduce,” in Proc. IEEE Youth Conf. Inform. Comput. Telecommun., 2010, pp. 243–246.

[8]

S. Hong, Z. Huaxuan, C. Shiping, and H. Chunyan, “ The study of improved FP-growth algorithm in mapreduce,” in Proc. 1st Int. Workshop Cloud Comput. Inform. Security, 2013, pp. 250–253.

[9]

M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal, “ Parma: A parallel randomized algorithm for approximate association rules mining in mapreduce,” in Proc. 21st ACM Int. Conf. Informa. Knowl. Manag., 2012, pp. 85–94.

Digital Library

[10]

C. Lam, Hadoop in Action . Greenwich, USA: Manning Publications Co., 2010.

[11]

H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “ PFP: Parallel FP-growth for query recommendation,” in Proc. ACM Conf. Recommender Syst., 2008, pp. 107–114.

Digital Library

[12]

C. Curino, E. Jones, Y. Zhang, and S. Madden, “ Schism: A workload-driven approach to database replication and partitioning,” Proc. VLDB Endowment, vol. Volume 3, no. Issue 1-2, pp. 48–57, 2010.

Digital Library

[13]

P. Uthayopas and N. Benjamas, “ Impact of i/o and execution scheduling strategies on large scale parallel data mining,” J. Next Generation Inform. Technol., vol. Volume 5, no. Issue 1, p. pp.78, 2014.

[14]

I. Pramudiono and M. Kitsuregawa, “ Parallel FP-growth on PC cluster,” in Proc. Adv. Knowl. Discovery Data Mining, 2003, pp. 467–473.

[15]

Y. Xun, J. Zhang, and X. Qin, “ Fidoop: Parallel mining of frequent itemsets using mapreduce,” IEEE Trans. Syst., Man, Cybern.: Syst., vol. Volume 46, no. Issue 3, pp. 313–325, 2016.

[16]

S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout Action . Greenwich, USA: Manning, 2011.

[17]

D. Borthakur, “ Hdfs architecture guide,” HADOOP APACHE PROJECT. Available : http://hadoop.apache.org/common/docs/current/hdfs design.pdf, 2008.

[18]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “ Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics Cloud Comput., 2010, p. pp.10.

Digital Library

[19]

W. Lu, Y. Shen, S. Chen, and B. C. Ooi, “ Efficient processing of k nearest neighbor joins using mapreduce,” Proc. VLDB Endowment, vol. Volume 5, no. Issue 10, pp. 1016–1027, 2012.

Digital Library

[20]

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “ An efficient k-means clustering algorithm: Analysis and implementation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. Volume 24, no. Issue 7, pp. 881–892, 2002.

Digital Library

[21]

A. K. Jain, “ Data clustering: 50 years beyond k-means,” Pattern Recog. Lett., vol. Volume 31, no. Issue 8, pp. 651–666, 2010.

Digital Library

[22]

D. Arthur and S. Vassilvitskii, “ k-means++: The advantages of careful seeding,” in Proc. 18th Annu. ACM-SIAM Symp. Discr. Algorithms, 2007, pp. 1027–1035.

Digital Library

[23]

J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining Massive Datasets . Cambridge, U.K.: Cambridge Univ. Press, 2014.

[24]

A. Stupar, S. Michel, and R. Schenkel, “ Rankreduce–processing k-nearest neighbor queries on top of mapreduce,” in Proc. 8th Workshop Large-Scale Distrib. Syst. Informa. Retrieval, 2010, pp. 13–18.

[25]

B. Bahmani, A. Goel, and R. Shinde, “ Efficient distributed locality sensitive hashing,” in Proc. 21st ACM Int. Conf. Inform. Knowl. Manag., 2012, pp. 2174–2178.

Digital Library

[26]

R. Panigrahy, “ Entropy based nearest neighbor search in high dimensions,” in Proc. 17th Annu. ACM-SIAM Symp. Discr. Algorithm, 2006, pp. 1186–1195.

Digital Library

[27]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “ Min-wise independent permutations,” J. Comput. Syst. Sci., vol. Volume 60, no. Issue 3, pp. 630–659, 2000.

Digital Library

[28]

L. Cristofor, “ ARtool: Association rule mining algorithms and tools,” 2006.

[29]

S. Agrawal, V. Narasayya, and B. Yang, “ Integrating vertical and horizontal partitioning into automated physical database design,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, 2004, pp. 359–370.

Digital Library

[30]

F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber, “ Bigtable: A distributed structured data storage system,” in Proc. 7th Symp. Operating Syst. Des. Implementation, 2006, pp. 305–314.

[31]

B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, “ Pnuts: Yahoo!'s hosted data serving platform,” Proc. VLDB Endowment, vol. Volume 1, no. Issue 2, pp. 1277–1288, 2008.

Digital Library

[32]

J. Xie and X. Qin, “ The 19th heterogeneity in computing workshop (HCW 2010),” in Proc. IEEE Int. Symp. Parallel Distrib. Process., Workshops Phd Forum, 2010, pp. 1–5.

[33]

M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson, “ Cohadoop: Flexible data placement and its exploitation in hadoop,” Proc. VLDB Endowment, vol. Volume 4, no. Issue 9, pp. 575–585, 2011.

Digital Library

[34]

R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac, “ Adaptive mapreduce using situation-aware mappers,” in Proc. 15th Int. Conf. Extending Database Technol., 2012, pp. 420–431.

Digital Library

[35]

Q. Ke, V. Prabhakaran, Y. Xie, Y. Yu, J. Wu, and J. Yang, “ Optimizing data partitioning for data-parallel computing,” uS Patent App. 13/325,049, 13 2011.

[36]

M. Liroz-Gistau, R. Akbarinia, D. Agrawal, E. Pacitti, and P. Valduriez, “ Data partitioning for minimizing transferred data in mapreduce,” in Proc. 6th Int. Conf. Data Manag. Cloud, Grid P2P Syst., 2013, pp. 1–12.

[37]

T. Kirsten, L. Kolb, M. Hartung, A. Groß, H. Köpcke, and E. Rahm, “ Data partitioning for parallel entity matching,” Proc. VLDB Endowment, vol. Volume 3, no. Issue 2, pp. 1–8, 2010.

[38]

S. Kotoulas, E. Oren, and F. Van Harmelen, “ Mind the data skew: Distributed inferencing by speeddating in elastic regions,” in Proc. 19th Int. Conf. World Wide Web, 2010, pp. 531–540.

Digital Library

[39]

L. Li and M. Zhang, “ The strategy of mining association rule based on cloud computing,” in Proc. Int. Conf. Bus. Comput. Global Inform., 2011, pp. 475–478.

Digital Library

[40]

S. Groot, K. Goda, and M. Kitsuregawa, “ Towards improved load balancing for data intensive distributed computing,” in Proc. ACM Symp. Appl. Comput., 2011, pp. 139–146.

Digital Library

[41]

M. Z. Ashrafi, D. Taniar, and K. Smith, “ ODAM: An optimized distributed association rule mining algorithm,” IEEE Distrib. Syst. Online, vol. Volume 5, no. Issue 3, p. pp.1, 2004.

Digital Library

Cited By

Bobed Lisbona CBernad JMaillot P(2024)Language-Model Based Informed Partition of Databases to Speed Up Pattern MiningProceedings of the ACM on Management of Data10.1145/36549872:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654987
Starlin Jini SChenthalir Indra D(2022)Understanding the Impact of Data Parallelism on Neural Network ClassificationOptical Memory and Neural Networks10.3103/S1060992X2201010631:1(107-121)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.3103/S1060992X22010106
Fu WWang L(2022)Load Balancing Algorithms for Hadoop Cluster in Unbalanced EnvironmentComputational Intelligence and Neuroscience10.1155/2022/15450242022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1545024
Show More Cited By

Index Terms

FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
1. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

DP-Apriori

In this paper, we study the problem of designing a differentially private FIM algorithm which can simultaneously provide a high level of data utility and a high level of data privacy. This task is very challenging due to the possibility of long ...
A method for mining top-rank-k frequent closed itemsets
Collective intelligent information and database systems

Mining frequent closed itemsets (FCIs) is important in mining non-redundant (minimal) association rules. Therefore, many algorithms have been developed for mining FCIs with reduced mining time and memory usage. For mining FCIs, algorithms use the minimum ...
Efficient algorithms for mining high-utility itemsets in uncertain databases

High-utility itemset mining (HUIM) is a useful set of techniques for discovering patterns in transaction databases, which considers both quantity and profit of items. However, most algorithms for mining high-utility itemsets (HUIs) assume that the ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 28, Issue 1

January 2017

304 pages

ISSN:1045-9219

Issue’s Table of Contents

Copyright © 2017.

Publisher

IEEE Press

Publication History

Published: 01 January 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bobed Lisbona CBernad JMaillot P(2024)Language-Model Based Informed Partition of Databases to Speed Up Pattern MiningProceedings of the ACM on Management of Data10.1145/36549872:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654987
Starlin Jini SChenthalir Indra D(2022)Understanding the Impact of Data Parallelism on Neural Network ClassificationOptical Memory and Neural Networks10.3103/S1060992X2201010631:1(107-121)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.3103/S1060992X22010106
Fu WWang L(2022)Load Balancing Algorithms for Hadoop Cluster in Unbalanced EnvironmentComputational Intelligence and Neuroscience10.1155/2022/15450242022Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1155/2022/1545024
Djenouri YLin JNørvåg KRamampiaro HYu P(2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
https://dl.acm.org/doi/10.1145/3439771
Belhadi ADjenouri YLin JCano A(2020)A general-purpose distributed pattern mining systemApplied Intelligence10.1007/s10489-020-01664-w50:9(2647-2662)Online publication date: 18-Mar-2020
https://dl.acm.org/doi/10.1007/s10489-020-01664-w
Huang CLeu Y(2019)Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computersCluster Computing10.1007/s10586-017-1609-622:2(2851-2863)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s10586-017-1609-6
Chon KKim M(2018)BIGMinerCluster Computing10.5555/3287988.328800221:3(1507-1520)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.5555/3287988.3288002
Shang FChen XYan C(2017)A strategy for scheduling reduce task based on intermediate data locality of the MapReduceCluster Computing10.1007/s10586-017-0972-720:4(2821-2831)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1007/s10586-017-0972-7

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents