research-article

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Authors:

Matteo Riondato,

Justin A. DeBrabant,

Rodrigo Fonseca,

Eli UpfalAuthors Info & Claims

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 85 - 94

https://doi.org/10.1145/2396761.2396776

Published: 29 October 2012 Publication History

Abstract

Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.

References

[1]

R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Rec., 22:207--216, June 1993.

Digital Library

[2]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 487--499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.

Digital Library

[3]

Apache Foundation. Apache mahout. http://mahout.apache.org/.

[4]

G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 2--12, New York, NY, USA, 2007. ACM.

Digital Library

[5]

F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in Map-Reduce. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 231--240, New York, NY, USA, 2010. ACM.

Digital Library

[6]

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4--7, 2006, pages 281--288. MIT Press, 2007.

[7]

F. Coenen. The LUCS-KDD FP-growth association rule mining algorithm.

[8]

S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '05, pages 255--265, New York, NY, USA, 2005. ACM.

Digital Library

[9]

L. Cristofor. ARTool. http://www.cs.umb.edu/ laur/ARtool/, 2006.

[10]

J.-D. Cryans, S. Ratté, and R. Champagne. Adaptation of APriori to MapReduce to build a warehouse of relations between named entities across the web. In Advances in Databases Knowledge and Data Applications (DBKDA), 2010 Second International Conference on, pages 185--189, april 2010.

Digital Library

[11]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[12]

M. El-Hajj and O. Zaiane. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, volume 1, page 8 pp., 0-0 2006.

Digital Library

[13]

W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, Y. Yang, B. He, Q. Luo, P. V. Sander, and K. Yang. Parallel data mining on graphics processors. Technical Report 07, The Hong Kong University of Science & Technology, 2008.

[14]

A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 334--342, New York, NY, USA, 2011. ACM.

Digital Library

[15]

M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the MapReduce framework. CoRR, abs/1101.1902, 2011.

[16]

S. Hammoud. MapReduce Network Enabled Algorithms for Classification Based on Association Rules. PhD thesis, Brunel University, 2011.

[17]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29:1--12, May 2000.

Digital Library

[18]

H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. PFP: Parallel FP-Growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys '08, pages 107--114, New York, NY, USA, 2008. ACM.

Digital Library

[19]

L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In Business Computing and Global Informatization (BCGIN), 2011 International Conference on, pages 475--478, july 2011.

Digital Library

[20]

Y. Li and R. Gopalan. Effective sampling for mining association rules. In G. Webb and X. Yu, editors, AI 2004: Advances in Artificial Intelligence, volume 3339 of Lecture Notes in Computer Science, pages 73--75. Springer, Berlin / Heidelberg, 2005.

Digital Library

[21]

J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG '10, pages 78--85, New York, NY, USA, 2010. ACM.

Digital Library

[22]

L. Liu, E. Li, Y. Zhang, and Z. Tang. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 1275--1285. VLDB Endowment, 2007.

Digital Library

[23]

H. Mannila, H. Toivonen, and I. Verkamo. Efficient algorithms for discovering association rules. In KDD Workshop, pages 181--192, Menlo Park, CA, USA, 1994. The AAAI Press.

[24]

M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.

Digital Library

[25]

E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item set mining with selective item replication. Parallel and Distributed Systems, IEEE Transactions on, 22(10):1632--1640, oct. 2011.

Digital Library

[26]

S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM '02, pages 354--361. IEEE Computer Society, 2002.

Digital Library

[27]

A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoffs for MapReduce computations. CoRR, abs/1111.2228, 2011.

Digital Library

[28]

A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery, 21:310--326, 2010.

Digital Library

[29]

M. Riondato and E. Upfal. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR, abs/1111.6937, November 2011.

[30]

J. Ruoming, Y. Ge, and G. Agrawal. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. Knowledge and Data Engineering, IEEE Transactions on, 17(1):71 -- 89, jan. 2005.

Digital Library

[31]

N. V. Sahinidis and M. Tawarmalani. BARON 9.0.4: Global Optimization of Mixed-Integer Nonlinear Programs, User's Manual, 2010.

[32]

H. Toivonen. Sampling large databases for association rules. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, pages 134--145, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

Digital Library

[33]

X. Y. Yang, Z. Liu, and Y. Fu. MapReduce as a programming model for association rules algorithm on Hadoop. In Information Sciences and Interaction Sciences (ICIS), 2010 3rd International Conference on, pages 99--102, june 2010.

[34]

M. Zaki. Parallel and distributed association mining: a survey. Concurrency, IEEE, 7(4):14 --25, oct-dec 1999.

Digital Library

[35]

M. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In Proceedings of the Seventh International Workshop on Research Issues in Data Engineering, RIDE '97, pages 42--50. IEEE Computer Society, apr 1997.

Digital Library

[36]

L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP-Growth with MapReduce. In Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pages 243--246, nov. 2010.

Cited By

Preti GDe Francisci Morales GRiondato M(2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
https://dl.acm.org/doi/10.1145/3587254
Kumar SMohbey K(2023)A Utility-Based Distributed Pattern Mining Algorithm With Reduced Shuffle OverheadIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322121034:1(416-428)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3221210
Baishya BAhmed SNath B(2023)Association Rule Mining in Distributed Environment: A SurveyAdvanced Computational and Communication Paradigms10.1007/978-981-99-4284-8_9(113-120)Online publication date: 21-Sep-2023
https://doi.org/10.1007/978-981-99-4284-8_9
Show More Cited By

Index Terms

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

An Efficient Hash-Based Method for Discovering the Maximal Frequent Set
COMPSAC '01: Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development

The association rule mining can be divided into two steps. The first step is to find out all frequent itemsets, whose occurrences are greater than or equal to the user-specified threshold. The second step is to generate reliable association rules based ...
A Novel Parallel Algorithm for Frequent Itemsets Mining in Large Transactional Databases
Advances in Data Mining. Applications and Theoretical Aspects
Abstract
Since the era of data explosion, data mining in large transactional databases has become more and more important. There are many data mining techniques like association rule mining, the most important and well-researched one. Furthermore, frequent ...
Interestingness measures for association rules: Combination between lattice and hash tables

There are many methods which have been developed for improving the time of mining frequent itemsets. However, the time for generating association rules were not put in deep research. In reality, if a database contains many frequent itemsets (from ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

October 2012

2840 pages

ISBN:9781450311564

DOI:10.1145/2396761

General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'12

Sponsor:

CIKM'12: 21st ACM International Conference on Information and Knowledge Management

October 29 - November 2, 2012

Hawaii, Maui, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

81
Total Citations
View Citations
905
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Preti GDe Francisci Morales GRiondato M(2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
https://dl.acm.org/doi/10.1145/3587254
Kumar SMohbey K(2023)A Utility-Based Distributed Pattern Mining Algorithm With Reduced Shuffle OverheadIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322121034:1(416-428)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3221210
Baishya BAhmed SNath B(2023)Association Rule Mining in Distributed Environment: A SurveyAdvanced Computational and Communication Paradigms10.1007/978-981-99-4284-8_9(113-120)Online publication date: 21-Sep-2023
https://doi.org/10.1007/978-981-99-4284-8_9
Elsersy WAnuar NRazak M(2022)ROOTECTOR: Robust Android Rooting Detection Framework Using Machine Learning AlgorithmsArabian Journal for Science and Engineering10.1007/s13369-022-06949-548:2(1771-1791)Online publication date: 26-Jun-2022
https://doi.org/10.1007/s13369-022-06949-5
Riondato M(2022)Sampling-Based Data Mining Algorithms: Modern Techniques and Case StudiesMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44845-8_48(516-519)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1007/978-3-662-44845-8_48
Djenouri YLin JNørvåg KRamampiaro HYu P(2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
https://dl.acm.org/doi/10.1145/3439771
Tongyan LDong LZhiguang Q(2020)Research of Association Rules Mining based on Fuzzy Alarm ExtractionProceedings of the 2020 4th International Conference on Digital Signal Processing10.1145/3408127.3408154(259-262)Online publication date: 19-Jun-2020
https://dl.acm.org/doi/10.1145/3408127.3408154
Zhang YWang L(2020)A Optimization Algorithm for Association Rule Based on Spark Platform2020 International Conference on Computer Network, Electronic and Automation (ICCNEA)10.1109/ICCNEA50255.2020.00026(82-86)Online publication date: Sep-2020
https://doi.org/10.1109/ICCNEA50255.2020.00026
Pal AKumar M(2020)Distributed synthesized association mining for big transactional dataSādhanā10.1007/s12046-020-01380-845:1Online publication date: 2-Jul-2020
https://doi.org/10.1007/s12046-020-01380-8
Millham RAgbehadji IYang H(2020)Pattern Mining AlgorithmsBio-inspired Algorithms for Data Streaming and Visualization, Big Data Management, and Fog Computing10.1007/978-981-15-6695-0_4(67-80)Online publication date: 26-Aug-2020
https://doi.org/10.1007/978-981-15-6695-0_4
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents