Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2396761.2396776acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Published: 29 October 2012 Publication History

Abstract

Frequent Itemsets and Association Rules Mining (FIM) is a key task in knowledge discovery from data. As the dataset grows, the cost of solving this task is dominated by the component that depends on the number of transactions in the dataset. We address this issue by proposing PARMA, a parallel algorithm for the MapReduce framework, which scales well with the size of the dataset (as number of transactions) while minimizing data replication and communication cost. PARMA cuts down the dataset-size-dependent part of the cost by using a random sampling approach to FIM. Each machine mines a small random sample of the dataset, of size independent from the dataset size. The results from each machine are then filtered and aggregated to produce a single output collection. The output will be a very close approximation of the collection of Frequent Itemsets (FI's) or Association Rules (AR's) with their frequencies and confidence levels. The quality of the output is probabilistically guaranteed by our analysis to be within the user-specified accuracy and error probability parameters. The sizes of the random samples are independent from the size of the dataset, as is the number of samples. They depend on the user-chosen accuracy and error probability parameters and on the parallel computational model. We implemented PARMA in Hadoop MapReduce and show experimentally that it runs faster than previously introduced FIM algorithms for the same platform, while 1) scaling almost linearly, and 2) offering even higher accuracy and confidence than what is guaranteed by the analysis.

References

[1]
R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD Rec., 22:207--216, June 1993.
[2]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 487--499, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
[3]
Apache Foundation. Apache mahout. http://mahout.apache.org/.
[4]
G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 2--12, New York, NY, USA, 2007. ACM.
[5]
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in Map-Reduce. In Proceedings of the 19th international conference on World wide web, WWW '10, pages 231--240, New York, NY, USA, 2010. ACM.
[6]
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for machine learning on multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4--7, 2006, pages 281--288. MIT Press, 2007.
[7]
F. Coenen. The LUCS-KDD FP-growth association rule mining algorithm.
[8]
S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '05, pages 255--265, New York, NY, USA, 2005. ACM.
[9]
L. Cristofor. ARTool. http://www.cs.umb.edu/ laur/ARtool/, 2006.
[10]
J.-D. Cryans, S. Ratté, and R. Champagne. Adaptation of APriori to MapReduce to build a warehouse of relations between named entities across the web. In Advances in Databases Knowledge and Data Applications (DBKDA), 2010 Second International Conference on, pages 185--189, april 2010.
[11]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[12]
M. El-Hajj and O. Zaiane. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, volume 1, page 8 pp., 0-0 2006.
[13]
W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, Y. Yang, B. He, Q. Luo, P. V. Sander, and K. Yang. Parallel data mining on graphics processors. Technical Report 07, The Hong Kong University of Science & Technology, 2008.
[14]
A. Ghoting, P. Kambadur, E. Pednault, and R. Kannan. NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '11, pages 334--342, New York, NY, USA, 2011. ACM.
[15]
M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the MapReduce framework. CoRR, abs/1101.1902, 2011.
[16]
S. Hammoud. MapReduce Network Enabled Algorithms for Classification Based on Association Rules. PhD thesis, Brunel University, 2011.
[17]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD Rec., 29:1--12, May 2000.
[18]
H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. PFP: Parallel FP-Growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, RecSys '08, pages 107--114, New York, NY, USA, 2008. ACM.
[19]
L. Li and M. Zhang. The strategy of mining association rule based on cloud computing. In Business Computing and Global Informatization (BCGIN), 2011 International Conference on, pages 475--478, july 2011.
[20]
Y. Li and R. Gopalan. Effective sampling for mining association rules. In G. Webb and X. Yu, editors, AI 2004: Advances in Artificial Intelligence, volume 3339 of Lecture Notes in Computer Science, pages 73--75. Springer, Berlin / Heidelberg, 2005.
[21]
J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG '10, pages 78--85, New York, NY, USA, 2010. ACM.
[22]
L. Liu, E. Li, Y. Zhang, and Z. Tang. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 1275--1285. VLDB Endowment, 2007.
[23]
H. Mannila, H. Toivonen, and I. Verkamo. Efficient algorithms for discovering association rules. In KDD Workshop, pages 181--192, Menlo Park, CA, USA, 1994. The AAAI Press.
[24]
M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.
[25]
E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item set mining with selective item replication. Parallel and Distributed Systems, IEEE Transactions on, 22(10):1632--1640, oct. 2011.
[26]
S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM '02, pages 354--361. IEEE Computer Society, 2002.
[27]
A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoffs for MapReduce computations. CoRR, abs/1111.2228, 2011.
[28]
A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin. Mining top-K frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery, 21:310--326, 2010.
[29]
M. Riondato and E. Upfal. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. CoRR, abs/1111.6937, November 2011.
[30]
J. Ruoming, Y. Ge, and G. Agrawal. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. Knowledge and Data Engineering, IEEE Transactions on, 17(1):71 -- 89, jan. 2005.
[31]
N. V. Sahinidis and M. Tawarmalani. BARON 9.0.4: Global Optimization of Mixed-Integer Nonlinear Programs, User's Manual, 2010.
[32]
H. Toivonen. Sampling large databases for association rules. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, pages 134--145, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.
[33]
X. Y. Yang, Z. Liu, and Y. Fu. MapReduce as a programming model for association rules algorithm on Hadoop. In Information Sciences and Interaction Sciences (ICIS), 2010 3rd International Conference on, pages 99--102, june 2010.
[34]
M. Zaki. Parallel and distributed association mining: a survey. Concurrency, IEEE, 7(4):14 --25, oct-dec 1999.
[35]
M. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In Proceedings of the Seventh International Workshop on Research Issues in Data Engineering, RIDE '97, pages 42--50. IEEE Computer Society, apr 1997.
[36]
L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP-Growth with MapReduce. In Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pages 243--246, nov. 2010.

Cited By

View all
  • (2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
  • (2023)A Utility-Based Distributed Pattern Mining Algorithm With Reduced Shuffle OverheadIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322121034:1(416-428)Online publication date: 1-Jan-2023
  • (2023)Association Rule Mining in Distributed Environment: A SurveyAdvanced Computational and Communication Paradigms10.1007/978-981-99-4284-8_9(113-120)Online publication date: 21-Sep-2023
  • Show More Cited By

Index Terms

  1. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
    October 2012
    2840 pages
    ISBN:9781450311564
    DOI:10.1145/2396761
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 October 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MapReduce
    2. association rules
    3. frequent itemsets
    4. sampling

    Qualifiers

    • Research-article

    Conference

    CIKM'12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)MaNIACS: Approximate Mining of Frequent Subgraph Patterns through SamplingACM Transactions on Intelligent Systems and Technology10.1145/358725414:3(1-29)Online publication date: 13-Apr-2023
    • (2023)A Utility-Based Distributed Pattern Mining Algorithm With Reduced Shuffle OverheadIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322121034:1(416-428)Online publication date: 1-Jan-2023
    • (2023)Association Rule Mining in Distributed Environment: A SurveyAdvanced Computational and Communication Paradigms10.1007/978-981-99-4284-8_9(113-120)Online publication date: 21-Sep-2023
    • (2022)ROOTECTOR: Robust Android Rooting Detection Framework Using Machine Learning AlgorithmsArabian Journal for Science and Engineering10.1007/s13369-022-06949-548:2(1771-1791)Online publication date: 26-Jun-2022
    • (2022)Sampling-Based Data Mining Algorithms: Modern Techniques and Case StudiesMachine Learning and Knowledge Discovery in Databases10.1007/978-3-662-44845-8_48(516-519)Online publication date: 10-Mar-2022
    • (2021)Exploring Decomposition for Solving Pattern Mining ProblemsACM Transactions on Management Information Systems10.1145/343977112:2(1-36)Online publication date: 11-Feb-2021
    • (2020)Research of Association Rules Mining based on Fuzzy Alarm ExtractionProceedings of the 2020 4th International Conference on Digital Signal Processing10.1145/3408127.3408154(259-262)Online publication date: 19-Jun-2020
    • (2020)A Optimization Algorithm for Association Rule Based on Spark Platform2020 International Conference on Computer Network, Electronic and Automation (ICCNEA)10.1109/ICCNEA50255.2020.00026(82-86)Online publication date: Sep-2020
    • (2020)Distributed synthesized association mining for big transactional dataSādhanā10.1007/s12046-020-01380-845:1Online publication date: 2-Jul-2020
    • (2020)Pattern Mining AlgorithmsBio-inspired Algorithms for Data Streaming and Visualization, Big Data Management, and Fog Computing10.1007/978-981-15-6695-0_4(67-80)Online publication date: 26-Aug-2020
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media