Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2602622.2602631acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Scaling data mining in massively parallel dataflow systems

Published: 18 June 2014 Publication History
  • Get Citation Alerts
  • Abstract

    The demand for mining large datasets using shared-nothing clusters is steadily on the rise. Despite the availability of parallel processing paradigms such as MapReduce, scalable data mining is still a tough problem. Naïve ports of existing algorithms to platforms like Hadoop exhibit various scalability bottlenecks, which prevent their application to large real-world datasets. These bottlenecks arise from various pitfalls that have to be overcome, including the scalability of the mathematical operations of the algorithm, the performance of the system when executing iterative computations, as well as its ability to efficiently execute meta learning techniques such as cross-validation and ensemble learning.
    In this paper, we present our work on overcoming these pitfalls. In particular, we show how to scale the mathematical operations of two popular recommendation mining algorithms, discuss an optimistic recovery mechanism that improves the performance of distributed iterative data processing, and outline future work on efficient sample generation for scalable meta learning. Early results of our work have been contributed to open source libraries, such as Apache Mahout and Stratosphere, and are already deployed in industry use cases.

    References

    [1]
    J. Lin and A. Kolcz. Large-scale machine learning at Twitter. SIGMOD'12, pp. 793--804.
    [2]
    A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE'11, pp. 231--242.
    [3]
    E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. J. Franklin, M. I. Jordan, and T. Kraska. MLI: An API for distributed machine learning. ICDM'13, pp. 1187--1192.
    [4]
    C. Bishop. Pattern Recognition & Machine Learning. Springer 2006.
    [5]
    Apache Hadoop, http://hadoop.apache.org.
    [6]
    M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in twitter: The million follower fallacy. ICWSM'10.
    [7]
    R2 - Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta Information, v. 1.0.
    [8]
    N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM'11, 53(2):217--288.
    [9]
    U. Kang, B. Meeder, and C. Faloutsos. Spectral analysis for billion-scale graphs: discoveries and implementation. PAKDD'11, pp. 13--25.
    [10]
    Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed Graphlab: A framework for machine learning and data mining in the cloud. PVLDB'12, pp. 716--727.
    [11]
    G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD'10, pp. 135--146.
    [12]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI'12, pp. 2--2.
    [13]
    S. Schelter, S. Ewen, K. Tzoumas, and V. Markl. All roads lead to rome: optimistic recovery for distributed iterative data processing. CIKM'13, pp. 1919--1928.
    [14]
    W. Krzanowski. Cross-validation in principal component analysis. Biometrics, 1987.
    [15]
    M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB'14.
    [16]
    B. Huang, S. Babu, and J. Yang. Cumulon: optimizing statistical data analysis in the cloud. SIGMOD'13, pp. 1--12.
    [17]
    S. Schelter, C. Boden, and V. Markl. Scalable similarity-based neighborhood methods with mapreduce. RecSys'12, pp. 163--170.
    [18]
    S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. RecSys'13, pp. 281--284.
    [19]
    B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. WWW'01, pp. 285--295.
    [20]
    G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.
    [21]
    Apache Mahout, http://mahout.apache.org.
    [22]
    A recommendation engine, foursquare style, http://s.apache.org/ee.
    [23]
    Scientific article recommendation & Mahout, http://goo.gl/e1kAMd.
    [24]
    Researchgate uses Mahout, http://s.apache.org/tkz.
    [25]
    Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42:30--37, 2009.
    [26]
    Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. pp. 337--348.
    [27]
    PredictionIO, http://prediction.io.
    [28]
    D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC'10, pp. 119--130.
    [29]
    S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB'12, pp. 1268--1279.
    [30]
    S. Ewen, S. Schelter, K. Tzoumas, D. Warneke, and V. Markl. Iterative parallel data processing with stratosphere: An inside look. SIGMOD'13 (demo track).
    [31]
    M. Gondran and M. Minoux. Graphs, Dioids and Semirings - New Models and Algorithms. Springer, 2008.
    [32]
    S. Geisser. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350),1975.
    [33]
    L. Breiman. Bagging predictors. Machine learning, 24(2), 1996.
    [34]
    B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: massively parallel learning of tree ensembles with mapreduce. PVLDB'09, pp. 1426--1437.
    [35]
    J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. PVLDB'09, pp. 1481--1492.
    [36]
    R. Xin, D. Crankshaw, A. Dave, J. Gonzalez, M. Franklin, and I. Stoica. GraphX: Unifying data-parallel and graph-parallel analytics. arXiv:1402.2394, 2014.
    [37]
    A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. OSDI'12, pp. 31--46.
    [38]
    P. Boldi and S. Vigna. In-core computation of geometric centralities with hyperball: A hundred billion nodes and beyond. arXiv:1308.2144, 2013.

    Index Terms

    1. Scaling data mining in massively parallel dataflow systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD'14 PhD Symposium: Proceedings of the 2014 SIGMOD PhD symposium
      June 2014
      58 pages
      ISBN:9781450329248
      DOI:10.1145/2602622
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2014

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. distributed processing
      2. scalable data mining

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS'14
      Sponsor:

      Acceptance Rates

      SIGMOD'14 PhD Symposium Paper Acceptance Rate 10 of 13 submissions, 77%;
      Overall Acceptance Rate 40 of 60 submissions, 67%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 301
        Total Downloads
      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)3

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media