research-article

Scaling data mining in massively parallel dataflow systems

Author:

Sebastian SchelterAuthors Info & Claims

SIGMOD'14 PhD Symposium: Proceedings of the 2014 SIGMOD PhD symposium

June 2014

Pages 11 - 15

https://doi.org/10.1145/2602622.2602631

Published: 18 June 2014 Publication History

Abstract

The demand for mining large datasets using shared-nothing clusters is steadily on the rise. Despite the availability of parallel processing paradigms such as MapReduce, scalable data mining is still a tough problem. Naïve ports of existing algorithms to platforms like Hadoop exhibit various scalability bottlenecks, which prevent their application to large real-world datasets. These bottlenecks arise from various pitfalls that have to be overcome, including the scalability of the mathematical operations of the algorithm, the performance of the system when executing iterative computations, as well as its ability to efficiently execute meta learning techniques such as cross-validation and ensemble learning.

In this paper, we present our work on overcoming these pitfalls. In particular, we show how to scale the mathematical operations of two popular recommendation mining algorithms, discuss an optimistic recovery mechanism that improves the performance of distributed iterative data processing, and outline future work on efficient sample generation for scalable meta learning. Early results of our work have been contributed to open source libraries, such as Apache Mahout and Stratosphere, and are already deployed in industry use cases.

References

[1]

J. Lin and A. Kolcz. Large-scale machine learning at Twitter. SIGMOD'12, pp. 793--804.

Digital Library

[2]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE'11, pp. 231--242.

Digital Library

[3]

E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. J. Franklin, M. I. Jordan, and T. Kraska. MLI: An API for distributed machine learning. ICDM'13, pp. 1187--1192.

[4]

C. Bishop. Pattern Recognition & Machine Learning. Springer 2006.

Digital Library

[5]

Apache Hadoop, http://hadoop.apache.org.

[6]

M. Cha, H. Haddadi, F. Benevenuto, and P. K. Gummadi. Measuring user influence in twitter: The million follower fallacy. ICWSM'10.

[7]

R2 - Yahoo! Music User Ratings of Songs with Artist, Album, and Genre Meta Information, v. 1.0.

[8]

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM'11, 53(2):217--288.

Digital Library

[9]

U. Kang, B. Meeder, and C. Faloutsos. Spectral analysis for billion-scale graphs: discoveries and implementation. PAKDD'11, pp. 13--25.

Digital Library

[10]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed Graphlab: A framework for machine learning and data mining in the cloud. PVLDB'12, pp. 716--727.

Digital Library

[11]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD'10, pp. 135--146.

Digital Library

[12]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI'12, pp. 2--2.

Digital Library

[13]

S. Schelter, S. Ewen, K. Tzoumas, and V. Markl. All roads lead to rome: optimistic recovery for distributed iterative data processing. CIKM'13, pp. 1919--1928.

Digital Library

[14]

W. Krzanowski. Cross-validation in principal component analysis. Biometrics, 1987.

[15]

M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB'14.

[16]

B. Huang, S. Babu, and J. Yang. Cumulon: optimizing statistical data analysis in the cloud. SIGMOD'13, pp. 1--12.

Digital Library

[17]

S. Schelter, C. Boden, and V. Markl. Scalable similarity-based neighborhood methods with mapreduce. RecSys'12, pp. 163--170.

Digital Library

[18]

S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. RecSys'13, pp. 281--284.

Digital Library

[19]

B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. WWW'01, pp. 285--295.

Digital Library

[20]

G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.

Digital Library

[21]

Apache Mahout, http://mahout.apache.org.

[22]

A recommendation engine, foursquare style, http://s.apache.org/ee.

[23]

Scientific article recommendation & Mahout, http://goo.gl/e1kAMd.

[24]

Researchgate uses Mahout, http://s.apache.org/tkz.

[25]

Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42:30--37, 2009.

Digital Library

[26]

Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan. Large-scale parallel collaborative filtering for the netflix prize. pp. 337--348.

Digital Library

[27]

PredictionIO, http://prediction.io.

[28]

D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. SoCC'10, pp. 119--130.

Digital Library

[29]

S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl. Spinning fast iterative data flows. PVLDB'12, pp. 1268--1279.

Digital Library

[30]

S. Ewen, S. Schelter, K. Tzoumas, D. Warneke, and V. Markl. Iterative parallel data processing with stratosphere: An inside look. SIGMOD'13 (demo track).

Digital Library

[31]

M. Gondran and M. Minoux. Graphs, Dioids and Semirings - New Models and Algorithms. Springer, 2008.

Digital Library

[32]

S. Geisser. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350),1975.

[33]

L. Breiman. Bagging predictors. Machine learning, 24(2), 1996.

Digital Library

[34]

B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: massively parallel learning of tree ensembles with mapreduce. PVLDB'09, pp. 1426--1437.

Digital Library

[35]

J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: new analysis practices for big data. PVLDB'09, pp. 1481--1492.

Digital Library

[36]

R. Xin, D. Crankshaw, A. Dave, J. Gonzalez, M. Franklin, and I. Stoica. GraphX: Unifying data-parallel and graph-parallel analytics. arXiv:1402.2394, 2014.

[37]

A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. OSDI'12, pp. 31--46.

Digital Library

[38]

P. Boldi and S. Vigna. In-core computation of geometric centralities with hyperball: A hundred billion nodes and beyond. arXiv:1308.2144, 2013.

Index Terms

Scaling data mining in massively parallel dataflow systems
1. Information systems
  1. Information systems applications

Recommendations

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Frequent itemset mining presents one of the fundamental building blocks in data mining. However, despite the crucial recent advances that have been made in data mining literature, few of both standard and improved solutions scale. This is particularly ...
Read More
Task Scheduling on the PASM Parallel Processing System

PASM is a proposed large-scale distributed/parallel processing system which can be partitioned into independent SIMD/MIMD machines of various sizes. One design problem for systems such as PASM is task scheduling. The use of multiple FIFO queues for ...
Read More
Task Preloading Schemes for Reconfigurable Parallel Processing Systems

One class of reconfigurable parallel processing systems is based on the use of a large number of processing elements where each processing element consists of a processor and a primary memory. To efficiently employ the processing elements, it is ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD'14 PhD Symposium: Proceedings of the 2014 SIGMOD PhD symposium

June 2014

58 pages

ISBN:9781450329248

DOI:10.1145/2602622

Program Chairs:
Pablo Barcelo
University of Chile, Chile
,
Lei Chen
HKUST, Hong Kong, P.R., China

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'14

Sponsor:

SIGMOD

SIGMOD/PODS'14: International Conference on Management of Data

June 22, 2014

Utah, Snowbird, USA

Acceptance Rates

SIGMOD'14 PhD Symposium Paper Acceptance Rate 10 of 13 submissions, 77%;

Overall Acceptance Rate 40 of 60 submissions, 67%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
301
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)3

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents