research-article

Scaling big data mining infrastructure: the twitter experience

Authors:

Dmitriy RyaboyAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 14, Issue 2

Pages 6 - 19

https://doi.org/10.1145/2481244.2481247

Published: 30 April 2013 Publication History

Abstract

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall "big picture" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as "plumbing". This paper has two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.

References

[1]

A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A reliable effective terascale linear learning system. In arXiv:1110.4198v1, 2011

[2]

M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In ACL, 2001.

Digital Library

[3]

J. Basilico, M. Munson, T. Kolda, K. Dixon, and W. Kegelmeyer. COMET: A recipe for learning and using large ensembles on massive data. In ICDM, 2011.

Digital Library

[4]

R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. In KDD, 2011.

Digital Library

[5]

M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-Driven Documents. In InfoVis, 2011.

Digital Library

[6]

L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, 2010.

[7]

T. Brants, A. Popat, P. Xu, F. Och, and J. Dean. Large language models in machine translation. In EMNLP, 2007.

[8]

E. Chang, H. Bai, K. Zhu, H. Wang, J. Li, and Z. Qiu. PSVM: Parallel Support Vector Machines with incomplete Cholesky factorization. In Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.

[9]

J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In VLDB, 2009.

Digital Library

[10]

G. Cormack, M. Smucker, and C. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. In arXiv:1004.5168v1, 2010.

Digital Library

[11]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.

Digital Library

[12]

J. Dean and S. Ghemawat. MapReduce: A flexible data processing tool. CACM, 53(1):72--77, 2010.

Digital Library

[13]

J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In VLDB, 2010.

Digital Library

[14]

C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with MapReduce. In StatMT Workshop, 2008.

Digital Library

[15]

A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of MapReduce: The Pig experience. In VLDB, 2009.

Digital Library

[16]

A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. In ICDE, 2011.

Digital Library

[17]

K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Ye. Building LinkedIn's real-time activity data pipeline. Bulletin of the Technical Committee on Data Engineering, 35(2):33--45, 2012.

[18]

A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.

Digital Library

[19]

A. Hall, O. Bachmann, R. Büssow, S. Gǎnceanu, and M. Nunkesser. Processing a trillion cells per mouse click. In VLDB, 2012.

Digital Library

[20]

J. Hammerbacher. Information platforms and the rise of the data scientist. In Beautiful Data: The Stories Behind Elegant Data Solutions. O'Reilly, 2009.

[21]

Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In ICDE, 2011.

Digital Library

[22]

J. Hellerstein, C. Ré, F. Schoppmann, D. Wang, E. Fratkin, A. Gorajek, K. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib Analytics Library or MAD skills, the SQL. In VLDB, 2012.

Digital Library

[23]

T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington, 2009.

[24]

A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan data layouts: Right shoes for a running elephant. In SoCC, 2011.

Digital Library

[25]

T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM TOIS, 25(2):1--27, 2007.

Digital Library

[26]

I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. Words versus character n-grams for anti-spam filtering. International Journal on Artificial Intelligence Tools, 16(6):1047--1067, 2007.

[27]

S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In VAST, 2012.

Digital Library

[28]

R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five puzzling outcomes explained. In KDD, 2012.

Digital Library

[29]

R. Kohavi, R. Henne, and D. Sommerfield. Practical guide to controlled experiments on the web: Listen to your customers not to the HiPPO. In KDD, 2007.

Digital Library

[30]

J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB, 2011.

[31]

G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. In VLDB, 2012.

Digital Library

[32]

H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool, 2011.

Digital Library

[33]

J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, 2010.

Digital Library

[34]

J. Lin and A. Kolcz. Large-scale machine learning at Twitter. In SIGMOD, 2012.

Digital Library

[35]

J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MAPREDUCE Workshop, 2011.

Digital Library

[36]

Y. Lin, D. Agrawal, C. Chen, B. Ooi, and S. Wu. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. In SIGMOD, 2011.

Digital Library

[37]

G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, 2010.

Digital Library

[38]

G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, 2009.

Digital Library

[39]

R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In HLT, 2010.

Digital Library

[40]

S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.

Digital Library

[41]

A. Ng, G. Bradski, C.-T. Chu, K. Olukotun, S. Kim, Y.-A. Lin, and Y. Yu. Map-Reduce for machine learning on multicore. In NIPS, 2006.

[42]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008.

Digital Library

[43]

B. Panda, J. Herbach, S. Basu, and R. Bayardo. MapReduce and its application to massively parallel learning of decision tree ensembles. In Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.

[44]

K. Patel, N. Bancroft, S. Drucker, J. Fogarty, A. Ko, and J. Landay. Gestalt: Integrated support for implementation and analysis in machine learning. In UIST, 2010.

Digital Library

[45]

D. Patil. Building Data Science Teams. O'Reilly, 2011.

[46]

D. Patil. Data Jujitsu: The Art of Turning Data Into Product. O'Reilly, 2012.

[47]

M. Rios and J. Lin. Distilling massive amounts of data into simple visualizations: Twitter case studies. In Workshop on Social Media Visualization at ICWSM, 2012.

[48]

D. Sculley, M. Otey, M. Pohl, B. Spitznagel, J. Hainsworth, and Y. Zhou. Detecting adversarial advertisements in the wild. In KDD, 2011.

Digital Library

[49]

K. Svore and C. Burges. Large-scale learning to rank using boosted decision trees. In Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.

[50]

B. Taylor, D. Fingal, and D. Aberdeen. The war against spam: A report from the front line. In NIPS Workshop on Machine Learning in Adversarial Environments, 2007.

[51]

A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at Facebook. In SIGMOD, 2010.

Digital Library

[52]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Digital Library

Cited By

Elhmadany MElmadah IAbdelmunim H(2024)Instance segmentation on distributed deep learning big data clusterJournal of Big Data10.1186/s40537-023-00871-911:1Online publication date: 2-Jan-2024
https://doi.org/10.1186/s40537-023-00871-9
Navarro Vdel Rio Sdel Mar Millán MMessina AVentura-Traveset J(2024)GSSC Now: ESA Thematic exploitation platform for navigation digital transformation. Enhancing GNSS scientific researchAdvances in Space Research10.1016/j.asr.2024.02.01674:6(2728-2751)Online publication date: Sep-2024
https://doi.org/10.1016/j.asr.2024.02.016
Le DYang JZhou SHo DTan R(2023)Design, Deployment, and Evaluation of an Industrial AIoT System for Quality Control at HP FactoriesACM Transactions on Sensor Networks10.1145/361830020:1(1-19)Online publication date: 3-Nov-2023
https://dl.acm.org/doi/10.1145/3618300
Show More Cited By

Index Terms

Scaling big data mining infrastructure: the twitter experience
1. Information systems

Recommendations

Mining Big Data
ICEIS 2015: Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1

Nowadays, the daily amount of generated data is measured in exabytes. Such huge data is now referred to as Big Data. Big data mining leads to the discovery of the useful information from huge data repositories. However, this huge amount of data hinders ...
From Big Data to Big Data Mining: Challenges, Issues, and Opportunities
Proceedings of the 18th International Conference on Database Systems for Advanced Applications - Volume 7827

While "big data" has become a highlighted buzzword since last year, "big data mining", i.e., mining from big data, has almost immediately followed up as an emerging, interrelated research area. This paper provides an overview of big data mining and ...
Big Data Analytics

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 14, Issue 2

December 2012

81 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/2481244

Issue’s Table of Contents

Copyright © 2013 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2013

Published in SIGKDD Volume 14, Issue 2

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

140
Total Citations
View Citations
3,546
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)4

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Elhmadany MElmadah IAbdelmunim H(2024)Instance segmentation on distributed deep learning big data clusterJournal of Big Data10.1186/s40537-023-00871-911:1Online publication date: 2-Jan-2024
https://doi.org/10.1186/s40537-023-00871-9
Navarro Vdel Rio Sdel Mar Millán MMessina AVentura-Traveset J(2024)GSSC Now: ESA Thematic exploitation platform for navigation digital transformation. Enhancing GNSS scientific researchAdvances in Space Research10.1016/j.asr.2024.02.01674:6(2728-2751)Online publication date: Sep-2024
https://doi.org/10.1016/j.asr.2024.02.016
Le DYang JZhou SHo DTan R(2023)Design, Deployment, and Evaluation of an Industrial AIoT System for Quality Control at HP FactoriesACM Transactions on Sensor Networks10.1145/361830020:1(1-19)Online publication date: 3-Nov-2023
https://dl.acm.org/doi/10.1145/3618300
Paleyes AGuo SScholkopf BLawrence N(2023)Dataflow graphs as complete causal graphs2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00010(7-12)Online publication date: May-2023
https://doi.org/10.1109/CAIN58948.2023.00010
Mittelstadt BWachter SRussell C(2023)To protect science, we must use LLMs as zero-shot translatorsNature Human Behaviour10.1038/s41562-023-01744-07:11(1830-1832)Online publication date: 20-Nov-2023
https://doi.org/10.1038/s41562-023-01744-0
Gebremeskel G(2023)Leveraging big data analytics for intelligent transportation systems: optimize the internet of vehicles data structure and modelingInternational Journal of Data Science and Analytics10.1007/s41060-023-00481-xOnline publication date: 14-Dec-2023
https://doi.org/10.1007/s41060-023-00481-x
Long AHan WHuang XLi JWang YChen J(2023)Distributed Deep Learning for Big Remote Sensing Data Processing on Apache Spark: Geological Remote Sensing Interpretation as a Case StudyWeb and Big Data10.1007/978-981-97-2303-4_7(96-110)Online publication date: 6-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-97-2303-4_7
Al Mazari A(2022)Computational and Data Mining Perspectives on HIV/AIDS in Big Data EraResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch072(1477-1503)Online publication date: 2022
https://doi.org/10.4018/978-1-6684-3662-2.ch072
Paleyes AUrma RLawrence N(2022)Challenges in Deploying Machine Learning: A Survey of Case StudiesACM Computing Surveys10.1145/353337855:6(1-29)Online publication date: 7-Dec-2022
https://dl.acm.org/doi/10.1145/3533378
Paleyes ACabrera CLawrence NCrnkovic I(2022)An empirical evaluation of flow based programming in the machine learning deployment contextProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528601(54-64)Online publication date: 16-May-2022
https://dl.acm.org/doi/10.1145/3522664.3528601
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents