Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Scaling big data mining infrastructure: the twitter experience

Published: 30 April 2013 Publication History

Abstract

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in practice is about much more than what most academics would consider data mining: life "in the trenches" is occupied by much preparatory work that precedes the application of data mining algorithms and followed by substantial effort to turn preliminary models into robust solutions. In this context, we discuss two topics: First, schemas play an important role in helping data scientists understand petabyte-scale data stores, but they're insufficient to provide an overall "big picture" of the data available to generate insights. Second, we observe that a major challenge in building data analytics platforms stems from the heterogeneity of the various components that must be integrated together into production workflows---we refer to this as "plumbing". This paper has two goals: For practitioners, we hope to share our experiences to flatten bumps in the road for those who come after us. For academic researchers, we hope to provide a broader context for data mining in production environments, pointing out opportunities for future work.

References

[1]
A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A reliable effective terascale linear learning system. In arXiv:1110.4198v1, 2011
[2]
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In ACL, 2001.
[3]
J. Basilico, M. Munson, T. Kolda, K. Dixon, and W. Kegelmeyer. COMET: A recipe for learning and using large ensembles on massive data. In ICDM, 2011.
[4]
R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. In KDD, 2011.
[5]
M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-Driven Documents. In InfoVis, 2011.
[6]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, 2010.
[7]
T. Brants, A. Popat, P. Xu, F. Och, and J. Dean. Large language models in machine translation. In EMNLP, 2007.
[8]
E. Chang, H. Bai, K. Zhu, H. Wang, J. Li, and Z. Qiu. PSVM: Parallel Support Vector Machines with incomplete Cholesky factorization. In Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.
[9]
J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In VLDB, 2009.
[10]
G. Cormack, M. Smucker, and C. Clarke. Efficient and effective spam filtering and re-ranking for large web datasets. In arXiv:1004.5168v1, 2010.
[11]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[12]
J. Dean and S. Ghemawat. MapReduce: A flexible data processing tool. CACM, 53(1):72--77, 2010.
[13]
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In VLDB, 2010.
[14]
C. Dyer, A. Cordova, A. Mont, and J. Lin. Fast, easy, and cheap: Construction of statistical machine translation models with MapReduce. In StatMT Workshop, 2008.
[15]
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of MapReduce: The Pig experience. In VLDB, 2009.
[16]
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. In ICDE, 2011.
[17]
K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Ye. Building LinkedIn's real-time activity data pipeline. Bulletin of the Technical Committee on Data Engineering, 35(2):33--45, 2012.
[18]
A. Halevy, P. Norvig, and F. Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8--12, 2009.
[19]
A. Hall, O. Bachmann, R. Büssow, S. Gǎnceanu, and M. Nunkesser. Processing a trillion cells per mouse click. In VLDB, 2012.
[20]
J. Hammerbacher. Information platforms and the rise of the data scientist. In Beautiful Data: The Stories Behind Elegant Data Solutions. O'Reilly, 2009.
[21]
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In ICDE, 2011.
[22]
J. Hellerstein, C. Ré, F. Schoppmann, D. Wang, E. Fratkin, A. Gorajek, K. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib Analytics Library or MAD skills, the SQL. In VLDB, 2012.
[23]
T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington, 2009.
[24]
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan data layouts: Right shoes for a running elephant. In SoCC, 2011.
[25]
T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search. ACM TOIS, 25(2):1--27, 2007.
[26]
I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. Words versus character n-grams for anti-spam filtering. International Journal on Artificial Intelligence Tools, 16(6):1047--1067, 2007.
[27]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. In VAST, 2012.
[28]
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five puzzling outcomes explained. In KDD, 2012.
[29]
R. Kohavi, R. Henne, and D. Sommerfield. Practical guide to controlled experiments on the web: Listen to your customers not to the HiPPO. In KDD, 2007.
[30]
J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB, 2011.
[31]
G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. In VLDB, 2012.
[32]
H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool, 2011.
[33]
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, 2010.
[34]
J. Lin and A. Kolcz. Large-scale machine learning at Twitter. In SIGMOD, 2012.
[35]
J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MAPREDUCE Workshop, 2011.
[36]
Y. Lin, D. Agrawal, C. Chen, B. Ooi, and S. Wu. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework. In SIGMOD, 2011.
[37]
G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, 2010.
[38]
G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, 2009.
[39]
R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured perceptron. In HLT, 2010.
[40]
S. Melnik, A. Gubarev, J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In VLDB, 2010.
[41]
A. Ng, G. Bradski, C.-T. Chu, K. Olukotun, S. Kim, Y.-A. Lin, and Y. Yu. Map-Reduce for machine learning on multicore. In NIPS, 2006.
[42]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008.
[43]
B. Panda, J. Herbach, S. Basu, and R. Bayardo. MapReduce and its application to massively parallel learning of decision tree ensembles. In Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.
[44]
K. Patel, N. Bancroft, S. Drucker, J. Fogarty, A. Ko, and J. Landay. Gestalt: Integrated support for implementation and analysis in machine learning. In UIST, 2010.
[45]
D. Patil. Building Data Science Teams. O'Reilly, 2011.
[46]
D. Patil. Data Jujitsu: The Art of Turning Data Into Product. O'Reilly, 2012.
[47]
M. Rios and J. Lin. Distilling massive amounts of data into simple visualizations: Twitter case studies. In Workshop on Social Media Visualization at ICWSM, 2012.
[48]
D. Sculley, M. Otey, M. Pohl, B. Spitznagel, J. Hainsworth, and Y. Zhou. Detecting adversarial advertisements in the wild. In KDD, 2011.
[49]
K. Svore and C. Burges. Large-scale learning to rank using boosted decision trees. In Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012.
[50]
B. Taylor, D. Fingal, and D. Aberdeen. The war against spam: A report from the front line. In NIPS Workshop on Machine Learning in Adversarial Environments, 2007.
[51]
A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at Facebook. In SIGMOD, 2010.
[52]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012.

Cited By

View all
  • (2024)Instance segmentation on distributed deep learning big data clusterJournal of Big Data10.1186/s40537-023-00871-911:1Online publication date: 2-Jan-2024
  • (2024)GSSC Now: ESA Thematic exploitation platform for navigation digital transformation. Enhancing GNSS scientific researchAdvances in Space Research10.1016/j.asr.2024.02.01674:6(2728-2751)Online publication date: Sep-2024
  • (2023)Design, Deployment, and Evaluation of an Industrial AIoT System for Quality Control at HP FactoriesACM Transactions on Sensor Networks10.1145/361830020:1(1-19)Online publication date: 3-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 14, Issue 2
December 2012
81 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/2481244
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2013
Published in SIGKDD Volume 14, Issue 2

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)4
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Instance segmentation on distributed deep learning big data clusterJournal of Big Data10.1186/s40537-023-00871-911:1Online publication date: 2-Jan-2024
  • (2024)GSSC Now: ESA Thematic exploitation platform for navigation digital transformation. Enhancing GNSS scientific researchAdvances in Space Research10.1016/j.asr.2024.02.01674:6(2728-2751)Online publication date: Sep-2024
  • (2023)Design, Deployment, and Evaluation of an Industrial AIoT System for Quality Control at HP FactoriesACM Transactions on Sensor Networks10.1145/361830020:1(1-19)Online publication date: 3-Nov-2023
  • (2023)Dataflow graphs as complete causal graphs2023 IEEE/ACM 2nd International Conference on AI Engineering – Software Engineering for AI (CAIN)10.1109/CAIN58948.2023.00010(7-12)Online publication date: May-2023
  • (2023)To protect science, we must use LLMs as zero-shot translatorsNature Human Behaviour10.1038/s41562-023-01744-07:11(1830-1832)Online publication date: 20-Nov-2023
  • (2023)Leveraging big data analytics for intelligent transportation systems: optimize the internet of vehicles data structure and modelingInternational Journal of Data Science and Analytics10.1007/s41060-023-00481-xOnline publication date: 14-Dec-2023
  • (2023)Distributed Deep Learning for Big Remote Sensing Data Processing on Apache Spark: Geological Remote Sensing Interpretation as a Case StudyWeb and Big Data10.1007/978-981-97-2303-4_7(96-110)Online publication date: 6-Oct-2023
  • (2022)Computational and Data Mining Perspectives on HIV/AIDS in Big Data EraResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch072(1477-1503)Online publication date: 2022
  • (2022)Challenges in Deploying Machine Learning: A Survey of Case StudiesACM Computing Surveys10.1145/353337855:6(1-29)Online publication date: 7-Dec-2022
  • (2022)An empirical evaluation of flow based programming in the machine learning deployment contextProceedings of the 1st International Conference on AI Engineering: Software Engineering for AI10.1145/3522664.3528601(54-64)Online publication date: 16-May-2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media