Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3210284.3210294acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

Published: 25 June 2018 Publication History

Abstract

Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications.
In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.

References

[1]
Charu C Aggarwal. 2015. Outlier analysis. In Data mining. Springer, 237--263.
[2]
Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag S Maskey, Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear road: a stream data management benchmark. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 480--491.
[3]
Marco Balduini, Emanuele Delia Valle, Matteo Azzi, Roberto Larcher, Fabrizio Antonelli, and Paolo Ciuccarelli. 2015. Citysensing: Fusing city data for visual storytelling. IEEE MultiMedia 22, 3 (2015), 44--53.
[4]
Marco Balduini, Emanuele Delia Valle, Daniele Dell'Aglio, Mikalai Tsytsarau, Themis Palpanas, and Cristian Confalonieri. 2013. Social listening of city scale events using the streaming linked data framework. In International Semantic Web Conference. Springer, 1--16.
[5]
Marco Balduini, Emanuele Delia Valle, and Riccardo Tommasini. 2017. SLD Revolution: A Cheaper, Faster yet more Accurate Streaming Linked Data Framework. In Joint Proceedings of the 2nd RDF Stream Processing (RSP 2017) and the Querying the Web of Data (QuWeDa 2017) Workshops co-located with 14th ESWC 2017 (ESWC 2017), Portoroz, Slovenia, May 28th - to - 29th, 2017 1--15. http://ceur-ws.org/Vol-1870/paper-01.pdf
[6]
Marco Balduini and Emanuele Delia Valle. 2015. FraPPE: A Vocabulary to Represent Heterogeneous Spatio-temporal Data to Support Visual Analytics. In International Semantic Web Conference (2) (Lecture Notes in Computer Science), Vol. 9367. Springer, 321--328.
[7]
Christoph Boden, Tilmann Rabl, and Volker Markl. 2018. Distributed Machine Learning-but at what COST? Private Communication. (2018).
[8]
Francesco Calabrese, Massimo Colonna, Piero Lovisolo, Dario Parata, and Carlo Ratti. 2011. Real-time urban monitoring using cell phones: A case study in Rome. IEEE Transactions on Intelligent Transportation Systems 12, 1 (2011), 141--151.
[9]
Francesco Calabrese, Kristian Kloeckl, Carlo Ratti, Mark Bilandzic, Marcus Foth, Angela Button, Helen Klaebe, Laura Forlano, Sean White, Petia Morozov, Steven Feiner, Fabien Girardin, Josep Blat, Nicolas Nova, M. P. Pieniazek, Rob Tieben, Koen van Boerdonk, Sietske Klooster, Elise van den Hoven, Jaime Martín Serrano, Joan Serrat, Daniel Michelis, and Eric Kabisch. 2007. Urban Computing and Mobile Devices. IEEE Pervasive Computing 6, 3 (2007), 52--57.
[10]
Francesco Calabrese, Francisco C Pereira, Giusy Di Lorenzo, Liu Liang, and Carlo Ratti. 2010. The geography of taste: Analyzing cell-phone mobility and social events. In Pervasive, Vol. 10. Springer, 22--37.
[11]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[12]
Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and Paul Poulosky. 2016. Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming. In IPDPS Workshops. IEEE Computer Society, 1789--1792.
[13]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[14]
Emanuele Delia Valle and Marco Balduini. 2015. Listening to and visualising the pulse of our cities using Social Media and Call Data Records. In International Conference on Business Information Systems. Springer, 3--14.
[15]
Jim Gray. 1992. Benchmark handbook: for database and transaction processing systems. Morgan Kaufmann Publishers Inc.
[16]
Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking Distributed Stream Processing Engines. arXiv preprint arXiv:1802.08496 (2018).
[17]
Jay Kreps, Neha Narkhede, Jun Rao, et al. 2011. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB. 1--7.
[18]
Gautier Krings, Francesco Calabrese, Carlo Ratti, and Vincent D Blondel. 2009. Urban gravity: a model for inter-city telecommunication flows. Journal of Statistical Mechanics: Theory and Experiment 2009, 07 (2009), L07003.
[19]
Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable real-time data systems. Manning Publications Co.
[20]
Frank McSherry, Michael Isard, and Derek Gordon Murray. 2015. Scalability! but at what COST?. In HotOS.
[21]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 147--156.
[22]
Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.

Cited By

View all
  • (2019)A Survey of Distributed Data Stream Processing FrameworksIEEE Access10.1109/ACCESS.2019.29468847(154300-154316)Online publication date: 2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEBS '18: Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems
June 2018
289 pages
ISBN:9781450357821
DOI:10.1145/3210284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cost-Aware Comparison
  2. Distributed Systems
  3. Stream Analytics

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DEBS '18

Acceptance Rates

DEBS '18 Paper Acceptance Rate 12 of 31 submissions, 39%;
Overall Acceptance Rate 145 of 583 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)A Survey of Distributed Data Stream Processing FrameworksIEEE Access10.1109/ACCESS.2019.29468847(154300-154316)Online publication date: 2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media