research-article

IncApprox: A Data Analytics System for Incremental Approximate Computing

Authors:

Dhanya R. Krishnan,

Pramod Bhatotia,

Christof Fetzer,

Rodrigo RodriguesAuthors Info & Claims

WWW '16: Proceedings of the 25th International Conference on World Wide Web

Pages 1133 - 1144

https://doi.org/10.1145/2872427.2883026

Published: 11 April 2016 Publication History

Abstract

Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items.

In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.

References

[1]

Amazon Kinesis Streams. https://aws.amazon.com/kinesis/. Accessed: Jan, 2016.

[2]

Apache Flink. https://flink.apache.org/. Accessed: Jan, 2016.

[3]

Apache Flume. https://flume.apache.org/. Accessed: Jan, 2016.

[4]

Apache Hadoop. http://hadoop.apache.org/. Accessed: Jan, 2016.

[5]

Apache Spark Streaming. http://spark.apache.org/streaming. Accessed: Jan, 2016.

[6]

Apache Storm. http://storm-project.net/. Accessed: Jan, 2016.

[7]

Kafka - A high-throughput distributed messaging system. http://kafka.apache.org. Accessed: Jan, 2016.

[8]

Trident. https://github.com/nathanmarz/storm/wiki/Trident-tutorial. Accessed: Jan, 2016.

[9]

Twitter Search API. http://apiwiki.twitter.com/Twitter-API-Documentation. Accessed: Jan, 2016.

[10]

U. A. Acar. Self-Adjusting Computation. PhD thesis, Carnegie Mellon University, 2005.

Digital Library

[11]

U. A. Acar, G. E. Blelloch, M. Blume, R. Harper, and K. Tangwongsan. An experimental analysis of self-adjusting computation. ACM Transactions on Programming Languages and Systems (TOPLAS), 2009.

Digital Library

[12]

U. A. Acar, A. Cotter, B. Hudson, and D. Türkoglu. Dynamic well-spaced point sets. In Proceedings of the 26th Annual Symposium on Computational Geometry (SoCG), 2010.

Digital Library

[13]

S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2014.

Digital Library

[14]

S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2013.

Digital Library

[15]

M. Al-Kateb and B. S. Lee. Stratified reservoir sampling over heterogeneous data streams. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), 2010.

Digital Library

[16]

M. Al-Kateb, B. S. Lee, and X. S. Wang. Adaptive-size reservoir sampling over data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSBDM), 2007.

Digital Library

[17]

S. Angel, H. Ballani, T. Karagiannis, G. O\textquoterightShea, and E. Thereska. End-to-end performance isolation through virtual datacenters. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2014.

Digital Library

[18]

P. Bhatotia. Incremental Parallel and Distributed Systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS), 2015.

[19]

P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues. Slider: Incremental Sliding Window Analytics. In Proceedings of the 15th International Middleware Conference (Middleware), 2014.

Digital Library

[20]

P. Bhatotia, M. Dischinger, R. Rodrigues, and U. A. Acar. Slider: Incremental Sliding-Window Computations for Large-Scale Data Analysis. In Technical Report: MPI-SWS-2012-004, 2012.

[21]

P. Bhatotia, P. Fonseca, U. A. Acar, B. Brandenburg, and R. Rodrigues. iThreads: A Threading Library for Parallel Incremental Computation. In proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.

Digital Library

[22]

P. Bhatotia, R. Rodrigues, and A. Verma. Shredder: GPU-Accelerated Incremental Storage and Computation. In Proceedings of USENIX Conference on File and Storage Technologies (FAST), 2012.

Digital Library

[23]

P. Bhatotia, A. Wieder, I. E. Akkus, R. Rodrigues, and U. A. Acar. Large-scale incremental data processing with change propagation. In Proceedings of the Conference on Hot Topics in Cloud Computing (HotCloud), 2011.

Digital Library

[24]

P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for Incremental Computations. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2011.

Digital Library

[25]

G. S. Brodal and R. Jacob. Dynamic planar convex hull. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2002.

Digital Library

[26]

C. Olston et al. Nova: Continuous Pig/Hadoop Workflows. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2011.

Digital Library

[27]

CAIDA. The CAIDA UCSD Anonymized Internet Traces 2015 (equinix-chicago-dirA). http://www.caida.org/data/passive/passive_2015_dataset.xml.

[28]

R. Charles, T. Alexey, G. Gregory, H. K. Randy, and K. Michael. Towards understanding heterogeneous clouds at scale: Google trace analysis. Technical report, 2012.

[29]

Y.-J. Chiang and R. Tamassia. Dynamic algorithms in computational geometry. Proceedings of the IEEE, 1992.

[30]

B. Claise. Cisco systems NetFlow services export version 9. 2004.

Digital Library

[31]

S. Coles. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.

[32]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2010.

Digital Library

[33]

G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 2012.

Digital Library

[34]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2004.

Digital Library

[35]

C. Demetrescu, I. Finocchi, and G. Italiano. Handbook on Data Structures and Applications, Chapter 36: Dynamic Graphs. 2005.

[36]

A. Doucet, S. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 2000.

Digital Library

[37]

D. M. Dziuda. Data mining for genomics and proteomics: analysis of gene and protein expression data. John Wiley & Sons, 2010.

Digital Library

[38]

B. Efron and R. Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1986.

[39]

D. Eppstein, Z. Galil, and G. F. Italiano. Dynamic graph algorithms. In Algorithms and Theory of Computation Handbook. CRC Press, 1999.

[40]

A. S. Ganapathi. Predicting and optimizing system utilization and performance via statistical machine learning. In Technical Report No. UCB/EECS-2009--181, 2009.

[41]

M. N. Garofalakis and P. B. Gibbon. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 2001.

Digital Library

[42]

I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.

Digital Library

[43]

L. J. Guibas. Kinetic data structures: a state of the art report. In Proceedings of the third Workshop on the Algorithmic Foundations of Robotics (WAFR), 1998.

Digital Library

[44]

P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2010.

Digital Library

[45]

M. A. Hammer, J. Dunfield, K. Headley, N. Labich, J. S. Foster, M. Hicks, and D. Van Horn. Incremental computation with names. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2015.

Digital Library

[46]

M. A. Hammer, K. Y. Phang, M. Hicks, and J. S. Foster. Adapton: Composable, demand-driven incremental computation. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014.

Digital Library

[47]

B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: Batched Stream Processing for Data Intensive Distributed Computing. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2010.

Digital Library

[48]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997.

Digital Library

[49]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2007.

Digital Library

[50]

B. Li, J. Springer, G. Bebis, and M. Hadi Gunes. Review: A survey of network flow applications. J. Netw. Comput. Appl., 2013.

Digital Library

[51]

H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2014.

Digital Library

[52]

S. Liu and W. Q. Meeker. Statistical methods for estimating the minimum thickness along a pipeline. Technometrics, 2014.

[53]

D. Logothetis, C. Olston, B. Reed, K. Web, and K. Yocum. Stateful bulk processing for incremental analytics. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2010.

Digital Library

[54]

S. Lohr. Sampling: Design and Analysis, 2nd Edition. Cengage Learning, 2009.

[55]

S. Mallick, G. Hains, and C. S. Deme. A resource prediction model for virtualization servers. In Proceedings of International Conference on High Performance Computing and Simulation (HPCS), 2012.

[56]

M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza. Facing the reality of data stream classification: coping with scarcity of labeled data. Knowledge and information systems, 2012.

Digital Library

[57]

C. Math. The Apache Commons Mathematics Library. http://commons. apache. org/proper/commons-math. Accessed: Jan, 2016.

[58]

S. Misailovic, D. M. Roy, and M. C. Rinard. Probabilistically accurate program transformations. In Proceedings of the 18th International Conference on Static Analysis (SAS), 2011.

Digital Library

[59]

A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC), 2007.

Digital Library

[60]

D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013.

Digital Library

[61]

S. Natarajan. Imprecise and Approximate Computation. Kluwer Academic Publishers, 1995.

Digital Library

[62]

D. Peng and F. Dabek. Large-scale Incremental Processing Using Distributed Transactions and Notifications. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2010.

Digital Library

[63]

O. Pons. Bootstrap of means under stratified sampling. Electronic Journal of Statistics, 2007.

[64]

L. Popa, M. Budiu, Y. Yu, and M. Isard. DryadInc: Reusing work in large-scale computations. In Proceedings of the Conference on Hot Topics in Cloud Computing (HotCloud), 2009.

Digital Library

[65]

Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. TimeStream: Reliable Stream Computation in the Cloud. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2013.

Digital Library

[66]

S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE), 2011.

Digital Library

[67]

S. K. Thompson. Sampling. Wiley Series in Probability and Statistics, 2012.

[68]

A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Brief Announcement: Modelling MapReduce for Optimal Execution in the Cloud. In proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of Distributed Computing (PODC), 2010.

Digital Library

[69]

A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Conductor: Orchestrating the Clouds. In proceedings of the 4th international workshop on Large Scale Distributed Systems and Middleware (LADIS), 2010.

Digital Library

[70]

A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the Deployment of Computations in the Cloud with Conductor. In proceedings of the 9th USENIX symposium on Networked Systems Design and Implementation (NSDI), 2012.

Digital Library

[71]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012.

Digital Library

[72]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013.

Digital Library

Cited By

Abdullah FPeng LTak B(2023)Query Latency Optimization by Resource-Aware Task Placement in Fog2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00062(293-295)Online publication date: May-2023
https://doi.org/10.1109/CCGridW59191.2023.00062
Barua HMondal K(2023)Cloud Big Data Mining and Analytics: Bringing Greenness and Acceleration in the CloudMachine Learning for Data Science Handbook10.1007/978-3-031-24628-9_22(491-510)Online publication date: 26-Feb-2023
https://doi.org/10.1007/978-3-031-24628-9_22
Denninnart CChanikaphon TAmini Salehi M(2023)Efficiency in the serverless cloud paradigm: A survey on the reusing and approximation aspectsSoftware: Practice and Experience10.1002/spe.323353:10(1853-1886)Online publication date: 24-Jun-2023
https://doi.org/10.1002/spe.3233
Show More Cited By

Index Terms

IncApprox: A Data Analytics System for Incremental Approximate Computing
1. Information systems
  1. Information systems applications
    1. Computing platforms
    2. Data mining
      1. Data stream mining
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Vector / streaming algorithms

Recommendations

Imperative self-adjusting computation
POPL '08: Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages

Self-adjusting computation enables writing programs that can automatically and efficiently respond to changes to their data (e.g., inputs). The idea behind the approach is to store all data that can change over time in modifiable references and to let ...
Imperative self-adjusting computation
POPL '08

Self-adjusting computation enables writing programs that can automatically and efficiently respond to changes to their data (e.g., inputs). The idea behind the approach is to store all data that can change over time in modifiable references and to let ...
An experimental analysis of self-adjusting computation

Recent work on adaptive functional programming (AFP) developed techniques for writing programs that can respond to modifications to their data by performing change propagation. To achieve this, executions of programs are represented with dynamic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '16: Proceedings of the 25th International Conference on World Wide Web

April 2016

1482 pages

ISBN:9781450341431

General Chairs:
Jacqueline Bourdeau
Tele-university (TELUQ), Montreal, QC, Canada
,
Jim A. Hendler
Rensselaer Polytechnic Institute, Troy, NY, USA
,
Roger Nkambou Nkambou
Université du Québec à Montréal, Montreal, QC, Canada
,
Program Chairs:
Ian Horrocks
University of Oxford, UK
,
Ben Y. Zhao
University of California at Santa Barbara, CA, USA

Copyright © 2016 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 11 April 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

FCT
CFAED
Amazon Web Services

Conference

WWW '16

Sponsor:

IW3C2

WWW '16: 25th International World Wide Web Conference

April 11 - 15, 2016

Québec, Montréal, Canada

Acceptance Rates

WWW '16 Paper Acceptance Rate 115 of 727 submissions, 16%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
614
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abdullah FPeng LTak B(2023)Query Latency Optimization by Resource-Aware Task Placement in Fog2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00062(293-295)Online publication date: May-2023
https://doi.org/10.1109/CCGridW59191.2023.00062
Barua HMondal K(2023)Cloud Big Data Mining and Analytics: Bringing Greenness and Acceleration in the CloudMachine Learning for Data Science Handbook10.1007/978-3-031-24628-9_22(491-510)Online publication date: 26-Feb-2023
https://doi.org/10.1007/978-3-031-24628-9_22
Denninnart CChanikaphon TAmini Salehi M(2023)Efficiency in the serverless cloud paradigm: A survey on the reusing and approximation aspectsSoftware: Practice and Experience10.1002/spe.323353:10(1853-1886)Online publication date: 24-Jun-2023
https://doi.org/10.1002/spe.3233
Barua HMondal KKhatua S(2022)Green Computing for Big Data and Machine LearningProceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)10.1145/3493700.3493772(348-351)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3493700.3493772
Abdullah FPeng LTak B(2022)Towards Query Latency Optimization in the Fog2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia)10.1109/ICCE-Asia57006.2022.9954681(1-5)Online publication date: 26-Oct-2022
https://doi.org/10.1109/ICCE-Asia57006.2022.9954681
Zhang DNi CZhang JZhang TYang PWang JYan H(2022)A Novel Edge Computing Architecture Based on Adaptive Stratified SamplingComputer Communications10.1016/j.comcom.2021.11.012183:C(121-135)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1016/j.comcom.2021.11.012
Behringer MFritz MSchwarz HMitschang B(2022)DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large DatasetsCooperative Information Systems10.1007/978-3-031-17834-4_4(55-74)Online publication date: 25-Sep-2022
https://doi.org/10.1007/978-3-031-17834-4_4
Abdullah FPeng LTak B(2021)A Survey of IoT Stream Query Execution Latency Optimization within Edge and CloudWireless Communications and Mobile Computing10.1155/2021/48110182021(1-16)Online publication date: 16-Nov-2021
https://doi.org/10.1155/2021/4811018
Wang WFu XLin X(2021)Edge-Based Sampling for Complex Network with Self-Similar Structure2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00134(955-962)Online publication date: Sep-2021
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00134
Fang JBai WXue X(2021)Real-Time Aggregation Approach for Power Quality DataWeb Information Systems and Applications10.1007/978-3-030-87571-8_9(99-106)Online publication date: 17-Sep-2021
https://doi.org/10.1007/978-3-030-87571-8_9
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents