Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2872427.2883026acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article

IncApprox: A Data Analytics System for Incremental Approximate Computing

Published: 11 April 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items.
    In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.

    References

    [1]
    Amazon Kinesis Streams. https://aws.amazon.com/kinesis/. Accessed: Jan, 2016.
    [2]
    Apache Flink. https://flink.apache.org/. Accessed: Jan, 2016.
    [3]
    Apache Flume. https://flume.apache.org/. Accessed: Jan, 2016.
    [4]
    Apache Hadoop. http://hadoop.apache.org/. Accessed: Jan, 2016.
    [5]
    Apache Spark Streaming. http://spark.apache.org/streaming. Accessed: Jan, 2016.
    [6]
    Apache Storm. http://storm-project.net/. Accessed: Jan, 2016.
    [7]
    Kafka - A high-throughput distributed messaging system. http://kafka.apache.org. Accessed: Jan, 2016.
    [8]
    Trident. https://github.com/nathanmarz/storm/wiki/Trident-tutorial. Accessed: Jan, 2016.
    [9]
    Twitter Search API. http://apiwiki.twitter.com/Twitter-API-Documentation. Accessed: Jan, 2016.
    [10]
    U. A. Acar. Self-Adjusting Computation. PhD thesis, Carnegie Mellon University, 2005.
    [11]
    U. A. Acar, G. E. Blelloch, M. Blume, R. Harper, and K. Tangwongsan. An experimental analysis of self-adjusting computation. ACM Transactions on Programming Languages and Systems (TOPLAS), 2009.
    [12]
    U. A. Acar, A. Cotter, B. Hudson, and D. Türkoglu. Dynamic well-spaced point sets. In Proceedings of the 26th Annual Symposium on Computational Geometry (SoCG), 2010.
    [13]
    S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when you're wrong: Building fast and reliable approximate query processing systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2014.
    [14]
    S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: Queries with bounded errors and bounded response times on very large data. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2013.
    [15]
    M. Al-Kateb and B. S. Lee. Stratified reservoir sampling over heterogeneous data streams. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), 2010.
    [16]
    M. Al-Kateb, B. S. Lee, and X. S. Wang. Adaptive-size reservoir sampling over data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management (SSBDM), 2007.
    [17]
    S. Angel, H. Ballani, T. Karagiannis, G. O\textquoterightShea, and E. Thereska. End-to-end performance isolation through virtual datacenters. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2014.
    [18]
    P. Bhatotia. Incremental Parallel and Distributed Systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS), 2015.
    [19]
    P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues. Slider: Incremental Sliding Window Analytics. In Proceedings of the 15th International Middleware Conference (Middleware), 2014.
    [20]
    P. Bhatotia, M. Dischinger, R. Rodrigues, and U. A. Acar. Slider: Incremental Sliding-Window Computations for Large-Scale Data Analysis. In Technical Report: MPI-SWS-2012-004, 2012.
    [21]
    P. Bhatotia, P. Fonseca, U. A. Acar, B. Brandenburg, and R. Rodrigues. iThreads: A Threading Library for Parallel Incremental Computation. In proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.
    [22]
    P. Bhatotia, R. Rodrigues, and A. Verma. Shredder: GPU-Accelerated Incremental Storage and Computation. In Proceedings of USENIX Conference on File and Storage Technologies (FAST), 2012.
    [23]
    P. Bhatotia, A. Wieder, I. E. Akkus, R. Rodrigues, and U. A. Acar. Large-scale incremental data processing with change propagation. In Proceedings of the Conference on Hot Topics in Cloud Computing (HotCloud), 2011.
    [24]
    P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for Incremental Computations. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2011.
    [25]
    G. S. Brodal and R. Jacob. Dynamic planar convex hull. In Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2002.
    [26]
    C. Olston et al. Nova: Continuous Pig/Hadoop Workflows. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2011.
    [27]
    CAIDA. The CAIDA UCSD Anonymized Internet Traces 2015 (equinix-chicago-dirA). http://www.caida.org/data/passive/passive_2015_dataset.xml.
    [28]
    R. Charles, T. Alexey, G. Gregory, H. K. Randy, and K. Michael. Towards understanding heterogeneous clouds at scale: Google trace analysis. Technical report, 2012.
    [29]
    Y.-J. Chiang and R. Tamassia. Dynamic algorithms in computational geometry. Proceedings of the IEEE, 1992.
    [30]
    B. Claise. Cisco systems NetFlow services export version 9. 2004.
    [31]
    S. Coles. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.
    [32]
    T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2010.
    [33]
    G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends databases, 2012.
    [34]
    J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2004.
    [35]
    C. Demetrescu, I. Finocchi, and G. Italiano. Handbook on Data Structures and Applications, Chapter 36: Dynamic Graphs. 2005.
    [36]
    A. Doucet, S. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and Computing, 2000.
    [37]
    D. M. Dziuda. Data mining for genomics and proteomics: analysis of gene and protein expression data. John Wiley & Sons, 2010.
    [38]
    B. Efron and R. Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science, 1986.
    [39]
    D. Eppstein, Z. Galil, and G. F. Italiano. Dynamic graph algorithms. In Algorithms and Theory of Computation Handbook. CRC Press, 1999.
    [40]
    A. S. Ganapathi. Predicting and optimizing system utilization and performance via statistical machine learning. In Technical Report No. UCB/EECS-2009--181, 2009.
    [41]
    M. N. Garofalakis and P. B. Gibbon. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 2001.
    [42]
    I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.
    [43]
    L. J. Guibas. Kinetic data structures: a state of the art report. In Proceedings of the third Workshop on the Algorithmic Foundations of Robotics (WAFR), 1998.
    [44]
    P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2010.
    [45]
    M. A. Hammer, J. Dunfield, K. Headley, N. Labich, J. S. Foster, M. Hicks, and D. Van Horn. Incremental computation with names. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2015.
    [46]
    M. A. Hammer, K. Y. Phang, M. Hicks, and J. S. Foster. Adapton: Composable, demand-driven incremental computation. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2014.
    [47]
    B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: Batched Stream Processing for Data Intensive Distributed Computing. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2010.
    [48]
    J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997.
    [49]
    M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2007.
    [50]
    B. Li, J. Springer, G. Bebis, and M. Hadi Gunes. Review: A survey of network flow applications. J. Netw. Comput. Appl., 2013.
    [51]
    H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2014.
    [52]
    S. Liu and W. Q. Meeker. Statistical methods for estimating the minimum thickness along a pipeline. Technometrics, 2014.
    [53]
    D. Logothetis, C. Olston, B. Reed, K. Web, and K. Yocum. Stateful bulk processing for incremental analytics. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), 2010.
    [54]
    S. Lohr. Sampling: Design and Analysis, 2nd Edition. Cengage Learning, 2009.
    [55]
    S. Mallick, G. Hains, and C. S. Deme. A resource prediction model for virtualization servers. In Proceedings of International Conference on High Performance Computing and Simulation (HPCS), 2012.
    [56]
    M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza. Facing the reality of data stream classification: coping with scarcity of labeled data. Knowledge and information systems, 2012.
    [57]
    C. Math. The Apache Commons Mathematics Library. http://commons. apache. org/proper/commons-math. Accessed: Jan, 2016.
    [58]
    S. Misailovic, D. M. Roy, and M. C. Rinard. Probabilistically accurate program transformations. In Proceedings of the 18th International Conference on Static Analysis (SAS), 2011.
    [59]
    A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC), 2007.
    [60]
    D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013.
    [61]
    S. Natarajan. Imprecise and Approximate Computation. Kluwer Academic Publishers, 1995.
    [62]
    D. Peng and F. Dabek. Large-scale Incremental Processing Using Distributed Transactions and Notifications. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2010.
    [63]
    O. Pons. Bootstrap of means under stratified sampling. Electronic Journal of Statistics, 2007.
    [64]
    L. Popa, M. Budiu, Y. Yu, and M. Isard. DryadInc: Reusing work in large-scale computations. In Proceedings of the Conference on Hot Topics in Cloud Computing (HotCloud), 2009.
    [65]
    Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. TimeStream: Reliable Stream Computation in the Cloud. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2013.
    [66]
    S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE), 2011.
    [67]
    S. K. Thompson. Sampling. Wiley Series in Probability and Statistics, 2012.
    [68]
    A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Brief Announcement: Modelling MapReduce for Optimal Execution in the Cloud. In proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of Distributed Computing (PODC), 2010.
    [69]
    A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Conductor: Orchestrating the Clouds. In proceedings of the 4th international workshop on Large Scale Distributed Systems and Middleware (LADIS), 2010.
    [70]
    A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the Deployment of Computations in the Cloud with Conductor. In proceedings of the 9th USENIX symposium on Networked Systems Design and Implementation (NSDI), 2012.
    [71]
    M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2012.
    [72]
    M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP), 2013.

    Cited By

    View all
    • (2023)Query Latency Optimization by Resource-Aware Task Placement in Fog2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00062(293-295)Online publication date: May-2023
    • (2023)Cloud Big Data Mining and Analytics: Bringing Greenness and Acceleration in the CloudMachine Learning for Data Science Handbook10.1007/978-3-031-24628-9_22(491-510)Online publication date: 26-Feb-2023
    • (2023)Efficiency in the serverless cloud paradigm: A survey on the reusing and approximation aspectsSoftware: Practice and Experience10.1002/spe.323353:10(1853-1886)Online publication date: 24-Jun-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '16: Proceedings of the 25th International Conference on World Wide Web
    April 2016
    1482 pages
    ISBN:9781450341431

    Sponsors

    • IW3C2: International World Wide Web Conference Committee

    In-Cooperation

    Publisher

    International World Wide Web Conferences Steering Committee

    Republic and Canton of Geneva, Switzerland

    Publication History

    Published: 11 April 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. approximate computation
    2. dependance graph
    3. error estimation
    4. incremental computation
    5. memoization
    6. real-time processing
    7. self-adjusting computation
    8. stratified sampling
    9. stream processing

    Qualifiers

    • Research-article

    Funding Sources

    • FCT
    • CFAED
    • Amazon Web Services

    Conference

    WWW '16
    Sponsor:
    • IW3C2
    WWW '16: 25th International World Wide Web Conference
    April 11 - 15, 2016
    Québec, Montréal, Canada

    Acceptance Rates

    WWW '16 Paper Acceptance Rate 115 of 727 submissions, 16%;
    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Query Latency Optimization by Resource-Aware Task Placement in Fog2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00062(293-295)Online publication date: May-2023
    • (2023)Cloud Big Data Mining and Analytics: Bringing Greenness and Acceleration in the CloudMachine Learning for Data Science Handbook10.1007/978-3-031-24628-9_22(491-510)Online publication date: 26-Feb-2023
    • (2023)Efficiency in the serverless cloud paradigm: A survey on the reusing and approximation aspectsSoftware: Practice and Experience10.1002/spe.323353:10(1853-1886)Online publication date: 24-Jun-2023
    • (2022)Green Computing for Big Data and Machine LearningProceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD)10.1145/3493700.3493772(348-351)Online publication date: 8-Jan-2022
    • (2022)Towards Query Latency Optimization in the Fog2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia)10.1109/ICCE-Asia57006.2022.9954681(1-5)Online publication date: 26-Oct-2022
    • (2022)A Novel Edge Computing Architecture Based on Adaptive Stratified SamplingComputer Communications10.1016/j.comcom.2021.11.012183:C(121-135)Online publication date: 1-Feb-2022
    • (2022)DATA-IMP: An Interactive Approach to Specify Data Imputation Transformations on Large DatasetsCooperative Information Systems10.1007/978-3-031-17834-4_4(55-74)Online publication date: 25-Sep-2022
    • (2021)A Survey of IoT Stream Query Execution Latency Optimization within Edge and CloudWireless Communications and Mobile Computing10.1155/2021/48110182021(1-16)Online publication date: 16-Nov-2021
    • (2021)Edge-Based Sampling for Complex Network with Self-Similar Structure2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00134(955-962)Online publication date: Sep-2021
    • (2021)Real-Time Aggregation Approach for Power Quality DataWeb Information Systems and Applications10.1007/978-3-030-87571-8_9(99-106)Online publication date: 17-Sep-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media