Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2670979.2670992acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

The Case for Drill-Ready Cloud Computing

Published: 03 November 2014 Publication History
  • Get Citation Alerts
  • Abstract

    As cloud computing has matured, more and more local applications are replaced by easy-to-use on-demand services accessible via computer networks (a.k.a. cloud services). Running behind these services are massive hardware infrastructures and complex management tasks (e.g., recovery, software upgrades) that if not tested thoroughly can exhibit failures that lead to major service disruptions. Some researchers estimate that 568 hours of downtime at 13 well-known cloud services since 2007 had an economic impact of more than $70 million [18]. Others predict worse: for every hour it is not up and running, a cloud service can take a hit between $1 to 5 million [32]. Moreover, an outage of a popular service can shutdown other dependent services [11, 37, 59], leading to many more frustrated and furious users.

    References

    [1]
    http://cloutage.org.
    [2]
    Amazon Web Services. http://aws.amazon.com.
    [3]
    Apache HBase Operational Management. http://hbase.apache.org/book/ops_mgt.html.
    [4]
    Cassandra Operations. http://wiki.apache.org/cassandra/Operations.
    [5]
    DevOps GameDay. https://github.com/cloudworkshop/devopsgameday/wiki.
    [6]
    Open Sourced Vulnerability Database. http://www.osvdb.org.
    [7]
    Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and Russell C. Sears. BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud. In EuroSys '10.
    [8]
    Mona Attariyan, Michael Chow, and Jason Flinn. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In OSDI '12.
    [9]
    Cory Bennett and Ariel Tseitlin. Chaos Monkey Released Into The Wild. http://techblog.netflix.com, 2012.
    [10]
    Sapan Bhatia, Abhishek Kumar, Marc E. Fiuczynski, and Larry Peterson. Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems. In OSDI '08.
    [11]
    Henry Blodget. Amazon's Cloud Crash Disaster Permanently Destroyed Many Customers' Data. http://www.businessinsider.com, 2011.
    [12]
    Andrew Bosworth. Building and testing at Facebook. http://www.facebook.com/Engineering, 2012.
    [13]
    Marco Canini, Vojin Jovanović, Daniele Venzano, Boris Spasojević, Olivier Crameri, and Dejan Kostić. Toward Online Testing of Federated and Heterogeneous Distributed Systems. In USENIX ATC '11.
    [14]
    Boston Computing. Data Loss Statistics. http://www.bostoncomputing.net.
    [15]
    Olivier Crameri, Nikola Knezevic, Dejan Kostic, Ricardo Bianchini, and Willy Zwaenepoel. Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System. In SOSP '07.
    [16]
    Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In SoCC '13.
    [17]
    U. Erlingsson, M. Peinado, S. Peter, and M. Budiu. Fay: Extensible Distributed Tracing from Kernels to Clusters. In SOSP '11.
    [18]
    Loek Essers. Cloud Failures Cost More Than $70 Million Since 2007, Researchers Estimate. http://www.pcworld.com, 2012.
    [19]
    Daniel B. Giffin, Amit Levy, Deian Stefan, David Terei, David Mazieres, John C. Mitchell, and Alejandro Russo. Hails: Protecting Data Privacy in Untrusted Web Applications. In OSDI '12.
    [20]
    Haryadi S. Gunawi, Thanh Do, Joseph M. Hellerstein, Ion Stoica, Dhruba Borthakur, and Jesse Robbins. Failure as a Service (FaaS): A Cloud Service for Large-Scale, Online Failure Drills. UC Berkeley Technical Report UCB/EECS-2011-87.
    [21]
    Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. Fate and Destini: A Framework for Cloud Recovery Testing. In NSDI '11.
    [22]
    Haryadi S. Gunawi, Mingzhe. Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SoCC '14.
    [23]
    Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure Recovery: When the Cure Is Worse Than the Disease. In HotOS XIV, 2013.
    [24]
    Weihang Jiang, Chongfeng Hu, Shankar Pasupathy, Arkady Kanevsky, Zhenmin Li, and Yuanyuan Zhou. Understanding Customer Problem Troubleshooting from Storage System Logs. In FAST '09.
    [25]
    Baris Kasikci, Cristian Zamfir, and George Candea. RaceMob: Crowdsourced Data Race Detection. In SOSP '13.
    [26]
    Emre Kiciman and Benjamin Livshits. Ajaxscope: A platform for remotely monitoring the client-side behavior of web 2.0 applications. In SOSP '07.
    [27]
    Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich CSAIL. Efficient Patch-based Auditing for Web Application Vulnerabilities. In OSDI '12.
    [28]
    Taesoo Kim, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. Intrusion Recovery Using Selective Re-execution. In OSDI '10.
    [29]
    Oren Laadan, Nicolas Viennot, Chia che Tsai, Chris Blinn, Junfeng Yang, and Jason Nieh. Pervasive Detection of Process Races in Deployed Systems. In SOSP '11.
    [30]
    H. Andres Lagar-Cavilla, Joseph A. Whitney, Adin Scannell, Stephen M. Rumble, Philip Patchin, Eyal de Lara, Michael Brudno, and M. Satyanarayanan. SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing. In EuroSys '09.
    [31]
    Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI '14.
    [32]
    David Linthicum. Calculating the true cost of cloud outages. http://www.infoworld.com, 2013.
    [33]
    Lionel Litty, H. Andres Lagar-Cavilla, and David Lie. Computer Meteorology: Monitoring Compute Clouds. In HotOS XII, 2009.
    [34]
    Changbin Liu, Boon Thau Loo, and Yun Mao. Declarative Automated Cloud Resource Orchestration. In SoCC '11.
    [35]
    Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In NSDI '08.
    [36]
    Marissa Mayer. An Update on Yahoo Mail, December 2013.
    [37]
    Rich Miller. Amazon Cloud Outage KOs Reddit, Foursquare and Others. http://www.datacenterknowledge.com, 2012.
    [38]
    Michael J. Mior and Eyal de Lara. FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Cloning. In SYSTOR '11.
    [39]
    Iulian Neamtiu and Tudor Dumitras. Cloud Software Upgrades: Challenges and Opportunities. In MESOCA '11.
    [40]
    Netflix. 5 Lessons We've Learned Using AWS. http://techblog.netflix.com, December 2010.
    [41]
    Pertino. April 1st Service Disruption Postmortem, April 2013.
    [42]
    Ken Presti. 6 Devastating Cloud Outages Over The Last 6 Months. http://www.crn.com, 2013.
    [43]
    Ariel Rabkin and Randy Katz. Precomputing Possible Configuration Error Diagnoses. In ASE '11.
    [44]
    Patrick Reynolds, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, Charles Killian, and Amin Vahdat. Pip: Detecting the unexpected in distributed systems. In NSDI '06.
    [45]
    Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli. Resilience Engineering: Learning to Embrace Failure. ACM Queue, 10(9), September 2012.
    [46]
    Chuck Rossi. Ship early and ship twice as often. https://www.facebook.com/Engineering, 2012.
    [47]
    Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. Diagnosing Performance Changes by Comparing Request Flows. In NSDI '11.
    [48]
    Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems. In SoCC '11.
    [49]
    Rob Sherwood, Glen Gibb, Kok-Kiong Yap, Guido Appenzeller, Martin Casado, Nick McKeown, and Guru Parulkar. Can the Production Network Be the Testbed?. In OSDI '10.
    [50]
    Atul Singh, Petros Maniatis, Timothy Roscoe, and Peter Druschel. Using Queries for Distributed Monitoring and Forensics. In EuroSys '06.
    [51]
    AWS Team. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://aws.amazon.com/message/65648, 2011.
    [52]
    AWS Team. Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region. http://aws.amazon.com/message/680587, 2012.
    [53]
    Gmail Team. More on today's Gmail issue. http://gmailblog.blogspot.com, September 2009.
    [54]
    Google AppEngine Team. Post-mortem for February 24th, 2010 outage. https://groups.google.com/group/google-appengine, February 2010.
    [55]
    Google Apps Team. GoogleApps IncidentReport, March 2013.
    [56]
    Skype Team. CIO update: Post-mortem on the Skype outage (December 2010). http://blogs.skype.com, December 2010.
    [57]
    The Joyent Team. Postmortem for outage of us-east-1, May 2014.
    [58]
    The Verge. Microsoft apologizes for Outlook, ActiveSync downtime, says error overloaded servers, August 2013.
    [59]
    Christina Warren. How Facebook killed the Internet. http://www.cnn.com, 2013.
    [60]
    Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed. Distributed Systems. In NSDI '09.
    [61]
    Wenchao Zhou, Qiong Fei, Arjun Narayan, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. Secure Network Provenance. In SOSP '11.
    [62]
    Wenchao Zhou, Micah Sherr, Tao Tao, Xiaozhou Li, Boon Thau Loo, and Yun Mao. Efficient Querying and Maintenance of Network Provenance at Internet-Scale. In SIGMOD '10.

    Cited By

    View all
    • (2022)Maximizing Error Injection Realism for Chaos Engineering With System CallsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.306971519:4(2695-2708)Online publication date: 1-Jul-2022
    • (2021)A Chaos Engineering System for Live Analysis and Falsification of Exception-Handling in the JVMIEEE Transactions on Software Engineering10.1109/TSE.2019.295487147:11(2534-2548)Online publication date: 1-Nov-2021
    • (2019)ScalecheckProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323332(359-373)Online publication date: 25-Feb-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SOCC '14: Proceedings of the ACM Symposium on Cloud Computing
    November 2014
    383 pages
    ISBN:9781450332521
    DOI:10.1145/2670979
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Tutorial
    • Research
    • Refereed limited

    Conference

    SOCC '14
    Sponsor:
    SOCC '14: ACM Symposium on Cloud Computing
    November 3 - 5, 2014
    WA, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Maximizing Error Injection Realism for Chaos Engineering With System CallsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.306971519:4(2695-2708)Online publication date: 1-Jul-2022
    • (2021)A Chaos Engineering System for Live Analysis and Falsification of Exception-Handling in the JVMIEEE Transactions on Software Engineering10.1109/TSE.2019.295487147:11(2534-2548)Online publication date: 1-Nov-2021
    • (2019)ScalecheckProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323332(359-373)Online publication date: 25-Feb-2019
    • (2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
    • (2017)Scalability BugsProceedings of the 16th Workshop on Hot Topics in Operating Systems10.1145/3102980.3102985(24-29)Online publication date: 7-May-2017
    • (2016)Why Does the Cloud Stop Computing?Proceedings of the Seventh ACM Symposium on Cloud Computing10.1145/2987550.2987583(1-16)Online publication date: 5-Oct-2016
    • (2015)SAMC: a fast model checker for finding heisenbugs in distributed systems (demo)Proceedings of the 2015 International Symposium on Software Testing and Analysis10.1145/2771783.2784771(423-427)Online publication date: 13-Jul-2015
    • (2014)What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud SystemsProceedings of the ACM Symposium on Cloud Computing10.1145/2670979.2670986(1-14)Online publication date: 3-Nov-2014

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media