Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2670979.2670986acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

Published: 03 November 2014 Publication History
  • Get Citation Alerts
  • Abstract

    We conduct a comprehensive study of development and deployment issues of six popular and important cloud systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper and Flume). From the bug repositories, we review in total 21,399 submitted issues within a three-year period (2011-2014). Among these issues, we perform a deep analysis of 3655 "vital" issues (i.e., real issues affecting deployments) with a set of detailed classifications. We name the product of our one-year study Cloud Bug Study database (CbsDB) [9], with which we derive numerous interesting insights unique to cloud systems. To the best of our knowledge, our work is the largest bug study for cloud systems to date.

    References

    [1]
    Apache Cassandra Project. http://cassandra.apache.org.
    [2]
    Apache Flume Project. http://flume.apache.org.
    [3]
    Apache Hadoop Project. http://hadoop.apache.org.
    [4]
    Apache HBase Project. http://hadoop.apache.org/hbase.
    [5]
    Apache ZooKeeper Project. http://zookeeper.apache.org.
    [6]
    HDFS Architecture. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html.
    [7]
    The 10 Biggest Cloud Outages Of 2013. http://www.crn.com/slide-shows/cloud/240165024/the-10-biggest-cloud-outages-of-2013.htm.
    [8]
    The Apache Software Foundation. http://www.apache.org/.
    [9]
    The Cloud Bug Study (CBS) Project. http://ucare.cs.uchicago.edu/projects/cbs.
    [10]
    Mona Attariyan and Jason Flinn. Automating Configuration Troubleshooting with Dynamic Information Flow Analysis. In OSDI '10.
    [11]
    Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. An Analysis of Data Corruption in the Storage Stack. In FAST '08.
    [12]
    Cory Bennett and Ariel Tseitlin. Chaos Monkey Released Into The Wild. http://techblog.netflix.com, 2012.
    [13]
    Mike Burrows. The Chubby lock service for loosely-coupled distributed systems Export. In OSDI '06.
    [14]
    Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. Bigtable: A Distributed Storage System for Structured Data. In OSDI '06.
    [15]
    Andy Chou, Junfeng Yang, Benjamin Chelf, Seth Hallem, and Dawson Engler. An Empirical Study of Operating System Errors. In SOSP '01.
    [16]
    Miguel Correia, Daniel Gomez Ferro, Flavio P. Junqueira, and Marco Serafini. Practical Hardening of Crash-Tolerant Systems. In USENIX ATC '12.
    [17]
    Jeffrey Dean. Underneath the Covers at Google: Current Systems and Future Directions. In Google I/O '08.
    [18]
    Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI '04.
    [19]
    Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In SOSP '07.
    [20]
    Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In SoCC '13.
    [21]
    Thanh Do, Tyler Harter, Yingchao Liu, Haryadi S. Gunawi, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. HARDFS: Hardening HDFS with Selective and Lightweight Versioning. In FAST '13.
    [22]
    Bin Fan, David G. Andersen, and Michael Kaminsky. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing. In NSDI '13.
    [23]
    Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In SOSP '03.
    [24]
    Lokesh Gidra, Gal Thomas, Julien Sopena, and Marc Shapiro. A study of the Scalability of Stop-the-World Garbage Collectors on Multicore. In ASPLOS '13.
    [25]
    Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. Fate and Destini: A Framework for Cloud Recovery Testing. In NSDI '11.
    [26]
    Haryadi S. Gunawi, Cindy Rubio-Gonzalez, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Ben Liblit. EIO: Error Handling is Occasionally Correct. In FAST '08.
    [27]
    Pradeep Kumar Gunda, Lenin Ravindranath, Chandramohan A. Thekkath, Yuan Yu, and Li Zhuang. Nectar: Automatic Management of Data and Computation in Datacenters. In OSDI '10.
    [28]
    Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. Practical Software Model Checking via Dynamic Interface Reduction. In SOSP '11.
    [29]
    Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure Recovery: When the Cure Is Worse Than the Disease. In HotOS XIV, 2013.
    [30]
    Herodotos Herodotou, Fei Dong, and Shivnath Babu. No One Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics. In SoCC '11.
    [31]
    Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI '11.
    [32]
    Tanakorn Leesatapornwongsa and Haryadi S. Gunawi. The Case for Drill-Ready Cloud Computing. In SoCC '14.
    [33]
    Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI '14.
    [34]
    Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. Detecting failures in distributed systems with the FALCON spy network. In SOSP '11.
    [35]
    Kaituo Li, Pallavi Joshi, and Aarti Gupta. ReproLite: A Lightweight Tool to Quickly Reproduce Hard System Bug. In SoCC '14.
    [36]
    Todd Lipcon. Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers, February 2011.
    [37]
    Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS. In SOSP '11.
    [38]
    Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. A Study of Linux File System Evolution. In FAST '13.
    [39]
    Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from Mistakes --- A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS '08.
    [40]
    Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. Naiad: A Timely Dataflow System. In SOSP '13.
    [41]
    Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon Howell, and Yutaka Suzue. Flat Datacenter Storage. In OSDI '12.
    [42]
    Ariel Rabkin and Randy Katz. Precomputing Possible Configuration Error Diagnoses. In ASE '11.
    [43]
    Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli. Resilience Engineering: Learning to Embrace Failure. ACM Queue, 10(9), September 2012.
    [44]
    Stephen M. Rumble, Ankita Kejriwal, and John Ousterhout. Log-structured Memory for DRAM-based Storage. In FAST '14.
    [45]
    Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. DRAM errors in the wild: A Large-Scale Field Study. In SIGMETRICS '09.
    [46]
    David Shue, Michael J. Freedman, and Anees Shaikh. Performance Isolation and Fairness for Multi-Tenant Cloud Storage. In OSDI '12.
    [47]
    Douglas B. Terry, Vijayan Prabhakaran, Ramakrishna Kotla, Mahesh Balakrishnan, Marcos K. Aguilera, and Hussam Abu-Libdeh. Consistency-Based Service Level Agreements for Cloud Storage. In SOSP '13.
    [48]
    Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In SoCC '13.
    [49]
    Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, and Ion Stoica. Cake: Enabling High-level SLOs on Shared Storage Systems. In SoCC '12.
    [50]
    Xi Wang, Zhenyu Guo, Xuezheng Liu, Zhilei Xu, Haoxiang Lin, Xiaoge Wang, and Zheng Zhang. Hang analysis: fighting responsiveness bugs. In EuroSys '08.
    [51]
    Yin Wang, Terence Kelly, Manjunath Kudlur, Stephane Lafortune, and Scott Mahlke. Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs. In OSDI '08.
    [52]
    Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng, Ding Yuan, Yuanyuan Zhou, and Shankar Pasupathy. Do Not Blame Users for Misconfigurations. In SOSP '13.
    [53]
    Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. In NSDI '09.
    [54]
    Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In SOSP '11.
    [55]
    Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Dataintensive Systems. In OSDI '14.
    [56]
    Yang Zhang, Russell Power, Siyuan Zhou, Yair Sovran, Marcos K. Aguilera, and Jinyang Li. Transaction Chains: Achieving Serializability with Low Latency in Geo-Distributed Storage Systems. In SOSP '13.

    Cited By

    View all
    • (2024)VConMC: Enabling Consistency Verification for Distributed Systems Using Implementation-Level Model Checkers and Consistency OraclesElectronics10.3390/electronics1306115313:6(1153)Online publication date: 21-Mar-2024
    • (2024)An Empirical Study on the Challenges of eBPF Application DevelopmentProceedings of the ACM SIGCOMM 2024 Workshop on eBPF and Kernel Extensions10.1145/3672197.3673429(1-8)Online publication date: 4-Aug-2024
    • (2024)Diffy: Data-Driven Bug Finding for ConfigurationsProceedings of the ACM on Programming Languages10.1145/36563858:PLDI(199-222)Online publication date: 20-Jun-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SOCC '14: Proceedings of the ACM Symposium on Cloud Computing
    November 2014
    383 pages
    ISBN:9781450332521
    DOI:10.1145/2670979
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Tutorial
    • Research
    • Refereed limited

    Conference

    SOCC '14
    Sponsor:
    SOCC '14: ACM Symposium on Cloud Computing
    November 3 - 5, 2014
    WA, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)108
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 10 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)VConMC: Enabling Consistency Verification for Distributed Systems Using Implementation-Level Model Checkers and Consistency OraclesElectronics10.3390/electronics1306115313:6(1153)Online publication date: 21-Mar-2024
    • (2024)An Empirical Study on the Challenges of eBPF Application DevelopmentProceedings of the ACM SIGCOMM 2024 Workshop on eBPF and Kernel Extensions10.1145/3672197.3673429(1-8)Online publication date: 4-Aug-2024
    • (2024)Diffy: Data-Driven Bug Finding for ConfigurationsProceedings of the ACM on Programming Languages10.1145/36563858:PLDI(199-222)Online publication date: 20-Jun-2024
    • (2024)Intelligent Monitoring Framework for Cloud Services: A Data-Driven ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639753(381-391)Online publication date: 14-Apr-2024
    • (2024)Efficient Auditing of Event-driven Web ApplicationsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3650089(1208-1224)Online publication date: 22-Apr-2024
    • (2024)ECFuzz: Effective Configuration Fuzzing for Large-Scale SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623315(1-12)Online publication date: 20-May-2024
    • (2024)Understanding and Detecting Real-World Safety Issues in RustIEEE Transactions on Software Engineering10.1109/TSE.2024.338039350:6(1306-1324)Online publication date: Jun-2024
    • (2024)QCFE: An Efficient Feature Engineering for Query Cost Estimation2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00328(4302-4315)Online publication date: 13-May-2024
    • (2024)Unsupervised Anomaly Detection on Distributed Log Tracing through Deep Learning2024 IEEE 25th International Conference of Young Professionals in Electron Devices and Materials (EDM)10.1109/EDM61683.2024.10615125(1830-1833)Online publication date: 28-Jun-2024
    • (2023)Development of Anomaly Detection System Based on Distributed Log TracingVestnik NSU. Series: Information Technologies10.25205/1818-7900-2023-21-1-62-7221:1(62-72)Online publication date: 28-Aug-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media