Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

An auditing language for preventing correlated failures in the cloud

Published: 12 October 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Today's cloud services extensively rely on replication techniques to ensure availability and reliability. In complex datacenter network architectures, however, seemingly independent replica servers may inadvertently share deep dependencies (e.g., aggregation switches). Such unexpected common dependencies may potentially result in correlated failures across the entire replication deployments, invalidating the efforts. Although existing cloud management and diagnosis tools have been able to offer post-failure forensics, they, nevertheless, typically lead to quite prolonged failure recovery time in the cloud-scale systems. In this paper, we propose a novel language framework, named RepAudit, that manages to prevent correlated failure risks before service outages occur, by allowing cloud administrators to proactively audit the replication deployments of interest. In particular, RepAudit consists of three new components: 1) a declarative domain-specific language, RAL, for cloud administrators to write auditing programs expressing diverse auditing tasks; 2) a high-performance RAL auditing engine that generates the auditing results by accurately and efficiently analyzing the underlying structures of the target replication deployments; and 3) an RAL-code generator that can automatically produce complex RAL programs based on easily written specifications. Our evaluation result shows that RepAudit uses 80x less lines of code than state-of-the-art efforts in expressing the auditing task of determining the top-20 critical correlated-failure root causes. To the best of our knowledge, RepAudit is the first effort capable of simultaneously offering expressive, accurate and efficient correlated failure auditing to the cloud-scale replication systems.

    References

    [1]
    Marcos Kawazoe Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In 19th ACM Symposium on Operating Systems Principles (SOSP).
    [2]
    N. Alon and R. B. Boppana. 1987. The monotone circuit complexity of Boolean functions. Combinatorica 7, 1 (1987), 1–22.
    [3]
    Mario Alviano. 2015. Maxino: A fast MaxSAT solver. http://alviano.net/software/maxino/ . (2015). Online; accessed Feb 24 2017.
    [4]
    Mario Alviano, Carmine Dodaro, and Francesco Ricca. 2015. A MaxSAT algorithm using cardinality constraints of bounded size. In 24th International Joint Conference on Artificial Intelligence (IJCAI).
    [5]
    Carlos Ansótegui, Maria Luisa Bonet, and Jordi Levy. 2009. Solving Weighted partial MaxSAT through satisfiability testing. In 12th Theory and Applications of Satisfiability Testing (SAT).
    [6]
    Carlos Ansótegui, Maria Luisa Bonet, and Jordi Levy. 2010. A new algorithm for weighted partial MaxSAT. In 24th Conference on Artificial Intelligence (AAAI).
    [7]
    Paramvir Bahl, Ranveer Chandra, Albert G. Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. In ACM SIGCOMM (SIGCOMM).
    [8]
    Tomas Balyo, Marijn J. H. Heule, and Matti Jarvisalo. 2016. SAT Competition 2016 : Solver and Benchmark Descriptions. In SAT.
    [9]
    Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for request extraction and workload modelling. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [10]
    Alysson Neves Bessani, Miguel P. Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2011. DepSky: Dependable and Secure Storage in a Cloud-of-clouds. In 6th ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys).
    [11]
    Peter Bodik, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. 2012. Surviving failures in bandwidth-constrained datacenters. In ACM SIGCOMM (SIGCOMM).
    [12]
    Nicolas Bonvin, Thanasis G. Papaioannou, and Karl Aberer. 2010. A self-organized, fault-tolerant and scalable replication scheme for cloud storage. In 1st ACM Symposium on Cloud Computing (SoCC).
    [13]
    Danny Bradbury. 2016. The bigger they get, the harder we fall: Thinking our way out of cloud crash. http://www.theregister. co.uk/2016/07/29/bryan_ford_bigger_icebergs/ . (2016).
    [14]
    Ang Chen, Yang Wu, Andreas Haeberlen, Boon Thau Loo, and Wenchao Zhou. 2017. Data provenance at Internet scale: Architecture, experiences, and the road ahead. In 8th Biennial Conference on Innovative Data Systems Research (CIDR).
    [15]
    Ang Chen, Yang Wu, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2016. The good, the bad, and the differences: Better network diagnostics with differential provenance. In ACM SIGCOMM (SIGCOMM).
    [16]
    Mike Y. Chen, Anthony Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-based failure and evolution management. In 1st USENIX Symposium on Networked System Design and Implementation (NSDI).
    [17]
    Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and Paramvir Bahl. 2008. Automating network application dependency discovery: Experiences, limitations, and new Solutions. In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [18]
    Ira Cohen, Jeffrey S. Chase, Moisés Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [19]
    John Dunagan, Nicholas J. A. Harvey, Michael B. Jones, Dejan Kostic, Marvin Theimer, and Alec Wolman. 2004. F USE: Lightweight guaranteed distributed failure notification. In 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [20]
    Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [21]
    Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: Measurement, analysis, and implications. In ACM SIGCOMM (SIGCOMM).
    [22]
    Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What bugs live in the cloud? A study of 3000+ issues in cloud systems. In 5th ACM Symposium on Cloud Computing (SoCC).
    [23]
    Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In 7th ACM Symposium on Cloud Computing (SoCC).
    [24]
    Andreas Haeberlen. 2009. A case for the accountable cloud. In 3rd ACM SIGOPS International Workshop on Large-Scale Distributed Systems and Middleware (LADIS).
    [25]
    Andreas Haeberlen, Paarijaat Aditya, Rodrigo Rodrigues, and Peter Druschelnd. 2010. Accountable virtual machines. In 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [26]
    Devindra Hardaware. 2011. Apple’s iCloud runs on Microsoft’s Azure and Amazon’s cloud. http://venturebeat.com/2011/ 09/03/icloud- azure- amazon/ . (2011).
    [27]
    Heqing Huang, Su Zhang, Xinming Ou, Atul Prakash, and Karem A. Sakallah. 2011. Distilling critical attack graph surface iteratively through minimum-cost SAT solving. In 27th Annual Computer Security Applications Conference (ACSAC).
    [28]
    Peng Huang, William J. Bolosky, Abhishek Singh, and Yuanyuan Zhou. 2015. Conf Valley: A systematic configuration validation framework for cloud services. In 10th European Conference on Computer Systems (EuroSys).
    [29]
    Andrew Johnson, Lucas Waye, Scott Moore, and Stephen Chong. 2015. Exploring and enforcing security guarantees via program dependence graphs. In 36th ACM Conference on Programming Language Design and Implementation (PLDI).
    [30]
    Ivan P Kaminow and Thomas L Koch. 1997. Optical Fiber Telecommunications IIIA. Academic Press, New York.
    [31]
    Srikanth Kandula, Dina Katabi, and Jean-Philippe Vasseur. 2005. Shrink: A Tool for Failure Diagnosis in IP Networks. In MineNet.
    [32]
    Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed Diagnosis in Enterprise Networks. In ACM SIGCOMM (SIGCOMM).
    [33]
    Ramana Rao Kompella, Jennifer Yates, Albert G. Greenberg, and Alex C. Snoeren. 2005. IP Fault Localization Via Risk Modeling. In 2nd USENIX Symposium on Networked System Design and Implementation (NSDI).
    [34]
    Akash Lal, Shaz Qadeer, and Shuvendu K. Lahiri. 2012. A solver for reachability modulo theories. In 24th International Conference on Computer Aided Verification (CAV).
    [35]
    Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In 23rd ACM Symposium on Operating Systems Principles (SOSP).
    [36]
    Nuno P. Lopes, Nikolaj Bjørner, Patrice Godefroid, Karthick Jayaraman, and George Varghese. 2015. Checking beliefs in dynamic networks. In 12th USENIX Symposium on Networked System Design and Implementation (NSDI).
    [37]
    Jedidiah McClurg, Hossein Hojjat, Pavol Cerný, and Nate Foster. 2015. Efficient synthesis of network updates. In 36th ACM Conference on Programming Language Design and Implementation (PLDI).
    [38]
    Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. 2009. PortLand: A Scalable Fault-tolerant Layer 2 Data Center Network Fabric. In ACM SIGCOMM (SIGCOMM).
    [39]
    Arun Natarajan, Peng Ning, Yao Liu, Sushil Jajodia, and Steve E. Hutchinson. 2012. NSDMiner: Automated discovery of network service dependencies. In 31st IEEE INFOCOM (INFOCOM).
    [40]
    Suman Nath, Haifeng Yu, Phillip B. Gibbons, and Srinivasan Seshan. 2006. Subtleties in tolerating correlated failures in wide-area storage systems. In 3rd USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI).
    [41]
    Barry Peddycord III, Peng Ning, and Sushil Jajodia. 2012. On the Accurate Identification of Network Service Dependencies in Distributed Systems. In 26th Large Installation System Administration Conference (LISA).
    [42]
    Gordon D. Plotkin, Nikolaj Bjørner, Nuno P. Lopes, Andrey Rybalchenko, and George Varghese. 2016. Scaling network verification using symmetry and surgery. In 43rd ACM Symposium on Principles of Programming Languages (POPL).
    [43]
    Patrick Reynolds, Charles Edwin Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. 2006. Pip: Detecting the unexpected in distributed systems. In 3rd Symposium on Networked Systems Design and Implementation (NSDI).
    [44]
    Lorenzo Saino, Cosmin Cocora, and George Pavlou. 2013. Fast Network Simulation Setup. https://github.com/fnss/fnss . (2013).
    [45]
    Mehul A. Shah, Mary Baker, Jeffrey C. Mogul, and Ram Swaminathan. 2007. Auditing to Keep Online Storage Services Honest. In 11th Workshop on Hot Topics in Operating Systems (HotOS).
    [46]
    Rew Steven. 2014. Rackspace Outage Nov 12th. http://www.realestatewebmasters.com/blogs/rew- steven/ rackspace- outage- nov- 12th/show/ . (2014). Online; accessed Feb 24 2017.
    [47]
    The AWS Team. 2012. Summary of the October 22, 2012 AWS Service Event in the US-East Region. https://aws.amazon. com/message/680342/ . (2012). Online; accessed Feb 24 2017.
    [48]
    Reinhard von Hanxleden, Björn Duderstadt, Christian Motika, Steven Smyth, Michael Mendler, Joaquin Aguado, Stephen Mercer, and Owen O’Brien. 2014. SCCharts: sequentially constructive statecharts for safety-critical applications: HW/SW-synthesis for a conservative extension of synchronous statecharts. In 35th ACM Conference on Programming Language Design and Implementation (PLDI).
    [49]
    Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: Automating datacenter network failure mitigation. In ACM SIGCOMM (SIGCOMM).
    [50]
    Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2014. Diagnosing missing events in distributed systems with negative provenance. In ACM SIGCOMM (SIGCOMM).
    [51]
    Hongda Xiao, Bryan Ford, and Joan Feigenbaum. 2013. Structural Cloud Audits that Protect Private Information. In ACM Cloud Computing Security Workshop (CCSW).
    [52]
    Andrew Chi-Chih Yao. 1982. Protocols for Secure Computations (Extended Abstract). In 23rd Annual Symposium on Foundations of Computer Science (FOCS).
    [53]
    Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2013. An Untold Story of Redundant Clouds: Making Your Service Deployment Truly Reliable. In 9th Workshop on Hot Topics in Dependable Systems (HotDep).
    [54]
    Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2014. Heading off correlated failures through Independence-as-a-service. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [55]
    Ennan Zhai, Liang Gu, and Yumei Hai. 2015. A risk-evaluation assisted system for service selection. In International Conference on Web Services (ICWS).
    [56]
    Ennan Zhai, David Isaac Wolinsky, Hongda Xiao, Hongqiang Liu, Xueyuan Su, and Bryan Ford. 2013. Auditing the Structural Reliability of the Clouds. Technical Report YALEU/DCS/TR-1479. Department of Computer Science, Yale University. Available at http://cpsc.yale.edu/sites/default/files/files/tr1479.pdf .
    [57]
    Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm. 2016. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [58]
    Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
    [59]
    Wenchao Zhou, Qiong Fei, Arjun Narayan, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. 2011a. Secure network provenance. In 23rd ACM Symposium on Operating Systems Principles (SOSP).
    [60]
    Wenchao Zhou, Qiong Fei, Shengzhi Sun, Tao Tao, Andreas Haeberlen, Zachary G. Ives, Boon Thau Loo, and Micah Sherr. 2011b. NetTrails: a declarative platform for maintaining and querying provenance in distributed systems. In ACM International Conference on Management of Data (SIGMOD).

    Cited By

    View all
    • (2024)RADig-X: a Tool for Regressions Analysis of User Digital Experience2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00043(359-370)Online publication date: 12-Mar-2024
    • (2022)Capelin: Data-Driven Compute Capacity Procurement for Cloud Datacenters Using Portfolios of ScenariosIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308481633:1(26-39)Online publication date: 1-Jan-2022
    • (2021)Dynamic Resource Provisioning for Sustainable Cloud Computing Systems in the Presence of Correlated FailuresIEEE Transactions on Sustainable Computing10.1109/TSUSC.2020.30251806:4(641-654)Online publication date: 1-Oct-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Programming Languages
    Proceedings of the ACM on Programming Languages  Volume 1, Issue OOPSLA
    October 2017
    1786 pages
    EISSN:2475-1421
    DOI:10.1145/3152284
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2017
    Published in PACMPL Volume 1, Issue OOPSLA

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. Cloud auditing
    2. Cloud reliability
    3. Correlated failures
    4. Fault tree analysis

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)RADig-X: a Tool for Regressions Analysis of User Digital Experience2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00043(359-370)Online publication date: 12-Mar-2024
    • (2022)Capelin: Data-Driven Compute Capacity Procurement for Cloud Datacenters Using Portfolios of ScenariosIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308481633:1(26-39)Online publication date: 1-Jan-2022
    • (2021)Dynamic Resource Provisioning for Sustainable Cloud Computing Systems in the Presence of Correlated FailuresIEEE Transactions on Sustainable Computing10.1109/TSUSC.2020.30251806:4(641-654)Online publication date: 1-Oct-2021
    • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media