Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3143361.3143388acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

Towards Reliable Application Deployment in the Cloud

Published: 28 November 2017 Publication History
  • Get Citation Alerts
  • Abstract

    A common practice to increase the reliability of a cloud application is to deploy redundant instances. Unfortunately such redundancy efforts can be undermined if the application's instances share common dependencies. This paper presents ReCloud, a novel system that can efficiently find a reliable deployment plan for cloud applications. ReCloud considers and avoids common dependencies shared across application instances that may lead to correlated failures, and works with applications that even have complex internal structures. ReCloud utilizes various pieces of available dependency information (e.g., hardware, software and/or network dependencies) about the cloud infrastructure to quantitatively assess the reliability of the application's deployment plan with rigorous error bounds. This assessment further enables ReCloud to find a deployment plan that balances between reliability and other criteria such as application performance and resource utilization. We implemented a fully functional system. The experimental results show that, even in a large cloud environment with more than 27K hosts, ReCloud needs only 30 seconds to find a deployment plan that is one order of magnitude more reliable than the common practice.

    References

    [1]
    Hussam Abu-Libdeh, Paolo Costa, Antony I. T. Rowstron, Greg O'Shea, and Austin Donnelly. 2010. Symbiotic routing in future data centers. In SIGCOMM. 51--62.
    [2]
    Marcos Kawazoe Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In SOSP. 74--89.
    [3]
    Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. In SIGCOMM. 63--74.
    [4]
    Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. 2010. Hedera: Dynamic Flow Scheduling for Data Center Networks. In NSDI. 281--296.
    [5]
    Hyame Assem Alameddine, Sara Ayoubi, and Chadi Assi. 2017. An Efficient Survivable Design With Bandwidth Guarantees for Multi-Tenant Cloud Networks. IEEE Trans. Network and Service Management 14, 2 (2017), 357--372.
    [6]
    Amazon. 2012. AWS Service Event Report. (2012). https://aws.amazon.com/message/680342/
    [7]
    Amazon. 2017. AWS Service Health Dashboard. (2017). http://status.aws.amazon.com/
    [8]
    Paramvir Bahl, Ranveer Chandra, Albert G. Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM. 13--24.
    [9]
    Michael O Ball. 1986. Computational complexity of network reliability analysis: An overview. IEEE Transactions on Reliability 35, 3 (1986), 230--239.
    [10]
    Hitesh Ballani, Keon Jang, Thomas Karagiannis, Changhoon Kim, Dinan Gunawardena, and Greg O'Shea. 2013. Chatty Tenants and the Cloud Network Sharing Problem. In NSDI. 171--184.
    [11]
    Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for Request Extraction and Workload Modelling. In OSDI. 259--272.
    [12]
    Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In IMC. 267--280.
    [13]
    Peter Bodík, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. 2012. Surviving failures in bandwidth-constrained data-centers. In SIGCOMM. 431--442.
    [14]
    Nicolas Bonvin, Thanasis G. Papaioannou, and Karl Aberer. 2010. A self-organized, fault-tolerant and scalable replication scheme for cloud storage. In SoCC. 205--216.
    [15]
    Vladimír Černy. 1985. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of optimization theory and applications 45, 1 (1985), 41--51.
    [16]
    Mike Y. Chen, Anthony Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-Based Failure and Evolution Management. In NSDI. 309--322.
    [17]
    Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and Paramvir Bahl. 2008. Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions. In OSDI 117--130.
    [18]
    Cisco. 2017. Cisco Cloud and Systems Management. (2017). http://www.cisco.com/c/en/us/products/cloud-systems-management/index.html
    [19]
    Jack Clark. 2014. 5 Numbers That Illustrate the Mind-Bending Size of Amazon's Cloud. (2014). http://www.bloomberg.com/news/2014-11-14/5-numbers-that-illustrate-the-mind-bending-size-of-amazon-s-cloud.html
    [20]
    Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI. 137--150.
    [21]
    Debian. 2017. Recursively lists package dependencies. (2017). https://packages.debian.org/jessie/apt-rdepends
    [22]
    Draios. 2017. Sysdig. (2017). https://www.sysdig.org/
    [23]
    John Dunagan, Nicholas J. A. Harvey, Michael B. Jones, Dejan Kostic, Marvin Theimer, and Alec Wolman. 2004. FUSE: Lightweight Guaranteed Distributed Failure Notification. In OSDI 151--166.
    [24]
    Embotics. 2017. Embotics Cloud Management Software. (2017). http://wwwembotics.com/
    [25]
    FIRST. 2017. Common Vulnerability Scoring System. (2017). https://www.flrst.org/cvss
    [26]
    Bryan Ford. 2012. Icebergs in the Clouds: The Other Risks of Cloud Computing. In HotCloud.
    [27]
    Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in Globally Distributed Storage Systems. In OSDI. 61--74.
    [28]
    Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: measurement, analysis, and implications. In SIGCOMM. 350--361.
    [29]
    GitHub. 2016. GitHub Service Outage Report. (2016). https://github.com/blog/2101-update-on-1-28-service-outage
    [30]
    Vincent Granville, Mirko Krivánek, and Jean-Paul Rasson. 1994. Simulated Annealing: A Proof of Convergence. IEEE Trans. Pattern Anal. Mach. Intell. 16, 6 (1994), 652--656.
    [31]
    Albert G. Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In SIGCOMM. 51--62.
    [32]
    Haryadi S. Gunawi, Agung Laksono, Riza O. Suminto, Mingzhe Hao, Jeffry Adity-atama, Kurnia J. Eliazar, and Anang D. Satria. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In SoCC.
    [33]
    Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In SIGCOMM. 63--74.
    [34]
    Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and Songwu Lu. 2008. Dcell: a scalable and fault-tolerant network structure for data centers. In SIGCOMM. 75--86.
    [35]
    Andreas Haeberlen, Paarijaat Aditya, Rodrigo Rodrigues, and Peter Druschel. 2010. Accountable Virtual Machines. In OSDI. 119--134.
    [36]
    HardwareLister. 2017. Hardware Lister. (2017). http://www.ezix.org/project/wiki/HardwareLiSter
    [37]
    Keqiang He, Alexis Fisher, Liang Wang, Aaron Gember, Aditya Akella, and Thomas Ristenpart. 2013. Next stop, the cloud: understanding modern web service deployment in EC2 and azure. In IMC. 177--190.
    [38]
    Heqing Huang, Su Zhang, Xinming Ou, Atul Prakash, and Karem A. Sakallah. 2011. Distilling critical attack graph surface iteratively through minimum-cost SAT solving. In ACSAC. 31--40.
    [39]
    Keon Jang, Justine Sherry, Hitesh Ballani, and Toby Moncaster. 2015. Silo: Predictable Message Latency in the Cloud. In SIGCOMM. 435--448.
    [40]
    Nikolai Joukov, Vasily Tarasov, Joel Ossher, Birgit Pfitzmann, Sergej Chicherin, Marco Pistoia, and Takaaki Tateishi. 2011. Static discovery and remediation of code-embedded resource dependencies. In IM. 233--240.
    [41]
    Srikanth Kandula, Dina Katabi, and Jean-Philippe Vasseur. 2005. Shrink: a tool for failure diagnosis in IP networks. In MineNet. 173--178.
    [42]
    Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In SIGCOMM. 243--254.
    [43]
    Scott Kirkpatrick, C. D. Gelatt, and Mario P Vecchi. 1983. Optimization by simulated annealing. Science 220, 4598 (1983), 671--680.
    [44]
    Ramana Rao Kompella, Jennifer Yates, Albert G. Greenberg, and Alex C. Snoeren. 2005. IP Fault Localization Via Risk Modeling. In NSDI.
    [45]
    Hiromitsu Kumamoto, Kazuo Tanaka, Koichi Inoue, and Ernest J Henley. 1980. Dagger-sampling Monte Carlo for system unavailability evaluation. IEEE Transactions on Reliability 29,2 (1980), 122--125.
    [46]
    Sangmin Lee, Rina Panigrahy, Vijayan Prabhakaran, Venugopalan Ramasubramanian, Kunal Talwar, Lincoln Uyeda, and Udi Wieder. 2011. Validating heuristics for virtual machines consolidation. Microsoft Research, MSR-TR-2011-9 (2011), 1--14.
    [47]
    Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In SOSP. 279--294.
    [48]
    James Lewis and Martin Fowler. 2014. Microservices. (2014). https://martinfowler.com/articles/microservices.html
    [49]
    Dan Li, Chuanxiong Guo, Haitao Wu, Kun Tan, and Songwu Lu. 2009. FiConn: Using Backup Port for Server Interconnection in Data Centers. In INFOCOM. 2276--2285.
    [50]
    Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas E. Anderson. 2013. F10: A Fault-Tolerant Engineered Network. In NSDI 399--412.
    [51]
    Microsoft. 2017. Azure status history. (2017). https://azure.microsoft.com/en-us/status/history/
    [52]
    Rich Miller. 2008. Failure Rates in Google Data Centers. (2008). http://www.datacenterknowledge.com/archives/2008/05/30/failure-rates-in-google-data-centers/
    [53]
    Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. 2009. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM. 39--50.
    [54]
    Arun Natarajan, Peng Ning, Yao Liu, Sushil Jajodia, and Steve E. Hutchinson. 2012. NSDMiner: Automated discovery of Network Service Dependencies. In INFOCOM. 2507--2515.
    [55]
    Jordan Novet. 2017. Microsoft confirms Azure storage issues around the world. (2017). https://venturebeat.com/2017/03/15/microsoft-confirms-azure-storage-issues-around-the-world/
    [56]
    NSDMiner. 2017. NSDMiner. (2017). https://sourceforge.net/projects/nsdminer/
    [57]
    OpenNebula. 2017. OpenNebula. (2017). http://opennebula.org/
    [58]
    Xinming Ou, Wayne F. Boyer, and Miles A. McQueen. 2006. A scalable approach to attack graph generation. In CCS. 336--345.
    [59]
    Barry W. Peddycord III, Peng Ning, and Sushil Jajodia. 2012. On the Accurate Identification of Network Service Dependencies in Distributed Systems. In LISA 181--194.
    [60]
    Gordon D. Plotkin, Nikolaj Bjørner, Nuno P. Lopes, Andrey Rybalchenko, and George Varghese. 2016. Scaling network verification using symmetry and surgery. In POPL. 69--83.
    [61]
    Rahul Potharaju and Navendu Jain. 2013. When the network crumbles: an empirical study of cloud network failures and their impact on services. In SoCC.
    [62]
    C. V. Ramamoorthy, Gary S. Ho, and Yih-Wu Han. 1977. Fault tree analysis of computer systems. In AFIPS National Computer Conference. 13--17.
    [63]
    Mario Rios, Keith Bell, Daniel Kirschen, and Ron Allan. 1999. Computation of the value of security. Manchester Centre for Electrical Energy (1999).
    [64]
    Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. 2015. Inside the Social Networks (Datacenter) Network. In SIGCOMM. 123--137.
    [65]
    Thomas L Saaty. 2000. Fundamentals of decision making and priority theory with the analytic hierarchy process. Vol. 6. Rws Publications.
    [66]
    Bianca Schroeder and Garth A. Gibson. 2007. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You?. In FAST. 1--16.
    [67]
    Bianca Schroeder and Garth A. Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Trans. Dependable Sec. Comput. 7, 4 (2010), 337--351.
    [68]
    Mina Sedaghat, Eddie Wadbro, John Wilkes, Sara de Luna, Oleg Seleznjev, and Erik Elmroth. 2016. DieHard: Reliable Scheduling to Survive Correlated Failures in Cloud Data Centers. In CCGrid. 52--59.
    [69]
    Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In SIGCOMM. 183--197.
    [70]
    Ankit Singla, Chi-Yao Hong, Lucian Popa, and Philip Brighten Godfrey. 2012. Jellyfish: Networking Data Centers Randomly. In NSDI. 225--238.
    [71]
    Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable Insights from Monitored Metrics in Distributed Systems. In Middleware.
    [72]
    Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. 2010. California fault lines: understanding the causes and impact of network failures. In SIGCOMM. 315--326.
    [73]
    Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In SoCC. 193--204.
    [74]
    VMware. 2017. VMware Cloud Management Platform. (2017). https://www.vmware.com/virtualization/cloud-management
    [75]
    Wikipedia. 2017. 68-95-99.7 Rule. (2017). https://en.wikipedia.org/wiki/68-95-99.7_rule
    [76]
    Wikipedia. 2017. Central Limit Theorem. (2017). https://en.wikipedia.org/wiki/Central_limit_theorem
    [77]
    Haitao Wu, Guohan Lu, Dan Li, Chuanxiong Guo, and Yongguang Zhang. 2009. MDCube: a high performance network structure for modular data center interconnection. In CoNEXT 25--36.
    [78]
    Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: automating datacenter network failure mitigation. In SIGCOMM. 419--430.
    [79]
    Jimmy Yang and Feng-Bin Sun. 1999. A comprehensive review of hard-disk drive reliability. In Reliability and Maintainability Symposium. 403--409.
    [80]
    Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2014. Heading Off Correlated Failures through Independence-as-a-Service. In OSDI. 317--334.
    [81]
    Ennan Zhai, Liang Gu, and Yumei Hai. 2015. A Risk-Evaluation Assisted System for Service Selection. In ICWS. 671--678.
    [82]
    Lingming Zhang, Lu Zhang, and Sarfraz Khurshid. 2013. Injecting mechanical faults to localize developer faults for evolving software. In OOPSLA. 765--784.
    [83]
    Quanlu Zhang, Shenglong Li, Zhenhua Li, Yuanjian Xing, Zhi Yang, and Yafei Dai. 2015. CHARM: A Cost-Efficient Multi-Cloud Data Hosting Scheme with High Availability. IEEE Trans. Cloud Computing 3, 3 (2015), 372--386.
    [84]
    Ao Zhou, Shangguang Wang, Bo Cheng, Zibin Zheng, Fangchun Yang, Rong Chang, Michael Lyu, and Rajkumar Buyya. 2016. Cloud Service Reliability Enhancement via Virtual Machine Placement Optimization. IEEE Transactions on Services Computing (2016).

    Cited By

    View all
    • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CoNEXT '17: Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies
    November 2017
    492 pages
    ISBN:9781450354226
    DOI:10.1145/3143361
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 November 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cloud reliability
    2. dagger sampling
    3. network transformations
    4. simulated annealing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    CoNEXT '17
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 198 of 789 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media