research-article

Towards Reliable Application Deployment in the Cloud

Authors:

Istemi Ekin Akkus,

Bimal Viswanath,

Volker HiltAuthors Info & Claims

CoNEXT '17: Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies

Pages 464 - 477

https://doi.org/10.1145/3143361.3143388

Published: 28 November 2017 Publication History

Abstract

A common practice to increase the reliability of a cloud application is to deploy redundant instances. Unfortunately such redundancy efforts can be undermined if the application's instances share common dependencies. This paper presents ReCloud, a novel system that can efficiently find a reliable deployment plan for cloud applications. ReCloud considers and avoids common dependencies shared across application instances that may lead to correlated failures, and works with applications that even have complex internal structures. ReCloud utilizes various pieces of available dependency information (e.g., hardware, software and/or network dependencies) about the cloud infrastructure to quantitatively assess the reliability of the application's deployment plan with rigorous error bounds. This assessment further enables ReCloud to find a deployment plan that balances between reliability and other criteria such as application performance and resource utilization. We implemented a fully functional system. The experimental results show that, even in a large cloud environment with more than 27K hosts, ReCloud needs only 30 seconds to find a deployment plan that is one order of magnitude more reliable than the common practice.

References

[1]

Hussam Abu-Libdeh, Paolo Costa, Antony I. T. Rowstron, Greg O'Shea, and Austin Donnelly. 2010. Symbiotic routing in future data centers. In SIGCOMM. 51--62.

Digital Library

[2]

Marcos Kawazoe Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. In SOSP. 74--89.

Digital Library

[3]

Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. In SIGCOMM. 63--74.

Digital Library

[4]

Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. 2010. Hedera: Dynamic Flow Scheduling for Data Center Networks. In NSDI. 281--296.

Digital Library

[5]

Hyame Assem Alameddine, Sara Ayoubi, and Chadi Assi. 2017. An Efficient Survivable Design With Bandwidth Guarantees for Multi-Tenant Cloud Networks. IEEE Trans. Network and Service Management 14, 2 (2017), 357--372.

[6]

Amazon. 2012. AWS Service Event Report. (2012). https://aws.amazon.com/message/680342/

[7]

Amazon. 2017. AWS Service Health Dashboard. (2017). http://status.aws.amazon.com/

[8]

Paramvir Bahl, Ranveer Chandra, Albert G. Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM. 13--24.

Digital Library

[9]

Michael O Ball. 1986. Computational complexity of network reliability analysis: An overview. IEEE Transactions on Reliability 35, 3 (1986), 230--239.

[10]

Hitesh Ballani, Keon Jang, Thomas Karagiannis, Changhoon Kim, Dinan Gunawardena, and Greg O'Shea. 2013. Chatty Tenants and the Cloud Network Sharing Problem. In NSDI. 171--184.

Digital Library

[11]

Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for Request Extraction and Workload Modelling. In OSDI. 259--272.

Digital Library

[12]

Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In IMC. 267--280.

Digital Library

[13]

Peter Bodík, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. 2012. Surviving failures in bandwidth-constrained data-centers. In SIGCOMM. 431--442.

Digital Library

[14]

Nicolas Bonvin, Thanasis G. Papaioannou, and Karl Aberer. 2010. A self-organized, fault-tolerant and scalable replication scheme for cloud storage. In SoCC. 205--216.

Digital Library

[15]

Vladimír Černy. 1985. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. Journal of optimization theory and applications 45, 1 (1985), 41--51.

Digital Library

[16]

Mike Y. Chen, Anthony Accardi, Emre Kiciman, David A. Patterson, Armando Fox, and Eric A. Brewer. 2004. Path-Based Failure and Evolution Management. In NSDI. 309--322.

Digital Library

[17]

Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and Paramvir Bahl. 2008. Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions. In OSDI 117--130.

Digital Library

[18]

Cisco. 2017. Cisco Cloud and Systems Management. (2017). http://www.cisco.com/c/en/us/products/cloud-systems-management/index.html

[19]

Jack Clark. 2014. 5 Numbers That Illustrate the Mind-Bending Size of Amazon's Cloud. (2014). http://www.bloomberg.com/news/2014-11-14/5-numbers-that-illustrate-the-mind-bending-size-of-amazon-s-cloud.html

[20]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI. 137--150.

Digital Library

[21]

Debian. 2017. Recursively lists package dependencies. (2017). https://packages.debian.org/jessie/apt-rdepends

[22]

Draios. 2017. Sysdig. (2017). https://www.sysdig.org/

[23]

John Dunagan, Nicholas J. A. Harvey, Michael B. Jones, Dejan Kostic, Marvin Theimer, and Alec Wolman. 2004. FUSE: Lightweight Guaranteed Distributed Failure Notification. In OSDI 151--166.

Digital Library

[24]

Embotics. 2017. Embotics Cloud Management Software. (2017). http://wwwembotics.com/

[25]

FIRST. 2017. Common Vulnerability Scoring System. (2017). https://www.flrst.org/cvss

[26]

Bryan Ford. 2012. Icebergs in the Clouds: The Other Risks of Cloud Computing. In HotCloud.

Digital Library

[27]

Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in Globally Distributed Storage Systems. In OSDI. 61--74.

Digital Library

[28]

Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: measurement, analysis, and implications. In SIGCOMM. 350--361.

Digital Library

[29]

GitHub. 2016. GitHub Service Outage Report. (2016). https://github.com/blog/2101-update-on-1-28-service-outage

[30]

Vincent Granville, Mirko Krivánek, and Jean-Paul Rasson. 1994. Simulated Annealing: A Proof of Convergence. IEEE Trans. Pattern Anal. Mach. Intell. 16, 6 (1994), 652--656.

Digital Library

[31]

Albert G. Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In SIGCOMM. 51--62.

Digital Library

[32]

Haryadi S. Gunawi, Agung Laksono, Riza O. Suminto, Mingzhe Hao, Jeffry Adity-atama, Kurnia J. Eliazar, and Anang D. Satria. 2016. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In SoCC.

Digital Library

[33]

Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu, Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang Zhang, and Songwu Lu. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In SIGCOMM. 63--74.

Digital Library

[34]

Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, and Songwu Lu. 2008. Dcell: a scalable and fault-tolerant network structure for data centers. In SIGCOMM. 75--86.

Digital Library

[35]

Andreas Haeberlen, Paarijaat Aditya, Rodrigo Rodrigues, and Peter Druschel. 2010. Accountable Virtual Machines. In OSDI. 119--134.

Digital Library

[36]

HardwareLister. 2017. Hardware Lister. (2017). http://www.ezix.org/project/wiki/HardwareLiSter

[37]

Keqiang He, Alexis Fisher, Liang Wang, Aaron Gember, Aditya Akella, and Thomas Ristenpart. 2013. Next stop, the cloud: understanding modern web service deployment in EC2 and azure. In IMC. 177--190.

Digital Library

[38]

Heqing Huang, Su Zhang, Xinming Ou, Atul Prakash, and Karem A. Sakallah. 2011. Distilling critical attack graph surface iteratively through minimum-cost SAT solving. In ACSAC. 31--40.

Digital Library

[39]

Keon Jang, Justine Sherry, Hitesh Ballani, and Toby Moncaster. 2015. Silo: Predictable Message Latency in the Cloud. In SIGCOMM. 435--448.

Digital Library

[40]

Nikolai Joukov, Vasily Tarasov, Joel Ossher, Birgit Pfitzmann, Sergej Chicherin, Marco Pistoia, and Takaaki Tateishi. 2011. Static discovery and remediation of code-embedded resource dependencies. In IM. 233--240.

[41]

Srikanth Kandula, Dina Katabi, and Jean-Philippe Vasseur. 2005. Shrink: a tool for failure diagnosis in IP networks. In MineNet. 173--178.

Digital Library

[42]

Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. 2009. Detailed diagnosis in enterprise networks. In SIGCOMM. 243--254.

Digital Library

[43]

Scott Kirkpatrick, C. D. Gelatt, and Mario P Vecchi. 1983. Optimization by simulated annealing. Science 220, 4598 (1983), 671--680.

[44]

Ramana Rao Kompella, Jennifer Yates, Albert G. Greenberg, and Alex C. Snoeren. 2005. IP Fault Localization Via Risk Modeling. In NSDI.

[45]

Hiromitsu Kumamoto, Kazuo Tanaka, Koichi Inoue, and Ernest J Henley. 1980. Dagger-sampling Monte Carlo for system unavailability evaluation. IEEE Transactions on Reliability 29,2 (1980), 122--125.

[46]

Sangmin Lee, Rina Panigrahy, Vijayan Prabhakaran, Venugopalan Ramasubramanian, Kunal Talwar, Lincoln Uyeda, and Udi Wieder. 2011. Validating heuristics for virtual machines consolidation. Microsoft Research, MSR-TR-2011-9 (2011), 1--14.

[47]

Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos Kawazoe Aguilera, and Michael Walfish. 2011. Detecting failures in distributed systems with the Falcon spy network. In SOSP. 279--294.

Digital Library

[48]

James Lewis and Martin Fowler. 2014. Microservices. (2014). https://martinfowler.com/articles/microservices.html

[49]

Dan Li, Chuanxiong Guo, Haitao Wu, Kun Tan, and Songwu Lu. 2009. FiConn: Using Backup Port for Server Interconnection in Data Centers. In INFOCOM. 2276--2285.

[50]

Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas E. Anderson. 2013. F10: A Fault-Tolerant Engineered Network. In NSDI 399--412.

Digital Library

[51]

Microsoft. 2017. Azure status history. (2017). https://azure.microsoft.com/en-us/status/history/

[52]

Rich Miller. 2008. Failure Rates in Google Data Centers. (2008). http://www.datacenterknowledge.com/archives/2008/05/30/failure-rates-in-google-data-centers/

[53]

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat. 2009. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM. 39--50.

Digital Library

[54]

Arun Natarajan, Peng Ning, Yao Liu, Sushil Jajodia, and Steve E. Hutchinson. 2012. NSDMiner: Automated discovery of Network Service Dependencies. In INFOCOM. 2507--2515.

[55]

Jordan Novet. 2017. Microsoft confirms Azure storage issues around the world. (2017). https://venturebeat.com/2017/03/15/microsoft-confirms-azure-storage-issues-around-the-world/

[56]

NSDMiner. 2017. NSDMiner. (2017). https://sourceforge.net/projects/nsdminer/

[57]

OpenNebula. 2017. OpenNebula. (2017). http://opennebula.org/

[58]

Xinming Ou, Wayne F. Boyer, and Miles A. McQueen. 2006. A scalable approach to attack graph generation. In CCS. 336--345.

Digital Library

[59]

Barry W. Peddycord III, Peng Ning, and Sushil Jajodia. 2012. On the Accurate Identification of Network Service Dependencies in Distributed Systems. In LISA 181--194.

Digital Library

[60]

Gordon D. Plotkin, Nikolaj Bjørner, Nuno P. Lopes, Andrey Rybalchenko, and George Varghese. 2016. Scaling network verification using symmetry and surgery. In POPL. 69--83.

Digital Library

[61]

Rahul Potharaju and Navendu Jain. 2013. When the network crumbles: an empirical study of cloud network failures and their impact on services. In SoCC.

Digital Library

[62]

C. V. Ramamoorthy, Gary S. Ho, and Yih-Wu Han. 1977. Fault tree analysis of computer systems. In AFIPS National Computer Conference. 13--17.

Digital Library

[63]

Mario Rios, Keith Bell, Daniel Kirschen, and Ron Allan. 1999. Computation of the value of security. Manchester Centre for Electrical Energy (1999).

[64]

Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. 2015. Inside the Social Networks (Datacenter) Network. In SIGCOMM. 123--137.

Digital Library

[65]

Thomas L Saaty. 2000. Fundamentals of decision making and priority theory with the analytic hierarchy process. Vol. 6. Rws Publications.

[66]

Bianca Schroeder and Garth A. Gibson. 2007. Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You?. In FAST. 1--16.

Digital Library

[67]

Bianca Schroeder and Garth A. Gibson. 2010. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Trans. Dependable Sec. Comput. 7, 4 (2010), 337--351.

Digital Library

[68]

Mina Sedaghat, Eddie Wadbro, John Wilkes, Sara de Luna, Oleg Seleznjev, and Erik Elmroth. 2016. DieHard: Reliable Scheduling to Survive Correlated Failures in Cloud Data Centers. In CCGrid. 52--59.

[69]

Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In SIGCOMM. 183--197.

Digital Library

[70]

Ankit Singla, Chi-Yao Hong, Lucian Popa, and Philip Brighten Godfrey. 2012. Jellyfish: Networking Data Centers Randomly. In NSDI. 225--238.

Digital Library

[71]

Jörg Thalheim, Antonio Rodrigues, Istemi Ekin Akkus, Pramod Bhatotia, Ruichuan Chen, Bimal Viswanath, Lei Jiao, and Christof Fetzer. 2017. Sieve: Actionable Insights from Monitored Metrics in Distributed Systems. In Middleware.

Digital Library

[72]

Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. 2010. California fault lines: understanding the causes and impact of network failures. In SIGCOMM. 315--326.

Digital Library

[73]

Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In SoCC. 193--204.

Digital Library

[74]

VMware. 2017. VMware Cloud Management Platform. (2017). https://www.vmware.com/virtualization/cloud-management

[75]

Wikipedia. 2017. 68-95-99.7 Rule. (2017). https://en.wikipedia.org/wiki/68-95-99.7_rule

[76]

Wikipedia. 2017. Central Limit Theorem. (2017). https://en.wikipedia.org/wiki/Central_limit_theorem

[77]

Haitao Wu, Guohan Lu, Dan Li, Chuanxiong Guo, and Yongguang Zhang. 2009. MDCube: a high performance network structure for modular data center interconnection. In CoNEXT 25--36.

Digital Library

[78]

Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: automating datacenter network failure mitigation. In SIGCOMM. 419--430.

Digital Library

[79]

Jimmy Yang and Feng-Bin Sun. 1999. A comprehensive review of hard-disk drive reliability. In Reliability and Maintainability Symposium. 403--409.

[80]

Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, and Bryan Ford. 2014. Heading Off Correlated Failures through Independence-as-a-Service. In OSDI. 317--334.

Digital Library

[81]

Ennan Zhai, Liang Gu, and Yumei Hai. 2015. A Risk-Evaluation Assisted System for Service Selection. In ICWS. 671--678.

Digital Library

[82]

Lingming Zhang, Lu Zhang, and Sarfraz Khurshid. 2013. Injecting mechanical faults to localize developer faults for evolving software. In OOPSLA. 765--784.

Digital Library

[83]

Quanlu Zhang, Shenglong Li, Zhenhua Li, Yuanjian Xing, Zhi Yang, and Yafei Dai. 2015. CHARM: A Cost-Efficient Multi-Cloud Data Hosting Scheme with High Availability. IEEE Trans. Cloud Computing 3, 3 (2015), 372--386.

[84]

Ao Zhou, Shangguang Wang, Bo Cheng, Zibin Zheng, Fangchun Yang, Rong Chang, Michael Lyu, and Rajkumar Buyya. 2016. Cloud Service Reliability Enhancement via Virtual Machine Placement Optimization. IEEE Transactions on Services Computing (2016).

Cited By

Zhai EChen APiskac RBalakrishnan MTian BSong BZhang HBhagwan RPorter G(2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388285

Index Terms

Towards Reliable Application Deployment in the Cloud
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
  2. Dependable and fault-tolerant systems and networks
2. Networks
  1. Network types
    1. Data center networks

Recommendations

An auditing language for preventing correlated failures in the cloud

Today's cloud services extensively rely on replication techniques to ensure availability and reliability. In complex datacenter network architectures, however, seemingly independent replica servers may inadvertently share deep dependencies (e.g., ...
Cloud Standby Deployment: A Model-Driven Deployment Method for Disaster Recovery in the Cloud
CLOUD '15: Proceedings of the 2015 IEEE 8th International Conference on Cloud Computing

Disaster recovery planning and securing business processes against outtakes have been essential parts of running a company for decades. As IT systems have become more important, and especially since more and more revenue is generated over the Internet, ...
Cloud application deployment with transient failure recovery

Application deployment is a crucial operation for modern cloud providers. The ability to dynamically allocate resources and deploy a new application instance based on a user-provided description in a fully automated manner is of great importance for the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CoNEXT '17: Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies

November 2017

492 pages

ISBN:9781450354226

DOI:10.1145/3143361

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CoNEXT '17

Sponsor:

SIGCOMM

CoNEXT '17: The 13th International Conference on emerging Networking EXperiments and Technologies

December 12 - 15, 2017

Incheon, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
319
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)3

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhai EChen APiskac RBalakrishnan MTian BSong BZhang HBhagwan RPorter G(2020)Check before you changeProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388285(575-590)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388285

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents