Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3297858.3304004acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Public Access

Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices

Published: 04 April 2019 Publication History

Abstract

Performance unpredictability is a major roadblock towards cloud adoption, and has performance, cost, and revenue ramifications. Predictable performance is even more critical as cloud services transition from monolithic designs to microservices. Detecting QoS violations after they occur in systems with microservices results in long recovery times, as hotspots propagate and amplify across dependent services. We present Seer, an online cloud performance debugging system that leverages deep learning and the massive amount of tracing data cloud systems collect to learn spatial and temporal patterns that translate to QoS violations. Seer combines lightweight distributed RPC-level tracing, with detailed low-level hardware monitoring to signal an upcoming QoS violation, and diagnose the source of unpredictable performance. Once an imminent QoS violation is detected, Seer notifies the cluster manager to take action to avoid performance degradation altogether. We evaluate Seer both in local clusters, and in large-scale deployments of end-to-end applications built with microservices with hundreds of users. We show that Seer correctly anticipates QoS violations 91% of the time, and avoids the QoS violation to begin with in 84% of cases. Finally, we show that Seer can identify application-level design bugs, and provide insights on how to better architect microservices to achieve predictable performance.

References

[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. {n. d.}. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. In Proceedings of OSDI, 2016 .
[2]
Hyunwook Baek, Abhinav Srivastava, and Jacobus Van der Merwe. {n. d.}. CloudSight: A tenant-oriented transparency framework for cross-layer cloud troubleshooting. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 2017.
[3]
Luiz Barroso. {n. d.}. Warehouse-Scale Computing: Entering the Teenage Decade. ISCA Keynote, SJ, June 2011 ({n. d.}).
[4]
Luiz Barroso and Urs Hoelzle. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. MC Publishers.
[5]
Robert Bell, Yehuda Koren, and Chris Volinsky. 2007. The BellKor 2008 Solution to the Netflix Prize. Technical Report.
[6]
Leon Bottou. {n. d.}. Large-Scale Machine Learning with Stochastic Gradient Descent. In Proceedings of the International Conference on Computational Statistics (COMPSTAT). Paris, France, 2010.
[7]
Martin A. Brown. {n. d.}. Traffic Control HOWTO. http://linux-ip.net/articles/Traffic-Control-HOWTO/.
[8]
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A Cloud-scale Acceleration Architecture. In MICRO. IEEE Press, Piscataway, NJ, USA, Article 7, 13 pages.
[9]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014a. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. of the 19th intl. conf. on Architectural Support for Programming Languages and Operating Systems.
[10]
Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao Family: Energy-efficient Hardware Accelerators for Machine Learning. Commun. ACM, Vol. 59, 11 (Oct. 2016), 105--112.
[11]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014b. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 609--622.
[12]
Ludmila Cherkasova, Diwaker Gupta, and Amin Vahdat. 2007. Comparison of the Three CPU Schedulers in Xen. SIGMETRICS Perform. Eval. Rev., Vol. 35, 2 (Sept. 2007), 42--51.
[13]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 217--231.
[14]
Eric S. Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian M. Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Maleen Abeydeera, Logan Adams, Hari Angepat, Christian Boehn, Derek Chiou, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Ahmad El Husseini, Tamá s Juhá sz, Kara Kagi, Ratna Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Brandon Perez, Amanda Rapsang, Steven K. Reinhardt, Bita Rouhani, Adam Sapek, Raja Seera, Sangeetha Shekar, Balaji Sridharan, Gabriel Weisz, Lisa Woods, Phillip Yi Xiao, Dan Zhang, Ritchie Zhao, and Doug Burger. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro, Vol. 38, 2 (2018), 8--20.
[15]
Guilherme Da Cunha Rodrigues, Rodrigo N. Calheiros, Vinicius Tavares Guimaraes, Glederson Lessa dos Santos, Márcio Barbosa de Carvalho, Lisandro Zambenedetti Granville, Liane Margarida Rockenbach Tarouco, and Rajkumar Buyya. 2016. Monitoring of Cloud Computing Environments: Concepts, Solutions, Trends, and Future Directions. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). ACM, New York, NY, USA, 378--383.
[16]
Marwan Darwish, Abdelkader Ouda, and Luiz Fernando Capretz. {n. d.}. Cloud-based DDoS attacks and defenses. In Proc. of i-Society. Toronto, ON, 2013.
[17]
Jeffrey Dean and Luiz Andre Barroso. {n. d.}. The Tail at Scale. In CACM, Vol. 56 No. 2 .
[18]
Christina Delimitrou, Nick Bambos, and Christos Kozyrakis. {n. d.}. QoS-Aware Admission Control in Heterogeneous Datacenters. In Proceedings of the International Conference of Autonomic Computing (ICAC). San Jose, CA, USA, 2013.
[19]
Christina Delimitrou and Christos Kozyrakis. {n. d.}. iBench: Quantifying Interference for Datacenter Workloads. In Proceedings of the 2013 IEEE International Symposium on Workload Characterization (IISWC). Portland, OR, September 2013.
[20]
Christina Delimitrou and Christos Kozyrakis. {n. d.}. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Houston, TX, USA, 2013.
[21]
Christina Delimitrou and Christos Kozyrakis. {n. d.}. QoS-Aware Scheduling in Heterogeneous Datacenters with Paragon. In ACM Transactions on Computer Systems (TOCS), Vol. 31 Issue 4. December 2013.
[22]
Christina Delimitrou and Christos Kozyrakis. {n. d.}. Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters with Paragon. In IEEE Micro Special Issue on Top Picks from the Computer Architecture Conferences. May/June 2014.
[23]
Christina Delimitrou and Christos Kozyrakis. {n. d.}. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proc. of ASPLOS. Salt Lake City, 2014.
[24]
Christina Delimitrou and Christos Kozyrakis. 2016. HCloud: Resource-Efficient Provisioning in Shared Cloud Systems. In Proceedings of the Twenty First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .
[25]
Christina Delimitrou and Christos Kozyrakis. 2017. Bolt: I Know What You Did Last Summer... In The Cloud. In Proc. of the Twenty Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .
[26]
Christina Delimitrou and Christos Kozyrakis. 2018. Amdahl's Law for Tail Latency. In Communications of the ACM (CACM) .
[27]
Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. 2015. Tarcil: Reconciling Scheduling Speed and Quality in Large Shared Clusters. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SOCC) .
[28]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 92--104.
[29]
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). USENIX Association, Renton, WA, 51--66.
[30]
Brad Fitzpatrick. {n. d.}. Distributed caching with memcached. In Linux Journal, Volume 2004, Issue 124, 2004 .
[31]
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A Pervasive Network Tracing Framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation (NSDI'07). USENIX Association, Berkeley, CA, USA, 20--20.
[32]
Yu Gan and Christina Delimitrou. 2018. The Architectural Implications of Cloud Microservices. In Computer Architecture Letters (CAL), vol.17, iss. 2 .
[33]
Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He, and Christina Delimitrou. 2018. Seer: Leveraging Big Data to Navigate the Complexity of Cloud Debugging. In Proceedings of the Tenth USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) .
[34]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems. In Proceedings of the Twenty Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) .
[35]
Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat. {n. d.}. Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization. In Proc. of NSDI. 2018.
[36]
Daniel Gmach, Jerry Rolia, Ludmila Cherkasova, and Alfons Kemper. {n. d.}. Workload Analysis and Demand Prediction of Enterprise Data Center Applications. In Proceedings of IISWC. Boston, MA, 2007, 10.
[37]
Zhenhuan Gong, Xiaohui Gu, and John Wilkes. {n. d.}. PRESS: PRedictive Elastic ReSource Scaling for cloud systems. In Proceedings of CNSM. Niagara Falls, ON, 2010.
[38]
Donald Gross, John F. Shortle, James M. Thompson, and Carl M. Harris. {n. d.}. Fundamentals of Queueing Theory. In Wiley Series in Probability and Statistics, Book 627. 2011.
[39]
Sanchika Gupta and Padam Kumar. {n. d.}. VM Profile Based Optimized Network Attack Pattern Detection Scheme for DDOS Attacks in Cloud. In Proc. of SSCC. Mysore, India, 2013.
[40]
Ben Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. {n. d.}. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Proceedings of NSDI. Boston, MA, 2011.
[41]
Jingwei Huang, David M. Nicol, and Roy H. Campbell. {n. d.}. Denial-of-Service Threat to Hadoop/YARN Clusters with Multi-Tenancy. In Proc. of the IEEE International Congress on Big Data. Washington, DC, 2014.
[42]
Alexandru Iosup, Nezih Yigitbasi, and Dick Epema. {n. d.}. On the Performance Variability of Production Cloud Services. In Proceedings of CCGRID. Newport Beach, CA, 2011.
[43]
Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2017. Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications. In Proceedings of WWW. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 469--478.
[44]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). ACM, New York, NY, USA, 1--12.
[45]
Harshad Kasture and Daniel Sanchez. 2014. Ubik: Efficient Cache Sharing with Strict QoS for Latency-Critical Workloads. In Proceedings of the 19th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIX) .
[46]
Harshad Kasture and Daniel Sanchez. 2016. TailBench: A Benchmark Suite and Evaluation Methodology for Latency-Critical Applications. In Proc. of IISWC .
[47]
Yaakoub El Khamra, Hyunjoo Kim, Shantenu Jha, and Manish Parashar. {n. d.}. Exploring the performance fluctuations of hpc workloads on clouds. In Proceedings of CloudCom. Indianapolis, IN, 2010.
[48]
Krzysztof C. Kiwiel. {n. d.}. Convergence and efficiency of subgradient methods for quasiconvex minimization. In Mathematical Programming (Series A) (Berlin, Heidelberg: Springer) 90 (1): pp. 1--25, 2001.
[49]
Jacob Leverich and Christos Kozyrakis. {n. d.}. Reconciling High Server Utilization and Sub-millisecond Quality-of-Service. In Proc. of EuroSys. 2014.
[50]
David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Christos Kozyrakis. {n. d.}. Towards Energy Proportionality for Large-scale Latency-critical Workloads. In Proceedings of the 41st Annual International Symposium on Computer Architecuture (ISCA). Minneapolis, MN, 2014.
[51]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. {n. d.}. Heracles: Improving Resource Efficiency at Scale. In Proc. of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Portland, OR, 2015.
[52]
Dave Mangot. {n. d.}. EC2 variability: The numbers revealed. http://tech.mangot.com/roller/dave/entry/ec2_variability_the_numbers_revealed .
[53]
Jason Mars and Lingjia Tang. {n. d.}. Whare-map: heterogeneity in "homogeneous" warehouse-scale computers. In Proceedings of ISCA. Tel-Aviv, Israel, 2013.
[54]
Jason Mars, Lingjia Tang, and Robert Hundt. 2011. Heterogeneity in "Homogeneous"; Warehouse-Scale Computers: A Performance Opportunity. IEEE Comput. Archit. Lett., Vol. 10, 2 (July 2011), 4.
[55]
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. {n. d.}. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of MICRO. Porto Alegre, Brazil, 2011.
[56]
David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. 2011. Power management of online data-intensive services. In Proceedings of the 38th annual international symposium on Computer architecture. 319--330.
[57]
Ripal Nathuji, Canturk Isci, and Eugene Gorbatov. {n. d.}. Exploiting platform heterogeneity for power efficient data centers. In Proceedings of ICAC. Jacksonville, FL, 2007.
[58]
Ripal Nathuji, Aman Kansal, and Alireza Ghaffarkhah. {n. d.}. Q-Clouds: Managing Performance Interference Effects for QoS-Aware Clouds. In Proceedings of EuroSys. Paris,France, 2010.
[59]
Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, and John Wilkes. {n. d.}. AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service. In Proceedings of ICAC. San Jose, CA, 2013.
[60]
Dejan Novakovic, Nedeljko Vasic, Stanko Novakovic, Dejan Kostic, and Ricardo Bianchini. {n. d.}. DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments. In Proceedings of ATC. San Jose, CA, 2013.
[61]
Simon Ostermann, Alexandru Iosup, Nezih Yigitbasi, Radu Prodan, Thomas Fahringer, and Dick Epema. {n. d.}. A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing. In Lecture Notes on Cloud Computing. Volume 34, p.115--131, 2010.
[62]
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. {n. d.}. Sparrow: Distributed, Low Latency Scheduling. In Proceedings of SOSP. Farminton, PA, 2013.
[63]
Xue Ouyang, Peter Garraghan, Renyu Yang, Paul Townend, and Jie Xu. {n. d.}. Reducing Late-Timing Failure at Scale: Straggler Root-Cause Analysis in Cloud Datacenters. In Proceedings of 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2016.
[64]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2014. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In Proc. of the 41st Intl. Symp. on Computer Architecture .
[65]
Suhail Rehman and Majd Sakr. {n. d.}. Initial Findings for provisioning variation in cloud computing. In Proceedings of CloudCom. Indianapolis, IN, 2010.
[66]
Charles Reiss, Alexey Tumanov, Gregory Ganger, Randy Katz, and Michael Kozych. {n. d.}. Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis. In Proceedings of SOCC. 2012.
[67]
Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. 2010. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers. IEEE Micro (2010), 65--79.
[68]
S. Sarwar, A. Ankit, and K. Roy. {n. d.}. Incremental Learning in Deep Convolutional Neural Networks Using Partial Network Sharing. In arXiv preprint arXiv:1712.02719.
[69]
Jörg Schad, Jens Dittrich, and Jorge-Arnulfo Quiané-Ruiz. 2010. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. Proceedings VLDB Endow., Vol. 3, 1--2 (Sept. 2010), 460--471.
[70]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. {n. d.}. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of EuroSys. Prague, 2013.
[71]
Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. {n. d.}. CloudScale: elastic resource scaling for multi-tenant cloud systems. In Proceedings of SOCC. Cascais, Portugal, 2011.
[72]
David Shue, Michael J. Freedman, and Anees Shaikh. {n. d.}. Performance Isolation and Fairness for Multi-tenant Cloud Storage. In Proc. of OSDI. Hollywood, CA, 2012.
[73]
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. https://research.google.com/archive/papers/dapper-2010--1.pdf
[74]
Lalith Suresh, Peter Bodik, Ishai Menache, Marco Canini, and Florin Ciucu. {n. d.}. Distributed Resource Management Across Process Boundaries. In Proceedings of the ACM Symposium on Cloud Computing (SOCC). Santa Clara, CA, 2017.
[75]
Venkatanathan Varadarajan, Thomas Ristenpart, and Michael Swift. {n. d.}. Scheduler-based Defenses against Cross-VM Side-channels. In Proc. of the 23rd Usenix Security Symposium. San Diego, CA, 2014.
[76]
Venkatanathan Varadarajan, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. {n. d.}. A Placement Vulnerability Study in Multi-Tenant Public Clouds. In Proc. of the 24th USENIX Security Symposium (USENIX Security). Washington, DC, 2015.
[77]
Nedeljko Vasiç, Dejan Novakoviç, Svetozar Miuvcin, Dejan Kostić, and Ricardo Bianchini. {n. d.}. DejaVu: accelerating resource allocation in virtualized environments. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). London, UK, 2012.
[78]
Jianping Weng, Jessie Hui Wang, Jiahai Yang, and Yang Yang. {n. d.}. Root cause analysis of anomalies of multitier services in public clouds. In Proceedings of the IEEE/ACM 25th International Symposium on Quality of Service (IWQoS). 2017.
[79]
Ian H. Witten, Eibe Frank, and Geoffrey Holmes. {n. d.}. Data Mining: Practical Machine Learning Tools and Techniques .3rd Edition.
[80]
Tianyin Xu, Xinxin Jin, Peng Huang, Yuanyuan Zhou, Shan Lu, Long Jin, and Shankar Pasupathy. 2016. Early Detection of Configuration Errors to Reduce Failure Damage. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 619--634.
[81]
Yunjing Xu, Michael Bailey, Farnam Jahanian, Kaustubh Joshi, Matti Hiltunen, and Richard Schlichting. {n. d.}. An Exploration of L2 Cache Covert Channels in Virtualized Environments. In Proc. of the 3rd ACM Workshop on Cloud Computing Security Workshop (CCSW). Chicago, IL, 2011.
[82]
Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang. {n. d.}. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers. In Proceedings of ISCA. 2013.
[83]
Minlan Yu, Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. {n. d.}. Profiling Network Performance for Multi-tier Data Center Applications. In Proceedings of NSDI. Boston, MA, 2011, 14.
[84]
Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs. In Proceedings of APSLOS. ACM, New York, NY, USA, 489--502.
[85]
Yinqian Zhang and Michael K. Reiter. {n. d.}. Duppel: retrofitting commodity operating systems to mitigate cache side channels in the cloud. In Proc. of CCS. Berlin, Germany, 2013.
[86]
Mark Zhao and G. Edward Suh. {n. d.}. FPGA-Based Remote Power Side-Channel Attacks. In Proceedings of the IEEE Symposium on Security and Privacy. May 2018.

Cited By

View all
  • (2024)SLIM: a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in MicroserviceProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694984(27-39)Online publication date: 27-Oct-2024
  • (2024)A Bayesian LSTM Based Active Anomaly Detection Service for Large Online SystemsProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674818(407-416)Online publication date: 24-Jul-2024
  • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
April 2019
1126 pages
ISBN:9781450362405
DOI:10.1145/3297858
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 April 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. QoS
  2. cloud computing
  3. data mining
  4. datacenter
  5. deep learning
  6. microservices
  7. monitoring
  8. performance debugging
  9. resource management
  10. tracing

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '19

Acceptance Rates

ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%;
Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,154
  • Downloads (Last 6 weeks)105
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SLIM: a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in MicroserviceProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694984(27-39)Online publication date: 27-Oct-2024
  • (2024)A Bayesian LSTM Based Active Anomaly Detection Service for Large Online SystemsProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674818(407-416)Online publication date: 24-Jul-2024
  • (2024)Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal GraphCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663827(50-61)Online publication date: 10-Jul-2024
  • (2024)MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination IndexingProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652131(325-337)Online publication date: 11-Sep-2024
  • (2024)PEMA+: A Comprehensive Resource Manager for MicroservicesACM SIGMETRICS Performance Evaluation Review10.1145/3639830.363983651:3(10-12)Online publication date: 5-Jan-2024
  • (2024)Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice SystemsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639088(1-12)Online publication date: 20-May-2024
  • (2024)GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices ApplicationsProceedings of the ACM Web Conference 202410.1145/3589334.3645665(3085-3095)Online publication date: 13-May-2024
  • (2024)Adaptive QoS-Aware Microservice Deployment With Excessive Loads via Intra- and Inter-Datacenter SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342593135:9(1565-1582)Online publication date: Sep-2024
  • (2024)Optimizing I/O Performance Through Effective vCPU Scheduling Interference ManagementIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332929835:12(2315-2330)Online publication date: Dec-2024
  • (2024)MicroNet: Operation Aware Root Cause Identification of Microservice System AnomaliesIEEE Transactions on Network and Service Management10.1109/TNSM.2024.338755221:4(4255-4267)Online publication date: Aug-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media